docs/design-docs/protozero.md - third_party/perfetto - Git at Google

 # ProtoZero design document

 ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
 purposefully built for Perfetto's tracing use cases.

 ## Motivations

 ProtoZero has been designed and optimized for proto serialization, which is used
 by all Perfetto tracing paths.
 Deserialization was introduced only at a later stage of the project and is
 mainly used by offline tools
 (e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
 The _zero-copy zero-alloc zero-syscall_ statement applies only to the
 serialization code.

 Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
 event in Perfetto is a proto
 (see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
 allows events to be strongly typed and makes it easier for the team to maintain
 backwards compatibility using a language that is understood across the board.

 Tracing fast-paths need to have very little overhead, because instrumentation
 points are sprinkled all over the codebase of projects like Android
 and Chrome and are performance-critical.

 Overhead here is not just defined as CPU time (or instructions retired) it
 takes to execute the instrumentation point. A big source of overhead in a
 tracing system is represented by the working set of the instrumentation points,
 specifically extra I-cache and D-cache misses which would slow down the
 non-tracing code _after_ the tracing instrumentation point.

 The major design departures of ProtoZero from canonical C++ protobuf libraries
 like [libprotobuf](https://github.com/google/protobuf) are:

 * Treating serialization and deserialization as different use-cases served by
   different code.

 * Optimizing for binary size and working-set-size on the serialization paths.

 * Ignoring most of the error checking and long-tail features of protobuf
   (repeated vs optional, full type checks).

 * ProtoZero is not designed as general-purpose protobuf de/serialization and is
   heavily customized to maintain the tracing writing code minimal and allow the
   compiler to see through the architectural layers.

 * Code generated by ProtoZero needs to be hermetic. When building the
   amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
   perfetto tracing sources need to not have any dependency on any other
   libraries other than the C++ standard library and C library.

 ## Usage

 At the build-system level, ProtoZero is extremely similar to the conventional
 libprotobuf library.
 The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
 libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
 compiler plugin.

 ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
 on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
 it, however, has no runtime dependency (not even header-only dependencies) on
 libprotobuf.

 In order to generate ProtoZero stubs from proto you need to:

 1. Build the ProtoZero compiler plugin, which lives in
    [src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
    ```bash
    tools/ninja -C out/default protozero_plugin protoc
    ```

 2. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
    ```bash
   out/default/protoc \
       --plugin=protoc-gen-plugin=out/default/protozero_plugin \
       --plugin_out=wrapper_namespace=pbzero:/tmp/  \
       test_msg.proto
    ```
    This generates `/tmp/test_msg.pbzero.{cc,h}`.

    NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
    The .cc file is emitted only because some build systems' rules assume that
    protobuf codegens generate both a .cc and a .h file.

 ## Proto serialization

 The quickest way to undestand ProtoZero design principles is to start from a
 small example and compare the generated code between libprotobuf and ProtoZero.

 ```protobuf
 syntax = "proto2";

 message TestMsg {
   optional string str_val = 1;
   optional int32 int_val = 2;
   repeated TestMsg nested = 3;
 }
 ```

 #### libprotobuf approach

 The libprotobuf approach is to generate a C++ class that has one member for each
 proto field, with dedicated serialization and de-serialization methods.

 ```bash
 out/default/protoc  --cpp_out=. test_msg.proto
 ```

 generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
 as follows:

 ```c++
 // This class is generated by the standard protoc compiler in the .pb.h source.
 class TestMsg : public protobuf::MessageLite {
   private:
    int32 int_val_;
    ArenaStringPtr str_val_;
    RepeatedPtrField<TestMsg> nested_;  // Effectively a vector<TestMsg>

  public:
   const std::string& str_val() const;
   void set_str_val(const std::string& value);

   bool has_int_val() const;
   int32_t int_val() const;
   void set_int_val(int32_t value);

   ::TestMsg* add_nested();
   ::TestMsg* mutable_nested(int index);
   const TestMsg& nested(int index);

   std::string SerializeAsString();
   bool ParseFromString(const std::string&);
 }
 ```

 The main characteristic of these stubs are:

 * Code generated from .proto messages can be used in the codebase as general
   purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()`
   methods (although anecdotal evidence suggests that most project use these
   proto-generated classes only at the de/serialization endpoints).

 * The end-to-end journey of serializing a proto involves two steps:
   1. Setting the individual int / string / vector fields of the generated class.
   2. Doing a serialization pass over these fields.

   In turn this has side-effects on the code generated. STL copy/assignment
   operators for strings and vectors are non-trivial because, for instance, they
   need to deal with dynamic memory resizing.

 #### ProtoZero approach

 ```c++
 // This class is generated by the ProtoZero plugin in the .pbzero.h source.
 class TestMsg : public protozero::Message {
  public:
   void set_str_val(const std::string& value) {
     AppendBytes(/*field_id=*/1, value.data(), value.size());
   }
   void set_str_val(const char* data, size_t size) {
     AppendBytes(/*field_id=*/1, data, size);
   }
   void set_int_val(int32_t value) {
     AppendVarInt(/*field_id=*/2, value);
   }
   TestMsg* add_nested() {
     return BeginNestedMessage<TestMsg>(/*field_id=*/3);
   }
 }
 ```

 The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods
 are invoked, the passed arguments are directly serialized into the target
 buffer. This introduces some limitations:

 * Readback is not possible: these classes cannot be used as C++ struct
   replacements.

 * No error-checking is performed: nothing prevents a non-repeated field to be
   emitted twice in the serialized proto if the caller accidentally calls a
   `set_*()` method twice. Basic type checks are still performed at compile-time
   though.

 * Nested fields must be filled in a stack fashion and cannot be written
   interleaved. Once a nested message is started, its fields must be set before
   going back setting the fields of the parent message. This turns out to not be
   a problem for most tracing use-cases.

 This has a number of advantages:

 * The classes generated by ProtoZero don't add any extra state on top of the
   base class they derive (`protozero::Message`). They define only inline
   setter methods that call base-class serialization methods. Compilers can
   see through all the inline expansions of these methods.

 * As a consequence of that, the binary cost of ProtoZero is independent of the
   number of protobuf messages defined and their fields, and depends only on the
   number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto
   messages and fields) anecdotally has been a big issue with libprotobuf.

 * The serialization methods don't involve any copy or dynamic allocation. The
   inline expansion calls directly into the corresponding `AppendVarInt()` /
   `AppendString()` methods of `protozero::Message`.

 * This allows to directly serialize trace events into the
   [tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
   not contiguous.

 ### Scattered buffer writing

 A key part of the ProtoZero design is supporting direct serialization on
 non-globally-contiguous sequences of contiguous memory regions.

 This happens by decoupling `protozero::Message`, the base class for all the
 generated classes, from the `protozero::ScatteredStreamWriter`.
 The problem it solves is the following: ProtoZero is based on direct
 serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
 most cases. At the same time, there is no limit in how much data the caller will
 try to write into an individual message, a trace event can be up to 256 MiB big.

 ![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)

 #### Fast-path

 At all times the underlying `ScatteredStreamWriter` knows what are the bounds
 of the current buffer. All write operations are bound checked and hit a
 slow-path when crossing the buffer boundary.

 Most write operations can be completed within the current buffer boundaries.
 In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
 the extra overhead of var-int encoding for protobuf preambles and
 length-delimited fields.

 #### Slow-path

 When crossing the boundary, the slow-path asks the
 `ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
 `GetNewBuffer()` is up to the client. In tracing use-cases, that call will
 acquire a new thread-local chunk from the tracing shared memory buffer.

 Other heap-based implementations are possible. For instance, the ProtoZero
 sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
 [scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
 which allocates a new heap buffer when crossing the boundaries of the current
 one.

 Consider the following example:

 ```c++
 TestMsg outer_msg;
 for (int i = 0; i < 1000; i++) {
   TestMsg* nested = outer_msg.add_nested();
   nested->set_int_val(42);
 }
 ```

 At some point one of the `set_int_val()` calls will hit the slow-path and
 acquire a new buffer. The overall idea is having a serialization mechanism
 that is extremely lightweight most of the times and that requires some extra
 function calls when buffer boundary, so that their cost gets amortized across
 all trace events.

 In the context of the overall Perfetto tracing use case, the slow-path involves
 grabbing a process-local mutex and finding the next free chunk in the shared
 memory buffer. Hence writes are lock-free as long as they happen within the
 thread-local chunk and require a critical section to acquire a new chunk once
 every 4KB-32KB (depending on the trace configuration).

 The assumption is that the likeliness that two threads will cross the chunk
 boundary and call `GetNewBuffer()` at the same time is extremely low and hence
 the critical section is un-contended most of the times.

 ```mermaid
 sequenceDiagram
   participant C as Call site
   participant M as Message
   participant SSR as ScatteredStreamWriter
   participant DEL as Buffer Delegate
   C->>M: set_int_val(...)
   activate C
   M->>SSR: AppendVarInt(...)
   deactivate C
   Note over C,SSR: A typical write on the fast-path

   C->>M: set_str_val(...)
   activate C
   M->>SSR: AppendString(...)
   SSR->>DEL: GetNewBuffer(...)
   deactivate C
   Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
 ```

 ### Deferred patching

 Nested messages in the protobuf binary encoding are prefixed with their
 varint-encoded size.

 Consider the following:

 ```c++
 TestMsg* nested = outer_msg.add_nested();
 nested->set_int_val(42);
 nested->set_str_val("foo");
 ```

 The canonical encoding of this protobuf message, using libprotobuf, would be:

 ```bash
 1a 07 0a 03 66 6f 6f 10 2a
 ^-+-^ ^-----+------^ ^-+-^
   |         |          |
   |         |          +--> Field ID: 2 [int_val], value = 42.
   |         |
   |         +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
   |
   +------> Field ID: 3 [nested], length: 7  # !!!
 ```

 The second byte in this sequence (07) is problematic for direct encoding. At the
 point where `outer_msg.add_nested()` is called, we can't possibly know upfront
 what the overall size of the nested message will be (in this case, 5 + 2 = 7).

 The way we get around this in ProtoZero is by reserving four bytes for the
 _size_ of each nested message and back-filling them once the message is
 finalized (or when we try to set a field in one of the parent messages).
 We do this by encoding the size of the message using redundant varint encoding,
 in this case: `87 80 80 00` instead of `07`.

 At the C++ level, the `protozero::Message` class holds a pointer to its `size`
 field, which typically points to the beginning of the message, where the four
 bytes are reserved, and back-fills it in the `Message::Finalize()` pass.

 This works fine for cases where the entire message lies in one contiguous buffer
 but opens a further challenge: a message can be several MBs big. Looking at this
 from the overall tracing perspective, the shared memory buffer chunk that holds
 the beginning of a message can be long gone (i.e. committed in the central
 service buffer) by the time we get to the end.

 In order to support this use case, at the tracing code level (outside of
 ProtoZero), when a message crosses the buffer boundary, its `size` field gets
 redirected to a temporary patch buffer
 (see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
 sent out-of-band, piggybacking over the next commit IPC (see
 [Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))

 ### Performance characteristics

 NOTE: For the full code of the benchmark see
       `/src/protozero/test/protozero_benchmark.cc`

 We consider two scenarios: writing a simple event and a nested event

 #### Simple event

 Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
 2 x 64-bit) and a 32 bytes string, as follows:

 ```c++
 void FillMessage_Simple(T* msg) {
   msg->set_field_int32(...);
   msg->set_field_uint32(...);
   msg->set_field_int64(...);
   msg->set_field_uint64(...);
   msg->set_field_string(...);
 }
 ```

 #### Nested event

 Consists of filling a similar message which is recursively nested 3 levels deep:

 ```c++
 void FillMessage_Nested(T* msg, int depth = 0) {
   FillMessage_Simple(msg);
   if (depth < 3) {
     auto* child = msg->add_field_nested();
     FillMessage_Nested(child, depth + 1);
   }
 }
 ```

 #### Comparison terms

 We compare, for the same message type, the performance of ProtoZero,
 libprotobuf and a speed-of-light serializer.

 The speed-of-light serializer is a very simple C++ class that just appends
 data into a linear buffer making all sorts of favourable assumptions. It does
 not use any binary-stable encoding, it does not perform bound checking,
 all writes are 64-bit aligned, it doesn't deal with any thread-safety.

 ```c++
 struct SOLMsg {
   template <typename T>
   void Append(T x) {
     // The memcpy will be elided by the compiler, which will emit just a
     // 64-bit aligned mov instruction.
     memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
     ptr_ += sizeof(x);
   }

   void set_field_int32(int32_t x) { Append(x); }
   void set_field_uint32(uint32_t x) { Append(x); }
   void set_field_int64(int64_t x) { Append(x); }
   void set_field_uint64(uint64_t x) { Append(x); }
   void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }

   alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
   char* ptr_ = &storage_[0];
 };
 ```

 The speed-of-light serializer serves as a reference for _how fast a serializer
 could be if argument marshalling and bound checking were zero cost._

 #### Benchmark results

 ##### Google Pixel 3 - aarch64

 ```bash
 $ cat out/droid_arm64/args.gn
 target_os = "android"
 is_clang = true
 is_debug = false
 target_cpu = "arm64"

 $ ninja -C out/droid_arm64/ perfetto_benchmarks && \
   adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
   adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'

 ------------------------------------------------------------------------
 Benchmark                                 Time           CPU Iterations
 ------------------------------------------------------------------------
 BM_Protozero_Simple_Libprotobuf         402 ns        398 ns    1732807
 BM_Protozero_Simple_Protozero           242 ns        239 ns    2929528
 BM_Protozero_Simple_SpeedOfLight        118 ns        117 ns    6101381
 BM_Protozero_Nested_Libprotobuf        1810 ns       1800 ns     390468
 BM_Protozero_Nested_Protozero           780 ns        773 ns     901369
 BM_Protozero_Nested_SpeedOfLight        138 ns        136 ns    5147958
 ```

 ##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux

 ```bash

 $ cat out/linux_clang_release/args.gn
 is_clang = true
 is_debug = false

 $ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
   out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*

 ------------------------------------------------------------------------
 Benchmark                                 Time           CPU Iterations
 ------------------------------------------------------------------------
 BM_Protozero_Simple_Libprotobuf         428 ns        428 ns    1624801
 BM_Protozero_Simple_Protozero           261 ns        261 ns    2715544
 BM_Protozero_Simple_SpeedOfLight        111 ns        111 ns    6297387
 BM_Protozero_Nested_Libprotobuf        1625 ns       1625 ns     436411
 BM_Protozero_Nested_Protozero           843 ns        843 ns     849302
 BM_Protozero_Nested_SpeedOfLight        140 ns        140 ns    5012910
 ```
	# ProtoZero design document

	ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
	purposefully built for Perfetto's tracing use cases.

	## Motivations

	ProtoZero has been designed and optimized for proto serialization, which is used
	by all Perfetto tracing paths.
	Deserialization was introduced only at a later stage of the project and is
	mainly used by offline tools
	(e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
	The _zero-copy zero-alloc zero-syscall_ statement applies only to the
	serialization code.

	Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
	event in Perfetto is a proto
	(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
	allows events to be strongly typed and makes it easier for the team to maintain
	backwards compatibility using a language that is understood across the board.

	Tracing fast-paths need to have very little overhead, because instrumentation
	points are sprinkled all over the codebase of projects like Android
	and Chrome and are performance-critical.

	Overhead here is not just defined as CPU time (or instructions retired) it
	takes to execute the instrumentation point. A big source of overhead in a
	tracing system is represented by the working set of the instrumentation points,
	specifically extra I-cache and D-cache misses which would slow down the
	non-tracing code _after_ the tracing instrumentation point.

	The major design departures of ProtoZero from canonical C++ protobuf libraries
	like [libprotobuf](https://github.com/google/protobuf) are:

	* Treating serialization and deserialization as different use-cases served by
	different code.

	* Optimizing for binary size and working-set-size on the serialization paths.

	* Ignoring most of the error checking and long-tail features of protobuf
	(repeated vs optional, full type checks).

	* ProtoZero is not designed as general-purpose protobuf de/serialization and is
	heavily customized to maintain the tracing writing code minimal and allow the
	compiler to see through the architectural layers.

	* Code generated by ProtoZero needs to be hermetic. When building the
	amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
	perfetto tracing sources need to not have any dependency on any other
	libraries other than the C++ standard library and C library.

	## Usage

	At the build-system level, ProtoZero is extremely similar to the conventional
	libprotobuf library.
	The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
	libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
	compiler plugin.

	ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
	on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
	it, however, has no runtime dependency (not even header-only dependencies) on
	libprotobuf.

	In order to generate ProtoZero stubs from proto you need to:

	1. Build the ProtoZero compiler plugin, which lives in
	[src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
	```bash
	tools/ninja -C out/default protozero_plugin protoc
	```

	2. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
	```bash
	out/default/protoc \
	--plugin=protoc-gen-plugin=out/default/protozero_plugin \
	--plugin_out=wrapper_namespace=pbzero:/tmp/ \
	test_msg.proto
	```
	This generates `/tmp/test_msg.pbzero.{cc,h}`.

	NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
	The .cc file is emitted only because some build systems' rules assume that
	protobuf codegens generate both a .cc and a .h file.

	## Proto serialization

	The quickest way to undestand ProtoZero design principles is to start from a
	small example and compare the generated code between libprotobuf and ProtoZero.

	```protobuf
	syntax = "proto2";

	message TestMsg {
	optional string str_val = 1;
	optional int32 int_val = 2;
	repeated TestMsg nested = 3;
	}
	```

	#### libprotobuf approach

	The libprotobuf approach is to generate a C++ class that has one member for each
	proto field, with dedicated serialization and de-serialization methods.

	```bash
	out/default/protoc --cpp_out=. test_msg.proto
	```

	generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
	as follows:

	```c++
	// This class is generated by the standard protoc compiler in the .pb.h source.
	class TestMsg : public protobuf::MessageLite {
	private:
	int32 int_val_;
	ArenaStringPtr str_val_;
	RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg>

	public:
	const std::string& str_val() const;
	void set_str_val(const std::string& value);

	bool has_int_val() const;
	int32_t int_val() const;
	void set_int_val(int32_t value);

	::TestMsg* add_nested();
	::TestMsg* mutable_nested(int index);
	const TestMsg& nested(int index);

	std::string SerializeAsString();
	bool ParseFromString(const std::string&);
	}
	```

	The main characteristic of these stubs are:

	* Code generated from .proto messages can be used in the codebase as general
	purpose objects, without ever using the `SerializeAs()` or `ParseFrom()`
	methods (although anecdotal evidence suggests that most project use these
	proto-generated classes only at the de/serialization endpoints).

	* The end-to-end journey of serializing a proto involves two steps:
	1. Setting the individual int / string / vector fields of the generated class.
	2. Doing a serialization pass over these fields.

	In turn this has side-effects on the code generated. STL copy/assignment
	operators for strings and vectors are non-trivial because, for instance, they
	need to deal with dynamic memory resizing.

	#### ProtoZero approach

	```c++
	// This class is generated by the ProtoZero plugin in the .pbzero.h source.
	class TestMsg : public protozero::Message {
	public:
	void set_str_val(const std::string& value) {
	AppendBytes(/field_id=/1, value.data(), value.size());
	}
	void set_str_val(const char* data, size_t size) {
	AppendBytes(/field_id=/1, data, size);
	}
	void set_int_val(int32_t value) {
	AppendVarInt(/field_id=/2, value);
	}
	TestMsg* add_nested() {
	return BeginNestedMessage<TestMsg>(/field_id=/3);
	}
	}
	```

	The ProtoZero-generated stubs are append-only. As the `set_`, `add_` methods
	are invoked, the passed arguments are directly serialized into the target
	buffer. This introduces some limitations:

	* Readback is not possible: these classes cannot be used as C++ struct
	replacements.

	* No error-checking is performed: nothing prevents a non-repeated field to be
	emitted twice in the serialized proto if the caller accidentally calls a
	`set_*()` method twice. Basic type checks are still performed at compile-time
	though.

	* Nested fields must be filled in a stack fashion and cannot be written
	interleaved. Once a nested message is started, its fields must be set before
	going back setting the fields of the parent message. This turns out to not be
	a problem for most tracing use-cases.

	This has a number of advantages:

	* The classes generated by ProtoZero don't add any extra state on top of the
	base class they derive (`protozero::Message`). They define only inline
	setter methods that call base-class serialization methods. Compilers can
	see through all the inline expansions of these methods.

	* As a consequence of that, the binary cost of ProtoZero is independent of the
	number of protobuf messages defined and their fields, and depends only on the
	number of `set_`/`add_` calls. This (i.e. binary cost of non-used proto
	messages and fields) anecdotally has been a big issue with libprotobuf.

	* The serialization methods don't involve any copy or dynamic allocation. The
	inline expansion calls directly into the corresponding `AppendVarInt()` /
	`AppendString()` methods of `protozero::Message`.

	* This allows to directly serialize trace events into the
	[tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
	not contiguous.

	### Scattered buffer writing

	A key part of the ProtoZero design is supporting direct serialization on
	non-globally-contiguous sequences of contiguous memory regions.

	This happens by decoupling `protozero::Message`, the base class for all the
	generated classes, from the `protozero::ScatteredStreamWriter`.
	The problem it solves is the following: ProtoZero is based on direct
	serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
	most cases. At the same time, there is no limit in how much data the caller will
	try to write into an individual message, a trace event can be up to 256 MiB big.

	![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)

	#### Fast-path

	At all times the underlying `ScatteredStreamWriter` knows what are the bounds
	of the current buffer. All write operations are bound checked and hit a
	slow-path when crossing the buffer boundary.

	Most write operations can be completed within the current buffer boundaries.
	In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
	the extra overhead of var-int encoding for protobuf preambles and
	length-delimited fields.

	#### Slow-path

	When crossing the boundary, the slow-path asks the
	`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
	`GetNewBuffer()` is up to the client. In tracing use-cases, that call will
	acquire a new thread-local chunk from the tracing shared memory buffer.

	Other heap-based implementations are possible. For instance, the ProtoZero
	sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
	[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
	which allocates a new heap buffer when crossing the boundaries of the current
	one.

	Consider the following example:

	```c++
	TestMsg outer_msg;
	for (int i = 0; i < 1000; i++) {
	TestMsg* nested = outer_msg.add_nested();
	nested->set_int_val(42);
	}
	```

	At some point one of the `set_int_val()` calls will hit the slow-path and
	acquire a new buffer. The overall idea is having a serialization mechanism
	that is extremely lightweight most of the times and that requires some extra
	function calls when buffer boundary, so that their cost gets amortized across
	all trace events.

	In the context of the overall Perfetto tracing use case, the slow-path involves
	grabbing a process-local mutex and finding the next free chunk in the shared
	memory buffer. Hence writes are lock-free as long as they happen within the
	thread-local chunk and require a critical section to acquire a new chunk once
	every 4KB-32KB (depending on the trace configuration).

	The assumption is that the likeliness that two threads will cross the chunk
	boundary and call `GetNewBuffer()` at the same time is extremely low and hence
	the critical section is un-contended most of the times.

	```mermaid
	sequenceDiagram
	participant C as Call site
	participant M as Message
	participant SSR as ScatteredStreamWriter
	participant DEL as Buffer Delegate
	C->>M: set_int_val(...)
	activate C
	M->>SSR: AppendVarInt(...)
	deactivate C
	Note over C,SSR: A typical write on the fast-path

	C->>M: set_str_val(...)
	activate C
	M->>SSR: AppendString(...)
	SSR->>DEL: GetNewBuffer(...)
	deactivate C
	Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
	```

	### Deferred patching

	Nested messages in the protobuf binary encoding are prefixed with their
	varint-encoded size.

	Consider the following:

	```c++
	TestMsg* nested = outer_msg.add_nested();
	nested->set_int_val(42);
	nested->set_str_val("foo");
	```

	The canonical encoding of this protobuf message, using libprotobuf, would be:

	```bash
	1a 07 0a 03 66 6f 6f 10 2a
	^-+-^ ^-----+------^ ^-+-^
	\| \| \|
	\| \| +--> Field ID: 2 [int_val], value = 42.
	\| \|
	\| +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
	\|
	+------> Field ID: 3 [nested], length: 7 # !!!
	```

	The second byte in this sequence (07) is problematic for direct encoding. At the
	point where `outer_msg.add_nested()` is called, we can't possibly know upfront
	what the overall size of the nested message will be (in this case, 5 + 2 = 7).

	The way we get around this in ProtoZero is by reserving four bytes for the
	_size_ of each nested message and back-filling them once the message is
	finalized (or when we try to set a field in one of the parent messages).
	We do this by encoding the size of the message using redundant varint encoding,
	in this case: `87 80 80 00` instead of `07`.

	At the C++ level, the `protozero::Message` class holds a pointer to its `size`
	field, which typically points to the beginning of the message, where the four
	bytes are reserved, and back-fills it in the `Message::Finalize()` pass.

	This works fine for cases where the entire message lies in one contiguous buffer
	but opens a further challenge: a message can be several MBs big. Looking at this
	from the overall tracing perspective, the shared memory buffer chunk that holds
	the beginning of a message can be long gone (i.e. committed in the central
	service buffer) by the time we get to the end.

	In order to support this use case, at the tracing code level (outside of
	ProtoZero), when a message crosses the buffer boundary, its `size` field gets
	redirected to a temporary patch buffer
	(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
	sent out-of-band, piggybacking over the next commit IPC (see
	[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))

	### Performance characteristics

	NOTE: For the full code of the benchmark see
	`/src/protozero/test/protozero_benchmark.cc`

	We consider two scenarios: writing a simple event and a nested event

	#### Simple event

	Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
	2 x 64-bit) and a 32 bytes string, as follows:

	```c++
	void FillMessage_Simple(T* msg) {
	msg->set_field_int32(...);
	msg->set_field_uint32(...);
	msg->set_field_int64(...);
	msg->set_field_uint64(...);
	msg->set_field_string(...);
	}
	```

	#### Nested event

	Consists of filling a similar message which is recursively nested 3 levels deep:

	```c++
	void FillMessage_Nested(T* msg, int depth = 0) {
	FillMessage_Simple(msg);
	if (depth < 3) {
	auto* child = msg->add_field_nested();
	FillMessage_Nested(child, depth + 1);
	}
	}
	```

	#### Comparison terms

	We compare, for the same message type, the performance of ProtoZero,
	libprotobuf and a speed-of-light serializer.

	The speed-of-light serializer is a very simple C++ class that just appends
	data into a linear buffer making all sorts of favourable assumptions. It does
	not use any binary-stable encoding, it does not perform bound checking,
	all writes are 64-bit aligned, it doesn't deal with any thread-safety.

	```c++
	struct SOLMsg {
	template <typename T>
	void Append(T x) {
	// The memcpy will be elided by the compiler, which will emit just a
	// 64-bit aligned mov instruction.
	memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
	ptr_ += sizeof(x);
	}

	void set_field_int32(int32_t x) { Append(x); }
	void set_field_uint32(uint32_t x) { Append(x); }
	void set_field_int64(int64_t x) { Append(x); }
	void set_field_uint64(uint64_t x) { Append(x); }
	void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }

	alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
	char* ptr_ = &storage_[0];
	};
	```

	The speed-of-light serializer serves as a reference for _how fast a serializer
	could be if argument marshalling and bound checking were zero cost._

	#### Benchmark results

	##### Google Pixel 3 - aarch64

	```bash
	$ cat out/droid_arm64/args.gn
	target_os = "android"
	is_clang = true
	is_debug = false
	target_cpu = "arm64"

	$ ninja -C out/droid_arm64/ perfetto_benchmarks && \
	adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
	adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'

	------------------------------------------------------------------------
	Benchmark Time CPU Iterations
	------------------------------------------------------------------------
	BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807
	BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528
	BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381
	BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468
	BM_Protozero_Nested_Protozero 780 ns 773 ns 901369
	BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958
	```

	##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux

	```bash

	$ cat out/linux_clang_release/args.gn
	is_clang = true
	is_debug = false

	$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
	out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*

	------------------------------------------------------------------------
	Benchmark Time CPU Iterations
	------------------------------------------------------------------------
	BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801
	BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544
	BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387
	BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411
	BM_Protozero_Nested_Protozero 843 ns 843 ns 849302
	BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910
	```