|  | # ProtoZero design document | 
|  |  | 
|  | ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary | 
|  | purposefully built for Perfetto's tracing use cases. | 
|  |  | 
|  | ## Motivations | 
|  |  | 
|  | ProtoZero has been designed and optimized for proto serialization, which is used | 
|  | by all Perfetto tracing paths. | 
|  | Deserialization was introduced only at a later stage of the project and is | 
|  | mainly used by offline tools | 
|  | (e.g., [TraceProcessor](/docs/analysis/trace-processor.md). | 
|  | The _zero-copy zero-alloc zero-syscall_ statement applies only to the | 
|  | serialization code. | 
|  |  | 
|  | Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace | 
|  | event in Perfetto is a proto | 
|  | (see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This | 
|  | allows events to be strongly typed and makes it easier for the team to maintain | 
|  | backwards compatibility using a language that is understood across the board. | 
|  |  | 
|  | Tracing fast-paths need to have very little overhead, because instrumentation | 
|  | points are sprinkled all over the codebase of projects like Android | 
|  | and Chrome and are performance-critical. | 
|  |  | 
|  | Overhead here is not just defined as CPU time (or instructions retired) it | 
|  | takes to execute the instrumentation point. A big source of overhead in a | 
|  | tracing system is represented by the working set of the instrumentation points, | 
|  | specifically extra I-cache and D-cache misses which would slow down the | 
|  | non-tracing code _after_ the tracing instrumentation point. | 
|  |  | 
|  | The major design departures of ProtoZero from canonical C++ protobuf libraries | 
|  | like [libprotobuf](https://github.com/google/protobuf) are: | 
|  |  | 
|  | * Treating serialization and deserialization as different use-cases served by | 
|  | different code. | 
|  |  | 
|  | * Optimizing for binary size and working-set-size on the serialization paths. | 
|  |  | 
|  | * Ignoring most of the error checking and long-tail features of protobuf | 
|  | (repeated vs optional, full type checks). | 
|  |  | 
|  | * ProtoZero is not designed as general-purpose protobuf de/serialization and is | 
|  | heavily customized to maintain the tracing writing code minimal and allow the | 
|  | compiler to see through the architectural layers. | 
|  |  | 
|  | * Code generated by ProtoZero needs to be hermetic. When building the | 
|  | amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all | 
|  | perfetto tracing sources need to not have any dependency on any other | 
|  | libraries other than the C++ standard library and C library. | 
|  |  | 
|  | ## Usage | 
|  |  | 
|  | At the build-system level, ProtoZero is extremely similar to the conventional | 
|  | libprotobuf library. | 
|  | The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the | 
|  | libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc` | 
|  | compiler plugin. | 
|  |  | 
|  | ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends | 
|  | on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by | 
|  | it, however, has no runtime dependency (not even header-only dependencies) on | 
|  | libprotobuf. | 
|  |  | 
|  | In order to generate ProtoZero stubs from proto you need to: | 
|  |  | 
|  | 1. Build the ProtoZero compiler plugin, which lives in | 
|  | [src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/). | 
|  | ```bash | 
|  | tools/ninja -C out/default protozero_plugin protoc | 
|  | ``` | 
|  |  | 
|  | 2. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`: | 
|  | ```bash | 
|  | out/default/protoc \ | 
|  | --plugin=protoc-gen-plugin=out/default/protozero_plugin \ | 
|  | --plugin_out=wrapper_namespace=pbzero:/tmp/  \ | 
|  | test_msg.proto | 
|  | ``` | 
|  | This generates `/tmp/test_msg.pbzero.{cc,h}`. | 
|  |  | 
|  | NOTE: The .cc file is always empty. ProtoZero-generated code is header only. | 
|  | The .cc file is emitted only because some build systems' rules assume that | 
|  | protobuf codegens generate both a .cc and a .h file. | 
|  |  | 
|  | ## Proto serialization | 
|  |  | 
|  | The quickest way to undestand ProtoZero design principles is to start from a | 
|  | small example and compare the generated code between libprotobuf and ProtoZero. | 
|  |  | 
|  | ```protobuf | 
|  | syntax = "proto2"; | 
|  |  | 
|  | message TestMsg { | 
|  | optional string str_val = 1; | 
|  | optional int32 int_val = 2; | 
|  | repeated TestMsg nested = 3; | 
|  | } | 
|  | ``` | 
|  |  | 
|  | #### libprotobuf approach | 
|  |  | 
|  | The libprotobuf approach is to generate a C++ class that has one member for each | 
|  | proto field, with dedicated serialization and de-serialization methods. | 
|  |  | 
|  | ```bash | 
|  | out/default/protoc  --cpp_out=. test_msg.proto | 
|  | ``` | 
|  |  | 
|  | generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks | 
|  | as follows: | 
|  |  | 
|  | ```c++ | 
|  | // This class is generated by the standard protoc compiler in the .pb.h source. | 
|  | class TestMsg : public protobuf::MessageLite { | 
|  | private: | 
|  | int32 int_val_; | 
|  | ArenaStringPtr str_val_; | 
|  | RepeatedPtrField<TestMsg> nested_;  // Effectively a vector<TestMsg> | 
|  |  | 
|  | public: | 
|  | const std::string& str_val() const; | 
|  | void set_str_val(const std::string& value); | 
|  |  | 
|  | bool has_int_val() const; | 
|  | int32_t int_val() const; | 
|  | void set_int_val(int32_t value); | 
|  |  | 
|  | ::TestMsg* add_nested(); | 
|  | ::TestMsg* mutable_nested(int index); | 
|  | const TestMsg& nested(int index); | 
|  |  | 
|  | std::string SerializeAsString(); | 
|  | bool ParseFromString(const std::string&); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | The main characteristic of these stubs are: | 
|  |  | 
|  | * Code generated from .proto messages can be used in the codebase as general | 
|  | purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()` | 
|  | methods (although anecdotal evidence suggests that most project use these | 
|  | proto-generated classes only at the de/serialization endpoints). | 
|  |  | 
|  | * The end-to-end journey of serializing a proto involves two steps: | 
|  | 1. Setting the individual int / string / vector fields of the generated class. | 
|  | 2. Doing a serialization pass over these fields. | 
|  |  | 
|  | In turn this has side-effects on the code generated. STL copy/assignment | 
|  | operators for strings and vectors are non-trivial because, for instance, they | 
|  | need to deal with dynamic memory resizing. | 
|  |  | 
|  | #### ProtoZero approach | 
|  |  | 
|  | ```c++ | 
|  | // This class is generated by the ProtoZero plugin in the .pbzero.h source. | 
|  | class TestMsg : public protozero::Message { | 
|  | public: | 
|  | void set_str_val(const std::string& value) { | 
|  | AppendBytes(/*field_id=*/1, value.data(), value.size()); | 
|  | } | 
|  | void set_str_val(const char* data, size_t size) { | 
|  | AppendBytes(/*field_id=*/1, data, size); | 
|  | } | 
|  | void set_int_val(int32_t value) { | 
|  | AppendVarInt(/*field_id=*/2, value); | 
|  | } | 
|  | TestMsg* add_nested() { | 
|  | return BeginNestedMessage<TestMsg>(/*field_id=*/3); | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods | 
|  | are invoked, the passed arguments are directly serialized into the target | 
|  | buffer. This introduces some limitations: | 
|  |  | 
|  | * Readback is not possible: these classes cannot be used as C++ struct | 
|  | replacements. | 
|  |  | 
|  | * No error-checking is performed: nothing prevents a non-repeated field to be | 
|  | emitted twice in the serialized proto if the caller accidentally calls a | 
|  | `set_*()` method twice. Basic type checks are still performed at compile-time | 
|  | though. | 
|  |  | 
|  | * Nested fields must be filled in a stack fashion and cannot be written | 
|  | interleaved. Once a nested message is started, its fields must be set before | 
|  | going back setting the fields of the parent message. This turns out to not be | 
|  | a problem for most tracing use-cases. | 
|  |  | 
|  | This has a number of advantages: | 
|  |  | 
|  | * The classes generated by ProtoZero don't add any extra state on top of the | 
|  | base class they derive (`protozero::Message`). They define only inline | 
|  | setter methods that call base-class serialization methods. Compilers can | 
|  | see through all the inline expansions of these methods. | 
|  |  | 
|  | * As a consequence of that, the binary cost of ProtoZero is independent of the | 
|  | number of protobuf messages defined and their fields, and depends only on the | 
|  | number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto | 
|  | messages and fields) anecdotally has been a big issue with libprotobuf. | 
|  |  | 
|  | * The serialization methods don't involve any copy or dynamic allocation. The | 
|  | inline expansion calls directly into the corresponding `AppendVarInt()` / | 
|  | `AppendString()` methods of `protozero::Message`. | 
|  |  | 
|  | * This allows to directly serialize trace events into the | 
|  | [tracing shared memory buffers](/docs/concepts/buffers.md), even if they are | 
|  | not contiguous. | 
|  |  | 
|  | ### Scattered buffer writing | 
|  |  | 
|  | A key part of the ProtoZero design is supporting direct serialization on | 
|  | non-globally-contiguous sequences of contiguous memory regions. | 
|  |  | 
|  | This happens by decoupling `protozero::Message`, the base class for all the | 
|  | generated classes, from the `protozero::ScatteredStreamWriter`. | 
|  | The problem it solves is the following: ProtoZero is based on direct | 
|  | serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in | 
|  | most cases. At the same time, there is no limit in how much data the caller will | 
|  | try to write into an individual message, a trace event can be up to 256 MiB big. | 
|  |  | 
|  |  | 
|  |  | 
|  | #### Fast-path | 
|  |  | 
|  | At all times the underlying `ScatteredStreamWriter` knows what are the bounds | 
|  | of the current buffer. All write operations are bound checked and hit a | 
|  | slow-path when crossing the buffer boundary. | 
|  |  | 
|  | Most write operations can be completed within the current buffer boundaries. | 
|  | In that case, the cost of a `set_*` operation is in essence a `memcpy()` with | 
|  | the extra overhead of var-int encoding for protobuf preambles and | 
|  | length-delimited fields. | 
|  |  | 
|  | #### Slow-path | 
|  |  | 
|  | When crossing the boundary, the slow-path asks the | 
|  | `ScatteredStreamWriter::Delegate` for a new buffer. The implementation of | 
|  | `GetNewBuffer()` is up to the client. In tracing use-cases, that call will | 
|  | acquire a new thread-local chunk from the tracing shared memory buffer. | 
|  |  | 
|  | Other heap-based implementations are possible. For instance, the ProtoZero | 
|  | sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see | 
|  | [scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)), | 
|  | which allocates a new heap buffer when crossing the boundaries of the current | 
|  | one. | 
|  |  | 
|  | Consider the following example: | 
|  |  | 
|  | ```c++ | 
|  | TestMsg outer_msg; | 
|  | for (int i = 0; i < 1000; i++) { | 
|  | TestMsg* nested = outer_msg.add_nested(); | 
|  | nested->set_int_val(42); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | At some point one of the `set_int_val()` calls will hit the slow-path and | 
|  | acquire a new buffer. The overall idea is having a serialization mechanism | 
|  | that is extremely lightweight most of the times and that requires some extra | 
|  | function calls when buffer boundary, so that their cost gets amortized across | 
|  | all trace events. | 
|  |  | 
|  | In the context of the overall Perfetto tracing use case, the slow-path involves | 
|  | grabbing a process-local mutex and finding the next free chunk in the shared | 
|  | memory buffer. Hence writes are lock-free as long as they happen within the | 
|  | thread-local chunk and require a critical section to acquire a new chunk once | 
|  | every 4KB-32KB (depending on the trace configuration). | 
|  |  | 
|  | The assumption is that the likeliness that two threads will cross the chunk | 
|  | boundary and call `GetNewBuffer()` at the same time is extremely low and hence | 
|  | the critical section is un-contended most of the times. | 
|  |  | 
|  | ```mermaid | 
|  | sequenceDiagram | 
|  | participant C as Call site | 
|  | participant M as Message | 
|  | participant SSR as ScatteredStreamWriter | 
|  | participant DEL as Buffer Delegate | 
|  | C->>M: set_int_val(...) | 
|  | activate C | 
|  | M->>SSR: AppendVarInt(...) | 
|  | deactivate C | 
|  | Note over C,SSR: A typical write on the fast-path | 
|  |  | 
|  | C->>M: set_str_val(...) | 
|  | activate C | 
|  | M->>SSR: AppendString(...) | 
|  | SSR->>DEL: GetNewBuffer(...) | 
|  | deactivate C | 
|  | Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks. | 
|  | ``` | 
|  |  | 
|  | ### Deferred patching | 
|  |  | 
|  | Nested messages in the protobuf binary encoding are prefixed with their | 
|  | varint-encoded size. | 
|  |  | 
|  | Consider the following: | 
|  |  | 
|  | ```c++ | 
|  | TestMsg* nested = outer_msg.add_nested(); | 
|  | nested->set_int_val(42); | 
|  | nested->set_str_val("foo"); | 
|  | ``` | 
|  |  | 
|  | The canonical encoding of this protobuf message, using libprotobuf, would be: | 
|  |  | 
|  | ```bash | 
|  | 1a 07 0a 03 66 6f 6f 10 2a | 
|  | ^-+-^ ^-----+------^ ^-+-^ | 
|  | |         |          | | 
|  | |         |          +--> Field ID: 2 [int_val], value = 42. | 
|  | |         | | 
|  | |         +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f). | 
|  | | | 
|  | +------> Field ID: 3 [nested], length: 7  # !!! | 
|  | ``` | 
|  |  | 
|  | The second byte in this sequence (07) is problematic for direct encoding. At the | 
|  | point where `outer_msg.add_nested()` is called, we can't possibly know upfront | 
|  | what the overall size of the nested message will be (in this case, 5 + 2 = 7). | 
|  |  | 
|  | The way we get around this in ProtoZero is by reserving four bytes for the | 
|  | _size_ of each nested message and back-filling them once the message is | 
|  | finalized (or when we try to set a field in one of the parent messages). | 
|  | We do this by encoding the size of the message using redundant varint encoding, | 
|  | in this case: `87 80 80 00` instead of `07`. | 
|  |  | 
|  | At the C++ level, the `protozero::Message` class holds a pointer to its `size` | 
|  | field, which typically points to the beginning of the message, where the four | 
|  | bytes are reserved, and back-fills it in the `Message::Finalize()` pass. | 
|  |  | 
|  | This works fine for cases where the entire message lies in one contiguous buffer | 
|  | but opens a further challenge: a message can be several MBs big. Looking at this | 
|  | from the overall tracing perspective, the shared memory buffer chunk that holds | 
|  | the beginning of a message can be long gone (i.e. committed in the central | 
|  | service buffer) by the time we get to the end. | 
|  |  | 
|  | In order to support this use case, at the tracing code level (outside of | 
|  | ProtoZero), when a message crosses the buffer boundary, its `size` field gets | 
|  | redirected to a temporary patch buffer | 
|  | (see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then | 
|  | sent out-of-band, piggybacking over the next commit IPC (see | 
|  | [Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi)) | 
|  |  | 
|  | ### Performance characteristics | 
|  |  | 
|  | NOTE: For the full code of the benchmark see | 
|  | `/src/protozero/test/protozero_benchmark.cc` | 
|  |  | 
|  | We consider two scenarios: writing a simple event and a nested event | 
|  |  | 
|  | #### Simple event | 
|  |  | 
|  | Consists of filling a flat proto message with of 4 integers (2 x 32-bit, | 
|  | 2 x 64-bit) and a 32 bytes string, as follows: | 
|  |  | 
|  | ```c++ | 
|  | void FillMessage_Simple(T* msg) { | 
|  | msg->set_field_int32(...); | 
|  | msg->set_field_uint32(...); | 
|  | msg->set_field_int64(...); | 
|  | msg->set_field_uint64(...); | 
|  | msg->set_field_string(...); | 
|  | } | 
|  | ``` | 
|  |  | 
|  | #### Nested event | 
|  |  | 
|  | Consists of filling a similar message which is recursively nested 3 levels deep: | 
|  |  | 
|  | ```c++ | 
|  | void FillMessage_Nested(T* msg, int depth = 0) { | 
|  | FillMessage_Simple(msg); | 
|  | if (depth < 3) { | 
|  | auto* child = msg->add_field_nested(); | 
|  | FillMessage_Nested(child, depth + 1); | 
|  | } | 
|  | } | 
|  | ``` | 
|  |  | 
|  | #### Comparison terms | 
|  |  | 
|  | We compare, for the same message type, the performance of ProtoZero, | 
|  | libprotobuf and a speed-of-light serializer. | 
|  |  | 
|  | The speed-of-light serializer is a very simple C++ class that just appends | 
|  | data into a linear buffer making all sorts of favourable assumptions. It does | 
|  | not use any binary-stable encoding, it does not perform bound checking, | 
|  | all writes are 64-bit aligned, it doesn't deal with any thread-safety. | 
|  |  | 
|  | ```c++ | 
|  | struct SOLMsg { | 
|  | template <typename T> | 
|  | void Append(T x) { | 
|  | // The memcpy will be elided by the compiler, which will emit just a | 
|  | // 64-bit aligned mov instruction. | 
|  | memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x)); | 
|  | ptr_ += sizeof(x); | 
|  | } | 
|  |  | 
|  | void set_field_int32(int32_t x) { Append(x); } | 
|  | void set_field_uint32(uint32_t x) { Append(x); } | 
|  | void set_field_int64(int64_t x) { Append(x); } | 
|  | void set_field_uint64(uint64_t x) { Append(x); } | 
|  | void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); } | 
|  |  | 
|  | alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8]; | 
|  | char* ptr_ = &storage_[0]; | 
|  | }; | 
|  | ``` | 
|  |  | 
|  | The speed-of-light serializer serves as a reference for _how fast a serializer | 
|  | could be if argument marshalling and bound checking were zero cost._ | 
|  |  | 
|  | #### Benchmark results | 
|  |  | 
|  | ##### Google Pixel 3 - aarch64 | 
|  |  | 
|  | ```bash | 
|  | $ cat out/droid_arm64/args.gn | 
|  | target_os = "android" | 
|  | is_clang = true | 
|  | is_debug = false | 
|  | target_cpu = "arm64" | 
|  |  | 
|  | $ ninja -C out/droid_arm64/ perfetto_benchmarks && \ | 
|  | adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \ | 
|  | adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*' | 
|  |  | 
|  | ------------------------------------------------------------------------ | 
|  | Benchmark                                 Time           CPU Iterations | 
|  | ------------------------------------------------------------------------ | 
|  | BM_Protozero_Simple_Libprotobuf         402 ns        398 ns    1732807 | 
|  | BM_Protozero_Simple_Protozero           242 ns        239 ns    2929528 | 
|  | BM_Protozero_Simple_SpeedOfLight        118 ns        117 ns    6101381 | 
|  | BM_Protozero_Nested_Libprotobuf        1810 ns       1800 ns     390468 | 
|  | BM_Protozero_Nested_Protozero           780 ns        773 ns     901369 | 
|  | BM_Protozero_Nested_SpeedOfLight        138 ns        136 ns    5147958 | 
|  | ``` | 
|  |  | 
|  | ##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux | 
|  |  | 
|  | ```bash | 
|  |  | 
|  | $ cat out/linux_clang_release/args.gn | 
|  | is_clang = true | 
|  | is_debug = false | 
|  |  | 
|  | $ ninja -C out/linux_clang_release/ perfetto_benchmarks && \ | 
|  | out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto* | 
|  |  | 
|  | ------------------------------------------------------------------------ | 
|  | Benchmark                                 Time           CPU Iterations | 
|  | ------------------------------------------------------------------------ | 
|  | BM_Protozero_Simple_Libprotobuf         428 ns        428 ns    1624801 | 
|  | BM_Protozero_Simple_Protozero           261 ns        261 ns    2715544 | 
|  | BM_Protozero_Simple_SpeedOfLight        111 ns        111 ns    6297387 | 
|  | BM_Protozero_Nested_Libprotobuf        1625 ns       1625 ns     436411 | 
|  | BM_Protozero_Nested_Protozero           843 ns        843 ns     849302 | 
|  | BM_Protozero_Nested_SpeedOfLight        140 ns        140 ns    5012910 | 
|  | ``` |