blob: fa87fe2fd491592a4fb6176e8af2c4addd7782b7 [file] [log] [blame] [view]
Primiano Tuccia6624852020-05-21 19:12:50 +01001# ProtoZero design document
2
3ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
4purposefully built for Perfetto's tracing use cases.
5
6## Motivations
7
8ProtoZero has been designed and optimized for proto serialization, which is used
9by all Perfetto tracing paths.
10Deserialization was introduced only at a later stage of the project and is
11mainly used by offline tools
12(e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
13The _zero-copy zero-alloc zero-syscall_ statement applies only to the
14serialization code.
15
16Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
17event in Perfetto is a proto
18(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
19allows events to be strongly typed and makes it easier for the team to maintain
20backwards compatibility using a language that is understood across the board.
21
22Tracing fast-paths need to have very little overhead, because instrumentation
23points are sprinkled all over the codebase of projects like Android
24and Chrome and are performance-critical.
25
26Overhead here is not just defined as CPU time (or instructions retired) it
27takes to execute the instrumentation point. A big source of overhead in a
28tracing system is represented by the working set of the instrumentation points,
29specifically extra I-cache and D-cache misses which would slow down the
30non-tracing code _after_ the tracing instrumentation point.
31
32The major design departures of ProtoZero from canonical C++ protobuf libraries
33like [libprotobuf](https://github.com/google/protobuf) are:
34
35* Treating serialization and deserialization as different use-cases served by
36 different code.
37
38* Optimizing for binary size and working-set-size on the serialization paths.
39
40* Ignoring most of the error checking and long-tail features of protobuf
41 (repeated vs optional, full type checks).
42
43* ProtoZero is not designed as general-purpose protobuf de/serialization and is
44 heavily customized to maintain the tracing writing code minimal and allow the
45 compiler to see through the architectural layers.
46
47* Code generated by ProtoZero needs to be hermetic. When building the
48 amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
49 perfetto tracing sources need to not have any dependency on any other
50 libraries other than the C++ standard library and C library.
51
52## Usage
53
54At the build-system level, ProtoZero is extremely similar to the conventional
Andrew Shulaevdad48502020-06-02 15:59:01 +010055libprotobuf library.
Primiano Tuccia6624852020-05-21 19:12:50 +010056The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
57libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
58compiler plugin.
59
60ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
61on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
62it, however, has no runtime dependency (not even header-only dependencies) on
63libprotobuf.
64
65In order to generate ProtoZero stubs from proto you need to:
66
671. Build the ProtoZero compiler plugin, which lives in
68 [src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
69 ```bash
70 tools/ninja -C out/default protozero_plugin protoc
71 ```
72
732. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
74 ```bash
75 out/default/protoc \
76 --plugin=protoc-gen-plugin=out/default/protozero_plugin \
77 --plugin_out=wrapper_namespace=pbzero:/tmp/ \
78 test_msg.proto
79 ```
80 This generates `/tmp/test_msg.pbzero.{cc,h}`.
81
82 NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
83 The .cc file is emitted only because some build systems' rules assume that
84 protobuf codegens generate both a .cc and a .h file.
85
86## Proto serialization
87
88The quickest way to undestand ProtoZero design principles is to start from a
89small example and compare the generated code between libprotobuf and ProtoZero.
90
91```protobuf
92syntax = "proto2";
93
94message TestMsg {
95 optional string str_val = 1;
96 optional int32 int_val = 2;
97 repeated TestMsg nested = 3;
98}
99```
100
Andrew Shulaevdad48502020-06-02 15:59:01 +0100101#### libprotobuf approach
Primiano Tuccia6624852020-05-21 19:12:50 +0100102
103The libprotobuf approach is to generate a C++ class that has one member for each
104proto field, with dedicated serialization and de-serialization methods.
105
106```bash
107out/default/protoc --cpp_out=. test_msg.proto
108```
109
110generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
111as follows:
112
113```c++
114// This class is generated by the standard protoc compiler in the .pb.h source.
115class TestMsg : public protobuf::MessageLite {
116 private:
117 int32 int_val_;
118 ArenaStringPtr str_val_;
119 RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg>
120
121 public:
122 const std::string& str_val() const;
123 void set_str_val(const std::string& value);
124
125 bool has_int_val() const;
126 int32_t int_val() const;
127 void set_int_val(int32_t value);
128
129 ::TestMsg* add_nested();
130 ::TestMsg* mutable_nested(int index);
131 const TestMsg& nested(int index);
132
133 std::string SerializeAsString();
134 bool ParseFromString(const std::string&);
135}
136```
137
138The main characteristic of these stubs are:
139
140* Code generated from .proto messages can be used in the codebase as general
Andrew Shulaevdad48502020-06-02 15:59:01 +0100141 purpose objects, without ever using the `SerializeAs*()` or `ParseFrom*()`
Primiano Tuccia6624852020-05-21 19:12:50 +0100142 methods (although anecdotal evidence suggests that most project use these
143 proto-generated classes only at the de/serialization endpoints).
144
145* The end-to-end journey of serializing a proto involves two steps:
146 1. Setting the individual int / string / vector fields of the generated class.
147 2. Doing a serialization pass over these fields.
148
Andrew Shulaevdad48502020-06-02 15:59:01 +0100149 In turn this has side-effects on the code generated. STL copy/assignment
Primiano Tuccia6624852020-05-21 19:12:50 +0100150 operators for strings and vectors are non-trivial because, for instance, they
151 need to deal with dynamic memory resizing.
152
153#### ProtoZero approach
154
155```c++
156// This class is generated by the ProtoZero plugin in the .pbzero.h source.
157class TestMsg : public protozero::Message {
158 public:
159 void set_str_val(const std::string& value) {
160 AppendBytes(/*field_id=*/1, value.data(), value.size());
161 }
162 void set_str_val(const char* data, size_t size) {
163 AppendBytes(/*field_id=*/1, data, size);
164 }
165 void set_int_val(int32_t value) {
166 AppendVarInt(/*field_id=*/2, value);
167 }
168 TestMsg* add_nested() {
169 return BeginNestedMessage<TestMsg>(/*field_id=*/3);
170 }
171}
172```
173
174The ProtoZero-generated stubs are append-only. As the `set_*`, `add_*` methods
175are invoked, the passed arguments are directly serialized into the target
176buffer. This introduces some limitations:
177
178* Readback is not possible: these classes cannot be used as C++ struct
179 replacements.
180
181* No error-checking is performed: nothing prevents a non-repeated field to be
182 emitted twice in the serialized proto if the caller accidentally calls a
183 `set_*()` method twice. Basic type checks are still performed at compile-time
184 though.
185
186* Nested fields must be filled in a stack fashion and cannot be written
187 interleaved. Once a nested message is started, its fields must be set before
188 going back setting the fields of the parent message. This turns out to not be
189 a problem for most tracing use-cases.
190
191This has a number of advantages:
192
193* The classes generated by ProtoZero don't add any extra state on top of the
194 base class they derive (`protozero::Message`). They define only inline
195 setter methods that call base-class serialization methods. Compilers can
196 see through all the inline expansions of these methods.
197
198* As a consequence of that, the binary cost of ProtoZero is independent of the
199 number of protobuf messages defined and their fields, and depends only on the
200 number of `set_*`/`add_*` calls. This (i.e. binary cost of non-used proto
201 messages and fields) anecdotally has been a big issue with libprotobuf.
202
203* The serialization methods don't involve any copy or dynamic allocation. The
204 inline expansion calls directly into the corresponding `AppendVarInt()` /
205 `AppendString()` methods of `protozero::Message`.
206
207* This allows to directly serialize trace events into the
208 [tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
209 not contiguous.
210
211### Scattered buffer writing
212
213A key part of the ProtoZero design is supporting direct serialization on
214non-globally-contiguous sequences of contiguous memory regions.
215
216This happens by decoupling `protozero::Message`, the base class for all the
217generated classes, from the `protozero::ScatteredStreamWriter`.
218The problem it solves is the following: ProtoZero is based on direct
219serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
220most cases. At the same time, there is no limit in how much data the caller will
221try to write into an individual message, a trace event can be up to 256 MiB big.
222
223![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)
224
225#### Fast-path
226
227At all times the underlying `ScatteredStreamWriter` knows what are the bounds
228of the current buffer. All write operations are bound checked and hit a
229slow-path when crossing the buffer boundary.
230
231Most write operations can be completed within the current buffer boundaries.
232In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
233the extra overhead of var-int encoding for protobuf preambles and
234length-delimited fields.
235
236#### Slow-path
237
238When crossing the boundary, the slow-path asks the
239`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
240`GetNewBuffer()` is up to the client. In tracing use-cases, that call will
241acquire a new thread-local chunk from the tracing shared memory buffer.
242
243Other heap-based implementations are possible. For instance, the ProtoZero
244sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
245[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
246which allocates a new heap buffer when crossing the boundaries of the current
247one.
248
249Consider the following example:
250
251```c++
252TestMsg outer_msg;
253for (int i = 0; i < 1000; i++) {
254 TestMsg* nested = outer_msg.add_nested();
255 nested->set_int_val(42);
256}
257```
258
259At some point one of the `set_int_val()` calls will hit the slow-path and
260acquire a new buffer. The overall idea is having a serialization mechanism
261that is extremely lightweight most of the times and that requires some extra
262function calls when buffer boundary, so that their cost gets amortized across
263all trace events.
264
265In the context of the overall Perfetto tracing use case, the slow-path involves
266grabbing a process-local mutex and finding the next free chunk in the shared
267memory buffer. Hence writes are lock-free as long as they happen within the
268thread-local chunk and require a critical section to acquire a new chunk once
269every 4KB-32KB (depending on the trace configuration).
270
271The assumption is that the likeliness that two threads will cross the chunk
Deepanjan Roy1ff7fdb2020-06-24 15:19:18 -0400272boundary and call `GetNewBuffer()` at the same time is extremely low and hence
Primiano Tuccia6624852020-05-21 19:12:50 +0100273the critical section is un-contended most of the times.
274
275```mermaid
276sequenceDiagram
277 participant C as Call site
278 participant M as Message
279 participant SSR as ScatteredStreamWriter
280 participant DEL as Buffer Delegate
281 C->>M: set_int_val(...)
282 activate C
283 M->>SSR: AppendVarInt(...)
284 deactivate C
285 Note over C,SSR: A typical write on the fast-path
286
287 C->>M: set_str_val(...)
288 activate C
289 M->>SSR: AppendString(...)
290 SSR->>DEL: GetNewBuffer(...)
291 deactivate C
292 Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
293```
294
295### Deferred patching
296
297Nested messages in the protobuf binary encoding are prefixed with their
298varint-encoded size.
299
300Consider the following:
301
302```c++
303TestMsg* nested = outer_msg.add_nested();
304nested->set_int_val(42);
305nested->set_str_val("foo");
306```
307
308The canonical encoding of this protobuf message, using libprotobuf, would be:
309
310```bash
3111a 07 0a 03 66 6f 6f 10 2a
312^-+-^ ^-----+------^ ^-+-^
313 | | |
314 | | +--> Field ID: 2 [int_val], value = 42.
315 | |
316 | +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
317 |
Andrew Shulaevdad48502020-06-02 15:59:01 +0100318 +------> Field ID: 3 [nested], length: 7 # !!!
Primiano Tuccia6624852020-05-21 19:12:50 +0100319```
320
321The second byte in this sequence (07) is problematic for direct encoding. At the
322point where `outer_msg.add_nested()` is called, we can't possibly know upfront
323what the overall size of the nested message will be (in this case, 5 + 2 = 7).
324
325The way we get around this in ProtoZero is by reserving four bytes for the
326_size_ of each nested message and back-filling them once the message is
327finalized (or when we try to set a field in one of the parent messages).
328We do this by encoding the size of the message using redundant varint encoding,
329in this case: `87 80 80 00` instead of `07`.
330
331At the C++ level, the `protozero::Message` class holds a pointer to its `size`
332field, which typically points to the beginning of the message, where the four
333bytes are reserved, and back-fills it in the `Message::Finalize()` pass.
334
335This works fine for cases where the entire message lies in one contiguous buffer
336but opens a further challenge: a message can be several MBs big. Looking at this
337from the overall tracing perspective, the shared memory buffer chunk that holds
338the beginning of a message can be long gone (i.e. committed in the central
339service buffer) by the time we get to the end.
340
341In order to support this use case, at the tracing code level (outside of
342ProtoZero), when a message crosses the buffer boundary, its `size` field gets
343redirected to a temporary patch buffer
344(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
345sent out-of-band, piggybacking over the next commit IPC (see
346[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))
347
348### Performance characteristics
349
350NOTE: For the full code of the benchmark see
351 `/src/protozero/test/protozero_benchmark.cc`
352
353We consider two scenarios: writing a simple event and a nested event
354
355#### Simple event
356
357Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
3582 x 64-bit) and a 32 bytes string, as follows:
359
360```c++
361void FillMessage_Simple(T* msg) {
362 msg->set_field_int32(...);
363 msg->set_field_uint32(...);
364 msg->set_field_int64(...);
365 msg->set_field_uint64(...);
366 msg->set_field_string(...);
367}
368```
369
370#### Nested event
371
372Consists of filling a similar message which is recursively nested 3 levels deep:
373
374```c++
375void FillMessage_Nested(T* msg, int depth = 0) {
376 FillMessage_Simple(msg);
377 if (depth < 3) {
378 auto* child = msg->add_field_nested();
379 FillMessage_Nested(child, depth + 1);
380 }
381}
382```
383
384#### Comparison terms
385
386We compare, for the same message type, the performance of ProtoZero,
387libprotobuf and a speed-of-light serializer.
388
389The speed-of-light serializer is a very simple C++ class that just appends
390data into a linear buffer making all sorts of favourable assumptions. It does
391not use any binary-stable encoding, it does not perform bound checking,
392all writes are 64-bit aligned, it doesn't deal with any thread-safety.
393
394```c++
395struct SOLMsg {
396 template <typename T>
397 void Append(T x) {
398 // The memcpy will be elided by the compiler, which will emit just a
399 // 64-bit aligned mov instruction.
Primiano Tucci112f8492020-09-14 21:31:54 +0200400 memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
Primiano Tuccia6624852020-05-21 19:12:50 +0100401 ptr_ += sizeof(x);
402 }
403
404 void set_field_int32(int32_t x) { Append(x); }
405 void set_field_uint32(uint32_t x) { Append(x); }
406 void set_field_int64(int64_t x) { Append(x); }
407 void set_field_uint64(uint64_t x) { Append(x); }
408 void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }
409
Primiano Tucci112f8492020-09-14 21:31:54 +0200410 alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
Primiano Tuccia6624852020-05-21 19:12:50 +0100411 char* ptr_ = &storage_[0];
412};
413```
414
415The speed-of-light serializer serves as a reference for _how fast a serializer
416could be if argument marshalling and bound checking were zero cost._
417
418#### Benchmark results
419
420##### Google Pixel 3 - aarch64
421
422```bash
423$ cat out/droid_arm64/args.gn
424target_os = "android"
425is_clang = true
426is_debug = false
427target_cpu = "arm64"
428
429$ ninja -C out/droid_arm64/ perfetto_benchmarks && \
430 adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
431 adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'
432
433------------------------------------------------------------------------
434Benchmark Time CPU Iterations
435------------------------------------------------------------------------
436BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807
437BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528
438BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381
439BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468
440BM_Protozero_Nested_Protozero 780 ns 773 ns 901369
441BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958
442```
443
444##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux
445
446```bash
447
448$ cat out/linux_clang_release/args.gn
449is_clang = true
450is_debug = false
451
452$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
453 out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*
454
455------------------------------------------------------------------------
456Benchmark Time CPU Iterations
457------------------------------------------------------------------------
458BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801
459BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544
460BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387
461BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411
462BM_Protozero_Nested_Protozero 843 ns 843 ns 849302
463BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910
464```