Blame - docs/design-docs/protozero.md - third_party/perfetto

blob: fa87fe2fd491592a4fb6176e8af2c4addd7782b7 [file] [log] [blame] [view]

Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	1	# ProtoZero design document
				2
				3	ProtoZero is a zero-copy zero-alloc zero-syscall protobuf serialization libary
				4	purposefully built for Perfetto's tracing use cases.
				5
				6	## Motivations
				7
				8	ProtoZero has been designed and optimized for proto serialization, which is used
				9	by all Perfetto tracing paths.
				10	Deserialization was introduced only at a later stage of the project and is
				11	mainly used by offline tools
				12	(e.g., [TraceProcessor](/docs/analysis/trace-processor.md).
				13	The _zero-copy zero-alloc zero-syscall_ statement applies only to the
				14	serialization code.
				15
				16	Perfetto makes extensive use of protobuf in tracing fast-paths. Every trace
				17	event in Perfetto is a proto
				18	(see [TracePacket](/docs/reference/trace-packet-proto.autogen) reference). This
				19	allows events to be strongly typed and makes it easier for the team to maintain
				20	backwards compatibility using a language that is understood across the board.
				21
				22	Tracing fast-paths need to have very little overhead, because instrumentation
				23	points are sprinkled all over the codebase of projects like Android
				24	and Chrome and are performance-critical.
				25
				26	Overhead here is not just defined as CPU time (or instructions retired) it
				27	takes to execute the instrumentation point. A big source of overhead in a
				28	tracing system is represented by the working set of the instrumentation points,
				29	specifically extra I-cache and D-cache misses which would slow down the
				30	non-tracing code _after_ the tracing instrumentation point.
				31
				32	The major design departures of ProtoZero from canonical C++ protobuf libraries
				33	like [libprotobuf](https://github.com/google/protobuf) are:
				34
				35	* Treating serialization and deserialization as different use-cases served by
				36	different code.
				37
				38	* Optimizing for binary size and working-set-size on the serialization paths.
				39
				40	* Ignoring most of the error checking and long-tail features of protobuf
				41	(repeated vs optional, full type checks).
				42
				43	* ProtoZero is not designed as general-purpose protobuf de/serialization and is
				44	heavily customized to maintain the tracing writing code minimal and allow the
				45	compiler to see through the architectural layers.
				46
				47	* Code generated by ProtoZero needs to be hermetic. When building the
				48	amalgamated [Tracing SDK](/docs/instrumentation/tracing-sdk.md), the all
				49	perfetto tracing sources need to not have any dependency on any other
				50	libraries other than the C++ standard library and C library.
				51
				52	## Usage
				53
				54	At the build-system level, ProtoZero is extremely similar to the conventional
Andrew Shulaev	dad4850	2020-06-02 15:59:01 +0100	[diff] [blame]	55	libprotobuf library.
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	56	The ProtoZero `.proto -> .pbzero.{cc,h}` compiler is based on top of the
				57	libprotobuf parser and compiler infrastructure. ProtoZero is as a `protoc`
				58	compiler plugin.
				59
				60	ProtoZero has a build-time-only dependency on libprotobuf (the plugin depends
				61	on libprotobuf's parser and compiler). The `.pbzero.{cc,h}` code generated by
				62	it, however, has no runtime dependency (not even header-only dependencies) on
				63	libprotobuf.
				64
				65	In order to generate ProtoZero stubs from proto you need to:
				66
				67	1. Build the ProtoZero compiler plugin, which lives in
				68	[src/protozero/protoc_plugin/](/src/protozero/protoc_plugin/).
				69	```bash
				70	tools/ninja -C out/default protozero_plugin protoc
				71	```
				72
				73	2. Invoke the libprotobuf `protoc` compiler passing the `protozero_plugin`:
				74	```bash
				75	out/default/protoc \
				76	--plugin=protoc-gen-plugin=out/default/protozero_plugin \
				77	--plugin_out=wrapper_namespace=pbzero:/tmp/ \
				78	test_msg.proto
				79	```
				80	This generates `/tmp/test_msg.pbzero.{cc,h}`.
				81
				82	NOTE: The .cc file is always empty. ProtoZero-generated code is header only.
				83	The .cc file is emitted only because some build systems' rules assume that
				84	protobuf codegens generate both a .cc and a .h file.
				85
				86	## Proto serialization
				87
				88	The quickest way to undestand ProtoZero design principles is to start from a
				89	small example and compare the generated code between libprotobuf and ProtoZero.
				90
				91	```protobuf
				92	syntax = "proto2";
				93
				94	message TestMsg {
				95	optional string str_val = 1;
				96	optional int32 int_val = 2;
				97	repeated TestMsg nested = 3;
				98	}
				99	```
				100
Andrew Shulaev	dad4850	2020-06-02 15:59:01 +0100	[diff] [blame]	101	#### libprotobuf approach
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	102
				103	The libprotobuf approach is to generate a C++ class that has one member for each
				104	proto field, with dedicated serialization and de-serialization methods.
				105
				106	```bash
				107	out/default/protoc --cpp_out=. test_msg.proto
				108	```
				109
				110	generates test_msg.pb.{cc,h}. With many degrees of simplification, it looks
				111	as follows:
				112
				113	```c++
				114	// This class is generated by the standard protoc compiler in the .pb.h source.
				115	class TestMsg : public protobuf::MessageLite {
				116	private:
				117	int32 int_val_;
				118	ArenaStringPtr str_val_;
				119	RepeatedPtrField<TestMsg> nested_; // Effectively a vector<TestMsg>
				120
				121	public:
				122	const std::string& str_val() const;
				123	void set_str_val(const std::string& value);
				124
				125	bool has_int_val() const;
				126	int32_t int_val() const;
				127	void set_int_val(int32_t value);
				128
				129	::TestMsg* add_nested();
				130	::TestMsg* mutable_nested(int index);
				131	const TestMsg& nested(int index);
				132
				133	std::string SerializeAsString();
				134	bool ParseFromString(const std::string&);
				135	}
				136	```
				137
				138	The main characteristic of these stubs are:
				139
				140	* Code generated from .proto messages can be used in the codebase as general
Andrew Shulaev	dad4850	2020-06-02 15:59:01 +0100	[diff] [blame]	141	purpose objects, without ever using the `SerializeAs()` or `ParseFrom()`
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	142	methods (although anecdotal evidence suggests that most project use these
				143	proto-generated classes only at the de/serialization endpoints).
				144
				145	* The end-to-end journey of serializing a proto involves two steps:
				146	1. Setting the individual int / string / vector fields of the generated class.
				147	2. Doing a serialization pass over these fields.
				148
Andrew Shulaev	dad4850	2020-06-02 15:59:01 +0100	[diff] [blame]	149	In turn this has side-effects on the code generated. STL copy/assignment
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	150	operators for strings and vectors are non-trivial because, for instance, they
				151	need to deal with dynamic memory resizing.
				152
				153	#### ProtoZero approach
				154
				155	```c++
				156	// This class is generated by the ProtoZero plugin in the .pbzero.h source.
				157	class TestMsg : public protozero::Message {
				158	public:
				159	void set_str_val(const std::string& value) {
				160	AppendBytes(/field_id=/1, value.data(), value.size());
				161	}
				162	void set_str_val(const char* data, size_t size) {
				163	AppendBytes(/field_id=/1, data, size);
				164	}
				165	void set_int_val(int32_t value) {
				166	AppendVarInt(/field_id=/2, value);
				167	}
				168	TestMsg* add_nested() {
				169	return BeginNestedMessage<TestMsg>(/field_id=/3);
				170	}
				171	}
				172	```
				173
				174	The ProtoZero-generated stubs are append-only. As the `set_`, `add_` methods
				175	are invoked, the passed arguments are directly serialized into the target
				176	buffer. This introduces some limitations:
				177
				178	* Readback is not possible: these classes cannot be used as C++ struct
				179	replacements.
				180
				181	* No error-checking is performed: nothing prevents a non-repeated field to be
				182	emitted twice in the serialized proto if the caller accidentally calls a
				183	`set_*()` method twice. Basic type checks are still performed at compile-time
				184	though.
				185
				186	* Nested fields must be filled in a stack fashion and cannot be written
				187	interleaved. Once a nested message is started, its fields must be set before
				188	going back setting the fields of the parent message. This turns out to not be
				189	a problem for most tracing use-cases.
				190
				191	This has a number of advantages:
				192
				193	* The classes generated by ProtoZero don't add any extra state on top of the
				194	base class they derive (`protozero::Message`). They define only inline
				195	setter methods that call base-class serialization methods. Compilers can
				196	see through all the inline expansions of these methods.
				197
				198	* As a consequence of that, the binary cost of ProtoZero is independent of the
				199	number of protobuf messages defined and their fields, and depends only on the
				200	number of `set_`/`add_` calls. This (i.e. binary cost of non-used proto
				201	messages and fields) anecdotally has been a big issue with libprotobuf.
				202
				203	* The serialization methods don't involve any copy or dynamic allocation. The
				204	inline expansion calls directly into the corresponding `AppendVarInt()` /
				205	`AppendString()` methods of `protozero::Message`.
				206
				207	* This allows to directly serialize trace events into the
				208	[tracing shared memory buffers](/docs/concepts/buffers.md), even if they are
				209	not contiguous.
				210
				211	### Scattered buffer writing
				212
				213	A key part of the ProtoZero design is supporting direct serialization on
				214	non-globally-contiguous sequences of contiguous memory regions.
				215
				216	This happens by decoupling `protozero::Message`, the base class for all the
				217	generated classes, from the `protozero::ScatteredStreamWriter`.
				218	The problem it solves is the following: ProtoZero is based on direct
				219	serialization into shared memory buffers chunks. These chunks are 4KB - 32KB in
				220	most cases. At the same time, there is no limit in how much data the caller will
				221	try to write into an individual message, a trace event can be up to 256 MiB big.
				222
				223	![ProtoZero scattered buffers diagram](/docs/images/protozero-ssw.png)
				224
				225	#### Fast-path
				226
				227	At all times the underlying `ScatteredStreamWriter` knows what are the bounds
				228	of the current buffer. All write operations are bound checked and hit a
				229	slow-path when crossing the buffer boundary.
				230
				231	Most write operations can be completed within the current buffer boundaries.
				232	In that case, the cost of a `set_*` operation is in essence a `memcpy()` with
				233	the extra overhead of var-int encoding for protobuf preambles and
				234	length-delimited fields.
				235
				236	#### Slow-path
				237
				238	When crossing the boundary, the slow-path asks the
				239	`ScatteredStreamWriter::Delegate` for a new buffer. The implementation of
				240	`GetNewBuffer()` is up to the client. In tracing use-cases, that call will
				241	acquire a new thread-local chunk from the tracing shared memory buffer.
				242
				243	Other heap-based implementations are possible. For instance, the ProtoZero
				244	sources provide a helper class `HeapBuffered<TestMsg>`, mainly used in tests (see
				245	[scattered_heap_buffer.h](/include/perfetto/protozero/scattered_heap_buffer.h)),
				246	which allocates a new heap buffer when crossing the boundaries of the current
				247	one.
				248
				249	Consider the following example:
				250
				251	```c++
				252	TestMsg outer_msg;
				253	for (int i = 0; i < 1000; i++) {
				254	TestMsg* nested = outer_msg.add_nested();
				255	nested->set_int_val(42);
				256	}
				257	```
				258
				259	At some point one of the `set_int_val()` calls will hit the slow-path and
				260	acquire a new buffer. The overall idea is having a serialization mechanism
				261	that is extremely lightweight most of the times and that requires some extra
				262	function calls when buffer boundary, so that their cost gets amortized across
				263	all trace events.
				264
				265	In the context of the overall Perfetto tracing use case, the slow-path involves
				266	grabbing a process-local mutex and finding the next free chunk in the shared
				267	memory buffer. Hence writes are lock-free as long as they happen within the
				268	thread-local chunk and require a critical section to acquire a new chunk once
				269	every 4KB-32KB (depending on the trace configuration).
				270
				271	The assumption is that the likeliness that two threads will cross the chunk
Deepanjan Roy	1ff7fdb	2020-06-24 15:19:18 -0400	[diff] [blame]	272	boundary and call `GetNewBuffer()` at the same time is extremely low and hence
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	273	the critical section is un-contended most of the times.
				274
				275	```mermaid
				276	sequenceDiagram
				277	participant C as Call site
				278	participant M as Message
				279	participant SSR as ScatteredStreamWriter
				280	participant DEL as Buffer Delegate
				281	C->>M: set_int_val(...)
				282	activate C
				283	M->>SSR: AppendVarInt(...)
				284	deactivate C
				285	Note over C,SSR: A typical write on the fast-path
				286
				287	C->>M: set_str_val(...)
				288	activate C
				289	M->>SSR: AppendString(...)
				290	SSR->>DEL: GetNewBuffer(...)
				291	deactivate C
				292	Note over C,DEL: A write on the slow-path when crossing 4KB - 32KB chunks.
				293	```
				294
				295	### Deferred patching
				296
				297	Nested messages in the protobuf binary encoding are prefixed with their
				298	varint-encoded size.
				299
				300	Consider the following:
				301
				302	```c++
				303	TestMsg* nested = outer_msg.add_nested();
				304	nested->set_int_val(42);
				305	nested->set_str_val("foo");
				306	```
				307
				308	The canonical encoding of this protobuf message, using libprotobuf, would be:
				309
				310	```bash
				311	1a 07 0a 03 66 6f 6f 10 2a
				312	^-+-^ ^-----+------^ ^-+-^
				313	\| \| \|
				314	\| \| +--> Field ID: 2 [int_val], value = 42.
				315	\| \|
				316	\| +------> Field ID: 1 [str_val], len = 3, value = "foo" (66 6f 6f).
				317	\|
Andrew Shulaev	dad4850	2020-06-02 15:59:01 +0100	[diff] [blame]	318	+------> Field ID: 3 [nested], length: 7 # !!!
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	319	```
				320
				321	The second byte in this sequence (07) is problematic for direct encoding. At the
				322	point where `outer_msg.add_nested()` is called, we can't possibly know upfront
				323	what the overall size of the nested message will be (in this case, 5 + 2 = 7).
				324
				325	The way we get around this in ProtoZero is by reserving four bytes for the
				326	_size_ of each nested message and back-filling them once the message is
				327	finalized (or when we try to set a field in one of the parent messages).
				328	We do this by encoding the size of the message using redundant varint encoding,
				329	in this case: `87 80 80 00` instead of `07`.
				330
				331	At the C++ level, the `protozero::Message` class holds a pointer to its `size`
				332	field, which typically points to the beginning of the message, where the four
				333	bytes are reserved, and back-fills it in the `Message::Finalize()` pass.
				334
				335	This works fine for cases where the entire message lies in one contiguous buffer
				336	but opens a further challenge: a message can be several MBs big. Looking at this
				337	from the overall tracing perspective, the shared memory buffer chunk that holds
				338	the beginning of a message can be long gone (i.e. committed in the central
				339	service buffer) by the time we get to the end.
				340
				341	In order to support this use case, at the tracing code level (outside of
				342	ProtoZero), when a message crosses the buffer boundary, its `size` field gets
				343	redirected to a temporary patch buffer
				344	(see [patch_list.h](/src/tracing/core/patch_list.h)). This patch buffer is then
				345	sent out-of-band, piggybacking over the next commit IPC (see
				346	[Tracing Protocol ABI](/docs/design-docs/api-and-abi.md#tracing-protocol-abi))
				347
				348	### Performance characteristics
				349
				350	NOTE: For the full code of the benchmark see
				351	`/src/protozero/test/protozero_benchmark.cc`
				352
				353	We consider two scenarios: writing a simple event and a nested event
				354
				355	#### Simple event
				356
				357	Consists of filling a flat proto message with of 4 integers (2 x 32-bit,
				358	2 x 64-bit) and a 32 bytes string, as follows:
				359
				360	```c++
				361	void FillMessage_Simple(T* msg) {
				362	msg->set_field_int32(...);
				363	msg->set_field_uint32(...);
				364	msg->set_field_int64(...);
				365	msg->set_field_uint64(...);
				366	msg->set_field_string(...);
				367	}
				368	```
				369
				370	#### Nested event
				371
				372	Consists of filling a similar message which is recursively nested 3 levels deep:
				373
				374	```c++
				375	void FillMessage_Nested(T* msg, int depth = 0) {
				376	FillMessage_Simple(msg);
				377	if (depth < 3) {
				378	auto* child = msg->add_field_nested();
				379	FillMessage_Nested(child, depth + 1);
				380	}
				381	}
				382	```
				383
				384	#### Comparison terms
				385
				386	We compare, for the same message type, the performance of ProtoZero,
				387	libprotobuf and a speed-of-light serializer.
				388
				389	The speed-of-light serializer is a very simple C++ class that just appends
				390	data into a linear buffer making all sorts of favourable assumptions. It does
				391	not use any binary-stable encoding, it does not perform bound checking,
				392	all writes are 64-bit aligned, it doesn't deal with any thread-safety.
				393
				394	```c++
				395	struct SOLMsg {
				396	template <typename T>
				397	void Append(T x) {
				398	// The memcpy will be elided by the compiler, which will emit just a
				399	// 64-bit aligned mov instruction.
Primiano Tucci	112f849	2020-09-14 21:31:54 +0200	[diff] [blame]	400	memcpy(reinterpret_cast<void*>(ptr_), &x, sizeof(x));
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	401	ptr_ += sizeof(x);
				402	}
				403
				404	void set_field_int32(int32_t x) { Append(x); }
				405	void set_field_uint32(uint32_t x) { Append(x); }
				406	void set_field_int64(int64_t x) { Append(x); }
				407	void set_field_uint64(uint64_t x) { Append(x); }
				408	void set_field_string(const char* str) { ptr_ = strcpy(ptr_, str); }
				409
Primiano Tucci	112f849	2020-09-14 21:31:54 +0200	[diff] [blame]	410	alignas(uint64_t) char storage_[sizeof(g_fake_input_simple) + 8];
Primiano Tucci	a662485	2020-05-21 19:12:50 +0100	[diff] [blame]	411	char* ptr_ = &storage_[0];
				412	};
				413	```
				414
				415	The speed-of-light serializer serves as a reference for _how fast a serializer
				416	could be if argument marshalling and bound checking were zero cost._
				417
				418	#### Benchmark results
				419
				420	##### Google Pixel 3 - aarch64
				421
				422	```bash
				423	$ cat out/droid_arm64/args.gn
				424	target_os = "android"
				425	is_clang = true
				426	is_debug = false
				427	target_cpu = "arm64"
				428
				429	$ ninja -C out/droid_arm64/ perfetto_benchmarks && \
				430	adb push --sync out/droid_arm64/perfetto_benchmarks /data/local/tmp/perfetto_benchmarks && \
				431	adb shell '/data/local/tmp/perfetto_benchmarks --benchmark_filter=BM_Proto*'
				432
				433	------------------------------------------------------------------------
				434	Benchmark Time CPU Iterations
				435	------------------------------------------------------------------------
				436	BM_Protozero_Simple_Libprotobuf 402 ns 398 ns 1732807
				437	BM_Protozero_Simple_Protozero 242 ns 239 ns 2929528
				438	BM_Protozero_Simple_SpeedOfLight 118 ns 117 ns 6101381
				439	BM_Protozero_Nested_Libprotobuf 1810 ns 1800 ns 390468
				440	BM_Protozero_Nested_Protozero 780 ns 773 ns 901369
				441	BM_Protozero_Nested_SpeedOfLight 138 ns 136 ns 5147958
				442	```
				443
				444	##### HP Z920 workstation (Intel Xeon E5-2690 v4) running Linux
				445
				446	```bash
				447
				448	$ cat out/linux_clang_release/args.gn
				449	is_clang = true
				450	is_debug = false
				451
				452	$ ninja -C out/linux_clang_release/ perfetto_benchmarks && \
				453	out/linux_clang_release/perfetto_benchmarks --benchmark_filter=BM_Proto*
				454
				455	------------------------------------------------------------------------
				456	Benchmark Time CPU Iterations
				457	------------------------------------------------------------------------
				458	BM_Protozero_Simple_Libprotobuf 428 ns 428 ns 1624801
				459	BM_Protozero_Simple_Protozero 261 ns 261 ns 2715544
				460	BM_Protozero_Simple_SpeedOfLight 111 ns 111 ns 6297387
				461	BM_Protozero_Nested_Libprotobuf 1625 ns 1625 ns 436411
				462	BM_Protozero_Nested_Protozero 843 ns 843 ns 849302
				463	BM_Protozero_Nested_SpeedOfLight 140 ns 140 ns 5012910
				464	```