Blame - docs/wrapping-upb.md - third_party/protobuf

blob: 6a98c2268a9953a622795d4649068b5039636fb8 [file] [log] [blame] [view]

Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	1
				2	<!---
				3	This document contains embedded graphviz diagrams inside ```dot blocks.
				4
				5	To convert it to rendered form using render.py:
				6	$ ./render.py wrapping-upb.in.md
				7
				8	You can also live-preview this document with all diagrams using Markdown Preview Enhanced
				9	in Visual Studio Code:
				10	https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
				11	--->
				12
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	13	# Building a protobuf library on upb
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	14
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	15	This is a guide for creating a new protobuf implementation based on upb. It
				16	starts from the beginning and walks you through the process, highlighting
				17	some important design choices you will need to make.
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	18
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	19	## Overview
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	20
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	21	A protobuf implementation consists of two main pieces:
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	22
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	23	1. a code generator, run at compile time, to turn `.proto` files into source
				24	files in your language (we will call this "zlang", assuming an extension of ".z").
				25	2. a runtime component, which implements the wire format and provides the data
				26	structures for representing protobuf data and metadata.
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	27
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	28	<br/>
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	29
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	30	```dot {align="center"}
				31	digraph {
				32	rankdir=LR;
				33	newrank=true;
				34	node [style="rounded,filled" shape=box]
				35	"foo.proto" -> protoc;
				36	"foo.proto" [shape=folder];
				37	protoc [fillcolor=lightgrey];
				38	protoc -> "protoc-gen-zlang";
				39	"protoc-gen-zlang" -> "foo.z";
				40	"protoc-gen-zlang" [fillcolor=palegreen3];
				41	"foo.z" [shape=folder];
				42	labelloc="b";
				43	label="Compile Time";
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	44	}
				45	```
				46
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	47	<br/>
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	48
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	49	```dot {align="center"}
				50	digraph {
				51	newrank=true;
				52	node [style="rounded,filled" shape=box fillcolor=lightgrey]
				53	"foo.z" -> "zlang/upb glue (FFI)";
				54	"zlang/upb glue (FFI)" -> "upb (C)";
				55	"zlang/upb glue (FFI)" [fillcolor=palegreen3];
				56	labelloc="b";
				57	label="Runtime";
				58	}
				59	```
				60
				61	The parts in green are what you will need to implement.
				62
				63	Note that your code generator (`protoc-gen-zlang`) does not need to generate
				64	any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and
				65	serializers are fully table-driven, which means there is never any need or even
				66	benefit to generating C code for each proto. upb is capable of full-speed
				67	parsing even when schema data is loaded at runtime from strings embedded into
				68	`foo.z`. This is a key benefit of upb compared with C++ protos, which have
				69	traditionally relied on generated parsers in `foo.pb.cc` files to achieve full
				70	parsing speed, and suffered a ~10x speed penalty in the parser when the schema
				71	data was loaded at runtime.
				72
				73	## Prerequisites
				74
				75	There are a few things that the language runtime must provide in order to wrap
				76	upb.
				77
				78	1. FFI: To wrap upb, your language must be able to call into a C API
				79	through a Foreign Function Interface (FFI). Most languages support FFI in
				80	some form, either through "native extensions" (in which you write some C
				81	code to implement new methods in your language) or through a direct FFI (in
				82	which you can call into regular C functions directly from your language
				83	using a special library).
				84	2. Finalizers, Destructors, or Cleaners: The runtime must provide
				85	finalizers or destructors of some sort. There must be a way of triggering a
				86	call to a C function when the language garbage collects or otherwise
				87	destroys an object. We don't care much whether it is a finalizer, a
				88	destructor, or a cleaner, as long as it gets called eventually when the
				89	object is destroyed. upb allocates memory in C space, and a finalizer is our
				90	only way of making sure that memory is freed and does not leak.
				91	3. HashMap with weak values: (optional) This is not a strong requirement,
				92	but it is sometimes helpful to have a global hashmap with weak values to act
				93	as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not
				94	the keys). There is some question about whether we want to continue to use
				95	this pattern going forward.
				96
				97	## Reflection vs. MiniTables
				98
				99	The first key design decision you will need to make is whether your generated
				100	code will access message data via reflection or minitables. Generally more
				101	dynamic languages will want to use reflection and more static languages will
				102	want to use minitables.
				103
				104	### Reflection
				105
				106	Reflection-based data access makes the most sense in highly dynamic language
				107	interpreters, where method dispatch is generally resolved via strings and hash
				108	table lookups.
				109
				110	In such languages, you can often implement a special method like `__getattr__`
				111	(Python) or `method_missing` (Ruby) that receives the method name as a string.
				112	Using upb's reflection, you can look up a field name using the method name,
				113	thereby using a hash table belonging to upb instead of one provided by the
				114	language.
				115
				116	```python
				117	class FooMessage:
				118	# Written in Python for illustration, but in practice we will want to
				119	# implement this in C for speed.
				120	def __getattr__(self, name):
				121	field = FooMessage.descriptor.fields_by_name[name]
				122	return field.get_value(self)
				123	```
				124
				125	Using this design, we only need to attach a single `__getattr__` method to each
				126	message class, instead of defining a getter/setter for each field. In this way
				127	we can avoid duplicating hash tables between upb and the language interpreter,
				128	reducing memory usage.
				129
				130	Reflection-based access requires loading full reflection at runtime. Your
				131	generated code will need to embed serialized descriptors (ie. a serialized
				132	message of `descriptor.proto`), which has some amount of size overhead and
				133	exposes all message/field names to the binary. It also forces a hash table
				134	lookup in the critical path of field access. If method calls in your language
				135	already have this overhead, then this is no added burden, but for statically
				136	dispatched languages it would cause extra overhead.
				137
				138	If we take this path to its logical conclusion, all class creation can be
				139	performed fully dynamically, using only a binary descriptor as input. The
				140	"generated code" becomes little more than an embedded descriptor plus a
				141	library call to load it. Python has recently gone down this path. Generated
				142	code now looks something like this:
				143
				144	```python
				145	# main_pb2.py
				146	from google3.net.proto2.python.internal import builder as _builder
				147	from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool
				148
				149	DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>")
				150	_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
				151	_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
				152	```
				153
				154	This is all the runtime needs to create all of the classes for messages defined
				155	in that serialized descriptor. This code has no pretense of readability, but
				156	a separate `.pyi` stub file provides a fully expanded and readable list of all
				157	methods a user can expect to be available:
				158
				159	```python
				160	# main_pb2.pyi
				161	from google3.net.proto2.python.public import descriptor as _descriptor
				162	from google3.net.proto2.python.public import message as _message
				163	from typing import ClassVar as _ClassVar, Optional as _Optional
				164
				165	DESCRIPTOR: _descriptor.FileDescriptor
				166
				167	class MyMessage(_message.Message):
				168	__slots__ = ["my_field"]
				169	MY_FIELD_FIELD_NUMBER: _ClassVar[int]
				170	my_field: str
				171	def __init__(self, my_field: _Optional[str] = ...) -> None: ...
				172	```
				173
				174	To use reflection-based access:
				175
Protobuf Team Bot	f3a0cc4	2022-11-18 10:00:20 -0800	[diff] [blame]	176	1. Load and access descriptor data using the interfaces in upb/def.h.
				177	2. Access message data using the interfaces in upb/reflection.h.
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	178
				179	### MiniTables
				180
Joshua Haberman	34495f8	2022-09-09 12:22:28 -0700	[diff] [blame]	181	MiniTables are a "lite" schema representation that are much smaller than
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	182	reflection. MiniTables omit names, options, and almost everything else from the
				183	`.proto` file, retaining only enough information to parse and serialize binary
				184	format.
				185
				186	MiniTables can be loaded into upb through MiniDescriptors. MiniDescriptors are
				187	a byte-oriented format that can be embedded into your generated code and passed
				188	to upb to construct MiniTables. MiniDescriptors only use printable characters,
				189	and therefore do not require escaping when embedding them into generated code
				190	strings. Overall the size savings of MiniDescriptors are ~60x compared with
				191	regular descriptors.
				192
				193	MiniTables and MiniDescriptors are a natural choice for compiled languages that
				194	resolve method calls at compile time. For languages that are sometimes compiled
				195	and sometimes interpreted, there might not be an obvious choice. When a method
				196	call is statically bound, we want to remove as much overhead as possible,
				197	especially from accessors. In the extreme case, we can use unsafe APIs to read
				198	raw memory at a known offset:
				199
				200	```java
				201	// Example of a maximally-optimized generated accessor.
				202	class FooMessage {
				203	public long getBarField() {
				204	// Using Unsafe should give us performance that is comparable to a
				205	// native member access.
				206	//
				207	// The constant "24" is obtained from upb at compile time.
				208	sun.misc.Unsafe.getLong(this.ptr, 24);
				209	}
				210	}
				211	```
				212
				213	This design is very low-level, and tightly couples the generated code to one
				214	specific version of the schema and compiler. A slower but safer version would
				215	look up a field by field number:
				216
				217	```java
				218	// Example of a more loosely-coupled accessor.
				219	class FooMessage {
				220	public long getBarField() {
				221	// The constant "2" is the field number. Internally this will look
				222	// up the number "2" in the MiniTable and use that to read the value
				223	// from the message.
				224	upb.glue.getLong(this.ptr, 2);
				225	}
				226	}
				227	```
				228
				229	One downside of MiniTables is that they cannot support parsing or serializing
Protobuf Team Bot	bf88f8b	2022-07-20 09:22:30 -0700	[diff] [blame]	230	to JSON or TextFormat, because they do not know the field names. It should be
Joshua Haberman	2a5919d	2022-05-09 10:32:14 -0700	[diff] [blame]	231	possible to generate reflection data "on the side", into separate generated
				232	code files, so that reflection is only pulled in if it is being used. However
				233	APIs to do this do not exist yet.
				234
				235	To use MiniTable-based access:
				236
Protobuf Team Bot	f3a0cc4	2022-11-18 10:00:20 -0800	[diff] [blame]	237	1. Load and access MiniDescriptors data using the interfaces in upb/mini_table.h.
				238	2. Access message data using the interfaces in upb/msg_accessors.h.
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	239
				240	## Memory Management
				241
				242	One of the core design challenges when wrapping upb is memory management. Every
				243	language runtime will have some memory management system, whether it is
				244	garbage collection, reference counting, manual memory management, or some hybrid
				245	of these. upb is written in C and uses arenas for memory management, but upb is
				246	designed to integrate with a wide variety of memory management schemes, and it
				247	provides a number of tools for making this integration as smooth as possible.
				248
				249	### Arenas
				250
				251	upb defines data structures in C to represent messages, arrays (repeated
				252	fields), and maps. A protobuf message is a hierarchical tree of these objects.
				253	For example, a relatively simple protobuf tree might look something like this:
				254
Joshua Haberman	6a2c01a	2022-04-12 08:57:33 -0700	[diff] [blame]	255	```dot {align="center"}
				256	digraph G {
				257	rankdir=LR;
				258	newrank=true;
				259	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
				260	upb_msg -> upb_msg2;
				261	upb_msg -> upb_array;
				262	upb_msg [label="upb Message" fillcolor=1]
				263	upb_msg2 [label="upb Message"];
				264	upb_array [label="upb Array"]
				265	}
				266	```
Joshua Haberman	5832e80	2022-01-06 21:08:17 -0800	[diff] [blame]	267
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	268	All upb objects are allocated from an arena. An arena lets you allocate objects
				269	individually, but you cannot free individual objects; you can only free the arena
				270	as a whole. When the arena is freed, all of the individual objects allocated
				271	from that arena are freed together.
				272
Joshua Haberman	6a2c01a	2022-04-12 08:57:33 -0700	[diff] [blame]	273	```dot {align="center"}
				274	digraph G {
				275	rankdir=LR;
				276	newrank=true;
				277	subgraph cluster_0 {
				278	label = "upb Arena"
				279	graph[style="rounded,filled" fillcolor=gray]
				280	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
				281	upb_msg -> upb_array;
				282	upb_msg -> upb_msg2;
				283	upb_msg [label="upb Message" fillcolor=1]
				284	upb_msg2 [label="upb Message"];
				285	upb_array [label="upb Array"];
				286	}
				287	}
				288	```
Joshua Haberman	5832e80	2022-01-06 21:08:17 -0800	[diff] [blame]	289
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	290	In simple cases, the entire tree of objects will all live in a single arena.
				291	This has the nice property that there cannot be any dangling pointers between
				292	objects, since all objects are freed at the same time.
				293
				294	However upb allows you to create links between any two objects, whether or
				295	not they are in the same arena. The library does not know or care what arenas
				296	the objects are in when you create links between them.
				297
Joshua Haberman	6a2c01a	2022-04-12 08:57:33 -0700	[diff] [blame]	298	```dot {align="center"}
				299	digraph G {
				300	rankdir=LR;
				301	newrank=true;
				302	subgraph cluster_0 {
				303	label = "upb Arena 1"
				304	graph[style="rounded,filled" fillcolor=gray]
				305	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
				306	upb_msg -> upb_array;
				307	upb_msg -> upb_msg2;
				308	upb_msg [label="upb Message 1" fillcolor=1]
				309	upb_msg2 [label="upb Message 2"];
				310	upb_array [label="upb Array"];
				311	}
				312	subgraph cluster_1 {
				313	label = "upb Arena 2"
				314	graph[style="rounded,filled" fillcolor=gray]
				315	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1]
				316	upb_msg3;
				317	}
				318	upb_msg2 -> upb_msg3;
				319	upb_msg3 [label="upb Message 3"];
				320	}
				321	```
Joshua Haberman	5832e80	2022-01-06 21:08:17 -0800	[diff] [blame]	322
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	323	When objects are on separate arenas, it is the user's responsibility to ensure
				324	that there are no dangling pointers. In the example above, this means Arena 2
				325	must outlive Message 1 and Message 2.
				326
				327	### Integrating GC with upb
				328
				329	In languages with automatic memory management, the goal is to handle all of the
				330	arenas behind the scenes, so that the user does not have to manage them manually
				331	or even know that they exist.
				332
				333	We can achieve this goal if we set up the object graph in a particular way. The
				334	general strategy is to create wrapper objects around all of the C objects,
				335	including the arena. Our key goal is to make sure the arena wrapper is not
				336	GC'd until all of the C objects in that arena have become unreachable.
				337
				338	For this example, we will assume we are wrapping upb in Python:
				339
Joshua Haberman	6a2c01a	2022-04-12 08:57:33 -0700	[diff] [blame]	340	```dot {align="center"}
				341	digraph G {
				342	rankdir=LR;
				343	newrank=true;
				344	compound=true;
				345
				346	subgraph cluster_1 {
				347	label = "upb Arena"
				348	graph[style="rounded,filled" fillcolor=gray]
				349	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
				350	upb_msg -> upb_array [style=dashed];
				351	upb_msg -> upb_msg2 [style=dashed];
				352	upb_msg [label="upb Message" fillcolor=1]
				353	upb_msg2 [label="upb Message"];
				354	upb_array [label="upb Array"]
				355	dummy [style=invis]
				356	}
				357	subgraph cluster_python {
				358	node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2]
				359	peripheries=0
				360	py_upb_msg [label="Python Message"];
				361	py_upb_msg2 [label="Python Message"];
				362	py_upb_arena [label="Python Arena"];
				363	}
				364	py_upb_msg -> upb_msg [style=dashed];
				365	py_upb_msg2->upb_msg2 [style=dashed];
				366	py_upb_msg2 -> py_upb_arena [color=springgreen4];
				367	py_upb_msg -> py_upb_arena [color=springgreen4];
				368	py_upb_arena -> dummy [lhead=cluster_1, color=red];
				369	{
				370	rank=same;
				371	upb_msg;
				372	py_upb_msg;
				373	}
				374	{
				375	rank=same;
				376	upb_array;
				377	upb_msg2;
				378	py_upb_msg2;
				379	}
				380	{ rank=same;
				381	dummy;
				382	py_upb_arena;
				383	}
				384	dummy->upb_array [style=invis];
				385	dummy->upb_msg2 [style=invis];
				386
				387	subgraph cluster_01 {
				388	node [shape=plaintext]
				389	peripheries=0
				390	key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
				391	<tr><td align="right" port="i1">raw ptr</td></tr>
				392	<tr><td align="right" port="i2">unique ptr</td></tr>
				393	<tr><td align="right" port="i3">shared (GC) ptr</td></tr>
				394	</table>>]
				395	key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
				396	<tr><td port="i1"> </td></tr>
				397	<tr><td port="i2"> </td></tr>
				398	<tr><td port="i3"> </td></tr>
				399	</table>>]
				400	key:i1:e -> key2:i1:w [style=dashed]
				401	key:i2:e -> key2:i2:w [color=red]
				402	key:i3:e -> key2:i3:w [color=springgreen4]
				403	}
				404	key2:i1:w -> upb_msg [style=invis];
				405	{
				406	rank=same;
				407	key;
				408	upb_msg;
				409	}
				410	}
				411	```
Joshua Haberman	5832e80	2022-01-06 21:08:17 -0800	[diff] [blame]	412
Joshua Haberman	11c9468	2022-01-06 20:20:45 -0800	[diff] [blame]	413	In this example we have three different kinds of pointers:
				414
				415	* raw ptr: This is a pointer that carries no ownership.
				416	* unique ptr: This is a pointer has unique ownership of the target. The owner
				417	will free the target in its destructor (or finalizer, or cleaner). There can
				418	only be a single unique pointer to a given object.
				419	* shared (GC) ptr: This is a pointer that has shared ownership of the
				420	target. Many objects can point to the target, and the target will be deleted
				421	only when all such references are gone. In a runtime with automatic memory
				422	management (GC), this is a reference that participates in GC. In Python such
				423	references use reference counting, but in other VMs they may use mark and
				424	sweep or some other form of GC instead.
				425
				426	The Python Message wrappers have only raw pointers to the underlying message,
				427	but they contain a shared pointer to the arena that will ensure that the raw
				428	pointer remains valid. Only when all message wrapper objects are destroyed
				429	will the Python Arena become unreachable, and the upb arena ultimately freed.
				430
				431	### Links between arenas with "Fuse"
				432
				433	The design given above works well for objects that live in a single arena. But
				434	what if a user wants to create a link between two objects in different arenas?
				435
				436	TODO
				437
				438	## UTF-8 vs. UTF-16
				439
				440	TODO
				441
				442	## Object Cache
				443
				444	TODO