blob: 6a98c2268a9953a622795d4649068b5039636fb8 [file] [log] [blame] [view]
Joshua Haberman11c94682022-01-06 20:20:45 -08001
2<!---
3This document contains embedded graphviz diagrams inside ```dot blocks.
4
5To convert it to rendered form using render.py:
6 $ ./render.py wrapping-upb.in.md
7
8You can also live-preview this document with all diagrams using Markdown Preview Enhanced
9in Visual Studio Code:
10 https://marketplace.visualstudio.com/items?itemName=shd101wyy.markdown-preview-enhanced
11--->
12
Joshua Haberman2a5919d2022-05-09 10:32:14 -070013# Building a protobuf library on upb
Joshua Haberman11c94682022-01-06 20:20:45 -080014
Joshua Haberman2a5919d2022-05-09 10:32:14 -070015This is a guide for creating a new protobuf implementation based on upb. It
16starts from the beginning and walks you through the process, highlighting
17some important design choices you will need to make.
Joshua Haberman11c94682022-01-06 20:20:45 -080018
Joshua Haberman2a5919d2022-05-09 10:32:14 -070019## Overview
Joshua Haberman11c94682022-01-06 20:20:45 -080020
Joshua Haberman2a5919d2022-05-09 10:32:14 -070021A protobuf implementation consists of two main pieces:
Joshua Haberman11c94682022-01-06 20:20:45 -080022
Joshua Haberman2a5919d2022-05-09 10:32:14 -0700231. a code generator, run at compile time, to turn `.proto` files into source
24 files in your language (we will call this "zlang", assuming an extension of ".z").
252. a runtime component, which implements the wire format and provides the data
26 structures for representing protobuf data and metadata.
Joshua Haberman11c94682022-01-06 20:20:45 -080027
Joshua Haberman2a5919d2022-05-09 10:32:14 -070028<br/>
Joshua Haberman11c94682022-01-06 20:20:45 -080029
Joshua Haberman2a5919d2022-05-09 10:32:14 -070030```dot {align="center"}
31digraph {
32 rankdir=LR;
33 newrank=true;
34 node [style="rounded,filled" shape=box]
35 "foo.proto" -> protoc;
36 "foo.proto" [shape=folder];
37 protoc [fillcolor=lightgrey];
38 protoc -> "protoc-gen-zlang";
39 "protoc-gen-zlang" -> "foo.z";
40 "protoc-gen-zlang" [fillcolor=palegreen3];
41 "foo.z" [shape=folder];
42 labelloc="b";
43 label="Compile Time";
Joshua Haberman11c94682022-01-06 20:20:45 -080044}
45```
46
Joshua Haberman2a5919d2022-05-09 10:32:14 -070047<br/>
Joshua Haberman11c94682022-01-06 20:20:45 -080048
Joshua Haberman2a5919d2022-05-09 10:32:14 -070049```dot {align="center"}
50digraph {
51 newrank=true;
52 node [style="rounded,filled" shape=box fillcolor=lightgrey]
53 "foo.z" -> "zlang/upb glue (FFI)";
54 "zlang/upb glue (FFI)" -> "upb (C)";
55 "zlang/upb glue (FFI)" [fillcolor=palegreen3];
56 labelloc="b";
57 label="Runtime";
58}
59```
60
61The parts in green are what you will need to implement.
62
63Note that your code generator (`protoc-gen-zlang`) does *not* need to generate
64any C code (eg. `foo.c`). While upb itself is written in C, upb's parsers and
65serializers are fully table-driven, which means there is never any need or even
66benefit to generating C code for each proto. upb is capable of full-speed
67parsing even when schema data is loaded at runtime from strings embedded into
68`foo.z`. This is a key benefit of upb compared with C++ protos, which have
69traditionally relied on generated parsers in `foo.pb.cc` files to achieve full
70parsing speed, and suffered a ~10x speed penalty in the parser when the schema
71data was loaded at runtime.
72
73## Prerequisites
74
75There are a few things that the language runtime must provide in order to wrap
76upb.
77
781. **FFI**: To wrap upb, your language must be able to call into a C API
79 through a Foreign Function Interface (FFI). Most languages support FFI in
80 some form, either through "native extensions" (in which you write some C
81 code to implement new methods in your language) or through a direct FFI (in
82 which you can call into regular C functions directly from your language
83 using a special library).
842. **Finalizers, Destructors, or Cleaners**: The runtime must provide
85 finalizers or destructors of some sort. There must be a way of triggering a
86 call to a C function when the language garbage collects or otherwise
87 destroys an object. We don't care much whether it is a finalizer, a
88 destructor, or a cleaner, as long as it gets called eventually when the
89 object is destroyed. upb allocates memory in C space, and a finalizer is our
90 only way of making sure that memory is freed and does not leak.
913. **HashMap with weak values**: (optional) This is not a strong requirement,
92 but it is sometimes helpful to have a global hashmap with weak values to act
93 as a `upb_msg* -> wrapper` object cache. We want the values to be weak (not
94 the keys). There is some question about whether we want to continue to use
95 this pattern going forward.
96
97## Reflection vs. MiniTables
98
99The first key design decision you will need to make is whether your generated
100code will access message data via reflection or minitables. Generally more
101dynamic languages will want to use reflection and more static languages will
102want to use minitables.
103
104### Reflection
105
106Reflection-based data access makes the most sense in highly dynamic language
107interpreters, where method dispatch is generally resolved via strings and hash
108table lookups.
109
110In such languages, you can often implement a special method like `__getattr__`
111(Python) or `method_missing` (Ruby) that receives the method name as a string.
112Using upb's reflection, you can look up a field name using the method name,
113thereby using a hash table belonging to upb instead of one provided by the
114language.
115
116```python
117class FooMessage:
118 # Written in Python for illustration, but in practice we will want to
119 # implement this in C for speed.
120 def __getattr__(self, name):
121 field = FooMessage.descriptor.fields_by_name[name]
122 return field.get_value(self)
123```
124
125Using this design, we only need to attach a single `__getattr__` method to each
126message class, instead of defining a getter/setter for each field. In this way
127we can avoid duplicating hash tables between upb and the language interpreter,
128reducing memory usage.
129
130Reflection-based access requires loading full reflection at runtime. Your
131generated code will need to embed serialized descriptors (ie. a serialized
132message of `descriptor.proto`), which has some amount of size overhead and
133exposes all message/field names to the binary. It also forces a hash table
134lookup in the critical path of field access. If method calls in your language
135already have this overhead, then this is no added burden, but for statically
136dispatched languages it would cause extra overhead.
137
138If we take this path to its logical conclusion, all class creation can be
139performed fully dynamically, using only a binary descriptor as input. The
140"generated code" becomes little more than an embedded descriptor plus a
141library call to load it. Python has recently gone down this path. Generated
142code now looks something like this:
143
144```python
145# main_pb2.py
146from google3.net.proto2.python.internal import builder as _builder
147from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool
148
149DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>")
150_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
151_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
152```
153
154This is all the runtime needs to create all of the classes for messages defined
155in that serialized descriptor. This code has no pretense of readability, but
156a separate `.pyi` stub file provides a fully expanded and readable list of all
157methods a user can expect to be available:
158
159```python
160# main_pb2.pyi
161from google3.net.proto2.python.public import descriptor as _descriptor
162from google3.net.proto2.python.public import message as _message
163from typing import ClassVar as _ClassVar, Optional as _Optional
164
165DESCRIPTOR: _descriptor.FileDescriptor
166
167class MyMessage(_message.Message):
168 __slots__ = ["my_field"]
169 MY_FIELD_FIELD_NUMBER: _ClassVar[int]
170 my_field: str
171 def __init__(self, my_field: _Optional[str] = ...) -> None: ...
172```
173
174To use reflection-based access:
175
Protobuf Team Botf3a0cc42022-11-18 10:00:20 -08001761. Load and access descriptor data using the interfaces in upb/def.h.
1772. Access message data using the interfaces in upb/reflection.h.
Joshua Haberman2a5919d2022-05-09 10:32:14 -0700178
179### MiniTables
180
Joshua Haberman34495f82022-09-09 12:22:28 -0700181MiniTables are a "lite" schema representation that are much smaller than
Joshua Haberman2a5919d2022-05-09 10:32:14 -0700182reflection. MiniTables omit names, options, and almost everything else from the
183`.proto` file, retaining only enough information to parse and serialize binary
184format.
185
186MiniTables can be loaded into upb through *MiniDescriptors*. MiniDescriptors are
187a byte-oriented format that can be embedded into your generated code and passed
188to upb to construct MiniTables. MiniDescriptors only use printable characters,
189and therefore do not require escaping when embedding them into generated code
190strings. Overall the size savings of MiniDescriptors are ~60x compared with
191regular descriptors.
192
193MiniTables and MiniDescriptors are a natural choice for compiled languages that
194resolve method calls at compile time. For languages that are sometimes compiled
195and sometimes interpreted, there might not be an obvious choice. When a method
196call is statically bound, we want to remove as much overhead as possible,
197especially from accessors. In the extreme case, we can use unsafe APIs to read
198raw memory at a known offset:
199
200```java
201// Example of a maximally-optimized generated accessor.
202class FooMessage {
203 public long getBarField() {
204 // Using Unsafe should give us performance that is comparable to a
205 // native member access.
206 //
207 // The constant "24" is obtained from upb at compile time.
208 sun.misc.Unsafe.getLong(this.ptr, 24);
209 }
210}
211```
212
213This design is very low-level, and tightly couples the generated code to one
214specific version of the schema and compiler. A slower but safer version would
215look up a field by field number:
216
217```java
218// Example of a more loosely-coupled accessor.
219class FooMessage {
220 public long getBarField() {
221 // The constant "2" is the field number. Internally this will look
222 // up the number "2" in the MiniTable and use that to read the value
223 // from the message.
224 upb.glue.getLong(this.ptr, 2);
225 }
226}
227```
228
229One downside of MiniTables is that they cannot support parsing or serializing
Protobuf Team Botbf88f8b2022-07-20 09:22:30 -0700230to JSON or TextFormat, because they do not know the field names. It should be
Joshua Haberman2a5919d2022-05-09 10:32:14 -0700231possible to generate reflection data "on the side", into separate generated
232code files, so that reflection is only pulled in if it is being used. However
233APIs to do this do not exist yet.
234
235To use MiniTable-based access:
236
Protobuf Team Botf3a0cc42022-11-18 10:00:20 -08002371. Load and access MiniDescriptors data using the interfaces in upb/mini_table.h.
2382. Access message data using the interfaces in upb/msg_accessors.h.
Joshua Haberman11c94682022-01-06 20:20:45 -0800239
240## Memory Management
241
242One of the core design challenges when wrapping upb is memory management. Every
243language runtime will have some memory management system, whether it is
244garbage collection, reference counting, manual memory management, or some hybrid
245of these. upb is written in C and uses arenas for memory management, but upb is
246designed to integrate with a wide variety of memory management schemes, and it
247provides a number of tools for making this integration as smooth as possible.
248
249### Arenas
250
251upb defines data structures in C to represent messages, arrays (repeated
252fields), and maps. A protobuf message is a hierarchical tree of these objects.
253For example, a relatively simple protobuf tree might look something like this:
254
Joshua Haberman6a2c01a2022-04-12 08:57:33 -0700255```dot {align="center"}
256digraph G {
257 rankdir=LR;
258 newrank=true;
259 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
260 upb_msg -> upb_msg2;
261 upb_msg -> upb_array;
262 upb_msg [label="upb Message" fillcolor=1]
263 upb_msg2 [label="upb Message"];
264 upb_array [label="upb Array"]
265}
266```
Joshua Haberman5832e802022-01-06 21:08:17 -0800267
Joshua Haberman11c94682022-01-06 20:20:45 -0800268All upb objects are allocated from an arena. An arena lets you allocate objects
269individually, but you cannot free individual objects; you can only free the arena
270as a whole. When the arena is freed, all of the individual objects allocated
271from that arena are freed together.
272
Joshua Haberman6a2c01a2022-04-12 08:57:33 -0700273```dot {align="center"}
274digraph G {
275 rankdir=LR;
276 newrank=true;
277 subgraph cluster_0 {
278 label = "upb Arena"
279 graph[style="rounded,filled" fillcolor=gray]
280 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
281 upb_msg -> upb_array;
282 upb_msg -> upb_msg2;
283 upb_msg [label="upb Message" fillcolor=1]
284 upb_msg2 [label="upb Message"];
285 upb_array [label="upb Array"];
286 }
287}
288```
Joshua Haberman5832e802022-01-06 21:08:17 -0800289
Joshua Haberman11c94682022-01-06 20:20:45 -0800290In simple cases, the entire tree of objects will all live in a single arena.
291This has the nice property that there cannot be any dangling pointers between
292objects, since all objects are freed at the same time.
293
294However upb allows you to create links between any two objects, whether or
295not they are in the same arena. The library does not know or care what arenas
296the objects are in when you create links between them.
297
Joshua Haberman6a2c01a2022-04-12 08:57:33 -0700298```dot {align="center"}
299digraph G {
300 rankdir=LR;
301 newrank=true;
302 subgraph cluster_0 {
303 label = "upb Arena 1"
304 graph[style="rounded,filled" fillcolor=gray]
305 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
306 upb_msg -> upb_array;
307 upb_msg -> upb_msg2;
308 upb_msg [label="upb Message 1" fillcolor=1]
309 upb_msg2 [label="upb Message 2"];
310 upb_array [label="upb Array"];
311 }
312 subgraph cluster_1 {
313 label = "upb Arena 2"
314 graph[style="rounded,filled" fillcolor=gray]
315 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1]
316 upb_msg3;
317 }
318 upb_msg2 -> upb_msg3;
319 upb_msg3 [label="upb Message 3"];
320}
321```
Joshua Haberman5832e802022-01-06 21:08:17 -0800322
Joshua Haberman11c94682022-01-06 20:20:45 -0800323When objects are on separate arenas, it is the user's responsibility to ensure
324that there are no dangling pointers. In the example above, this means Arena 2
325must outlive Message 1 and Message 2.
326
327### Integrating GC with upb
328
329In languages with automatic memory management, the goal is to handle all of the
330arenas behind the scenes, so that the user does not have to manage them manually
331or even know that they exist.
332
333We can achieve this goal if we set up the object graph in a particular way. The
334general strategy is to create wrapper objects around all of the C objects,
335including the arena. Our key goal is to make sure the arena wrapper is not
336GC'd until all of the C objects in that arena have become unreachable.
337
338For this example, we will assume we are wrapping upb in Python:
339
Joshua Haberman6a2c01a2022-04-12 08:57:33 -0700340```dot {align="center"}
341digraph G {
342 rankdir=LR;
343 newrank=true;
344 compound=true;
345
346 subgraph cluster_1 {
347 label = "upb Arena"
348 graph[style="rounded,filled" fillcolor=gray]
349 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out]
350 upb_msg -> upb_array [style=dashed];
351 upb_msg -> upb_msg2 [style=dashed];
352 upb_msg [label="upb Message" fillcolor=1]
353 upb_msg2 [label="upb Message"];
354 upb_array [label="upb Array"]
355 dummy [style=invis]
356 }
357 subgraph cluster_python {
358 node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2]
359 peripheries=0
360 py_upb_msg [label="Python Message"];
361 py_upb_msg2 [label="Python Message"];
362 py_upb_arena [label="Python Arena"];
363 }
364 py_upb_msg -> upb_msg [style=dashed];
365 py_upb_msg2->upb_msg2 [style=dashed];
366 py_upb_msg2 -> py_upb_arena [color=springgreen4];
367 py_upb_msg -> py_upb_arena [color=springgreen4];
368 py_upb_arena -> dummy [lhead=cluster_1, color=red];
369 {
370 rank=same;
371 upb_msg;
372 py_upb_msg;
373 }
374 {
375 rank=same;
376 upb_array;
377 upb_msg2;
378 py_upb_msg2;
379 }
380 { rank=same;
381 dummy;
382 py_upb_arena;
383 }
384 dummy->upb_array [style=invis];
385 dummy->upb_msg2 [style=invis];
386
387 subgraph cluster_01 {
388 node [shape=plaintext]
389 peripheries=0
390 key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
391 <tr><td align="right" port="i1">raw ptr</td></tr>
392 <tr><td align="right" port="i2">unique ptr</td></tr>
393 <tr><td align="right" port="i3">shared (GC) ptr</td></tr>
394 </table>>]
395 key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0">
396 <tr><td port="i1">&nbsp;</td></tr>
397 <tr><td port="i2">&nbsp;</td></tr>
398 <tr><td port="i3">&nbsp;</td></tr>
399 </table>>]
400 key:i1:e -> key2:i1:w [style=dashed]
401 key:i2:e -> key2:i2:w [color=red]
402 key:i3:e -> key2:i3:w [color=springgreen4]
403 }
404 key2:i1:w -> upb_msg [style=invis];
405 {
406 rank=same;
407 key;
408 upb_msg;
409 }
410}
411```
Joshua Haberman5832e802022-01-06 21:08:17 -0800412
Joshua Haberman11c94682022-01-06 20:20:45 -0800413In this example we have three different kinds of pointers:
414
415* **raw ptr**: This is a pointer that carries no ownership.
416* **unique ptr**: This is a pointer has *unique ownership* of the target. The owner
417 will free the target in its destructor (or finalizer, or cleaner). There can
418 only be a single unique pointer to a given object.
419* **shared (GC) ptr**: This is a pointer that has *shared ownership* of the
420 target. Many objects can point to the target, and the target will be deleted
421 only when all such references are gone. In a runtime with automatic memory
422 management (GC), this is a reference that participates in GC. In Python such
423 references use reference counting, but in other VMs they may use mark and
424 sweep or some other form of GC instead.
425
426The Python Message wrappers have only raw pointers to the underlying message,
427but they contain a shared pointer to the arena that will ensure that the raw
428pointer remains valid. Only when all message wrapper objects are destroyed
429will the Python Arena become unreachable, and the upb arena ultimately freed.
430
431### Links between arenas with "Fuse"
432
433The design given above works well for objects that live in a single arena. But
434what if a user wants to create a link between two objects in different arenas?
435
436TODO
437
438## UTF-8 vs. UTF-16
439
440TODO
441
442## Object Cache
443
444TODO