This is a guide for creating a new protobuf implementation based on upb. It starts from the beginning and walks you through the process, highlighting some important design choices you will need to make.
A protobuf implementation consists of two main pieces:
.proto
files into source files in your language (we will call this “zlang”, assuming an extension of “.z”).digraph { rankdir=LR; newrank=true; node [style="rounded,filled" shape=box] "foo.proto" -> protoc; "foo.proto" [shape=folder]; protoc [fillcolor=lightgrey]; protoc -> "protoc-gen-zlang"; "protoc-gen-zlang" -> "foo.z"; "protoc-gen-zlang" [fillcolor=palegreen3]; "foo.z" [shape=folder]; labelloc="b"; label="Compile Time"; }
digraph { newrank=true; node [style="rounded,filled" shape=box fillcolor=lightgrey] "foo.z" -> "zlang/upb glue (FFI)"; "zlang/upb glue (FFI)" -> "upb (C)"; "zlang/upb glue (FFI)" [fillcolor=palegreen3]; labelloc="b"; label="Runtime"; }
The parts in green are what you will need to implement.
Note that your code generator (protoc-gen-zlang
) does not need to generate any C code (eg. foo.c
). While upb itself is written in C, upb's parsers and serializers are fully table-driven, which means there is never any need or even benefit to generating C code for each proto. upb is capable of full-speed parsing even when schema data is loaded at runtime from strings embedded into foo.z
. This is a key benefit of upb compared with C++ protos, which have traditionally relied on generated parsers in foo.pb.cc
files to achieve full parsing speed, and suffered a ~10x speed penalty in the parser when the schema data was loaded at runtime.
There are a few things that the language runtime must provide in order to wrap upb.
upb_msg* -> wrapper
object cache. We want the values to be weak (not the keys). There is some question about whether we want to continue to use this pattern going forward.The first key design decision you will need to make is whether your generated code will access message data via reflection or minitables. Generally more dynamic languages will want to use reflection and more static languages will want to use minitables.
Reflection-based data access makes the most sense in highly dynamic language interpreters, where method dispatch is generally resolved via strings and hash table lookups.
In such languages, you can often implement a special method like __getattr__
(Python) or method_missing
(Ruby) that receives the method name as a string. Using upb's reflection, you can look up a field name using the method name, thereby using a hash table belonging to upb instead of one provided by the language.
class FooMessage: # Written in Python for illustration, but in practice we will want to # implement this in C for speed. def __getattr__(self, name): field = FooMessage.descriptor.fields_by_name[name] return field.get_value(self)
Using this design, we only need to attach a single __getattr__
method to each message class, instead of defining a getter/setter for each field. In this way we can avoid duplicating hash tables between upb and the language interpreter, reducing memory usage.
Reflection-based access requires loading full reflection at runtime. Your generated code will need to embed serialized descriptors (ie. a serialized message of descriptor.proto
), which has some amount of size overhead and exposes all message/field names to the binary. It also forces a hash table lookup in the critical path of field access. If method calls in your language already have this overhead, then this is no added burden, but for statically dispatched languages it would cause extra overhead.
If we take this path to its logical conclusion, all class creation can be performed fully dynamically, using only a binary descriptor as input. The “generated code” becomes little more than an embedded descriptor plus a library call to load it. Python has recently gone down this path. Generated code now looks something like this:
# main_pb2.py from google3.net.proto2.python.internal import builder as _builder from google3.net.proto2.python.public import descriptor_pool as _descriptor_pool DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile("<...>") _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals()) _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.main_pb2', globals())
This is all the runtime needs to create all of the classes for messages defined in that serialized descriptor. This code has no pretense of readability, but a separate .pyi
stub file provides a fully expanded and readable list of all methods a user can expect to be available:
# main_pb2.pyi from google3.net.proto2.python.public import descriptor as _descriptor from google3.net.proto2.python.public import message as _message from typing import ClassVar as _ClassVar, Optional as _Optional DESCRIPTOR: _descriptor.FileDescriptor class MyMessage(_message.Message): __slots__ = ["my_field"] MY_FIELD_FIELD_NUMBER: _ClassVar[int] my_field: str def __init__(self, my_field: _Optional[str] = ...) -> None: ...
To use reflection-based access:
MiniTables are a “lite” schema representation that are much smaller than reflection. MiniTables omit names, options, and almost everything else from the .proto
file, retaining only enough information to parse and serialize binary format.
MiniTables can be loaded into upb through MiniDescriptors. MiniDescriptors are a byte-oriented format that can be embedded into your generated code and passed to upb to construct MiniTables. MiniDescriptors only use printable characters, and therefore do not require escaping when embedding them into generated code strings. Overall the size savings of MiniDescriptors are ~60x compared with regular descriptors.
MiniTables and MiniDescriptors are a natural choice for compiled languages that resolve method calls at compile time. For languages that are sometimes compiled and sometimes interpreted, there might not be an obvious choice. When a method call is statically bound, we want to remove as much overhead as possible, especially from accessors. In the extreme case, we can use unsafe APIs to read raw memory at a known offset:
// Example of a maximally-optimized generated accessor. class FooMessage { public long getBarField() { // Using Unsafe should give us performance that is comparable to a // native member access. // // The constant "24" is obtained from upb at compile time. sun.misc.Unsafe.getLong(this.ptr, 24); } }
This design is very low-level, and tightly couples the generated code to one specific version of the schema and compiler. A slower but safer version would look up a field by field number:
// Example of a more loosely-coupled accessor. class FooMessage { public long getBarField() { // The constant "2" is the field number. Internally this will look // up the number "2" in the MiniTable and use that to read the value // from the message. upb.glue.getLong(this.ptr, 2); } }
One downside of MiniTables is that they cannot support parsing or serializing to JSON or TextFormat, because they do not know the field names. It should be possible to generate reflection data “on the side”, into separate generated code files, so that reflection is only pulled in if it is being used. However APIs to do this do not exist yet.
To use MiniTable-based access:
One of the core design challenges when wrapping upb is memory management. Every language runtime will have some memory management system, whether it is garbage collection, reference counting, manual memory management, or some hybrid of these. upb is written in C and uses arenas for memory management, but upb is designed to integrate with a wide variety of memory management schemes, and it provides a number of tools for making this integration as smooth as possible.
upb defines data structures in C to represent messages, arrays (repeated fields), and maps. A protobuf message is a hierarchical tree of these objects. For example, a relatively simple protobuf tree might look something like this:
digraph G { rankdir=LR; newrank=true; node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] upb_msg -> upb_msg2; upb_msg -> upb_array; upb_msg [label="upb Message" fillcolor=1] upb_msg2 [label="upb Message"]; upb_array [label="upb Array"] }
All upb objects are allocated from an arena. An arena lets you allocate objects individually, but you cannot free individual objects; you can only free the arena as a whole. When the arena is freed, all of the individual objects allocated from that arena are freed together.
digraph G { rankdir=LR; newrank=true; subgraph cluster_0 { label = "upb Arena" graph[style="rounded,filled" fillcolor=gray] node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] upb_msg -> upb_array; upb_msg -> upb_msg2; upb_msg [label="upb Message" fillcolor=1] upb_msg2 [label="upb Message"]; upb_array [label="upb Array"]; } }
In simple cases, the entire tree of objects will all live in a single arena. This has the nice property that there cannot be any dangling pointers between objects, since all objects are freed at the same time.
However upb allows you to create links between any two objects, whether or not they are in the same arena. The library does not know or care what arenas the objects are in when you create links between them.
digraph G { rankdir=LR; newrank=true; subgraph cluster_0 { label = "upb Arena 1" graph[style="rounded,filled" fillcolor=gray] node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] upb_msg -> upb_array; upb_msg -> upb_msg2; upb_msg [label="upb Message 1" fillcolor=1] upb_msg2 [label="upb Message 2"]; upb_array [label="upb Array"]; } subgraph cluster_1 { label = "upb Arena 2" graph[style="rounded,filled" fillcolor=gray] node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1] upb_msg3; } upb_msg2 -> upb_msg3; upb_msg3 [label="upb Message 3"]; }
When objects are on separate arenas, it is the user's responsibility to ensure that there are no dangling pointers. In the example above, this means Arena 2 must outlive Message 1 and Message 2.
In languages with automatic memory management, the goal is to handle all of the arenas behind the scenes, so that the user does not have to manage them manually or even know that they exist.
We can achieve this goal if we set up the object graph in a particular way. The general strategy is to create wrapper objects around all of the C objects, including the arena. Our key goal is to make sure the arena wrapper is not GC'd until all of the C objects in that arena have become unreachable.
For this example, we will assume we are wrapping upb in Python:
digraph G { rankdir=LR; newrank=true; compound=true; subgraph cluster_1 { label = "upb Arena" graph[style="rounded,filled" fillcolor=gray] node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=1, ordering=out] upb_msg -> upb_array [style=dashed]; upb_msg -> upb_msg2 [style=dashed]; upb_msg [label="upb Message" fillcolor=1] upb_msg2 [label="upb Message"]; upb_array [label="upb Array"] dummy [style=invis] } subgraph cluster_python { node [style="rounded,filled" shape=box colorscheme=accent8 fillcolor=2] peripheries=0 py_upb_msg [label="Python Message"]; py_upb_msg2 [label="Python Message"]; py_upb_arena [label="Python Arena"]; } py_upb_msg -> upb_msg [style=dashed]; py_upb_msg2->upb_msg2 [style=dashed]; py_upb_msg2 -> py_upb_arena [color=springgreen4]; py_upb_msg -> py_upb_arena [color=springgreen4]; py_upb_arena -> dummy [lhead=cluster_1, color=red]; { rank=same; upb_msg; py_upb_msg; } { rank=same; upb_array; upb_msg2; py_upb_msg2; } { rank=same; dummy; py_upb_arena; } dummy->upb_array [style=invis]; dummy->upb_msg2 [style=invis]; subgraph cluster_01 { node [shape=plaintext] peripheries=0 key [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> <tr><td align="right" port="i1">raw ptr</td></tr> <tr><td align="right" port="i2">unique ptr</td></tr> <tr><td align="right" port="i3">shared (GC) ptr</td></tr> </table>>] key2 [label=<<table border="0" cellpadding="2" cellspacing="0" cellborder="0"> <tr><td port="i1"> </td></tr> <tr><td port="i2"> </td></tr> <tr><td port="i3"> </td></tr> </table>>] key:i1:e -> key2:i1:w [style=dashed] key:i2:e -> key2:i2:w [color=red] key:i3:e -> key2:i3:w [color=springgreen4] } key2:i1:w -> upb_msg [style=invis]; { rank=same; key; upb_msg; } }
In this example we have three different kinds of pointers:
The Python Message wrappers have only raw pointers to the underlying message, but they contain a shared pointer to the arena that will ensure that the raw pointer remains valid. Only when all message wrapper objects are destroyed will the Python Arena become unreachable, and the upb arena ultimately freed.
The design given above works well for objects that live in a single arena. But what if a user wants to create a link between two objects in different arenas?
TODO
TODO
TODO