| # upb Design |
| |
| [TOC] |
| |
| upb is a protobuf kernel written in C. It is a fast and conformant |
| implementation of protobuf, with a low-level C API that is designed to be |
| wrapped in other languages. |
| |
| upb is not designed to be used by applications directly. The C API is very |
| low-level, unsafe, and changes frequently. It is important that upb is able to |
| make breaking API changes as necessary, to avoid taking on technical debt that |
| would compromise upb's goals of small code size and fast performance. |
| |
| ## Design goals |
| |
| Goals: |
| |
| - Full protobuf conformance |
| - Small code size |
| - Fast performance (without compromising code size) |
| - Easy to wrap in language runtimes |
| - Easy to adapt to different memory management schemes (refcounting, GC, etc) |
| |
| Non-Goals: |
| |
| - Stable API |
| - Safe API |
| - Ergonomic API for applications |
| |
| Parameters: |
| |
| - C99 |
| - 32 or 64-bit CPU (assumes 4 or 8 byte pointers) |
| - Uses pointer tagging, but avoids other implementation-defined behavior |
| - Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) |
| - No global state, fully re-entrant |
| |
| ## Arenas |
| |
| All memory management in upb uses arenas, using the type `upb_Arena`. Arenas are |
| an alternative to `malloc()` and `free()` that significantly reduces the costs |
| of memory allocation. |
| |
| Arenas obtain blocks of memory using some underlying allocator (likely |
| `malloc()` and `free()`), and satisfy allocations using a simple bump allocator |
| that walks through each block in linear order. Allocations cannot be freed |
| individually: it is only possible to free the arena as a whole, which frees all |
| of the underlying blocks. |
| |
| Here is an example of using the `upb_Arena` type: |
| |
| ```c |
| upb_Arena* arena = upb_Arena_New(); |
| |
| // Perform some allocations. |
| int* x = upb_Arena_Malloc(arena, sizeof(*x)); |
| int* y = upb_Arena_Malloc(arena, sizeof(*y)); |
| |
| // We cannot free `x` and `y` separately, we can only free the arena |
| // as a whole. |
| upb_Arena_Free(arena); |
| ``` |
| |
| upb uses arenas for all memory management, and this fact is reflected in the API |
| for all upb data structures. All upb functions that allocate take a `upb_Arena*` |
| parameter and perform allocations using that arena rather than calling |
| `malloc()` or `free()`. |
| |
| ```c |
| // upb API to create a message. |
| UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table, |
| upb_Arena* arena); |
| |
| void MakeMessage(const upb_MiniTable* mini_table) { |
| upb_Arena* arena = upb_Arena_New(); |
| |
| // This message is allocated on our arena. |
| upb_Message* msg = upb_Message_New(mini_table, arena); |
| |
| // We can free the arena whenever we want, but we cannot free the |
| // message separately from the arena. |
| upb_Arena_Free(arena); |
| |
| // msg is now deleted. |
| } |
| ``` |
| |
| Arenas are a key part of upb's performance story. Parsing a large protobuf |
| payload usually involves rapidly creating a series of messages, arrays (repeated |
| fields), and maps. It is crucial for parsing performance that these allocations |
| are as fast as possible. Equally important, freeing the tree of messages should |
| be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg |
| n)`. |
| |
| ### Avoiding Dangling Pointers |
| |
| Objects allocated on an arena will frequently contain pointers to other |
| arena-allocated objects. For example, a `upb_Message` will have pointers to |
| sub-messages that are also arena-allocated. |
| |
| Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide |
| automatic safety from dangling pointers. Instead, upb provides tools to help |
| bridge between higher-level memory management schemes (GC, refcounting, RAII, |
| borrow checkers) and arenas. |
| |
| If there is only one arena, dangling pointers within the arena are impossible, |
| because all objects are freed at the same time. This is the simplest case. The |
| user must still be careful not to keep dangling pointers that point at arena |
| memory after it has been freed, but dangling pointers *between* the arena |
| objects will be impossible. |
| |
| But what if there are multiple arenas? If we have a pointer from one arena to |
| another, how do we ensure that this will not become a dangling pointer? |
| |
| To help with the multiple arena case, upb provides a primitive called "fuse". |
| |
| ```c |
| // Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b` |
| // will be freed until both arenas are freed. |
| UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b); |
| ``` |
| |
| When two arenas are fused together, their lifetimes are irreversibly joined, |
| such that none of the arena blocks in either arena will be freed until *both* |
| arenas are freed with `upb_Arena_Free()`. This means that dangling pointers |
| between the two arenas will no longer be possible. |
| |
| Fuse is useful when joining two messages from separate arenas (making one a |
| sub-message of the other). Fuse is a relatively cheap operation, on the order of |
| 150ns, and is very nearly `O(1)` in the number of arenas being fused (the true |
| complexity is the inverse Ackermann function, which grows extremely slowly). |
| |
| Each arena does consume some memory, so repeatedly creating and fusing an |
| additional arena is not free, but the CPU cost of fusing two arenas together is |
| modest. |
| |
| ### Initial Block and Custom Allocators |
| |
| `upb_Arena` normally uses `malloc()` and `free()` to allocate and return its |
| underlying blocks. But this default strategy can be customized to support the |
| needs of a particular language. |
| |
| The lowest-level function for creating a `upb_Arena` is: |
| |
| ```c |
| // Creates an arena from the given initial block (if any -- n may be 0). |
| // Additional blocks will be allocated from |alloc|. If |alloc| is NULL, |
| // this is a fixed-size arena and cannot grow. |
| UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc); |
| ``` |
| |
| The buffer `[mem, n]` will be used as an "initial block", which is used to |
| satisfy allocations before calling any underlying allocation function. Note that |
| the `upb_Arena` itself will be allocated from the initial block if possible, so |
| the amount of memory available for allocation from the arena will be less than |
| `n`. |
| |
| The `alloc` parameter specifies a custom memory allocation function which will |
| be used once the initial block is exhausted. The user can pass `NULL` as the |
| allocation function, in which case the initial block is the only memory |
| available in the arena. This can allow upb to be used even in situations where |
| there is no heap. |
| |
| It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating |
| operations like `upb_Message_New()` should be checked for failure if there is |
| any possibility that a fixed size arena is in use. |
| |
| ## Schemas |
| |
| Nearly all operations in upb require that you have a schema. A protobuf schema |
| is a data structure that contains all of the message, field, enum, etc. |
| definitions that are specified in a `.proto` file. To create, parse, serialize, |
| or access a message you must have a schema. For this reason, loading a schema is |
| generally the first thing you must do when you use upb. [^0] |
| |
| [^0]: This requirement comes from the protobuf wire format itself, which is a |
| deep insight about the nature of protobuf (or at least the existing wire |
| format). Unlike JSON, protobuf cannot be parsed or manipulated in a |
| schema-less way. This is because the binary wire format does not |
| distinguish between strings and sub-messages, so a generic parser that is |
| oblivious to the schema is not possible. If a future version of the wire |
| format did distinguish between these, it could be possible to have a |
| schema-agnostic data representation, parser, and serializer. |
| |
| upb has two main data structures that represent a protobuf schema: |
| |
| * **MiniTables** are a compact, stripped down version of the schema that |
| contains only the information necessary for parsing and serializing the |
| binary wire format. |
| * **Reflection** contains basically all of the data from a `.proto` file, |
| including the original names of all messages/fields/etc., and all options. |
| |
| The table below summarizes the main differences between these two: |
| |
| | | MiniTables | Reflection | |
| | ------------------- | ------------------------- | -------------------------- | |
| | Contains | Field numbers and types | All data in `.proto` file, | |
| : : only : including names of : |
| : : : everything : |
| | Used to parse | binary format | JSON / TextFormat | |
| | Wire representation | MiniDescriptor | Descriptor | |
| | Type names | `upb_MiniTable`, | `upb_MessageDef`, | |
| : : `upb_MiniTableField`, ... : `upb_FieldDef`, ... : |
| | Registry | `upb_ExtensionRegistry` | `upb_DefPool` | |
| : : (for extensions) : : |
| |
| MiniTables are useful if you only need the binary wire format, because they are |
| much lighter weight than full reflection. |
| |
| Reflection is useful if you need to parse JSON or TextFormat, or you need access |
| to options that were specified in the `proto` file. Note that reflection also |
| includes MiniTables, so if you have reflection, you also have MiniTables |
| available. |
| |
| ### MiniTables |
| |
| MiniTables are represented by a set of data structures with names like |
| `upb_MiniTable` (representing a message), `upb_MiniTableField`, |
| `upb_MiniTableFile`, etc. Whenever you see one of these types in a function |
| signature, you know that this particular operation requires a MiniTable. For |
| example: |
| |
| ``` |
| // Parses the wire format data in the given buffer `[buf, size]` and writes it |
| // to the message `msg`, which has the type `mt`. |
| UPB_API upb_DecodeStatus upb_Decode(const char* buf, size_t size, |
| upb_Message* msg, const upb_MiniTable* mt, |
| const upb_ExtensionRegistry* extreg, |
| int options, upb_Arena* arena); |
| ``` |
| |
| The subset of upb that requires only MiniTables can be thought of as "upb lite," |
| because both the code size and the runtime memory overhead will be less than |
| "upb full" (the parts that use reflection). |
| |
| #### Loading |
| |
| There are three main ways of loading a MiniTable: |
| |
| 1. **From C generated code:** The upb code generator can emit `.upb.c` files that |
| contain the MiniTables as global constant variables. When the main program |
| links against these, the MiniTable will be placed into `.rodata` (or |
| `.data.rel.ro`) in the binary. The MiniTable can then be obtained from a |
| generated function. In Blaze/Bazel these files can be generated and linked |
| using the `upb_proto_library()` rule. |
| 2. **From MiniDescriptors:** The user can build MiniDescriptors into MiniTables |
| at runtime. MiniDescriptors are a compact upb-specific wire format designed |
| specially for this purpose. The user can call `upb_MiniTable_Build()` at |
| runtime to convert MiniDescriptors to MiniTables. |
| 3. **From reflection:** If you have already built reflection data structures |
| for your type, then you can obtain the `upb_MiniTable` corresponding to a |
| `upb_MessageDef` using `upb_MessageDef_MiniTable()`. |
| |
| For languages that are already using reflection, (3) is an obvious choice. |
| |
| For languages that are avoiding reflection, here is a general guideline for |
| choosing between (1) and (2): if the language being wrapped participates in the |
| standard binary linking model on a given platform (in particular, if it is |
| generally linked using `ld`), then it is better to use (1), which is also known |
| as "static loading". |
| |
| Static loading of MiniTables has the benefit of requiring no runtime |
| initialization[^2], leading to faster startup. Static loading of MiniTables also |
| facilitates cross-language sharing of proto messages, because sharing generally |
| requires that both languages are using exactly the same MiniTables. |
| |
| The main downside of static loading is that it requires the user to generate one |
| `.upb.c` file per `.proto` and link against the transitive closure of `.upb.c` |
| files. Blaze and Bazel make this reasonably easy, but for other build systems it |
| can be more of a challenge. |
| |
| [^2]: aside from possible pointer relocations performed by the ELF/Mach-O loader |
| if the library or binary is position-independent |
| |
| Loading from MiniDescriptors, as in option (2), has the advantage that it does |
| not require per-message linking of C code. For many language toolchains, |
| generating and linking some custom C code for every protobuf file or message |
| type would be a burdensome requirement. MiniDescriptors are a convenient way of |
| loading MiniTables without needing to cross the FFI boundary outside the core |
| runtime. |
| |
| A common pattern when using dynamic loading is to embed strings containing |
| MiniDescriptors directly into generated code. For example, the generated code in |
| Dart for a message with only primitive fields currently looks something like: |
| |
| ```dart |
| const desc = r'$(+),*-#$%&! /10'; |
| _accessor = $pb.instance.registry.newMessageAccessor(desc); |
| ``` |
| |
| The implementation of `newMessageAccessor()` is mainly just a wrapper around |
| `upb_MiniTable_Build()`, which builds a MiniTable from a MiniDescriptor. In the |
| code generator, the MiniDescriptor can be obtained from the |
| `upb_MessageDef_MiniDescriptorEncode()` API; users should never need to encode a |
| MiniDescriptor manually. |
| |
| #### Linking |
| |
| When building MiniTables dynamically, it is the user's responsibility to link |
| each message to the to the appropriate sub-messages and or enums. Each message |
| must have its message and closed enum fields linked using |
| `upb_MiniTable_SetSubMessage()` and `upb_MiniTable_SetSubEnum()`, respectively. |
| |
| A higher-level function that links all fields at the same time is also |
| available, as `upb_MiniTable_Link()`. This function pairs well with |
| `upb_MiniTable_GetSubList()` which can be used in a code generator to get a list |
| of all the messages and enums which must be passed to `upb_MiniTable_Link()`. |
| |
| A common pattern is to embed the `link()` calls directly into the generated |
| code. For example, here is an example from Dart of building a MiniTable that |
| contains sub-messages and enums: |
| |
| ```dart |
| const desc = r'$3334'; |
| _accessor = $pb.instance.registry.newMessageAccessor(desc); |
| _accessor!.link( |
| [ |
| M2.$_accessor, |
| M3.$_accessor, |
| M4.$_accessor, |
| ], |
| [ |
| E.$_accessor, |
| ], |
| ); |
| ``` |
| |
| In this case, `upb_MiniTable_GetSubList()` was used in the code generator to |
| discover the 3 sub-message fields and 1 sub-enum field that require linking. At |
| runtime, these lists of MiniTables are passed to the `link()` function, which |
| will internally call `upb_MiniTable_Link()`. |
| |
| Note that in some cases, the application may choose to delay or even skip the |
| registration of sub-message types, as part of a tree shaking strategy. |
| |
| When using static MiniTables, a manual link step is not necessary, as linking is |
| performed automatically by `ld`. |
| |
| #### Enums |
| |
| MiniTables primarily carry data about messages, fields, and extensions. However |
| for closed enums, we must also have a `upb_MiniTableEnum` structure that stores |
| the set of all numbers that are defined in the enum. This is because closed |
| enums have the unfortunate behavior of putting unknown enum values into the |
| unknown field set. |
| |
| Over time, closed enums will hopefully be phased out via editions, and the |
| relevance and overhead of `upb_MiniTableEnum` will shrink and eventually |
| disappear. |
| |
| ### Reflection |
| |
| Reflection uses types like `upb_MessageDef` and `upb_FieldDef` to represent the |
| full contents of a `.proto` file at runtime. These types are upb's direct |
| equivalents of `google::protobuf::Descriptor`, `google::protobuf::FieldDescriptor`, etc. [^1] |
| |
| [^1]: upb consistently uses `Def` where C++ would use `Descriptor` in type |
| names. This introduces divergence with C++; the rationale was to conserve |
| horizontal line length, as `Def` is less than 1/3 the length of |
| `Descriptor`. This is more relevant in C, where the type name is repeated |
| in every function, eg. `upb_FieldDef_Name()` vs. |
| `upb_FieldDescriptor_Name()`. |
| |
| Whenever you see one of these types in a function signature, you know that the |
| given operation requires reflection. For example: |
| |
| ```c |
| // Parses JSON format into a message object, using reflection. |
| UPB_API bool upb_JsonDecode(const char* buf, size_t size, upb_Message* msg, |
| const upb_MessageDef* m, const upb_DefPool* symtab, |
| int options, upb_Arena* arena, upb_Status* status); |
| ``` |
| |
| The part of upb that requires reflection can be thought of as "upb full." These |
| parts of the library cannot be used if a given application has only loaded |
| MiniTables. There is no way to convert a MiniTable into reflection. |
| |
| The `upb_DefPool` type is the top-level container that builds and owns some set |
| of defs. This type is a close analogue of `google::protobuf::DescriptorPool` in C++. The |
| user must always ensure that the `upb_DefPool` outlives any def objects that it |
| owns. |
| |
| #### Loading |
| |
| As with MiniTable loading, we have multiple options for how to load full |
| reflection: |
| |
| 1. **From C generated code**: The upb code generator can create `foo.upbdefs.c` |
| files that embed the descriptors and exports generated C functions for |
| adding them to a user-provided `upb_DefPool`. |
| 2. **From descriptors**: The user can make manual calls to |
| `upb_DefPool_AddFile()`, using descriptors obtained at runtime. Defs for |
| individual messages can then be obtained using |
| `upb_DefPool_FindMessageByName()`. |
| |
| Unlike MiniTables, loading from generated code requires runtime initialization, |
| as reflection data structures like `upb_MessageDef` are not capable of being |
| emitted directly into `.rodata` like `upb_MiniTable` is. Instead, the generated |
| code embeds serialized descriptor protos into `.rodata` which are then built |
| into heap objects at runtime. |
| |
| From this you might conclude that option (1) is nothing but a convenience |
| wrapper around option (2), but that is not quite correct either. Option (1) |
| *does* link against the static `.upb.c` structures for the MiniTables, whereas |
| option (2) will build the MiniTables from scratch on the heap. So option (1) |
| will use marginally less CPU and RAM when the descriptors are loaded into a |
| `upb_DefPool`. More importantly, the resulting descriptors will be capable of |
| reflecting over any messages built from the generated `.upb.c` MiniTables, |
| whereas descriptors built using option (2) will have distinct MiniTables that |
| cannot reflect over messages that use the generated MiniTables. |
| |
| A common pattern for dynamic languages like PHP, Ruby, or Python, is to use |
| option (2) with descriptors that are embedded into the generated code. For |
| example, the generated code in Python currently looks something like: |
| |
| ```python |
| from google.protobuf import descriptor_pool as _descriptor_pool |
| from google.protobuf.internal import builder as _builder |
| |
| _desc = b'\n\x1aprotoc_explorer/main.proto\x12\x03pkg' |
| |
| DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(_desc) |
| _globals = globals() |
| _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals) |
| _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.protoc_explorer.main_pb2', _globals) |
| ``` |
| |
| The `AddSerializedFile()` API above is mainly just a thin wrapper around |
| `upb_DefPool_AddFile()`. |