Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 1 | |
| 2 | # upb Design |
| 3 | |
| 4 | upb aims to be a minimal C protobuf kernel. It has a C API, but its primary |
| 5 | goal is to be the core runtime for a higher-level API. |
| 6 | |
| 7 | ## Design goals |
| 8 | |
| 9 | - Full protobuf conformance |
| 10 | - Small code size |
| 11 | - Fast performance (without compromising code size) |
| 12 | - Easy to wrap in language runtimes |
| 13 | - Easy to adapt to different memory management schemes (refcounting, GC, etc) |
| 14 | |
| 15 | ## Design parameters |
| 16 | |
| 17 | - C99 |
| 18 | - 32 or 64-bit CPU (assumes 4 or 8 byte pointers) |
| 19 | - Uses pointer tagging, but avoids other implementation-defined behavior |
| 20 | - Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc) |
| 21 | - No global state, fully re-entrant |
| 22 | |
| 23 | |
| 24 | ## Overall Structure |
| 25 | |
| 26 | The upb library is divided into two main parts: |
| 27 | |
| 28 | - A core message representation, which supports binary format parsing |
| 29 | and serialization. |
| 30 | - `upb/upb.h`: arena allocator (`upb_arena`) |
| 31 | - `upb/msg_internal.h`: core message representation and parse tables |
| 32 | - `upb/msg.h`: accessing metadata common to all messages, like unknown fields |
| 33 | - `upb/decode.h`: binary format parsing |
| 34 | - `upb/encode.h`: binary format serialization |
| 35 | - `upb/table_internal.h`: hash table (used for maps) |
| 36 | - `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for |
| 37 | accessing messages without reflection. |
| 38 | - A reflection add-on library that supports JSON and text format. |
| 39 | - `upb/def.h`: schema representation and loading from descriptors |
| 40 | - `upb/reflection.h`: reflective access to message data. |
| 41 | - `upb/json_encode.h`: JSON encoding |
| 42 | - `upb/json_decode.h`: JSON decoding |
| 43 | - `upb/text_encode.h`: text format encoding |
| 44 | - `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c` |
| 45 | APIs for loading reflection. |
| 46 | |
| 47 | ## Core Message Representation |
| 48 | |
| 49 | The representation for each message consists of: |
| 50 | - One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This |
| 51 | pointer is `NULL` when no unknown fields or extensions are present. |
| 52 | - Hasbits for any optional/required fields. |
| 53 | - Case integers for each oneof. |
| 54 | - Data for each field. |
| 55 | |
| 56 | For example, a layout for a message with two `optional int32` fields would end |
| 57 | up looking something like this: |
| 58 | |
| 59 | ```c |
| 60 | // For illustration only, upb does not actually generate structs. |
| 61 | typedef struct { |
| 62 | upb_msg_internaldata* internal; // Unknown fields and extensions. |
| 63 | uint32_t hasbits; // We are only using two hasbits. |
| 64 | int32_t field1; |
| 65 | int32_t field2; |
| 66 | } package_name_MessageName; |
| 67 | ``` |
| 68 | |
| 69 | Note in particular that messages do *not* have: |
| 70 | - A pointer to reflection or a parse table (upb messages are not self-describing). |
Protobuf Team | 7e9e95a | 2022-04-22 12:55:28 -0700 | [diff] [blame] | 71 | - A pointer to an arena (the arena must be explicitly passed into any function that |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 72 | allocates). |
| 73 | |
| 74 | The upb compiler computes a layout for each message, and determines the offset for |
| 75 | each field using normal alignment rules (each data member must be aligned to a |
| 76 | multiple of its size). This layout is then embedded into the generated `.upb.h` |
| 77 | and `.upb.c` headers in two different forms. First as inline accessors that expect |
| 78 | the data at a given offset: |
| 79 | |
| 80 | ```c |
| 81 | // Example of a generated accessor, from foo.upb.h |
| 82 | UPB_INLINE int32_t package_name_MessageName_field1( |
| 83 | const upb_test_MessageName *msg) { |
| 84 | return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t); |
| 85 | } |
| 86 | ``` |
| 87 | |
| 88 | Secondly, the layout is emitted as a table which is used by the parser and serializer. |
| 89 | We call these tables "mini-tables" to distinguish them from the larger and more |
| 90 | optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is |
| 91 | 2-3x the speed of the main parser, though the main parser is already quite fast). |
| 92 | |
| 93 | ```c |
| 94 | // Definition of mini-table structure, from upb/msg_internal.h |
| 95 | typedef struct { |
| 96 | uint32_t number; |
| 97 | uint16_t offset; |
| 98 | int16_t presence; /* If >0, hasbit_index. If <0, ~oneof_index. */ |
| 99 | uint16_t submsg_index; /* undefined if descriptortype != MESSAGE or GROUP. */ |
| 100 | uint8_t descriptortype; |
| 101 | int8_t mode; /* upb_fieldmode, with flags from upb_labelflags */ |
| 102 | } upb_msglayout_field; |
| 103 | |
| 104 | typedef enum { |
| 105 | _UPB_MODE_MAP = 0, |
| 106 | _UPB_MODE_ARRAY = 1, |
| 107 | _UPB_MODE_SCALAR = 2, |
| 108 | } upb_fieldmode; |
| 109 | |
| 110 | typedef struct { |
| 111 | const struct upb_msglayout *const* submsgs; |
| 112 | const upb_msglayout_field *fields; |
| 113 | uint16_t size; |
| 114 | uint16_t field_count; |
| 115 | bool extendable; |
| 116 | uint8_t dense_below; |
| 117 | uint8_t table_mask; |
| 118 | } upb_msglayout; |
| 119 | |
| 120 | // Example of a generated mini-table, from foo.upb.c |
| 121 | static const upb_msglayout_field upb_test_MessageName__fields[2] = { |
| 122 | {1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR}, |
| 123 | {2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR}, |
| 124 | }; |
| 125 | |
Protobuf Team Bot | 46e306b | 2022-06-30 10:23:47 -0700 | [diff] [blame] | 126 | const upb_msglayout upb_test_MessageName_msg_init = { |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 127 | NULL, |
| 128 | &upb_test_MessageName__fields[0], |
| 129 | UPB_SIZE(16, 16), 2, false, 2, 255, |
| 130 | }; |
| 131 | ``` |
| 132 | |
| 133 | The upb compiler computes separate layouts for 32 and 64 bit modes, since the |
| 134 | pointer size will be 4 or 8 bytes respectively. The upb compiler embeds both |
| 135 | sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can |
| 136 | choose the appropriate size at build time based on the size of `UINTPTR_MAX`. |
| 137 | |
| 138 | Note that `.upb.c` files contain data tables only. There is no "generated code" |
| 139 | except for the inline accessors in the `.upb.h` files: the entire footprint |
| 140 | of `.upb.c` files is in `.rodata`, none in `.text` or `.data`. |
| 141 | |
| 142 | ## Memory Management Model |
| 143 | |
| 144 | All memory management in upb is built around arenas. A message is never |
| 145 | considered to "own" the strings or sub-messages contained within it. Instead a |
| 146 | message and all of its sub-messages/strings/etc. are all owned by an arena and |
| 147 | are freed when the arena is freed. An entire message tree will probably be |
| 148 | owned by a single arena, but this is not required or enforced. As far as upb is |
| 149 | concerned, it is up to the client how to partition its arenas. upb only requires |
| 150 | that when you ask it to serialize a message, that all reachable messages are |
| 151 | still alive. |
| 152 | |
| 153 | The arena supports both a user-supplied initial block and a custom allocation |
| 154 | callback, so there is a lot of flexibility in memory allocation strategy. The |
| 155 | allocation callback can even be `NULL` for heap-free operation. The main |
| 156 | constraint of the arena is that all of the memory in each arena must be freed |
| 157 | together. |
| 158 | |
| 159 | `upb_arena` supports a novel operation called "fuse". When two arenas are fused |
| 160 | together, their lifetimes are irreversibly joined, such that none of the arena |
| 161 | blocks in either arena will be freed until *both* arenas are freed with |
| 162 | `upb_arena_free()`. This is useful when joining two messages from separate |
Protobuf Team | 7e9e95a | 2022-04-22 12:55:28 -0700 | [diff] [blame] | 163 | arenas (making one a sub-message of the other). Fuse is a very cheap |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 164 | operation, and an unlimited number of arenas can be fused together efficiently. |
| 165 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 166 | ## Reflection and Descriptors |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 167 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 168 | upb offers a fully-featured reflection library. There are two main ways of |
| 169 | using reflection: |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 170 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 171 | 1. You can load descriptors from strings using `upb_symtab_addfile()`. |
| 172 | The upb runtime will dynamically create mini-tables like what the upb compiler |
| 173 | would have created if you had compiled this type into a `.upb.c` file. |
| 174 | 2. You can load descriptors using generated `.upbdefs.h` interfaces. |
| 175 | This will load reflection that references the corresponding `.upb.c` |
| 176 | mini-tables instead of building a new mini-table on the fly. This lets |
| 177 | you reflect on generated types that are linked into your program. |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 178 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 179 | upb's design for descriptors is similar to protobuf C++ in many ways, with |
| 180 | the following correspondences: |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 181 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 182 | | C++ Type | upb type | |
| 183 | | ---------| ---------| |
| 184 | | `google::protobuf::DescriptorPool` | `upb_symtab` |
| 185 | | `google::protobuf::Descriptor` | `upb_msgdef` |
| 186 | | `google::protobuf::FieldDescriptor` | `upb_fielddef` |
| 187 | | `google::protobuf::OneofDescriptor` | `upb_oneofdef` |
| 188 | | `google::protobuf::EnumDescriptor` | `upb_enumdef` |
| 189 | | `google::protobuf::FileDescriptor` | `upb_filedef` |
| 190 | | `google::protobuf::ServiceDescriptor` | `upb_servicedef` |
| 191 | | `google::protobuf::MethodDescriptor` | `upb_methoddef` |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 192 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 193 | Like in C++ descriptors (defs) are created by loading a |
| 194 | `google_protobuf_FileDescriptorProto` into a `upb_symtab`. This creates and |
| 195 | links all of the def objects corresponding to that `.proto` file, and inserts |
| 196 | the names into a symbol table so they can be looked up by name. |
Joshua Haberman | a52fb79 | 2021-09-15 22:00:32 -0700 | [diff] [blame] | 197 | |
Joshua Haberman | 975ea59 | 2021-09-15 22:06:52 -0700 | [diff] [blame] | 198 | Once you have loaded some descriptors into a `upb_symtab`, you can create and |
| 199 | manipulate messages using the interfaces defined in `upb/reflection.h`. If your |
| 200 | descriptors are linked to your generated layouts using option (2) above, you can |
| 201 | safely access the same messages using both reflection and generated interfaces. |