blob: 41a2097e7f3b29c7f5839be804ab9bf5d4a67682 [file] [log] [blame] [view]
Joshua Habermana52fb792021-09-15 22:00:32 -07001
2# upb Design
3
4upb aims to be a minimal C protobuf kernel. It has a C API, but its primary
5goal is to be the core runtime for a higher-level API.
6
7## Design goals
8
9- Full protobuf conformance
10- Small code size
11- Fast performance (without compromising code size)
12- Easy to wrap in language runtimes
13- Easy to adapt to different memory management schemes (refcounting, GC, etc)
14
15## Design parameters
16
17- C99
18- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
19- Uses pointer tagging, but avoids other implementation-defined behavior
20- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
21- No global state, fully re-entrant
22
23
24## Overall Structure
25
26The upb library is divided into two main parts:
27
28- A core message representation, which supports binary format parsing
29 and serialization.
30 - `upb/upb.h`: arena allocator (`upb_arena`)
31 - `upb/msg_internal.h`: core message representation and parse tables
32 - `upb/msg.h`: accessing metadata common to all messages, like unknown fields
33 - `upb/decode.h`: binary format parsing
34 - `upb/encode.h`: binary format serialization
35 - `upb/table_internal.h`: hash table (used for maps)
36 - `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for
37 accessing messages without reflection.
38- A reflection add-on library that supports JSON and text format.
39 - `upb/def.h`: schema representation and loading from descriptors
40 - `upb/reflection.h`: reflective access to message data.
41 - `upb/json_encode.h`: JSON encoding
42 - `upb/json_decode.h`: JSON decoding
43 - `upb/text_encode.h`: text format encoding
44 - `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c`
45 APIs for loading reflection.
46
47## Core Message Representation
48
49The representation for each message consists of:
50- One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This
51 pointer is `NULL` when no unknown fields or extensions are present.
52- Hasbits for any optional/required fields.
53- Case integers for each oneof.
54- Data for each field.
55
56For example, a layout for a message with two `optional int32` fields would end
57up looking something like this:
58
59```c
60// For illustration only, upb does not actually generate structs.
61typedef struct {
62 upb_msg_internaldata* internal; // Unknown fields and extensions.
63 uint32_t hasbits; // We are only using two hasbits.
64 int32_t field1;
65 int32_t field2;
66} package_name_MessageName;
67```
68
69Note in particular that messages do *not* have:
70- A pointer to reflection or a parse table (upb messages are not self-describing).
Protobuf Team7e9e95a2022-04-22 12:55:28 -070071- A pointer to an arena (the arena must be explicitly passed into any function that
Joshua Habermana52fb792021-09-15 22:00:32 -070072 allocates).
73
74The upb compiler computes a layout for each message, and determines the offset for
75each field using normal alignment rules (each data member must be aligned to a
76multiple of its size). This layout is then embedded into the generated `.upb.h`
77and `.upb.c` headers in two different forms. First as inline accessors that expect
78the data at a given offset:
79
80```c
81// Example of a generated accessor, from foo.upb.h
82UPB_INLINE int32_t package_name_MessageName_field1(
83 const upb_test_MessageName *msg) {
84 return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t);
85}
86```
87
88Secondly, the layout is emitted as a table which is used by the parser and serializer.
89We call these tables "mini-tables" to distinguish them from the larger and more
90optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is
912-3x the speed of the main parser, though the main parser is already quite fast).
92
93```c
94// Definition of mini-table structure, from upb/msg_internal.h
95typedef struct {
96 uint32_t number;
97 uint16_t offset;
98 int16_t presence; /* If >0, hasbit_index. If <0, ~oneof_index. */
99 uint16_t submsg_index; /* undefined if descriptortype != MESSAGE or GROUP. */
100 uint8_t descriptortype;
101 int8_t mode; /* upb_fieldmode, with flags from upb_labelflags */
102} upb_msglayout_field;
103
104typedef enum {
105 _UPB_MODE_MAP = 0,
106 _UPB_MODE_ARRAY = 1,
107 _UPB_MODE_SCALAR = 2,
108} upb_fieldmode;
109
110typedef struct {
111 const struct upb_msglayout *const* submsgs;
112 const upb_msglayout_field *fields;
113 uint16_t size;
114 uint16_t field_count;
115 bool extendable;
116 uint8_t dense_below;
117 uint8_t table_mask;
118} upb_msglayout;
119
120// Example of a generated mini-table, from foo.upb.c
121static const upb_msglayout_field upb_test_MessageName__fields[2] = {
122 {1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR},
123 {2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR},
124};
125
Protobuf Team Bot46e306b2022-06-30 10:23:47 -0700126const upb_msglayout upb_test_MessageName_msg_init = {
Joshua Habermana52fb792021-09-15 22:00:32 -0700127 NULL,
128 &upb_test_MessageName__fields[0],
129 UPB_SIZE(16, 16), 2, false, 2, 255,
130};
131```
132
133The upb compiler computes separate layouts for 32 and 64 bit modes, since the
134pointer size will be 4 or 8 bytes respectively. The upb compiler embeds both
135sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can
136choose the appropriate size at build time based on the size of `UINTPTR_MAX`.
137
138Note that `.upb.c` files contain data tables only. There is no "generated code"
139except for the inline accessors in the `.upb.h` files: the entire footprint
140of `.upb.c` files is in `.rodata`, none in `.text` or `.data`.
141
142## Memory Management Model
143
144All memory management in upb is built around arenas. A message is never
145considered to "own" the strings or sub-messages contained within it. Instead a
146message and all of its sub-messages/strings/etc. are all owned by an arena and
147are freed when the arena is freed. An entire message tree will probably be
148owned by a single arena, but this is not required or enforced. As far as upb is
149concerned, it is up to the client how to partition its arenas. upb only requires
150that when you ask it to serialize a message, that all reachable messages are
151still alive.
152
153The arena supports both a user-supplied initial block and a custom allocation
154callback, so there is a lot of flexibility in memory allocation strategy. The
155allocation callback can even be `NULL` for heap-free operation. The main
156constraint of the arena is that all of the memory in each arena must be freed
157together.
158
159`upb_arena` supports a novel operation called "fuse". When two arenas are fused
160together, their lifetimes are irreversibly joined, such that none of the arena
161blocks in either arena will be freed until *both* arenas are freed with
162`upb_arena_free()`. This is useful when joining two messages from separate
Protobuf Team7e9e95a2022-04-22 12:55:28 -0700163arenas (making one a sub-message of the other). Fuse is a very cheap
Joshua Habermana52fb792021-09-15 22:00:32 -0700164operation, and an unlimited number of arenas can be fused together efficiently.
165
Joshua Haberman975ea592021-09-15 22:06:52 -0700166## Reflection and Descriptors
Joshua Habermana52fb792021-09-15 22:00:32 -0700167
Joshua Haberman975ea592021-09-15 22:06:52 -0700168upb offers a fully-featured reflection library. There are two main ways of
169using reflection:
Joshua Habermana52fb792021-09-15 22:00:32 -0700170
Joshua Haberman975ea592021-09-15 22:06:52 -07001711. You can load descriptors from strings using `upb_symtab_addfile()`.
172 The upb runtime will dynamically create mini-tables like what the upb compiler
173 would have created if you had compiled this type into a `.upb.c` file.
1742. You can load descriptors using generated `.upbdefs.h` interfaces.
175 This will load reflection that references the corresponding `.upb.c`
176 mini-tables instead of building a new mini-table on the fly. This lets
177 you reflect on generated types that are linked into your program.
Joshua Habermana52fb792021-09-15 22:00:32 -0700178
Joshua Haberman975ea592021-09-15 22:06:52 -0700179upb's design for descriptors is similar to protobuf C++ in many ways, with
180the following correspondences:
Joshua Habermana52fb792021-09-15 22:00:32 -0700181
Joshua Haberman975ea592021-09-15 22:06:52 -0700182| C++ Type | upb type |
183| ---------| ---------|
184| `google::protobuf::DescriptorPool` | `upb_symtab`
185| `google::protobuf::Descriptor` | `upb_msgdef`
186| `google::protobuf::FieldDescriptor` | `upb_fielddef`
187| `google::protobuf::OneofDescriptor` | `upb_oneofdef`
188| `google::protobuf::EnumDescriptor` | `upb_enumdef`
189| `google::protobuf::FileDescriptor` | `upb_filedef`
190| `google::protobuf::ServiceDescriptor` | `upb_servicedef`
191| `google::protobuf::MethodDescriptor` | `upb_methoddef`
Joshua Habermana52fb792021-09-15 22:00:32 -0700192
Joshua Haberman975ea592021-09-15 22:06:52 -0700193Like in C++ descriptors (defs) are created by loading a
194`google_protobuf_FileDescriptorProto` into a `upb_symtab`. This creates and
195links all of the def objects corresponding to that `.proto` file, and inserts
196the names into a symbol table so they can be looked up by name.
Joshua Habermana52fb792021-09-15 22:00:32 -0700197
Joshua Haberman975ea592021-09-15 22:06:52 -0700198Once you have loaded some descriptors into a `upb_symtab`, you can create and
199manipulate messages using the interfaces defined in `upb/reflection.h`. If your
200descriptors are linked to your generated layouts using option (2) above, you can
201safely access the same messages using both reflection and generated interfaces.