docs/upb/design.md - third_party/protobuf - Git at Google

 # upb Design

 [TOC]

 upb is a protobuf kernel written in C. It is a fast and conformant
 implementation of protobuf, with a low-level C API that is designed to be
 wrapped in other languages.

 upb is not designed to be used by applications directly. The C API is very
 low-level, unsafe, and changes frequently. It is important that upb is able to
 make breaking API changes as necessary, to avoid taking on technical debt that
 would compromise upb's goals of small code size and fast performance.

 ## Design goals

 Goals:

 -   Full protobuf conformance
 -   Small code size
 -   Fast performance (without compromising code size)
 -   Easy to wrap in language runtimes
 -   Easy to adapt to different memory management schemes (refcounting, GC, etc)

 Non-Goals:

 -   Stable API
 -   Safe API
 -   Ergonomic API for applications

 Parameters:

 -   C99
 -   32 or 64-bit CPU (assumes 4 or 8 byte pointers)
 -   Uses pointer tagging, but avoids other implementation-defined behavior
 -   Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
 -   No global state, fully re-entrant

 ## Arenas

 All memory management in upb uses arenas, using the type `upb_Arena`. Arenas are
 an alternative to `malloc()` and `free()` that significantly reduces the costs
 of memory allocation.

 Arenas obtain blocks of memory using some underlying allocator (likely
 `malloc()` and `free()`), and satisfy allocations using a simple bump allocator
 that walks through each block in linear order. Allocations cannot be freed
 individually: it is only possible to free the arena as a whole, which frees all
 of the underlying blocks.

 Here is an example of using the `upb_Arena` type:

 ```c
   upb_Arena* arena = upb_Arena_New();

   // Perform some allocations.
   int* x = upb_Arena_Malloc(arena, sizeof(*x));
   int* y = upb_Arena_Malloc(arena, sizeof(*y));

   // We cannot free `x` and `y` separately, we can only free the arena
   // as a whole.
   upb_Arena_Free(arena);
 ```

 upb uses arenas for all memory management, and this fact is reflected in the API
 for all upb data structures. All upb functions that allocate take a `upb_Arena*`
 parameter and perform allocations using that arena rather than calling
 `malloc()` or `free()`.

 ```c
 // upb API to create a message.
 UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
                                      upb_Arena* arena);

 void MakeMessage(const upb_MiniTable* mini_table) {
   upb_Arena* arena = upb_Arena_New();

   // This message is allocated on our arena.
   upb_Message* msg = upb_Message_New(mini_table, arena);

   // We can free the arena whenever we want, but we cannot free the
   // message separately from the arena.
   upb_Arena_Free(arena);

   // msg is now deleted.
 }
 ```

 Arenas are a key part of upb's performance story. Parsing a large protobuf
 payload usually involves rapidly creating a series of messages, arrays (repeated
 fields), and maps. It is crucial for parsing performance that these allocations
 are as fast as possible. Equally important, freeing the tree of messages should
 be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg
 n)`.

 ### Avoiding Dangling Pointers

 Objects allocated on an arena will frequently contain pointers to other
 arena-allocated objects. For example, a `upb_Message` will have pointers to
 sub-messages that are also arena-allocated.

 Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide
 automatic safety from dangling pointers. Instead, upb provides tools to help
 bridge between higher-level memory management schemes (GC, refcounting, RAII,
 borrow checkers) and arenas.

 If there is only one arena, dangling pointers within the arena are impossible,
 because all objects are freed at the same time. This is the simplest case. The
 user must still be careful not to keep dangling pointers that point at arena
 memory after it has been freed, but dangling pointers *between* the arena
 objects will be impossible.

 But what if there are multiple arenas? If we have a pointer from one arena to
 another, how do we ensure that this will not become a dangling pointer?

 To help with the multiple arena case, upb provides a primitive called "fuse".

 ```c
 // Fuses the lifetimes of `a` and `b`.  None of the blocks from `a` or `b`
 // will be freed until both arenas are freed.
 UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
 ```

 When two arenas are fused together, their lifetimes are irreversibly joined,
 such that none of the arena blocks in either arena will be freed until *both*
 arenas are freed with `upb_Arena_Free()`. This means that dangling pointers
 between the two arenas will no longer be possible.

 Fuse is useful when joining two messages from separate arenas (making one a
 sub-message of the other). Fuse is a relatively cheap operation, on the order of
 150ns, and is very nearly `O(1)` in the number of arenas being fused (the true
 complexity is the inverse Ackermann function, which grows extremely slowly).

 Each arena does consume some memory, so repeatedly creating and fusing an
 additional arena is not free, but the CPU cost of fusing two arenas together is
 modest.

 ### Initial Block and Custom Allocators

 `upb_Arena` normally uses `malloc()` and `free()` to allocate and return its
 underlying blocks. But this default strategy can be customized to support the
 needs of a particular language.

 The lowest-level function for creating a `upb_Arena` is:

 ```c
 // Creates an arena from the given initial block (if any -- n may be 0).
 // Additional blocks will be allocated from |alloc|.  If |alloc| is NULL,
 // this is a fixed-size arena and cannot grow.
 UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
 ```

 The buffer `[mem, n]` will be used as an "initial block", which is used to
 satisfy allocations before calling any underlying allocation function. Note that
 the `upb_Arena` itself will be allocated from the initial block if possible, so
 the amount of memory available for allocation from the arena will be less than
 `n`.

 The `alloc` parameter specifies a custom memory allocation function which will
 be used once the initial block is exhausted. The user can pass `NULL` as the
 allocation function, in which case the initial block is the only memory
 available in the arena. This can allow upb to be used even in situations where
 there is no heap.

 It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating
 operations like `upb_Message_New()` should be checked for failure if there is
 any possibility that a fixed size arena is in use.

 ## Schemas

 Nearly all operations in upb require that you have a schema. A protobuf schema
 is a data structure that contains all of the message, field, enum, etc.
 definitions that are specified in a `.proto` file. To create, parse, serialize,
 or access a message you must have a schema. For this reason, loading a schema is
 generally the first thing you must do when you use upb. [^0]

 [^0]: This requirement comes from the protobuf wire format itself, which is a
     deep insight about the nature of protobuf (or at least the existing wire
     format). Unlike JSON, protobuf cannot be parsed or manipulated in a
     schema-less way. This is because the binary wire format does not
     distinguish between strings and sub-messages, so a generic parser that is
     oblivious to the schema is not possible. If a future version of the wire
     format did distinguish between these, it could be possible to have a
     schema-agnostic data representation, parser, and serializer.

 upb has two main data structures that represent a protobuf schema:

 *   **MiniTables** are a compact, stripped down version of the schema that
     contains only the information necessary for parsing and serializing the
     binary wire format.
 *   **Reflection** contains basically all of the data from a `.proto` file,
     including the original names of all messages/fields/etc., and all options.

 The table below summarizes the main differences between these two:

 |                     | MiniTables                | Reflection                 |
 | ------------------- | ------------------------- | -------------------------- |
 | Contains            | Field numbers and types   | All data in `.proto` file, |
 :                     : only                      : including names of         :
 :                     :                           : everything                 :
 | Used to parse       | binary format             | JSON / TextFormat          |
 | Wire representation | MiniDescriptor            | Descriptor                 |
 | Type names          | `upb_MiniTable`,          | `upb_MessageDef`,          |
 :                     : `upb_MiniTableField`, ... : `upb_FieldDef`, ...        :
 | Registry            | `upb_ExtensionRegistry`   | `upb_DefPool`              |
 :                     : (for extensions)          :                            :

 MiniTables are useful if you only need the binary wire format, because they are
 much lighter weight than full reflection.

 Reflection is useful if you need to parse JSON or TextFormat, or you need access
 to options that were specified in the `proto` file. Note that reflection also
 includes MiniTables, so if you have reflection, you also have MiniTables
 available.

 ### MiniTables

 MiniTables are represented by a set of data structures with names like
 `upb_MiniTable` (representing a message), `upb_MiniTableField`,
 `upb_MiniTableFile`, etc. Whenever you see one of these types in a function
 signature, you know that this particular operation requires a MiniTable. For
 example:

 ```
 // Parses the wire format data in the given buffer `[buf, size]` and writes it
 // to the message `msg`, which has the type `mt`.
 UPB_API upb_DecodeStatus upb_Decode(const char* buf, size_t size,
                                     upb_Message* msg, const upb_MiniTable* mt,
                                     const upb_ExtensionRegistry* extreg,
                                     int options, upb_Arena* arena);
 ```

 The subset of upb that requires only MiniTables can be thought of as "upb lite,"
 because both the code size and the runtime memory overhead will be less than
 "upb full" (the parts that use reflection).

 #### Loading

 There are three main ways of loading a MiniTable:

 1.  **From C generated code:** The upb code generator can emit `.upb.c` files that
     contain the MiniTables as global constant variables. When the main program
     links against these, the MiniTable will be placed into `.rodata` (or
     `.data.rel.ro`) in the binary. The MiniTable can then be obtained from a
     generated function. In Blaze/Bazel these files can be generated and linked
     using the `upb_proto_library()` rule.
 2.  **From MiniDescriptors:** The user can build MiniDescriptors into MiniTables
     at runtime. MiniDescriptors are a compact upb-specific wire format designed
     specially for this purpose. The user can call `upb_MiniTable_Build()` at
     runtime to convert MiniDescriptors to MiniTables.
 3.  **From reflection:** If you have already built reflection data structures
     for your type, then you can obtain the `upb_MiniTable` corresponding to a
     `upb_MessageDef` using `upb_MessageDef_MiniTable()`.

 For languages that are already using reflection, (3) is an obvious choice.

 For languages that are avoiding reflection, here is a general guideline for
 choosing between (1) and (2): if the language being wrapped participates in the
 standard binary linking model on a given platform (in particular, if it is
 generally linked using `ld`), then it is better to use (1), which is also known
 as "static loading".

 Static loading of MiniTables has the benefit of requiring no runtime
 initialization[^2], leading to faster startup. Static loading of MiniTables also
 facilitates cross-language sharing of proto messages, because sharing generally
 requires that both languages are using exactly the same MiniTables.

 The main downside of static loading is that it requires the user to generate one
 `.upb.c` file per `.proto` and link against the transitive closure of `.upb.c`
 files. Blaze and Bazel make this reasonably easy, but for other build systems it
 can be more of a challenge.

 [^2]: aside from possible pointer relocations performed by the ELF/Mach-O loader
     if the library or binary is position-independent

 Loading from MiniDescriptors, as in option (2), has the advantage that it does
 not require per-message linking of C code. For many language toolchains,
 generating and linking some custom C code for every protobuf file or message
 type would be a burdensome requirement. MiniDescriptors are a convenient way of
 loading MiniTables without needing to cross the FFI boundary outside the core
 runtime.

 A common pattern when using dynamic loading is to embed strings containing
 MiniDescriptors directly into generated code. For example, the generated code in
 Dart for a message with only primitive fields currently looks something like:

 ```dart
   const desc = r'$(+),*-#$%&! /10';
   _accessor = $pb.instance.registry.newMessageAccessor(desc);
 ```

 The implementation of `newMessageAccessor()` is mainly just a wrapper around
 `upb_MiniTable_Build()`, which builds a MiniTable from a MiniDescriptor. In the
 code generator, the MiniDescriptor can be obtained from the
 `upb_MessageDef_MiniDescriptorEncode()` API; users should never need to encode a
 MiniDescriptor manually.

 #### Linking

 When building MiniTables dynamically, it is the user's responsibility to link
 each message to the to the appropriate sub-messages and or enums. Each message
 must have its message and closed enum fields linked using
 `upb_MiniTable_SetSubMessage()` and `upb_MiniTable_SetSubEnum()`, respectively.

 A higher-level function that links all fields at the same time is also
 available, as `upb_MiniTable_Link()`. This function pairs well with
 `upb_MiniTable_GetSubList()` which can be used in a code generator to get a list
 of all the messages and enums which must be passed to `upb_MiniTable_Link()`.

 A common pattern is to embed the `link()` calls directly into the generated
 code. For example, here is an example from Dart of building a MiniTable that
 contains sub-messages and enums:

 ```dart
   const desc = r'$3334';
   _accessor = $pb.instance.registry.newMessageAccessor(desc);
   _accessor!.link(
       [
         M2.$_accessor,
         M3.$_accessor,
         M4.$_accessor,
       ],
       [
         E.$_accessor,
       ],
   );
 ```

 In this case, `upb_MiniTable_GetSubList()` was used in the code generator to
 discover the 3 sub-message fields and 1 sub-enum field that require linking. At
 runtime, these lists of MiniTables are passed to the `link()` function, which
 will internally call `upb_MiniTable_Link()`.

 Note that in some cases, the application may choose to delay or even skip the
 registration of sub-message types, as part of a tree shaking strategy.

 When using static MiniTables, a manual link step is not necessary, as linking is
 performed automatically by `ld`.

 #### Enums

 MiniTables primarily carry data about messages, fields, and extensions. However
 for closed enums, we must also have a `upb_MiniTableEnum` structure that stores
 the set of all numbers that are defined in the enum. This is because closed
 enums have the unfortunate behavior of putting unknown enum values into the
 unknown field set.

 Over time, closed enums will hopefully be phased out via editions, and the
 relevance and overhead of `upb_MiniTableEnum` will shrink and eventually
 disappear.

 ### Reflection

 Reflection uses types like `upb_MessageDef` and `upb_FieldDef` to represent the
 full contents of a `.proto` file at runtime. These types are upb's direct
 equivalents of `google::protobuf::Descriptor`, `google::protobuf::FieldDescriptor`, etc. [^1]

 [^1]: upb consistently uses `Def` where C++ would use `Descriptor` in type
     names. This introduces divergence with C++; the rationale was to conserve
     horizontal line length, as `Def` is less than 1/3 the length of
     `Descriptor`. This is more relevant in C, where the type name is repeated
     in every function, eg. `upb_FieldDef_Name()` vs.
     `upb_FieldDescriptor_Name()`.

 Whenever you see one of these types in a function signature, you know that the
 given operation requires reflection. For example:

 ```c
 // Parses JSON format into a message object, using reflection.
 UPB_API bool upb_JsonDecode(const char* buf, size_t size, upb_Message* msg,
                             const upb_MessageDef* m, const upb_DefPool* symtab,
                             int options, upb_Arena* arena, upb_Status* status);
 ```

 The part of upb that requires reflection can be thought of as "upb full." These
 parts of the library cannot be used if a given application has only loaded
 MiniTables. There is no way to convert a MiniTable into reflection.

 The `upb_DefPool` type is the top-level container that builds and owns some set
 of defs. This type is a close analogue of `google::protobuf::DescriptorPool` in C++. The
 user must always ensure that the `upb_DefPool` outlives any def objects that it
 owns.

 #### Loading

 As with MiniTable loading, we have multiple options for how to load full
 reflection:

 1.  **From C generated code**: The upb code generator can create `foo.upbdefs.c`
     files that embed the descriptors and exports generated C functions for
     adding them to a user-provided `upb_DefPool`.
 2.  **From descriptors**: The user can make manual calls to
     `upb_DefPool_AddFile()`, using descriptors obtained at runtime. Defs for
     individual messages can then be obtained using
     `upb_DefPool_FindMessageByName()`.

 Unlike MiniTables, loading from generated code requires runtime initialization,
 as reflection data structures like `upb_MessageDef` are not capable of being
 emitted directly into `.rodata` like `upb_MiniTable` is. Instead, the generated
 code embeds serialized descriptor protos into `.rodata` which are then built
 into heap objects at runtime.

 From this you might conclude that option (1) is nothing but a convenience
 wrapper around option (2), but that is not quite correct either. Option (1)
 *does* link against the static `.upb.c` structures for the MiniTables, whereas
 option (2) will build the MiniTables from scratch on the heap. So option (1)
 will use marginally less CPU and RAM when the descriptors are loaded into a
 `upb_DefPool`. More importantly, the resulting descriptors will be capable of
 reflecting over any messages built from the generated `.upb.c` MiniTables,
 whereas descriptors built using option (2) will have distinct MiniTables that
 cannot reflect over messages that use the generated MiniTables.

 A common pattern for dynamic languages like PHP, Ruby, or Python, is to use
 option (2) with descriptors that are embedded into the generated code. For
 example, the generated code in Python currently looks something like:

 ```python
 from google.protobuf import descriptor_pool as _descriptor_pool
 from google.protobuf.internal import builder as _builder

 _desc = b'\n\x1aprotoc_explorer/main.proto\x12\x03pkg'

 DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(_desc)
 _globals = globals()
 _builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
 _builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.protoc_explorer.main_pb2', _globals)
 ```

 The `AddSerializedFile()` API above is mainly just a thin wrapper around
 `upb_DefPool_AddFile()`.
	# upb Design

	[TOC]

	upb is a protobuf kernel written in C. It is a fast and conformant
	implementation of protobuf, with a low-level C API that is designed to be
	wrapped in other languages.

	upb is not designed to be used by applications directly. The C API is very
	low-level, unsafe, and changes frequently. It is important that upb is able to
	make breaking API changes as necessary, to avoid taking on technical debt that
	would compromise upb's goals of small code size and fast performance.

	## Design goals

	Goals:

	- Full protobuf conformance
	- Small code size
	- Fast performance (without compromising code size)
	- Easy to wrap in language runtimes
	- Easy to adapt to different memory management schemes (refcounting, GC, etc)

	Non-Goals:

	- Stable API
	- Safe API
	- Ergonomic API for applications

	Parameters:

	- C99
	- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
	- Uses pointer tagging, but avoids other implementation-defined behavior
	- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
	- No global state, fully re-entrant

	## Arenas

	All memory management in upb uses arenas, using the type `upb_Arena`. Arenas are
	an alternative to `malloc()` and `free()` that significantly reduces the costs
	of memory allocation.

	Arenas obtain blocks of memory using some underlying allocator (likely
	`malloc()` and `free()`), and satisfy allocations using a simple bump allocator
	that walks through each block in linear order. Allocations cannot be freed
	individually: it is only possible to free the arena as a whole, which frees all
	of the underlying blocks.

	Here is an example of using the `upb_Arena` type:

	```c
	upb_Arena* arena = upb_Arena_New();

	// Perform some allocations.
	int* x = upb_Arena_Malloc(arena, sizeof(*x));
	int* y = upb_Arena_Malloc(arena, sizeof(*y));

	// We cannot free `x` and `y` separately, we can only free the arena
	// as a whole.
	upb_Arena_Free(arena);
	```

	upb uses arenas for all memory management, and this fact is reflected in the API
	for all upb data structures. All upb functions that allocate take a `upb_Arena*`
	parameter and perform allocations using that arena rather than calling
	`malloc()` or `free()`.

	```c
	// upb API to create a message.
	UPB_API upb_Message* upb_Message_New(const upb_MiniTable* mini_table,
	upb_Arena* arena);

	void MakeMessage(const upb_MiniTable* mini_table) {
	upb_Arena* arena = upb_Arena_New();

	// This message is allocated on our arena.
	upb_Message* msg = upb_Message_New(mini_table, arena);

	// We can free the arena whenever we want, but we cannot free the
	// message separately from the arena.
	upb_Arena_Free(arena);

	// msg is now deleted.
	}
	```

	Arenas are a key part of upb's performance story. Parsing a large protobuf
	payload usually involves rapidly creating a series of messages, arrays (repeated
	fields), and maps. It is crucial for parsing performance that these allocations
	are as fast as possible. Equally important, freeing the tree of messages should
	be as fast as possible, and arenas can reduce this cost from `O(n)` to `O(lg
	n)`.

	### Avoiding Dangling Pointers

	Objects allocated on an arena will frequently contain pointers to other
	arena-allocated objects. For example, a `upb_Message` will have pointers to
	sub-messages that are also arena-allocated.

	Unlike unique ownership schemes (such as `unique_ptr<>`), arenas cannot provide
	automatic safety from dangling pointers. Instead, upb provides tools to help
	bridge between higher-level memory management schemes (GC, refcounting, RAII,
	borrow checkers) and arenas.

	If there is only one arena, dangling pointers within the arena are impossible,
	because all objects are freed at the same time. This is the simplest case. The
	user must still be careful not to keep dangling pointers that point at arena
	memory after it has been freed, but dangling pointers between the arena
	objects will be impossible.

	But what if there are multiple arenas? If we have a pointer from one arena to
	another, how do we ensure that this will not become a dangling pointer?

	To help with the multiple arena case, upb provides a primitive called "fuse".

	```c
	// Fuses the lifetimes of `a` and `b`. None of the blocks from `a` or `b`
	// will be freed until both arenas are freed.
	UPB_API bool upb_Arena_Fuse(upb_Arena* a, upb_Arena* b);
	```

	When two arenas are fused together, their lifetimes are irreversibly joined,
	such that none of the arena blocks in either arena will be freed until both
	arenas are freed with `upb_Arena_Free()`. This means that dangling pointers
	between the two arenas will no longer be possible.

	Fuse is useful when joining two messages from separate arenas (making one a
	sub-message of the other). Fuse is a relatively cheap operation, on the order of
	150ns, and is very nearly `O(1)` in the number of arenas being fused (the true
	complexity is the inverse Ackermann function, which grows extremely slowly).

	Each arena does consume some memory, so repeatedly creating and fusing an
	additional arena is not free, but the CPU cost of fusing two arenas together is
	modest.

	### Initial Block and Custom Allocators

	`upb_Arena` normally uses `malloc()` and `free()` to allocate and return its
	underlying blocks. But this default strategy can be customized to support the
	needs of a particular language.

	The lowest-level function for creating a `upb_Arena` is:

	```c
	// Creates an arena from the given initial block (if any -- n may be 0).
	// Additional blocks will be allocated from \|alloc\|. If \|alloc\| is NULL,
	// this is a fixed-size arena and cannot grow.
	UPB_API upb_Arena* upb_Arena_Init(void* mem, size_t n, upb_alloc* alloc);
	```

	The buffer `[mem, n]` will be used as an "initial block", which is used to
	satisfy allocations before calling any underlying allocation function. Note that
	the `upb_Arena` itself will be allocated from the initial block if possible, so
	the amount of memory available for allocation from the arena will be less than
	`n`.

	The `alloc` parameter specifies a custom memory allocation function which will
	be used once the initial block is exhausted. The user can pass `NULL` as the
	allocation function, in which case the initial block is the only memory
	available in the arena. This can allow upb to be used even in situations where
	there is no heap.

	It follows that `upb_Arena_Malloc()` is a fallible operation, and all allocating
	operations like `upb_Message_New()` should be checked for failure if there is
	any possibility that a fixed size arena is in use.

	## Schemas

	Nearly all operations in upb require that you have a schema. A protobuf schema
	is a data structure that contains all of the message, field, enum, etc.
	definitions that are specified in a `.proto` file. To create, parse, serialize,
	or access a message you must have a schema. For this reason, loading a schema is
	generally the first thing you must do when you use upb. [^0]

	[^0]: This requirement comes from the protobuf wire format itself, which is a
	deep insight about the nature of protobuf (or at least the existing wire
	format). Unlike JSON, protobuf cannot be parsed or manipulated in a
	schema-less way. This is because the binary wire format does not
	distinguish between strings and sub-messages, so a generic parser that is
	oblivious to the schema is not possible. If a future version of the wire
	format did distinguish between these, it could be possible to have a
	schema-agnostic data representation, parser, and serializer.

	upb has two main data structures that represent a protobuf schema:

	* MiniTables are a compact, stripped down version of the schema that
	contains only the information necessary for parsing and serializing the
	binary wire format.
	* Reflection contains basically all of the data from a `.proto` file,
	including the original names of all messages/fields/etc., and all options.

	The table below summarizes the main differences between these two:

	\| \| MiniTables \| Reflection \|
	\| ------------------- \| ------------------------- \| -------------------------- \|
	\| Contains \| Field numbers and types \| All data in `.proto` file, \|
	: : only : including names of :
	: : : everything :
	\| Used to parse \| binary format \| JSON / TextFormat \|
	\| Wire representation \| MiniDescriptor \| Descriptor \|
	\| Type names \| `upb_MiniTable`, \| `upb_MessageDef`, \|
	: : `upb_MiniTableField`, ... : `upb_FieldDef`, ... :
	\| Registry \| `upb_ExtensionRegistry` \| `upb_DefPool` \|
	: : (for extensions) : :

	MiniTables are useful if you only need the binary wire format, because they are
	much lighter weight than full reflection.

	Reflection is useful if you need to parse JSON or TextFormat, or you need access
	to options that were specified in the `proto` file. Note that reflection also
	includes MiniTables, so if you have reflection, you also have MiniTables
	available.

	### MiniTables

	MiniTables are represented by a set of data structures with names like
	`upb_MiniTable` (representing a message), `upb_MiniTableField`,
	`upb_MiniTableFile`, etc. Whenever you see one of these types in a function
	signature, you know that this particular operation requires a MiniTable. For
	example:

	```
	// Parses the wire format data in the given buffer `[buf, size]` and writes it
	// to the message `msg`, which has the type `mt`.
	UPB_API upb_DecodeStatus upb_Decode(const char* buf, size_t size,
	upb_Message* msg, const upb_MiniTable* mt,
	const upb_ExtensionRegistry* extreg,
	int options, upb_Arena* arena);
	```

	The subset of upb that requires only MiniTables can be thought of as "upb lite,"
	because both the code size and the runtime memory overhead will be less than
	"upb full" (the parts that use reflection).

	#### Loading

	There are three main ways of loading a MiniTable:

	1. From C generated code: The upb code generator can emit `.upb.c` files that
	contain the MiniTables as global constant variables. When the main program
	links against these, the MiniTable will be placed into `.rodata` (or
	`.data.rel.ro`) in the binary. The MiniTable can then be obtained from a
	generated function. In Blaze/Bazel these files can be generated and linked
	using the `upb_proto_library()` rule.
	2. From MiniDescriptors: The user can build MiniDescriptors into MiniTables
	at runtime. MiniDescriptors are a compact upb-specific wire format designed
	specially for this purpose. The user can call `upb_MiniTable_Build()` at
	runtime to convert MiniDescriptors to MiniTables.
	3. From reflection: If you have already built reflection data structures
	for your type, then you can obtain the `upb_MiniTable` corresponding to a
	`upb_MessageDef` using `upb_MessageDef_MiniTable()`.

	For languages that are already using reflection, (3) is an obvious choice.

	For languages that are avoiding reflection, here is a general guideline for
	choosing between (1) and (2): if the language being wrapped participates in the
	standard binary linking model on a given platform (in particular, if it is
	generally linked using `ld`), then it is better to use (1), which is also known
	as "static loading".

	Static loading of MiniTables has the benefit of requiring no runtime
	initialization[^2], leading to faster startup. Static loading of MiniTables also
	facilitates cross-language sharing of proto messages, because sharing generally
	requires that both languages are using exactly the same MiniTables.

	The main downside of static loading is that it requires the user to generate one
	`.upb.c` file per `.proto` and link against the transitive closure of `.upb.c`
	files. Blaze and Bazel make this reasonably easy, but for other build systems it
	can be more of a challenge.

	[^2]: aside from possible pointer relocations performed by the ELF/Mach-O loader
	if the library or binary is position-independent

	Loading from MiniDescriptors, as in option (2), has the advantage that it does
	not require per-message linking of C code. For many language toolchains,
	generating and linking some custom C code for every protobuf file or message
	type would be a burdensome requirement. MiniDescriptors are a convenient way of
	loading MiniTables without needing to cross the FFI boundary outside the core
	runtime.

	A common pattern when using dynamic loading is to embed strings containing
	MiniDescriptors directly into generated code. For example, the generated code in
	Dart for a message with only primitive fields currently looks something like:

	```dart
	const desc = r'$(+),*-#$%&! /10';
	_accessor = $pb.instance.registry.newMessageAccessor(desc);
	```

	The implementation of `newMessageAccessor()` is mainly just a wrapper around
	`upb_MiniTable_Build()`, which builds a MiniTable from a MiniDescriptor. In the
	code generator, the MiniDescriptor can be obtained from the
	`upb_MessageDef_MiniDescriptorEncode()` API; users should never need to encode a
	MiniDescriptor manually.

	#### Linking

	When building MiniTables dynamically, it is the user's responsibility to link
	each message to the to the appropriate sub-messages and or enums. Each message
	must have its message and closed enum fields linked using
	`upb_MiniTable_SetSubMessage()` and `upb_MiniTable_SetSubEnum()`, respectively.

	A higher-level function that links all fields at the same time is also
	available, as `upb_MiniTable_Link()`. This function pairs well with
	`upb_MiniTable_GetSubList()` which can be used in a code generator to get a list
	of all the messages and enums which must be passed to `upb_MiniTable_Link()`.

	A common pattern is to embed the `link()` calls directly into the generated
	code. For example, here is an example from Dart of building a MiniTable that
	contains sub-messages and enums:

	```dart
	const desc = r'$3334';
	_accessor = $pb.instance.registry.newMessageAccessor(desc);
	_accessor!.link(
	[
	M2.$_accessor,
	M3.$_accessor,
	M4.$_accessor,
	],
	[
	E.$_accessor,
	],
	);
	```

	In this case, `upb_MiniTable_GetSubList()` was used in the code generator to
	discover the 3 sub-message fields and 1 sub-enum field that require linking. At
	runtime, these lists of MiniTables are passed to the `link()` function, which
	will internally call `upb_MiniTable_Link()`.

	Note that in some cases, the application may choose to delay or even skip the
	registration of sub-message types, as part of a tree shaking strategy.

	When using static MiniTables, a manual link step is not necessary, as linking is
	performed automatically by `ld`.

	#### Enums

	MiniTables primarily carry data about messages, fields, and extensions. However
	for closed enums, we must also have a `upb_MiniTableEnum` structure that stores
	the set of all numbers that are defined in the enum. This is because closed
	enums have the unfortunate behavior of putting unknown enum values into the
	unknown field set.

	Over time, closed enums will hopefully be phased out via editions, and the
	relevance and overhead of `upb_MiniTableEnum` will shrink and eventually
	disappear.

	### Reflection

	Reflection uses types like `upb_MessageDef` and `upb_FieldDef` to represent the
	full contents of a `.proto` file at runtime. These types are upb's direct
	equivalents of `google::protobuf::Descriptor`, `google::protobuf::FieldDescriptor`, etc. [^1]

	[^1]: upb consistently uses `Def` where C++ would use `Descriptor` in type
	names. This introduces divergence with C++; the rationale was to conserve
	horizontal line length, as `Def` is less than 1/3 the length of
	`Descriptor`. This is more relevant in C, where the type name is repeated
	in every function, eg. `upb_FieldDef_Name()` vs.
	`upb_FieldDescriptor_Name()`.

	Whenever you see one of these types in a function signature, you know that the
	given operation requires reflection. For example:

	```c
	// Parses JSON format into a message object, using reflection.
	UPB_API bool upb_JsonDecode(const char* buf, size_t size, upb_Message* msg,
	const upb_MessageDef* m, const upb_DefPool* symtab,
	int options, upb_Arena* arena, upb_Status* status);
	```

	The part of upb that requires reflection can be thought of as "upb full." These
	parts of the library cannot be used if a given application has only loaded
	MiniTables. There is no way to convert a MiniTable into reflection.

	The `upb_DefPool` type is the top-level container that builds and owns some set
	of defs. This type is a close analogue of `google::protobuf::DescriptorPool` in C++. The
	user must always ensure that the `upb_DefPool` outlives any def objects that it
	owns.

	#### Loading

	As with MiniTable loading, we have multiple options for how to load full
	reflection:

	1. From C generated code: The upb code generator can create `foo.upbdefs.c`
	files that embed the descriptors and exports generated C functions for
	adding them to a user-provided `upb_DefPool`.
	2. From descriptors: The user can make manual calls to
	`upb_DefPool_AddFile()`, using descriptors obtained at runtime. Defs for
	individual messages can then be obtained using
	`upb_DefPool_FindMessageByName()`.

	Unlike MiniTables, loading from generated code requires runtime initialization,
	as reflection data structures like `upb_MessageDef` are not capable of being
	emitted directly into `.rodata` like `upb_MiniTable` is. Instead, the generated
	code embeds serialized descriptor protos into `.rodata` which are then built
	into heap objects at runtime.

	From this you might conclude that option (1) is nothing but a convenience
	wrapper around option (2), but that is not quite correct either. Option (1)
	does link against the static `.upb.c` structures for the MiniTables, whereas
	option (2) will build the MiniTables from scratch on the heap. So option (1)
	will use marginally less CPU and RAM when the descriptors are loaded into a
	`upb_DefPool`. More importantly, the resulting descriptors will be capable of
	reflecting over any messages built from the generated `.upb.c` MiniTables,
	whereas descriptors built using option (2) will have distinct MiniTables that
	cannot reflect over messages that use the generated MiniTables.

	A common pattern for dynamic languages like PHP, Ruby, or Python, is to use
	option (2) with descriptors that are embedded into the generated code. For
	example, the generated code in Python currently looks something like:

	```python
	from google.protobuf import descriptor_pool as _descriptor_pool
	from google.protobuf.internal import builder as _builder

	_desc = b'\n\x1aprotoc_explorer/main.proto\x12\x03pkg'

	DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(_desc)
	_globals = globals()
	_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, _globals)
	_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR, 'google3.protoc_explorer.main_pb2', _globals)
	```

	The `AddSerializedFile()` API above is mainly just a thin wrapper around
	`upb_DefPool_AddFile()`.