Blame - DESIGN.md - third_party/protobuf

blob: 41a2097e7f3b29c7f5839be804ab9bf5d4a67682 [file] [log] [blame] [view]

Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	1
				2	# upb Design
				3
				4	upb aims to be a minimal C protobuf kernel. It has a C API, but its primary
				5	goal is to be the core runtime for a higher-level API.
				6
				7	## Design goals
				8
				9	- Full protobuf conformance
				10	- Small code size
				11	- Fast performance (without compromising code size)
				12	- Easy to wrap in language runtimes
				13	- Easy to adapt to different memory management schemes (refcounting, GC, etc)
				14
				15	## Design parameters
				16
				17	- C99
				18	- 32 or 64-bit CPU (assumes 4 or 8 byte pointers)
				19	- Uses pointer tagging, but avoids other implementation-defined behavior
				20	- Aims to never invoke undefined behavior (tests with ASAN, UBSAN, etc)
				21	- No global state, fully re-entrant
				22
				23
				24	## Overall Structure
				25
				26	The upb library is divided into two main parts:
				27
				28	- A core message representation, which supports binary format parsing
				29	and serialization.
				30	- `upb/upb.h`: arena allocator (`upb_arena`)
				31	- `upb/msg_internal.h`: core message representation and parse tables
				32	- `upb/msg.h`: accessing metadata common to all messages, like unknown fields
				33	- `upb/decode.h`: binary format parsing
				34	- `upb/encode.h`: binary format serialization
				35	- `upb/table_internal.h`: hash table (used for maps)
				36	- `upbc/protoc-gen-upbc.cc`: compiler that generates `.upb.h`/`.upb.c` APIs for
				37	accessing messages without reflection.
				38	- A reflection add-on library that supports JSON and text format.
				39	- `upb/def.h`: schema representation and loading from descriptors
				40	- `upb/reflection.h`: reflective access to message data.
				41	- `upb/json_encode.h`: JSON encoding
				42	- `upb/json_decode.h`: JSON decoding
				43	- `upb/text_encode.h`: text format encoding
				44	- `upbc/protoc-gen-upbdefs.cc`: compiler that generates `.upbdefs.h`/`.upbdefs.c`
				45	APIs for loading reflection.
				46
				47	## Core Message Representation
				48
				49	The representation for each message consists of:
				50	- One pointer (`upb_msg_internaldata*`) for unknown fields and extensions. This
				51	pointer is `NULL` when no unknown fields or extensions are present.
				52	- Hasbits for any optional/required fields.
				53	- Case integers for each oneof.
				54	- Data for each field.
				55
				56	For example, a layout for a message with two `optional int32` fields would end
				57	up looking something like this:
				58
				59	```c
				60	// For illustration only, upb does not actually generate structs.
				61	typedef struct {
				62	upb_msg_internaldata* internal; // Unknown fields and extensions.
				63	uint32_t hasbits; // We are only using two hasbits.
				64	int32_t field1;
				65	int32_t field2;
				66	} package_name_MessageName;
				67	```
				68
				69	Note in particular that messages do not have:
				70	- A pointer to reflection or a parse table (upb messages are not self-describing).
Protobuf Team	7e9e95a	2022-04-22 12:55:28 -0700	[diff] [blame]	71	- A pointer to an arena (the arena must be explicitly passed into any function that
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	72	allocates).
				73
				74	The upb compiler computes a layout for each message, and determines the offset for
				75	each field using normal alignment rules (each data member must be aligned to a
				76	multiple of its size). This layout is then embedded into the generated `.upb.h`
				77	and `.upb.c` headers in two different forms. First as inline accessors that expect
				78	the data at a given offset:
				79
				80	```c
				81	// Example of a generated accessor, from foo.upb.h
				82	UPB_INLINE int32_t package_name_MessageName_field1(
				83	const upb_test_MessageName *msg) {
				84	return *UPB_PTR_AT(msg, UPB_SIZE(4, 4), int32_t);
				85	}
				86	```
				87
				88	Secondly, the layout is emitted as a table which is used by the parser and serializer.
				89	We call these tables "mini-tables" to distinguish them from the larger and more
				90	optimized "fast tables" used in `upb/decode_fast.c` (an experimental parser that is
				91	2-3x the speed of the main parser, though the main parser is already quite fast).
				92
				93	```c
				94	// Definition of mini-table structure, from upb/msg_internal.h
				95	typedef struct {
				96	uint32_t number;
				97	uint16_t offset;
				98	int16_t presence; /* If >0, hasbit_index. If <0, ~oneof_index. */
				99	uint16_t submsg_index; /* undefined if descriptortype != MESSAGE or GROUP. */
				100	uint8_t descriptortype;
				101	int8_t mode; /* upb_fieldmode, with flags from upb_labelflags */
				102	} upb_msglayout_field;
				103
				104	typedef enum {
				105	_UPB_MODE_MAP = 0,
				106	_UPB_MODE_ARRAY = 1,
				107	_UPB_MODE_SCALAR = 2,
				108	} upb_fieldmode;
				109
				110	typedef struct {
				111	const struct upb_msglayout const submsgs;
				112	const upb_msglayout_field *fields;
				113	uint16_t size;
				114	uint16_t field_count;
				115	bool extendable;
				116	uint8_t dense_below;
				117	uint8_t table_mask;
				118	} upb_msglayout;
				119
				120	// Example of a generated mini-table, from foo.upb.c
				121	static const upb_msglayout_field upb_test_MessageName__fields[2] = {
				122	{1, UPB_SIZE(4, 4), 1, 0, 5, _UPB_MODE_SCALAR},
				123	{2, UPB_SIZE(8, 8), 2, 0, 5, _UPB_MODE_SCALAR},
				124	};
				125
Protobuf Team Bot	46e306b	2022-06-30 10:23:47 -0700	[diff] [blame]	126	const upb_msglayout upb_test_MessageName_msg_init = {
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	127	NULL,
				128	&upb_test_MessageName__fields[0],
				129	UPB_SIZE(16, 16), 2, false, 2, 255,
				130	};
				131	```
				132
				133	The upb compiler computes separate layouts for 32 and 64 bit modes, since the
				134	pointer size will be 4 or 8 bytes respectively. The upb compiler embeds both
				135	sizes into the source code, using a `UPB_SIZE(size32, size64)` macro that can
				136	choose the appropriate size at build time based on the size of `UINTPTR_MAX`.
				137
				138	Note that `.upb.c` files contain data tables only. There is no "generated code"
				139	except for the inline accessors in the `.upb.h` files: the entire footprint
				140	of `.upb.c` files is in `.rodata`, none in `.text` or `.data`.
				141
				142	## Memory Management Model
				143
				144	All memory management in upb is built around arenas. A message is never
				145	considered to "own" the strings or sub-messages contained within it. Instead a
				146	message and all of its sub-messages/strings/etc. are all owned by an arena and
				147	are freed when the arena is freed. An entire message tree will probably be
				148	owned by a single arena, but this is not required or enforced. As far as upb is
				149	concerned, it is up to the client how to partition its arenas. upb only requires
				150	that when you ask it to serialize a message, that all reachable messages are
				151	still alive.
				152
				153	The arena supports both a user-supplied initial block and a custom allocation
				154	callback, so there is a lot of flexibility in memory allocation strategy. The
				155	allocation callback can even be `NULL` for heap-free operation. The main
				156	constraint of the arena is that all of the memory in each arena must be freed
				157	together.
				158
				159	`upb_arena` supports a novel operation called "fuse". When two arenas are fused
				160	together, their lifetimes are irreversibly joined, such that none of the arena
				161	blocks in either arena will be freed until both arenas are freed with
				162	`upb_arena_free()`. This is useful when joining two messages from separate
Protobuf Team	7e9e95a	2022-04-22 12:55:28 -0700	[diff] [blame]	163	arenas (making one a sub-message of the other). Fuse is a very cheap
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	164	operation, and an unlimited number of arenas can be fused together efficiently.
				165
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	166	## Reflection and Descriptors
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	167
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	168	upb offers a fully-featured reflection library. There are two main ways of
				169	using reflection:
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	170
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	171	1. You can load descriptors from strings using `upb_symtab_addfile()`.
				172	The upb runtime will dynamically create mini-tables like what the upb compiler
				173	would have created if you had compiled this type into a `.upb.c` file.
				174	2. You can load descriptors using generated `.upbdefs.h` interfaces.
				175	This will load reflection that references the corresponding `.upb.c`
				176	mini-tables instead of building a new mini-table on the fly. This lets
				177	you reflect on generated types that are linked into your program.
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	178
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	179	upb's design for descriptors is similar to protobuf C++ in many ways, with
				180	the following correspondences:
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	181
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	182	\| C++ Type \| upb type \|
				183	\| ---------\| ---------\|
				184	\| `google::protobuf::DescriptorPool` \| `upb_symtab`
				185	\| `google::protobuf::Descriptor` \| `upb_msgdef`
				186	\| `google::protobuf::FieldDescriptor` \| `upb_fielddef`
				187	\| `google::protobuf::OneofDescriptor` \| `upb_oneofdef`
				188	\| `google::protobuf::EnumDescriptor` \| `upb_enumdef`
				189	\| `google::protobuf::FileDescriptor` \| `upb_filedef`
				190	\| `google::protobuf::ServiceDescriptor` \| `upb_servicedef`
				191	\| `google::protobuf::MethodDescriptor` \| `upb_methoddef`
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	192
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	193	Like in C++ descriptors (defs) are created by loading a
				194	`google_protobuf_FileDescriptorProto` into a `upb_symtab`. This creates and
				195	links all of the def objects corresponding to that `.proto` file, and inserts
				196	the names into a symbol table so they can be looked up by name.
Joshua Haberman	a52fb79	2021-09-15 22:00:32 -0700	[diff] [blame]	197
Joshua Haberman	975ea59	2021-09-15 22:06:52 -0700	[diff] [blame]	198	Once you have loaded some descriptors into a `upb_symtab`, you can create and
				199	manipulate messages using the interfaces defined in `upb/reflection.h`. If your
				200	descriptors are linked to your generated layouts using option (2) above, you can
				201	safely access the same messages using both reflection and generated interfaces.