Author: @zhangskz
Approved: 2023-05-26
The “Lite” implementation for Java utilizes a custom format for embedding descriptors motivated by critical code-size and performance requirements for Android.
The code generator for Java Lite encodes an descriptor-like info string which is stored into RawMessageInfo
. This is decoded into MessageSchema
which serves as the descriptor-like schema for Java lite for parsing and serialization.
The current implementation makes significant use of an is_proto3
bit in the encoding, which is problematic for editions. Note that any parser changes to the format would also need to maintain backwards compatibility, due to our guarantees for parsers to remain backwards compatible within a major version.
Fortunately, we already have corresponding bits for most Editions Zero Features in the corresponding MessageInfo
field entry encoding.
We will move existing remaining syntax usages reading is_proto3
to use these bits. Several other syntax usages need to be made to be editions compatible by merging implementations.
As new editions features are added that must be represented in MessageInfo
, we will eventually need to revamp MessageInfo
encoding to support these changes. However, this should be avoidable for Editions Zero.
RawMessageInfo
should be augmented with an additional is_edition
bit in flags' unused bits.
[0]: flags, flags & 0x1 = is proto2?, flags & 0x2 = is message?, flags & 0x4 = is edition?
The decoded ProtoSyntax
should add a corresponding Editions option based on this bit.
public enum ProtoSyntax PROTO2; PROTO3; EDITIONS;
For now, there is no need to explicitly encode the raw editions string or feature options. These resolved features will be encoded directly in their corresponding field entries.
Field entries in RawMessageInfo
already encode bits corresponding to most resolved Editions Zero features in GetExperimentalJavaFieldType
. This is decoded in fieldTypeWithExtraBits
by reading the corresponding bits.
Several places already use these bits properly, but there are a few syntax usages in the decoding that should be replaced by checking the corresponding feature bit.
There are several unused bits that we could use for future field-level features before breaking the encoding format, but we should not need these for editions zero.
The results of the is_proto3
and feature bits only seem to be used within protobuf, and don't seem to be publicly exposed.
In the compiler, message fields with features.message_encoding = DELIMITED
should be treated as a group before encoding message info.
This means that GetExperimentalJavaFieldTypeForSingular
, should encode the field's type GROUP
(17), instead of its actual type MESSAGE
(9), e.g.
int GetExperimentalJavaFieldTypeForSingular(const FieldDescriptor* field) { int result = field->type(); if (result == FieldDescriptor::TYPE_MESSAGE) { if (field->isDelimited()) { return 17; // GROUP } } }
ImmutableMessageFieldLiteGenerator::GenerateFieldInfo
calls this when generating the message field's field info.
The nested message's MessageInfo
encoding does not need to be changed as this is already identical for group and message.
Since each message field will be handled separately, this means that the post-editions proto file below
// foo.proto edition = "tbd" message Foo { message Bar { int32 x = 1; repeated int32 y = 2; } Bar bar = 1 [features.message_encoding = DELIMITED]; Bar baz = 2; // not DELIMITED }
will be encoded and treated by MessageSchema
like its pre-editions equivalent below.
message Foo { group Bar = 1 { int32 x = 1; repeated int32 y = 2; } Bar baz = 2; // not DELIMITED }
We recommended this alternative to minimize changes to the encoding and how groups are treated.
In a future breaking change, we could consider renaming FieldType.GROUP
to FieldType.MESSAGE_DELIMITED
while preserving the same number and encoding for clarity. For now, we will leave the naming for this enum as-is.
Alternatively, we could encode features.message_encoding = DELIMITED
as-is as type MESSAGE
. The MessageInfo
encoding would encode these as a normal message field, using an unused (0x1100) bit as kIsMessageEncodingDelimitedBit
.
This could be used to indicate that the message should be parsed/serialized from the wire-format as if it were a group. This would need to be passed along to MessageSchema
which would then handle treating Messages with this bit set as groups e.g. in case Message
.
This is less ideal, since it would require handling this in multiple places.
There are several places that branch on syntax into separate proto2/proto3 codepaths. These generally duplicate a lot of code and should be unified into a single syntax-agnostic code path branching on the relevant feature bits.
This code tends to be pretty opaque, so we should document this with comments or add helpers (e.g. isEnforceUtf8
) to indicate what feature bits are used as we make changes here.
There is a lot of dead code in Java Lite so several syntax usages can also be deleted or merged where possible.
Add a new backwards-compatible MessageInfo
encoding for editions.
The is_edition
bit could toggle the encoding format being used, where is_edition == true
indicates the new encoding format but is_edition == false
indicates the old encoding.
This would allow us to encode additional information that the current encoding format does not currently have available bits to support, such as the editions string or additional features.
For example, the current encoding format only has a fixed number of available field entry bits where we could encode new feature bits. We will need to introduce a new encoding format once we exceed these, or if we want to encode features at the message level.
In a future major version bump when support for proto2/3 is officially dropped, we could drop support for the previous encoding format.
The recommendation is to revisit alternative 1 along with alternative 2 post-Editions zero as we need to support additional feature bits.
We could switch Java Lite to use the MiniDescriptor encoding specification.
Like Java Lite, this encoding seems to be optimized to be lightweight and with minimal descriptor information.
MiniDescriptors do not encode proto2/proto3 syntax currently, which makes it mostly editions-compatible. MiniDescriptors encode FieldModifier/MessageModifier bits that correspond to some editions zero similarly to the Java Lite field feature bits, and can be augmented to support additional features.
Supposedly, this encoding format should support an arbitrary number of modifier bits, but this needs to be double-checked to verify there isn't a similar hard limit to the number of features.
It is unclear whether this is sufficiently optimized for Android‘s needs and how compatible this would be with Java Lite’s Schemas.
The recommendation is to revisit alternative 2 along with alternative 1 post-Editions zero as we need to support additional feature bits.
Unify implementations for lower long-term maintenance cost
MiniDescriptor encoding will eventually need to be updated for editions anyways.
Blocks editions zero on more complex encoding changes that aren't necessary.
Requires even more invasive updates to all MessageInfo decodings
Probably requires major version bumps to break compatibility
Unknown code size /schema compatibility constraints that would need to be explored.
There are a few possible changes to MiniDescriptors on the table that we should wait to settle before bringing on additional implementations.
Doing nothing is always an alternative. Describe the pros and cons of it.