Author: @mcy
Approved: 2022-11-28
The Protobuf language is surprisingly lax in what it allows in some places, even though these corners of the syntax space are rarely exercised in real use, and which add complexity to backends and runtimes.
This document describes several such corners in the language, and how we might use Editions to fix them (spoiler: we'll add a feature for each one and then ratchet the features).
This is primarily a memo on a use-case for Editions, and not a design doc per se.
Protobuf does not enforce any constraints on names other than the “ASCII identifier” rule: they must match the regex [A-Za-z_][A-Za-z0-9_]*
. This results in problems for backends:
The Protobuf language should be as strict as possible in what patterns it accepts for identifiers, since these need to be transformed to many languages. Thus, we propose the following regexes for the three casings used in Protobuf:
([A-Z][a-zA-Z0-9]*)+
for PascalCase. We require this case for:[a-z][a-z0-9]*(_[a-z0-9]+)*
for snake_case. We require this case for:[A-Z][A-Z0-9]*(_[A-Z0-9]+)*
for SHOUTY_CASE. We require this case for:These patterns are intended to reject extraneous underscores, and to make casing of ASCII letters consistent. We explicitly only support ASCII for maximal portability to target languages. Note that option names are not included, since those are defined as fields in a proto, and would be subject to this rule automatically.
To migrate, we would introduce a bool feature feature.relax_identifier_rules
, which can be applied to any entity. When set, it would cause the compiler to reject .proto
files which contain identifiers that don't match the above constraints. It would default to true and would switch to false in a future edition.
Currently, the Protobuf language allows using keywords as identifiers. This makes the parser somewhat more complicated than it has to be for minimal benefit, and shadowing behavior is not well-specified. For example, what does the following compile as?
message Foo { message int32 {} optional int32 foo = 1; }
This is particularly fraught in places where either a keyword or a type name can follow. For example, optional foo = 1;
is a proto3 non-optional with type optional
, but the parser can't tell until it sees the =
sign.
To avoid this and eventually stop supporting this in the parser, we make the following set of keywords true reserved names that cannot be used as identifiers:
bool bytes double edition enum extend extensions fixed32 fixed64 float group import int32 int64 map max message oneof option optional package public repeated required reserved returns rpc service sfixed32 sfixed64 sint32 sint64 stream string syntax to uint32 uint64 weak
Additionally, we introduce the syntax #optional
for escaping a keyword as an identifier. This may only be used on keywords, and not non-keyword identifiers.
To migrate, we would introduce a bool feature feature.keywords_as_identifiers
, which can be applied to any entity. When set, it would cause the compiler to reject .proto
files which contain identifiers that use the names of keywords. It would migrate true->false in a future edition. The #optional
syntax would not need to be feature-gated.
From time to time we may introduce new keywords. The best procedure for doing so is to add a feature.xxx_is_a_keyword
feature, start it out as true, and then switch it to false in an edition, which would cause it to be treated as a keyword for the purposes of this check. There's nothing stopping us from starting to use it in the syntax without an edition if it would be relatively unambiguous (i.e., a “contextual” keyword). Rust provides guidance here: they really hate contextual keywords since it complicates the parser, so keywords start out as contextual and become properly reserved in the next Rust edition.
Right now, an empty package is technically permitted. We should remove this functionality from the language completely and require every file to declare a package.
We would introduce a feature like feature.allow_missing_package
, start it out as true, and switch it to false.
reserved
Currently, reserved "foo-bar";
is accepted. It is not a valid name for a field and thus should be rejected. Ideally we should remove this syntax altogether and only permit the use of identifiers in this position, such as reserved foo, bar;
.
We would introduce a feature like feature.allow_strings_in_reserved
, start it out as true, and then switch it to false.
Right now, Protobuf defines a complicated name resolution scheme that involves matching subsets of names inspired by that of C++ (which is even more complicated than ours!). Instead, we should require that every name be either a single identifier OR fully-qualified. This is an attempt to move to Go-style name resolution, which is significantly simpler to implement and explain.
In particular, if a name is a single identifier, then:
Because any multi-component path must be fully qualified, we no longer need the .foo.Bar
syntax anymore, except to refer to messages defined in files without a package. We forbid .
-prefixed names except in that case.
We would introduce a feature like features.use_cpp_style_name_resolution
, start it out as true, and then switch it to false.
Ideally, if we get strict identifier names, we can tell that Foo.Bar
is rooted at a message, rather than a package. In that case, we could go as far as saying that “names that start with a lower-case letter are fully-qualified, otherwise they are relative to the current package, and will only find things defined in the current file.”
Unlike Go, we do not allow finding things in other packages without being fully-qualified; this mostly comes from doing source-diving in very large packages, like the Go runtime, where it is very hard to find where something is defined.
Right now, we allow aliases in enums:
enum Foo { BAR = 5; BAZ = 5; }
This results in significant complexity in some parts of the backend, and weird behavior in textproto and JSON. We should disallow this.
We would introduce a feature like features.allow_enum_aliases
, which would switch from true to false.
We should adopt the Go rule that all non-public imports are used (i.e, every import provides at least one type referred to in the file).
We would introduce a feature like features.allow_unused_imports
, which would switch from true to false.
There's a few idioms for this checked by linters, such as // Next ID: N
. We should codify this in the language by rewriting that every message begin with reserved N to max;
, with the intent that N
is the next never-used field number. Because it is required to be the first production in the message, it can be
We could, additionally, require that every field number be either used or reserved, in addition to having a single N to max;
reservation. Alternatively, we could require that every field number up to the largest one used be reserved; gaps between message numbers are usually a smell.
This applies equally to message fields and enum values.
We would introduce a feature like features.allow_unused_numbers
, which we would switch from true to false.
Protobuf will implicitly concatenate two adjacent strings in any place it allows quoted strings, e.g. option foo = "bar " "baz;
. This has caused interesting problems around reserved
in the past, if a comma is omitted: reserved "foo" "bar";
is reserved "foobar";
.
We would introduce a feature like features.concatenate_adjacent_strings
, which would switch from true to false.
The package
declaration can appear anywhere in the file after syntax
or edition
. We should take cues from Go and require it to be the first thing in the file, after the edition.
We would introduce a feature like features.package_anywhere
, which would switch from true to false.
Boolean options can use true, false, True, False, T, or F as a value: option my_bool = T;
. We should restrict to only true
and false
.
We would introduce a feature like features.loose_bool_options
, which would switch from true to false.
We permit non-decimal integer literals for field numbers, e.g. optional int32 x = 0x01;
. Thankfully(?) we do not already permit a leading + or -. We should require decimal literals, since there is very little reason to allow other literals and makes the Protobuf language harder to parse.
We would introduce a feature like features.non_decimal_field_numbers
, which would switch from true to false.