blob: e4d8e4a487b067a59bc7deda02a43f4e8fea7b6e [file] [log] [blame] [view] [edit]
# Stricter Schemas with Editions
**Author:** [@mcy](https://github.com/mcy)
**Approved:** 2022-11-28
## Overview
The Protobuf language is surprisingly lax in what it allows in some places, even
though these corners of the syntax space are rarely exercised in real use, and
which add complexity to backends and runtimes.
This document describes several such corners in the language, and how we might
use Editions to fix them (spoiler: we'll add a feature for each one and then
ratchet the features).
This is primarily a memo on a use-case for Editions, and not a design doc per
se.
## Potential Lints
### Entity Names
Protobuf does not enforce any constraints on names other than the "ASCII
identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This
results in problems for backends:
* Backends need to be able to convert between PascalCase, camelCase,
snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky.
* Extraneous underscores, such as underscores in names that want to be
PascalCase, trailing underscores, leading underscores, and repeated
underscores create problems for case conversion and can clash with private
names generated by backends.
* Protobuf does not support non-ASCII identifiers, mostly out of inertia more
than anything else. Because some languages (Java most prominent among them)
do not support them, we can never support them, but we are not particularly
clear on this point.
The Protobuf language should be as strict as possible in what patterns it
accepts for identifiers, since these need to be transformed to many languages.
Thus, we propose the following regexes for the three casings used in Protobuf:
* `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for:
* Messages.
* Enums.
* Services.
* Methods.
* `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for:
* Fields (including extensions).
* Package components.
* `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for:
* Enum values.
These patterns are intended to reject extraneous underscores, and to make casing
of ASCII letters consistent. We explicitly only support ASCII for maximal
portability to target languages. Note that option names are not included, since
those are defined as fields in a proto, and would be subject to this rule
automatically.
To migrate, we would introduce a bool feature `feature.relax_identifier_rules`,
which can be applied to any entity. When set, it would cause the compiler to
reject `.proto` files which contain identifiers that don't match the above
constraints. It would default to true and would switch to false in a future
edition.
### Keywords as Identifiers
Currently, the Protobuf language allows using keywords as identifiers. This
makes the parser somewhat more complicated than it has to be for minimal
benefit, and shadowing behavior is not well-specified. For example, what does
the following compile as?
```
message Foo {
message int32 {}
optional int32 foo = 1;
}
```
This is particularly fraught in places where either a keyword or a type name can
follow. For example, `optional foo = 1;` is a proto3 non-optional with type
`optional`, but the parser can't tell until it sees the `=` sign.
To avoid this and eventually stop supporting this in the parser, we make the
following set of keywords true reserved names that cannot be used as
identifiers:
```
bool bytes double edition enum extend extensions fixed32
fixed64 float group import int32 int64 map max
message oneof option optional package public repeated required
reserved returns rpc service sfixed32 sfixed64 sint32 sint64
stream string syntax to uint32 uint64 weak
```
Additionally, we introduce the syntax `#optional` for escaping a keyword as an
identifier. This may *only* be used on keywords, and not non-keyword
identifiers.
To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`,
which can be applied to any entity. When set, it would cause the compiler to
reject `.proto` files which contain identifiers that use the names of keywords.
It would migrate true->false in a future edition. The `#optional` syntax would
not need to be feature-gated.
From time to time we may introduce new keywords. The best procedure for doing so
is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then
switch it to false in an edition, which would cause it to be treated as a
keyword for the purposes of this check. There's nothing stopping us from
starting to use it in the syntax without an edition if it would be relatively
unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they
really hate contextual keywords since it complicates the parser, so keywords
start out as contextual and become properly reserved in the next Rust edition.
### Nonempty Package
Right now, an empty package is technically permitted. We should remove this
functionality from the language completely and require every file to declare a
package.
We would introduce a feature like `feature.allow_missing_package`, start it out
as true, and switch it to false.
### Invalid Names in `reserved`
Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field
and thus should be rejected. Ideally we should remove this syntax altogether and
only permit the use of identifiers in this position, such as `reserved foo,
bar;`.
We would introduce a feature like `feature.allow_strings_in_reserved`, start it
out as true, and then switch it to false.
### Almost All Names are Fully Qualified
Right now, Protobuf defines a complicated name resolution scheme that involves
matching subsets of names inspired by that of C++ (which is even more
complicated than ours!). Instead, we should require that every name be either a
single identifier OR fully-qualified. This is an attempt to move to Go-style
name resolution, which is significantly simpler to implement and explain.
In particular, if a name is a single identifier, then:
* It must be the name of a type defined at the top level of the current file.
* If it is the name of a message or enum for a field's type, it may be the
name of a type defined in the current message. This does *not* apply to
extension fields.
Because any multi-component path must be fully qualified, we no longer need the
`.foo.Bar` syntax anymore, except to refer to messages defined in files without
a package. We forbid `.`-prefixed names except in that case.
We would introduce a feature like `features.use_cpp_style_name_resolution`,
start it out as true, and then switch it to false.
Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted
at a message, rather than a package. In that case, we could go as far as saying
that "names that start with a lower-case letter are fully-qualified, otherwise
they are relative to the current package, and will only find things defined in
the current file."
Unlike Go, we do not allow finding things in other packages without being
fully-qualified; this mostly comes from doing source-diving in very large
packages, like the Go runtime, where it is very hard to find where something is
defined.
### Unique Enum Values
Right now, we allow aliases in enums:
```
enum Foo {
BAR = 5;
BAZ = 5;
}
```
This results in significant complexity in some parts of the backend, and weird
behavior in textproto and JSON. We should disallow this.
We would introduce a feature like `features.allow_enum_aliases`, which would
switch from true to false.
### Imports are Used
We should adopt the Go rule that all non-public imports are used (i.e, every
import provides at least one type referred to in the file).
We would introduce a feature like `features.allow_unused_imports`, which would
switch from true to false.
### Next Field # is Reserved
There's a few idioms for this checked by linters, such as `// Next ID: N`. We
should codify this in the language by rewriting that every message begin with
`reserved N to max;`, with the intent that `N` is the next never-used field
number. Because it is required to be the first production in the message, it can
be
We could, additionally, require that *every* field number be either used or
reserved, in addition to having a single `N to max;` reservation. Alternatively,
we could require that every field number up to the largest one used be reserved;
gaps between message numbers are usually a smell.
This applies equally to message fields and enum values.
We would introduce a feature like `features.allow_unused_numbers`, which we
would switch from true to false.
### Disallow Implicit String Concatenation
Protobuf will implicitly concatenate two adjacent strings in any place it allows
quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting
problems around `reserved` in the past, if a comma is omitted: `reserved "foo"
"bar";` is `reserved "foobar";`.
We would introduce a feature like `features.concatenate_adjacent_strings`, which
would switch from true to false.
### Package Is First
The `package` declaration can appear anywhere in the file after `syntax` or
`edition`. We should take cues from Go and require it to be the first thing in
the file, after the edition.
We would introduce a feature like `features.package_anywhere`, which would
switch from true to false.
### Strict Boolean Options
Boolean options can use true, false, True, False, T, or F as a value: `option
my_bool = T;`. We should restrict to only `true` and `false`.
We would introduce a feature like `features.loose_bool_options`, which would
switch from true to false.
### Decimal Field Numbers
We permit non-decimal integer literals for field numbers, e.g. `optional int32
x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should
require decimal literals, since there is very little reason to allow other
literals and makes the Protobuf language harder to parse.
We would introduce a feature like `features.non_decimal_field_numbers`, which
would switch from true to false.