| # Stricter Schemas with Editions |
| |
| **Author:** [@mcy](https://github.com/mcy) |
| |
| **Approved:** 2022-11-28 |
| |
| ## Overview |
| |
| The Protobuf language is surprisingly lax in what it allows in some places, even |
| though these corners of the syntax space are rarely exercised in real use, and |
| which add complexity to backends and runtimes. |
| |
| This document describes several such corners in the language, and how we might |
| use Editions to fix them (spoiler: we'll add a feature for each one and then |
| ratchet the features). |
| |
| This is primarily a memo on a use-case for Editions, and not a design doc per |
| se. |
| |
| ## Potential Lints |
| |
| ### Entity Names |
| |
| Protobuf does not enforce any constraints on names other than the "ASCII |
| identifier" rule: they must match the regex `[A-Za-z_][A-Za-z0-9_]*`. This |
| results in problems for backends: |
| |
| * Backends need to be able to convert between PascalCase, camelCase, |
| snake_case, and SHOUTY_CASE. Doing so correctly is surprisingly tricky. |
| * Extraneous underscores, such as underscores in names that want to be |
| PascalCase, trailing underscores, leading underscores, and repeated |
| underscores create problems for case conversion and can clash with private |
| names generated by backends. |
| * Protobuf does not support non-ASCII identifiers, mostly out of inertia more |
| than anything else. Because some languages (Java most prominent among them) |
| do not support them, we can never support them, but we are not particularly |
| clear on this point. |
| |
| The Protobuf language should be as strict as possible in what patterns it |
| accepts for identifiers, since these need to be transformed to many languages. |
| Thus, we propose the following regexes for the three casings used in Protobuf: |
| |
| * `([A-Z][a-zA-Z0-9]*)+` for PascalCase. We require this case for: |
| * Messages. |
| * Enums. |
| * Services. |
| * Methods. |
| * `[a-z][a-z0-9]*(_[a-z0-9]+)*` for snake_case. We require this case for: |
| * Fields (including extensions). |
| * Package components. |
| * `[A-Z][A-Z0-9]*(_[A-Z0-9]+)*` for SHOUTY_CASE. We require this case for: |
| * Enum values. |
| |
| These patterns are intended to reject extraneous underscores, and to make casing |
| of ASCII letters consistent. We explicitly only support ASCII for maximal |
| portability to target languages. Note that option names are not included, since |
| those are defined as fields in a proto, and would be subject to this rule |
| automatically. |
| |
| To migrate, we would introduce a bool feature `feature.relax_identifier_rules`, |
| which can be applied to any entity. When set, it would cause the compiler to |
| reject `.proto` files which contain identifiers that don't match the above |
| constraints. It would default to true and would switch to false in a future |
| edition. |
| |
| ### Keywords as Identifiers |
| |
| Currently, the Protobuf language allows using keywords as identifiers. This |
| makes the parser somewhat more complicated than it has to be for minimal |
| benefit, and shadowing behavior is not well-specified. For example, what does |
| the following compile as? |
| |
| ``` |
| message Foo { |
| message int32 {} |
| optional int32 foo = 1; |
| } |
| ``` |
| |
| This is particularly fraught in places where either a keyword or a type name can |
| follow. For example, `optional foo = 1;` is a proto3 non-optional with type |
| `optional`, but the parser can't tell until it sees the `=` sign. |
| |
| To avoid this and eventually stop supporting this in the parser, we make the |
| following set of keywords true reserved names that cannot be used as |
| identifiers: |
| |
| ``` |
| bool bytes double edition enum extend extensions fixed32 |
| fixed64 float group import int32 int64 map max |
| message oneof option optional package public repeated required |
| reserved returns rpc service sfixed32 sfixed64 sint32 sint64 |
| stream string syntax to uint32 uint64 weak |
| ``` |
| |
| Additionally, we introduce the syntax `#optional` for escaping a keyword as an |
| identifier. This may *only* be used on keywords, and not non-keyword |
| identifiers. |
| |
| To migrate, we would introduce a bool feature `feature.keywords_as_identifiers`, |
| which can be applied to any entity. When set, it would cause the compiler to |
| reject `.proto` files which contain identifiers that use the names of keywords. |
| It would migrate true->false in a future edition. The `#optional` syntax would |
| not need to be feature-gated. |
| |
| From time to time we may introduce new keywords. The best procedure for doing so |
| is to add a `feature.xxx_is_a_keyword` feature, start it out as true, and then |
| switch it to false in an edition, which would cause it to be treated as a |
| keyword for the purposes of this check. There's nothing stopping us from |
| starting to use it in the syntax without an edition if it would be relatively |
| unambiguous (i.e., a "contextual" keyword). Rust provides guidance here: they |
| really hate contextual keywords since it complicates the parser, so keywords |
| start out as contextual and become properly reserved in the next Rust edition. |
| |
| ### Nonempty Package |
| |
| Right now, an empty package is technically permitted. We should remove this |
| functionality from the language completely and require every file to declare a |
| package. |
| |
| We would introduce a feature like `feature.allow_missing_package`, start it out |
| as true, and switch it to false. |
| |
| ### Invalid Names in `reserved` |
| |
| Currently, `reserved "foo-bar";` is accepted. It is not a valid name for a field |
| and thus should be rejected. Ideally we should remove this syntax altogether and |
| only permit the use of identifiers in this position, such as `reserved foo, |
| bar;`. |
| |
| We would introduce a feature like `feature.allow_strings_in_reserved`, start it |
| out as true, and then switch it to false. |
| |
| ### Almost All Names are Fully Qualified |
| |
| Right now, Protobuf defines a complicated name resolution scheme that involves |
| matching subsets of names inspired by that of C++ (which is even more |
| complicated than ours!). Instead, we should require that every name be either a |
| single identifier OR fully-qualified. This is an attempt to move to Go-style |
| name resolution, which is significantly simpler to implement and explain. |
| |
| In particular, if a name is a single identifier, then: |
| |
| * It must be the name of a type defined at the top level of the current file. |
| * If it is the name of a message or enum for a field's type, it may be the |
| name of a type defined in the current message. This does *not* apply to |
| extension fields. |
| |
| Because any multi-component path must be fully qualified, we no longer need the |
| `.foo.Bar` syntax anymore, except to refer to messages defined in files without |
| a package. We forbid `.`-prefixed names except in that case. |
| |
| We would introduce a feature like `features.use_cpp_style_name_resolution`, |
| start it out as true, and then switch it to false. |
| |
| Ideally, if we get strict identifier names, we can tell that `Foo.Bar` is rooted |
| at a message, rather than a package. In that case, we could go as far as saying |
| that "names that start with a lower-case letter are fully-qualified, otherwise |
| they are relative to the current package, and will only find things defined in |
| the current file." |
| |
| Unlike Go, we do not allow finding things in other packages without being |
| fully-qualified; this mostly comes from doing source-diving in very large |
| packages, like the Go runtime, where it is very hard to find where something is |
| defined. |
| |
| ### Unique Enum Values |
| |
| Right now, we allow aliases in enums: |
| |
| ``` |
| enum Foo { |
| BAR = 5; |
| BAZ = 5; |
| } |
| ``` |
| |
| This results in significant complexity in some parts of the backend, and weird |
| behavior in textproto and JSON. We should disallow this. |
| |
| We would introduce a feature like `features.allow_enum_aliases`, which would |
| switch from true to false. |
| |
| ### Imports are Used |
| |
| We should adopt the Go rule that all non-public imports are used (i.e, every |
| import provides at least one type referred to in the file). |
| |
| We would introduce a feature like `features.allow_unused_imports`, which would |
| switch from true to false. |
| |
| ### Next Field # is Reserved |
| |
| There's a few idioms for this checked by linters, such as `// Next ID: N`. We |
| should codify this in the language by rewriting that every message begin with |
| `reserved N to max;`, with the intent that `N` is the next never-used field |
| number. Because it is required to be the first production in the message, it can |
| be |
| |
| We could, additionally, require that *every* field number be either used or |
| reserved, in addition to having a single `N to max;` reservation. Alternatively, |
| we could require that every field number up to the largest one used be reserved; |
| gaps between message numbers are usually a smell. |
| |
| This applies equally to message fields and enum values. |
| |
| We would introduce a feature like `features.allow_unused_numbers`, which we |
| would switch from true to false. |
| |
| ### Disallow Implicit String Concatenation |
| |
| Protobuf will implicitly concatenate two adjacent strings in any place it allows |
| quoted strings, e.g. `option foo = "bar " "baz;`. This has caused interesting |
| problems around `reserved` in the past, if a comma is omitted: `reserved "foo" |
| "bar";` is `reserved "foobar";`. |
| |
| We would introduce a feature like `features.concatenate_adjacent_strings`, which |
| would switch from true to false. |
| |
| ### Package Is First |
| |
| The `package` declaration can appear anywhere in the file after `syntax` or |
| `edition`. We should take cues from Go and require it to be the first thing in |
| the file, after the edition. |
| |
| We would introduce a feature like `features.package_anywhere`, which would |
| switch from true to false. |
| |
| ### Strict Boolean Options |
| |
| Boolean options can use true, false, True, False, T, or F as a value: `option |
| my_bool = T;`. We should restrict to only `true` and `false`. |
| |
| We would introduce a feature like `features.loose_bool_options`, which would |
| switch from true to false. |
| |
| ### Decimal Field Numbers |
| |
| We permit non-decimal integer literals for field numbers, e.g. `optional int32 |
| x = 0x01;`. Thankfully(?) we do not already permit a leading + or -. We should |
| require decimal literals, since there is very little reason to allow other |
| literals and makes the Protobuf language harder to parse. |
| |
| We would introduce a feature like `features.non_decimal_field_numbers`, which |
| would switch from true to false. |