Author: @mkruskal-google, @zhangskz
Approved: 2023-08-23
“What are Protobuf Editions” lays out a plan for allowing for more targeted features not owned by the protobuf team. It uses extensions of the global features proto to implement this. One thing that was left a bit ambiguous was who should own these extensions. Language, code generator, and runtime implementations are all similar but not identical distinctions.
“Editions Zero Feature: utf8_validation” (not available externally, though a later version, “Editions Zero: utf8_validation Without Problematic Options” is) is a recent plan to add a new set of generator features for utf8 validation. While the sole feature we had originally created (legacy_closed_enum
in Java and C++) didn't have any ambiguity here, this one did. Specifically in Python, the current behaviors across proto2/proto3 are distinct for all 3 implementations: pure python, Python/C++, Python/upb.
In meetings, we've discussed various alternatives, captured below. The original plan was to make feature extensions runtime implementation-specific (e.g. C++, Java, Python, upb). There are some notable complications that came up though:
Polyglot - it's not clear how upb or C++ runtimes should behave in multi-language situations. Which feature sets do they consider for runtime behaviors? Note: this is already a serious issue today, where all proto2 strings and many proto3 strings are completely unsafe across languages.
Shared Implementations - Runtimes like upb and C++ are used as backing implementations of multiple other languages (e.g. Python, Rust, Ruby, PHP). If we have a single set of upb
or cpp
features, migrating to those shared implementations would be more difficult (since there‘s no independent switches per-language). *Note: this is already the situation we’re in today, where switching the runtime implementation can cause subtle and dangerous behavior changes.*
Given that we only have two behaviors, and one of them is unambiguous, it seems reasonable to punt on this decision until we have more information. We may encounter more edge cases that require feature extensions (and give us more information) during the rollout of edition zero. We also have a lot of freedom to re-model features in later editions, so keeping the initial implementation as simple as possible seems best (i.e. Alternative 2).
Features would be per-runtime implementation as originally described in “Editions Zero Feature: utf8_validation.” For example, Protobuf Python users would set different features depending on the backing implementation (e.g. features.(pb.cpp).<feature>
, features.(pb.upb).<feature>
).
Features would be per-generator only (i.e. each protoc plugin would own one set of features). This was the second decision we made in later discussions, and while very similar to the above alternative, it's more inline with our goal of making features primarily for codegen.
For example, all Python implementations would share the same set of features (e.g. features.(pb.python).<feature>
). However, certain features could be targeted to specific implementations (e.g. features.(pb.python).upb_utf8_validation
would only be used by Python/upb).
Since this whole discussion revolves around the utf8 validation feature, one option would be to just remove it from edition zero. Instead of adding a new toggle for UTF8 behavior, we could simply migrate everyone who doesn‘t enforce utf8 today to bytes
. This would likely need another new codegen feature for generating byte getters/setters as strings, but that wouldn’t have any of the ambiguity we're seeing today.
Unfortunately, this doesn‘t seem feasible because of all the different behaviors laid out in “Editions Zero Feature: utf8_validation.” UTF8 validation isn’t really a binary on/off decision, and it can vary widely between languages. There are many cases where UTF8 is validated in some languages but not others, and there's also the C++ “hint” behavior that logs errors but allows invalid UTF8.
Note: This could still be partially done in a follow-up LSC by targeting specific combinations of the new feature that disable validation in all relevant languages.
Another option is to allow for shared feature set messages. For example, upb would define a feature message, but not make it an extension of the global FeatureSet
. Instead, languages with upb implementations would have a field of this type to allow for finer-grained controls. C++ would both extend the global FeatureSet
and also be allowed as a field in other languages.
For example, python utf8 validation could be specified as:
We could have checks during feature validation that enforce that impossible combinations aren‘t specified. For example, with our current implementation features.(pb.python).cpp
should always be identical to features.(pb.cpp)
, since we don’t have any mechanism for distinguishing them.