docs/design/editions/edition-zero-features.md
Authors: @mcy, @zhangskz, @mkruskal-google
Approved: 2022-07-22
Feature flags, and their defaults, that we will introduce to define the converged semantics of Edition Zero.
NOTE: This document is largely replaced by the topic, Feature Settings for Editions (to be released soon).
Edition Zero Features defines the "first edition" of the brave new world of
no-syntax Protobuf. This document defines the actual mechanics of the features
(in the narrow sense of editions) we need to implement in protoc, as well as the
chosen defaults.
This document will require careful review from various stakeholders, because it
is essentially defining a new Protobuf syntax, even if it isn't spelled that
way. In particular, we need to ensure that there is a way to rewrite existing
proto2 and proto3 files as editions files, and the behavior of "mixed
syntax" messages, without any nasty surprises.
Note that it is an explicit goal that it be possible to take an arbitrary proto2/proto3 file and convert it to editions without semantic changes, via appropriate application of features.
We must keep in mind that the status quo is messy. Many languages have some areas where they currently diverge from the correct proto2/proto3 semantics. For edition zero, we must preserve these idiosyncratic behaviors, because that is the only way for a proto2/proto3 -> editions LSC to be a no-op.
For example, in this document we define a feature features.enum = {CLOSED,OPEN}. But currently Go does not implement closed enum semantics for
syntax=proto2 as it should. This behavior is out of conformance, but we must
preserve this out-of-conformance behavior for edition zero.
In other words, defining features and their semantics is in scope for edition zero, but fixing code generators to perfectly match those semantics is explicitly out-of-scope.
Because we need to speak of two proto syntaxes, proto2 and proto3, that have
disagreeing terminology in some places, we'll define the following terms to aid
discussion. When a term appears in code font, it refers to the Protobuf
language keyword.
has method and a Clear method; default values are always serialized
if the hasbit is set.int32 representing a field of this type matches one of the known set of
valid values.int32
field with well-known values.For the purposes of this document, we will use the syntax described in Features as Custom Options, since it is the prevailing consensus among those working on editions, and allows us to have enum-typed features. The exact names for the features are a matter of bikeshedding.
There are two kinds of syntax behaviors we need to capture: those that are
turned on by a keyword, like required, and those that are implicit, like open
enums. The differences between proto2 and proto3 today are:
required but not defaulted; Proto3 has defaulted
but not required. Proto3 also does not allow custom defaults on
defaulted fields, and on message-typed fields, defaulted is a synonym
for optional.Any is the
canonical workaround).We propose defining the following features as part of edition zero:
This feature is enum-typed and controls the presence discipline of a singular field:
EXPLICIT (default) - the field has explicit presence discipline. Any
explicitly set value will be serialized onto the wire (even if it is the
same as the default value).IMPLICIT - the field has no presence discipline. The default value is
not serialized onto the wire (even if it is explicitly set).LEGACY_REQUIRED - the field is wire-required and API-optional. Setting
this will require being in the required allowlist. Any explicitly set
value will be serialized onto the wire (even if it is the same as the
default value).The syntax for singular fields is a much debated question. After discussing the
tradeoffs, we have chosen to eliminate both the optional and required
keywords, making them parse errors. Singular fields are spelled as in proto3
(no label), and will take on the presence discipline given by
features.:presence. Migration will require deleting every instance of
optional in proto files in google3, of which there are 385,236.
It is important to observe that proto2 users are much likelier to care about presence than proto3 users, since the design of proto3 discourages thinking about presence as an interesting feature of protos, so arguably introducing proto2-style presence will not register on most users' mental radars. This is difficult to prove concretely.
IMPLICIT fields behave much like proto3 implicit fields: they cannot have
custom defaults and are ignored on submessage fields. Also, if it is an
enum-typed field, that enum must be open (i.e., it is either defined in a
syntax = proto3; file or it specifies option features.enum = OPEN;
transitively).
We also make some semantic changes:
IMPLICIT``fields may have a custom default value, unlike inproto3. Whether anIMPLICIT` field containing its default value is serialized
becomes an implementation choice (implementations are encouraged to try to
avoid serializing too much, though).has_optional_keyword() and has_presence() now check for EXPLICIT, and
are effectively synonyms.proto3_optional is rejected as a parse error (use the feature instead).Migrating from proto2/3 involves deleting all optional/required labels and
adding IMPLICIT and LEGACY_REQUIURED annotations where necessary.
optional. This may confuse proto3 users who are used to
optional not being a default they reach for. Will result in
significant (trivial, but noisy) churn in proto3 files. The keyword is
effectively line noise, since it does not indicate anything other than
"this is a singular field".singular. This results in more churn but
avoids breaking peoples' priors.optional and no label to coexist in a file, which take on their
original meanings unless overridden by features.field_presence. The
fact that a top-level features.field_presence = IMPLICIT breaks the
proto3 expectation that optional means EXPLICIT may be a source of
confusion.proto:allow_required, which must be present for required to not be a
syntax error.required/optional and introduce defaulted as a real keyword. We
will not have another easy chance to introduce such syntax (which we do,
because edition = ... is a breaking change).IMPLICIT fields. This is technically not really
needed for converged semantics, but trying to remove the Proto3-ness from
IMPLICIT fields seems useful for consistency.In the future, we can introduce something like features.always_serialize or a
similar new enumerator (ALWAYS_SERIALIZE) to the when_missing enum, which
makes EXPLICIT_PRESENCE fields unconditionally serialized, allowing
LEGACY_REQUIRED fields to become EXPLICIT_PRESENCE in a future large-scale
change. The details of such a migration are out-of-scope for this document.
Given the following files:
// foo.proto
syntax = "proto2"
message Foo {
required int32 x = 1;
optional int32 y = 2;
repeated int32 z = 3;
}
// bar.proto
syntax = "proto3"
message Bar {
int32 x = 1;
optional int32 y = 2;
repeated int32 z = 3;
}
post-editions, they will look like this:
// foo.proto
edition = "tbd"
message Foo {
int32 x = 1 [features.field_presence = LEGACY_REQUIRED];
int32 y = 2;
repeated int32 z = 3;
}
// bar.proto
edition = "tbd"
option features.field_presence = NO_PRESENCE;
message Bar {
int32 x = 1;
int32 y = 2 [features.field_presence = EXPLICIT_PRESENCE];
repeated int32 z = 3;
}
Enum types come in two distinct flavors: closed and open.
closed enums will store enum values that are out of range in the unknown field set.
open enums will parse out of range values into their fields directly.
NOTE: Closed enums can cause confusion for parallel arrays (two repeated fields that expect to have index i refer to the same logical concept in both fields) because an unknown enum value from a parallel array will be placed in the unknown field set and the arrays will cease being parallel. Similarly parsing and serializing can change the order of a repeated closed enum by moving unknown values to the end.
NOTE: Some runtimes (C++ and Java, in particular) currently do not use
the declaration site of enums to determine whether an enum field is treated
as open; rather, they use the syntax of the message the field is defined in,
instead. To preserve this proto2 quirk until we can migrate users off of it,
Java and C++ (and runtimes with the same quirk) will use the value of
features.enum as set at the file level of messages (so, if a file sets
features.enum = CLOSED at the file level, enum fields defined in it behave
as if the enum was closed, regardless of declaration). IMPLICIT singular
fields in Java and C++ ignore this and are always treated as open, because
they used to only be possible to define in proto3 files, which can't use
proto2 enums.
In proto2, enum values are closed and no requirements are placed upon the
first enum value. The first enum value will be used as the default value.
In proto3, enum values are open and the first enum value must be zero. The
first enum value is used as the default value, but that value is required to
be zero.
In edition zero, We will add a feature features.enum_type = {CLOSED,OPEN}. The
default will be OPEN. Upgraded proto2 files will explicitly set
features.enum_type = CLOSED. The requirement of having the first enum value be
zero will be dropped.
NOTE: Nominally this exposes a new state in the configuration space, OPEN enums with a non-zero default. We decided that excluding this option simply because it was previously inexpressible was a false economy.
CLOSED enums, but that is a semantic
change.Given the following files:
// foo.proto
syntax = "proto2"
enum Foo {
A = 2, B = 4, C = 6,
}
// bar.proto
syntax = "proto3"
enum Bar {
A = 0, B = 1, C = 5,
}
post-editions, they will look like this:
// foo.proto
edition = "tbd"
option features.enum_type = CLOSED;
enum Foo {
A = 2, B = 4, C = 6,
}
// bar.proto
edition = "tbd"
enum Bar {
A = 0, B = 1, C = 5,
}
If we wanted to merge them into one file, it would look like this:
// foo.proto
edition = "tbd"
enum Foo {
option features.enum_type = CLOSED;
A = 2, B = 4, C = 6,
}
enum Bar {
A = 0, B = 1, C = 5,
}
In proto3, the repeated_field_encoding attribute defaults to PACKED. In
proto2, the repeated_field_encoding attribute defaults to EXPANDED. Users
explicitly enabled packed fields 12.3k times, but only explicitly disable it 200
times. Thus we can see a clear preference for repeated_field_encoding = PACKED
emerge. This data matches best practices. As such, the default value for
features.repeated_field_encoding will be PACKED.
The existing [packed = …] syntax will be made an alias for setting the feature
in edition zero. This alias will eventually be removed. Whether that removal
happens during the initial large-scale change to enable edition zero or as a
follow on will be decided at the time.
In the long term, we would like to remove explicit usages of
features.repeated_field_encoding = EXPANDED, but we would prefer to separate
that large-scale change from the landing of edition zero. So we will explicitly
set features.repeated_field_encoding to EXPANDED at the file level when we
migrate proto2 files to edition zero.
features.repeated_field_encoding and instead specify [packed = false] when converting from proto2. This will be incredibly noisy,
syntax-wise and diff-wise.Given the following files:
// foo.proto
syntax = "proto2"
message Foo {
repeated int32 x = 1;
repeated int32 y = 2 [packed = true];
repeated int32 z = 3;
}
// bar.proto
syntax = "proto3"
message Foo {
repeated int32 x = 1;
repeated int32 y = 2 [packed = false];
repeated int32 z = 3;
}
post-editions, they will look like this:
// foo.proto
edition = "tbd"
options features.repeated_field_encoding = EXPANDED;
message Foo {
repeated int32 x = 1;
repeated int32 y = 2 [packed = true];
repeated int32 z = 3;
}
// bar.proto
edition = "tbd"
message Foo {
repeated int32 x = 1;
repeated int32 y = 2 [packed = false];
repeated int32 z = 3;
}
Note that post migration, we have not changed packed to
features.repeated_field_encoding = PACKED, although we could choose to do so
if the diff cost is not monumental. We prefer to defer to an LSC after editions
are shipped, if possible.
WARNING: UTF8 validation is actually messier than originally believed. This feature is being reconsidered in Editions Zero Feature: utf8_validation.
This feature is a tristate:
MANDATORY - this means that a runtime MUST verify UTF-8.HINT - this means that a runtime may refuse to parse invalid UTF-8, but it
can also simply skip the check for performance in some build modes.NONE - this field behaves like a bytes field on the wire, but parsers
may mangle the string in an unspecified way (for example, Java may insert
spaces as replacement characters).The default will be MANDATORY.
Long term, we would like to remove this feature and make all string fields
MANDATORY.
In the infinite future, we would like to remove this feature and force all
string fields to be UTF-8 validated. To do this, we need to recognize that
what many callers want from their string fields is a bytes field with a
string-like API. To ease the transition, we would add per-codegen backend
features, like java.bytes_as_string, that give a bytes field a generated API
resembling that of a string field (with caveats about replacement characters
forced by the host language's string type).
The migration would take HINT or SKIP string fields and convert them into
bytes fields with the appropriate API modifiers, depending on which languages
use that proto; C++-only protos, for example, are a no-op.
There is an argument to be made for "I want a string type, and I explicitly want replacement U+FFFD characters if I get something that isn't UTF-8." It is unclear if this is something users want and we would need to investigate it before making a decision.
This feature is dual state in edition zero:
ALLOW - this means that a runtime must allow JSON parsing and
serialization. Checks will be applied at the proto level to make sure that
there is a well-defined mapping to JSON.LEGACY_BEST_EFFORT - this means that a runtime will do the best it can to
parse and serialize JSON. Certain protos will be allowed that can result in
undefined behavior at runtime (e.g. many:1 or 1:many mappings).The default will be ALLOW, which maps the to the current proto3 behavior.
LEGACY_BEST_EFFORT will be used for proto2 files that require it (e.g. they’ve
set deprecated_legacy_json_field_conflicts)
ALLOW - there are ~30 cases internally where protos have invalid
JSON mappings and rely on unspecified (but luckily well defined) runtime
behavior.Long term, we would like to either remove this feature entirely or add a
DISALLOW option instead of LEGACY_BEST_EFFORT. This will more strictly
enforce that protos without a valid JSON mapping can’t be serialized or parsed
to JSON. DISALLOW will be enforced at the proto-language level, where no
message marked ALLOW can contain any message/enum marked DISALLOW (e.g.
through extensions or fields)
Extensions may be used on all messages. This lifts a restriction from proto3.
Extensions do not play nicely with TypeResolver. This is actually fixable, but
probably only worth it if someone complains.
features.allow_extensions, default true. This feels unnecessary since
uttering extend and extensions is required to use extensions in the
first place.This feature defaults to LENGTH_PREFIXED. The group syntax does not exist
under editions. Instead, message-typed fields that have
features.message_encoding = DELIMITED set will be encoded as groups (wire type
3/4) rather than byte blobs (wire type 2). This reflects the existing API
(groups are funny message fields) and simplifies the parser.
A proto2 group field will be converted into a nested message type of the same
name, and a singular submessage field that is features.message_encoding = DELIMITED with the message type's name in snake_case.
This could be used in the future to switch new message fields to use group encoding, which suggested previously as an efficiency direction.
editions with no changes. group syntax is deprecated, so
we may as well take the opportunity to knock it out.required. This is mostly
orthogonal.Given the following file
// foo.proto
syntax = "proto2"
message Foo {
group Bar = 1 {
optional int32 x = 1;
repeated int32 y = 2;
}
}
post-editions, it will look like this:
// foo.proto
edition = "tbd"
message Foo {
message Bar {
optional int32 x = 1;
repeated int32 y = 2;
}
Bar bar = 1 [features.message_encoding = DELIMITED];
}
Putting together all of the above, we propose the following Features message,
including retention and target rules associated with fields.
message Features {
enum FieldPresence {
EXPLICIT = 0;
IMPLICIT = 1;
LEGACY_REQUIRED = 2;
}
optional FieldPresence field_presence = 1 [
retention = RUNTIME,
target = FILE,
target = FIELD
];
enum EnumType {
OPEN = 0;
CLOSED = 1;
}
optional EnumType enum = 2 [
retention = RUNTIME,
target = FILE,
target = ENUM
];
enum RepeatedFieldEncoding {
PACKED = 0;
UNPACKED = 1;
}
optional RepeatedFieldEncoding repeated_field_encoding = 3 [
retention = RUNTIME,
target = FILE,
target = FIELD
];
enum StringFieldValidation {
MANDATORY = 0;
HINT = 1;
NONE = 2;
}
optional StringFieldValidation string_field_validation = 4 [
retention = RUNTIME,
target = FILE,
target = FIELD
];
enum MessageEncoding {
LENGTH_PREFIXED = 0;
DELIMITED = 1;
}
optional MessageEncoding message_encoding = 5 [
retention = RUNTIME,
target = FILE,
target = FIELD
];
extensions 1000; // for features_cpp.proto
extensions 1001; // for features_java.proto
}