rfcs/2022-02-23-9481-config-schema.md
Vector's configuration, while driven by code, is hard to document, as well as programmatically describe and validate outside of Vector itself. This RFC proposes an enhancement to how we define configuration types so that Vector can emit an authoritative schema that can be used to drive configuration editing and validation, in both interactive and programmatic contexts.
Vector's configuration represents a fairly dense entrypoint into the vast flexibility provided by Vector: sources, transforms, sinks, tests, each with their own varying levels of complexity.
While Vector's documentation is generally fairly high-quality, users struggle to efficiently and correctly configure Vector for a variety of different reasons:
1e+07 instead of 1000000, even when we know
the unit is bytes, and could instead display 1MB)Additionally, some of these pains are also pains felt during development. While users may struggle to understand an issue caused by following incorrect documentation, developers also struggle to correctly update the documentation itself. Vector uses a tool called Cue which is a data definition language that can be used to define data and its schema all in one, which is how we use it in Vector to codify our configuration in a somewhat programmatic/strict way. While Cue itself is a powerful data constraint language, it can take time to master and can be fairly inscrutable when errors are encountered.
Overlapping with some of the user pains, understanding when the documentation needs to be updated, or even simply remembering, can easily be missed during development, which in turn leads to documentation that grows out-of-sync until someone realizes what has occurred.
In order to address these concerns, we would update Vector to generate a configuration schema to provide a single source of truth in terms of how a particular version of Vector could be configured.
This schema would be based on JSONSchema, a specification for defining and validating JSON objects, as we can trivially convert any supported encoding for Vector configurations into JSON. JSONSchema is the most comprehensive specification for schema definition and validation when also considering the ubiquitousness of JSON itself, of which JSONSchema is written in.
As a developer, the primary impact of this would be in terms of needing to utilize new helpers, and patterns, that become required for documenting their configuration types. This would represent a net-new change to creating and updating components, although as part of developing this feature, all existing components would be bootstrapped in terms of being migrated over.
In turn, developers could equate this to something like a new CI check or lint being added that told them when their code did not comply, and what changes needed to be made in order to do so.
Overall, the cognitive overhead of this proposal would be low, as we would rely on the compiler, and CI, as much as possible in order to surface errors or non-compliance and explain exactly what needed to be changed in order to fix the errors.
ConfigurableThis trait, and derive macro, would form the basis of validating the compliance of a configuration as well as walking a configuration to generate its schema. There are two primary requirements for generating the configuration schema: discoverability and compliance.
Discoverability, or the actual logic of inspecting the configuration types, encoding information about their fields, allowable values, and so on, is the most obvious requirement. We need to be able to find these types, know they can be inspected, and then actually do the work. Compliance is perhaps even more important: unless all configuration types are able to be inspected, then the configuration schema can never be used to correctly validate a user-supplied configuration.
The Configurable trait would form the basis for discoverability. It would provide a minimalistic
interface that walked the type, and walked its fields, mapping closely to the traditional "Visitor"
pattern. The trait would allow exposing common items such as name, description, allowable type,
units, and so on. Additionally, it would allow for defining custom metadata, or extensions, that
could be parsed by external code to satisfy more advanced workflows i.e. configuration migration,
testing, etc.
The Configurable derive macro would form the basis for compliance. While it would generally
provide the scaffolding to generate an implementation of the Configurable trait -- walking each
field, gathering attributes and doc comments and so on -- it would also be able to validate that
those things exists at compile time. As an example, we can enforce that all Configuration
implementors are fully documented: the type itself, their subfields, and so on. It would then become
extremely hard for developers to add new fields or types in configuration that weren't documented
upon cutting a new Vector release.
Configurable trait as a vehicle for encapsulating all facets of configurationAs discussed in point 1, the Configurable trait is meant to provide a common interface for
configuration types, and the types used within those types, such as types from the standard library
or third-party crates, such that they can describe their "shape", value constraints, and any other
relevant metadata. Below is an abbreviated version of the Configurable trait, along with
supporting types:
/// The shape of the field.
///
/// This maps similar to the concept of JSON's data types, where types are generalized and have
/// generalized representations. This allows us to provide general-but-relevant mappings to core
/// types, such as integers and strings and so on, while providing escape hatches for customized
/// types that may be encoded and decoded via "normal" types but otherwise have specific rules or
/// requirements.
///
/// Additionally, the shape of a field can encode some basic properties about the field to which it
/// is attached. For example, numbers can be bounded on or the lower or upper end, while strings
/// could define a minimum length, or even an allowed pattern via regular expressions.
///
/// In this way, they describe a more complete shape of the field than simply the data type alone.
#[derive(Clone)]
pub enum Shape {
Null,
Boolean,
String(StringShape),
Number(NumberShape),
Array(ArrayShape),
Map(MapShape),
Composite(Vec<Shape>),
}
#[derive(Clone, Default)]
pub struct StringShape {
minimum_length: Option<usize>,
maximum_length: Option<usize>,
allowed_pattern: Option<&'static str>,
}
#[derive(Clone)]
pub enum NumberShape {
Unsigned {
effective_lower_bound: u128,
effective_upper_bound: u128,
},
Signed {
effective_lower_bound: i128,
effective_upper_bound: i128,
},
FloatingPoint {
effective_lower_bound: f64,
effective_upper_bound: f64,
}
}
#[derive(Clone)]
pub struct ArrayShape {
element_shape: Box<Shape>,
minimum_length: Option<usize>,
maximum_length: Option<usize>,
}
#[derive(Clone)]
pub struct MapShape {
required_fields: HashMap<&'static str, Shape>,
allowed_unknown_field_shape: Option<Shape>,
}
pub struct Field {
name: &'static str,
description: &'static str,
shape: Shape,
fields: Vec<Field>,
metadata: Metadata<Value>,
}
#[derive(Clone, Default)]
pub struct Metadata<T: Serialize> {
default: Option<T>,
attributes: Vec<(String, String)>,
}
pub trait Configurable<'de>: Serialize + Deserialize<'de> + Sized
where
Self: Clone,
{
/// Gets the human-readable description of this value, if any.
///
/// For standard types, this will be `None`. Commonly, custom types would implement this
/// directly, while fields using standard types would provide a field-specific description that
/// would be used instead of the default description.
fn description() -> Option<&'static str>;
/// Gets the shape of this value.
fn shape() -> Shape;
/// Gets the metadata for this value.
fn metadata() -> Metadata<Self>;
/// The fields for this value, if any.
fn fields(overrides: Metadata<Self>) -> Option<HashMap<&'static str, Field>>;
}
The Configurable trait defines some very basic core functionality: the description of this type
(if applicable), the "shape" of the type, any metadata associated with it, and the fields it
exposes. It also enforces (de)serialization capabilities on the type as this represents a base level
of functionality required by types that will be utilized in a Vector configuration.
Description and shape are required because they are both inherent and inextricable qualities of anything that we expose as a configurable option. Metadata and fields are optional as not every type will have metadata, and not every type actually has fields. For example, any scalar value -- string, number, bool, etc -- is a singular unit, and the same with arrays. Anything that looks like an "object", however, must have fields, as that is an inherent characteristic of an "object".
At the top level, there must always be a type that is Configurable which maps to the Vector
configuration itself, and then fields within it. From this point on, we'll relate characteristics of
the Configurable trait in the context of the types that implement it being fields.
Shape represents the inherent type of a field, as well as any additional constraints on that type.
This is where we start to see the mappings from Rust types to their serialized representation, and in
general, the Shape variants map closely to the various JSON types. We've added some general
constraints here -- lower/upper bounds on numbers, min.max length and acceptable regex pattern for
strings, expected element shape for arrays, expected fields for maps, etc -- but this is merely for
fleshing out the concept. We could extend this as needed but generally we would strive to only
encode intrinsic properties of these types within Shape, depending on metadata for more
custom/situational constraints.
Following on from Shape, we have the ability to define metadata about fields. One major thing that
we utilize metadata for is to provide default values for a given type/field. This allows Shape to
avoid having to deal with that as it makes it a bit messier. Another thing it allows us to do is use
a generically-typed struct to capture real Rust values, and then eventually serialize them down to a
generic representation that can eventually flow into the schema. Additionally, and perhaps most
obviously, metadata can also be used for generic key/value data about the given type.
Finally, we come to fields. As mentioned above, fields are the realization, essentially, of the sum
of Configurable types that can represent a Vector configuration. They are a coalesced version of
all the data provided by Configurable and are ultimately the data that gets used to drive schema
generation. One point here is that this is the interface where typed metadata will be serialized
such that Field has all the data necessary to be generate a schema: name, description, shape,
default value, custom metadata, subfields, etc.
Below is an example of a very simple sink configuration which supports batching and uses the
ubiquitous BatchConfig type:
#[derive(Serialize, Deserialize, Clone)]
struct SinkConfig {
endpoint: String,
batch: BatchConfig,
}
impl<'de> Configurable<'de> for SinkConfig {
fn description() -> Option<&'static str> {
Some("Configuration for the sink.")
}
fn shape() -> Shape {
let mut required_fields = HashMap::new();
required_fields.insert("endpoint", <String as Configurable>::shape());
required_fields.insert("batch", <BatchConfig as Configurable>::shape());
Shape::Map(MapShape {
required_fields,
allowed_unknown_field_shape: None,
})
}
fn metadata() -> Metadata<Self> {
Metadata {
default: Some(SinkConfig {
endpoint: String::from("foo"),
batch: BatchConfig::default(),
}),
..Default::default()
}
}
fn fields(overrides: Metadata<Self>) -> Option<Vec<Field>> {
let shape = Self::shape();
let mut required_field_shapes = match shape {
Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
_ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
};
let base_metadata = <Self as Configurable>::metadata();
let merged_metadata = merge_metadata_overrides(base_metadata, overrides);
let endpoint_shape = required_field_shapes.remove("endpoint").expect("shape for `endpoint` must exist");
let endpoint_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.endpoint.clone());
let batch_shape = required_field_shapes.remove("batch").expect("shape for `batch` must exist");
let batch_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.batch.clone());
let mut fields = HashMap::new();
fields.insert("endpoint", Field::new::<String>(
"endpoint",
"The endpoint to send requests to.",
endpoint_shape,
endpoint_override_metadata,
));
fields.insert("batch", Field::new::<BatchConfig>(
"batch",
<BatchConfig as Configurable>::description().expect("`BatchConfig` has no defined description, and an override description was not provided."),
batch_shape,
batch_override_metadata,
));
Some(fields)
}
}
#[derive(Serialize, Deserialize, Default, Clone)]
struct BatchConfig {
max_events: Option<u32>,
max_bytes: Option<u32>,
max_timeout: Option<Duration>,
}
impl<'de> Configurable<'de> for BatchConfig {
fn description() -> Option<&'static str> {
Some("Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.")
}
fn shape() -> Shape {
let mut required_fields = HashMap::new();
required_fields.insert("max_events", <Option<u32> as Configurable>::shape());
required_fields.insert("max_bytes", <Option<u32> as Configurable>::shape());
required_fields.insert("max_timeout", <Option<Duration> as Configurable>::shape());
Shape::Map(MapShape {
required_fields,
allowed_unknown_field_shape: None,
})
}
fn metadata() -> Option<Vec<Metadata<Self>>> {
Some(vec![
Metadata::DefaultValue(BatchConfig {
max_events: Some(1000),
max_bytes: Some(1048576),
max_timeout: Some(Duration::from_secs(60)),
})
])
}
fn fields(overrides: Option<Vec<Metadata<Self>>>) -> Option<Vec<Field>> {
let shape = Self::shape();
let mut required_field_shapes = match shape {
Shape::Map(MapShape { required_fields, .. }) => required_fields.clone(),
_ => unreachable!("SinkConfig is a fixed-field object and cannot be another shape"),
};
let base_metadata = <Self as Configurable>::metadata();
let merged_metadata = merge_metadata_overrides(base_metadata, overrides);
let max_events_shape = required_field_shapes.remove("max_events").expect("shape for `max_events` must exist");
let max_events_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_events);
let max_bytes_shape = required_field_shapes.remove("max_bytes").expect("shape for `max_bytes` must exist");
let max_bytes_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_bytes));
let max_timeout_shape = required_field_shapes.remove("max_timeout").expect("shape for `max_timeout` must exist");
let max_timeout_override_metadata = merged_metadata.clone()
.map_default_value(|default| default.max_timeout));
let mut fields = HashMap::new();
fields.insert("max_events", Field::new::<Option<u32>>(
"max_events",
"Maximum number of events per batch.",
max_events_shape,
max_events_override_metadata,
));
fields.insert("max_bytes", Field::new::<Option<u32>>(
"max_bytes",
"Maximum number of bytes per batch.",
max_bytes_shape,
max_bytes_override_metadata,
));
fields.insert("max_timeout", Field::new::<Option<Duration>>(
"max_timeout",
"Maximum period of time a batch can exist before being forcibly flushed.",
max_timeout_shape,
max_timeout_override_metadata,
));
Some(fields)
}
}
This code represents the expected boilerplate that would be generated by a Configurable derive
macro, so it may appear verbose but such verbosity would be entirely hidden from developers unless
they needed to manually implement Configurable for a remote type in a third-party crate.
Immediately, we can observe a few things about the trait's usage in practice. The design of
Configurable::metadata and Configurable::fields allow us to define metadata such that it is
automatically propagated downward as far as we wish. In the above example, we use this specifically
for the ability to define a default BatchConfig value at the SinkConfig level, while being able
to pass the value of each individual field down. This means that, while we may be using a global
definition of the shape of BatchConfig, we can define an override of the default value for it at
the point of usage.
Additionally, as the metadata is typed, the derive macro can generate rich code that utilizes the
raw types involved whether than having to deal with downcasted/generalized versions. We utilize this
in the referenced try_derive_field_default_from_self function to grab the value of a specific
field from a value of Self, allowing us to continue generating and providing typed metadata as we
render each field, and that field's fields, and so on.
Additionally, you can see the generated code around things descriptions, where we can layer on additional checks to ensure required fields are present, giving us to ability to add in run-time checks on top of compile-time checks, furthering extending our goal of misuse resistance.
Configurable derive macro as a vehicle for easily defining high-level constraints on configuration typesFollowing from the code examples in point 2, we'll explore what the user-defined types would look
like when using the proposed Configurable derive macro. Some of the fields and other metadata may
differ from the above example as it would have become too verbose to display above. All of that
said, let's take a look:
/// Configuration for the sink.
#[derive(Clone)]
#[configurable_component]
#[configurable(metadata("status", "beta"))]
struct SinkConfig {
/// The endpoint to send requests to.
#[configurable(format(uri), deprecated("url"))]
#[serde(alias = "url")]
endpoint: String,
#[serde(default = default_batch_config_for_sink)]
#[configurable(subfield(max_events, range(max = 1000)))]
batch: BatchConfig,
}
/// Controls batching behavior i.e. maximum batch size, the maximum time before a batch is flushed, etc.
#[derive(Default, Clone)]
#[configurable_component]
struct BatchConfig {
/// Maximum number of events per batch.
#[configurable(non_zero)]
max_events: Option<u32>,
#[configurable(non_zero)]
/// Maximum number of bytes per batch.
max_bytes: Option<u32>,
#[configurable(non_zero)]
/// Maximum period of time a batch can exist before being forcibly flushed.
max_timeout: Option<Duration>,
}
At a high level, the Configurable derive macro primarily deals with generating the boilerplate
Configurable implementation for the given type, but goes further with actually being able to
introspect existing attributes as well as allowing further constraints to be applied. Additionally,
you can see a distinct attribute macro here -- configurable_component -- which we use to apply the
derive attribute for us. You can think of #[configurable_component] as being a string replacement
marker for #[derive(Serialize, Deserialize, Configurable)]. This lets us enforce that Serialize
and Deserialize are derived, along with Configurable, as they all must be present for any
configuration type that we want to include within the schema.
Above, we can see the two previously shown types, with doc comments, typical derives for
(de)serialization, and so on. The Configurable derive macro can trivially consume information such
as the doc comments to provide a description for types and fields. The real power, as mentioned
above, is when we get into using the configurable attribute on fields, and how it uses existing
attributes.
We can see a number of usages of #[configurable(...)] which represents specific attributes
supported by the Configurable derive macro. In this example, we're doing a few different things:
uri on SinkConfig::endpoint, which will let consumers of the
schema know to validate this field according to the uri format defined in the JSONSchema
specificationSinkConfig::endpoint as deprecated, which gets added as
custom metadataSinkConfig::batch where we apply a range override,
specifically setting the maximum value for max_events to 1000BatchConfig to have a non-zero constraint, ensuring that none of the
values can ever be passed in as zeroIn and of themselves, these are powerful constraints to be able to apply inline with the definition of the configuration types themselves, and then can be exposed via the schema using either native JSONSchema support or custom metadata if they weren't natively supported yet.
As well, we can interrogate the other attributes present on the fields, including existing serde
field attributes. On the batch field in SinkConfig, a default batch configuration has been
defined using the typical serde field attribute, default, which can either take a direct value or
a reference to a function that can generate the value. We too can see this attribute when our derive
macro runs, and we can utilize it to generate our own default value. This is generically applicable
to whatever attributes we want to be able to interrogate, and so this provides an extremely powerful
primitive to be able to take advantage of existing code, as well as support new attributes from
other crates that get utilized in the future.
Additionally, the ability to specify custom metadata using the configurable attribute is a
powerful escape hatch when we need to encode behaviours, or inline relevant data, to configuration
types and fields that aren't related to the schema of a Vector configuration itself... or aren't
possible to encode in JSONSchema. For example, this could be used for something simple, like in
the above example, where we're defining the status of the sink implementation as beta. This might be
used to drive the generated content of the vector.dev website and documentation.
Some constraints might be harder to express in a JSONSchema, however, such as whether or not a sink supports acknowledgements. Whether or not a sink supports acknowledgements isn't terribly relevant to the Vector configuration itself, beyond validating whether or not any fields which toggle it on or off have been set right, and so on. There's a semantic relevance, however, is that knowing a sink does or doesn't support acknowledgements could allow a validator to surface issues to user. For example, if they have acknowledgements enabled on a sink but their source does not support acknowledgements, then end-to-end acknowledgements would not be able to actually function. While the configuration loading code can also detect this, being able to provide these semantic definitions within the schema itself allows us to more generically encode these types of behaviors and allow external tools, which don't have the benefit of running Vector directly, to correctly suss out incompatibilities and misconfigurations.
Further, and following the example code itself, this can allow us to enforce constraints that are only partially able to be represented in JSONSchema, such as aliased fields. While JSONSchema already supports the ability to define a schema such that a field can be represented by multiple names, it has no concept of a deprecated field. Utilizing custom metadata, we can encode metadata that indicates which field name variant is the deprecated one. This can be used not only to drive behavior in the generated documentation, but in other tools as well, such as automatically transforming one version of a configuration to a newer version by analyzing the schemas, and being able to reason that if field X used to be able to be referenced via A or B, and now only B is allowed, we can just rename A to B. We could also check that schemas don't remove fields unless those fields were already marked as deprecated, being able to enforce our unofficial guideline of not removing fields unless they've been marked as deprecated for at least one release.
schemars to generate the actual JSONSchemaWith the necessary information being derived from our configuration types directly, we still need a
way to take that and emit an actual JSONSchema document for Vector's configuration. We would utilize
schemars, a crate for generating JSONSchema documents, for this purpose.
The schemars crate, among other things, provides a small set of traits and types related
specifically to programmatically generating a JSONSchema document. Types which a user wishes to
document must implement JSONSchema, which interacts with a SchemaGenerator object that actually
holds the in-progress schema as a type is being walked.
Our Configurable derive macro would implement this trait automatically for the given type, using
the information provided by Configurable to ultimately drive the schema generation. While the
boilerplate to read the data given by Configurable and use it to feed the schema generator is
almost entirely mechanical and boring, we can take a look at how the above code example from point 3
might look when actually turned into a JSONSchema document:
{
"$schema": "https://json-schema.org/draft/2020-12/schema#",
"title": "SinkConfig",
"type": "object",
"oneOf": [
{ "required": ["endpoint"] },
{ "required": ["url"] }
],
"properties": {
"endpoint": {
"description": "The endpoint to send requests to.",
"type": "string",
"format": "uri"
},
"url": {
"description": "The endpoint to send requests to.",
"type": "string",
"format": "uri",
"deprecated": true
},
"batch": {
"allOf": [{ "$ref": "#/definitions/BatchSettings" }],
"properties": {
"max_events": {
"type": [
"null",
"number"
],
"maximum": 1000
}
},
"default": {
"max_bytes": 1048576,
"max_events": 1000,
"max_timeout": 60
}
}
},
"_metadata": {
"status": "beta"
},
"definitions": {
"BatchSettings": {
"description": "Controls batching behavior.",
"type": "object",
"properties": {
"max_bytes": {
"description": "Maximum number of bytes per batch.",
"type": [
"null",
"integer"
],
"minimum": 1,
"maximum": 4294967296
},
"max_events": {
"description": "Maximum number of events per batch.",
"type": [
"null",
"integer"
],
"minimum": 1,
"maximum": 4294967296
},
"max_timeout": {
"description": "Maximum period of time a batch can exist before being forcibly flushed.",
"anyOf": [
{ "type": "null" },
{ "$ref": "#/definitions/duration" }
],
"minimum": 1
}
}
},
"duration": {
"type": "number",
"minimum": 0,
"maximum": 9007199254740991
}
}
}
While this is long and verbose, users almost exclusively interact with JSONSchema documents by using a library that validates JSON documents using the schema. Briefly, though, you can observe the following things:
SinkConfig::endpoint field, we've enumerated its alias, url, along
with the fact that it's deprecatedBatchConfig type, but we can also do a union of its
definition along with a separate definition that enforces a maximum size of 1000 events per batchBatchConfigIn general, the lack of a configuration schema, and being able to treat our Rust code as the source of truth, hurts both users and developers. As such, doing this work represents a huge opportunity to reduce developer toil when it comes to generating and synchronizing our generated documentation. It additionally represents a large quality of life improvement for our users, who look to our documentation to be up-to-date and semantically meaningful.
If we didn't do this work, we would be stuck with the current manual toil, which is not only extra work on the part of developers -- learning Cue, remembering to update it, etc -- but also reviewers, who now need to catch when these things are missed.
Utilizing an extensible system for expressing the schema of Vector's configuration gives us the runway to solve our current problems, but also the ability to handle future problems and requirements as they come up.
The primary drawback of this approach is that it limits us to things we can reasonably express within the limitations of the Rust language itself, and what we're comfortable with representing via attributes. For example, single-line and multi-line doc comments are trivially supported, but if we wanted to start pushing more semantically-relevant information, such as configuration snippets, etc, it could be technically possible but appear as a very ugly attribute usage, which could make the source muddy and unclear.
Additionally, while derive macros are written in Rust code, which can be much easier to grok than declarative macros, it represents a section of code that may be harder to understand and modify than if we used a more brute force approach.
Notably, and perhaps most obviously, Cue itself can be used for data validation purposes. There are also other projects that slot themselves into the same general domain, such as OPA/Rego and CDDL.
The primary problem with all of these solutions is that they're all custom languages, with far less cross-platform support, and a far steeper learning curve. Even though our approach generally concerns itself with obfuscating the schema tool itself, and focusing on making it trivially to generate the schema from annotated Rust types/fields, eventually the rubber must meet the road, and this is where these other tools would fall down for our case and become very hard to wield correctly.
schemars onlyWhile schemars itself has support for annotating Rust types in almost the exact same way as we've
proposed above, it lacks a few features necessary for our use case:
alias attribute featureThe lack of these features make it much harder to correctly generate our schema, as we would still
be required to do the minimum amount of work to support defining custom attributes for the missing
features, and the work to support at least one attribute is 90% of the work to support two, or
three, or more, custom attributes. It would also mean that developers would need to figure out
whether or not schemars provided a certain behavior via its attributes, versus being able to only
need to remember how to get to the documentation for the our proposed internal approach.
Another alternative is the possibility of moving the configuration source-of-truth outside of Rust and using it in the opposite direction, to generate the Rust types themselves. This is technically possible, although fundamentally suboptimal for a few reasons:
In general, the issue of integrating the code is the biggest reason to avoid such an approach.
Developers already have issues with getting IDE language assist extensions/plugins to correctly
provide type information and hinting for types that are in Rust code which is imported from the
filesystem, such as the approach taken by prost for importing the Rust code it generates from
Protocol Buffers definitions. Those issues can be easily kept at bay as-is because the rate of
change to things like our Protocol Buffers definitions is low, but configuration types for
components are both more prevalent overall and experience far more churn, which means whatever
potential issues existed would statistically show up more often.
Even an approach where we more directly placed the code into the normal src hierarchy, to avoid
needing to do prost-style code import, would still be at risk of causing friction during the
development process:
Incremental steps to execute this change. These will be converted to issues after the RFC is approved:
Configurable trait, and supporting types, and implement it by hand for Rust
standard library types, and at least one of each major component type (source, transform, sink) to
vet out the design and expose any corner cases.Configurable type, and implement the JSONSchema trait from schemars by hand for all
types which have a manual Configurable implementation.JSONSchema trait for it.Configurable derive macro, and configurable_component
attribute macro, that can generate a boilerplate Configurable trait implementation without any
support for attributes or attribute interrogation.JSONSchema trait implementation.serde type/field/variant attributes to generate metadata
with a goal of being on par with what schemars supports, plus whatever they don't support that
we need.configurable attributes, specifically in the vein of what schemars
provides, in terms of adding field value constraints or custom metadata.Configurable/JSONSchema with the derive
macro/attributes.inventory-based stuff) to enforce that
the configuration types implement Configurable, thereby ensuring that all configuration types
are adhering to the requirements enforced by the derive macro.configurable attributes, as well as default values that get
merged in after-the-fact. Currently, many defaults are merged in, and validations are performed,
only once a sink is in the process of being built, and the configuration has been deserialized,
which means some validation happens during deserialization and some happens after, leading to
slightly discontiguous error messages.cue command but would also involve figuring out how to integrate the resulting Cue
into our existing Cue documentation.