design/design-doc-terrajet.md
Managed resources are granular, high fidelity Crossplane representations of a resource in an external system. A provider extends Crossplane by installing controllers for new kinds of managed resources. Providers typically group conceptually related managed resources; for example the AWS provider installs support for AWS managed resources like RDSInstance and S3Bucket.
Crossplane community has been developing providers in Go to add support for the services they need since the inception of the project. However, the development process can be cumbersome for some and sometimes just the sheer number of managed resources needed can be way too many for one to develop one by one. In order to address this issue, we have come up with several solutions like crossplane-tools for simple generic code generation and integrating to code generator pipeline of AWS Controllers for Kubernetes project to generate fully-fledged resources, albeit still need manual additions to make sure corner cases are handled.
The main challenge with the code generation efforts is that the corresponding APIs are usually not as consistent to have a generic pipeline that works for every API. Hence, while we can generate the parts that are generic across the board, we still need resource specific logic to be implemented manually, just like our provider-aws pipeline.
Fortunately, we are not alone in this space. Terraform is a powerful tool that provisions infrastructure in a declarative fashion and the challenges they had regarding the provider API communication are very similar to ours. As Crossplane community, we'd like to build on top of the shoulders of the giants and leverage the support that's been developed over the years.
We would like to develop a code generation pipeline that can be used for generating a Crossplane provider from any Terraform provider. We'd like that pipeline to be able to:
While it'd be nice to generate a whole provider from scratch, that's not quite
what we aim for. The main goal of the code generation effort is to generate
things that scale with the number of managed resources. For example, it should be
fine to manually write ProviderConfig CRD and its logic, but we should do our
best to offload custom diff operations to Terraform.
There are different ways that we can choose to leverage Terraform providers:
In order to make a decision, let's inspect what makes up a managed resource support and see how they cover those requirements.
The Go code that needs to be written to add support for a managed resource can be roughly classified as following:
| Functionality | Import Provider | Talk to Provider | HCL/JSON |
|---|---|---|---|
| Separate Spec and Status structs | Iterate over schema.Provider | Iterate over schema.Provider | Iterate over schema.Provider |
| Spec -> API Input | Generated cty Translation | Generated cty Translation | JSON Marshalling (multi-key) |
| Target SDK Struct -> Spec Late Initialization | Generated cty Translation | Generated cty Translation | JSON Marshalling (multi-key) and reflection/code-generation |
| API Output -> Status | Generated cty Translation | Generated cty Translation | JSON Marshalling (multi-key) |
| Spec <-> API Output Comparison for up-to-date check | Generated cty Translation (leaking) | Generated cty Translation (leaking) | terraform plan diff |
| Readiness Check | create/update completion | create/update completion | terraform apply completion |
| Calls to API | call create/update/delete/read funcs | call create/update/delete/read funcs | terraform apply/destroy |
| Connection Details | iterate over create result | iterate over create result | sensitive_attributes in tfstate |
Note that even if we import provider code, we need to work with
ctyformat because the input provider functions expect is of typeschema.ResourceData, which can be generated only if you haveterraform.InstanceStatebecause all of its fields are internal. While state can be constructed, due to its structure of how it stores key/values, usingctyempowered with the schema of the resource would be the best choice.
"leaking" in this context means that there are resource-specific details we can't generate.
jsoniter library allows usage of tag keys other than "json", which would allow us define multiple keys for the same field to cover snake case Terraform struct fields and camel case CRD struct fields.
The main advantage of this approach is that we're closer to the actual API calls that are made to the external API, which would possibly make things run faster since we eliminate HCL middleware.
On the other hand, we'd have to do some operations Terraform CLI does, like
diff'ing, importing, conversion of JSON/HCL to cty and such. Providers do give
you the customization functions whenever needed, but the main pipeline of execution
lives in Terraform CLI. So, we would have to reimplement those parts of Terraform
CLI.
Additionally, the mechanics of providers naturally include core Terraform code
structs that we'd need to develop logic to work on and that'd increase
our bug surface since they are not intended to be called by something other than
Terraform CLI code. This fact has some consequences around how the authors of the
providers organize their code as well. For example, Azure provider stores everything
under internal package which would force us to fork the provider.
When Terraform starts executing an action, it spawns the provider gRPC server and talks with it. Similarly, Crossplane provider process could bring up a server and different reconciliation routines could talk with it via gRPC.
This option is not too different from importing provider since we'd still work
with cty wire format and implement things that Terraform CLI provides. In fact,
the earlier effort for building code generation built on top of Terraform used
this approach. It generates strong-typed functions for cty and have its own
logic for merge/diff operations. Though it doesn't support doing these operations
for nested schema structs.
We can produce a blob of JSON and give that to Terraform CLI for processing whenever
we need to execute an operation. [As mentioned][json-hcl-compatibility] in
Terraform docs, JSON can be a full replacement for HCL and thanks to jsoniter
library, we'll be able to have multiple tag keys to use in marshall/unmarshall
operations depending on the context. So, we will not need to generate any conversion
functions to get JSON representation with snake case.
Additionally, this option has the advantage that we'd be interacting with Terraform just like any user, hence all features, stability guarantees and community support would be available to us. For example, we won't need to build logic for importing resources or comparison tooling to see whether an update is due; we'll call the CLI and parse its JSON output.
A disadvantage of this approach is that we'd need to work with the filesystem to
manage JSON files and .tfstate file that contains the state. The footprint of
multiple instances of Terraform CLI working at the same time could also bring in
resource problems, however, we have workarounds for these problems like pooling
the executions. Additionally, forking the CLI to make it work with an existing
provider server is an option we can consider as well.
From the three options listed, interacting with the CLI is the one that will get us to the finish line fastest without too much compromise. Let's inspect the main drivers of this decision.
Terraform CLI has generic functionality for operations it performs and providers
only inject these operations with custom functions. For example, resources declare
custom diff functions so that Terraform CLI is aware of
resource-specific cases when it runs its plan logic. Another example would be
custom import logic. As you can see, there is only a small
exception handling here so if we interact at this level we'd need to build import
logic and then inject this. So, while it's intriguing to see CRUD operations
exported right away, it's the other operations that need more effort and time.
Interacting with CLI means we get to use Hashicorp's implementation, which is
well-tested by the community, and not spend cycles.
Since the providers are designed to be consumed only by Terraform CLI, they are
designed for its specific needs and whatever the CLI doesn't need is not exposed.
For example, ResourceDiff struct here has many fields and
functions that are not exposed. So the implementation we'd be doing if we went
with other two options will be very similar to how Terraform CLI implemented these
functions already.
Additionally, we see that some providers hide everything under internal package.
This makes it essential to fork the code for the importing option. In other options,
we still need to import the provider to get the schema to be used for generating
the CRD types, but that happens during development time, so we can have some
workarounds making the generator access the schema.
As explained in detail below, we'll have an interface that includes all operations
that a usual Crossplane controller needs on top of the ExternalClient, like
IsUpToDate, GetConnectionDetails etc. We'll hide away all the interaction
with Terraform behind that interface, and after we get the initial XRM-compliant
coverage, we can get back to the shortcomings of this approach and assess whether
it'd be worth switching to importing provider option. If we decide to change, there
will definitely be more than underlying implementation to change but the cost
will be lowered by that interface and by that time we'd have breadth of coverage
our users need already.
Overall, the CLI is not the cleanest option, but it is the one that will get us breadth of coverage we want in a short amount of time with good stability promise and then the chance of iterations to see what works best for Crossplane users.
Conceptually, there are two main parts to be implemented as part of this design. One is the code generator, and the other one is the common tools that will be used by the generated code.
As opposed to usual code generation tools that are fairly scoped, a code generator
for a fully-fledged controller has several moving parts, which makes configuration
more complex than others. Hence, we need a medium that supports complex statements
that is specific to provider and/or resource so that provider developers are
able to introduce unusual configurations and corner cases. In order to achieve
that, we'll have a repository with code generation tooling and every provider
will have its own pipeline generated in cmd/generator/main.go. For the most
parts, it will be a struct with simple inputs like Terraform Provider URL and
such but when we need complex configuration, we'll have all capabilities of Go.
All code generator utilities to be used in that Go program will live in
crossplane/terrajet repository together with the common tools that will be
imported by the generated provider code.
Every Terraform provider exposes a *schema.Schema object per
resource type that has all information related to a field, such as whether it's
immutable, computed, required etc. In our pipeline, we will iterate through
each *schema.Schema recursively and generate spec and status
structs and fields, see Terraform AWS Provider for an example
of that struct. Then we will use standard Go types library to encode the
information and then print it using standard Go templating tooling. These
structs will need to be convertible to and from JSON with different keys
depending on where they are used. An example output will look like the following:
// VPCParameters define the desired state of an AWS Virtual Private Cloud.
type VPCParameters struct {
// Region is the region you'd like your VPC to be created in.
Region *string `json:"region" tf:"region"`
// CIDRBlock is the IPv4 network range for the VPC, in CIDR notation. For
// example, 10.0.0.0/16.
// +kubebuilder:validation:Required
// +immutable
CIDRBlock string `json:"cidrBlock" tf:"cidr_block"`
// A boolean flag to enable/disable DNS support in the VPC
// +optional
EnableDNSSupport *bool `json:"enableDnsSupport,omitempty" tf:"enable_dns_support"`
// Indicates whether the instances launched in the VPC get DNS hostnames.
// +optional
EnableDNSHostNames *bool `json:"enableDnsHostNames,omitempty" tf:"enable_dns_host_names"`
// The allowed tenancy of instances launched into the VPC.
// +optional
InstanceTenancy *string `json:"instanceTenancy,omitempty" tf:"instance_tenancy"`
}
As you might have noticed, the field tags Terraform uses are snake case whereas the JSON tags we use should be camel case and Go field name should be upper camel. A small utility for doing back and forth with these strings will be introduced. The source of truth will always be the snake case string from Terraform Provider schema.
We will introduce additional tag key called tf in order to store the field name
used in Terraform schema so that conversions don't require strong-typed functions
to be generated. Namely, the following mechanisms will be used for each:
jsoniter.Marshal (using tf key)jsoniter.Marshal (using tf key) on empty Parameters objectParameters object using reflect library to late-initialize.
jsoniter.Unmarshal (using tf key) using output of terraform show --jsonterraform plan --json to see if an update is neededWe will have a single implementation of ExternalClient that
will be used across all providers built on top of Terraform. However, since
connection details differ between provider APIs, there will be single implementation
of ExternalConnecter struct for every provider implemented
manually, which will instantiate the generic ExternalClient
implementation per CRD.
The Terraform controller will have two main parts:
ExternalClient and
talks with the scheduler.Scheduler will be responsible for managing all interactions related to Terraform in a non-blocking fashion. While it's known that the underlying implementations will be interacting with Terraform, we'd like to hide the fact that it's talking with CLI so that we have optionality in the future to replace that with another mechanism. So, the interface will have functions that represent granular Crossplane functions instead of CLI calls.
A rough sketch of the that interface could look like the following:
type Scheduler interface {
Exists(ctx context.Context, mg resource.Managed) (bool, error)
UpdateStatus(ctx context.Context, mg resource.Managed) error
LateInitialize(ctx context.Context, mg resource.Managed) (bool, error)
IsReady(ctx context.Context, mg resource.Managed) (bool, error)
IsUpToDate(ctx context.Context, mg resource.Managed) (bool, error)
GetConnectionDetails(ctx context.Context, mg resource.Managed) (resource.ConnectionDetails, error)
Apply(ctx context.Context, mg resource.Managed) (*ApplyResult, error)
Delete(ctx context.Context, mg resource.Managed) (*DeletionResult, error)
}
type ApplyResult struct {
// Tells whether the apply operation is completed.
Completed bool
// Sensitive information that is available during creation/update.
ConnectionDetails resource.ConnectionDetails
}
While the interface is generic, if we decide to change underlying implementation, it's likely that some things above that interface would have to change as well. But investing in this abstraction is worth it now because the cost is low given the fact that the same implementation will live in reconciler anyway if not put under this interface.
We call it reconciler for the lack of a better word, but in reality it's just an
implementation of the existing ExternalClient, hence the actual
reconciler is the generic managed reconciler that is used by all managed resource
controllers.
The main responsibility of the reconciler is to utilize the functions of the
scheduler to implement Crossplane functionality. When we look at what a fully-fledged
ExternalClient implementation does together with functionality
in managed reconciler, the following list is what needs to be handled by the
generated provider.
Show into status.Show into spec and late-init with existing spec using reflection.Apply operation is completed.Plan function.sensitive_attributes in Show result will be published.Applysensitive_attributes in Show result will be published.Applysensitive_attributes in Show result will be published.DestroyId() and SetId() functions of ResourceData can be used
with the fallback of accepting a fieldpath in the configuration object of
the code generator.tags field. A generic utility to check whether
the field exists, and if so, add the default tags will cover this functionality.Since the CRDs are generated using the schema of Terraform resource, copying the
related portions of tfstate to spec and status will let us store the state in
etcd. In each reconciliation, we'll construct tfstate from the fields of custom
resource instance.
There are two exceptions though: resource id and sensitive attributes. We'll store
the resource id as the external name of the resource, i.e. metadata.annotations[crossplane.io/external-name]
and the sensitive attributes will be stored in the connection detail secret of the
resource. For sensitive inputs, we'll have a mechanism to add a SecretKeySelector
for that field and use it to get the input from a Secret user pointed to.
Crossplane tries to match the corresponding provider API as closely as possible
with a clear separation, i.e. dedicated spec.forProvider. While Terraform also
does that for most resources, there are occasionally fields in the Terraform schema
of the resource that configures the behavior of Terraform execution rather than
only the request made to the provider API. For example, force_destroy is a field
in S3 Bucket Terraform schema but there is no corresponding API call - Terraform
just deletes all objects in the bucket before calling deletion.
For such cases, we'll expose configuration points for provider authors to provide per-resource exception information. They will be able to remove/add specific fields, manipulate their JSON tags or code comments. Since authors will call the pipeline in a Go context, they will be able to reuse same exception information for many resources if it's generic enough.
It is still to be decided whether we'll keep current cross-resource references.
In order to address an information that exist on another resource, we need the following information in addition to the name of the target resource:
With our current cross-resource referencing mechanism, developers are able to give this information in development time except the name of the target resource. For example, in order to reference a VPC, a Crossplane user only needs to give the name of the VPC like following:
spec:
forProvider:
vpcIdRef:
name: my-little-vpc
In order to achieve the same, simple API, we need to be able to gather the identification information given above. Unfortunately, Terraform's referencing mechanism is very generic in the sense that a resource implementation doesn't know what it can reference to or whether it's referenced from another object.
Terraform Provider authors need to define a separate schema that includes values that might be needed by others, which is called
data_sourceinstead ofresource. Additionally, HCL as a language provides primitives for string manipulation that make it easier to write ad-hoc manipulation functions compared to YAML.
So this information will be something that we need human input as a customization
to the code generation pipeline. In the code generator implementation of providers,
i.e. in cmd/generator/main.go, developers will specify what field of a specific
resource references to what and how the transform should be handled in Go. Then
the code generator will add the necessary reference & selector fields and generate
the ResolveReferences function of that resource. Since
the medium is Go, developers will have the full flexibility for specifying this
information.
There are a few risks associated with interacting with Terraform CLI. Firstly,
execution of Terraform CLI requires provider binary to exist in its workspace
folder. Additionally, each execution brings up a separate provider server to talk
to. We'll investigate how to make all Terraform invocations use the same provider.
But in the worst case, we'd need to manage a pool of workspace folders and put
the main.tf and terraform.tfstate files of the managed resource when a
reconciliation request comes in.
Conceptually, Terraform modules correspond to composition in Crossplane. However, migrating a Terraform module to composition means a whole rewrite. We can possibly generate a CRD in runtime with given Terraform module schema and reconcile it using Terraform CLI. Hence, users would be able to bring their modules as is but expose them as managed resource that they can use directly or in a composition.
Some providers have declarative versions of their SDKs that their Terraform provider is built on top of. For example, Google has declarative client library with strong-typed structs. Such libraries handle all the necessary customizations and exceptions, so we can generate code to use those libraries instead of depending on Terraform in the future. However, not all providers have declarative libraries, AWS being one such example.
The earlier effort about building code generator for Terraform providers had made
the choice of interacting with provider servers via gRPC. So, the logic that existed
in Terraform CLI had to be implemented and in some cases, like cty conversion,
it used code generation instead of generic marshaling like Terraform CLI.
Additionally, due to the fact that effort wasn't finished, it didn't cover
all Terraform logic, hence there are places where we need to keep implementing
functionality of CLI.
Overall, the cost of writing the pipeline from scratch targeting the CLI is lower than to understand the existing code, spot the broken parts, adopt it and make it work today.
The main reason we don't target provider SDKs directly is the whole suite of exceptional cases that need to be handled and customizations that should be implemented per resource. Terraform community has done that for years with hand-crafted code and no other tool has as much coverage as Terraform.
There are many projects in infrastructure space that builds on top of Terraform. Each of the projects have their own limitations, additional features and different license restrictions.