design/design-doc-provider-strategy.md
Crossplane providers have been developed since the inception of the project using the published APIs of the cloud providers, git servers and any software that exposes an API. However, the coverage of the providers hasn't been at a level that we'd like it to be. In a lot of cases, users opt for adding support for the resources they need, depending on how automated the process is. But there are many users that don't have the means to do that contribution and they decide to check if their needs are met at a later date.
In order to lower the cost of adding a new resource and reach a critical mass that will increase the pace of the contributions to keep up with users, we have decided to invest in a project called Terrajet that will let provider developers generate the CRDs and use a common runtime that wraps Terraform CLI for its operations. This way we're able to add support for a resource in a matter of minutes.
With great power comes responsibility. Now that we are able to generate Terrajet based providers, what to do with the existing providers is a question that we need to answer sooner than later.
This document will try to summarize the current summary of tools that we can imagine used in the providers in the long term and then proposes a strategy that will inform our next steps with Terrajet providers as well as these new tools and where they sit in the big picture of Crossplane ecosystem.
It's important to note that we have not made any decisions about using the tools here, we're analysing them as possible candidates to see what it'd take to use them as part of a native code generation pipeline and migrate to that.
In this section, we will examine each of the candidate tools that cloud providers maintain in addition to their Terraform providers. Each of the tooling section includes a list of metadata requirements that we have for full automation of the code generation. Note that the requirements are not blockers for building code generator; it only reflects the effort needed to build it and the custom code that may be needed per-resource. The more requirements are met, the easier code generation implementation will be.
Note that we already have native code generation pipeline built with Amazon Controllers for Kubernetes (ACK).
AWS Cloud Control API is a new managed service
announced
by AWS. In its essence, it's what powers
CloudFormation, which is a
declarative API of AWS for managing cloud resources. It has many similarities to
how users interact with CloudFormation but it's not as heavyweight. The main
difference between the two is that CloudFormation allows you to manage
multiple resources in a single object called Stack whereas Cloud Control API
supports a single resource and the tracking is done via request tokens.
You can give it a try by using the script here.
Here is some high level notes about how Cloud Control behaves:
Metadata requirements:
primaryIdentifier.readOnlyPropertiescreateOnlyProperties.The schema of the resources used in Cloud Control is exactly same as CloudFormation.
The following is the list of validated example YAML of the same resource in different kinds of providers:
# Native provider-aws
apiVersion: ecr.aws.crossplane.io/v1alpha1
kind: Repository
metadata:
name: example
labels:
region: us-east-1
spec:
forProvider:
region: us-east-1
imageScanningConfiguration:
scanOnPush: true
imageTagMutability: IMMUTABLE
tags:
- key: key1
value: value1
# Terrajet provider-jet-aws
apiVersion: ecr.aws.jet.crossplane.io/v1alpha1
kind: Repository
metadata:
name: sample-repository
spec:
forProvider:
region: us-east-1
imageScanningConfiguration:
- scanOnPush: true
imageTagMutability: IMMUTABLE
tags:
key1: value1
# Cloud Control API input
ImageScanningConfiguration:
ScanOnPush: true
ImageTagMutability: IMMUTABLE
RepositoryName: sample-repository # field is marked as identifier in the schema.
# Possible CloudControl Implementation using provided schema
apiVersion: ecr.aws.crossplane.io/v1alpha1
kind: Repository
metadata:
name: sample-repository
spec:
forProvider:
region: us-east-1
imageScanningConfiguration:
- scanOnPush: true
imageTagMutability: IMMUTABLE
tags:
- key: key1
value: value1
As you might have noticed, the schemas are very similar except the tags field
because Terraform maintainers apparently made a deliberate decision there to
improve UX while Cloud Control stuck with how AWS SDK represents them similar to
our native provider.
A more complex resource would be an RDS Instance. The example YAML isn't
validated with Cloud Control API since there is no support for that resource yet
but it works with CloudFormation. See the comparison snippet
here. Two things to note there is
that we have autogeneratePassword field that we implemented in native provider
to improve UX and there is skipFinalSnapshot parameter that is not included in
Cloud Control schema since it's a parameter for deletion. Other than these two,
native and Cloud Control are very similar and Terraform one has different names
for a subset of the fields.
Google DCL is a declarative Go library that declarative infra tools such as Terraform, Ansible and Pulumi uses instead of using the low level SDK. While Cloud Control and Azure Resource Manager is also available to call using provider SDKs, this library can be considered as a separate SDK with all the functionality and strong-typed structs it provides instead of JSON blob as the medium of configuration as opposed to the other two.
By looking at the codebase, it's clear that the first user of this library was Terraform since a lot of config options and the way the functions are supposed to called are optimized for how Terraform works. One notable example is that all calls are synchronous and blocking just like Terraform.
You can give it a try by using the Go program here.
Here is some high level notes about how Google DCL works:
compute APIs are on DCL,
Instance
is not.container APIs are
not
on DCL at all, no Go
files
in DCL for that group.Instance
there is the schema file but no Go files.delete_default_routes_on_create in Network.Metadata requirements:
ID() function similar to TF.
x-dcl-references.requiredreadOnly.immutable.enum.The schema Terraform Google Provider uses is exactly the same schema as DCL, with some minor exceptions. The following is a list of example YAMLs in different contexts:
# Crossplane Native GCP Provider
apiVersion: compute.gcp.crossplane.io/v1beta1
kind: Network
metadata:
name: example
spec:
forProvider:
autoCreateSubnetworks: false
routingConfig:
routingMode: REGIONAL
# Possible provider-jet-gcp, written by looking at TF schema
apiVersion: compute.gcp.crossplane.io/v1beta1
kind: Network
metadata:
name: example
spec:
forProvider:
autoCreateSubnetworks: false
routingConfig:
routingMode: REGIONAL
deleteDefaultRoutesOnCreate: false # This field is added by Terraform maintainers.
# Possible Crossplane Provider built with DCL
apiVersion: compute.gcp.crossplane.io/v1beta1
kind: Network
metadata:
name: example
spec:
forProvider:
autoCreateSubnetworks: false
routingConfig:
routingMode: REGIONAL
For a more complex resource such as GKE Cluster, see this YAML
file.
Azure Resource Manager is the name of the control plane used for all Azure resource provisioning and management tasks. You can use service-specific API endpoints to manage resources and the Azure SDK has many packages with strong-typed structs that target those APIs. It's closer to GCP API where all operations are resource-based as opposed to AWS API where most of the endpoints are verb-based.
Azure Resource Manager Template is a special endpoint that allows management of multiple resources in a dynamic way; similar to AWS CloudFormation. ARM Template API is what powers Azure Service Operation (ASO) v2.
A few high level notes about how it works:
Metadata requirements:
x-ms-secret.
kubeconfig are not marked as secretrequiredreadOnlySince all tooling uses the same spec as source of truth for their code generation, the schemas are same with minor differences.
The following examples are for Postgre SQL Server.
# Crossplane Native Provider
apiVersion: database.azure.crossplane.io/v1beta1
kind: PostgreSQLServer
metadata:
name: example-psql
spec:
forProvider:
administratorLogin: myadmin
resourceGroupName: example-rg
location: West US 2
minimalTlsVersion: TLS12
sslEnforcement: Disabled
version: "11"
sku:
tier: GeneralPurpose
capacity: 2
family: Gen5
storageProfile:
storageMB: 20480
# Terrajet provider-jet-azure
apiVersion: postgresql.azure.jet.crossplane.io/v1alpha1
kind: PostgresqlServer
metadata:
name: example
spec:
forProvider:
name: example-psqlserver
resourceGroupName: example
location: "East US"
administratorLogin: "psqladminun"
skuName: "GP_Gen5_4" # different than native where 3 fields are used.
version: "11"
storageMb: 640000 # different than native where this is given under storageProfile
publicNetworkAccessEnabled: true # schema same, but not supported in our native provider yet.
sslEnforcementEnabled: true # in native, enum Disabled/Enabled instead of boolean
sslMinimalTlsVersionEnforced: "TLS1_2" # named differently as minimalTlsVersion
# Possible ARM implementation example,
# written by looking at https://docs.microsoft.com/en-us/azure/templates/microsoft.dbforpostgresql/servers?tabs=json
# Exactly same as native implementation.
apiVersion: database.azure.crossplane.io/v1alpha1
kind: PostgreSQLServer
metadata:
name: example-psql
spec:
forProvider:
administratorLogin: myadmin
resourceGroupName: example-rg
location: West US 2
minimalTlsVersion: TLS12
sslEnforcement: Disabled
version: "11"
sku:
tier: GeneralPurpose
capacity: 2
family: Gen5
storageProfile:
storageMB: 20480
Terraform providers of the big three clouds differ in their heterogeneity of the APIs used:
In terms of coverage, the best bet is still Terraform Providers except for Azure where ARM Template has full coverage. At the same time, AWS CloudFormation has the same coverage level, too, if we were to include it. So, while it depends on how you look at it Terraform providers have all of them unified and filled the gaps in the schemas whenever needed.
In terms of the shape of the APIs we build, you can see from the examples that the closer to the lower level API, the similar field names and structures we get. But it seems like this is mostly due to the different versioning practices followed. For example, we see that Azure TF has different names and formats for some of the fields since it didn't change the schema when they updated the SDK they used, instead implemented manual pairings to not break users because TF provides a single version HCL schema to users whereas Azure versions their schemas by date. Overall, the differences are mostly due to TF not wanting to break users with the new schemas published by the clouds.
Another aspect of the API shape discussion is the additional properties that TF
providers decide to add. For example, GCP Network has
delete_default_routes_on_create property in TF provider that doesn't exist in
GCP API. Even though it's very rare, Crossplane native provider has similar
additions as well. For example, in order to not have a racing condition we
removed the nodepool section from GKE cluster to have people create NodePools
separately and that made us remove the default node pool at every creation,
which is something users toggle in TF via remove_default_node_pool property.
Metadata requirements:
Sensitive.RequiredComputedOne of the most important aspects of the schema discussions is about what is
considered as resource or property by each tool. Because while we can always
handle the property name, value and structural differences with Kubernetes API
versioning tools, it's more challenging to handle the case where a single
resource is defined as two or more separate resources in the new version or vice
versa.
The most notable examples of this is Bucket resource in AWS. From SDK
perspective, Bucket has very limited set of properties and you configure other
aspects such as CORSRule with different calls. However, both
CloudFormation
and AWS
TF
provider make deliberate decision about including those calls in their Bucket
definition and there is no clear metadata in the SDK that make this decision for
you.
From resource definition perspective, each cloud deserves its own summary:
Currently, there is a single stable and recommended TF provider for each of the big three clouds:
With the analysis we have done, it seems that creation of a new provider for Cloud Control was more of an exception rather than a strategy that Terraform follows when it comes what tooling the provider is using. For example, Google provider uses DCL and SDK at the same time.
We can have the following set of providers to cover three big clouds:
The main advantages of this approach are:
The main disadvantages:
We cannot use different tools for the same CRD at the same apiVersion, which
includes group and version. So, we need a separation level on either group or
version level. The most important thing to look for is the migration path for
implementation changes. For example, there can be a bug in Azure TF provider
that we can't solve and we might want to switch to native implementation. The
choice we make here should lower the cost of this switch as much as possible
with the best user experience.
We can separate the CRDs by the tool their managed reconciler uses on group level like:
s3.aws.jet.crossplane.io/v1alpha1 for Terrajet S3s3.aws.cc.crossplane.io/v1alpha1 for Cloud Control S3s3.aws.crossplane.io/v1alpha1 for native S3Since the separation is on group level, it may seem like we can have a different
Bucket CRD in each group, that's not really feasible today since the
cross-resource referencing works only with single target type, i.e.
spec.s3Import.bucketNameRef can only target a single type because if it
targets multiple types then there could be more than a single candidate for
resolution with the same name.
The main advantages:
The main disadvantages:
We can separate the CRDs by the tool their managed reconciler uses on version level like:
s3.aws.crossplane.io/v1alpha1 for the initial introduction of the S3
resource.s3.aws.crossplane.io/v1alpha2 when we decide to switch to another tool,
either Cloud Control or native SDKs3.aws.crossplane.io/v1beta1 for when we feel it's mature enough to support
on beta level.The main advantages:
The main disadvantages:
For each of the big three cloud providers, we have the long-term plan of using their tools to the extent possible in the long term, e.g. Cloud Control for AWS, DCL for GCP and possibly ARM Templates for Azure. However, their Terraform providers are our best short to medium bet due to the vast coverage and maturity it provides. That was one of our main motivations behind Terrajet.
Given that we know the underlying tool will change at some point in the future, we need to think about how to manage that change with the least amount of burden on users to do the migration and maintainers to provide the tools for that and maintain the whole process. In addition, the migration process is important to be light-weight for the cases where we'd like to switch to another implementation because of a bug in Terraform provider.
In that regard, Option C where we separate on version level sounds like the best trade-off due to the following reasons:
The course of action would look roughly like the following:
v1beta1 and
upwards.With this strategy, we'll be able to have full coverage in a very short time and continue improving those resources in a single provider.
The main risk of this approach is that what happens if we end up in a state that we just cannot have automated migration to the new CRD. In such cases, we can provide the manual instructions/scripts that we'd have to provide for every migration in other two options. For example, if the resource definition changes with the tooling change or even change in the source API, the migration effort will be similar to other two options.
Another thing to keep in mind as drawback is that debugging will be harder for users since error messages could look different if they are returned before making it to the API, e.g. Terraform errors. Additionally, when they dive into the code, they will see different logic depending on the tool used in the controller. It's not as bad as looking at Terraform code vs native SDK since whatever tool we use, we build it on top of the same managed reconciler that powers all Crossplane managed resource controllers. However, it's still something to keep in mind.
We have decided on taking the Option A where Terrajet-based providers reside in their own repositories, completely separate from native providers.
The main drivers of this decision are:
The costs that we will need to account for by this decision are roughly: