rfd/0134-cloud-first-deployment.md
This RFD describes the Cloud-first deployment approach, as well as the release process changes required to support it.
Our current feature delivery process is not compatible with the company's shift to "SaaSify" Teleport. The new (or rather, extended) Cloud-focused process aims to address shortcomings like:
Before diving into the new process, let's set the stage with a high-level overview of the current release process.
Current release versioning scheme follows semver, with quarterly major releases and monthly minor releases.
Major releases are getting rolled onto Cloud quarterly, with a month's delay. Major features are reserved for major releases, which leads to Cloud customers getting major features with a month's delay.
The release process is triggered on a GitHub tag creation and builds the full suite of artifacts that takes 2-3 hours so it's unrealistic to do multiple deploys a day.
Cloud upgrade happens once a week to a release published the previous week, and engineers oftentimes don't have a clear idea of when their change is being deployed.
In the Cloud-first world, every release that's deployed on Cloud does not necessarily have a corresponding self-hosted release tag. For this reason, we can't reuse the current versioning scheme and apply it to both Cloud and self-hosted.
Instead, we want to get to a place where Cloud and self-hosted releases are decoupled, with Cloud releases being regularly cut from the main development branch and self-hosted releases cut from their respective release branches.
Although the concept of semantic versioning loses most of its meaning with the continuous release train, we do need to keep it for Cloud releases to be able to use automatic upgrades system we built.
With this in mind, the Cloud version will consist of:
-cloud.X.Y where X will represent a timestamp of when the tag
was cut and Y the corresponding commit hash: v14.0.0-cloud.20230712T123633.aabbcc.Once it is time for the next self-hosted major release (every 3-4 months), the new release branch will cut and the Cloud release version major component will increment.
14.0.0-alpha.1 14.0.0 14.0.1
branch/v14 |------------------|----------|---------....---->
|
|
14.0.0-cloud.20230712T123633.aabbcc | 15.0.0-cloud... 15.0.0-cloud... 16.0.0-cloud... 16.0.0-cloud...
master -----|-------....---------------------------|-----|----------------|-------....--|----|------------------|--------------....---->
|
|
branch/v15 |-----------------|---------|---------...---->
15.0.0-alpha.1 15.0.0 15.0.1
Once a self-hosted release branch is cut, subsequent self-hosted minor/patch releases are published without any changes to the current process, but on a less frequent cadence than Cloud releases off of master. Changes to the release branches will follow the same "develop on master, backport to release branch" workflow we're using now.
The self-hosted release and promotion process is slow and takes 2-3 hours because it builds a full suite of artifacts. Cloud users need most of these artifacts as well for the agents they run on their own infrastructure, however Cloud control plane needs a very limited subset of all artifacts. For this reason, the Cloud release and promotion process will be split in 2 parts:
This will allow us to produce rapid releases that affect only control plane services (auth, proxy) while still having the ability to push remaining artifacts when needed.
Both workflows will run in Github Actions and will reuse existing bits of the full self-hosted release pipeline.
The Cloud release promotion process will be staged, with a different class of tenants being upgraded at each stage:
The canary tenants will be a set of precreated clusters with some data in them which engineers will use for connecting their agents and testing their features. Canary clusters will be updated automatically once a Cloud release is cut.
Before promoting the release further, engineers will test any significant change
in the cloud staging environment and then explicitly sign off that the change is
ready for deployment by applying a label (e.g. cloud/verified) on the master
PR corresponding to the change. The release manager then will approve the
further rollout.
Once the change is rolled out to production tenants, engineers will be expected to monitor error rates and Cloud metrics relevant to their changes.
Let's now look at some of the edge case scenarios in the release process and how we'll be handling them.
Teleport's component compatibility guarantees become somewhat obsolete in a Cloud-first world. Historically, we've been reserving behavior-incompatible changes for major releases. The concept of a major release, however, becomes non-existent in the Cloud-first world where any change, backwards compatible or not, gets deployed to all users within days.
When applied to Cloud release model, and accounted for potential delays in agent automatic upgrades, the compatibility requirement becomes:
| Any Cloud release must be always compatible with the oldest version of an | agent that's connected to Cloud and enrolled in automatic upgrades.
Compatibility requirements affect control plane (proxy, auth and other services we host for our users), agents and clients:
For self-hosted releases, the compatibility guarantees will stay the same: https://goteleport.com/docs/faq/#version-compatibility.
Bad releases can happen even with the multi-staged rollout process. In that situation the process will be:
Existing security release process is compatible with Cloud-first deployment.
Both -cloud.X and regular self-hosted tags will be published from teleport-private
repository and the process will be as follows:
While the security embargo is in effect, if another Cloud release needs to be
published, it will be published from teleport-private repository to include
the security fix like in the current process.
No changes are expected in the teleport-private and teleport relationship
with regards to security release publishing in the short term.
Prerequisites for implementing the Cloud-first deployment process:
tsh and other clients.