beps/0004-scaffolder-task-idempotency/README.md
Scaffolder retryable task idempotency provides the means to make each action of the task idempotent. By default, an action is not considered to be idempotent. It has to be crafted to a solution when action can be re-run multiple times and giving the same effect as it had been run only once.
The aim is to make the task engine more reliable in terms of system crash or redeployment. If the task engine is in process of executing
tasks and system stops, after restart task engine will restore all such tasks and continue their execution.
Another purpose is to make it possible to manually retry the task from the last failed step.
We will not aim to magically provide idempotency for actions, it has to be explicitly implemented.
We believe that idempotency is the best way to do it. Idempotency allows to rerun the actions multiple times to gracefully deal with semi-complete actions.
We believe that serialization of workspaces is the way to achieve re-running the task in a non-sticky way. This means that the task can be restored and retried on a different scaffolder task worker. This serialization can be stored in the database, or perhaps additional modules could be installed to provide additional options for storing this serialized workspace data since it may be large in some cases.
Secrets will be stored for a longer period of time in the database and wiped out once the task goes into a completed state (successfully finished or archived). Depending on the life of the task, it's possible that these secrets could expire. The refresh of these tokens is out of scope for now, but perhaps could be achieved by notifying the user that they need to go back to a task page to re-trigger the task.
This is a simplified idempotent version of GitHub repository creation action:
export function createGithubRepoCreateAction(options: {
integrations: ScmIntegrationRegistry;
githubCredentialsProvider?: GithubCredentialsProvider;
}) {
const { integrations, githubCredentialsProvider } = options;
return createTemplateAction<{
repoUrl: string;
secrets?: { [key: string]: string };
token?: string;
}>({
id: 'github:repo:create',
description: 'Creates a GitHub repository.',
examples,
schema: {
input: {
type: 'object',
required: ['repoUrl'],
properties: {
repoUrl: inputProps.repoUrl,
token: inputProps.token,
secrets: inputProps.secrets,
repoVariables: inputProps.repoVariables,
},
},
},
async handler(ctx) {
const {
repoUrl,
secrets,
repoVariables,
token: providedToken,
} = ctx.input;
const octokitOptions = await getOctokitOptions({
integrations,
credentialsProvider: githubCredentialsProvider,
token: providedToken,
repoUrl: repoUrl,
});
const client = new Octokit(octokitOptions);
const { owner, repo } = parseRepoUrl(repoUrl, integrations);
if (!owner) {
throw new InputError('Invalid repository owner provided in repoUrl');
}
const user = await client.rest.users.getByUsername({
username: owner,
});
await ctx.checkpoint({
key: 'repo.creation.v1',
fn: async () => {
const repoCreationPromise =
user.data.type === 'Organization'
? client.rest.repos.createInOrg({
name: repo,
org: owner,
})
: client.rest.repos.createForAuthenticatedUser({
name: repo,
});
const { repoUrl } = await repoCreationPromise;
return { repoUrl };
},
});
if (secrets) {
await ctx.checkpoint({
key: 'repo.create.variables',
fn: async () => {
for (const [key, value] of Object.entries(repoVariables ?? {})) {
await client.rest.actions.createRepoVariable({
owner,
repo,
name: key,
value: value,
});
}
},
});
}
ctx.output('remoteUrl', newRepo.clone_url);
},
});
}
Implement the similar API to CatalogProcessorCache allowing to store markers or keys to enable users to write idempotent actions.
This context persists across retries.
const repoMarker = await cache.get<RepoMarker>('repo.marker.key');
Checkpoints will allow action authors to create actions where code paths are ignored if already run. This will be provided on a context object and action of author provide a key and a callback.
await ctx.checkpoint({
key: 'repo.creation',
fn: async () => {
const { repoUrl } = await client.rest.Repository.create({});
return { repoUrl };
},
});
This checkpoint will be backed with task stored context namespaced with a checkpoint versioned prefix. It's going look like:
{
"repo.creation": {
"status": "success",
"result": {
"repoUrl": "https://github.com/backstage/backstage.git"
}
}
}
or a failed attempt as:
{
"repo.creation": {
"status": "failed",
"reason": "Namespace is not valid"
}
}
DatabaseTaskStore will provide two extra methods saveTaskState and getTaskState. The type of state in API will be
represented as JsonObject.
Task state will be stored in the extra column state in the table tasks with the next structure:
{
"state": {
"checkpoints": {
"repo.creation": {
"status": "success",
"result": {
"repoUrl": "https://github.com/backstage/backstage.git"
}
},
"repo.add.member": {
"status": "success",
"result": {
"id": "2345"
}
}
}
}
}
The workspace will be serialized and stored in the database by default. This serialization should occur at the end of a step, and after each checkpoint. It will be possible to provide additional modules to extend the workspace serialization to other providers, such as GCS or S3 instead of the database. This would be useful for larger workspaces, instead of taking up space in the database, we can store these directory structures in a more appropriate place.
The workspace will need to be zipped up into a binary like a tar or zip and be stored as a binary in the remote store. This is going to be better for performance than iterating through each file path and storing the contents along with the permissions.
There could be an impact to the speed of task recovery as it downloads the workspace, but this is an accepted risk and a tradeoff for the benefits of having the workspace stored in a remote store.
We're going to release this behind EXPERIMENTAL_ flags in the template schema to enable this on a per template level. And once we're happy with the implementation and after heavy testing, we can consider this being opt in at the plugin level, before being rolled out to all templates and the scaffolder plugin entirely.
There could also be the option to have this behind a scaffolder.backstage.io/v1beta4 apiVersion if the EXPERIMENTAL_ options are not enough, or causing too much of a headache.
None present. However this BEP does unblock things like longer running tasks and Gated Workflows