service/worker/scanner/README.md
First and foremost: most code in this folder is disabled by default dynamic config, and it should be considered beta-like quality in general. Understand what you are enabling before enabling it, and run it at your own risk. It is shared primarily so others can learn from and leverage what we have already encountered, if they end up needing it or something similar.
Any actually recommended fixing processes will be explicitly called out in release notes or similar. This document is not recommending any, merely describing.
There are a variety of problems with the Scanner and Fixer related code, so please do not take this document as a sign that it is a structure we want to keep. It has just been confusing enough that it took substantial time to understand, and now some of that effort is written down to save people the full effort in the future.
This folder as a whole contains a variety of data-cleanup workflows.
Stuff like:
As a general rule, these all scan the full database (for one kind of data), check
some data, and clean it up if necessary.
How their code does that varies quite a bit, however.
E.g. the "history scavenger" finds old branches of history that lost their CAS race
to update the workflow's official history. It is not possible to guarantee that
these are cleaned up while a workflow runs, because any cleanup could have failed.
Instead, the history scavenger periodically walks through the whole database and
looks for these orphaned history branches, and deletes them.
The most complex of these processes are based around Scanner and Fixer, and
so this readme is almost exclusively written for them.
For others, just read their code, it's probably faster than reading any doc about
their code.
Scanner and Fixer workflowsCheck and Fix methods that make sure our invariants hold,
and fixes them if they did not for some reason.
common/reconciliation/invariant folder, e.g.
concreteExecutionExists.go
checks (Check) that a current execution record points to a concrete record
that exists. If it does not, it deletes the current record (the Fix).InvariantManager, which runs multiple and
aggregates the results.Check, and pushes all failing checks to the blobstore.
InvariantManager).Fix, on the downloaded results from the most-recent Scanner.
Fix instead, and
it gets its data from a different Iterator (this one iterates over the scanner
results in your blobstore).Check previously.Fix is called, invariants should likely Check first internally.*shardscanner.ScannerConfig instances,
which contain everything that customizes each particular kind of scanner/fixer,
which is stored in / retrieved from the context by the workflow's type as a key.
ScannerConfig) are where much of the non-workflow behavior comes from.
create*Hooks
funcs in concrete_exeuction.go.scanShardActivity (in shardscanner/activities.go
essentially iterates over shards, and runs scanShard on each one, which creates
a NewScanner that gets its config and behavior from args / environment / hooks.Last but not least: there are other workflows in this folder which do not follow these patterns!
Start function, most are pretty easily found in there.Hopefully that made sense.
Scanners and fixers are generally disabled by default, as they can consume
substantial resources and may delete or modify data to correct issues.
Because of this, you generally need to modify your dynamic config to run them.
At the time of writing, you can enable these with config like the following.
Enable scanner workflows (these are per data source / per record type, like "concrete executions" and "timers"):
worker.executionsScannerEnabled:
- value: true # default false
worker.currentExecutionsScannerEnabled:
- value: true # default false
worker.timersScannerEnabled:
- value: true # default false
worker.historyScannerEnabled:
- value: true # default false
worker.taskListScannerEnabled:
- value: true # default true, only used on sql stores
Enable scanner invariants (currently each one only supports one data source / record type, but there may be multiple invariants for the data source):
# concretes
worker.executionsScannerInvariantCollectionStale:
- value: true # default false
worker.executionsScannerInvariantCollectionMutableState:
- value: true # default true
worker.executionsScannerInvariantCollectionHistory:
- value: true # default true
# timer invariant is implied as there is only one.
# to enable it, enable the workflow.
# currents, NONE OF THESE WORK because of type mismatch
worker.currentExecutionsScannerInvariantCollectionHistory:
- value: true # default true
worker.currentExecutionsInvariantCollectionMutableState:
- value: true # default true
Enable fixer workflows (also one per type):
worker.concreteExecutionFixerEnabled:
- value: true # default false
worker.currentExecutionFixerEnabled:
- value: true # default false
worker.timersFixerEnabled:
- value: true # default false
Enable fixer to run on a domain (required to do anything to a domain's data, which also means nothing will be fixed without this):
worker.currentExecutionFixerDomainAllow:
- value: true # default false
constraints: {domainName: "your-domain"} # for example, or have no constraints to enable for all domains
worker.concreteExecutionFixerDomainAllow:
- value: true # default false
worker.timersFixerDomainAllow:
- value: true # default false
Enable fixer invariants:
# concretes
worker.executionsFixerInvariantCollectionStale:
- value: true # default false
worker.executionsFixerInvariantCollectionMutableState:
- value: true # default true
worker.executionsFixerInvariantCollectionHistory:
- value: true # default true
# timer invariant is enabled if timer-fixer is enabled, as there is only one
# current execution fixer has never worked and does not currently support dynamic config
There are a few ways to run local clusters and make changes and test things out, but this is the way I did things when reading and changing this code:
make cadence-server to ensure it builds./cadence-server start --services worker to start up a worker that will
connect to your docker compose cluster.The default docker compose setup starts a worker instance, but due to the default
dynamic config setup where all but worker.taskListScannerEnabled are disabled,
the in-docker worker will not run (most) scanner/fixer tasklists and will not steal
any tasks from a local worker.
So you can often simply run it without any changes, start up your local (customized) worker service outside of docker, and everything will Just Work™.
This way you can leverage the normal docker compose yaml files with minimal effort, use the web UI to observe the results, and rapidly change/rebuild/rerun/debug/etc without needing to deal with docker.
If you do need to debug the tasklist scanner, I would recommend making a custom build, and modifying the docker compose file to use your build. Details on that are below, but they are not necessary for other scanners/fixers.
There are a few ways to achieve this, but I like modifying the docker-compose*.yaml
file to use a custom local build, and just changing its dynamic config to disable
the tasklist scanner.
To do that, see the docker/README.md file for instructions. Personally I prefer making a unique auto-setup tag so it does not replace any non-customized docker compose runs in the future. E.g.:
services:
# ...
cadence:
image: ubercadence/server:my-auto-setup # use your new tag
# ...
environment:
# note this env var, it is the file you need to change
- "DYNAMIC_CONFIG_FILE_PATH=config/dynamicconfig/development.yaml"
And just make a build after changing the dynamic config file. This will copy the file into the docker image, and any local changes won't affect the running container.
worker.taskListScannerEnabled:
- value: false # default true, only used on sql stores
Just set ^ this, and make sure the others are not explicitly enabled as they are disabled by default, and you're likely done.
config/dynamicconfig/development.yaml
file, so you likely want to change that one to enable your code.func (i *invariant) Check(...) {
x := true // go vet doesn't currently complain
if x { // about dead code with this. handy!
return CheckResult{Failed} // fake failure, so all records go to fixer
}
// ... the rest of your normal code, unchanged
}
func (i *invariant) Fix(...) {
// print the fix rather than doing it, so the next runs try too.
// or do the `if x {` trick here too
}
start --services worker in its start-args.Add some breakpoints or print statements, run it and see what happens :)
If Scanner found anything interesting, you should now have a /tmp/blobstore folder with
files like {uuid()}_0.corrupted. These uuids are random, and the _0.corrupted suffix
marks them as page 0 (out of N), and that they refer to corrupted entries. You'll have
one uuid per Scanner shard (configurable, for concurrency) that found issues, and multiple
pages per shard if the results exceeded the paging size limit.
If Fixer found anything, /tmp/blobstore should now have {uuid()}_0.skipped and/or
{uuid()}_0.fixed files. These uuids are also random, and do not refer to the Scanner
file that their data came from, and the uuid and paging patterns are the same as Scanner.
You may also have *.failed files, following the same pattern. Similarly, these are cases
where an invariant returned a Failure result (they can come from scanner or fixer).
Only *.corrupted files from scanner will be processed by fixer, however.
Note that *.failed files can contain invariant results of all statuses, as the status
of a record trends towards "failed", and only the final status is recorded.
For details, see the behavior in the InvariantManger.
You can also see the results in the scanner and fixer workflows in the UI. In particular:
concreteExecutionsScannerWFID = "cadence-sys-executions-scanner".aggregate_report (works in UI, others require arguments and you currently
need to use the CLI) to get overall counts./tmp/blobstore, so you can look up
specific detailed results.
/tmp/blobstore), and return the same kind of
structure (new random UUIDs, new page ranges, referring to new files in /tmp/blobstore).
If you are not printing or debugging whatever info you are looking for, check the contents of these files to verify they're doing what you expect!
This is the new "stale" invariant working in its concrete execution scanner -> the concrete execution fixer also working, with faked results to simplify testing. I also ran all the other concrete invariants because I was curious.
I made a change like this:
func (c *staleWorkflowCheck) Check(
ctx context.Context,
execution interface{},
) CheckResult {
x := true
if x {
return CheckResult{
CheckResultType: CheckResultTypeCorrupted,
InvariantName: c.Name(),
Info: "fake corrupt",
}
}
_, result := c.check(ctx, execution)
return result
}
so the Check call would always fail, and the Fix call would run the real check.
And added dynamic config like this:
# enable these invariants
worker.executionsScannerInvariantCollectionStale:
- value: true # default false
worker.executionsScannerInvariantCollectionMutableState:
- value: true # default true
worker.executionsScannerInvariantCollectionHistory:
- value: true # default true
worker.executionsFixerInvariantCollectionStale:
- value: true # default false
worker.executionsFixerInvariantCollectionMutableState:
- value: true # default true
worker.executionsFixerInvariantCollectionHistory:
- value: true # default true
# these invariants are all run by the concrete execution scanner/fixer
worker.executionsScannerEnabled: # note slightly different name
- value: true # default false
worker.concreteExecutionFixerEnabled:
- value: true # default false
worker.concreteExecutionFixerDomainAllow:
- value: true # default false
After a scanner and fixer run, /tmp/blobstore contained *.corrupted and *.skipped files.
The *.corrupted files contained data like this:
{
"Input": {
"Execution": { ... },
"Result": {
"CheckResultType": "corrupted",
"DeterminingInvariantType": "stale_workflow",
"CheckResults": [
{
"CheckResultType": "healthy",
"InvariantName": "history_exists",
"Info": "",
"InfoDetails": ""
},
{
"CheckResultType": "healthy",
"InvariantName": "open_current_execution",
"Info": "",
"InfoDetails": ""
},
{
"CheckResultType": "corrupted",
"InvariantName": "stale_workflow",
"Info": "fake corrupt",
"InfoDetails": ""
}
]
}
},
"Result": {
"FixResultType": "skipped",
"DeterminingInvariantName": null,
"FixResults": null
}
}
You can see the two healthy invariants, and the one I faked.
When fixer then ran I got this in a *.skipped file:
{
"Execution": { ... },
"Input": {
"Execution": { ... },
"Result": {
"CheckResultType": "corrupted",
"DeterminingInvariantType": "stale_workflow",
"CheckResults": [
{
"CheckResultType": "healthy",
"InvariantName": "history_exists",
"Info": "",
"InfoDetails": ""
},
{
"CheckResultType": "healthy",
"InvariantName": "open_current_execution",
"Info": "",
"InfoDetails": ""
},
{
"CheckResultType": "corrupted",
"InvariantName": "stale_workflow",
"Info": "fake corrupt",
"InfoDetails": ""
}
]
}
},
"Result": {
"FixResultType": "skipped",
"DeterminingInvariantName": null,
"FixResults": [
{
"FixResultType": "skipped",
"InvariantName": "history_exists",
"CheckResult": {
"CheckResultType": "healthy",
"InvariantName": "history_exists",
"Info": "",
"InfoDetails": ""
},
"Info": "skipped fix because execution was healthy",
"InfoDetails": ""
},
{
"FixResultType": "skipped",
"InvariantName": "open_current_execution",
"CheckResult": {
"CheckResultType": "healthy",
"InvariantName": "open_current_execution",
"Info": "",
"InfoDetails": ""
},
"Info": "skipped fix because execution was healthy",
"InfoDetails": ""
},
{
"FixResultType": "skipped",
"InvariantName": "stale_workflow",
"CheckResult": {
"CheckResultType": "",
"InvariantName": "",
"Info": "",
"InfoDetails": ""
},
"Info": "no need to fix: completed workflow still within retention + 10-day buffer",
"InfoDetails": "completed workflow still within retention + 10-day buffer, closed 2023-09-20 20:26:04.924876012 -0500 CDT and allowed to exist until 2023-10-07"
}
]
}
}
Notice that all three invariants ran in the fixer (all three were enabled), and all three fixes were skipped because they did not find any problems.
If I had also faked the stale workflow invariant Fix, you would see a FixResultType
of "fixed" on that invariant in fixer, and a file named *.fixed rather than *.skipped.
This is the current-execution scanner working -> the current-execution fixer misbehaving
and using the wrong type, and creating *.failed files, as of commit eb55629d.
I faked the current execution invariant to always say "corrupt" in Check, and panic in Fix,
and enabled the current execution scanner and fixer in dynamic config, and ran the worker.
First, the scanner run creates *.corrupted files with entries like this:
{
"Input": {
"Execution": { ... },
"Result": {
"CheckResultType": "corrupted",
"DeterminingInvariantType": "concrete_execution_exists",
"CheckResults": [
{
"CheckResultType": "corrupted",
"InvariantName": "concrete_execution_exists",
"Info": "execution is open without having concrete execution",
"InfoDetails": "concrete execution not found. WorkflowId: e905c98f-108a-4191-9ef2-ca07a1361f9c, RunId: 6bc5386b-c043-4eb1-a332-c3bb7b5188f0"
}
]
}
},
"Result": {
"FixResultType": "skipped",
"DeterminingInvariantName": null,
"FixResults": null
}
}
This shows that it identified my "always corrupt" results correctly, in the concrete_execution_exists invariant.
This is later consumed by the current execution fixer, which produces *.failed files with
contents like this:
{
"Execution": { ... },
"Input": {
"Execution": { ... },
"Result": {
"CheckResultType": "corrupted",
"DeterminingInvariantType": "concrete_execution_exists",
"CheckResults": [
{
"CheckResultType": "corrupted",
"InvariantName": "concrete_execution_exists",
"Info": "execution is open without having concrete execution",
"InfoDetails": "concrete execution not found. WorkflowId: e905c98f-108a-4191-9ef2-ca07a1361f9c, RunId: 6bc5386b-c043-4eb1-a332-c3bb7b5188f0"
}
]
}
},
"Result": {
"FixResultType": "failed",
"DeterminingInvariantName": "history_exists",
"FixResults": [
{
"FixResultType": "failed",
"InvariantName": "history_exists",
"CheckResult": {
"CheckResultType": "failed",
"InvariantName": "history_exists",
"Info": "failed to check: expected concrete execution",
"InfoDetails": ""
},
"Info": "failed fix because check failed",
"InfoDetails": ""
},
{
"FixResultType": "failed",
"InvariantName": "open_current_execution",
"CheckResult": {
"CheckResultType": "failed",
"InvariantName": "open_current_execution",
"Info": "failed to check: expected concrete execution",
"InfoDetails": ""
},
"Info": "failed fix because check failed",
"InfoDetails": ""
}
]
}
}
You can see scanner's data as the input-result, and completely different invariants running and failing as the fixer result.
Most notably:
Overall it seems like it probably deserves a rewrite, though some of the parts are pretty clearly reusable (invariants, iterators, etc).
[*]: may not actually be brief