skills/cache-expert/references/debugging.md
This guide focuses on practical debugging for current dagql + filesync cache behavior.
Use a tight test repro before adding logs.
Recommended integration command format:
dagger --progress=plain call engine-dev test --pkg ./core/integration --run='<TestSuiteName>/<SubtestName>'
This command rebuilds the dev engine, runs it as an ephemeral service, and then runs tests against it. Output includes:
t.Logf)Capture output to a file under /tmp to avoid overwhelming terminal context:
dagger --progress=plain call engine-dev test --pkg ./core/integration --run='<TestSuiteName>/<SubtestName>' > /tmp/cache-debug.log 2>&1
rg -n "panic:|--- FAIL:|^FAIL\s" /tmp/cache-debug.log
During long runs, periodically grep for panics. If the engine panics, tests may hang indefinitely:
rg -n "panic:|fatal error:|SIGSEGV|stack trace" /tmp/cache-debug.log
If a test appears hung (engine still alive but no test progress), capture a goroutine dump from the inner dev engine process with SIGQUIT (THESE INSTRUCTIONS MUST BE FOLLOWED CLOSELY TO AVOID SENDING SIGQUIT TO THE WRONG PROCESS):
engine_ctr="$(docker ps --format '{{.Names}}' | rg '^dagger-engine-v' | head -n1)"
docker exec "$engine_ctr" sh -lc '
for p in /proc/[0-9]*; do
pid=${p#/proc/}
[ "$pid" = "1" ] && continue
cmd="$(tr "\0" " " < "$p/cmdline" 2>/dev/null || true)"
case "$cmd" in
*"/usr/local/bin/dagger-engine"*)
echo "sending SIGQUIT to inner dagger-engine pid=$pid" >&2
kill -QUIT "$pid"
exit 0
;;
esac
done
echo "no inner dagger-engine process found" >&2
exit 1
'
Then inspect the same run log for the dump:
rg -n "goroutine [0-9]+|fatal error:|SIGQUIT|chan receive|chan send|semacquire|sync\\.Mutex|deadlock" /tmp/cache-debug.log
AFTER SENDING SIGQUIT the tests may hang. Once you confirm the log output has SIGQUIT stack traces, you are done and don't need to wait for the test hang to end.
To compare behavior against an engine from another git ref:
dagger --progress=plain call engine-dev --source 'https://github.com/dagger/dagger#main' test --pkg ./core/integration --run='TestSomeSuite/TestSomeSubtestYouWant'
Do not run multiple suites in parallel unless necessary; each suite is CPU-heavy and concurrent runs significantly degrade performance.
DO NOT EVER USE broad ./... WHEN RUNNING TESTS AS YOU WILL ACCIDENTALLY CAPTURE INTEGRATION TESTS OR OTHER TESTS YOU DID NOT MEAN TO RUN.
./core/integration, ./dagql/idtui and ./dagql/idtui/multiprefixw are integration-style test packages (not quick unit loops). Avoid running them during tight cache-debug cycles unless you explicitly need those integration paths.
When a failure happens in CI, start from the trace if one is available. The user may provide either a raw trace ID or a command copied from the web UI, such as:
dagger trace <trace-id>
Replay that trace locally with plain progress and capture it to a temp file:
dagger --progress=plain trace <trace-id> > /tmp/ci-trace-<trace-id>.log 2>&1
This does not rerun the CI job. It fetches and prints the recorded trace in the
same style as local --progress=plain output, so the rest of this debugging
guide applies: keep the full output in /tmp, inspect it with rg, and avoid
dumping the whole trace into the conversation.
If the user gives a GitHub PR URL instead of a trace ID, first inspect the PR's commit statuses and collect the Dagger Cloud target URLs for the checks of interest. With GitHub CLI this usually looks like:
pr_url='https://github.com/dagger/dagger/pull/13119'
head_sha="$(gh pr view "$pr_url" --json headRefOid --jq .headRefOid)"
gh api "repos/dagger/dagger/commits/$head_sha/status" \
--jq '.statuses[] | select(.target_url | startswith("https://dagger.cloud/")) | [.state, .context, .target_url] | @tsv'
For failed checks, add select(.state != "success"). A Dagger status target URL
has this shape:
https://dagger.cloud/{org}/checks/{moduleRef}@{moduleVersion}?check={checkName}
For public repos, the Cloud GraphQL API can map that URL data to check IDs and trace IDs without rerunning anything:
curl -sS -X POST https://api.dagger.cloud/query \
-H 'Content-Type: application/json' \
--data '{
"query": "query($org:String!,$moduleRef:String!,$moduleVersion:String!){ org(name:$org){ moduleChecks(moduleRef:$moduleRef,moduleVersion:$moduleVersion){ commitSHA checks { id name status traceId spanId moduleRef moduleVersion } } } }",
"variables": {
"org": "dagger",
"moduleRef": "github.com/dagger/dagger",
"moduleVersion": "e7600fda40142627a4206ec04de3a5f702be5a45"
}
}' > /tmp/ci-checks.json
jq -r --arg check 'test-split:test-base' \
'.data.org.moduleChecks[].checks[]
| select(.name == $check)
| [.status, .name, .id, .traceId]
| @tsv' /tmp/ci-checks.json
If the Dagger Cloud URL contains run=<checkID>, prefer that exact check ID.
Current GitHub status URLs often only include check=<name>, so the lookup is
"latest matching check for this org/module/version/name"; be careful after
reruns and prefer the non-success/latest row that matches the status being
debugged.
Once you have the trace ID, replay it with dagger --progress=plain trace ...
and capture output to /tmp as described above.
Start with the usual failure scan:
rg -n "panic:|fatal error:|SIGSEGV|--- FAIL:|^FAIL\s|Error:|error:" /tmp/ci-trace-<trace-id>.log
Then inspect around the interesting spans:
rg -n "TestName|FieldName|module name|command text" /tmp/ci-trace-<trace-id>.log
sed -n '<start>,<end>p' /tmp/ci-trace-<trace-id>.log
Use the replayed trace to identify the exact failing call, subtest, generated
command, or engine error. Once the failing surface is clear, decide whether to
reproduce it locally with a tight dagger --progress=plain call engine-dev ...
command or debug directly from the recorded CI trace.
For most testing/debugging flows, prefer ephemeral engines via:
dagger --progress=plain call engine-dev ...
However, for performance debugging (pprof snapshots, repeated profiling loops, endpoint inspection), use a persistent dev engine running in Docker.
docker rm -fv dagger-engine.dev
docker volume rm dagger-engine.dev
./hack/dev
Notes:
dagger-engine.dev.Use ./hack/with-dev to target the running dagger-engine.dev:
./hack/with-dev go test -v -count=1 -run='TestWorkspace/TestWorkspaceContentAddressed/storing_a_Directory' ./core/integration/
You can also run Dagger commands through the same wrapper:
./hack/with-dev ./bin/dagger ...
Important CLI gotcha:
./hack/with-dev bash -c 'dagger ...', you may accidentally pick up a non-dev dagger binary from PATH../bin/dagger to avoid ambiguity.Because the engine is a normal Docker container, you can use standard Docker tools:
docker logs dagger-engine.devdocker exec -it dagger-engine.dev shdocker kill -s <SIGNAL> dagger-engine.devThe dev engine exposes debug endpoints on localhost:6060.
cmd/engine/debug.go (see route setup near line 29).Example heap profile capture over 15 seconds:
curl 'http://localhost:6060/debug/pprof/heap?seconds=15' > /tmp/heap.pprof
Then inspect with:
go tool pprof /tmp/heap.pprof
General profiling guidance:
When debugging leaked dagql cache refs, start with Prometheus metrics before adding deep logs.
Enable metrics on the target engine:
_EXPERIMENTAL_DAGGER_METRICS_ADDR=0.0.0.0:9090
_EXPERIMENTAL_DAGGER_METRICS_CACHE_UPDATE_INTERVAL=1s
Key metrics:
dagger_connected_clientsdagger_dagql_cache_entriesdagger_dagql_cache_ongoing_calls_entriesdagger_dagql_cache_completed_calls_entriesdagger_dagql_cache_completed_calls_by_content_entriesdagger_dagql_cache_ongoing_arbitrary_entriesdagger_dagql_cache_completed_arbitrary_entriesInterpretation:
connected_clients is 0 but dagql_cache_entries stays non-zero, refs are retained.completed_calls growth: call-result refs not released.ongoing_calls growth: waiter/cancel path likely stuck.*_arbitrary_* growth: opaque/arbitrary cache path leak.dagger_dagql_cache_entries is index-entry count, not unique-result count.
The same shared result may appear in multiple indexes.Practical scrape tip for nested-engine integration tests:
curl http://dev-engine:9090/metrics).Useful correlation log (session teardown):
engine/server/session.go logs:
released dagql cache refs for session with beforeEntries and afterEntriesafterEntries trends upward across completed sessions, session close is not releasing all refs.dagql/objects.go
preselect: log newID, returned cacheCfgResp.CacheKey.ID, and decoded args after rewritenewCacheKey: log ID, DoNotCache, TTL, ConcurrencyKeydagql/cache.go
GetOrInitCall: log callKey, storageKey, contentKey, hit path takenwait: log index insertion (storageKey, resultCallKey, contentDigestKey)dagql/session_cache.go
DoNotCache retries (noCacheNext)isClosed checks)dagql/cache.go around DB select/update logicdagql/db/queries.go compare-and-upsert behaviorengine/filesync/change_cache.go for change dedupe/wait/releaseengine/filesync/localfs.go for conflict detection (verifyExpectedChange) and release timingCheck in order:
GetCacheConfig rewrite ID unexpectedly?Check:
storageKey vs contentDigestKey)?Check:
noCacheNext for this key?DoNotCache and then reinserted?Check:
Prefer small, high-signal log lines with:
storage, content, miss, ongoing)This usually narrows root cause quickly without overwhelming logs.