skills/cache-expert/references/cache_pruning.md
This document describes the current pruning and retention model for the dagql
cache.
The source of truth is the code, mainly:
dagql/cache.godagql/cache_prune.goengine/server/gc.goengine/server/server.gocore/schema/coremod.goengine/snapshots/persistent_metadata.goThis doc is about:
The live cache is a DAG of materialized results.
Each sharedResult may depend on other sharedResults through exact dependency
edges in sharedResult.deps.
Conceptually, retention works like graph reachability:
The implementation does not literally maintain one explicit synthetic root node.
Instead, it maintains explicit classes of ownership edges, and
incomingOwnershipCount is the compact runtime summary of whether a result is
still retained.
Still, the "can I reach this from a root?" mental model is the right one.
The cache's e-graph tells us about equivalence and lookup reuse. It does not by itself retain results.
Retention comes from explicit ownership edges:
This distinction matters a lot.
Examples of things that are not retention edges:
A result may be equivalent to another result and still be collectible if nothing owns it.
incomingOwnershipCountsharedResult.incomingOwnershipCount is the authoritative liveness count.
It is incremented when the cache adds a real ownership edge and decremented when that edge goes away.
When the count reaches zero, the result becomes collectible.
Collection then:
OnRelease hooksSo the runtime system is not a tracing GC. It is explicit ownership accounting with cascade cleanup.
There are three important root classes today.
When a session obtains cache-backed results, the session gets ownership edges to those results.
Those edges live for the duration of the session.
When the session ends:
ReleaseSession drops those session ownership edgesThis is the most ordinary retention class: "the client is using this result, so keep it alive."
When a field is marked IsPersistable, completed results of that field get a
persisted edge.
That edge does not disappear at session end.
Instead, it remains until later prune work explicitly removes it.
This is how results survive beyond the session that created them and become eligible for shutdown persistence and later restart reuse.
Persisted edges can also carry expiration metadata and can be marked unpruneable.
There are some special cases where the engine intentionally keeps results for its own lifetime.
The main current example is core typedef retention.
core/schema/coremod.go builds the static core typedef graph and then calls
cache.MakeResultUnpruneable(...) on each typedef result. That effectively
installs persisted edges that are never eligible for prune.
This is not a separate retention mechanism in the cache internals. It is the
same persisted-edge machinery with the unpruneable bit set.
Dependency edges are how retention propagates transitively.
If result A depends on result B, then A holds an ownership edge to B.
That means:
This is why a persistable result's transitive dependency closure is retained even though only the top-level result was directly marked persistable.
The dependency edges that matter here are the exact ones in sharedResult.deps,
not symbolic graph relationships.
The cache adds exact dependency edges from a few important sources:
AddExplicitDependency callsResultCallRef dependencies extracted from the authoritative
ResultCallresult_depsThe important thing is not how they were discovered. The important thing is that once they exist, they participate in real retention and prune simulation.
A big part of the retention story is session teardown.
On session removal, the engine:
engineCache.ReleaseSessionThat drops the session root set and immediately runs the same ownership cascade logic the cache uses everywhere else.
So even before explicit disk pruning policies run, ordinary session release is already constantly pruning the cache back to the non-session-retained graph.
User-visible persistable behavior is driven by Field.IsPersistable().
At execution time this becomes CallRequest.IsPersistable.
When a persistable result is completed, initCompletedResult calls
upsertPersistedEdgeLocked.
That:
This is why persistable results stay alive after session close.
MakeResultUnpruneable is a special case of persisted retention.
It installs a persisted edge with:
unpruneable = truePrune candidate selection skips those results entirely.
This is what the core typedef retention path uses today.
Persisted edges may have an expiresAtUnix.
That expiration does not by itself immediately delete the result. Instead, it affects candidate ordering and eligibility during prune.
Expired persisted edges are preferred prune candidates.
The prune operation does not directly remove arbitrary results.
The thing it cuts is the persisted edge.
That is an important design point.
Why?
Because persisted edges are the durable roots for cache retention beyond live sessions. If prune wants to stop keeping something, it removes that root edge. The normal ownership cascade then collects anything that is no longer reachable.
So the prune algorithm is really:
The current prune policy type is dagql.CachePrunePolicy.
It includes:
AllFiltersKeepDurationReservedSpaceMaxUsedSpaceMinFreeSpaceTargetSpaceCurrentFreeSpaceThis policy shape is still buildkit-influenced.
That is intentional for now:
So the current pruning system is Dagger-owned in implementation, but still uses policy concepts inspired by BuildKit.
The engine server builds dagql prune policies in engine/server/gc.go.
That layer:
CurrentFreeSpace from actual disk statsengineCache.PruneSo dagql owns the prune implementation, while engine/server owns policy
construction and triggering.
At a high level, the prune implementation in dagql/cache_prune.go does this:
This is absolutely a best-effort pruning pass, not an optimal solver.
An important design goal is: prune should not become a stop-the-world GC.
The implementation addresses that in two ways:
The cache briefly takes a snapshot of the information it needs:
Then it releases the lock and does the expensive reasoning outside the lock.
Only once the plan is chosen does the cache reacquire the live lock and attempt to remove persisted edges from the real cache.
That means the slow part is simulation, not holding the live graph lock.
The prune snapshot is a simplified view of the live cache:
pruneSnapshotResult per live resultThere is also pruneUsageIdentityState tracking shared-storage identities.
This snapshot is enough to simulate edge cuts without touching live cache state.
Before choosing prune candidates, the cache computes the active closure from session roots.
This means:
Anything in that active closure is not a prune candidate, even if it has a persisted edge.
This is an important subtlety:
Only results with persisted edges are considered.
Candidate collection skips results if:
KeepDurationSo pruning is not scanning "all results." It is scanning the persisted-root set and applying a few simple eligibility rules.
The current candidate ordering is heuristic and intentionally simple.
Candidates are sorted roughly by:
This is not sophisticated. It is a basic heuristic.
There is a lot of room to improve this later.
The current reclaim planner is greedy.
It does not try to solve a globally optimal selection problem.
Given the current candidate order, it simulates cutting persisted edges one by one until the target reclaim threshold is reached.
That is intentionally cheap and simple compared to trying to solve a more optimal subset selection problem.
This is very much a "good enough for now" pruning strategy.
The simulation state tracks:
Applying a candidate means:
This is why the simulation is "edge cut" based rather than "delete this result" based.
Multiple results can represent the same underlying physical storage.
This is handled through cache-usage identities.
The relevant interfaces are:
hasCacheUsageIdentitycacheUsageSizercacheUsageMayChangeThe basic idea is:
sharedResultIDThis is how pruning avoids double-counting shared snapshots or other shared storage.
Prune needs approximate reclaim sizes, so it measures usage before planning.
The flow is:
Important details:
self values participateThis measurement phase is separate from candidate simulation, but the simulation depends on its output.
pruneTargetBytes computes the reclaim target from policy thresholds.
The current logic is still policy-shaped rather than deeply semantic:
MaxUsedSpaceReservedSpaceMinFreeSpaceTargetSpaceIf thresholds are not triggered but the policy is effectively "prune matching
things anyway" (All or filters), the target becomes effectively unlimited.
That is how explicit user prune requests can still remove matching entries even without disk pressure.
Once the plan is built, the cache applies it against live state by calling
removePersistedEdge for each planned candidate.
This is where real-time drift matters.
Between snapshot time and apply time:
The implementation accepts that.
If removePersistedEdge says the edge is already gone, prune just skips it.
This is fine. Pruning is best effort.
The live apply path relies on the same ownership cascade used everywhere else:
OnReleaseAt a high level, dagql retention and pruning are expressed through snapshot owner leases.
When a retained result owns snapshots, the cache ensures the snapshot manager attaches a lease for that result's owner slots.
When a result is finally collected, its OnRelease cleanup removes those owner
leases.
Actual physical snapshot reclamation is then largely delegated to containerd:
The prune path itself triggers snapshot metadata GC after it has actually removed entries, but the low-level cleanup semantics are intentionally delegated to containerd rather than reimplemented in dagql.
That is enough to understand the current prune story at a high level. The lease/snapshot side can be documented in finer detail separately.
Pruning can leave the union-find class ID space sparse.
So after prune removes anything, the cache may compact eq-classes.
This:
This is maintenance work to keep the e-graph structure tidy after repeated merge-and-prune cycles.
The same size-accounting machinery also feeds usage reporting.
UsageEntriesAll:
CacheUsageEntry valuesThe engine exposes that through EngineLocalCacheEntries.
So the prune-size view and the user-visible cache-entry view come from the same accounting path.
The static core schema typedef graph is intentionally retained for the life of the engine.
core/schema/coremod.go does this by calling MakeResultUnpruneable on the
typedef results when building the core schema view state.
That means:
This is one of the clearest examples of "engine-owned lifetime" rather than session-owned or merely persistable lifetime.
The current algorithm is intentionally basic.
Important limitations:
This is not trying to be the final word in pruning quality.
It is a straightforward best-effort heuristic that works with the current cache ownership model.
The current dagql prune model treats persisted edges as prunable retention roots, protects live session closure from prune, takes a quick snapshot of the retained graph, runs a simple greedy edge-cut simulation outside the lock, then cuts real persisted edges and lets normal ownership cascade and containerd lease cleanup do the rest.