skills/cache-expert/references/cache_persistence.md
This document describes the current persistence model for the dagql cache.
The source of truth is the code, mainly:
dagql/cache.godagql/cache_persistence_import.godagql/cache_persistence_worker.godagql/cache_persistence_self.godagql/cache_persistence_resolver.godagql/persistdb/schema.sqlcore/persisted_object.goThis doc is about the persistence model itself: what is persisted, when it is persisted, how objects encode themselves, and what guarantees we do and do not make.
The persistence model is intentionally simple and intentionally best effort.
The cache is fundamentally an in-memory cache.
While the engine is running:
Disk persistence is only used as a startup/shutdown checkpoint:
This is not meant to behave like a database with durability guarantees. If the engine crashes or is killed ungracefully, losing the cache is acceptable. It is just a cache.
Right now, if persistence is suspect, we do not try to salvage pieces of it. We wipe it and cold-start.
dagql.NewCache is the entry point.
If no DB path is configured, the cache is just in-memory and persistence is effectively disabled.
If a DB path is configured, dagql startup does this:
meta.schema_versionmeta.clean_shutdownclean_shutdown=0That clean_shutdown=0 write at startup is important: it means the store is
considered dirty until a later successful graceful close explicitly marks it
clean.
At the engine level, the dagql persisted graph is the source of truth for
restartable local cache state. If dagql startup discards persistence, or if the
dagql cache cannot be opened at all, engine/server.NewServer also discards
the local worker/containerd state before retrying startup. That reset removes
the worker state directory and dagql SQLite files, then recreates the
containerd metadata DB, content store, snapshotter, snapshot manager, and dagql
cache from empty state. Root-level non-cache state such as the engine secret
salt is preserved.
During normal engine execution:
clean_shutdown=0
and the shutdown clean_shutdown=1The runtime cache state may change constantly, but none of that is pushed to SQLite during normal operation.
The engine shutdown path matters here.
engine/server/server.go:GracefulStop does the important sequencing:
clean_shutdown=1The session removal part is critical. Before persistence, the engine tries to get rid of session-owned state first so the retained graph is in a steady state.
That means:
ReleaseSession removes session ownership edges from the cacheBy the time Cache.Close() persists, the cache should reflect the post-session
retained state rather than some partially attached session state.
One subtle but important detail: graceful shutdown may still prune before persistence. So "everything marked persistable gets written" is not quite the whole story. More precisely:
The failure strategy is intentionally blunt.
If any of these happen:
clean_shutdown != 1the persistence DB is wiped. During full engine startup, the corresponding worker/containerd state is wiped too, because those snapshots and content are only meaningful if the dagql graph that owns them is trusted.
If persistence fails during Cache.Close():
clean_shutdown=1 is not recordedThen on the next startup, the store is seen as unclean and wiped.
We do not try to preserve partial progress or repair a half-written store.
The persistence store is SQLite via modernc.org/sqlite.
The DB is opened with pragmas chosen explicitly for cache semantics rather than database durability:
journal_mode=WALbusy_timeout=10000synchronous=OFFBEGIN IMMEDIATE transactionsThe important implication is that we intentionally choose better performance over robust crash durability. That matches the "cache, not database" model.
The schema lives in dagql/persistdb/schema.sql.
There are three broad groups of data:
metaCurrently used for:
schema_versionclean_shutdownresultseq_classeseq_class_digeststermsterm_inputsresult_output_eq_classesresult_depspersisted_edgesresult_snapshot_linksThis is the persisted mirror of the in-memory dagql cache/e-graph state.
snapshot_content_linksimported_layer_blob_indeximported_layer_diff_indexThese do not describe the dagql graph directly. They mirror auxiliary snapshot manager metadata needed to reconstruct snapshot/content relationships and imported-layer indexes on restart.
On graceful shutdown, the cache snapshots and writes:
sharedResults in resultsByIDThat is important: the store does not just save a small set of "roots." It saves the live retained cache graph and the metadata needed to reconstruct it.
In other words, persistence is trying to serialize the current cache state, not just enough information to replay everything later.
Not everything in the cache is persisted.
Important omissions:
ongoingCallscache_arbitrary.goThose are runtime-only.
The persisted store is about retained dagql call-cache state and snapshot metadata, not every transient runtime structure.
The main user-visible way something survives beyond a session is through
IsPersistable.
At the dagql field-definition level, Field.IsPersistable() sets the field spec
to mark results of that field as eligible for persistence.
At execution time, that turns into CallRequest.IsPersistable, and the cache
responds by adding a persisted edge for the completed result.
That persisted edge does two things:
Because retained results also keep their exact result dependencies alive, making a result persistable retains its transitive dependency closure too.
This is why shutdown persistence naturally includes more than just the root persistable results: the retained graph includes whatever those roots depend on.
Pruning is the part of the system that decides which persisted edges survive over time. That deserves its own doc, but it is directly relevant here because it controls what still exists to flush at shutdown.
The results table stores one self_payload blob per result.
That blob is not a raw Go serialization of the whole object. It is a structured
PersistedResultEnvelope defined in dagql/cache_persistence_self.go.
The current envelope kinds are:
nullobject_selfscalar_jsonlistThe envelope also carries:
resultIDtypeNamesessionResourceHandleThe envelope is the generic dagql-level wrapper. Object-specific details live inside object JSON payloads implemented by the object types themselves.
There are three main interfaces to know:
PersistedObjectImplemented by typed self payloads that know how to encode themselves directly:
EncodePersistedObject(context.Context, PersistedObjectCache) (json.RawMessage, error)This is how objects serialize their own internal state to JSON.
PersistedObjectDecoderImplemented by zero-value object types that know how to reconstruct themselves:
DecodePersistedObject(context.Context, *Server, uint64, *ResultCall, json.RawMessage) (Typed, error)This is how object payloads are rebuilt on import or first hit.
PersistedSnapshotRefLinkProviderImplemented by objects that can name the durable snapshots they own:
PersistedSnapshotRefLinks() []PersistedSnapshotRefLinkThis is how object payloads expose snapshot ownership links for
result_snapshot_links.
Persisted object payloads often refer to other persisted dagql objects.
Those references are encoded through encodePersistedObjectRef, which stores the
referenced object's sharedResultID.
This is a major current caveat:
That is accepted for now. The persistence format is a snapshot of one engine's cache state, not a portable interchange format.
The lazy system is separate conceptually, but it matters directly to persistence.
The core lazy interface includes:
EvaluateAttachDependenciesEncodePersistedThat last method is the persistence hook.
For objects like Directory, File, and Container, persisted object encoding
often has two broad forms:
This is a big design point: laziness does not block persistence as long as the lazy operation is structurally representable.
If an object has neither:
then persistence returns ErrPersistStateNotReady.
Today Directory and File explicitly do this when they have neither snapshot
nor lazy state available to encode.
That is important because shutdown persistence is all-or-nothing from the point
of view of clean restart. If a persistable retained result cannot be serialized,
the flush fails, clean_shutdown=1 is not recorded, and the next startup wipes
the store.
Snapshots are not encoded only implicitly through object JSON.
There are two related persistence mechanisms:
Objects expose PersistedSnapshotRefLinks(), and those are written into
result_snapshot_links.
These links describe:
Examples:
Separately, the snapshot manager exports:
Those rows are written into the snapshot metadata tables and loaded back into the snapshot manager at startup.
Import does more than just rebuild tables in memory.
After reading the mirrored rows, startup also:
This is how persisted dagql ownership is translated back into live containerd lease ownership at startup.
The ordering matters:
That intentionally biases failure modes toward temporary over-retention rather than accidental ownership loss.
importPersistedState rebuilds the live cache in several phases:
The opportunistic eager decode is subtle:
The implementation detail behind that is important:
ensurePersistedHitValueLoadedThis matches the current code path and is not just a vague policy choice.
This was added originally with the intention of handling dynamic object reconstruction, especially around module objects and other schema-dependent object types. Import-time decode does not necessarily have enough live schema/type context to rebuild those objects correctly, so the system defers full decode until the object is accessed through an actual server/resolver path.
It may be worth reassessing whether that concern is still fully valid in the current architecture, but today the lazy-on-first-use decode is still real and intentional in the implementation.
So after import, a result may exist in the graph with:
persistedEnvelopehasValue == falseThat is valid. ensurePersistedHitValueLoaded is the boundary that materializes
that payload before the result escapes to callers.
Shutdown persistence is a two-step process:
snapshotPersistState walks the in-memory cache and builds a detached
persistStateSnapshot.
This is important for performance and correctness:
egraphMu only while copying the graph state outSo the live cache is not held under the graph lock for the whole flush.
applyPersistStateSnapshot then:
This is a whole-snapshot rewrite, not an incremental update.
That is another deliberate simplification:
For each result being flushed, persistence stores:
call_frame_jsonself_payloadIt also stores separately:
The authoritative call frame matters a lot. It is the semantic identity and reconstruction anchor used during later decode.
Persisted object references are keyed by sharedResultID, which is only
meaningful inside one engine cache snapshot.
If shutdown is not clean, the store is wiped on the next startup.
If import or startup validation fails, the store is wiped. If shutdown flush fails, the next startup wipes the store.
One explicit example today: Container.EncodePersistedObject still rejects
containers carrying:
That is a known first-cut restriction.
The arbitrary in-memory cache is session/runtime-only today.
The persistence model makes a few strong performance choices:
synchronous=OFFThe price paid is lower durability and a willingness to wipe the store if anything looks wrong.
If you want to understand the live implementation quickly, this order works well:
dagql/cache.go
NewCacheprepareCacheDBsReleaseSessionCloseengine/server/server.go
GracefulStopdagql/cache_persistence_worker.go
persistCurrentStatesnapshotPersistStateapplyPersistStateSnapshotpersistResultEnvelopedagql/cache_persistence_import.go
importPersistedStateensurePersistedHitValueLoadeddagql/cache_persistence_self.go
PersistedResultEnvelopePersistedObjectPersistedObjectDecoderPersistedSnapshotRefLinkProviderencodePersistedResultEnvelopedecodePersistedResultEnvelopedagql/persistdb/schema.sql
core/persisted_object.go
The current dagql persistence model is a best-effort startup/shutdown snapshot of the live in-memory cache: load once on startup, run entirely in memory, flush once on graceful shutdown, and wipe the whole store whenever the on-disk state looks unsafe or inconsistent.