crates/event_store/README.md
Embedded event store and authoritative log of state-affecting messages for NautilusTrader.
[!WARNING] Early alpha. The API is not stable and may change between versions. Event-store capture, replay, and verification workflows are the current focus.
The nautilus-event-store crate is designed to capture commands, generated events, raw venue
reports, reconciliation outputs, and request/response traffic at the message bus's publish and
send entry points. It persists those messages as a durable per-run log and exposes a typed replay
path. Combined with cache snapshots anchored to a durable log high-watermark, it provides the
state-affecting history needed for audit, deterministic replay, end-to-end correlation of agent
decisions, and counterfactual research.
The event store is the authoritative durable boundary for the deterministic engine's state-affecting history:
nautilus-event-store is a single-node, embedded event store for one trading instance. It
captures state-affecting bus traffic at the publish and send entry points, persists it as a
per-run log, and exposes replay paths for the kernel, agents, and deterministic simulation
testing (DST).
It defines:
It does not define:
Two load-bearing guarantees:
(seed, binary_hash, config_hash, schema_version, log) reproduces observable behavior
bit-identically across the crates covered by deterministic simulation testing when paired with
captured raw venue inputs and the canonical replay order. Outside that scope, such as adapter
network I/O, replay is causally close: same logical sequence, not bit-exact byte order.The store also enables:
intent_id.parent_run_id in the manifest.seq whose entry has been durably acknowledged by the backend.
Snapshots are anchored to this value.seq > anchor.intent_id, correlation_id,
caused_by. Header propagation spans command construction, endpoint sends, generated events,
and reconciliation reports.Captured at the message bus dispatch boundaries:
SubmitOrder, CancelOrder, ModifyOrder, Subscribe*, Unsubscribe*, etc.) at
the send and publish entry points, before any handler runs.OrderStatusReport, FillReport, PositionStatusReport) on a dedicated
topic before reconciliation synthesizes any derived event from them.OrderInitialized produced for external
orders during reconciliation.Subscribe* and Unsubscribe* activations and removals) so
replay reconstructs the strategy or agent's observation window.RunStarted (durably committed before any other entry of the run) and
RunEnded on graceful shutdown. The RunStarted payload includes the registered component
manifest so replay binds actors, strategies, algorithms, subscriptions, and command endpoints
without consulting external config.Capturing raw reports before synthesis is load-bearing: replay must be able to re-run reconciliation against the same input the live engine saw, not only against the events it produced.
OrderInitialized after the cache
insert succeeds. The raw venue report is captured before reconciliation, so replay can inspect
both the input and the derived order origin.position_id propagation is engine-derived from OrderFilled.
Replay runs that propagation rule, and verification asserts that replay reproduces those links.RunStarted component
manifest, and verification coverage.Capture sites are instrumented at the bus's publish and send entry points. Each captured message
is forwarded to a dedicated writer thread. The default store backend is
redb, an embedded ACID key-value store.
+-----------------+ publish/send +---------------------+
| MessageBus tap | ---------------> | Bus capture adapter |
+-----------------+ +----------+----------+
| sync_channel
v
+--------------------+
| Event store writer | dedicated thread, batched commits
+----------+---------+
v
+-----------------+
| Store backend | redb (default)
+-----------------+
Bus-level capture frees adapters and engines from per-call instrumentation. Capture sites pass
the entry to the writer through a bounded std::sync::mpsc::sync_channel, mirroring the dedicated
writer-thread shape used by the existing logging subsystem. The event store uses a bounded
channel where logging uses an unbounded one; the audit contract requires capture, so the
backpressure policy is no-drop.
The writer batches up to ~100 entries or ~5 ms (whichever first), commits them in a single
backend transaction, and amortizes the fsync cost across the batch. This policy comes from the
storage-backend benchmark: with redb 4.1.0, 256 B payloads, Durability::Immediate, and an
NVMe/ext4 host, a batch size of 100 measured 5.16 ms p50 commit latency and 18,356 entries/sec.
That is two-to-three orders of magnitude above the expected captured-traffic rate; larger batches
cut tail latency only marginally and trade against burst responsiveness. The
high-watermark advances only when the backend acknowledges the commit. Backpressure: the channel
send blocks until the writer accepts. If a stall exceeds a configurable threshold, the kernel
halts rather than dropping or proceeding unaudited. Readers compose with the writer through the
backend's MVCC.
Persistence ordering has two boundaries. Inbound bus messages and raw reports are durably committed before handler dispatch. Generated lifecycle events are durably committed before bus fanout to downstream handlers. Some synchronous-core paths mutate cache before publishing the generated lifecycle event; for those paths, the event store records the successful mutation at the publish boundary and does not claim durable-before-cache-write ordering. State-affecting handlers reached through the bus run after the writer has acknowledged the captured entry's batch.
The tap must fire before fanout, never after. Three guarantees depend on it:
The public API is intentionally not frozen yet. The implementation should expose four roles:
seq, and looking up
secondary indices.intent_id lookups, and run iteration.Entry metadata must include a per-run seq replay-order authority, domain timestamps such as
ts_init and optional bus-accepted time, topic or endpoint identity, payload type, payload bytes,
and correlation headers. Exact Rust type names, field names, and serialization choices are early
alpha implementation details.
The on-disk realization is backend-specific. The crate exposes only the logical layout. Every
backend stores per-run entries keyed by seq, with sidecar indices for intent_id and message
id lookups, plus a manifest and an optional snapshot anchor.
Logical layout, per-run:
seq.intent_id -> seq, client_order_id -> seq, venue_order_id -> seq.The redb backend stores entries and indices in named tables
inside one redb file per run; a custom-WAL backend stores entries as length-prefixed records
with frame-level CRC32 in segment files plus sidecar index files. Both expose the same logical
contract; the crate does not promise a particular file format.
Run manifest:
run_id <start_ts_init>-<short_uuid>; sortable by start time, unique by uuid
parent_run_id optional; set when this run resumes a crashed predecessor
instance_id from kernel config
binary_hash hash of the trader binary
schema_version bumps when entry payload schema changes
crate_versions hash of Cargo.lock or equivalent crate version manifest
feature_flags active Cargo features
adapter_versions per-adapter version stamp
config_hash hash of the kernel config
registered_components actor/strategy/algorithm ids, config hashes, endpoint bindings
seed if running under a seeded mode
start_ts_init first ts_init in the run
end_ts_init last ts_init in the run, or null if crashed
high_watermark largest seq durably acknowledged at end of run
status "running" | "ended" | "crashed-recovered" | "quarantined"
RunStarted is the first entry of every run and must be durably committed before any
state-affecting entry. RunEnded is the last entry on graceful shutdown. On boot, the kernel
scans <instance_id>/ for any run whose status is running and that lacks a RunEnded entry,
seals it as crashed-recovered (or quarantines it for inspection), and starts a new run that
records the crashed predecessor's run_id as its parent_run_id.
seq, or lookup by intent_id /
client_order_id / venue_order_id. Loads in seconds for a day's run; no data catalog needed.ts_init,
filtered by the strategy or agent's Subscribe* activations from the captured stream.ts_init plus instrument identity. Tolerates clock skew within deterministic
simulation scope; outside that scope, ordering is causally close but not bit-exact for
adapter-originated traffic.A boot kernel flag --replay-from <run-id> skips venue reconciliation against the live venue,
restores the snapshot anchored to the run's high-watermark, replays the tail in seq order, and
exits or freezes for inspection. Reconciliation behavior during replay is reconstructed from the
captured raw venue reports; replay never queries the live venue. Live restart, when it adopts
event-store recovery, deduplicates by entry id when catching up past the run tail against the
live venue.
Crash recovery composes four primitives, addressed at four crash classes:
Live catch-up past the run tail (when the new run resumes trading) deduplicates against the captured stream by entry id and venue identifiers.
Backend recovery time is asymmetric in the storage-backend benchmark. After graceful shutdown,
the redb backend reopened in single-digit milliseconds for files up to one-month scope;
copy-on-write with header validation skips log replay entirely. After a crash with an in-flight
transaction, redb walked the file at roughly NVMe-bandwidth (around 700 MB/s on the benchmark
host) to discard aborted writes and rebuild allocator state. Per-run files at one-month scope
(~18 GiB) reopened in tens of seconds. Per-run-file rotation is therefore the scaling unit:
long-lived runs that grow indefinitely would breach the restart target and break the audit
posture.
The store's determinism guarantee depends on what is paired with it:
seq order. Forensics, audit, and
decision-level replay are exact replays of the captured stream.(seed, binary_hash, config_hash, schema_version, log): bit-identical across the in-scope
crates. Adapter network I/O is out of scope for bit-identicality; replay there is causally
close, same logical sequence.ts_publish bounds drift but is not the ordering authority; seq is.seq is the replay-order authority, assigned by the writer at commit time. ts_init is a domain
timestamp, strictly monotonic and unique system-wide via AtomicTime's AcqRel
compare-and-exchange. ts_publish, when populated, records the bus-accepted time; the writer
stamps it on receive, and the bus stamps it before fanout when cross-subscriber ordering matters.
Neither is used to order replay.
Idempotency under replay composes four primitives:
seq; seq never points at a
different entry once committed.seq advance.seq=N+1 without having seen seq=N detects a
silent skip rather than missing it.client_order_id and venue_order_id
before applying.The four primitives compose to make double-apply and silent-skip unrepresentable, not merely unlikely.
Cache snapshots are owned by the cache, not by the event store. The event store stores only an
anchor: the high-watermark seq at the moment the snapshot was captured, plus a content-
addressed reference to the snapshot blob. The cache writes the blob to its own backing store;
the event store records the anchor inside the same transaction that advances the
high-watermark, so the anchor is durable iff the high-watermark is durable.
This is the blob-outside-store pattern. The alternative (snapshot-as-event) would couple snapshot retention to log retention and require the event store to understand snapshot blobs; we keep them separate.
Restore reads the anchor, fetches the blob from cache storage, validates its content hash
against the anchor, and replays log entries with seq > anchor. A blob whose hash mismatches
the anchor fails the restore and quarantines the run for inspection.
Entries with seq <= snapshot_anchor are reclaimable. Reclamation rotates segments (or
backend files) as a unit, never by individual entry. The retention contract trades audit and
forensics depth against storage cost; the operator picks the trade per deployment:
The default is full retention; bounded modes are configured per deployment. Reclamation never
touches a run whose status is running. Bounded retention must keep at least one
known-good prior sealed run as a complete restore point: its manifest, its snapshot anchor,
the external snapshot blob the anchor references, and the entry tail since that anchor.
A bare redb file alone is not a restore point. The supervisor falls back to this restore
point when the latest run quarantines on corruption, rather than entering a restart loop.
The default backend is redb: a pure-Rust ACID kv store with a
single-file B-tree and MVCC.
redb's single-writer many-reader model aligns with the dedicated writer thread. The
EventStore trait encapsulates the backend so backend swaps do not touch consumers, and the
crate does not expose redb-specific types in its public API. Durability is set explicitly:
every commit uses Durability::Immediate. The redb file is one per run, sized for retention
rotation as a unit.
Backend failure model. The storage-backend benchmark characterized redb's response to disk
pressure and physical corruption. Disk pressure (ENOSPC, RLIMIT_FSIZE) returns a typed
Io(FileTooLarge) error from Table::insert or WriteTransaction::commit; the aborted
batch's entries are not visible after reopen, prior commits are intact, and the high-
watermark has not advanced. The writer surfaces the error to the kernel halt path and
fail-stops. Header-region corruption is detected on open as a typed Storage(Corrupted)
error. Data-page corruption (truncation, mid-tree bit-flips) is not framewise-checksummed
by redb 4.x and surfaces as an assertion failure or unreachable-code panic, on open or on
first read. The release profile builds with panic = "abort", so in-process catch_unwind
does not contain these panics; the trader process aborts.
Two consequences:
zero-tail corruption opens cleanly and
panics on first read, so the verifier exercises a full integrity scan, not just the
file header.seq, ts_init, ts_publish, topic,
payload_type, payload, headers) computed at capture time and stored alongside the
entry. Readers, replay, export, and the verifier process recompute and check it; a
mismatch quarantines the run. Sidecar indices (intent_id -> seq,
client_order_id -> seq, venue_order_id -> seq) are rebuildable projections from the
seq -> entry table, not authoritative storage; the verifier rebuilds and cross-checks
them. Corruption that redb misses is caught by hash mismatch on the next read of the
affected entry, and the run never proceeds unaudited.Sealed runs may be exported to opt-in downstream sinks. An outage of any downstream sink never blocks trading.
Nautilus Agents records each agent decision
cycle as a DecisionEnvelope. The event store records the engine-side history that surrounds that
decision: the ordered world inputs, the command stream, and the generated state-affecting events
keyed by intent_id.
Together, the decision envelope and the event-store slice form one auditable transaction:
Agent-private records remain the agent framework's capture surface. The event store owns the trader-side durable history that makes those decisions reproducible against engine state.
The event store and DST close different halves of replay. The event store supplies the durable input history; DST supplies the execution environment that proves the same history produces the same engine behavior.
Together, a captured run and the DST seams form the replay contract:
position_id propagation.Under simulation (cfg(madsim)), a synchronous in-memory event store replaces the writer thread,
so tests assert against an authoritative in-process log without disk I/O or thread scheduling.
Production-shape integration tests outside cfg(madsim) exercise the on-disk writer.
NautilusTrader is an open-source, production-grade, Rust-native engine for multi-asset, multi-venue trading systems.
The system spans research, deterministic simulation, and live execution within a single event-driven architecture, providing research-to-live semantic parity.
See the docs for more detailed usage.
The source code for NautilusTrader is available on GitHub under the GNU Lesser General Public License v3.0.
NautilusTrader™ is developed and maintained by Nautech Systems, a technology company specializing in the development of high-performance trading systems. For more information, visit https://nautilustrader.io.
Use of this software is subject to the Disclaimer.