contrib/replay/README.md
⏪ ═══════════════════════ ⏩
★
/|
/ |
/ |
/ ✨|
/ |
/_____|
( o o )
\ > /
~~~~~~~~~~~
\ ~~~~ /
\ /
✨ | | ✨
◀──── | | ────▶
/ \
⌛ 🔮 ⌛
═══════════════════════
FDB REPLAY
Time Travel Wizard
An interactive Terminal User Interface (TUI) for "replaying" FoundationDB simulation trace files.
This tool is in EXPERIMENTAL stage.
I wanted to create a tool that I personally always wanted in FDB. Tired of grepping trace files and doing common patterns again and again, I wanted to codify the general techniques I use and data I look at while debugging failures - just making it very fast and easily available, while not regressing on core grep/less functionality.
At the end, the goal is to make myself more productive, learn about FDB better, and in the future, potentially for others if they find this tool useful too.
Prerequisites: Go 1.21+ must be installed on the system.
If Go is installed, replay is built automatically as part of the default FDB build. If Go is not installed, CMake will emit a warning during configuration and skip the replay target. The rest of the FDB build will proceed normally.
The replay binary is built by default when you build FDB:
cmake --build . # replay is included in the default build
cmake --build . --target replay # or build replay specifically
CMake handles everything automatically:
bin/replayIf you prefer to build directly with Go:
cd contrib/replay
go build -o replay .
replay [trace-file.xml] # Load specific trace file
replay # Auto-load latest trace*.xml in current directory
replay -h | --help # Show help
Tip: Create an alias r for quick access:
alias r=replay
The main motivation is to be fast in debugging simulation issues, as well as understanding how FDB works.
FDB simulation produces XML trace files that can be gigabytes in size with millions of events. Debugging a test failure typically involves:
This tool aims to make all of that faster and more intuitive.
This tool is meant to be organic - it will grow and evolve over time. There will always be features to add, improvements to make, and new patterns to codify.
Feature Wishlist (always growing):
Opportunity for Richer Logging:
Building this tool has revealed opportunities to improve FDB's trace logging itself. For example, I realized some trace events don't include the role ID, which is very useful since you can then see which role (e.g., ClusterController) the trace event is coming from, or even better, which specific StorageServer (out of dozens in simulation) the trace is coming from.
The tool and the logging can evolve together - as we add features to the tool, we may discover places where richer trace events would help debugging.
The core idea is to navigate the trace file and then, wherever you are, "as of" that time/line, the tool reflects the state of the cluster.
Everything is based on where you are in the timeline:
You can go back and forth in time, and the tool reacts to that - showing you the cluster state at that exact moment.
From experience debugging FDB issues, you have to do multiple rounds of "back and forth" like this to narrow down the root cause of a bug.
| Key | Action | Description |
|---|---|---|
Ctrl+N | Next event | Jump to the next visible event (respects filters) |
Ctrl+P | Previous event | Jump to the previous visible event |
Ctrl+V | Page forward | Jump ~1 second forward in time |
Alt+V | Page backward | Jump ~1 second backward in time |
g | Go to start | Jump to the first visible event |
G / Shift+G | Go to end | Jump to the last visible event |
t | Time jump | Open popup to enter a specific time in seconds |
| Key | Action | Description |
|---|---|---|
/ | Search forward | Enter search pattern (supports * wildcard) |
? | Search backward | Enter search pattern (supports * wildcard) |
n | Next match | Go to next match in original search direction |
N / Shift+N | Previous match | Go to match in opposite direction |
Esc | Clear search | Clear search highlighting |
Search patterns support wildcards: *Recovery*, Type=Master*, Machine=2.0.1.*
| Key | Action | Description |
|---|---|---|
r | Next recovery start | Jump to next MasterRecoveryState with StatusCode=0 |
R / Shift+R | Previous recovery start | Jump to previous recovery start |
e | Next recovery event | Jump to any next MasterRecoveryState event |
E / Shift+E | Previous recovery event | Jump to any previous recovery event |
Recovery states are color-coded:
| Key | Action | Description |
|---|---|---|
3 | Next warning | Jump to next Severity=30 event |
# / Shift+3 | Previous warning | Jump to previous warning |
4 | Next error | Jump to next Severity=40 event |
$ / Shift+4 | Previous error | Jump to previous error |
Press f to open the filter configuration popup. Filters allow you to focus on specific events.
Filter Categories:
Raw Filters (pattern matching)
Type=MasterRecoveryState, Role*TLog, Severity=40a to add, e to edit, r to remove, d to disable/enablet to search and add by Type name (fuzzy search)c to toggle "common" event types (pre-defined important events)Machine Filters
Enter to open machine selection popupTime Range Filter
d to toggle, Enter to configure start/end timesMessage Filter
NetworkMessageSent eventsd to toggleFilter Logic:
Space to disable/enable all filtering| Key | Action | Description |
|---|---|---|
c | Config view | Show full DB configuration JSON (scrollable) |
x | Health view | Show network latencies, degraded peers, connections |
h | Help | Show all keyboard shortcuts |
f | Filter | Configure event filters |
Config View (c):
Ctrl+N/Ctrl+PHealth View (x):
The left pane shows the cluster topology at the current time:
TLog [abc123] (e=5)-> arrowNetworkMessageSent Visualization:
--> arrow<-- arrowThe bottom of the screen shows:
| Key | Action |
|---|---|
q / Q / Ctrl+C | Quit |
+------------------+
| trace*.xml |
| (FDB sim trace) |
+--------+---------+
|
v
+-----------------------------------------------------------------------------------+
| main.go |
| - CLI argument parsing |
| - Auto-find latest trace*.xml if not specified |
| - Load and parse trace file |
| - Launch TUI |
+-----------------------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------------------+
| trace.go |
| - XML parsing (streaming decoder) |
| - TraceEvent struct (Time, Type, Machine, ID, Severity, Attrs) |
| - DBConfig parsing from MasterRecoveryState events |
| - RecoveryState tracking |
| - EpochVersionInfo tracking (from GetDurableResult, UpdateRegistration) |
| - Binary search for time-based lookups |
+-----------------------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------------------+
| cluster.go |
| - Worker/RoleInfo structs |
| - BuildClusterState() - reconstructs topology from Role events |
| - Address parsing (DC extraction, main vs tester) |
| - Epoch tracking from TLogStart, LogRouterStart, BackupWorkerStart |
+-----------------------------------------------------------------------------------+
|
v
+-----------------------------------------------------------------------------------+
| ui.go |
| - Bubbletea TUI framework (Model-Update-View pattern) |
| - Split-pane layout: Topology (left) | Events (right) |
| - Popup overlays: Help, Config, Health, Filter, Time Jump |
| - Navigation handlers (Ctrl+N/P, g/G, r/R, e/E, etc.) |
| - Search with wildcard support |
| - Filter system (Raw, Machine, Time, Message) |
| - NetworkMessageSent visualization |
+-----------------------------------------------------------------------------------+
|
v
+------------------+
| Terminal |
| (user sees) |
+------------------+
main.go (~120 lines)
trace*.xml if no argument providedtrace.go (~515 lines)
cluster.go (~265 lines)
ui.go (~4600 lines)
Go dependencies are managed automatically by the build system. For reference: