scripts/repro-dolt-hang/INCIDENT-REPORT.md
Date: 2026-02-23 Dolt Version: 1.82.2 (now upgraded to 1.82.4) Platform: macOS Darwin 25.3.0, arm64 Reporter: Steve Yegge (multi-agent workspace / beads project)
A shared Dolt SQL server (PID 13360, port 3307) became completely unresponsive under concurrent load from ~20 AI coding agents. All queries timed out, causing a cascade failure that bricked the entire multi-agent workspace. Required force-killing the Dolt server and all ~15 stuck bd/gt processes to recover.
The workspace is a multi-agent environment where ~20 Claude Code agents run
concurrently, each issuing bd (beads CLI) commands that connect to a shared
Dolt SQL server.
config.yaml)behavior:
autocommit: false
listener:
host: 127.0.0.1
port: 3307
# max_connections, back_log, max_connections_timeout_millis all at defaults
The shared server hosts ~15 databases (beads, hq, plus project-specific and test databases from automated test runs).
Each bd command is a separate Go process using go-sql-driver/mysql. The
transaction pattern was:
sqlTx, err := db.BeginTx(ctx, nil)
// ... INSERT/UPDATE operations ...
sqlTx.ExecContext(ctx, "CALL DOLT_COMMIT('-Am', ?, '--author', ?)", msg, author)
sqlTx.Commit() // ← REDUNDANT: DOLT_COMMIT already ends the transaction
Per Tim Sehn's guidance (2026-02-22), DOLT_COMMIT() implicitly commits the SQL
transaction, making the explicit tx.Commit() redundant and adding "raciness."
db.SetMaxOpenConns(10)
db.SetMaxIdleConns(5)
db.SetConnMaxLifetime(5 * time.Minute)
No query-level timeouts — root context has no deadline.
bd commands (create, update, list, close)tx.Commit() after DOLT_COMMIT — per Tim's guidancescripts/repro-dolt-hang/main.go) — could not reproduce
the hang with 50 concurrent workers doing 1000 ops against a single databaseThe repro fires N goroutines each doing BEGIN → INSERT → DOLT_COMMIT in a loop with a watchdog monitoring server responsiveness. Tested up to 50 workers / 1000 ops with both old (with tx.Commit) and new (without) patterns on Dolt 1.82.4:
[old] 1000/1000 success (100.0%), max latency 312ms, 0 unresponsive events
[new] 1000/1000 success (100.0%), max latency 321ms, 0 unresponsive events
The simple repro doesn't trigger the hang. Suspected additional factors in production:
autocommit: false in server configtx.Commit() after DOLT_COMMIT adding racinessIs this the bug Tim mentioned fixing? ("We fixed the bug you ran into" —
email 2026-02-21). The 1.82.3 and 1.82.4 changelogs don't show a concurrency
fix. Was the fix in an earlier release, or is the "fix" the guidance to drop
the explicit tx.Commit() after DOLT_COMMIT?
Can the redundant tx.Commit() after DOLT_COMMIT cause a server hang?
Tim said it "adds raciness" — could that raciness escalate to full server
unresponsiveness under high concurrent load?
Should we configure max_connections / back_log / max_connections_timeout_millis
explicitly? Currently all at defaults. With 20 agents creating separate
connection pools (MaxOpenConns=10 each), we could hit 200 connections.
Multiple databases on one server — is there any known issue with lock contention across databases on the same Dolt server? We have ~15 databases including leftover test databases.
cd beads/
go run ./scripts/repro-dolt-hang 50 20 both