docs/issues/multi-stmt-tx-transiency.md
ErrCannotUpdateKeyTransiency on UPDATE-after-INSERT of the same rowResolved in the follow-up commit that changes
OngoingTx.transientEntrieskeyRef allocation to a negative-integer space so it can't collide withtx.entries[]slice indices. See the "Root cause — confirmed" and "Fix" sections below.
High — blocks any ORM that wraps INSERT + UPDATE of the same row in a single transaction, which is the norm. Hit in the Gitea compat exercise on /api/v1/user/repos where Gitea's WatchRepo runs:
BEGIN
INSERT INTO repository (...) RETURNING id
INSERT INTO watch (user_id, repo_id, mode, ...) RETURNING id
UPDATE repository SET num_watches = num_watches + 1 WHERE id = ?
COMMIT
Surfaced to the client as:
pq: cannot change a non-transient key to transient or vice versa
3 lines against a freshly opened Postgres-wire session:
BEGIN;
INSERT INTO q (id, n) VALUES (1, 0);
UPDATE q SET n = n + 1 WHERE id = 1; -- fails here
COMMIT;
Table schema is minimal — one primary-key column + one data column. No AUTO_INCREMENT, no secondary indexes, no defaults, no NOT NULL, no CHECK. Autocommit variant (each statement its own tx) passes, so the bug is specific to multi-statement single-tx sequences.
Driver-independent: reproduced with lib/pq and pgx/v5; also reproduced in-process when wrapping the three statements with engine.NewTx (produces already closed instead of the transiency message, but same root tx state corruption).
embedded/store/ongoing_tx.go:271-273:
_, wasTransient := tx.transientEntries[keyRef]
if isKeyUpdate {
if wasTransient != isTransient {
return ErrCannotUpdateKeyTransiency
}
}
OngoingTx.entriesByKey[kid] → keyRef stores keyRefs from two overlapping integer spaces:
len(tx.entries)-1 — a slice index into tx.entries[], starting at 0 and growing monotonically.SetTransient path and the indexer-walk path at ongoing_tx.go:336-341) used len(tx.entriesByKey) — a counter meant as an opaque map key into tx.transientEntries[], also starting at 0 and growing.Both counters share the int space starting at 0. When the SQL engine's primary-row indexer runs the walk during an INSERT, it stashes a transient entry for the mapped PK key at keyRef=0. The outer non-transient main write for the row key then lands at keyRef=0 too (as len(tx.entries)-1).
On a subsequent write to the main row key (the UPDATE), wasTransient := tx.transientEntries[keyRef] looks up transientEntries[0] and finds the stale indexer-walk entry — concluding (incorrectly) that the main key was previously written transient. wasTransient != isTransient and the check at line 271-273 fires.
The refInterceptor reader at line 224-225 had the same ambiguity but was harder to trip because in the Gitea repro no one actually reads the mapped key through it before the next write tips over the transience check.
Allocate transient keyRefs in a negative integer space disjoint from the non-negative tx.entries[] slice indices:
tkey := -(len(tx.transientEntries) + 1)
tx.transientEntries[tkey] = e
tx.entriesByKey[kid] = tkey
Applied at the two insertion sites (ongoing_tx.go:339-340 for the indexer-walk path and 353-354 for the outer SetTransient path). Existing lookups (tx.transientEntries[keyRef] at lines 224 and 269) and the update sites (lines 337 and 347) are unchanged — they correctly read back whatever integer is in entriesByKey regardless of sign.
0 ≤ keyRef < len(tx.entries) — always non-negative.tx.transientEntries is a map[int]*EntrySpec — it indexes by int, so negative integer keys are valid map keys with no performance impact.wasTransient := tx.transientEntries[keyRef] check now returns true iff the keyRef points to a transient entry for this specific key — which is what the invariant always meant to enforce.refInterceptor at line 224-225 (entrySpec, transient := tx.transientEntries[keyRef]) reads the correct entry regardless of sign.transientEntries is a per-tx in-memory map never persisted as indexed keys.embedded/store/... full test suite green (27s). embedded/sql/... full test suite green (36s). No new regressions across embedded/... other than the pre-existing embedded/document/TestGetDocument… failures that reproduce on HEAD before the fix (max key length exceeded, unrelated).OngoingTx.set walks every registered indexer after accepting a write. For the SQL engine's primary-row indexer (SourcePrefix=rowEntryPrefix, no SourceEntryMapper, TargetEntryMapper rewrites RowPrefix+… into MappedPrefix+…), a non-transient tx.set(rowKey) triggers the block at ongoing_tx.go:331-340:
if !bytes.Equal(key, targetKey) {
kid := sha256.Sum256(targetKey)
keyRef, isKeyUpdate := tx.entriesByKey[kid]
if isKeyUpdate {
tx.transientEntries[keyRef] = e
} else {
tx.transientEntries[len(tx.entriesByKey)] = e
tx.entriesByKey[kid] = len(tx.entriesByKey)
...
}
}
So a user-visible non-transient write on rowKey causes an internal, transient entry to be stashed at targetKey = MappedPrefix + table.id + primaryIndex.id + pkEncVals. On a subsequent write affecting the same row, the flag consistency check on targetKey triggers against one of the earlier transient entries.
Interaction graph (to be confirmed):
tx.set(rowKey, non-transient) → indexer walk stashes targetKey as transient.tx.set(rowKey, non-transient) → indexer walk revisits targetKey. The transience check may see the transient entry from step 1 and compare it against what the current walk wants to write; one of the two paths tracked by entriesByKey/transientEntries flips and trips the invariant.OngoingTx.set needs closer trace to confirm which of these is the flipped key and whether the invariant is actually being violated or whether the check is being applied to the wrong pair of writes.
targetKey already exists.entriesByKey space so user writes and indexer-walk writes don't share hash buckets.Option 1 is least invasive but risks masking genuine user bugs. Option 2 is the most targeted. Option 3 is the cleanest architecturally but blast-radius touches every code path that reads from entriesByKey.
IsExplicitCloseRequired / treating each statement as auto-commit — the whole point of wrapping in a transaction is to not do that.With the same multi-statement pattern, these errors also surfaced in Gitea's dashboard and were fixed incrementally:
issueIDsFromSearch, pq: already closed — same tx state corruption symptom, fixed in ddde60eb by deferring the outer reader close in jointRowReader.Read.eventsource/manager_run.go: Unable to get UIDcounts: pq: syntax error: unexpected CASE at position 24 — still open. SUM(CASE WHEN … END) needs a new engine-level AggExp AST node that holds a ValueExp instead of a column, parallel to AggColSelector. embedded/sql/gitea_background_regression_test.go::TestSumCaseWhenAggregate carries a t.Skip(...) marker pointing at this follow-up. The error surface is background (Gitea's notification event poller); all user-visible pages render.Post-fix remaining background-only errors against Gitea 1.25.5:
workflows.go init.0.1: yaml unmarshal errors — Gitea's own bindata YAML init, unrelated to the DB layer.InsertRun, pq: values are not comparable — fires through the pgsql wire on INSERT INTO action_run … with 22 scalar params. A direct engine-level INSERT with the same schema + bound types passes (TestInsertActionRunShape), so the error is in the pgsql wire's type coercion during Bind, not in the SQL engine. Needs pgsql-wire-level bind tracing to root-cause. Safe to defer.Fixed when:
BEGIN/COMMIT.POST /api/v1/user/repos returns 201 and the repo is browseable via the UI.go test ./embedded/store/... ./embedded/sql/... full-suite green.