design-bucket-config-serialization.md
Issue #9651 — concurrent PutBucketVersioning + PutBucketEncryption (as Terraform
issues them in parallel) intermittently lose the encryption write.
The bucket's entire config lives in one filer entry, /buckets/<name>. Every
config API does a read-modify-write of that single entry, and the writes are not
serialized:
updateBucketConfig(bucket, fn) (s3api_bucket_config.go:468) — sources from a
possibly-stale cached BucketConfig, mutates Entry.Extended, writes the
whole entry. Used by: versioning, object-lock config, lifecycle, ACL/owner.UpdateBucketMetadata → setBucketMetadata (:1042) — reads a fresh entry,
mutates Entry.Content, writes the whole entry. Used by: encryption, CORS,
tagging, ownership, policy, notification.Two ingredients produce the lost update:
updateBucketConfig
rebuilds from a stale cached BucketConfig whose Content predates the
concurrent encryption write, so writing the whole entry reverts Content.Sequential calls always pass (each sees the previous write), so it only surfaces under concurrency — and CI's slower IO widens the window (the "2 of ~12 runs").
WriteCondition,
ObjectTransaction); do not reintroduce a distributed lock.Both updateBucketConfig and UpdateBucketMetadata must run their RMW under one
per-bucket critical section, and re-read the entry fresh from the filer inside
it — not rebuild from the cached BucketConfig. The lock alone is insufficient:
without the fresh read, two serialized writers still each apply a stale snapshot.
The two writers touch disjoint fields (Extended[versioning] vs Content). If
each path updated only its own field instead of rewriting the whole entry, neither
could clobber the other regardless of ordering. This is the structural fix and
makes serialization a defense-in-depth concern rather than a correctness
requirement for cross-field cases.
The bucket entry is a single filer entry, so unlike object writes there is no sharding — the question is purely the scope of the lock:
| Layer | Serializes across | Cost | Notes |
|---|---|---|---|
| 1. Gateway-local per-bucket lock | one gateway process | tiny | fixes the reported (single-gateway/CI) case |
| 2. Filer per-path lock via conditional write | all gateways on one filer | small | reuses #9640 CreateEntry+WriteCondition |
| 3. Route-by-key to bucket-key owner filer | all gateways and filers | medium | same mechanism as the object DLM-removal |
Add a bounded per-bucket lock table to S3ApiServer, reusing the same
util.LockTable the filer uses for its per-path lock:
// in S3ApiServer
bucketConfigLocks *util.LockTable[string] // serialize bucket-entry RMW
func (s3a *S3ApiServer) withBucketConfigLock(bucket string, fn func() s3err.ErrorCode) s3err.ErrorCode {
lk := s3a.bucketConfigLocks.AcquireLock("bucketConfig", bucket, util.ExclusiveLock)
defer s3a.bucketConfigLocks.ReleaseLock(bucket, lk)
return fn()
}
Wrap the RMW in both chokepoints, and inside the lock read the entry fresh:
updateBucketConfig: acquire the lock; re-read /buckets/<name> from the filer
(not the cache); rebuild BucketConfig from that fresh entry; apply fn; write;
invalidate cache; release.UpdateBucketMetadata/setBucketMetadata: same lock key; it already reads fresh,
so it just needs to share the critical section.Both must use the same lock keyed on bucket, so versioning and encryption
contend on one mutex. This closes the reported window. Limitation: only one
gateway; two gateways behind a load balancer still race.
Test: parallel PutBucketVersioning + PutBucketEncryption, assert both persist
(the exact Terraform scenario), plus an N-way parallel variant over distinct
fields.
Move the writers off whole-entry rewrites:
ObjectTransaction PATCH_EXTENDED on /buckets/<name>. The owner filer reads
the entry fresh under its per-path lock and merges only the named keys, so the
gateway never sends a whole-entry snapshot — this dissolves both ingredients for
these fields.Content-based config (encryption, CORS, tags blob) — chosen and
implemented (b3): extend PATCH_EXTENDED with set_content. Under the same
per-path lock the filer reads the entry fresh, merges extended attributes, and
replaces Content, preserving the rest. So a content write becomes a field-level
patch too — setBucketMetadata patches Content, updateBucketConfig patches
extended keys, and the two serialize on the lock instead of racing whole-entry
rewrites. This is cleaner than the alternatives below: no client-side retry, no
storage migration, and it reuses ObjectTransaction's existing atomic lock.
CreateEntry overwrite with IF_ETAG_MATCH + retry
(#9640): correct but needs client-side retry, and the bucket directory entry has
no reliable ETag to compare on.Content blob
into its own Extended key. Then even intra-blob writes (tags vs encryption)
stop racing. Larger migration; tracked separately.Once all paths are field-level patches, the phase-1 gateway lock is unnecessary — the filer enforces atomicity. (This is the path taken: phase 1 was skipped.)
If multiple filers can write /buckets/<name> concurrently, a filer-local per-path
lock no longer suffices. Route bucket-config writes to
PrimaryForKey("/buckets/<name>") (the lock-ring view) and serialize on that one
owner filer — the same route-by-key design used to take object writes off the DLM.
Overkill for rare config writes; include only if multi-filer bucket writes are real.
PATCH_EXTENDED is atomic field-level merge at the filer (no snapshot);
CAS turns a concurrent Content write into a retry, enforced under the filer's
per-path lock — correct for any number of gateways sharing a filer.All of these funnel through the two chokepoints, so fixing the chokepoints covers them — but the fix must not leave any of them on an unserialized path:
updateBucketConfig: versioning, object-lock config, lifecycle, ACL/owner.UpdateBucketMetadata/setBucketMetadata: encryption, CORS, tagging,
ownership controls, bucket policy, notification.CreateEntry/DeleteEntry of /buckets/<name>) already
go through the filer's per-path lock on CreateEntry; ensure they take the same
bucket lock if they also patch config.Under the lock, read the entry from the filer, never rebuild from the cached
BucketConfig. The cache is for reads; it must be invalidated on every write and
never be the source for an RMW. This is the single most important detail — the lock
without the fresh read does not fix the bug.