.agents/skills/hybrid-cloud-outboxes/references/debugging.md
Understanding the pipeline helps locate where things break:
outbox_context(transaction.atomic(...))flush=True, drain_shard() runs synchronously for that shardenqueue_outbox_jobs (cell) / enqueue_outbox_jobs_control (control) runs on a cron scheduleschedule_batch partitions the ID range into CONCURRENCY=5 chunks and spawns drain_outbox_shards tasksdrain_outbox_shards calls process_outbox_batch which:
find_scheduled_shards(lo, hi) to find shards with scheduled_for <= nowprepare_next_from_shard(shard) to lock the first message and bump backoffshard_outbox.drain_shard(flush_all=True) to process the sharddrain_shard loops: process_shard() (lock) -> process() -> process_coalesced() -> send_signal()OutboxCategory, executing the handler logic (RPC calls, tombstones, etc.)When processing fails, prepare_next_from_shard bumps scheduled_for using exponential backoff:
Attempt 1: now + 2 * last_delay (initial delay ~seconds)
Attempt 2: now + 4 * last_delay
Attempt 3: now + 8 * last_delay
...
Maximum: 1 hour between retries
The backoff is computed as:
def next_schedule(self, now):
return now + min((self.last_delay() * 2), datetime.timedelta(hours=1))
Where last_delay() is scheduled_for - scheduled_from (time since last attempt).
When debugging stuck outboxes, you'll often need to generate SQL for a developer to run against production PostgreSQL. Follow these rules to construct the correct query.
| Direction | Model class | Table name |
|---|---|---|
| Cell -> Control | CellOutbox | sentry_regionoutbox |
| Control -> Cell(s) | ControlOutbox | sentry_controloutbox |
How to determine direction: Look at the model that changed.
@cell_silo_model (or inherits ReplicatedCellModel), it writes to sentry_regionoutbox@control_silo_model (or inherits ReplicatedControlModel), it writes to sentry_controloutboxBoth tables share these columns:
| Column | Type | Description |
|---|---|---|
id | bigint | Auto-increment primary key |
shard_scope | int | OutboxScope enum value (see category.py) |
shard_identifier | bigint | Shard key (e.g., org ID, user ID) |
category | int | OutboxCategory enum value |
object_identifier | bigint | ID of the source model instance |
payload | jsonb | Optional JSON data (nullable) |
scheduled_from | timestamptz | When this attempt started |
scheduled_for | timestamptz | When eligible for next processing |
date_added | timestamptz | When the outbox was created |
sentry_controloutbox has one additional column:
| Column | Type | Description |
|---|---|---|
region_name | varchar | Target cell for this outbox |
Before constructing a query, resolve the integer values for the category and scope from src/sentry/hybridcloud/outbox/category.py. Read the file to get the exact values. For example:
OutboxCategory.ORGANIZATION_MEMBER_UPDATE = 3OutboxScope.ORGANIZATION_SCOPE = 0Always include the resolved enum names as SQL comments so the developer knows what the magic numbers mean.
When generating SQL for a developer, print the query to the terminal so they can copy-paste it into a production psql session. Always include:
LIMIT clauses to avoid overwhelming output-- Find cell outbox shards stuck in backoff
-- shard_scope: 0 = ORGANIZATION_SCOPE, 1 = USER_SCOPE, etc.
-- category: see OutboxCategory enum in category.py
SELECT
shard_scope,
shard_identifier,
category,
count(*) AS depth,
min(scheduled_for) AS next_attempt,
min(date_added) AS oldest_message,
max(date_added) AS newest_message
FROM sentry_regionoutbox
WHERE scheduled_for > NOW()
GROUP BY shard_scope, shard_identifier, category
ORDER BY depth DESC
LIMIT 20;
-- Find control outbox shards stuck in backoff
SELECT
region_name,
shard_scope,
shard_identifier,
category,
count(*) AS depth,
min(scheduled_for) AS next_attempt
FROM sentry_controloutbox
WHERE scheduled_for > NOW()
GROUP BY region_name, shard_scope, shard_identifier, category
ORDER BY depth DESC
LIMIT 20;
-- Inspect messages in a specific shard (most recent first)
-- Replace <scope>, <shard_id> with actual values
SELECT
id,
category,
object_identifier,
payload,
scheduled_from,
scheduled_for,
date_added
FROM sentry_regionoutbox
WHERE shard_scope = <scope> -- e.g., 0 = ORGANIZATION_SCOPE
AND shard_identifier = <shard_id> -- e.g., the organization_id
ORDER BY id DESC
LIMIT 50;
-- Count pending outboxes for a specific category
-- category: <N> = <CATEGORY_NAME>
SELECT count(*) AS pending
FROM sentry_regionoutbox
WHERE category = <N>;
-- Find all outboxes for a specific model instance
-- category: <N> = <CATEGORY_NAME>
SELECT
id,
shard_scope,
shard_identifier,
payload,
scheduled_from,
scheduled_for,
date_added
FROM sentry_regionoutbox
WHERE category = <N>
AND object_identifier = <object_id>
ORDER BY id DESC
LIMIT 20;
-- Top 10 deepest shards across all scopes/categories
SELECT
shard_scope,
shard_identifier,
count(*) AS depth
FROM sentry_regionoutbox
GROUP BY shard_scope, shard_identifier
ORDER BY depth DESC
LIMIT 10;
When a developer asks you to debug stuck outboxes:
sentry_regionoutbox for cell models, sentry_controloutbox for control models.src/sentry/hybridcloud/outbox/category.py to get the integer values for the relevant OutboxCategory and OutboxScope.scheduled_for is far in the future, the shard is in exponential backoff after repeated failures").The should_skip_shard() method checks these options:
# Skip specific organization shards (cell outboxes)
"hybrid_cloud.authentication.disabled_organization_shards": [org_id_1, org_id_2]
# Skip specific user shards (cell/control outboxes)
"hybrid_cloud.authentication.disabled_user_shards": [user_id_1, user_id_2]
When a shard is skipped, its outboxes remain in the table but are not processed until the option is removed.
Set the option value lower than the code's replication_version to prevent a backfill from running:
# If model.replication_version = 3, setting this to 2 prevents the v3 backfill:
"outbox_replication.sentry_mymodel.replication_version": 2
See references/backfill.md for details on how find_replication_version() uses min(option_value, coded_version).
| Metric | Type | Description |
|---|---|---|
outbox.saved | counter | Outbox rows saved (per category tag) |
outbox.processed | counter | Coalesced outbox groups processed |
outbox.processing_lag | histogram | Time from date_added to processing |
outbox.coalesced_net_processing_time | histogram | Time spent in send_signal() |
outbox.coalesced_net_queue_time | histogram | Total queue time for coalesced messages |
schedule_batch.queued_batch_size | gauge | Number of drain tasks spawned per cycle |
schedule_batch.maximum_shard_depth | gauge | Deepest shard in the current batch |
schedule_batch.total_outbox_count | gauge | Total pending outbox count |
For local debugging or in a Django shell:
from sentry.hybridcloud.models.outbox import CellOutbox, ControlOutbox
# Top 10 deepest cell shards
for shard in CellOutbox.get_shard_depths_descending(limit=10):
print(f"Scope={shard['shard_scope']} ID={shard['shard_identifier']} Depth={shard['depth']}")
enqueue_outbox_jobs task is running (Taskbroker / cron)drain_outbox_shards tasks are being spawned (check Taskbroker queue)scheduled_for > now())OutboxFlushError wraps the original exception with the outbox detailsOutboxCategory value in the stuck outboxpayload_for_update(), ensure the payload contains only immutable or slowly-changing dataOutboxFlushError: The signal receiver raised an exception during outbox_runner(). Read the nested exception.OutboxRecursionLimitError: More than 10 drain iterations — likely an outbox handler that creates more outboxes in an infinite loop.QuerySet.update() / QuerySet.delete() bypass outbox creation.