.agents/commands/comet/analyze-sentry-issue.md
Command: cursor analyze-sentry-issue
Drive an end-to-end triage of a Sentry issue: pull all events directly from Sentry's REST API, aggregate, locate the emitting code, diagnose root cause and observability gaps, and propose concrete fixes (with code locations) the engineer can apply or ticket. This is a guided runbook, not just a data dump — at each step the workflow gives the engineer something to decide on, not just numbers.
This is the canonical entry point for Sentry analysis in this repo. It uses the SENTRY_ACCESS_TOKEN from .env.local and Sentry's REST API directly.
https://<org>.sentry.io/issues/<id>/?...) or the bare numeric issue ID.SENTRY_ACCESS_TOKEN is present in .env.local. Never echo the token. If missing, point the engineer at .agents/docs/SENTRY_MCP_SETUP.md and stop..sentry.io)..env.local directly, not from .mcp.json.curl -H "Authorization: Bearer $TOKEN" leaks it via ps). Set the header in code via the language's HTTP client (e.g. Python urllib.request, Node fetch), or use a temporary header file (curl -H @file, mode 600) deleted afterward.Page through GET https://us.sentry.io/api/0/issues/<issue_id>/events/?full=false&limit=100 with Authorization: Bearer $SENTRY_ACCESS_TOKEN, following the Link: rel="next"; results="true"; cursor=... header.
⚠️ The host is hardcoded to the US region (
us.sentry.io) because that's where Comet's Sentry org lives. If you're on the EU region (de.sentry.io) or a self-hosted Sentry, swap in$SENTRY_HOSTfrom.env.local— otherwise the call will silently hit the wrong API and either fail auth or return empty results.
Default cap: ~3 pages (300 events). Distinct-message and tag distributions converge fast; pulling thousands of events per analysis is rarely necessary and slows the workflow. Bump the cap (and tell the engineer you're doing so) only when:
count is large and the question being asked actually depends on the long tail (e.g., "which rare exception types are hiding in here?").Capture per event: eventID, message, title, user.id (nullable — treat missing as <no-user>), release (nullable — treat missing as <no-release>), and tags. Note: Sentry returns tags as an array of {key, value} objects; normalize it into a {key: value} dict before any tag aggregation in Phase 3.
Compute and present:
count (mismatch hints at unindexed pagination or pruning).KeyError: '<X>' → which key was missing (count by key).<ExceptionType>: ... → exception-type distribution.default_* IDs. Per-user-per-day clustering matters: if one user fires N events in one minute, it's typically a single broken run iterating a dataset, not N independent failures.cli_command, installation_type, os_type, python_version, environment, etc.Sentry's title is often a log string emitted by the application, not the actual exception. Find where it comes from.
The Sentry project the issue belongs to (returned by the API as project.slug, or visible in the issue's "Project" field) tells you which codebase area emitted the event. Map it to the corresponding subtree by purpose: backend service → apps/opik-backend/ (Java), frontend app → apps/opik-frontend/ (TypeScript/React), Python SDKs → sdks/python/ or sdks/opik_optimizer/, TypeScript SDK → sdks/typescript/. If the project name doesn't make the mapping obvious, ask the engineer.
Steps:
grep it inside the project's subtree. Use a substring distinctive enough to land on one or two callsites.logger.error / .warning / .exception(...) (logging integration) or sentry_sdk.capture_exception(...).log.error("msg", e) (SLF4J) or Sentry.captureException(e).console.error(...), logger.error(...), Sentry.captureException(e), or Sentry.captureMessage(...).exc_info=<exc> argument present?log.error("msg", e) attaches; log.error("msg " + e) does not.)captureException, or just stringified into a message?If multiple distinct messages share the same log template at the callsite, that's the fingerprint-collision pattern: Sentry buckets unrelated failure modes together because the template hash is identical.
Lead the engineer through these decisions, citing concrete events and code locations:
Observability gap? Three sub-questions, expressed in the language of the project from Phase 4:
exc_info=. Java: passed by string concatenation instead of as the second SLF4J argument. TypeScript: stringified into a message instead of passed to Sentry.captureException.extra= / sentry_sdk.set_context(...). Java: Sentry.setExtra(...) / MDC. TypeScript: Sentry.setContext(...) / Sentry.withScope(...).SDK bug, user error, or infra?
Per-customer concentration? If one named org/server/user dominates, flag it for direct outreach — they're hitting a recoverable wall.
For each finding produced in Phase 5, give the engineer a specific, actionable proposal with file paths and line numbers. Two layers in order of priority:
Observability fix (almost always cheap, ship first). Patch the identified callsites to attach the exception object and/or structured context using the idioms from the project's language (see Phase 5). If template collision is the issue, change the log template so distinct exception types fingerprint distinctly. State the expected outcome explicitly: future Sentry events will carry the data the engineer just spent time discovering wasn't there.
Behavior fix.
Offer (do not auto-execute):
sdks/python/tests/, sdks/typescript/, apps/opik-backend/src/test/java/, apps/opik-frontend/).<user>/OPIK-<ticket>-<slug> per .claude/rules/git-workflow.md) and the first-commit message format [OPIK-####] [<COMPONENT>] <type>: … where <COMPONENT> matches the project (e.g. [SDK], [BE], [FE]).Fixes <SENTRY-ISSUE-SHORT-ID> (the short ID is shown on the issue page, format like <PROJECT>-XYZ) auto-resolves the Sentry issue when the commit ships — call this out explicitly so the engineer doesn't have to remember.Final summary should be short and decision-oriented, not a data dump:
Use these as starting hypotheses when the issue's emitting code is in the Python SDK subtree (sdks/python/ or sdks/opik_optimizer/) — they recur often enough to be worth checking first:
LOGGER.error/.warning(..., exception) without exc_info=: very common; roughly half of LOGGER.{error,warning,exception} calls in sdks/python/src/opik/ lack exc_info. Always check this — fixing it usually unblocks triage on its own."Evaluation task failed (group=%s): %s: %s" or "Task failed for item %s: %s" group every distinct task error into one Sentry issue. The fingerprint-collision diagnosis applies whenever the issue title cites one exception type but the message-distribution shows many.runner/supervisor.py) capture child crashes via the parent — stderr_tail and exit_code are in scope but only the message gets logged. Look for extra= opportunities, not exc_info=, on these.For backend (Java) and frontend / TypeScript SDK projects, no project-specific priors have been collected yet — fall back to the language-agnostic checks in Phases 4 and 5.