.agents/skills/debug-issue-with-datadog/SKILL.md
Use this skill whenever the task is investigative rather than implementational: a user, customer, or oncall has surfaced a problem and you need to figure out what is actually happening in production and where in the code it lives. The deliverable is an analysis, not a patch — though the analysis should make the right patch obvious.
LFE-XXXX ID) describes a production
failure, error spike, or customer report.LFE-8837) and similar — these expect the structured analysis output below.If the task is "implement this fix" rather than "figure out what's broken",
this is the wrong skill — go to backend-dev-guidelines or the relevant
package guide.
Read the inputs first, then plan the Datadog sweep, then read the code, then write the analysis. Do not skip ahead to suggested patches before the data supports them.
Intake. Pull every signal already available in the report. See
references/intake.md. For a Linear URL/ID, fetch
the issue and its comments via the Linear MCP — the description is often
updated inline as triage proceeds. For a GitHub issue, use gh issue view.
For pasted text, treat it as the description.
Scope the sweep. From the intake, pick the affected subsystem and time
window. Use references/repo-debug-map.md
to translate "PostHog integration", "ingestion failures", "evals stuck",
etc. into the Datadog filters and source files you should be looking at.
Run the broad Datadog sweep. Default to the full sweep in
references/datadog-playbook.md: APM
spans, error logs, metrics, and monitors — split across prod-eu
and prod-us (and prod-hipaa / prod-jp when relevant). Always check
regional disparity first; it usually rules whole hypotheses in or out.
Cluster the errors. Group by (projectId, error.message) or
(error.type, error.message). Treat each distinct cluster as its own
hypothesis — Langfuse incidents commonly have multiple coexisting root
causes, not one.
Map clusters to code. For each cluster, open the relevant handler file from the repo-debug map and read enough of it to confirm or refute the hypothesis. Cite specific files and line ranges in the output.
Write the analysis using
references/output-template.md.
Deliver. Default: print the analysis in chat. If the user asked for it, also save under the workflow they specified (file, Linear comment via, etc.).
Two Datadog MCP servers are typically available — one bound to the EU site
(datadoghq.eu) and one to the US site (datadoghq.com). Always run
region-relevant queries against both unless intake clearly localizes the
incident. The prod-eu / prod-us env tags live on each side respectively.
service:worker resource_name:"process posthog-integration-project" status:errorservice:worker env:prod-eu @langfuse.project.id:cm1r6u… status:erroraggregate_spans / aggregate_events
grouped by (error.message, projectId) over fetching individual traces.See references/datadog-playbook.md for the
full set of starter queries and parameter shapes.
From the output template:
projectId (or per-cluster) error counts.worker/src/features/** or
web/src/**.Findings come first, recommendations last. If the data is thin, say so explicitly and propose what would need to be true to confirm each hypothesis — do not invent root causes.
backend-dev-guidelinesclickhouse-best-practicesAGENTS.md for the affected directory.