.agents/skills/debug-issue-with-datadog/references/datadog-playbook.md
The default for this skill is the broad sweep: APM spans + logs + metrics
Two MCP servers are typically connected — one for datadoghq.eu, one for
datadoghq.com. Run the same query on both and compare. The contrast itself
is often the most informative finding (e.g. LFE-9475's 23.5% EU vs 0.7% US
error rate immediately ruled out PostHog Cloud as the global cause).
datadoghq.eu for EU, datadoghq.com for US.env): prod-eu, prod-us, prod-hipaa, prod-jp.worker (default in worker/src/env.ts) or web (default in
web/src/env.mjs). Some deployments override to langfuse.process <queue-name> — see repo-debug-map.md.Use aggregate_spans first; only fetch individual traces once a cluster is
identified.
Starter shape (rename the resource for the relevant subsystem):
service:worker resource_name:"process posthog-integration-project" status:error
Aggregations to run, in order:
env — confirms region split.error.message — primary error classes.(projectId, error.message) — which tenants are affected, by
class. projectId lives on span tags as @projectId for log search and as
a tag for spans (depends on instrumentation site).If aggregate_spans returns no results, check:
repo-debug-map.md);Logs are the right tool for messages the handler emitted. Spans are the right tool for which handler invocations failed.
Starter shapes:
service:worker env:prod-eu @projectId:cm1r6u1iq00ccfvrkoy8vg3ms status:error
service:worker env:prod-eu "[POSTHOG]" status:error
Useful log facets:
@projectId or @langfuse.project.id — Langfuse project (cuid).@error.kind / @error.message / @error.stack — when the Winston logger
serialized an Error.@queue / @jobName — when set by BullMQ instrumentation.For high-volume subsystems (ingestion-queue, otel-ingestion-queue),
prefer analyze_datadog_logs with grouping over search_datadog_logs — the
raw matches are too noisy.
Pick 2–3 metrics that match the subsystem. Common ones:
trace.bullmq.process.errors and trace.bullmq.process.duration —
per-queue health from the BullMQ OTel instrumentation. Filter by
resource_name:"process <queue-name>".trace.http_request.errors and trace.http_request.duration for HTTP
handlers (service:web).clickhouse.query.duration,
clickhouse.memory_usage — the worker doesn't emit these directly, they
come from the clickhouse integration in the infra repo.aurora.databaseconnections, aurora.deadlocks — relevant when
the symptom is connection_limit / connection pool errors.If the subsystem isn't already known, run search_datadog_metrics for the
subsystem name and pick the obvious counter / gauge / histogram triplet.
search_datadog_monitors for the subsystem name — tells you what alerts
would have fired and what their thresholds are. A muted monitor on the
affected subsystem is itself a finding (see LFE-9475: "EU alert muted for
a week").search_datadog_incidents for the time window — links any pre-existing
incident the user may not have referenced.Skip unless the issue is "page broken" / "slow load". Then:
search_datadog_rum_events filtered by @view.url: patterns matching the
affected route.service:web API errors at the same time.Once a cluster is identified, fetch one or two representative traces with
get_datadog_trace to read the actual stack and confirm where in the handler
the throw originates. This is what lets you point at a specific file and
line range in the analysis.
error.message literally if it goes through a custom error
wrapper — validateWebhookURL rejections, for example, are re-logged as
"DNS lookup failed" but are actually validator rejections.End the analysis with the actual Datadog UI URLs you queried, e.g.:
https://app.datadoghq.eu/apm/traces?query=resource_name%3A%22process+posthog-integration-project%22+status%3Aerror
https://app.datadoghq.com/apm/traces?query=resource_name%3A%22process+posthog-integration-project%22+status%3Aerror
so the human reader can re-run the same query.