Back to Langfuse

Repo Debug Map — Subsystem → Code → Datadog Filters

.agents/skills/debug-issue-with-datadog/references/repo-debug-map.md

3.172.16.1 KB
Original Source

Repo Debug Map — Subsystem → Code → Datadog Filters

For each subsystem we ship monitors and incidents on, this is the canonical map between the symptom, the Datadog query that surfaces it, and the source files where the bug almost certainly lives.

When intake gives you a subsystem (PostHog, evals, exports, etc.), start here to pick the right Datadog filters and the right files to read.

Worker Async Jobs

Worker handlers are wrapped by instrumentAsync in their queue file. The span resource name follows the pattern process <queue-name>. Queue and job name constants live in packages/shared/src/server/queues.ts (QueueName and QueueJobs enums).

SubsystemQueue fileHandler dirSpan resource_nameLog prefix
PostHog integrationworker/src/queues/postHogIntegrationQueue.tsworker/src/features/posthog/process posthog-integration-project[POSTHOG]
Mixpanel integrationworker/src/queues/mixpanelIntegrationQueue.tsworker/src/features/mixpanel/process mixpanel-integration-project[MIXPANEL]
Blob storage exportworker/src/queues/blobStorageIntegrationQueue.tsworker/src/features/blobstorage/process blob-storage-project[BLOBSTORAGE]
Data retentionworker/src/queues/dataRetentionQueue.tsworker/src/features/batch-data-retention-cleaner/process data-retention-projectn/a
Event propagationworker/src/queues/eventPropagationQueue.tsworker/src/features/eventPropagation/process event-propagationn/a
Cloud usage meteringworker/src/queues/cloudUsageMeteringQueue.tsworker/src/ee/ (cloud-only)process cloud-usage-meteringn/a
Free-tier usage thresholdworker/src/queues/cloudFreeTierUsageThresholdQueue.tsworker/src/ee/usageThresholds/process cloud-free-tier-usage-thresholdn/a
Ingestion (single event)worker/src/queues/ingestionQueue.tsworker/src/features/ingestion/ (and IngestionService)BullMQ default spann/a
OTel ingestionworker/src/queues/otelIngestionQueue.tsworker/src/features/otel/BullMQ default spann/a
Evaluation executionworker/src/queues/evalQueue.tsworker/src/features/evaluation/BullMQ default spann/a
Batch exportworker/src/queues/batchExportQueue.tsworker/src/features/batchExport/BullMQ default spann/a
Webhook deliveryworker/src/queues/webhooks.tsworker/src/features/webhooks/BullMQ default spann/a
Trace / score / dataset / project deleteworker/src/queues/{traceDelete,scoreDelete,datasetDelete,projectDelete}.tsworker/src/features/traces/, …/scores/, …/datasets/BullMQ default spann/a

For queues using BullMQ default spans (no instrumentAsync wrapper), search APM with service:worker operation_name:bullmq.process filtered by bullmq.queue:<queue-name>.

Web (Next.js / tRPC / public API)

SubsystemCodeSpan / log filter
Public REST APIweb/src/pages/api/public/**service:web resource_name:"GET /api/public/<path>"
tRPC proceduresweb/src/server/api/routers/**service:web resource_name:"POST /api/trpc/<router>.<proc>"
Auth / API key verificationweb/src/features/public-api/server/apiAuth.tslook for verifyAuthHeaderAndReturnScope spans
Stripe billingweb/src/ee/features/billing/server/stripeBillingService.tswrapped in instrumentAsync; spans named after the method

Shared Layers

These are not subsystems on their own, but are frequently the actual cause behind a worker subsystem failure.

LayerLocationCommon failure modes
ClickHouse accesspackages/shared/src/server/clickhouse/, packages/shared/src/server/repositories/OOM (Code: 241), buffer cancel (Code: 734), JOIN spills, slow queries on un-pre-filtered traces
Prisma accesspackages/shared/src/db.ts and per-feature reposconnection pool timeout (worker default connection_limit=5), N+1 queries
Queue contractspackages/shared/src/server/queues.tswrong queue name, missing schema validation
Logger / instrumentationpackages/shared/src/server/logger.ts, packages/shared/src/server/instrumentation.tslog silently dropped because LANGFUSE_LOG_LEVEL set wrong, or span missing because handler doesn't call instrumentAsync
Webhook URL validationpackages/shared/src/server/validateWebhookURL.tsrejects with messages that look like DNS errors but are SSRF guard rejections
Encryptionpackages/shared/encryptionbad keys → 403/auth-style failures masquerading as upstream errors

Common Symptoms → First Files To Read

  • "403 from upstream": check the per-integration credentials table in Postgres (PostHogIntegration, BlobStorageIntegration, WebhookConfig, etc.) and the encryption layer.
  • "Timeout": check the SDK timeout default and the per-stream flush/batch size in the handler. Worker async jobs default to long-running but the upstream SDK does not.
  • "DNS lookup failed": distinguish actual DNS from validateWebhookURL rejection. The error message wrapping is misleading on purpose.
  • "Cannot write to canceled buffer" (CH): ClickHouse stream wasn't aborted when the downstream consumer threw. Look for an AbortController threaded through the handler.
  • "Connection pool timeout" (Prisma): worker connection_limit is set in the connection string; jobs doing per-row findFirst() exhaust it. Check whether the integration row could be cached in closure scope.
  • "memory limit exceeded" (CH): look for unbounded JOINs without a pre-filter CTE, especially in analytics integrations.
  • "Header overflow": Node HTTP parser's default 80 KB ceiling. Either raise --max-http-header-size for the worker, or replace the SDK's HTTP client.

Where to Look for Already-Shipped Fixes

Before recommending a patch, confirm it isn't already merged or in flight:

  • attachments on the Linear issue (PRs and commits are auto-linked).
  • git log --oneline --since=<recent-window> -- <handler-path>.
  • Open PRs touching the file via gh pr list --search "<filename>".