.agents/skills/weekly-production-review/SKILL.md
Use this skill to produce a source-grounded, event-centric production review. The report should help engineering understand the week, not just list tool output.
prod-us, prod-eu, prod-hipaa, and prod-jp.No measurements found when a requested signal cannot be queried or
measured.datadog-query-recipes for
production Datadog query shapes and environment/site routing.linear-bug-triage only after a human
explicitly approves a Linear write-back.bug label first. Include all bug-labeled
tickets created, updated, completed, or still-open with production evidence
during the window. Inspect likely production bugs with issue details and
comments when status, owner, or evidence is unclear.Every production event should have exactly one canonical object in the review:
expected/test, monitor noise, or unknown/no measurements and no incident
or Linear bug should be created yet.Treat Datadog as evidence, not the canonical event. Treat the public status page as the customer-facing mirror, not the engineering source of truth.
Use this table to decide what is missing:
| Canonical Object | Should Link To | How To Represent In Review |
|---|---|---|
| incident.io incident | status-page URL, Datadog alert/monitor/query links, Linear follow-ups | event row sources plus customer incident linked sources |
| Linear production bug | Datadog monitor/query/trace/log links, incident.io incident if any, status incident if any | Linear bug evidence plus event row sources |
| Alert disposition | monitor ID/title, env, reason, verdict, owner/team if visible | Datadog table row with Linked Event set to disposition |
For a healthy review, each real production event should satisfy one of:
Canonical event = incident.io incident
OR canonical event = Linear production bug
OR canonical event = explicit alert disposition
When proposing or later creating links, use short stable titles:
Datadog monitor: <monitor name>Datadog logs: <env/service/symptom>Datadog spans: <env/route/symptom>Datadog trace: <trace id or route>Status incident: <status title>incident.io: <INC reference>Linear follow-up: <issue key>Do not write any of these links unless the user explicitly asks for changes after reviewing the report.
Start from all Linear tickets with the bug label that were touched by the
window. Do not rely only on text searches for prod, incident, or Datadog;
those searches are useful for enrichment but are not the source universe.
Use this table for the bug section:
| Linear | Title | Summary | Owner | Status | Touched Last Week Because | Production Evidence | Classification | Counted? |
|---|
Column rules:
Linear: the issue key linked to Linear, such as LFE-123.Title: the Linear issue title in a separate column. Do not collapse the
title into the Linear link because reviewers need to scan IDs and titles
independently.Summary: one operational sentence based on issue body, comments, and
evidence. Avoid fix guesses.Owner: assignee if present; otherwise owning team if clear; otherwise
Unassigned.Status: the Linear status or state, plus completion timing when useful
such as Done May 18, Todo, Triage, or Canceled.Touched Last Week Because: created, updated, completed, or
open production bug.Production Evidence: prod env, customer impact, status incident, Datadog
link, measured logs/spans/errors, or No measurements found.Classification: use one of production/customer-impacting,
internal-only, self-hosted, staging/dev, duplicate/canceled/no-action,
or unclear.Counted?: yes only when the bug label and production/customer-impacting
evidence support including it in fixed/open production bug counts.For headline counts, report fixed and open production bugs separately from the total number of bug-labeled tickets reviewed.
Use this table as the evidence layer:
| Monitor/Page Signal | Env | Count / Window | Why It Alerted | Verdict | Linked Event |
|---|
The Datadog table answers "what alerted or paged?" It is monitor-centric, not the primary narrative. Use these verdicts:
customer incidentconfirmed buginfra/dependencyexpected/testmonitor noiseunknown/no measurementsGroup repeated pages by monitor name or ID, environment, service/team, and trigger reason. Exclude or clearly mark SLO/burn-rate monitors, test monitors, and maintenance-window noise when the review is about actionable breakage.
Linked Event should be the canonical incident.io reference, Linear issue key,
or explicit disposition. Do not leave a real page as none unless the next
action is to classify the alert.
Use this as the main engineering narrative:
| Event | Impact | Sources | State | Owner / Team | Next Action |
|---|
The event-centric view answers "what actually broke?" Combine related status incidents, Datadog pages, Linear bugs, and follow-ups into one row when the evidence supports it. If correlation is inferential, say so.
Good event rows:
fixed, mitigated, open, monitoring, noise, or
unknown.Use this section for public status-page incidents and accepted incident.io incidents:
| Incident | Severity / Status | Start / End / Duration | Impact | Linked Sources |
|---|
Lead each incident summary with its reference or URL. Preserve uncertainty when status-page timezone, severity, linked alerts, or Linear follow-ups are missing.
Start the final report with:
bug-labeled Linear tickets reviewed.Then present sections in this order: