.agents/skills/metrics-instrumentation/SKILL.md
Normative spec for the backend half of operational observability: per-stage OpenTelemetry metrics in apps/opik-backend. The metrics are designed so a flow-ordered Grafana dashboard can read top-to-bottom — the failing stage is the one whose numbers break — with a per-workspace drill. Building that dashboard (layout, query contracts, per-customer name resolution, dependency panels, validation) is a separate concern, specified by the dashboard-authoring skill in comet monitoring tooling; this skill covers only what to emit.
Patterns applied in this implementation (online scoring is the worked example):
sampler_decisions_total{decision}) and enqueue-to-Redis (enqueue_total{result}). A producer that emits nothing makes downstream starvation explainable rather than mysterious.processing_time, plus per-Redis-op read/claim/ack_and_remove/list_pending_time).http_server_request_duration_seconds, with 5xx broken down by endpoint × error_type × workspace, so an ingestion problem is never mistaken for a scoring problem.error_type (the exception class) and, for shared async plumbing, by component (listener/subscriber): processing_errors_total, unexpected_errors_total, and enqueue_total{result="error"} (a push failure = real loss).queue_delay (enqueue→pickup) is kept separate from processing_time (scorer/LLM work) so a backlog is distinguishable from a slow scorer; end-to-end = queue_delay + processing_time.system.query_log, so a "slow pipeline" resolves to the dependency causing it.To see these conventions already in practice, grep apps/opik-backend for the existing metric families rather than specific classes (metric names are the stable contract; class locations move): the online-scoring *_sampler_decisions_total, *_enqueue_total{result}, *_processing_time_milliseconds, *_queue_delay_milliseconds, *_processing_errors_total{error_type} / *_unexpected_errors_total, the per-Redis-op *_{read,claim,ack_and_remove,list_pending}_time_milliseconds, and the attachment upload byte/size families.
Apply alongside the opik-backend skill (general conventions, logging rule).
1.1 Decompose the workflow into ordered stages, each defined by: input, output, failure mode, existing instrumentation.
1.2 Choose the instrument by the question it answers. Most stages need a counter (did it happen, and how often?) and a histogram (how long did it take?); add a gauge only for a level you cannot derive from those.
_total suffix. MUST NOT be used for a value that can decrease.histogram_quantile. Use whenever a p95/p99 matters, not just an average: processing time, queue delay, per-dependency op time, end-to-end latency, and payload size (bytes/chars). Prefer native (no explicit buckets, §2.1) over classic le-bucketed histograms; reach for classic buckets only when a downstream consumer (e.g. an existing exporter) forces them. Do NOT use a histogram where a counter suffices — a plain success/error tally needs no distribution.Rule of thumb: "how many happened" → counter; "how long / how big" → histogram; "how much is there right now" → gauge (or UpDownCounter). Cover each stage with RED (Rate, Errors, Duration) for the work flowing through it and USE (Utilization, Saturation, Errors) for the resource it runs on.
1.3 Define identity dimensions, each bounded in cardinality:
workspace_id and workspace_name — the customer drill. Always paired; workspace_name falls back to workspace_id when the name is absent (§2.3).evaluator_type, decision, the Redis op, content/mime type). Lets the dashboard sum by(...) per stage (§2.2).result ∈ {success, error} on outcome counters.error_type on every error counter — the exception class / failure category — plus a component label (listener/subscriber/endpoint) where one counter serves many call sites, so the outcome panel breaks errors down by both cause and origin.Cardinality MUST stay bounded by #workspaces × #types × #error_types; an unbounded value (trace id, user input, raw message, full URL) MUST NOT be placed on a label.
2.1 Meters MUST be created via the OTel API (GlobalOpenTelemetry.getMeter(namespace)), one namespace per workflow:
private static final String METRIC_NAMESPACE = "<workflow>";
var meter = GlobalOpenTelemetry.getMeter(METRIC_NAMESPACE);
meter.counterBuilder("%s_<stage>_total".formatted(METRIC_NAMESPACE)).setDescription("…").build();
meter.histogramBuilder("%s_<stage>_time".formatted(METRIC_NAMESPACE)).setUnit("ms").ofLongs().build(); // native histogram
meter.gaugeBuilder("%s_<stage>_size".formatted(METRIC_NAMESPACE)).build();
meter.upDownCounterBuilder("%s_<stage>_in_flight".formatted(METRIC_NAMESPACE)).build(); // signed level (inc/dec)
Counters surface in Prometheus with a _total suffix. Native histograms MUST NOT define explicit buckets (no _bucket/_sum/_count/le).
2.2 The stage/type dimension MUST be a label, not part of the metric name (e.g. online_scoring_enqueue_total{evaluator_type, result}), so the dashboard can sum by(evaluator_type)(rate(...)). Name-encoding the dimension is permitted ONLY to extend an existing metric family; it complicates dashboard aggregation (a name-encoded dimension cannot be rate()d across names in one selector), so prefer a label.
2.3 workspace_id and workspace_name MUST be read from the reactive request context (the workspace-id / workspace-name context keys), reusing the shared workspace attribute-key constants rather than redeclaring them per call site:
var workspaceId = ctx.getOrDefault(WORKSPACE_ID, "");
var workspaceName = StringUtils.defaultIfBlank(ctx.getOrDefault(WORKSPACE_NAME, workspaceId), workspaceId);
counter.add(1, Attributes.of(TYPE_KEY, type, WORKSPACE_ID_KEY, workspaceId, WORKSPACE_NAME_KEY, workspaceName, RESULT_KEY, "success"));
workspace_name MUST fall back to workspace_id when the name is absent.workspaceName field and populate it from the workspace-name context key at the publish site (as the entity-created events feeding this workflow already do).2.4 The instrumented operation MUST be reactive and read the context with deferContextual:
public Mono<Void> enqueue(List<?> messages, Type type) {
return Flux.deferContextual(ctx -> { /* resolve workspace per 2.3 */
return Flux.fromIterable(messages).flatMap(m -> redisAdd(m)
.doOnNext(id -> counter.add(1, successAttrs))
.doOnError(e -> { counter.add(1, errorAttrs); log.error("Error … id='{}'", id, e); }));
}).then().subscribeOn(Schedulers.boundedElastic());
}
Mono<Void> and MUST NOT self-subscribe. Callers MUST subscribe or compose it..then(...) / flatMap) so it inherits the workspace context..contextWrite(ctx -> ctx.put(WORKSPACE_ID, id).put(WORKSPACE_NAME, defaultIfBlank(name, id))) and an error-logging consumer (one helper, not duplicated per caller).findById) MUST run via Mono.fromCallable(...).subscribeOn(Schedulers.boundedElastic()). The enqueue/IO work MUST run on a bounded scheduler, not the caller's (e.g. EventBus) thread. Where a caller already holds a resolved object, provide an overload that skips the lookup.2.5 Log statements MUST single-quote placeholders (evaluator='{}' workspaceId='{}') per .agents/skills/opik-backend/SKILL.md, and SHOULD include the batch size on enqueue logs.
2.6 mvn -o compile MUST succeed (spotless clean) before delivery.
3.1 Introducing a Mono<Void> (lazy) return breaks tests that called the method for its side effect. Restore green by stubbing the reactive method (lenient().when(pub.enqueue(any(),any())).thenReturn(Mono.empty())) where production composes it, .block()-ing the returned Mono in unit tests that assert downstream effects, and replacing verifyNoInteractions(mock) with verify(mock, never()).method(...) where a lenient stub now exists.
4.1 A returned Mono is inert until subscribed; an unsubscribed enqueue is a silent no-op. Every call site MUST compose or subscribe it.
4.2 The workspace MUST be sourced from the reactive context, not a name-service lookup, where the context carries it (§2.3).
4.3 IO/enqueue work MUST run on a bounded scheduler, off the caller's thread (§2.4).
4.4 A backpressure / poll-tick counter (e.g. backpressure_drops_total) counts skipped scheduler ticks while a consumer is busy; it is NOT lost work and MUST NOT be alerted on alone. Emit it, but document it as benign for whoever builds the dashboard.
4.5 Work MUST be done against origin/main (create the worktree from it), not a possibly-stale local checkout.
5.1 The change is delivered as two PRs on their own branches <user>/<TASK>-<name>: this metrics PR (opik repo) and a companion dashboard PR built per the dashboard-authoring skill and delivered to comet monitoring.
5.2 The metrics PR (this repo) MUST contain no customer, cluster, or infrastructure identifiers, MUST follow .github/pull_request_template.md including a ## Documentation section (the PR linter fails without it), ## Issues (Resolves OPIK-XXXX), and ## AI-WATERMARK: yes with tools/model/scope/human-verification. Title [OPIK-XXXX] [BE] …. Announce per .agents/commands/comet/send-code-review-slack.md.
5.3 The dashboard panel for a new metric stays empty until this backend PR deploys — so the two PRs are independent and can land in either order.