docs/design/telemetry-llm-request-timing-design.md
Issue #3731 — Phase 4 of hierarchical session tracing. Adds time-to-first-token, request-setup duration, sampling duration, and per-attempt retry telemetry to the
qwen-code.llm_requestspan so operators can answer "why was this LLM call slow?" without guessing.Builds on Phase 1 (#4126), Phase 1.5 (#4302), Phase 2 (#4321). Independent of Phase 3 (#4410, in review) — recommended to land Phase 3 first so Phase 4's per-attempt fields aggregate cleanly under subagent subtrees.
qwen-code.llm_request spans today carry only model, prompt_id, input_tokens, output_tokens, success, error, duration_ms. Operators reading a single trace cannot tell:
duration_ms was the model thinking vs the network setup. A 12-second duration_ms could be 11s of retries followed by 1s of fast generation, or 100ms of setup followed by 12s of slow streaming — the trace doesn't say.retryWithBackoff (utils/retry.ts:285) only calls debugLogger.warn — no OTel event, no span attribute. The 4 LLM call sites that go through it (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039) have zero retry visibility in traces or metrics. ContentRetryEvent exists for content-recovery retries inside geminiChat.ts:806,830 but not for the more common rate-limit / 5xx retries.api.request.breakdown is dead code. The metric is defined at metrics.ts:242-251 with 4 ApiRequestPhase values, exported from index.ts:117, tested in metrics.test.ts:646-675 — but recordApiRequestBreakdown() has zero callers in production code. The metric infrastructure is paid for; the data flow was never connected.These gaps make qwen-code.llm_request the least informative span in the trace tree. Tool spans (#4126/#4321) and subagent spans (#4410) both surface lifecycle phases; LLM spans collapse the entire request into one opaque duration.
| Component | Location | Why we don't touch it |
|---|---|---|
| LLM request span lifecycle | session-tracing.ts startLLMRequestSpan / endLLMRequestSpan | Phase 1 (#4126) established the helpers. We extend the metadata interface, don't restructure |
| Active span propagation into provider generators | loggingContentGenerator.ts:213,287 | Phase 1 (#4126) replaced withSpan('api.*') with native helpers; the active context already reaches the stream wrapper |
ContentRetryEvent schema + consumers | types.ts:626, qwen-logger.ts:947, loggers.ts:717 | Existing event keeps its shape and downstreams; we add a sibling event class for the retryWithBackoff path |
LogToSpanProcessor log-bridge spans | log-to-span-processor.ts | ContentRetryEvent's existing bridge continues to nest under the active LLM span. Phase 4 does not change this |
ApiRequestPhase enum | metrics.ts:330-334 | Public surface (4 values). We populate 3 of the 4 from production code; leave the enum unchanged for backward compatibility |
Per-provider chunk normalization → GenerateContentResponse | loggingContentGenerator.ts:286-393 | Each provider already normalizes to Google's GenerateContentResponse shape before LoggingContentGenerator sees the stream. TTFT detection runs centrally over this normalized shape; no per-provider code |
retryWithBackoff general-purpose retry | utils/retry.ts:140 | Used by both LLM callers and non-LLM (channels/weixin/src/api.ts). We extend with an opt-in onRetry callback rather than hard-coupling to LLM telemetry |
Non-streaming generateContent | loggingContentGenerator.ts:212 | TTFT is not meaningful for non-streaming; the new fields stay undefined. Span lifecycle and existing attrs unchanged |
maxRetries=3, google-genai SDK internal retries). These happen entirely inside the third-party SDK; observing them requires disabling SDK retries and reimplementing in retryWithBackoff. Separate decision, not Phase 4.ttft_to_reasoning_ms vs ttft_to_answer_ms, but only after we know there's demand.duration_ms - ttft_ms - request_setup_ms; child span adds nothing for OTel-only backends (claude-code uses one for Perfetto only). Stored as a span attribute instead — see D6.QWEN_CODE_UNATTENDED_RETRY) event-level rate limiting. A single LLM request can produce 50+ ContentRetryEvent / ApiRetryEvent records under persistent retry. Capping emission is a follow-up — Phase 4 emits all events; if production volumes prove unbearable, add a per-span emission cap with a "+N more attempts (truncated)" summary event in a follow-up PR.TOKEN_PROCESSING breakdown phase. Enum value exists but qwen-code has no real post-stream local processing worth measuring (<10ms typical). Skipped in production callers; enum value retained for future use or for callers we don't control.ContentRetryEvent onto LLM span as span events. Same reasoning as Phase 3's subagent_execution LogRecord: existing consumers (qwen-logger RUM, future metrics) are tightly coupled to the LogRecord. Bridge-span coverage is good enough.| Source | Key takeaway |
|---|---|
claude-code (Anthropic) claude.ts:1762, 1789, 1982, 2882 | TTFT captured as Date.now() - start on message_start SSE event; start reset per retry attempt. requestSetupMs = start - startIncludingRetries. attemptStartTimes array preserved per attempt. Confirms feasibility of the approach; their TTFT semantic is "first stream event" (we diverge to "first content" — see D1) |
claude-code perfettoTracing.ts:549-671 | Renders Request Setup → Attempt N (retry) → First Token → Sampling as nested B/E pairs. Demonstrates the visual decomposition; qwen-code does the same decomposition with OTel attributes since we have no Perfetto |
claude-code sessionTracing.ts:447 | Only ttft_ms makes it onto the OTel span (not requestSetupMs, not samplingMs, not per-attempt timing). We deliberately put more on the span — claude-code has Perfetto for visualization; we don't |
opencode (sst/opencode) session/llm.ts, route/client.ts | No TTFT measurement. Single LLM.run Effect span covers everything. Validates that the gap exists across competing tools; not a reference for what to do |
| OTel GenAI Semantic Conventions (status: Development / Experimental) | gen_ai.usage.input_tokens (Stable), gen_ai.usage.output_tokens (Stable), gen_ai.usage.cached_tokens (Experimental), gen_ai.request.model (Stable), gen_ai.server.time_to_first_token (Experimental, seconds as double). Dual-emit pattern follows #4410 precedent |
| OTel Trace Spec — Span Events | "Events SHOULD NOT be used to record information that's better captured as Span Attributes." Confirms per-attempt info belongs on the LLM span attributes + log-bridge spans, not as Span Events on the parent |
Phase 3 design doc (telemetry-subagent-spans-design.md) | Established the dual-emit pattern (qwen-code.subagent.id + gen_ai.agent.id) and the "private name is authoritative" rule. Phase 4 follows the same convention for TTFT and token fields |
TTFT measures wall-clock from the successful attempt's request dispatch to the first stream chunk that contains user-visible output. A chunk is "user-visible" if any normalized Part in candidates[0].content.parts is one of:
text with non-empty stringfunctionCall (tool use)inlineData (image, binary)executableCodethought / reasoning content (whatever the provider surfaces — Gemini's thought, Anthropic's <thinking> block, OpenAI o1 reasoning chunk)Chunks containing only role metadata or only usageMetadata (final usage-summary chunk) do not trigger TTFT.
Why not "first stream event of any kind" (claude-code's choice): claude-code measures TTFT at message_start, an Anthropic-specific metadata event that fires 50–300ms before any actual content. Their internal headlessProfiler.ts already separates time_to_first_response_ms for the "user saw something" semantic, acknowledging the distinction. qwen-code spans multiple providers (Anthropic, OpenAI, Gemini, Qwen) — picking the metadata-event semantic means TTFT for Anthropic is fundamentally different from TTFT for OpenAI (which has no analogous metadata-only first event). The user-visible-content semantic is uniform across all 4 providers and matches "time-to-first-token" literally.
Why include thought / reasoning: from the operator's perspective, reasoning chunks are still "the model produced output." Excluding them would understate TTFT for reasoning-heavy models (o1, Qwen thinking variants). Future split into ttft_to_reasoning_ms vs ttft_to_answer_ms is possible; not Phase 4.
Why include tool-call-only chunks: agent tool-decision LLM calls (one tool_use, no text) are common in qwen-code's workflow. Excluding them means TTFT is undefined for these requests. The functionCall Part is meaningful output.
Cross-product comparison note: design doc explicitly states qwen-code.ttft_ms ≈ claude-code.time_to_first_response_ms ≠ claude-code.ttft_ms. Operators comparing across products should align on the user-visible-content semantic.
LoggingContentGenerator.generateContentStreamThe first-chunk detection runs inside the existing stream wrapper at loggingContentGenerator.ts:393 (async function* processStreamGenerator). Per-call variables (start, ttftMs) live in the method's closure; never as instance fields.
Why never instance fields: LoggingContentGenerator is instantiated once per ContentGenerator (contentGenerator.ts:377) and shared across all concurrent generateContentStream calls — subagent fan-out, warmup queries, side-queries from geminiChat. An instance field would be overwritten across concurrent calls, producing nonsense TTFT for one of every two interleaved requests.
Why not AsyncLocalStorage: ALS would work but adds a context-management layer for a piece of state that doesn't need to escape the method. Method-local is simpler, zero overhead, zero risk of leakage.
// loggingContentGenerator.ts — inside generateContentStream
const attemptStart = Date.now(); // per-call local
const requestEntryTime = Date.now(); // also per-call local — see D3
let ttftMs: number | undefined;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// stream wrapper inspects each chunk; first one matching hasUserVisibleContent:
// ttftMs = Date.now() - attemptStart;
hasUserVisibleContent(chunk) is a small standalone helper colocated with the wrapper, exported for tests:
function hasUserVisibleContent(chunk: GenerateContentResponse): boolean {
const parts = chunk.candidates?.[0]?.content?.parts;
if (!parts?.length) return false;
return parts.some(
(p) =>
(typeof p.text === 'string' && p.text.length > 0) ||
p.functionCall !== undefined ||
p.inlineData !== undefined ||
p.executableCode !== undefined ||
// @ts-expect-error — `thought` is not on all SDK versions but providers emit it
p.thought !== undefined,
);
}
request_setup_ms computation: entry-time vs successful-attempt-startrequest_setup_ms measures wall-clock from generateContentStream/generateContent entry to the start of the successful attempt — including all failed retries, backoff sleeps, and any pre-retry preparation work.
request_setup_ms = attemptStart_of_successful_attempt - requestEntryTime;
When attempt === 1 and no retries happened, request_setup_ms is small (just SDK setup). When retries occurred, it captures the entire retry-budget overhead.
Putting it on the OTel span (diverges from claude-code, which puts it only on Perfetto): rationale at three levels:
duration_ms=12000, request_setup_ms=11500, ttft_ms=200, sampling_ms=300 → instantly diagnoses "retries ate 11.5s, model itself was fast." Computing request_setup_ms from other fields requires also exposing sampling_ms, which we do anyway (D6).input_tokens, output_tokens attributes. Backend ingest cost is not material.onRetry callback option on retryWithBackoff + new ApiRetryEventretryWithBackoff currently calls logRetryAttempt (retry.ts:343) which only writes to debugLogger.warn. We extend the RetryOptions interface with an opt-in callback:
// utils/retry.ts
interface RetryOptions<T> {
// ... existing fields ...
/**
* Optional. Called once per failed attempt, before the backoff sleep.
* Receives the attempt number (1-based), the error, and the delay before
* the next attempt. Use this to emit telemetry events for LLM call sites;
* leave undefined for non-LLM callers (e.g., channels/weixin) so they
* stay silent in LLM-specific telemetry channels.
*/
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number; // 1-based, matches debugLogger output
error: unknown;
errorStatus?: number;
delayMs: number; // backoff delay before next attempt
}
The 4 LLM call sites (client.ts:1540, baseLlmClient.ts:193,282, geminiChat.ts:1039) register a callback that emits a new ApiRetryEvent:
// types.ts — new event class, sibling to ContentRetryEvent
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number; // 1-based
error_type: string;
error_message: string; // truncated to 256 chars
status_code?: number;
retry_delay_ms: number;
// ... duration_ms set to retry_delay_ms so LogToSpanProcessor renders
// a bridge span of meaningful width
duration_ms: number;
}
Why a new event class, not extending ContentRetryEvent:
ContentRetryEvent has 2 downstream consumers (qwen-logger, log-record export). Changing its payload risks breaking them.Why not embed callback IN retry.ts: retry.ts is called by channels/weixin/src/api.ts too (microsoft messaging API retries). Hard-coupling LLM telemetry inside retry.ts would emit ApiRetryEvent for non-LLM retries. The onRetry callback is opt-in per caller — LLM callers opt in, weixin caller doesn't.
ContentRetryEvent coexistence: ContentRetryEvent stays as-is for content-recovery retries inside geminiChat.ts:806,830. ApiRetryEvent covers the rate-limit / 5xx retries from retryWithBackoff. The two events fire from different layers and never duplicate. Existing log-bridge behavior for both events is preserved via LogToSpanProcessor — both events nest under the active LLM span automatically (Phase 1 wiring ensures the LLM span is active during retries).
Persistent retry mode (QWEN_CODE_UNATTENDED_RETRY): a single 429-loop request may emit 50+ events. Out of scope to rate-limit emission in Phase 4 — if production volumes prove unbearable, add a per-span cap with summary event in a follow-up PR. The aggregated attempt and retry_total_delay_ms on the parent LLM span (D5) remain accurate regardless of event cap.
OTel span attributes are scalars (string | number | boolean | array of these). Map-typed attributes (like retry_count_by_status: {429:2, 503:1}) require JSON serialization and are awkward to query. Skip them.
| Attribute | Type | Semantic |
|---|---|---|
attempt | int | 1-based final attempt count (attemptStartTimes.length) |
retry_total_delay_ms | int | Sum of all delayMs reported by onRetry; 0 if no retries |
ttft_ms | int | TTFT per D1; undefined for non-streaming or aborted-before-first-chunk requests |
request_setup_ms | int | Per D3 |
sampling_ms | int | Per D6 |
output_tokens_per_second | double | Derived; output_tokens / (sampling_ms / 1000); undefined when sampling_ms === 0 |
Per-attempt status-code distribution (e.g., "2 of the 3 attempts were 429s") is queryable from log-bridge spans of ApiRetryEvent records. No need to duplicate it as a flattened attribute on the parent.
Why sampling_ms and output_tokens_per_second on the span: derivable but cumbersome to compute in backend queries when summing across many spans. Same cost-benefit as request_setup_ms (D3).
recordApiRequestBreakdown() for 3 of 4 phasesIn endLLMRequestSpan (or the wrapper that calls it), after computing TTFT/setup/sampling, emit:
recordApiRequestBreakdown(config, model, [
{ phase: ApiRequestPhase.REQUEST_PREPARATION, durationMs: requestSetupMs },
{ phase: ApiRequestPhase.NETWORK_LATENCY, durationMs: ttftMs }, // ttftMs = network + first-token-generation
{ phase: ApiRequestPhase.RESPONSE_PROCESSING, durationMs: samplingMs },
]);
Why skip TOKEN_PROCESSING: qwen-code does stream chunk processing inline (consolidation happens in the wrapper at loggingContentGenerator.ts:644); the post-stream wrap-up phase is <10ms and not architecturally distinct. Filling it with a meaningless value pollutes the histogram. Leaving the enum value unused is safe — apiRequestBreakdownHistogram.record(value, {model, phase}) is just a histogram with phase as a label; missing labels are simply absent in queries.
Why not redefine NETWORK_LATENCY: the spec name is slightly misleading (it's network + first-token-generation, not pure network latency), but:
metrics.ts:330-334 which is exported from index.ts:117 and tested.Document the semantic in the design doc; leave the enum unchanged.
Why on the span path, not parallel: keeps recordApiRequestBreakdown colocated with span attribute writes — single gated emission point (see D7 idempotency), single ordering invariant.
endLLMRequestSpan idempotency: metric recording gated on existing double-end guardPhase 1.5 (#4302) established that endLLMRequestSpan may be called twice (abort path + error path collision). The existing guard at session-tracing.ts:~470 (if (!activeSpans.has(...)) return;) prevents double span.end(). Phase 4 metric recording (D6) must sit inside the same guarded block, before span.end():
// session-tracing.ts — endLLMRequestSpan
const llmCtx = activeSpans.get(spanRef);
if (!llmCtx) return; // already ended — double-end guard
activeSpans.delete(spanRef); // claim the end
// ... compute duration, set attributes ...
if (metadata) {
recordApiRequestBreakdown(config, llmCtx.attributes.model, [...]); // NEW — gated
recordTokenUsageMetrics(...); // existing
}
span.end();
This guarantees metric is recorded exactly once per LLM request, matching the span lifecycle.
Why not record in loggingContentGenerator: it doesn't see the abort path. Recording at the span lifecycle layer ensures every LLM request that opens a span produces exactly one breakdown sample, regardless of success/failure/abort.
Each Phase 4 attribute that corresponds to an OTel GenAI semconv attribute is written twice on the span:
| qwen-code private (authoritative) | GenAI semconv (compat layer) | Unit conversion | Spec status |
|---|---|---|---|
ttft_ms (ms, int) | gen_ai.server.time_to_first_token (s, double) | ttftMs / 1000 | Experimental |
input_tokens (int) | gen_ai.usage.input_tokens (int) | identical | Stable |
output_tokens (int) | gen_ai.usage.output_tokens (int) | identical | Stable |
cached_input_tokens (int) (when present) | gen_ai.usage.cached_tokens (int) | identical | Experimental |
qwen-code.model (string) | gen_ai.request.model (string) | identical | Stable |
Existing token attribute names on the LLM span (set in endLLMRequestSpan before Phase 4): qwen-code uses bare input_tokens and output_tokens already. Phase 4 adds the gen_ai.usage.* siblings to match #4410's pattern. The bare names stay; don't rename.
Fields with no GenAI semconv equivalent — request_setup_ms, sampling_ms, retry_total_delay_ms, attempt, output_tokens_per_second — are emitted only under the qwen-code namespace.
Why "private authoritative, semconv as compat":
ttft_ms etc. Treating those as canonical avoids a flag-day migration.gen_ai.server.time_to_first_token before reaching Stable. If/when it does, we update the semconv emission; the qwen-code names don't move.gen_ai.* attributes without our involvement.Why dual-emit unit conversion (ms ↔ seconds): GenAI semconv chose seconds-as-double for latency; qwen-code chose ms-as-int (matches duration_ms already on the span). Both representations have value; the conversion is cheap.
session-tracing.ts)// session-tracing.ts — LLMRequestMetadata interface extended (additive)
export interface LLMRequestMetadata {
// ... existing fields: inputTokens, outputTokens, cachedInputTokens, success, error, ...
/** Time from successful attempt start to first user-visible content chunk (ms). Undefined for non-streaming or aborted-before-first-chunk requests. */
ttftMs?: number;
/** Time from generateContent entry to start of successful attempt (ms). Includes all failed retries + backoff. */
requestSetupMs?: number;
/** Final attempt number (1-based). 1 = no retries. */
attempt?: number;
/** Sum of all backoff delays before the successful attempt (ms). */
retryTotalDelayMs?: number;
}
// No new exported helpers — Phase 4 reuses startLLMRequestSpan / endLLMRequestSpan with extended metadata.
// types.ts — new event class
export class ApiRetryEvent implements BaseTelemetryEvent {
'event.name': typeof EVENT_API_RETRY = EVENT_API_RETRY;
'event.timestamp': string;
model: string;
prompt_id?: string;
attempt_number: number;
error_type: string;
error_message: string;
status_code?: number;
retry_delay_ms: number;
duration_ms: number; // = retry_delay_ms, drives LogToSpanProcessor bridge span width
constructor(opts: { model: string; promptId?: string; attemptNumber: number; error: unknown; statusCode?: number; retryDelayMs: number }) { ... }
}
// constants.ts
export const EVENT_API_RETRY = 'qwen-code.api_retry';
// loggers.ts
export function logApiRetry(config: Config, event: ApiRetryEvent): void { ... }
// utils/retry.ts — RetryOptions extension
interface RetryOptions<T> {
// ... existing ...
onRetry?: (info: RetryAttemptInfo) => void;
}
interface RetryAttemptInfo {
attempt: number;
error: unknown;
errorStatus?: number;
delayMs: number;
}
// Inside retryWithBackoff, where logRetryAttempt is called today:
options.onRetry?.({ attempt, error, errorStatus, delayMs: actualDelay });
logRetryAttempt(attempt, error, errorStatus); // existing debugLogger call unchanged
// loggingContentGenerator.ts:283 — generateContentStream
async generateContentStream(req, userPromptId): Promise<AsyncGenerator<GenerateContentResponse>> {
const requestEntryTime = Date.now();
let attemptStart = requestEntryTime;
const attemptStartTimes: number[] = [attemptStart];
let retryTotalDelayMs = 0;
let finalAttempt = 1;
// Use existing startLLMRequestSpan (Phase 1)
// Pass onRetry callback to whatever retry layer is in use:
const onRetry: RetryAttemptInfo & { invoke: ... } = (info) => {
finalAttempt = info.attempt + 1; // we're about to start attempt N+1
retryTotalDelayMs += info.delayMs;
attemptStart = Date.now() + info.delayMs; // approximate; actual reset is at top of next attempt
attemptStartTimes.push(attemptStart);
// emit ApiRetryEvent
logApiRetry(this.config, new ApiRetryEvent({
model: req.model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}));
};
// stream wrapper detects first user-visible chunk:
return this.processStreamGenerator(stream, ..., {
onFirstUserVisibleChunk: (now) => {
ttftMs = now - attemptStart;
},
});
}
At span end (already in Phase 1's endLLMRequestSpan flow), include the new fields in LLMRequestMetadata:
endLLMRequestSpan(llmSpan, {
success: true,
inputTokens,
outputTokens,
cachedInputTokens,
ttftMs,
requestSetupMs: attemptStart - requestEntryTime,
attempt: finalAttempt,
retryTotalDelayMs,
});
generateContent (loggingContentGenerator.ts:212) does not produce streaming chunks. TTFT is undefined; request_setup_ms is still meaningful (captures retry overhead). The breakdown metric records 2 phases (REQUEST_PREPARATION + RESPONSE_PROCESSING where RESPONSE_PROCESSING = duration_ms - request_setup_ms), not 3.
Each of the 4 LLM retryWithBackoff call sites adds onRetry:
// client.ts:1540 (similar at baseLlmClient.ts:193, 282, geminiChat.ts:1039)
const result = await retryWithBackoff(apiCall, {
...existingOptions,
onRetry: (info) => {
logApiRetry(
this.config,
new ApiRetryEvent({
model,
promptId: userPromptId,
attemptNumber: info.attempt,
error: info.error,
statusCode: info.errorStatus,
retryDelayMs: info.delayMs,
}),
);
// also feed back into LoggingContentGenerator's local retry accumulator
// (when in scope — for callers that don't go through LoggingContentGenerator,
// the LLM span still gets `attempt` and `retry_total_delay_ms` via the
// metadata path because endLLMRequestSpan is called at the LLM layer)
},
});
The non-LLM caller (channels/weixin/src/api.ts) does not register onRetry — no ApiRetryEvent is emitted for its retries, matching today's behavior.
LoggingContentGenerator instance is shared (one per ContentGenerator, contentGenerator.ts:377). Three concurrent generateContentStream calls (e.g., 3 subagents fan out via coreToolScheduler.runConcurrently) execute three independent closures of generateContentStream:
call_A: attemptStart_A, ttftMs_A, ... (closure)
call_B: attemptStart_B, ttftMs_B, ... (closure)
call_C: attemptStart_C, ttftMs_C, ... (closure)
Per-call locals never overlap. Stream chunks are detected against the local attemptStart of each call. Span attributes are set at each call's own endLLMRequestSpan.
AsyncLocalStorageContextManager (registered by NodeSDK at sdk.ts:273) already ensures the active OTel context — and thus the parent span passed to startLLMRequestSpan — is correct per fiber.
| File | Change | LOC est |
|---|---|---|
packages/core/src/telemetry/constants.ts | Add EVENT_API_RETRY constant | +2 |
packages/core/src/telemetry/types.ts | Add ApiRetryEvent class + union member | +40 |
packages/core/src/telemetry/loggers.ts | Add logApiRetry() function | +20 |
packages/core/src/telemetry/qwen-logger/qwen-logger.ts | Add logApiRetryEvent() for RUM downstream consistency | +20 |
packages/core/src/telemetry/session-tracing.ts | Extend LLMRequestMetadata (ttftMs, requestSetupMs, attempt, retryTotalDelayMs); extend endLLMRequestSpan to set new attrs + breakdown metric + dual-emit gen_ai.* | +60 |
packages/core/src/telemetry/metrics.ts | Wire recordApiRequestBreakdown callsite inside endLLMRequestSpan (no change to the existing recorder) | 0 |
packages/core/src/utils/retry.ts | Add onRetry?: (info: RetryAttemptInfo) => void to RetryOptions; export RetryAttemptInfo; invoke callback in the existing logRetryAttempt site | +25 |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.ts | TTFT capture: method-local accumulators + hasUserVisibleContent helper + first-chunk detection in stream wrapper; pass new metadata to endLLMRequestSpan | +80 |
packages/core/src/core/client.ts | Wire onRetry callback at retryWithBackoff call site (client.ts:1540) | +15 |
packages/core/src/core/baseLlmClient.ts | Wire onRetry callback at 2 retryWithBackoff call sites | +25 |
packages/core/src/core/geminiChat.ts | Wire onRetry callback at retryWithBackoff call site (geminiChat.ts:1039) | +15 |
packages/core/src/telemetry/session-tracing.test.ts | endLLMRequestSpan sets ttft_ms / request_setup_ms / attempt / retry_total_delay_ms / sampling_ms / output_tokens_per_second + gen_ai dual-emit + breakdown metric (each phase) + idempotent end | +120 |
packages/core/src/core/loggingContentGenerator/loggingContentGenerator.test.ts | hasUserVisibleContent (text / functionCall / inlineData / executableCode / thought / role-only / usage-only); concurrent calls don't cross-contaminate; TTFT undefined when aborted before first chunk; TTFT undefined on non-streaming | +100 |
packages/core/src/utils/retry.test.ts | onRetry invoked per failed attempt with correct attempt, delayMs, error, errorStatus; absence of onRetry is silent (no telemetry emitted) | +50 |
packages/core/src/telemetry/loggers.test.ts | logApiRetry emits LogRecord with expected payload; bridges through LogToSpanProcessor to nested span under active LLM span | +40 |
Total: 14 files, ~610 LOC. Larger than Phase 2 (#4321) but comparable to Phase 3 (#4410) and justified by the breadth of integration (4 retry sites + telemetry plumbing + streaming wrapper).
If review pushes back on size: split into Phase 4a + 4b + 4c:
LLMRequestMetadata + dual-emit. Self-contained value (TTFT visibility from day one).onRetry callback + ApiRetryEvent + 4 caller wiring. Independently a bug fix for the retryWithBackoff telemetry gap.recordApiRequestBreakdown activation + parent span aggregation attrs (attempt, retry_total_delay_ms, sampling_ms, output_tokens_per_second). Depends on 4a + 4b.| Test | What it proves |
|---|---|
hasUserVisibleContent returns true for text/functionCall/inlineData/executableCode/thought | D1 semantics across part types |
hasUserVisibleContent returns false for role-only and usage-only chunks | D1 negative cases |
| streaming: TTFT measured from attempt start to first user-visible chunk | End-to-end TTFT detection |
| streaming: TTFT undefined if stream aborts before any user-visible chunk | Edge case |
| streaming: TTFT computed from final attempt's start (not first attempt) | D3 — TTFT reset on retry |
| non-streaming: TTFT remains undefined | S3 decision |
concurrent generateContentStream calls don't cross-contaminate TTFT | D2 — method-local guarantee |
endLLMRequestSpan sets all Phase 4 attrs (ttft_ms, request_setup_ms, sampling_ms, attempt, retry_total_delay_ms, output_tokens_per_second) | Attribute presence |
endLLMRequestSpan dual-emits gen_ai.server.time_to_first_token + gen_ai.usage.* + gen_ai.request.model | D8 dual-emit |
endLLMRequestSpan records breakdown metric with 3 phases for streaming, 2 for non-streaming | D6 |
endLLMRequestSpan called twice: metric recorded exactly once, attrs not re-set | D7 idempotency |
retryWithBackoff with onRetry: callback invoked per failed attempt with correct args | D4 callback contract |
retryWithBackoff without onRetry: no telemetry emitted (silent for non-LLM callers) | P2 — channels/weixin scope protection |
client.ts / baseLlmClient.ts / geminiChat.ts retry callsites emit ApiRetryEvent on retry | Integration of D4 at 4 sites |
ApiRetryEvent LogRecord bridges via LogToSpanProcessor to a child span under active LLM span | Trace tree correctness |
LLM span attempt field correctly reflects final attempt number under retries | D5 aggregation |
LLM span retry_total_delay_ms correctly sums onRetry delays | D5 aggregation |
output_tokens_per_second undefined when sampling_ms === 0 (no streaming) | Avoid divide-by-zero |
| Case | Handling |
|---|---|
| Stream aborts before any chunk arrives | ttftMs = undefined, sampling_ms = undefined, output_tokens_per_second = undefined. attempt, request_setup_ms still set. success = false |
| Stream aborts after first chunk | ttftMs set; sampling_ms = duration_ms - ttftMs - request_setup_ms; reflects partial response time. success = false |
| Retry succeeds on attempt 1 (no retries) | attempt = 1, retry_total_delay_ms = 0, no ApiRetryEvent emitted, breakdown metric records request_setup_ms close to 0 |
| Persistent retry mode 50+ attempts | 50+ ApiRetryEvent records emitted (out-of-scope cap deferred); LLM span attempt = 51, retry_total_delay_ms = sum of all delays. Operator sees aggregated view on span; full per-attempt detail in log-bridge spans |
Non-LLM retryWithBackoff caller (channels/weixin) | No onRetry registered; only existing debugLogger.warn fires. No ApiRetryEvent; no breakdown metric (caller isn't an LLM site) |
endLLMRequestSpan called twice (abort + error race) | Phase 1.5 guard at activeSpans.delete() returns early on second call; recordApiRequestBreakdown is inside the guard, recorded exactly once |
Anthropic message_start chunk arrives before content | hasUserVisibleContent returns false for it (no parts with text/functionCall/etc.); TTFT not triggered until subsequent content_block_delta chunk |
OpenAI first chunk with empty delta.content but role only | hasUserVisibleContent returns false; TTFT not triggered until first chunk with non-empty delta |
| Tool-call-only response (no text) | First chunk with functionCall Part triggers TTFT; output_tokens_per_second computed against tool-call token count |
| Concurrent subagents (3 calls in flight) | Each call's closure has its own attemptStart, ttftMs, attemptStartTimes. Per-call span receives its own metadata at endLLMRequestSpan. No interleaving (D2) |
SDK-level retries inside openai-sdk (maxRetries=3) | Invisible to qwen-code telemetry — happens entirely inside SDK before retryWithBackoff sees the request. attempt reflects retryWithBackoff attempts only. Out of scope (see Out-of-scope) |
gen_ai.server.time_to_first_token spec renames before reaching Stable | Single-file update: session-tracing.ts:endLLMRequestSpan. The qwen-code-native ttft_ms stays authoritative — no downstream impact |
| Subagent's LLM request | Parent is the subagent span (Phase 3). Phase 4 fields nest correctly. Aggregations grouped by qwen-code.subagent.id give per-subagent LLM perf — design-doc-future, easy follow-up |
| Reasoning model with long thought blocks | First thought Part triggers TTFT; sampling_ms includes both thinking + answer phases. Split into separate metrics deferred |
The change is additive at the OTel and metric level — every new attribute is optional, every new event is a new class. Existing dashboards that don't filter on the new fields keep working unchanged.
Behavior-affecting changes:
ApiRetryEvent LogRecord starts flowing → log volume increases proportional to retry rate (typically <1% of requests retry). Mitigate by sampling LogRecord at the SDK layer if needed.qwen-code.api.request.breakdown starts producing time series → mild Prometheus cardinality bump ({model, phase} — bounded).output_tokens_per_second derived attribute may appear unusual on dashboards filtering "all attributes" — document.Rollback path: revert the single PR (or each of 4a/4b/4c independently). All new fields use defensive defaults (undefined / 0) and don't change span structure.
qwen-code.llm_request spans regardless of whether they're under a qwen-code.subagent (Phase 3) or qwen-code.interaction (Phase 1) parent. Recommend Phase 3 land first so per-attempt aggregation under subagent subtrees works naturally.traceparent + X-Qwen-Code-Session-Id outbound propagation). They touch the HTTP layer; Phase 4 touches the stream/retry/metric layer.clearDetailedSpanState chat-compression follow-up (#4097 follow-up). Different surface.onRetry callback firing semantics: invoked before backoff sleep (current proposal) or after (when the next attempt is about to start)? Before is simpler — callback has all the info immediately; after would require capturing the just-completed delay separately. Pre-sleep is the recommendation; document in callback contract.attempt_durations_ms: number[] array? OTel supports array-of-primitive attributes. Useful for "which attempt of N was slow" diagnostics. Defer until production data shows demand — log-bridge spans already carry the equivalent.attempt > N threshold should we start sampling? N = 5 then 1-in-10? N = 10 then summary-only? Defer until we have production volume data.TOKEN_PROCESSING phase: keep enum value dormant or wire it to something (e.g., consolidation time)? Defer — wait for a real use case.ttft_ms/output_tokens/input_tokens per subagent subtree. Not Phase 4 scope but the data flow enables it.