docs/design/adaptive-output-token-escalation/adaptive-output-token-escalation-design.md
Defaults to the model's declared output limit unless the user or environment configures
max_tokens, then uses escalation and multi-turn recovery only when a response still hitsMAX_TOKENS.
Every API request reserves a fixed GPU slot proportional to max_tokens. A low default can reduce slot reservation, but it also makes normal large responses more likely to truncate. For file-writing workflows, that can produce incomplete tool-call arguments and force the scheduler to reject the partial write.
Use the model's declared output limit by default. When a response is truncated (the model hits max_tokens):
This favors correctness for large generation and file-edit tasks. Operators that need a lower reservation can still set QWEN_CODE_MAX_OUTPUT_TOKENS, and that explicit value is respected.
Request (max_tokens = user/env value or model output limit)
│
▼
┌─────────────────────────┐
│ Response truncated? │──── No ──▶ Done ✓
│ (MAX_TOKENS) │
└───────────┬──────────────┘
│ Yes
▼
┌──────────────────────────────────────────────────┐
│ Layer 1: Escalate to model output limit │
│ ┌────────────────────────────────────────────┐ │
│ │ Pop partial response from history │ │
│ │ RETRY (isContinuation: false → reset UI) │ │
│ │ Re-send at max(64K, model output limit) │ │
│ └────────────────────────────────────────────┘ │
└───────────┬──────────────────────────────────────┘
│
▼
┌─────────────────────────┐
│ Still truncated? │──── No ──▶ Done ✓
│ (MAX_TOKENS) │
└───────────┬──────────────┘
│ Yes
▼
┌──────────────────────────────────────────────────┐
│ Layer 2: Multi-turn recovery (up to 3×) │
│ ┌────────────────────────────────────────────┐ │
│ │ Keep partial response in history │ │
│ │ Push user message: "Resume directly..." │ │
│ │ RETRY (isContinuation: true → keep UI buf) │ │
│ │ Re-send with updated history │ │
│ │ Model continues from where it left off │ │
│ └──────────────┬─────────────────────────────┘ │
│ │ │
│ ┌──────┴──────┐ │
│ │ Succeeded? │── Yes ──▶ Done ✓ │
│ └──────┬──────┘ │
│ │ No (still truncated) │
│ ▼ │
│ attempt < 3? ── Yes ──▶ loop back ↑ │
└───────────┬──────────────────────────────────────┘
│ No (exhausted)
▼
┌──────────────────────────────────────────────────┐
│ Layer 3: Tool scheduler fallback │
│ ┌────────────────────────────────────────────┐ │
│ │ Reject truncated Edit/Write tool calls │ │
│ │ Return guidance: "You MUST split into │ │
│ │ smaller parts — write skeleton first, │ │
│ │ then edit incrementally." │ │
│ └────────────────────────────────────────────┘ │
└──────────────────────────────────────────────────┘
The effective max_tokens is resolved in the following priority order:
| Priority | Source | Value (known model) | Value (unknown model) | Escalation behavior |
|---|---|---|---|---|
| 1 (highest) | User config (samplingParams.max_tokens) | min(userValue, modelLimit) | userValue | No escalation |
| 2 | Environment variable (QWEN_CODE_MAX_OUTPUT_TOKENS) | min(envValue, modelLimit) | envValue | No escalation |
| 3 (lowest) | Model/default output limit | modelLimit | DEFAULT_OUTPUT_TOKEN_LIMIT = 32K | Escalates to model limit (64K floor) + recovery |
A "known model" is one that has an explicit entry in OUTPUT_PATTERNS (checked via hasExplicitOutputLimit()). For known models, the effective value is always capped at the model's declared output limit to avoid API errors. Unknown models (custom deployments, self-hosted endpoints) pass the user's value through directly, since the backend may support larger limits.
This logic is implemented in three content generators:
DefaultOpenAICompatibleProvider.applyOutputTokenLimit() — OpenAI-compatible providersDashScopeProvider — inherits applyOutputTokenLimit() from the default providerAnthropicContentGenerator.buildSamplingParameters() — Anthropic providerThe escalation logic lives in geminiChat.ts, placed outside the main retry loop. This is intentional:
1. Stream completes successfully (lastError === null)
2. Last chunk has finishReason === MAX_TOKENS
3. Guard checks pass:
- maxTokensEscalated === false (prevent infinite escalation)
- hasUserMaxTokensOverride === false (respect user intent)
4. Compute escalated limit: max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output'))
5. Pop the partial model response from chat history
6. Yield RETRY event (isContinuation: false) → UI discards partial output and resets buffers
7. Re-send the same request with maxOutputTokens: escalatedLimit
If the escalated response is also truncated (finishReason === MAX_TOKENS), the recovery loop runs up to MAX_OUTPUT_RECOVERY_ATTEMPTS (3) times:
1. Partial model response is already in history (pushed by processStreamResponse)
2. Push a recovery user message: OUTPUT_RECOVERY_MESSAGE
3. Yield RETRY event (isContinuation: true) → UI keeps text buffer for continuation
4. Re-send with updated history (model sees its partial output + recovery instruction)
5. If still truncated and attempts remain, loop back to step 1
6. If recovery attempt throws (empty response, network error):
- Pop the dangling recovery message from history
- Break out of recovery loop
When the Turn class receives a RETRY event, it clears accumulated state to prevent inconsistencies:
pendingToolCalls — cleared to avoid duplicate tool calls if the first truncated response contained completed tool calls that are repeated in the escalated responsependingCitations — cleared to avoid duplicate citationsfinishReason — reset to undefined so the new response's finish reason is usedThe isContinuation flag is passed through to the UI so it can decide whether to reset text buffers (escalation) or keep them (recovery).
Defined in geminiChat.ts and tokenLimits.ts:
| Constant | Value | Purpose |
|---|---|---|
ESCALATED_MAX_TOKENS | 64,000 | Floor for escalation when the model limit is low |
MAX_OUTPUT_RECOVERY_ATTEMPTS | 3 | Max multi-turn recovery attempts after escalation |
The effective escalated limit is max(ESCALATED_MAX_TOKENS, tokenLimit(model, 'output')):
| Model | Escalated limit |
|---|---|
| Claude Opus 4.6 | 131,072 (128K) |
| GPT-5 / o-series | 131,072 (128K) |
| Qwen3.x | 65,536 (64K) |
| Unknown models | 64,000 (floor) |
max_tokens, so a lower value over-reserves less).tengu_otk_slot_v1) that defaults to off for third-party providers ("not validated on Bedrock/Vertex") — i.e. its default behavior for non-first-party serving is exactly "use the model's declared limit." qwen-code's providers are all third-party / OpenAI-compatible / self-hosted, so matching that default-off behavior is the safe choice; assuming the low default is safe for every backend is not.QWEN_CODE_MAX_OUTPUT_TOKENS (e.g. 8000) to restore the lower per-request reservation. A GrowthBook-style feature flag is intentionally not reintroduced — qwen-code has no such infrastructure, and the env var already covers the need.ESCALATED_MAX_TOKENS (64K) serves as a floor for unknown models where tokenLimit() returns the default 32K