Model Providers

Qwen Code allows you to configure multiple model providers through the modelProviders setting in your settings.json. This enables you to switch between different AI models and providers using the /model command.

Overview

Use modelProviders to declare curated model lists per auth type that the /model picker can switch between. Keys must be valid auth types (openai, anthropic, gemini, etc.). Each entry requires an id and must include envKey, with optional name, description, baseUrl, and generationConfig. Credentials are never persisted in settings; the runtime reads them from process.env[envKey]. Qwen OAuth models remain hard-coded and cannot be overridden.

[!note]

Only the /model command exposes non-default auth types. Anthropic, Gemini, etc., must be defined via modelProviders. The /auth command lists Qwen OAuth, Alibaba Cloud Coding Plan, and API Key as the built-in authentication options.

[!note]

Model uniqueness: Models within the same authType are uniquely identified by the combination of id + baseUrl. This means you can define the same model ID (e.g., "gpt-4o") multiple times under a single authType as long as each entry has a different baseUrl — for example, one pointing to OpenAI directly and another to a proxy endpoint. If two entries share both the same id and the same baseUrl (or both omit baseUrl), the first occurrence wins and subsequent duplicates are skipped with a warning.

Configuration Examples by Auth Type

Below are comprehensive configuration examples for different authentication types, showing the available parameters and their combinations.

Supported Auth Types

The modelProviders object keys must be valid authType values. Currently supported auth types are:

Auth Type	Description
`openai`	OpenAI-compatible APIs (OpenAI, Azure OpenAI, local inference servers like vLLM/Ollama)
`anthropic`	Anthropic Claude API
`gemini`	Google Gemini API
`qwen-oauth`	Qwen OAuth (hard-coded, cannot be overridden in `modelProviders`)

[!warning] If an invalid auth type key is used (e.g., a typo like "openai-custom"), the configuration will be silently skipped and the models will not appear in the /model picker. Always use one of the supported auth type values listed above.

SDKs Used for API Requests

Qwen Code uses the following official SDKs to send requests to each provider:

Auth Type	SDK Package
`openai`	`openai` - Official OpenAI Node.js SDK
`anthropic`	`@anthropic-ai/sdk` - Official Anthropic SDK
`gemini`	`@google/genai` - Official Google GenAI SDK
`qwen-oauth`	`openai` with custom provider (DashScope-compatible)

This means the baseUrl you configure should be compatible with the corresponding SDK's expected API format. For example, when using openai auth type, the endpoint must accept OpenAI API format requests.

OpenAI-compatible providers (`openai`)

This auth type supports not only OpenAI's official API but also any OpenAI-compatible endpoint, including aggregated model providers like OpenRouter.

json

{
  "env": {
    "OPENAI_API_KEY": "sk-your-actual-openai-key-here",
    "OPENROUTER_API_KEY": "sk-or-your-actual-openrouter-key-here"
  },
  "modelProviders": {
    "openai": [
      {
        "id": "gpt-4o",
        "name": "GPT-4o",
        "envKey": "OPENAI_API_KEY",
        "baseUrl": "https://api.openai.com/v1",
        "generationConfig": {
          "timeout": 60000,
          "maxRetries": 3,
          "enableCacheControl": true,
          "contextWindowSize": 128000,
          "modalities": {
            "image": true
          },
          "customHeaders": {
            "X-Client-Request-ID": "req-123"
          },
          "extra_body": {
            "enable_thinking": true,
            "service_tier": "priority"
          },
          "samplingParams": {
            "temperature": 0.2,
            "top_p": 0.8,
            "max_tokens": 4096,
            "presence_penalty": 0.1,
            "frequency_penalty": 0.1
          }
        }
      },
      {
        "id": "gpt-4o-mini",
        "name": "GPT-4o Mini",
        "envKey": "OPENAI_API_KEY",
        "baseUrl": "https://api.openai.com/v1",
        "generationConfig": {
          "timeout": 30000,
          "samplingParams": {
            "temperature": 0.5,
            "max_tokens": 2048
          }
        }
      },
      {
        "id": "openai/gpt-4o",
        "name": "GPT-4o (via OpenRouter)",
        "envKey": "OPENROUTER_API_KEY",
        "baseUrl": "https://openrouter.ai/api/v1",
        "generationConfig": {
          "timeout": 120000,
          "maxRetries": 3,
          "samplingParams": {
            "temperature": 0.7
          }
        }
      }
    ]
  }
}

Anthropic (`anthropic`)

json

{
  "env": {
    "ANTHROPIC_API_KEY": "sk-ant-your-actual-anthropic-key-here"
  },
  "modelProviders": {
    "anthropic": [
      {
        "id": "claude-3-5-sonnet",
        "name": "Claude 3.5 Sonnet",
        "envKey": "ANTHROPIC_API_KEY",
        "baseUrl": "https://api.anthropic.com/v1",
        "generationConfig": {
          "timeout": 120000,
          "maxRetries": 3,
          "contextWindowSize": 200000,
          "samplingParams": {
            "temperature": 0.7,
            "max_tokens": 8192,
            "top_p": 0.9
          }
        }
      },
      {
        "id": "claude-3-opus",
        "name": "Claude 3 Opus",
        "envKey": "ANTHROPIC_API_KEY",
        "baseUrl": "https://api.anthropic.com/v1",
        "generationConfig": {
          "timeout": 180000,
          "samplingParams": {
            "temperature": 0.3,
            "max_tokens": 4096
          }
        }
      }
    ]
  }
}

Google Gemini (`gemini`)

json

{
  "env": {
    "GEMINI_API_KEY": "AIza-your-actual-gemini-key-here"
  },
  "modelProviders": {
    "gemini": [
      {
        "id": "gemini-2.0-flash",
        "name": "Gemini 2.0 Flash",
        "envKey": "GEMINI_API_KEY",
        "baseUrl": "https://generativelanguage.googleapis.com",
        "capabilities": {
          "vision": true
        },
        "generationConfig": {
          "timeout": 60000,
          "maxRetries": 2,
          "contextWindowSize": 1000000,
          "schemaCompliance": "auto",
          "samplingParams": {
            "temperature": 0.4,
            "top_p": 0.95,
            "max_tokens": 8192,
            "top_k": 40
          }
        }
      }
    ]
  }
}

Local Self-Hosted Models (via OpenAI-compatible API)

Most local inference servers (vLLM, Ollama, LM Studio, etc.) provide an OpenAI-compatible API endpoint. Configure them using the openai auth type with a local baseUrl:

json

{
  "env": {
    "OLLAMA_API_KEY": "ollama",
    "VLLM_API_KEY": "not-needed",
    "LMSTUDIO_API_KEY": "lm-studio"
  },
  "modelProviders": {
    "openai": [
      {
        "id": "qwen2.5-7b",
        "name": "Qwen2.5 7B (Ollama)",
        "envKey": "OLLAMA_API_KEY",
        "baseUrl": "http://localhost:11434/v1",
        "generationConfig": {
          "timeout": 300000,
          "maxRetries": 1,
          "contextWindowSize": 32768,
          "samplingParams": {
            "temperature": 0.7,
            "top_p": 0.9,
            "max_tokens": 4096
          }
        }
      },
      {
        "id": "llama-3.1-8b",
        "name": "Llama 3.1 8B (vLLM)",
        "envKey": "VLLM_API_KEY",
        "baseUrl": "http://localhost:8000/v1",
        "generationConfig": {
          "timeout": 120000,
          "maxRetries": 2,
          "contextWindowSize": 128000,
          "samplingParams": {
            "temperature": 0.6,
            "max_tokens": 8192
          }
        }
      },
      {
        "id": "local-model",
        "name": "Local Model (LM Studio)",
        "envKey": "LMSTUDIO_API_KEY",
        "baseUrl": "http://localhost:1234/v1",
        "generationConfig": {
          "timeout": 60000,
          "samplingParams": {
            "temperature": 0.5
          }
        }
      }
    ]
  }
}

For local servers that don't require authentication, you can use any placeholder value for the API key:

bash

# For Ollama (no auth required)
export OLLAMA_API_KEY="ollama"

# For vLLM (if no auth is configured)
export VLLM_API_KEY="not-needed"

[!note]

The extra_body parameter is only supported for OpenAI-compatible providers (openai, qwen-oauth). It is ignored for Anthropic, and Gemini providers.

[!note]

About envKey: The envKey field specifies the name of an environment variable, not the actual API key value. For the configuration to work, you need to ensure the corresponding environment variable is set with your real API key. There are two ways to do this:
Option 1: Using a .env file (recommended for security):
bash
# ~/.qwen/.env (or project root)
OPENAI_API_KEY=sk-your-actual-key-here
Be sure to add .env to your .gitignore to prevent accidentally committing secrets.
Option 2: Using the env field in settings.json (as shown in the examples above):
json
{
  "env": {
    "OPENAI_API_KEY": "sk-your-actual-key-here"
  }
}
Each provider example includes an env field to illustrate how the API key should be configured.

Alibaba Cloud Coding Plan

Alibaba Cloud Coding Plan provides a pre-configured set of Qwen models optimized for coding tasks. This feature is available for users with Alibaba Cloud Coding Plan API access and offers a simplified setup experience with automatic model configuration updates.

Overview

When you authenticate with an Alibaba Cloud Coding Plan API key using the /auth command, Qwen Code automatically configures the following models:

Model ID	Name	Description
`qwen3.5-plus`	qwen3.5-plus	Advanced model with thinking enabled
`qwen3-coder-plus`	qwen3-coder-plus	Optimized for coding tasks
`qwen3-max-2026-01-23`	qwen3-max-2026-01-23	Latest max model with thinking enabled

Setup

Obtain an Alibaba Cloud Coding Plan API key:
- China: https://bailian.console.aliyun.com/?tab=model#/efm/coding_plan
- International: https://modelstudio.console.alibabacloud.com/?tab=dashboard#/efm/coding_plan
Run the /auth command in Qwen Code
Select Alibaba Cloud Coding Plan
Select your region
Enter your API key when prompted

The models will be automatically configured and added to your /model picker.

Regions

Alibaba Cloud Coding Plan supports two regions:

Region	Endpoint	Description
China	`https://coding.dashscope.aliyuncs.com/v1`	Mainland China endpoint
Global/International	`https://coding-intl.dashscope.aliyuncs.com/v1`	International endpoint

The region is selected during authentication and stored in settings.json under codingPlan.region. To switch regions, re-run the /auth command and select a different region.

API Key Storage

When you configure Coding Plan through the /auth command, the API key is stored using the reserved environment variable name BAILIAN_CODING_PLAN_API_KEY. By default, it is stored in the env field of your settings.json file.

[!warning]

Security Recommendation: For better security, it is recommended to move the API key from settings.json to a separate .env file and load it as an environment variable. For example:
bash
# ~/.qwen/.env
BAILIAN_CODING_PLAN_API_KEY=your-api-key-here
Then ensure this file is added to your .gitignore if you're using project-level settings.

Automatic Updates

Coding Plan model configurations are versioned. When Qwen Code detects a newer version of the model template, you will be prompted to update. Accepting the update will:

Replace the existing Coding Plan model configurations with the latest versions
Preserve any custom model configurations you've added manually
Automatically switch to the first model in the updated configuration

The update process ensures you always have access to the latest model configurations and features without manual intervention.

Manual Configuration (Advanced)

If you prefer to manually configure Coding Plan models, you can add them to your settings.json like any OpenAI-compatible provider:

json

{
  "modelProviders": {
    "openai": [
      {
        "id": "qwen3-coder-plus",
        "name": "qwen3-coder-plus",
        "description": "Qwen3-Coder via Alibaba Cloud Coding Plan",
        "envKey": "YOUR_CUSTOM_ENV_KEY",
        "baseUrl": "https://coding.dashscope.aliyuncs.com/v1"
      }
    ]
  }
}

[!note]

When using manual configuration:

You can use any environment variable name for envKey

You do not need to configure codingPlan.*

Automatic updates will not apply to manually configured Coding Plan models

[!warning]

If you also use automatic Coding Plan configuration, automatic updates may overwrite your manual configurations if they use the same envKey and baseUrl as the automatic configuration. To avoid this, ensure your manual configuration uses a different envKey if possible.

Resolution Layers and Atomicity

The effective auth/model/credential values are chosen per field using the following precedence (first present wins). You can combine --auth-type with --model to point directly at a provider entry; these CLI flags run before other layers.

Layer (highest → lowest)	authType	model	apiKey	baseUrl	apiKeyEnvKey	proxy
Programmatic overrides	`/auth`	`/auth` input	`/auth` input	`/auth` input	—	—
Model provider selection	—	`modelProvider.id`	`env[modelProvider.envKey]`	`modelProvider.baseUrl`	`modelProvider.envKey`	—
CLI arguments	`--auth-type`	`--model`	`--openaiApiKey` (or provider-specific equivalents)	`--openaiBaseUrl` (or provider-specific equivalents)	—	—
Environment variables	—	Provider-specific mapping (e.g. `OPENAI_MODEL`)	Provider-specific mapping (e.g. `OPENAI_API_KEY`)	Provider-specific mapping (e.g. `OPENAI_BASE_URL`)	—	—
Settings (`settings.json`)	`security.auth.selectedType`	`model.name`	`security.auth.apiKey`	`security.auth.baseUrl`	—	—
Default / computed	Falls back to `AuthType.QWEN_OAUTH`	Built-in default (OpenAI ⇒ `qwen3-coder-plus`)	—	—	—	`Config.getProxy()` if configured

*When present, CLI auth flags override settings. Otherwise, security.auth.selectedType or the implicit default determine the auth type. Qwen OAuth and OpenAI are the only auth types surfaced without extra configuration.

[!warning]

Deprecation of security.auth.apiKey and security.auth.baseUrl: Directly configuring API credentials via security.auth.apiKey and security.auth.baseUrl in settings.json is deprecated. These settings were used in historical versions for credentials entered through the UI, but the credential input flow was removed in version 0.10.1. These fields will be fully removed in a future release. It is strongly recommended to migrate to modelProviders for all model and credential configurations. Use envKey in modelProviders to reference environment variables for secure credential management instead of hardcoding credentials in settings files.

Generation Config Layering: The Impermeable Provider Layer

The configuration resolution follows a strict layering model with one crucial rule: the modelProvider layer is impermeable.

How it works

When a modelProvider model IS selected (e.g., via /model command choosing a provider-configured model):
- The entire generationConfig from the provider is applied atomically
- The provider layer is completely impermeable — lower layers (CLI, env, settings) do not participate in generationConfig resolution at all
- All fields defined in modelProviders[].generationConfig use the provider's values
- All fields not defined by the provider are set to undefined (not inherited from settings)
- This ensures provider configurations act as a complete, self-contained "sealed package"
If a model is listed in modelProviders, put all model-specific generation settings for that model in the matching provider entry. Top-level model.generationConfig values, including contextWindowSize, modalities, customHeaders, and extra_body, are ignored for provider models. Configure those fields under modelProviders[authType][].generationConfig for them to apply.
When NO modelProvider model is selected (e.g., using --model with a raw model ID, or using CLI/env/settings directly):
- The resolution falls through to lower layers
- Fields are populated from CLI → env → settings → defaults
- This creates a Runtime Model (see next section)

Per-field precedence for `generationConfig`

Priority	Source	Behavior
1	Programmatic overrides	Runtime `/model`, `/auth` changes
2	`modelProviders[authType][].generationConfig`	Impermeable layer - completely replaces all generationConfig fields; lower layers do not participate
3	`settings.model.generationConfig`	Only used for Runtime Models (when no provider model is selected)
4	Content-generator defaults	Provider-specific defaults (e.g., OpenAI vs Gemini) - only for Runtime Models

Atomic field treatment

The following fields are treated as atomic objects - provider values completely replace the entire object, no merging occurs:

samplingParams - Temperature, top_p, max_tokens, etc.
customHeaders - Custom HTTP headers
extra_body - Extra request body parameters

Example

json

// User settings (~/.qwen/settings.json)
{
  "model": {
    "generationConfig": {
      "timeout": 30000,
      "samplingParams": { "temperature": 0.5, "max_tokens": 1000 }
    }
  }
}

// modelProviders configuration
{
  "modelProviders": {
    "openai": [{
      "id": "gpt-4o",
      "envKey": "OPENAI_API_KEY",
      "generationConfig": {
        "timeout": 60000,
        "samplingParams": { "temperature": 0.2 }
      }
    }]
  }
}

When gpt-4o is selected from modelProviders:

timeout = 60000 (from provider, overrides settings)
samplingParams.temperature = 0.2 (from provider, completely replaces settings object)
samplingParams.max_tokens = undefined (not defined in provider, and provider layer does not inherit from settings — fields are explicitly set to undefined if not provided)

When using a raw model via --model gpt-4 (not from modelProviders, creates a Runtime Model):

timeout = 30000 (from settings)
samplingParams.temperature = 0.5 (from settings)
samplingParams.max_tokens = 1000 (from settings)

The merge strategy for modelProviders itself is REPLACE: the entire modelProviders from project settings will override the corresponding section in user settings, rather than merging the two.

Reasoning / thinking configuration

The optional reasoning field under generationConfig controls how aggressively the model reasons before responding. The Anthropic and Gemini converters always honor it. The OpenAI-compatible pipeline honors it unless generationConfig.samplingParams is set — see the "Interaction with samplingParams" caveat below.

jsonc

{
  "modelProviders": {
    "openai": [
      {
        "id": "deepseek-v4-pro",
        "name": "DeepSeek V4 Pro",
        "baseUrl": "https://api.deepseek.com/v1",
        "envKey": "DEEPSEEK_API_KEY",
        "generationConfig": {
          // The four-tier scale:
          //   'low'    | 'medium' — server-mapped to 'high' on DeepSeek
          //   'high'   — default reasoning intensity
          //   'max'    — DeepSeek-specific extra-strong tier
          // Or set `false` to disable reasoning entirely.
          "reasoning": { "effort": "max" },
        },
      },
    ],
  },
}

Per-provider behavior

Protocol / provider	Wire shape	Notes
OpenAI / DeepSeek (`api.deepseek.com`)	Flat `reasoning_effort: <effort>` body parameter	When `reasoning.effort` is set in the nested config shape, it's rewritten to flat `reasoning_effort` and `'low'`/`'medium'` are normalized to `'high'`, `'xhigh'` to `'max'` — mirroring DeepSeek's server-side back-compat. Top-level `samplingParams.reasoning_effort` or `extra_body.reasoning_effort` overrides skip this normalization and ship verbatim.
OpenAI (other compatible servers)	`reasoning: { effort, ... }` passed through verbatim	Set via `samplingParams` (e.g. `samplingParams.reasoning_effort` for GPT-5/o-series) when the provider expects a different shape.
Anthropic (real `api.anthropic.com`)	`output_config: { effort }` plus the `effort-2025-11-24` beta header	Real Anthropic accepts `'low'`/`'medium'`/`'high'` only. `'max'` is clamped to `'high'` with a `debugLogger.warn` line (once per generator); if you want max effort, switch the baseURL to a DeepSeek-compatible endpoint that supports it.
Anthropic (`api.deepseek.com/anthropic`)	Same `output_config: { effort }` + beta header	`'max'` is passed through unchanged.
Gemini (`@google/genai`)	`thinkingConfig: { includeThoughts: true, thinkingLevel }`	`'low'` → `LOW`, `'high'`/`'max'` → `HIGH`, others → `THINKING_LEVEL_UNSPECIFIED` (Gemini has no `MAX` tier).

`reasoning: false`

Setting reasoning: false (the literal boolean) explicitly disables thinking on every provider — useful for cheap side queries that don't benefit from reasoning. This is honored at the request level too via request.config.thinkingConfig.includeThoughts: false for one-off calls (e.g. suggestion generation).

On a api.deepseek.com baseURL, the OpenAI pipeline emits the explicit thinking: { type: 'disabled' } field that DeepSeek V4+ requires — the server-side default is 'enabled', so simply omitting reasoning_effort would still pay thinking latency/cost. Self-hosted DeepSeek backends (sglang/vllm) and other OpenAI-compatible servers do not receive this field; if you need to disable thinking on those, inject thinking: { type: 'disabled' } (or whatever knob your inference framework exposes) via samplingParams/extra_body.

Interaction with `samplingParams` (OpenAI-compatible only)

[!warning]

When generationConfig.samplingParams is set on an OpenAI-compatible provider, the pipeline ships those keys to the wire verbatim and skips the separate reasoning injection entirely. So a config like { samplingParams: { temperature: 0.5 }, reasoning: { effort: 'max' } } will silently drop the reasoning field on OpenAI/DeepSeek requests.

If you set samplingParams, include the reasoning knob inside it directly — for DeepSeek that's samplingParams.reasoning_effort, for GPT-5/o-series it's samplingParams.reasoning_effort (their flat field) or samplingParams.reasoning (the nested object). For OpenRouter and other providers the field name varies; consult the provider docs.

The Anthropic and Gemini converters are unaffected — they always read reasoning.effort directly regardless of samplingParams.

`budget_tokens`

You can pin an exact thinking-token budget by including budget_tokens alongside effort:

jsonc

"reasoning": { "effort": "high", "budget_tokens": 50000 }

For Anthropic this becomes thinking.budget_tokens. For OpenAI/DeepSeek the field is preserved but currently ignored by the server — reasoning_effort is the load-bearing knob.

Provider Models vs Runtime Models

Qwen Code distinguishes between two types of model configurations:

Provider Model

Defined in modelProviders configuration
Has a complete, atomic configuration package
When selected, its configuration is applied as an impermeable layer
Appears in /model command list with full metadata (name, description, capabilities)
Recommended for multi-model workflows and team consistency

Runtime Model

Created dynamically when using raw model IDs via CLI (--model), environment variables, or settings
Not defined in modelProviders
Configuration is built by "projecting" through resolution layers (CLI → env → settings → defaults)
Automatically captured as a RuntimeModelSnapshot when a complete configuration is detected
Allows reuse without re-entering credentials

RuntimeModelSnapshot lifecycle

When you configure a model without using modelProviders, Qwen Code automatically creates a RuntimeModelSnapshot to preserve your configuration:

bash

# This creates a RuntimeModelSnapshot with ID: $runtime|openai|my-custom-model
qwen --auth-type openai --model my-custom-model --openaiApiKey $KEY --openaiBaseUrl https://api.example.com/v1

The snapshot:

Captures model ID, API key, base URL, and generation config
Persists across sessions (stored in memory during runtime)
Appears in the /model command list as a runtime option
Can be switched to using /model $runtime|openai|my-custom-model

Key differences

Aspect	Provider Model	Runtime Model
Configuration source	`modelProviders` in settings	CLI, env, settings layers
Configuration atomicity	Complete, impermeable package	Layered, each field resolved independently
Reusability	Always available in `/model` list	Captured as snapshot, appears if complete
Team sharing	Yes (via committed settings)	No (user-local)
Credential storage	Reference via `envKey` only	May capture actual key in snapshot

When to use each

Use Provider Models when: You have standard models shared across a team, need consistent configurations, or want to prevent accidental overrides
Use Runtime Models when: Quickly testing a new model, using temporary credentials, or working with ad-hoc endpoints

Selection Persistence and Recommendations

[!important]

Define modelProviders in the user-scope ~/.qwen/settings.json whenever possible and avoid persisting credential overrides in any scope. Keeping the provider catalog in user settings prevents merge/override conflicts between project and user scopes and ensures /auth and /model updates always write back to a consistent scope.

/model and /auth persist model.name (where applicable) and security.auth.selectedType to the closest writable scope that already defines modelProviders; otherwise they fall back to the user scope. This keeps workspace/user files in sync with the active provider catalog.
Without modelProviders, the resolver mixes CLI/env/settings layers, creating Runtime Models. This is fine for single-provider setups but cumbersome when frequently switching. Define provider catalogs whenever multi-model workflows are common so that switches stay atomic, source-attributed, and debuggable.

Model Providers

Model Providers

Overview

Configuration Examples by Auth Type

Supported Auth Types

SDKs Used for API Requests

OpenAI-compatible providers (openai)

Anthropic (anthropic)

Google Gemini (gemini)

Local Self-Hosted Models (via OpenAI-compatible API)

Alibaba Cloud Coding Plan

Overview

Setup

Regions

API Key Storage

Automatic Updates

Manual Configuration (Advanced)

Resolution Layers and Atomicity

Generation Config Layering: The Impermeable Provider Layer

How it works

Per-field precedence for generationConfig

Atomic field treatment

Example

Reasoning / thinking configuration

Per-provider behavior

reasoning: false

Interaction with samplingParams (OpenAI-compatible only)

budget_tokens

Provider Models vs Runtime Models

Provider Model

Runtime Model

RuntimeModelSnapshot lifecycle

Key differences

When to use each

Selection Persistence and Recommendations

OpenAI-compatible providers (`openai`)

Anthropic (`anthropic`)

Google Gemini (`gemini`)

Per-field precedence for `generationConfig`

`reasoning: false`

Interaction with `samplingParams` (OpenAI-compatible only)

`budget_tokens`