LiteLLM Architecture - LiteLLM SDK + AI Gateway

This document helps contributors understand where to make changes in LiteLLM.

How It Works

The LiteLLM AI Gateway (Proxy) uses the LiteLLM SDK internally for all LLM calls:

OpenAI SDK (client)    ──▶  LiteLLM AI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API
Anthropic SDK (client) ──▶  LiteLLMAI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API
Any HTTP client        ──▶  LiteLLMAI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API

The AI Gateway adds authentication, rate limiting, budgets, and routing on top of the SDK. The SDK handles the actual LLM provider calls, request/response transformations, and streaming.

1. AI Gateway (Proxy) Request Flow

The AI Gateway (litellm/proxy/) wraps the SDK with authentication, rate limiting, and management features.

mermaid

sequenceDiagram
    participant Client
    participant ProxyServer as proxy/proxy_server.py
    participant Auth as proxy/auth/user_api_key_auth.py
    participant Redis as Redis Cache
    participant Hooks as proxy/hooks/
    participant Router as router.py
    participant Main as main.py + utils.py
    participant Handler as llms/custom_httpx/llm_http_handler.py
    participant Transform as llms/{provider}/chat/transformation.py
    participant Provider as LLM Provider API
    participant CostCalc as cost_calculator.py
    participant LoggingObj as litellm_logging.py
    participant DBWriter as db/db_spend_update_writer.py
    participant Postgres as PostgreSQL

    %% Request Flow
    Client->>ProxyServer: POST /v1/chat/completions
    ProxyServer->>Auth: user_api_key_auth()
    Auth->>Redis: Check API key cache
    Redis-->>Auth: Key info + spend limits
    ProxyServer->>Hooks: max_budget_limiter, parallel_request_limiter
    Hooks->>Redis: Check/increment rate limit counters
    ProxyServer->>Router: route_request()
    Router->>Main: litellm.acompletion()
    Main->>Handler: BaseLLMHTTPHandler.completion()
    Handler->>Transform: ProviderConfig.transform_request()
    Handler->>Provider: HTTP Request
    Provider-->>Handler: Response
    Handler->>Transform: ProviderConfig.transform_response()
    Transform-->>Handler: ModelResponse
    Handler-->>Main: ModelResponse
    
    %% Cost Attribution (in utils.py wrapper)
    Main->>LoggingObj: update_response_metadata()
    LoggingObj->>CostCalc: _response_cost_calculator()
    CostCalc->>CostCalc: completion_cost(tokens × price)
    CostCalc-->>LoggingObj: response_cost
    LoggingObj-->>Main: Set response._hidden_params["response_cost"]
    Main-->>ProxyServer: ModelResponse (with cost in _hidden_params)
    
    %% Response Headers + Async Logging
    ProxyServer->>ProxyServer: Extract cost from hidden_params
    ProxyServer->>LoggingObj: async_success_handler()
    LoggingObj->>Hooks: async_log_success_event()
    Hooks->>DBWriter: update_database(response_cost)
    DBWriter->>Redis: Queue spend increment
    DBWriter->>Postgres: Batch write spend logs (async)
    ProxyServer-->>Client: ModelResponse + x-litellm-response-cost header

Proxy Components

mermaid

graph TD
    subgraph "Incoming Request"
        Client["POST /v1/chat/completions"]
    end

    subgraph "proxy/proxy_server.py"
        Endpoint["chat_completion()"]
    end

    subgraph "proxy/auth/"
        Auth["user_api_key_auth()"]
    end

    subgraph "proxy/"
        PreCall["litellm_pre_call_utils.py"]
        RouteRequest["route_llm_request.py"]
    end

    subgraph "litellm/"
        Router["router.py"]
        Main["main.py"]
    end

    subgraph "Infrastructure"
        DualCache["DualCache
(in-memory + Redis)"]
        Postgres["PostgreSQL
(keys, teams, spend logs)"]
    end

    Client --> Endpoint
    Endpoint --> Auth
    Auth --> DualCache
    DualCache -.->|cache miss| Postgres
    Auth --> PreCall
    PreCall --> RouteRequest
    RouteRequest --> Router
    Router --> DualCache
    Router --> Main
    Main --> Client

Key proxy files:

proxy/proxy_server.py - Main API endpoints
proxy/auth/ - Authentication (API keys, JWT, OAuth2)
proxy/hooks/ - Proxy-level callbacks
router.py - Load balancing, fallbacks
router_strategy/ - Routing algorithms (lowest_latency.py, simple_shuffle.py, etc.)

LLM-specific proxy endpoints:

Endpoint	Directory	Purpose
`/v1/messages`	`proxy/anthropic_endpoints/`	Anthropic Messages API
`/vertex-ai/*`	`proxy/vertex_ai_endpoints/`	Vertex AI passthrough
`/gemini/*`	`proxy/google_endpoints/`	Google AI Studio passthrough
`/v1/images/*`	`proxy/image_endpoints/`	Image generation
`/v1/batches`	`proxy/batches_endpoints/`	Batch processing
`/v1/files`	`proxy/openai_files_endpoints/`	File uploads
`/v1/fine_tuning`	`proxy/fine_tuning_endpoints/`	Fine-tuning jobs
`/v1/rerank`	`proxy/rerank_endpoints/`	Reranking
`/v1/responses`	`proxy/response_api_endpoints/`	OpenAI Responses API
`/v1/vector_stores`	`proxy/vector_store_endpoints/`	Vector stores
`/*` (passthrough)	`proxy/pass_through_endpoints/`	Direct provider passthrough

Proxy Hooks (proxy/hooks/__init__.py):

Hook	File	Purpose
`max_budget_limiter`	`proxy/hooks/max_budget_limiter.py`	Enforce budget limits
`parallel_request_limiter`	`proxy/hooks/parallel_request_limiter_v3.py`	Rate limiting per key/user
`cache_control_check`	`proxy/hooks/cache_control_check.py`	Cache validation
`responses_id_security`	`proxy/hooks/responses_id_security.py`	Response ID validation
`litellm_skills`	`proxy/hooks/skills_injection.py`	Skills injection

To add a new proxy hook, implement CustomLogger and register in PROXY_HOOKS.

Infrastructure Components

The AI Gateway uses external infrastructure for persistence and caching:

mermaid

graph LR
    subgraph "AI Gateway (proxy/)"
        Proxy["proxy_server.py"]
        Auth["auth/user_api_key_auth.py"]
        DBWriter["db/db_spend_update_writer.py
DBSpendUpdateWriter"]
        InternalCache["utils.py
InternalUsageCache"]
        CostCallback["hooks/proxy_track_cost_callback.py
_ProxyDBLogger"]
        Scheduler["APScheduler
ProxyStartupEvent"]
    end

    subgraph "SDK (litellm/)"
        Router["router.py
Router.cache (DualCache)"]
        LLMCache["caching/caching_handler.py
LLMCachingHandler"]
        CacheClass["caching/caching.py
Cache"]
    end

    subgraph "Redis (caching/redis_cache.py)"
        RateLimit["Rate Limit Counters"]
        SpendQueue["Spend Increment Queue"]
        KeyCache["API Key Cache"]
        TPM_RPM["TPM/RPM Tracking"]
        Cooldowns["Deployment Cooldowns"]
        LLMResponseCache["LLM Response Cache"]
    end

    subgraph "PostgreSQL (proxy/schema.prisma)"
        Keys["LiteLLM_VerificationToken"]
        Teams["LiteLLM_TeamTable"]
        SpendLogs["LiteLLM_SpendLogs"]
        Users["LiteLLM_UserTable"]
    end

    Auth --> InternalCache
    InternalCache --> KeyCache
    InternalCache -.->|cache miss| Keys
    InternalCache --> RateLimit
    Router --> TPM_RPM
    Router --> Cooldowns
    LLMCache --> CacheClass
    CacheClass --> LLMResponseCache
    CostCallback --> DBWriter
    DBWriter --> SpendQueue
    DBWriter --> SpendLogs
    Scheduler --> SpendLogs
    Scheduler --> Keys

Component	Purpose	Key Files/Classes
Redis	Rate limiting, API key caching, TPM/RPM tracking, cooldowns, LLM response caching, spend queuing	`caching/redis_cache.py` (`RedisCache`), `caching/dual_cache.py` (`DualCache`)
PostgreSQL	API keys, teams, users, spend logs	`proxy/utils.py` (`PrismaClient`), `proxy/schema.prisma`
InternalUsageCache	Proxy-level cache for rate limits + API keys (in-memory + Redis)	`proxy/utils.py` (`InternalUsageCache`)
Router.cache	TPM/RPM tracking, deployment cooldowns, client caching (in-memory + Redis)	`router.py` (`Router.cache: DualCache`)
LLMCachingHandler	SDK-level LLM response/embedding caching	`caching/caching_handler.py` (`LLMCachingHandler`), `caching/caching.py` (`Cache`)
DBSpendUpdateWriter	Batches spend updates to reduce DB writes	`proxy/db/db_spend_update_writer.py` (`DBSpendUpdateWriter`)
Cost Tracking	Calculates and logs response costs	`proxy/hooks/proxy_track_cost_callback.py` (`_ProxyDBLogger`)

Background Jobs (APScheduler, initialized in proxy/proxy_server.py → ProxyStartupEvent.initialize_scheduled_background_jobs()):

Job	Interval	Purpose	Key Files
`update_spend`	60s	Batch write spend logs to PostgreSQL	`proxy/db/db_spend_update_writer.py`
`reset_budget`	10-12min	Reset budgets for keys/users/teams	`proxy/management_helpers/budget_reset_job.py`
`add_deployment`	10s	Sync new model deployments from DB	`proxy/proxy_server.py` (`ProxyConfig`)
`cleanup_old_spend_logs`	cron/interval	Delete old spend logs	`proxy/management_helpers/spend_log_cleanup.py`
`check_batch_cost`	30min	Calculate costs for batch jobs	`proxy/management_helpers/check_batch_cost_job.py`
`check_responses_cost`	30min	Calculate costs for responses API	`proxy/management_helpers/check_responses_cost_job.py`
`process_rotations`	1hr	Auto-rotate API keys	`proxy/management_helpers/key_rotation_manager.py`
`_run_background_health_check`	continuous	Health check model deployments	`proxy/proxy_server.py`
`send_weekly_spend_report`	weekly	Slack spend alerts	`proxy/utils.py` (`SlackAlerting`)
`send_monthly_spend_report`	monthly	Slack spend alerts	`proxy/utils.py` (`SlackAlerting`)

Cost Attribution Flow:

LLM response returns to utils.py wrapper after litellm.acompletion() completes
update_response_metadata() (llm_response_utils/response_metadata.py) is called
logging_obj._response_cost_calculator() (litellm_logging.py) calculates cost via litellm.completion_cost() (cost_calculator.py)
Cost is stored in response._hidden_params["response_cost"]
proxy/common_request_processing.py extracts cost from hidden_params and adds to response headers (x-litellm-response-cost)
logging_obj.async_success_handler() triggers callbacks including _ProxyDBLogger.async_log_success_event()
DBSpendUpdateWriter.update_database() queues spend increments to Redis
Background job update_spend flushes queued spend to PostgreSQL every 60s

2. SDK Request Flow

The SDK (litellm/) provides the core LLM calling functionality used by both direct SDK users and the AI Gateway.

mermaid

graph TD
    subgraph "SDK Entry Points"
        Completion["litellm.completion()"]
        Messages["litellm.messages()"]
    end

    subgraph "main.py"
        Main["completion()
acompletion()"]
    end

    subgraph "utils.py"
        GetProvider["get_llm_provider()"]
    end

    subgraph "llms/custom_httpx/"
        Handler["llm_http_handler.py
BaseLLMHTTPHandler"]
        HTTP["http_handler.py
HTTPHandler / AsyncHTTPHandler"]
    end

    subgraph "llms/{provider}/chat/"
        TransformReq["transform_request()"]
        TransformResp["transform_response()"]
    end

    subgraph "litellm_core_utils/"
        Streaming["streaming_handler.py"]
    end

    subgraph "integrations/ (async, off main thread)"
        Callbacks["custom_logger.py
Langfuse, Datadog, etc."]
    end

    Completion --> Main
    Messages --> Main
    Main --> GetProvider
    GetProvider --> Handler
    Handler --> TransformReq
    TransformReq --> HTTP
    HTTP --> Provider["LLM Provider API"]
    Provider --> HTTP
    HTTP --> TransformResp
    TransformResp --> Streaming
    Streaming --> Response["ModelResponse"]
    Response -.->|async| Callbacks

Key SDK files:

main.py - Entry points: completion(), acompletion(), embedding()
utils.py - get_llm_provider() resolves model → provider
llms/custom_httpx/llm_http_handler.py - Central HTTP orchestrator
llms/custom_httpx/http_handler.py - Low-level HTTP client
llms/{provider}/chat/transformation.py - Provider-specific transformations
litellm_core_utils/streaming_handler.py - Streaming response handling
integrations/ - Async callbacks (Langfuse, Datadog, etc.)

3. Translation Layer

When a request comes in, it goes through a translation layer that converts between API formats. Each translation is isolated in its own file, making it easy to test and modify independently.

Where to find translations

Incoming API	Provider	Translation File
`/v1/chat/completions`	Anthropic	`llms/anthropic/chat/transformation.py`
`/v1/chat/completions`	Bedrock Converse	`llms/bedrock/chat/converse_transformation.py`
`/v1/chat/completions`	Bedrock Invoke	`llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py`
`/v1/chat/completions`	Gemini	`llms/gemini/chat/transformation.py`
`/v1/chat/completions`	Vertex AI	`llms/vertex_ai/gemini/transformation.py`
`/v1/chat/completions`	OpenAI	`llms/openai/chat/gpt_transformation.py`
`/v1/messages` (passthrough)	Anthropic	`llms/anthropic/experimental_pass_through/messages/transformation.py`
`/v1/messages` (passthrough)	Bedrock	`llms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py`
`/v1/messages` (passthrough)	Vertex AI	`llms/vertex_ai/vertex_ai_partner_models/anthropic/experimental_pass_through/transformation.py`
Passthrough endpoints	All	`proxy/pass_through_endpoints/llm_provider_handlers/`

Example: Debugging prompt caching

If /v1/messages → Bedrock Converse prompt caching isn't working but Bedrock Invoke works:

Bedrock Converse translation: llms/bedrock/chat/converse_transformation.py
Bedrock Invoke translation: llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py
Compare how each handles cache_control in transform_request()

How translations work

Each provider has a Config class that inherits from BaseConfig (llms/base_llm/chat/transformation.py):

python

class ProviderConfig(BaseConfig):
    def transform_request(self, model, messages, optional_params, litellm_params, headers):
        # Convert OpenAI format → Provider format
        return {"messages": transformed_messages, ...}
    
    def transform_response(self, model, raw_response, model_response, logging_obj, ...):
        # Convert Provider format → OpenAI format
        return ModelResponse(choices=[...], usage=Usage(...))

The BaseLLMHTTPHandler (llms/custom_httpx/llm_http_handler.py) calls these methods - you never need to modify the handler itself.

4. Adding/Modifying Providers

To add a new provider:

Create llms/{provider}/chat/transformation.py
Implement Config class with transform_request() and transform_response()
Add tests in tests/llm_translation/test_{provider}.py

To add a feature (e.g., prompt caching):

Find the translation file from the table above
Modify transform_request() to handle the new parameter
Add unit tests that verify the transformation

Testing checklist

When adding a feature, verify it works across all paths:

Test	File Pattern
OpenAI passthrough	`tests/llm_translation/test_openai*.py`
Anthropic direct	`tests/llm_translation/test_anthropic*.py`
Bedrock Invoke	`tests/llm_translation/test_bedrock*.py`
Bedrock Converse	`tests/llm_translation/test_bedrockconverse.py`
Vertex AI	`tests/llm_translation/test_vertex*.py`
Gemini	`tests/llm_translation/test_gemini*.py`

Unit testing translations

Translations are designed to be unit testable without making API calls:

python

from litellm.llms.bedrock.chat.converse_transformation import BedrockConverseConfig

def test_prompt_caching_transform():
    config = BedrockConverseConfig()
    result = config.transform_request(
        model="anthropic.claude-3-opus",
        messages=[{"role": "user", "content": "test", "cache_control": {"type": "ephemeral"}}],
        optional_params={},
        litellm_params={},
        headers={}
    )
    assert "cachePoint" in str(result)  # Verify cache_control was translated