Back to Litellm

LiteLLM Architecture - LiteLLM SDK + AI Gateway

ARCHITECTURE.md

1.84.0-dev.216.3 KB
Original Source

LiteLLM Architecture - LiteLLM SDK + AI Gateway

This document helps contributors understand where to make changes in LiteLLM.


How It Works

The LiteLLM AI Gateway (Proxy) uses the LiteLLM SDK internally for all LLM calls:

OpenAI SDK (client)    ──▶  LiteLLM AI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API
Anthropic SDK (client) ──▶  LiteLLMAI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API
Any HTTP client        ──▶  LiteLLMAI Gateway (proxy/)  ──▶  LiteLLM SDK (litellm/)  ──▶  LLM API

The AI Gateway adds authentication, rate limiting, budgets, and routing on top of the SDK. The SDK handles the actual LLM provider calls, request/response transformations, and streaming.


1. AI Gateway (Proxy) Request Flow

The AI Gateway (litellm/proxy/) wraps the SDK with authentication, rate limiting, and management features.

mermaid
sequenceDiagram
    participant Client
    participant ProxyServer as proxy/proxy_server.py
    participant Auth as proxy/auth/user_api_key_auth.py
    participant Redis as Redis Cache
    participant Hooks as proxy/hooks/
    participant Router as router.py
    participant Main as main.py + utils.py
    participant Handler as llms/custom_httpx/llm_http_handler.py
    participant Transform as llms/{provider}/chat/transformation.py
    participant Provider as LLM Provider API
    participant CostCalc as cost_calculator.py
    participant LoggingObj as litellm_logging.py
    participant DBWriter as db/db_spend_update_writer.py
    participant Postgres as PostgreSQL

    %% Request Flow
    Client->>ProxyServer: POST /v1/chat/completions
    ProxyServer->>Auth: user_api_key_auth()
    Auth->>Redis: Check API key cache
    Redis-->>Auth: Key info + spend limits
    ProxyServer->>Hooks: max_budget_limiter, parallel_request_limiter
    Hooks->>Redis: Check/increment rate limit counters
    ProxyServer->>Router: route_request()
    Router->>Main: litellm.acompletion()
    Main->>Handler: BaseLLMHTTPHandler.completion()
    Handler->>Transform: ProviderConfig.transform_request()
    Handler->>Provider: HTTP Request
    Provider-->>Handler: Response
    Handler->>Transform: ProviderConfig.transform_response()
    Transform-->>Handler: ModelResponse
    Handler-->>Main: ModelResponse
    
    %% Cost Attribution (in utils.py wrapper)
    Main->>LoggingObj: update_response_metadata()
    LoggingObj->>CostCalc: _response_cost_calculator()
    CostCalc->>CostCalc: completion_cost(tokens × price)
    CostCalc-->>LoggingObj: response_cost
    LoggingObj-->>Main: Set response._hidden_params["response_cost"]
    Main-->>ProxyServer: ModelResponse (with cost in _hidden_params)
    
    %% Response Headers + Async Logging
    ProxyServer->>ProxyServer: Extract cost from hidden_params
    ProxyServer->>LoggingObj: async_success_handler()
    LoggingObj->>Hooks: async_log_success_event()
    Hooks->>DBWriter: update_database(response_cost)
    DBWriter->>Redis: Queue spend increment
    DBWriter->>Postgres: Batch write spend logs (async)
    ProxyServer-->>Client: ModelResponse + x-litellm-response-cost header

Proxy Components

mermaid
graph TD
    subgraph "Incoming Request"
        Client["POST /v1/chat/completions"]
    end

    subgraph "proxy/proxy_server.py"
        Endpoint["chat_completion()"]
    end

    subgraph "proxy/auth/"
        Auth["user_api_key_auth()"]
    end

    subgraph "proxy/"
        PreCall["litellm_pre_call_utils.py"]
        RouteRequest["route_llm_request.py"]
    end

    subgraph "litellm/"
        Router["router.py"]
        Main["main.py"]
    end

    subgraph "Infrastructure"
        DualCache["DualCache
(in-memory + Redis)"]
        Postgres["PostgreSQL
(keys, teams, spend logs)"]
    end

    Client --> Endpoint
    Endpoint --> Auth
    Auth --> DualCache
    DualCache -.->|cache miss| Postgres
    Auth --> PreCall
    PreCall --> RouteRequest
    RouteRequest --> Router
    Router --> DualCache
    Router --> Main
    Main --> Client

Key proxy files:

  • proxy/proxy_server.py - Main API endpoints
  • proxy/auth/ - Authentication (API keys, JWT, OAuth2)
  • proxy/hooks/ - Proxy-level callbacks
  • router.py - Load balancing, fallbacks
  • router_strategy/ - Routing algorithms (lowest_latency.py, simple_shuffle.py, etc.)

LLM-specific proxy endpoints:

EndpointDirectoryPurpose
/v1/messagesproxy/anthropic_endpoints/Anthropic Messages API
/vertex-ai/*proxy/vertex_ai_endpoints/Vertex AI passthrough
/gemini/*proxy/google_endpoints/Google AI Studio passthrough
/v1/images/*proxy/image_endpoints/Image generation
/v1/batchesproxy/batches_endpoints/Batch processing
/v1/filesproxy/openai_files_endpoints/File uploads
/v1/fine_tuningproxy/fine_tuning_endpoints/Fine-tuning jobs
/v1/rerankproxy/rerank_endpoints/Reranking
/v1/responsesproxy/response_api_endpoints/OpenAI Responses API
/v1/vector_storesproxy/vector_store_endpoints/Vector stores
/* (passthrough)proxy/pass_through_endpoints/Direct provider passthrough

Proxy Hooks (proxy/hooks/__init__.py):

HookFilePurpose
max_budget_limiterproxy/hooks/max_budget_limiter.pyEnforce budget limits
parallel_request_limiterproxy/hooks/parallel_request_limiter_v3.pyRate limiting per key/user
cache_control_checkproxy/hooks/cache_control_check.pyCache validation
responses_id_securityproxy/hooks/responses_id_security.pyResponse ID validation
litellm_skillsproxy/hooks/skills_injection.pySkills injection

To add a new proxy hook, implement CustomLogger and register in PROXY_HOOKS.

Infrastructure Components

The AI Gateway uses external infrastructure for persistence and caching:

mermaid
graph LR
    subgraph "AI Gateway (proxy/)"
        Proxy["proxy_server.py"]
        Auth["auth/user_api_key_auth.py"]
        DBWriter["db/db_spend_update_writer.py
DBSpendUpdateWriter"]
        InternalCache["utils.py
InternalUsageCache"]
        CostCallback["hooks/proxy_track_cost_callback.py
_ProxyDBLogger"]
        Scheduler["APScheduler
ProxyStartupEvent"]
    end

    subgraph "SDK (litellm/)"
        Router["router.py
Router.cache (DualCache)"]
        LLMCache["caching/caching_handler.py
LLMCachingHandler"]
        CacheClass["caching/caching.py
Cache"]
    end

    subgraph "Redis (caching/redis_cache.py)"
        RateLimit["Rate Limit Counters"]
        SpendQueue["Spend Increment Queue"]
        KeyCache["API Key Cache"]
        TPM_RPM["TPM/RPM Tracking"]
        Cooldowns["Deployment Cooldowns"]
        LLMResponseCache["LLM Response Cache"]
    end

    subgraph "PostgreSQL (proxy/schema.prisma)"
        Keys["LiteLLM_VerificationToken"]
        Teams["LiteLLM_TeamTable"]
        SpendLogs["LiteLLM_SpendLogs"]
        Users["LiteLLM_UserTable"]
    end

    Auth --> InternalCache
    InternalCache --> KeyCache
    InternalCache -.->|cache miss| Keys
    InternalCache --> RateLimit
    Router --> TPM_RPM
    Router --> Cooldowns
    LLMCache --> CacheClass
    CacheClass --> LLMResponseCache
    CostCallback --> DBWriter
    DBWriter --> SpendQueue
    DBWriter --> SpendLogs
    Scheduler --> SpendLogs
    Scheduler --> Keys
ComponentPurposeKey Files/Classes
RedisRate limiting, API key caching, TPM/RPM tracking, cooldowns, LLM response caching, spend queuingcaching/redis_cache.py (RedisCache), caching/dual_cache.py (DualCache)
PostgreSQLAPI keys, teams, users, spend logsproxy/utils.py (PrismaClient), proxy/schema.prisma
InternalUsageCacheProxy-level cache for rate limits + API keys (in-memory + Redis)proxy/utils.py (InternalUsageCache)
Router.cacheTPM/RPM tracking, deployment cooldowns, client caching (in-memory + Redis)router.py (Router.cache: DualCache)
LLMCachingHandlerSDK-level LLM response/embedding cachingcaching/caching_handler.py (LLMCachingHandler), caching/caching.py (Cache)
DBSpendUpdateWriterBatches spend updates to reduce DB writesproxy/db/db_spend_update_writer.py (DBSpendUpdateWriter)
Cost TrackingCalculates and logs response costsproxy/hooks/proxy_track_cost_callback.py (_ProxyDBLogger)

Background Jobs (APScheduler, initialized in proxy/proxy_server.pyProxyStartupEvent.initialize_scheduled_background_jobs()):

JobIntervalPurposeKey Files
update_spend60sBatch write spend logs to PostgreSQLproxy/db/db_spend_update_writer.py
reset_budget10-12minReset budgets for keys/users/teamsproxy/management_helpers/budget_reset_job.py
add_deployment10sSync new model deployments from DBproxy/proxy_server.py (ProxyConfig)
cleanup_old_spend_logscron/intervalDelete old spend logsproxy/management_helpers/spend_log_cleanup.py
check_batch_cost30minCalculate costs for batch jobsproxy/management_helpers/check_batch_cost_job.py
check_responses_cost30minCalculate costs for responses APIproxy/management_helpers/check_responses_cost_job.py
process_rotations1hrAuto-rotate API keysproxy/management_helpers/key_rotation_manager.py
_run_background_health_checkcontinuousHealth check model deploymentsproxy/proxy_server.py
send_weekly_spend_reportweeklySlack spend alertsproxy/utils.py (SlackAlerting)
send_monthly_spend_reportmonthlySlack spend alertsproxy/utils.py (SlackAlerting)

Cost Attribution Flow:

  1. LLM response returns to utils.py wrapper after litellm.acompletion() completes
  2. update_response_metadata() (llm_response_utils/response_metadata.py) is called
  3. logging_obj._response_cost_calculator() (litellm_logging.py) calculates cost via litellm.completion_cost() (cost_calculator.py)
  4. Cost is stored in response._hidden_params["response_cost"]
  5. proxy/common_request_processing.py extracts cost from hidden_params and adds to response headers (x-litellm-response-cost)
  6. logging_obj.async_success_handler() triggers callbacks including _ProxyDBLogger.async_log_success_event()
  7. DBSpendUpdateWriter.update_database() queues spend increments to Redis
  8. Background job update_spend flushes queued spend to PostgreSQL every 60s

2. SDK Request Flow

The SDK (litellm/) provides the core LLM calling functionality used by both direct SDK users and the AI Gateway.

mermaid
graph TD
    subgraph "SDK Entry Points"
        Completion["litellm.completion()"]
        Messages["litellm.messages()"]
    end

    subgraph "main.py"
        Main["completion()
acompletion()"]
    end

    subgraph "utils.py"
        GetProvider["get_llm_provider()"]
    end

    subgraph "llms/custom_httpx/"
        Handler["llm_http_handler.py
BaseLLMHTTPHandler"]
        HTTP["http_handler.py
HTTPHandler / AsyncHTTPHandler"]
    end

    subgraph "llms/{provider}/chat/"
        TransformReq["transform_request()"]
        TransformResp["transform_response()"]
    end

    subgraph "litellm_core_utils/"
        Streaming["streaming_handler.py"]
    end

    subgraph "integrations/ (async, off main thread)"
        Callbacks["custom_logger.py
Langfuse, Datadog, etc."]
    end

    Completion --> Main
    Messages --> Main
    Main --> GetProvider
    GetProvider --> Handler
    Handler --> TransformReq
    TransformReq --> HTTP
    HTTP --> Provider["LLM Provider API"]
    Provider --> HTTP
    HTTP --> TransformResp
    TransformResp --> Streaming
    Streaming --> Response["ModelResponse"]
    Response -.->|async| Callbacks

Key SDK files:

  • main.py - Entry points: completion(), acompletion(), embedding()
  • utils.py - get_llm_provider() resolves model → provider
  • llms/custom_httpx/llm_http_handler.py - Central HTTP orchestrator
  • llms/custom_httpx/http_handler.py - Low-level HTTP client
  • llms/{provider}/chat/transformation.py - Provider-specific transformations
  • litellm_core_utils/streaming_handler.py - Streaming response handling
  • integrations/ - Async callbacks (Langfuse, Datadog, etc.)

3. Translation Layer

When a request comes in, it goes through a translation layer that converts between API formats. Each translation is isolated in its own file, making it easy to test and modify independently.

Where to find translations

Incoming APIProviderTranslation File
/v1/chat/completionsAnthropicllms/anthropic/chat/transformation.py
/v1/chat/completionsBedrock Conversellms/bedrock/chat/converse_transformation.py
/v1/chat/completionsBedrock Invokellms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py
/v1/chat/completionsGeminillms/gemini/chat/transformation.py
/v1/chat/completionsVertex AIllms/vertex_ai/gemini/transformation.py
/v1/chat/completionsOpenAIllms/openai/chat/gpt_transformation.py
/v1/messages (passthrough)Anthropicllms/anthropic/experimental_pass_through/messages/transformation.py
/v1/messages (passthrough)Bedrockllms/bedrock/messages/invoke_transformations/anthropic_claude3_transformation.py
/v1/messages (passthrough)Vertex AIllms/vertex_ai/vertex_ai_partner_models/anthropic/experimental_pass_through/transformation.py
Passthrough endpointsAllproxy/pass_through_endpoints/llm_provider_handlers/

Example: Debugging prompt caching

If /v1/messages → Bedrock Converse prompt caching isn't working but Bedrock Invoke works:

  1. Bedrock Converse translation: llms/bedrock/chat/converse_transformation.py
  2. Bedrock Invoke translation: llms/bedrock/chat/invoke_transformations/anthropic_claude3_transformation.py
  3. Compare how each handles cache_control in transform_request()

How translations work

Each provider has a Config class that inherits from BaseConfig (llms/base_llm/chat/transformation.py):

python
class ProviderConfig(BaseConfig):
    def transform_request(self, model, messages, optional_params, litellm_params, headers):
        # Convert OpenAI format → Provider format
        return {"messages": transformed_messages, ...}
    
    def transform_response(self, model, raw_response, model_response, logging_obj, ...):
        # Convert Provider format → OpenAI format
        return ModelResponse(choices=[...], usage=Usage(...))

The BaseLLMHTTPHandler (llms/custom_httpx/llm_http_handler.py) calls these methods - you never need to modify the handler itself.


4. Adding/Modifying Providers

To add a new provider:

  1. Create llms/{provider}/chat/transformation.py
  2. Implement Config class with transform_request() and transform_response()
  3. Add tests in tests/llm_translation/test_{provider}.py

To add a feature (e.g., prompt caching):

  1. Find the translation file from the table above
  2. Modify transform_request() to handle the new parameter
  3. Add unit tests that verify the transformation

Testing checklist

When adding a feature, verify it works across all paths:

TestFile Pattern
OpenAI passthroughtests/llm_translation/test_openai*.py
Anthropic directtests/llm_translation/test_anthropic*.py
Bedrock Invoketests/llm_translation/test_bedrock*.py
Bedrock Conversetests/llm_translation/test_bedrock*converse*.py
Vertex AItests/llm_translation/test_vertex*.py
Geminitests/llm_translation/test_gemini*.py

Unit testing translations

Translations are designed to be unit testable without making API calls:

python
from litellm.llms.bedrock.chat.converse_transformation import BedrockConverseConfig

def test_prompt_caching_transform():
    config = BedrockConverseConfig()
    result = config.transform_request(
        model="anthropic.claude-3-opus",
        messages=[{"role": "user", "content": "test", "cache_control": {"type": "ephemeral"}}],
        optional_params={},
        litellm_params={},
        headers={}
    )
    assert "cachePoint" in str(result)  # Verify cache_control was translated