cookbook/litellm_proxy_server/readme.md
Make /chat/completions requests for 50+ LLM models Azure, OpenAI, Replicate, Anthropic, Hugging Face
Example: for model use claude-2, gpt-3.5, gpt-4, command-nightly, stabilityai/stablecode-completion-alpha-3b-4k
{
"model": "replicate/llama-2-70b-chat:2c1608e18606fad2812020dc541930f2d0495ce32eee50074220b87300bc16e1",
"messages": [
{
"content": "Hello, whats the weather in San Francisco??",
"role": "user"
}
]
}
Consistent Input/Output Format
completion(model, messages)['choices'][0]['message']['content']Error Handling Using Model Fallbacks (if GPT-4 fails, try llama2)
Logging - Log Requests, Responses and Errors to Supabase, Posthog, Mixpanel, Sentry, Lunary,Athina, Helicone (Any of the supported providers here: https://litellm.readthedocs.io/en/latest/advanced/
Example: Logs sent to Supabase
Token Usage & Spend - Track Input + Completion tokens used + Spend/model
Caching - Implementation of Semantic Caching
Streaming & Async Support - Return generators to stream text responses
/chat/completions (POST)This endpoint is used to generate chat completions for 50+ support LLM API Models. Use llama2, GPT-4, Claude2 etc
This API endpoint accepts all inputs in raw JSON and expects the following inputs
model (string, required): ID of the model to use for chat completions. See all supported models [here]: (https://litellm.readthedocs.io/en/latest/supported/):
eg gpt-3.5-turbo, gpt-4, claude-2, command-nightly, stabilityai/stablecode-completion-alpha-3b-4kmessages (array, required): A list of messages representing the conversation context. Each message should have a role (system, user, assistant, or function), content (message text), and name (for function role).temperature, functions, function_call, top_p, n, stream. See the full list of supported inputs here: https://litellm.readthedocs.io/en/latest/input/For claude-2
{
"model": "claude-2",
"messages": [
{
"content": "Hello, whats the weather in San Francisco??",
"role": "user"
}
]
}
import requests
import json
# TODO: use your URL
url = "http://localhost:5000/chat/completions"
payload = json.dumps({
"model": "gpt-3.5-turbo",
"messages": [
{
"content": "Hello, whats the weather in San Francisco??",
"role": "user"
}
]
})
headers = {
'Content-Type': 'application/json'
}
response = requests.request("POST", url, headers=headers, data=payload)
print(response.text)
Responses from the server are given in the following format. All responses from the server are returned in the following format (for all LLM models). More info on output here: https://litellm.readthedocs.io/en/latest/output/
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "I'm sorry, but I don't have the capability to provide real-time weather information. However, you can easily check the weather in San Francisco by searching online or using a weather app on your phone.",
"role": "assistant"
}
}
],
"created": 1691790381,
"id": "chatcmpl-7mUFZlOEgdohHRDx2UpYPRTejirzb",
"model": "gpt-3.5-turbo-0613",
"object": "chat.completion",
"usage": {
"completion_tokens": 41,
"prompt_tokens": 16,
"total_tokens": 57
}
}
git clone https://github.com/BerriAI/liteLLM-proxy
pip install -r requirements.txt
os.environ['OPENAI_API_KEY]` = "YOUR_API_KEY"
or
set OPENAI_API_KEY in your .env file
python main.py
Quick Start: Deploy on Railway
GCP, AWS, Azure
This project includes a Dockerfile allowing you to build and deploy a Docker Project on your providers