site/docs/providers/llamaApi.md
The Llama API provider enables you to use Meta's hosted Llama models through their official API service. This includes access to the latest Llama 4 multimodal models and Llama 3.3 text models, as well as accelerated variants from partners like Cerebras and Groq.
First, you'll need to get an API key from Meta:
export LLAMA_API_KEY="your_api_key_here"
Use the llamaapi: prefix to specify Llama API models:
providers:
- llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
- llamaapi:Llama-3.3-70B-Instruct
- llamaapi:chat:Llama-3.3-8B-Instruct # Explicit chat format
providers:
- id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
config:
temperature: 0.7 # Controls randomness (0.0-2.0)
max_tokens: 1000 # Maximum response length
top_p: 0.9 # Nucleus sampling parameter
frequency_penalty: 0 # Reduce repetition (-2.0 to 2.0)
presence_penalty: 0 # Encourage topic diversity (-2.0 to 2.0)
stream: false # Enable streaming responses
Llama-4-Maverick-17B-128E-Instruct-FP8: Industry-leading multimodal model with image and text understandingLlama-4-Scout-17B-16E-Instruct-FP8: Class-leading multimodal model with superior visual intelligenceBoth Llama 4 models support:
Llama-3.3-70B-Instruct: Enhanced performance text modelLlama-3.3-8B-Instruct: Lightweight, ultra-fast variantBoth Llama 3.3 models support:
For applications requiring ultra-low latency:
Cerebras-Llama-4-Maverick-17B-128E-Instruct (32k context, 900 RPM, 300k TPM)Cerebras-Llama-4-Scout-17B-16E-Instruct (32k context, 600 RPM, 200k TPM)Groq-Llama-4-Maverick-17B-128E-Instruct (128k context, 1000 RPM, 600k TPM)Note: Accelerated variants are text-only and don't support image inputs.
Basic text generation works with all models:
providers:
- llamaapi:Llama-3.3-70B-Instruct
prompts:
- 'Explain quantum computing in simple terms'
tests:
- vars: {}
assert:
- type: contains
value: 'quantum'
Llama 4 models can process images alongside text:
providers:
- llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
prompts:
- role: user
content:
- type: text
text: 'What do you see in this image?'
- type: image_url
image_url:
url: 'https://example.com/image.jpg'
tests:
- vars: {}
assert:
- type: llm-rubric
value: 'Accurately describes the image content'
Generate responses following a specific JSON schema:
providers:
- id: llamaapi:Llama-4-Maverick-17B-128E-Instruct-FP8
config:
temperature: 0.1
response_format:
type: json_schema
json_schema:
name: product_review
schema:
type: object
properties:
rating:
type: number
minimum: 1
maximum: 5
summary:
type: string
pros:
type: array
items:
type: string
cons:
type: array
items:
type: string
required: ['rating', 'summary']
prompts:
- 'Review this product: {{product_description}}'
tests:
- vars:
product_description: 'Wireless headphones with great sound quality but short battery life'
assert:
- type: is-json
- type: javascript
value: 'JSON.parse(output).rating >= 1 && JSON.parse(output).rating <= 5'
Enable models to call external functions:
providers:
- id: llamaapi:Llama-3.3-70B-Instruct
config:
tools:
- type: function
function:
name: get_weather
description: Get current weather for a location
parameters:
type: object
properties:
location:
type: string
description: City and state, e.g. San Francisco, CA
unit:
type: string
enum: ['celsius', 'fahrenheit']
required: ['location']
prompts:
- "What's the weather like in {{city}}?"
tests:
- vars:
city: 'New York, NY'
assert:
- type: function-call
value: get_weather
- type: javascript
value: "output.arguments.location.includes('New York')"
Enable real-time response streaming:
providers:
- id: llamaapi:Llama-3.3-8B-Instruct
config:
stream: true
temperature: 0.7
prompts:
- 'Write a short story about {{topic}}'
tests:
- vars:
topic: 'time travel'
assert:
- type: contains
value: 'time'
All rate limits are applied per team (across all API keys):
| Model Type | Requests/min | Tokens/min |
|---|---|---|
| Standard Models | 3,000 | 1,000,000 |
| Cerebras Models | 600-900 | 200,000-300,000 |
| Groq Models | 1,000 | 600,000 |
Rate limit information is available in response headers:
x-ratelimit-limit-tokens: Total token limitx-ratelimit-remaining-tokens: Remaining tokensx-ratelimit-limit-requests: Total request limitx-ratelimit-remaining-requests: Remaining requestsmax_tokens appropriately: Control response lengthError: 401 Unauthorized
LLAMA_API_KEY environment variable is setError: 429 Too Many Requests
Error: Model not found
Error: Invalid image format
Meta Llama API has strong data commitments:
| Feature | Llama API | OpenAI | Anthropic |
|---|---|---|---|
| Multimodal | ✅ (Llama 4) | ✅ | ✅ |
| Tool Calling | ✅ | ✅ | ✅ |
| JSON Schema | ✅ | ✅ | ❌ |
| Streaming | ✅ | ✅ | ✅ |
| Context Window | 32k-128k | 128k | 200k |
| Data Training | ❌ | ✅ | ❌ |
Check out the examples directory for:
For questions and support, visit the Llama API documentation or join the promptfoo Discord community.