docs_new/cookbook/autoregressive/Llama/Llama3.3-70B.mdx
Llama-3.3-70B-Instruct is Meta's latest 70 billion parameter instruction-tuned language model, featuring improved performance and efficiency over Llama 3.1. With a 128K token context window and enhanced capabilities across reasoning, coding, and multilingual tasks, Llama 3.3 delivers state-of-the-art results while maintaining accessibility for production deployment.
Key Features:
License: Llama 3.3 is licensed under the Llama 3.3 Community License. See LICENSE for details.
For more details, please refer to the official Llama models repository.
Please refer to the official SGLang installation guide for installation instructions.
This section provides deployment configurations optimized for AMD GPUs (MI300X, MI325X, MI355X).
Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your AMD GPU setup.
import { Llama33Deployment } from "/src/snippets/autoregressive/llama33-70b-deployment.jsx";
<Llama33Deployment />AMD GPU Deployment:
amd/Llama-3.3-70B-Instruct-FP8-KV--tool-call-parser llama3 for function calling supportFor basic API usage and request examples, please refer to:
Llama 3.3 70B Instruct supports native tool calling. Enable the tool parser during deployment:
python -m sglang.launch_server \
--model-path meta-llama/Llama-3.3-70B-Instruct \
--tool-call-parser llama3 \
--tp 1 \
--host 0.0.0.0 \
--port 30000
Python Example:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Define available tools
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get the current weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["location"]
}
}
}
]
# Make request
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": "What's the weather in Tokyo?"}
],
tools=tools,
temperature=0.7
)
# Check for tool calls
message = response.choices[0].message
if message.tool_calls:
tool_call = message.tool_calls[0]
print(f"Function: {tool_call.function.name}")
print(f"Arguments: {tool_call.function.arguments}")
Handling Tool Call Results:
# After executing the function, send the result back
def get_weather(location, unit="celsius"):
# Your weather API call here
return f"The weather in {location} is 22°{unit[0].upper()} and sunny."
# Build conversation with tool result
messages = [
{"role": "user", "content": "What's the weather in Tokyo?"},
{
"role": "assistant",
"content": None,
"tool_calls": [{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": '{"location": "Tokyo", "unit": "celsius"}'
}
}]
},
{
"role": "tool",
"tool_call_id": "call_123",
"content": get_weather("Tokyo", "celsius")
}
]
final_response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=messages,
temperature=0.7
)
print(final_response.choices[0].message.content)
# Output: "The current weather in Tokyo is 22°C and sunny. A perfect day!"
Leverage the 128K context window for processing long documents:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:30000/v1",
api_key="EMPTY"
)
# Example with long document
long_document = "..." * 10000 # Your long document here
response = client.chat.completions.create(
model="meta-llama/Llama-3.3-70B-Instruct",
messages=[
{"role": "user", "content": f"Summarize this document:\n\n{long_document}"}
],
temperature=0.7,
max_tokens=1000
)
print(response.choices[0].message.content)
Use the SGLang benchmarking suite to test model performance with different workload patterns:
python -m sglang.bench_serving \
--backend sglang \
--dataset-name random \
--num-prompts 1000 \
--random-input 1024 \
--random-output 1024 \
--max-concurrency 16
Input/Output Length: Adjust --random-input and --random-output to test different workload patterns:
--random-input 1024 --random-output 1024--random-input 1024 --random-output 8192--random-input 8192 --random-output 1024Concurrency Levels: Adjust --max-concurrency to test different load scenarios:
--max-concurrency 1 --num-prompts 100--max-concurrency 16 --num-prompts 1000--max-concurrency 100 --num-prompts 2000