Llama-3.3-70B

1. Model Introduction

Llama-3.3-70B-Instruct is Meta's latest 70 billion parameter instruction-tuned language model, featuring improved performance and efficiency over Llama 3.1. With a 128K token context window and enhanced capabilities across reasoning, coding, and multilingual tasks, Llama 3.3 delivers state-of-the-art results while maintaining accessibility for production deployment.

Key Features:

Enhanced Performance: Improved instruction following, reasoning, and task completion over Llama 3.1
Tool Calling: Native support for function calling and tool use scenarios
Multilingual Support: Optimized for 8 languages (English, German, French, Italian, Portuguese, Hindi, Spanish, and Thai)
Extended Context: 128K token context window for processing long documents and complex tasks
Efficient Deployment: 70B parameters enable deployment on single GPU with AMD MI300X

License: Llama 3.3 is licensed under the Llama 3.3 Community License. See LICENSE for details.

For more details, please refer to the official Llama models repository.

2. SGLang Installation

Please refer to the official SGLang installation guide for installation instructions.

3. Model Deployment

This section provides deployment configurations optimized for AMD GPUs (MI300X, MI325X, MI355X).

3.1 Interactive Configuration

Interactive Command Generator: Use the configuration selector below to automatically generate the appropriate deployment command for your AMD GPU setup.

import { Llama33Deployment } from "/src/snippets/autoregressive/llama33-70b-deployment.jsx";

3.2 Configuration Tips

AMD GPU Deployment:

All AMD GPUs (MI300X, MI325X, MI355X) support TP=1 for both BF16 and FP8 variants
FP8 Model Variant: Use AMD's optimized amd/Llama-3.3-70B-Instruct-FP8-KV
Tool Calling: Enable with --tool-call-parser llama3 for function calling support
Higher Throughput: Optional TP=2 or TP=4 can be used for increased throughput

4. Model Invocation

4.1 Basic Usage

For basic API usage and request examples, please refer to:

SGLang Basic Usage Guide

4.2 Advanced Usage

4.2.1 Tool Calling

Llama 3.3 70B Instruct supports native tool calling. Enable the tool parser during deployment:

shell

python -m sglang.launch_server \
  --model-path meta-llama/Llama-3.3-70B-Instruct \
  --tool-call-parser llama3 \
  --tp 1 \
  --host 0.0.0.0 \
  --port 30000

Python Example:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Define available tools
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a location",
            "parameters": {
                "type": "object",
                "properties": {
                    "location": {
                        "type": "string",
                        "description": "The city name"
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "Temperature unit"
                    }
                },
                "required": ["location"]
            }
        }
    }
]

# Make request
response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": "What's the weather in Tokyo?"}
    ],
    tools=tools,
    temperature=0.7
)

# Check for tool calls
message = response.choices[0].message
if message.tool_calls:
    tool_call = message.tool_calls[0]
    print(f"Function: {tool_call.function.name}")
    print(f"Arguments: {tool_call.function.arguments}")

Handling Tool Call Results:

python

# After executing the function, send the result back
def get_weather(location, unit="celsius"):
    # Your weather API call here
    return f"The weather in {location} is 22°{unit[0].upper()} and sunny."

# Build conversation with tool result
messages = [
    {"role": "user", "content": "What's the weather in Tokyo?"},
    {
        "role": "assistant",
        "content": None,
        "tool_calls": [{
            "id": "call_123",
            "type": "function",
            "function": {
                "name": "get_weather",
                "arguments": '{"location": "Tokyo", "unit": "celsius"}'
            }
        }]
    },
    {
        "role": "tool",
        "tool_call_id": "call_123",
        "content": get_weather("Tokyo", "celsius")
    }
]

final_response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=messages,
    temperature=0.7
)

print(final_response.choices[0].message.content)
# Output: "The current weather in Tokyo is 22°C and sunny. A perfect day!"

4.2.2 Long Context Processing

Leverage the 128K context window for processing long documents:

python

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:30000/v1",
    api_key="EMPTY"
)

# Example with long document
long_document = "..." * 10000  # Your long document here

response = client.chat.completions.create(
    model="meta-llama/Llama-3.3-70B-Instruct",
    messages=[
        {"role": "user", "content": f"Summarize this document:\n\n{long_document}"}
    ],
    temperature=0.7,
    max_tokens=1000
)

print(response.choices[0].message.content)

5. Benchmarking

Use the SGLang benchmarking suite to test model performance with different workload patterns:

5.1 Basic Benchmark Command

bash

python -m sglang.bench_serving \
  --backend sglang \
  --dataset-name random \
  --num-prompts 1000 \
  --random-input 1024 \
  --random-output 1024 \
  --max-concurrency 16

5.2 Adjusting Benchmark Parameters

Input/Output Length: Adjust --random-input and --random-output to test different workload patterns:

Short conversations: --random-input 1024 --random-output 1024
Long outputs: --random-input 1024 --random-output 8192
Long inputs: --random-input 8192 --random-output 1024

Concurrency Levels: Adjust --max-concurrency to test different load scenarios:

Low concurrency (latency-focused): --max-concurrency 1 --num-prompts 100
Medium concurrency (balanced): --max-concurrency 16 --num-prompts 1000
High concurrency (throughput-focused): --max-concurrency 100 --num-prompts 2000

1. Model Introduction

2. SGLang Installation

3. Model Deployment

3.1 Interactive Configuration

3.2 Configuration Tips

4. Model Invocation

4.1 Basic Usage

4.2 Advanced Usage

4.2.1 Tool Calling

4.2.2 Long Context Processing

5. Benchmarking

5.1 Basic Benchmark Command

5.2 Adjusting Benchmark Parameters

📚 Additional Resources