docs/LLAMA4.md
meta-llama/Llama-4-Scout-17B-16E-Instruct🚧 We are preparing a collection of UQFF quantized models! 🚧
The Llama 4 collection of models are natively multimodal AI models that enable text and multimodal experiences.
Architecture:
Integration in mistral.rs:
The Python and HTTP APIs support sending images as:
The Rust SDK takes an image from the image crate.
You can find this example here.
We support an OpenAI compatible HTTP API for multimodal models. This example demonstrates sending a chat completion request with an image.
Note: The image_url may be either a path, URL, or a base64 encoded string.
Image:
<h6><a href = "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg">Credit</a></h6>Prompt:
Please describe this image in detail.
Output:
The image presents a breathtaking mountain landscape, with a snow-capped peak dominating the scene. The mountain's rugged terrain is characterized by numerous ridges and valleys, while its summit is adorned with several structures that appear to be communication towers or antennas.
**Key Features:**
* **Mountain:** The mountain is the central focus of the image, showcasing a mix of snow-covered and bare areas.
* **Sky:** The sky above the mountain features a dramatic display of clouds, with dark grey clouds at the top gradually giving way to lighter blue skies towards the bottom.
* **Valley:** In the foreground, a valley stretches out, covered in trees that are mostly bare, suggesting a winter setting.
* **Lighting:** The lighting in the image is striking, with the sun casting a warm glow on the mountain's snow-covered slopes while leaving the surrounding areas in shadow.
**Overall Impression:**
The image exudes a sense of serenity and majesty, capturing the beauty of nature in a dramatic and awe-inspiring way. The contrast between the snow-covered mountain and the bare trees in the valley creates a visually appealing scene that invites the viewer to appreciate the natural world.
mistralrs serve multimodal -p 1234 --isq 4 -m meta-llama/Llama-4-Scout-17B-16E-Instruct
from openai import OpenAI
import httpx
import textwrap
import json
client = OpenAI(api_key="foobar", base_url="http://localhost:1234/v1/")
completion = client.chat.completions.create(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "Please describe this image in detail.",
},
],
},
],
max_tokens=256,
frequency_penalty=1.0,
top_p=0.1,
temperature=0,
)
resp = completion.choices[0].message.content
print(resp)
You can find this example here.
This is a minimal example of running the Llama 4 model with a dummy image.
use anyhow::Result;
use mistralrs::{IsqType, TextMessageRole, MultimodalMessages, MultimodalModelBuilder};
#[tokio::main]
async fn main() -> Result<()> {
let model = MultimodalModelBuilder::new(
"meta-llama/Llama-4-Scout-17B-16E-Instruct",
)
.with_isq(IsqType::Q4K)
.with_logging()
.build()
.await?;
let bytes = match reqwest::blocking::get(
"https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg",
) {
Ok(http_resp) => http_resp.bytes()?.to_vec(),
Err(e) => anyhow::bail!(e),
};
let image = image::load_from_memory(&bytes)?;
let messages = MultimodalMessages::new().add_image_message(
TextMessageRole::User,
"What is this?",
vec![image],
);
let response = model.send_chat_request(messages).await?;
println!("{}", response.choices[0].message.content.as_ref().unwrap());
dbg!(
response.usage.avg_prompt_tok_per_sec,
response.usage.avg_compl_tok_per_sec
);
Ok(())
}
You can find this example here.
This example demonstrates loading and sending a chat completion request with an image.
Note: the image_url may be either a path, URL, or a base64 encoded string.
from mistralrs import Runner, Which, ChatCompletionRequest, MultimodalArchitecture
runner = Runner(
which=Which.MultimodalPlain(
model_id="meta-llama/Llama-4-Scout-17B-16E-Instruct",
arch=MultimodalArchitecture.Llama4,
),
in_situ_quant="4",
)
res = runner.send_chat_completion_request(
ChatCompletionRequest(
model="default",
messages=[
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://www.nhmagazine.com/content/uploads/2019/05/mtwashingtonFranconia-2-19-18-108-Edit-Edit.jpg"
},
},
{
"type": "text",
"text": "What is this?",
},
],
}
],
max_tokens=256,
presence_penalty=1.0,
top_p=0.1,
temperature=0.1,
)
)
print(res.choices[0].message.content)
print(res.usage)