Back to Mistral Rs

mistral.rs v0.8.2 Report

releases/v0.8.2/report.md

0.8.323.8 KB
Original Source

mistral.rs v0.8.2 Report

Benchmark Results

The release figures cover Gemma 4 E4B on GB10 and B200. Values are tokens per second, and speedups are mistral.rs divided by the comparison engine at the same length or decode depth.

ModeHardwaremistral.rs Q8 speedup range vs llama.cpp
PrefillGB101.382x to 2.113x
DecodeGB101.086x to 1.097x
PrefillB2001.246x to 2.793x
DecodeB2001.210x to 1.263x

ModeHardwareMean speedup vs llama.cppPoints
DecodeB2001.242x6
PrefillB2002.194x6
DecodeGB101.090x6
PrefillGB101.828x6

ModeHardwareMean speedup vs vLLMPoints
DecodeB2001.060x6
PrefillB2001.238x6
DecodeGB101.331x6
PrefillGB101.005x6

Headline observations from the release artifacts:

  • For Gemma 4 E4B Q8, mistral.rs is faster than llama.cpp on every GB10 and B200 point in the release CSV. Mean prefill speedup is 1.828x on GB10 and 2.194x on B200; mean decode speedup is 1.090x on GB10 and 1.242x on B200.
  • For Gemma 4 E4B BF16, mistral.rs is effectively tied with vLLM on GB10 prefill at 1.005x mean speedup, faster on GB10 decode at 1.331x, faster on B200 prefill at 1.238x, and faster on B200 decode at 1.060x.
  • The full appendix also includes Gemma 4 26B-A4B and H100 SXM. Those data are useful for technical inspection but are not part of the headline release figures.

Method

  • Workloads: prompt lengths and decode depths of 128, 512, 2048, 4096, 8192, and 16384 tokens.
  • Decode workload: 256 generated tokens at each requested depth.
  • Iteration policy: 1 warmup iteration and 3 measured iterations.
  • mistral.rs configuration: paged attention enabled with --max-seq-len 16896 --pa-context-len 16896. GB10 and B200 logs report 16864 usable tokens, covering the 16384 plus 256 decode case.
  • Quantized comparisons: mistral.rs UQFF q4 is compared with llama.cpp GGUF Q4_K_M; mistral.rs UQFF q8 is compared with llama.cpp GGUF Q8_0.
  • BF16 comparison: mistral.rs safetensors are compared with vLLM BF16.
  • GB10 and B200 mistral.rs rows were rerun on 2026-06-01 after the Gemma 4 KV-sharing correctness fix. Their llama.cpp and vLLM comparison rows are copied from the prior 2026-05-31 full sweep and were not rerun for this report.
  • H100 mistral.rs rows were rerun on 2026-06-01 on the H100 SXM host. H100 llama.cpp and vLLM rows are copied from the 2026-05-31 H100 sweep raw artifacts.
  • Copied H100 llama.cpp rows use llama-bench with -ngl 99 -fa 1, as recorded in the H100 source report metadata and llama.cpp JSON fields.
  • H100 vLLM rows use the single-request sweep script reproduced below with vLLM 0.19.0, Transformers 5.9.0, and Torch 2.10.0+cu128.

Commands and Reproducibility

Sweep Settings

bash
CTX=16896
ITER=3
WARMUP=1
GEN=256
LENGTHS=128,512,2048,4096,8192,16384

Entrypoints

B200 used the same bench command shapes shown below.

GB10 Commands

bash
# gb10_e4b_q4k
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# gb10_e4b_q8
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# gb10_e4b_bf16
target/release/mistralrs bench -m google/gemma-4-E4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# gb10_26b_q4k
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# gb10_26b_q8
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# gb10_26b_bf16
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

B200 Commands

bash
# b200_e4b_q4k
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# b200_e4b_q8
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# b200_e4b_bf16
target/release/mistralrs bench -m google/gemma-4-E4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# b200_26b_q4k
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# b200_26b_q8
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

# b200_26b_bf16
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1

H100 mistral.rs Command Template

bash
target/release/mistralrs bench -m "$MODEL" --paged-attn on \
  --max-seq-len "$CTX" --pa-context-len "$CTX" \
  --prompt-len "$LEN" --depth "$LEN" --gen-len "$GEN" \
  --iterations "$ITER" --warmup "$WARMUP"

# UQFF variants add one of:
--from-uqff 4
--from-uqff 8

Model Artifacts

ArtifactHF repo idRevision recorded in reportsUse
Gemma 4 E4B BF16google/gemma-4-E4B-itd6436b3d62967e1af08bbb046c6300b2a9ae8e85BF16 safetensors
Gemma 4 26B-A4B BF16google/gemma-4-26B-A4B-it6e6f6edea8c52db2094dca3086e4b963a0034dfcBF16 safetensors
Gemma 4 E4B UQFFmistralrs-community/gemma-4-E4B-it-UQFF0dcf5919591d49d7ef7001983eafa6f0e6b901ccUsed by --quant 4, --quant 8, or explicit H100 snapshot path
Gemma 4 26B-A4B UQFFmistralrs-community/gemma-4-26B-A4B-it-UQFFb4fb085fe7169cd8c60488e05899a3e0095a0122Used by --quant 4, --quant 8, or explicit H100 snapshot path

GB10 and B200 logs record the base HF model IDs and UQFF repo selected by --quant; the resolved UQFF snapshot hashes above are recorded by the H100 download script.

GGUF artifacts recorded in the llama.cpp JSON rows:

  • Gemma 4 E4B Q4_K_M: gemma-4-E4B-it-Q4_K_M.gguf.
  • Gemma 4 E4B Q8_0: gemma-4-e4b-it-Q8_0.gguf.
  • Gemma 4 26B-A4B Q4_K_M: gemma-4-26B-A4B-it-Q4_K_M.gguf.
  • Gemma 4 26B-A4B Q8_0: gemma-4-26B-A4B-it-Q8_0.gguf.

Versions and Commits

ComponentCommit or versionNotes
mistral.rs8b49f96392a767ea25a08af27669999a20e8a59ebranch cuda_graphs_v1; release v0.8.2; features cuda,cudnn,flash-attn,cutile
llama.cpp751ebd17a58a8a513994509214373bb9e6a3d66c
vLLM73dd2f33b7a5a8a237fe7296039cec246e4c68bdvllm 0.21.0, torch 2.11.0+cu130

Host metadata:

  • GB10: NVIDIA GB10, driver 580.126.09, release v0.8.2.
  • B200: NVIDIA B200, driver 580.126.20, CUDA 13.0, 183359 MiB GPU memory, release v0.8.2.
  • H100 SXM: NVIDIA H100 80GB HBM3, driver 570.124.06, CUDA 12.8, rustc 1.96.0, cargo 1.96.0, release v0.8.2.

vLLM Bench Script

python
import argparse
import json
import statistics
from pathlib import Path

from vllm import LLM, SamplingParams


LENGTHS = [128, 512, 2048, 4096, 8192, 16384]


def prompt_tokens(n: int) -> list[int]:
    return list(range(1000, 1000 + n))


def run_one(llm: LLM, prompt_len: int, output_len: int) -> tuple[float, float, int]:
    params = SamplingParams(
        temperature=0.0,
        max_tokens=output_len,
        min_tokens=output_len,
        ignore_eos=True,
    )
    out = llm.generate(
        [{"prompt_token_ids": prompt_tokens(prompt_len)}],
        params,
        use_tqdm=False,
    )[0]
    metrics = out.metrics
    if metrics is None:
        raise RuntimeError("vLLM did not return request metrics")
    prefill = metrics.first_token_ts - metrics.scheduled_ts
    decode = metrics.last_token_ts - metrics.first_token_ts
    generated = len(out.outputs[0].token_ids)
    return prefill, decode, generated


def mean(values: list[float]) -> float:
    return statistics.fmean(values)


def stdev(values: list[float]) -> float:
    return statistics.pstdev(values) if len(values) > 1 else 0.0


def main() -> None:
    parser = argparse.ArgumentParser()
    parser.add_argument("--model", required=True)
    parser.add_argument("--max-model-len", type=int, default=32768)
    parser.add_argument("--iterations", type=int, default=3)
    parser.add_argument("--warmup", type=int, default=1)
    parser.add_argument("--output-json", required=True)
    parser.add_argument("--gpu-memory-utilization", type=float, default=0.92)
    args = parser.parse_args()

    llm = LLM(
        model=args.model,
        runner="generate",
        dtype="bfloat16",
        max_model_len=args.max_model_len,
        gpu_memory_utilization=args.gpu_memory_utilization,
        enable_prefix_caching=False,
        seed=0,
        trust_remote_code=True,
        disable_log_stats=False,
        limit_mm_per_prompt={"image": 0, "video": 0, "audio": 0},
    )

    for _ in range(args.warmup):
        run_one(llm, 32, 16)
        for length in LENGTHS:
            run_one(llm, length, 1)
            run_one(llm, length, 256)

    rows = []
    for length in LENGTHS:
        prefill_times = []
        for _ in range(args.iterations):
            prefill, _, _ = run_one(llm, length, 1)
            prefill_times.append(prefill)
        prefill_tps = [length / t for t in prefill_times]
        rows.append(
            {
                "mode": "prefill",
                "length": length,
                "tokens_per_second": mean(prefill_tps),
                "stddev": stdev(prefill_tps),
                "seconds": mean(prefill_times),
            }
        )

    for length in LENGTHS:
        decode_times = []
        generated_counts = []
        for _ in range(args.iterations):
            _, decode, generated = run_one(llm, length, 256)
            decode_times.append(decode)
            generated_counts.append(generated)
        decode_tps = [
            max(generated - 1, 0) / t if t > 0 else 0.0
            for t, generated in zip(decode_times, generated_counts)
        ]
        rows.append(
            {
                "mode": "decode",
                "length": length,
                "tokens_per_second": mean(decode_tps),
                "stddev": stdev(decode_tps),
                "seconds": mean(decode_times),
                "generated_tokens": mean(generated_counts),
            }
        )

    Path(args.output_json).write_text(json.dumps({"model": args.model, "rows": rows}, indent=2))

    print(f"Model: {args.model}")
    print("| Mode | Length | T/s | Stddev | Seconds |")
    print("|---|---:|---:|---:|---:|")
    for row in rows:
        print(
            f"| {row['mode']} | {row['length']} | "
            f"{row['tokens_per_second']:.1f} | {row['stddev']:.1f} | {row['seconds']:.4f} |"
        )


if __name__ == "__main__":
    main()

Appendix: Full Tables

All appendix values are tokens per second. The speedup column is mistral.rs divided by the comparison engine in the same row.

GB10

Gemma 4 E4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1284643.53170.71.465x
5127747.04666.91.660x
20488605.94762.21.807x
40968663.24678.41.852x
81927945.74578.61.735x
163847021.94315.81.627x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
12868.061.61.104x
51267.461.21.101x
204866.560.21.105x
409665.358.61.114x
819263.357.21.107x
1638459.253.71.102x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1283657.12645.41.382x
5127289.54238.11.720x
20489239.84372.22.113x
40968976.24327.82.074x
81928189.74235.71.933x
163847021.84023.01.745x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
12845.541.91.086x
51245.341.71.086x
204844.941.21.090x
409644.340.41.097x
819243.339.71.091x
1638441.437.91.092x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
1282401.32401.61.000x
5125688.96265.50.908x
20487288.37180.71.015x
40967090.77060.41.004x
81926667.56451.51.033x
163845896.55517.71.069x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
12825.219.31.306x
51225.619.31.326x
204825.419.11.330x
409625.218.91.333x
819224.918.51.346x
1638424.218.01.344x

Gemma 4 26B-A4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1281662.51461.31.138x
5123354.52855.81.175x
20483932.42863.01.374x
40963888.72781.81.398x
81923556.12791.31.274x
163843000.62659.61.128x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
12871.167.41.055x
51269.465.81.055x
204867.062.11.079x
409665.962.11.061x
819264.059.71.072x
1638460.857.71.054x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1281255.51118.71.122x
5122728.62424.51.125x
20483723.62452.11.519x
40963745.22415.81.550x
81923336.02374.61.405x
163842893.12285.21.266x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
12849.249.01.004x
51248.448.11.006x
204847.046.21.017x
409646.446.01.009x
819245.644.91.016x
1638443.943.91.000x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
128418.91014.30.413x
512710.82728.80.260x
2048666.74686.10.142x
4096595.05201.20.114x
8192589.45240.20.112x
16384572.54400.90.130x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
12827.823.91.163x
51227.423.71.156x
204827.023.31.159x
409626.823.21.155x
819226.422.81.158x
1638425.822.21.162x

B200

Gemma 4 E4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1287111.15265.11.351x
51217879.410242.01.746x
204831030.311689.12.655x
409632768.011873.82.760x
819231347.011801.32.656x
1638429239.811416.62.561x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
128234.4212.81.102x
512233.9217.51.075x
2048232.7212.31.096x
4096230.7211.91.089x
8192226.6205.71.102x
16384217.8200.11.088x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1287529.45981.21.259x
51215067.112091.81.246x
204835310.313515.52.613x
409638280.413705.42.793x
819236357.613596.72.674x
1638433688.913063.92.579x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
128247.4195.91.263x
512246.6198.51.242x
2048245.1195.11.256x
4096243.4195.21.247x
8192237.7193.21.230x
16384228.4188.71.210x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
12811122.56892.81.614x
51225132.818575.11.353x
204852992.534926.61.517x
409661369.962047.70.989x
819258656.663947.30.917x
1638452012.750197.91.036x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
128206.5236.50.873x
512205.7230.50.892x
2048205.1210.20.976x
4096203.7188.31.082x
8192201.0166.41.208x
16384193.6145.51.331x

Gemma 4 26B-A4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1282767.03588.30.771x
5128642.58000.81.080x
204812027.18666.01.388x
409616560.78682.41.907x
819215723.78685.01.810x
1638414546.38398.01.732x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
128241.8205.41.177x
512252.5208.91.209x
2048247.3207.21.194x
4096205.6204.51.005x
8192239.1200.31.194x
16384232.1196.21.183x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1283180.34034.50.788x
51210046.89347.81.075x
204813153.69500.91.384x
409617758.29511.11.867x
819216867.69471.71.781x
1638415345.39154.41.676x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
128216.3195.01.109x
512216.0196.11.101x
2048213.7194.31.100x
4096212.3192.41.103x
8192206.8188.91.095x
16384200.4186.51.075x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
1282256.57963.80.283x
5123244.820151.70.161x
20482843.424120.80.118x
40964224.243368.90.097x
81924173.942132.40.099x
163844060.833459.30.121x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
128163.0251.90.647x
512162.5247.10.658x
2048160.9234.10.687x
4096159.8218.80.730x
8192157.6194.10.812x
16384153.9175.10.879x

H100 SXM

Gemma 4 E4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1288355.65461.61.530x
51217975.311003.61.634x
204828079.011831.22.373x
409631270.811940.32.619x
819230229.611589.12.608x
1638428280.911154.12.535x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
128215.3206.11.045x
512214.1206.61.036x
2048213.3204.81.042x
4096212.1202.81.046x
8192207.9198.01.050x
16384198.1193.21.026x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1288533.35898.41.447x
51220107.311778.41.707x
204831049.313413.92.315x
409633957.513467.32.521x
819233032.613103.02.521x
1638430643.412551.62.441x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
128229.1186.51.228x
512227.9187.11.218x
2048226.9185.61.223x
4096224.0183.51.221x
8192220.0179.31.227x
16384210.7175.81.199x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
1289718.515483.70.628x
51225495.940069.10.636x
204843728.550940.80.858x
409648236.849448.40.975x
819245989.544158.21.041x
1638441944.135662.21.176x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
128178.5177.01.009x
512177.1173.61.020x
2048176.7162.31.088x
4096175.0149.41.171x
8192172.3134.91.277x
16384166.9120.51.385x

Gemma 4 26B-A4B

Q4_K_M Prefill
Lengthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
1283498.23512.50.996x
5128689.48308.11.046x
204813660.98440.61.618x
409614877.98348.21.782x
819214255.68269.91.724x
1638413202.37917.61.667x
Q4_K_M Decode
Depthmistral.rs UQFF q4llama.cpp GGUF Q4_K_Mmistral.rs speedup
128216.0211.51.021x
512215.1212.21.014x
2048211.5208.61.014x
4096208.4207.31.005x
8192205.4203.21.011x
16384196.9200.10.984x
Q8_0 Prefill
Lengthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
1283885.73831.41.014x
5129549.89146.41.044x
204814996.69092.31.649x
409616191.78979.91.803x
819215389.48841.41.741x
1638414160.88439.01.678x
Q8_0 Decode
Depthmistral.rs UQFF q8llama.cpp GGUF Q8_0mistral.rs speedup
128204.9187.41.093x
512205.2187.81.093x
2048201.8184.51.094x
4096201.1184.21.092x
8192195.5180.91.081x
16384190.1178.61.065x
BF16 Prefill
Lengthmistral.rs BF16vLLM BF16mistral.rs speedup
1281833.18446.40.217x
5122733.122914.20.119x
20483159.333839.50.093x
40962622.334905.70.075x
81923154.431892.00.099x
163843093.525777.80.120x
BF16 Decode
Depthmistral.rs BF16vLLM BF16mistral.rs speedup
128141.1162.10.871x
512140.8159.90.880x
2048138.2154.20.896x
4096142.7147.90.965x
8192136.5136.51.000x
16384132.9127.31.044x