releases/v0.8.2/report.md
The release figures cover Gemma 4 E4B on GB10 and B200. Values are tokens per second, and speedups are mistral.rs divided by the comparison engine at the same length or decode depth.
| Mode | Hardware | mistral.rs Q8 speedup range vs llama.cpp |
|---|---|---|
| Prefill | GB10 | 1.382x to 2.113x |
| Decode | GB10 | 1.086x to 1.097x |
| Prefill | B200 | 1.246x to 2.793x |
| Decode | B200 | 1.210x to 1.263x |
| Mode | Hardware | Mean speedup vs llama.cpp | Points |
|---|---|---|---|
| Decode | B200 | 1.242x | 6 |
| Prefill | B200 | 2.194x | 6 |
| Decode | GB10 | 1.090x | 6 |
| Prefill | GB10 | 1.828x | 6 |
| Mode | Hardware | Mean speedup vs vLLM | Points |
|---|---|---|---|
| Decode | B200 | 1.060x | 6 |
| Prefill | B200 | 1.238x | 6 |
| Decode | GB10 | 1.331x | 6 |
| Prefill | GB10 | 1.005x | 6 |
Headline observations from the release artifacts:
--max-seq-len 16896 --pa-context-len 16896. GB10 and B200 logs report 16864 usable tokens, covering the 16384 plus 256 decode case.llama-bench with -ngl 99 -fa 1, as recorded in the H100 source report metadata and llama.cpp JSON fields.CTX=16896
ITER=3
WARMUP=1
GEN=256
LENGTHS=128,512,2048,4096,8192,16384
B200 used the same bench command shapes shown below.
# gb10_e4b_q4k
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# gb10_e4b_q8
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# gb10_e4b_bf16
target/release/mistralrs bench -m google/gemma-4-E4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# gb10_26b_q4k
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# gb10_26b_q8
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# gb10_26b_bf16
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_e4b_q4k
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_e4b_q8
target/release/mistralrs bench -m google/gemma-4-E4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_e4b_bf16
target/release/mistralrs bench -m google/gemma-4-E4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_26b_q4k
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 4 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_26b_q8
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --quant 8 --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
# b200_26b_bf16
target/release/mistralrs bench -m google/gemma-4-26B-A4B-it --paged-attn on --max-seq-len 16896 --pa-context-len 16896 --prompt-len 128,512,2048,4096,8192,16384 --depth 128,512,2048,4096,8192,16384 --gen-len 256 --iterations 3 --warmup 1
target/release/mistralrs bench -m "$MODEL" --paged-attn on \
--max-seq-len "$CTX" --pa-context-len "$CTX" \
--prompt-len "$LEN" --depth "$LEN" --gen-len "$GEN" \
--iterations "$ITER" --warmup "$WARMUP"
# UQFF variants add one of:
--from-uqff 4
--from-uqff 8
| Artifact | HF repo id | Revision recorded in reports | Use |
|---|---|---|---|
| Gemma 4 E4B BF16 | google/gemma-4-E4B-it | d6436b3d62967e1af08bbb046c6300b2a9ae8e85 | BF16 safetensors |
| Gemma 4 26B-A4B BF16 | google/gemma-4-26B-A4B-it | 6e6f6edea8c52db2094dca3086e4b963a0034dfc | BF16 safetensors |
| Gemma 4 E4B UQFF | mistralrs-community/gemma-4-E4B-it-UQFF | 0dcf5919591d49d7ef7001983eafa6f0e6b901cc | Used by --quant 4, --quant 8, or explicit H100 snapshot path |
| Gemma 4 26B-A4B UQFF | mistralrs-community/gemma-4-26B-A4B-it-UQFF | b4fb085fe7169cd8c60488e05899a3e0095a0122 | Used by --quant 4, --quant 8, or explicit H100 snapshot path |
GB10 and B200 logs record the base HF model IDs and UQFF repo selected by --quant; the resolved UQFF snapshot hashes above are recorded by the H100 download script.
GGUF artifacts recorded in the llama.cpp JSON rows:
Gemma 4 E4B Q4_K_M: gemma-4-E4B-it-Q4_K_M.gguf.Gemma 4 E4B Q8_0: gemma-4-e4b-it-Q8_0.gguf.Gemma 4 26B-A4B Q4_K_M: gemma-4-26B-A4B-it-Q4_K_M.gguf.Gemma 4 26B-A4B Q8_0: gemma-4-26B-A4B-it-Q8_0.gguf.| Component | Commit or version | Notes |
|---|---|---|
| mistral.rs | 8b49f96392a767ea25a08af27669999a20e8a59e | branch cuda_graphs_v1; release v0.8.2; features cuda,cudnn,flash-attn,cutile |
| llama.cpp | 751ebd17a58a8a513994509214373bb9e6a3d66c | |
| vLLM | 73dd2f33b7a5a8a237fe7296039cec246e4c68bd | vllm 0.21.0, torch 2.11.0+cu130 |
Host metadata:
v0.8.2.v0.8.2.v0.8.2.import argparse
import json
import statistics
from pathlib import Path
from vllm import LLM, SamplingParams
LENGTHS = [128, 512, 2048, 4096, 8192, 16384]
def prompt_tokens(n: int) -> list[int]:
return list(range(1000, 1000 + n))
def run_one(llm: LLM, prompt_len: int, output_len: int) -> tuple[float, float, int]:
params = SamplingParams(
temperature=0.0,
max_tokens=output_len,
min_tokens=output_len,
ignore_eos=True,
)
out = llm.generate(
[{"prompt_token_ids": prompt_tokens(prompt_len)}],
params,
use_tqdm=False,
)[0]
metrics = out.metrics
if metrics is None:
raise RuntimeError("vLLM did not return request metrics")
prefill = metrics.first_token_ts - metrics.scheduled_ts
decode = metrics.last_token_ts - metrics.first_token_ts
generated = len(out.outputs[0].token_ids)
return prefill, decode, generated
def mean(values: list[float]) -> float:
return statistics.fmean(values)
def stdev(values: list[float]) -> float:
return statistics.pstdev(values) if len(values) > 1 else 0.0
def main() -> None:
parser = argparse.ArgumentParser()
parser.add_argument("--model", required=True)
parser.add_argument("--max-model-len", type=int, default=32768)
parser.add_argument("--iterations", type=int, default=3)
parser.add_argument("--warmup", type=int, default=1)
parser.add_argument("--output-json", required=True)
parser.add_argument("--gpu-memory-utilization", type=float, default=0.92)
args = parser.parse_args()
llm = LLM(
model=args.model,
runner="generate",
dtype="bfloat16",
max_model_len=args.max_model_len,
gpu_memory_utilization=args.gpu_memory_utilization,
enable_prefix_caching=False,
seed=0,
trust_remote_code=True,
disable_log_stats=False,
limit_mm_per_prompt={"image": 0, "video": 0, "audio": 0},
)
for _ in range(args.warmup):
run_one(llm, 32, 16)
for length in LENGTHS:
run_one(llm, length, 1)
run_one(llm, length, 256)
rows = []
for length in LENGTHS:
prefill_times = []
for _ in range(args.iterations):
prefill, _, _ = run_one(llm, length, 1)
prefill_times.append(prefill)
prefill_tps = [length / t for t in prefill_times]
rows.append(
{
"mode": "prefill",
"length": length,
"tokens_per_second": mean(prefill_tps),
"stddev": stdev(prefill_tps),
"seconds": mean(prefill_times),
}
)
for length in LENGTHS:
decode_times = []
generated_counts = []
for _ in range(args.iterations):
_, decode, generated = run_one(llm, length, 256)
decode_times.append(decode)
generated_counts.append(generated)
decode_tps = [
max(generated - 1, 0) / t if t > 0 else 0.0
for t, generated in zip(decode_times, generated_counts)
]
rows.append(
{
"mode": "decode",
"length": length,
"tokens_per_second": mean(decode_tps),
"stddev": stdev(decode_tps),
"seconds": mean(decode_times),
"generated_tokens": mean(generated_counts),
}
)
Path(args.output_json).write_text(json.dumps({"model": args.model, "rows": rows}, indent=2))
print(f"Model: {args.model}")
print("| Mode | Length | T/s | Stddev | Seconds |")
print("|---|---:|---:|---:|---:|")
for row in rows:
print(
f"| {row['mode']} | {row['length']} | "
f"{row['tokens_per_second']:.1f} | {row['stddev']:.1f} | {row['seconds']:.4f} |"
)
if __name__ == "__main__":
main()
All appendix values are tokens per second. The speedup column is mistral.rs divided by the comparison engine in the same row.
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 4643.5 | 3170.7 | 1.465x |
| 512 | 7747.0 | 4666.9 | 1.660x |
| 2048 | 8605.9 | 4762.2 | 1.807x |
| 4096 | 8663.2 | 4678.4 | 1.852x |
| 8192 | 7945.7 | 4578.6 | 1.735x |
| 16384 | 7021.9 | 4315.8 | 1.627x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 68.0 | 61.6 | 1.104x |
| 512 | 67.4 | 61.2 | 1.101x |
| 2048 | 66.5 | 60.2 | 1.105x |
| 4096 | 65.3 | 58.6 | 1.114x |
| 8192 | 63.3 | 57.2 | 1.107x |
| 16384 | 59.2 | 53.7 | 1.102x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 3657.1 | 2645.4 | 1.382x |
| 512 | 7289.5 | 4238.1 | 1.720x |
| 2048 | 9239.8 | 4372.2 | 2.113x |
| 4096 | 8976.2 | 4327.8 | 2.074x |
| 8192 | 8189.7 | 4235.7 | 1.933x |
| 16384 | 7021.8 | 4023.0 | 1.745x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 45.5 | 41.9 | 1.086x |
| 512 | 45.3 | 41.7 | 1.086x |
| 2048 | 44.9 | 41.2 | 1.090x |
| 4096 | 44.3 | 40.4 | 1.097x |
| 8192 | 43.3 | 39.7 | 1.091x |
| 16384 | 41.4 | 37.9 | 1.092x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 2401.3 | 2401.6 | 1.000x |
| 512 | 5688.9 | 6265.5 | 0.908x |
| 2048 | 7288.3 | 7180.7 | 1.015x |
| 4096 | 7090.7 | 7060.4 | 1.004x |
| 8192 | 6667.5 | 6451.5 | 1.033x |
| 16384 | 5896.5 | 5517.7 | 1.069x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 25.2 | 19.3 | 1.306x |
| 512 | 25.6 | 19.3 | 1.326x |
| 2048 | 25.4 | 19.1 | 1.330x |
| 4096 | 25.2 | 18.9 | 1.333x |
| 8192 | 24.9 | 18.5 | 1.346x |
| 16384 | 24.2 | 18.0 | 1.344x |
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 1662.5 | 1461.3 | 1.138x |
| 512 | 3354.5 | 2855.8 | 1.175x |
| 2048 | 3932.4 | 2863.0 | 1.374x |
| 4096 | 3888.7 | 2781.8 | 1.398x |
| 8192 | 3556.1 | 2791.3 | 1.274x |
| 16384 | 3000.6 | 2659.6 | 1.128x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 71.1 | 67.4 | 1.055x |
| 512 | 69.4 | 65.8 | 1.055x |
| 2048 | 67.0 | 62.1 | 1.079x |
| 4096 | 65.9 | 62.1 | 1.061x |
| 8192 | 64.0 | 59.7 | 1.072x |
| 16384 | 60.8 | 57.7 | 1.054x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 1255.5 | 1118.7 | 1.122x |
| 512 | 2728.6 | 2424.5 | 1.125x |
| 2048 | 3723.6 | 2452.1 | 1.519x |
| 4096 | 3745.2 | 2415.8 | 1.550x |
| 8192 | 3336.0 | 2374.6 | 1.405x |
| 16384 | 2893.1 | 2285.2 | 1.266x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 49.2 | 49.0 | 1.004x |
| 512 | 48.4 | 48.1 | 1.006x |
| 2048 | 47.0 | 46.2 | 1.017x |
| 4096 | 46.4 | 46.0 | 1.009x |
| 8192 | 45.6 | 44.9 | 1.016x |
| 16384 | 43.9 | 43.9 | 1.000x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 418.9 | 1014.3 | 0.413x |
| 512 | 710.8 | 2728.8 | 0.260x |
| 2048 | 666.7 | 4686.1 | 0.142x |
| 4096 | 595.0 | 5201.2 | 0.114x |
| 8192 | 589.4 | 5240.2 | 0.112x |
| 16384 | 572.5 | 4400.9 | 0.130x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 27.8 | 23.9 | 1.163x |
| 512 | 27.4 | 23.7 | 1.156x |
| 2048 | 27.0 | 23.3 | 1.159x |
| 4096 | 26.8 | 23.2 | 1.155x |
| 8192 | 26.4 | 22.8 | 1.158x |
| 16384 | 25.8 | 22.2 | 1.162x |
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 7111.1 | 5265.1 | 1.351x |
| 512 | 17879.4 | 10242.0 | 1.746x |
| 2048 | 31030.3 | 11689.1 | 2.655x |
| 4096 | 32768.0 | 11873.8 | 2.760x |
| 8192 | 31347.0 | 11801.3 | 2.656x |
| 16384 | 29239.8 | 11416.6 | 2.561x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 234.4 | 212.8 | 1.102x |
| 512 | 233.9 | 217.5 | 1.075x |
| 2048 | 232.7 | 212.3 | 1.096x |
| 4096 | 230.7 | 211.9 | 1.089x |
| 8192 | 226.6 | 205.7 | 1.102x |
| 16384 | 217.8 | 200.1 | 1.088x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 7529.4 | 5981.2 | 1.259x |
| 512 | 15067.1 | 12091.8 | 1.246x |
| 2048 | 35310.3 | 13515.5 | 2.613x |
| 4096 | 38280.4 | 13705.4 | 2.793x |
| 8192 | 36357.6 | 13596.7 | 2.674x |
| 16384 | 33688.9 | 13063.9 | 2.579x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 247.4 | 195.9 | 1.263x |
| 512 | 246.6 | 198.5 | 1.242x |
| 2048 | 245.1 | 195.1 | 1.256x |
| 4096 | 243.4 | 195.2 | 1.247x |
| 8192 | 237.7 | 193.2 | 1.230x |
| 16384 | 228.4 | 188.7 | 1.210x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 11122.5 | 6892.8 | 1.614x |
| 512 | 25132.8 | 18575.1 | 1.353x |
| 2048 | 52992.5 | 34926.6 | 1.517x |
| 4096 | 61369.9 | 62047.7 | 0.989x |
| 8192 | 58656.6 | 63947.3 | 0.917x |
| 16384 | 52012.7 | 50197.9 | 1.036x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 206.5 | 236.5 | 0.873x |
| 512 | 205.7 | 230.5 | 0.892x |
| 2048 | 205.1 | 210.2 | 0.976x |
| 4096 | 203.7 | 188.3 | 1.082x |
| 8192 | 201.0 | 166.4 | 1.208x |
| 16384 | 193.6 | 145.5 | 1.331x |
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 2767.0 | 3588.3 | 0.771x |
| 512 | 8642.5 | 8000.8 | 1.080x |
| 2048 | 12027.1 | 8666.0 | 1.388x |
| 4096 | 16560.7 | 8682.4 | 1.907x |
| 8192 | 15723.7 | 8685.0 | 1.810x |
| 16384 | 14546.3 | 8398.0 | 1.732x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 241.8 | 205.4 | 1.177x |
| 512 | 252.5 | 208.9 | 1.209x |
| 2048 | 247.3 | 207.2 | 1.194x |
| 4096 | 205.6 | 204.5 | 1.005x |
| 8192 | 239.1 | 200.3 | 1.194x |
| 16384 | 232.1 | 196.2 | 1.183x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 3180.3 | 4034.5 | 0.788x |
| 512 | 10046.8 | 9347.8 | 1.075x |
| 2048 | 13153.6 | 9500.9 | 1.384x |
| 4096 | 17758.2 | 9511.1 | 1.867x |
| 8192 | 16867.6 | 9471.7 | 1.781x |
| 16384 | 15345.3 | 9154.4 | 1.676x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 216.3 | 195.0 | 1.109x |
| 512 | 216.0 | 196.1 | 1.101x |
| 2048 | 213.7 | 194.3 | 1.100x |
| 4096 | 212.3 | 192.4 | 1.103x |
| 8192 | 206.8 | 188.9 | 1.095x |
| 16384 | 200.4 | 186.5 | 1.075x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 2256.5 | 7963.8 | 0.283x |
| 512 | 3244.8 | 20151.7 | 0.161x |
| 2048 | 2843.4 | 24120.8 | 0.118x |
| 4096 | 4224.2 | 43368.9 | 0.097x |
| 8192 | 4173.9 | 42132.4 | 0.099x |
| 16384 | 4060.8 | 33459.3 | 0.121x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 163.0 | 251.9 | 0.647x |
| 512 | 162.5 | 247.1 | 0.658x |
| 2048 | 160.9 | 234.1 | 0.687x |
| 4096 | 159.8 | 218.8 | 0.730x |
| 8192 | 157.6 | 194.1 | 0.812x |
| 16384 | 153.9 | 175.1 | 0.879x |
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 8355.6 | 5461.6 | 1.530x |
| 512 | 17975.3 | 11003.6 | 1.634x |
| 2048 | 28079.0 | 11831.2 | 2.373x |
| 4096 | 31270.8 | 11940.3 | 2.619x |
| 8192 | 30229.6 | 11589.1 | 2.608x |
| 16384 | 28280.9 | 11154.1 | 2.535x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 215.3 | 206.1 | 1.045x |
| 512 | 214.1 | 206.6 | 1.036x |
| 2048 | 213.3 | 204.8 | 1.042x |
| 4096 | 212.1 | 202.8 | 1.046x |
| 8192 | 207.9 | 198.0 | 1.050x |
| 16384 | 198.1 | 193.2 | 1.026x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 8533.3 | 5898.4 | 1.447x |
| 512 | 20107.3 | 11778.4 | 1.707x |
| 2048 | 31049.3 | 13413.9 | 2.315x |
| 4096 | 33957.5 | 13467.3 | 2.521x |
| 8192 | 33032.6 | 13103.0 | 2.521x |
| 16384 | 30643.4 | 12551.6 | 2.441x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 229.1 | 186.5 | 1.228x |
| 512 | 227.9 | 187.1 | 1.218x |
| 2048 | 226.9 | 185.6 | 1.223x |
| 4096 | 224.0 | 183.5 | 1.221x |
| 8192 | 220.0 | 179.3 | 1.227x |
| 16384 | 210.7 | 175.8 | 1.199x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 9718.5 | 15483.7 | 0.628x |
| 512 | 25495.9 | 40069.1 | 0.636x |
| 2048 | 43728.5 | 50940.8 | 0.858x |
| 4096 | 48236.8 | 49448.4 | 0.975x |
| 8192 | 45989.5 | 44158.2 | 1.041x |
| 16384 | 41944.1 | 35662.2 | 1.176x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 178.5 | 177.0 | 1.009x |
| 512 | 177.1 | 173.6 | 1.020x |
| 2048 | 176.7 | 162.3 | 1.088x |
| 4096 | 175.0 | 149.4 | 1.171x |
| 8192 | 172.3 | 134.9 | 1.277x |
| 16384 | 166.9 | 120.5 | 1.385x |
| Length | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 3498.2 | 3512.5 | 0.996x |
| 512 | 8689.4 | 8308.1 | 1.046x |
| 2048 | 13660.9 | 8440.6 | 1.618x |
| 4096 | 14877.9 | 8348.2 | 1.782x |
| 8192 | 14255.6 | 8269.9 | 1.724x |
| 16384 | 13202.3 | 7917.6 | 1.667x |
| Depth | mistral.rs UQFF q4 | llama.cpp GGUF Q4_K_M | mistral.rs speedup |
|---|---|---|---|
| 128 | 216.0 | 211.5 | 1.021x |
| 512 | 215.1 | 212.2 | 1.014x |
| 2048 | 211.5 | 208.6 | 1.014x |
| 4096 | 208.4 | 207.3 | 1.005x |
| 8192 | 205.4 | 203.2 | 1.011x |
| 16384 | 196.9 | 200.1 | 0.984x |
| Length | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 3885.7 | 3831.4 | 1.014x |
| 512 | 9549.8 | 9146.4 | 1.044x |
| 2048 | 14996.6 | 9092.3 | 1.649x |
| 4096 | 16191.7 | 8979.9 | 1.803x |
| 8192 | 15389.4 | 8841.4 | 1.741x |
| 16384 | 14160.8 | 8439.0 | 1.678x |
| Depth | mistral.rs UQFF q8 | llama.cpp GGUF Q8_0 | mistral.rs speedup |
|---|---|---|---|
| 128 | 204.9 | 187.4 | 1.093x |
| 512 | 205.2 | 187.8 | 1.093x |
| 2048 | 201.8 | 184.5 | 1.094x |
| 4096 | 201.1 | 184.2 | 1.092x |
| 8192 | 195.5 | 180.9 | 1.081x |
| 16384 | 190.1 | 178.6 | 1.065x |
| Length | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 1833.1 | 8446.4 | 0.217x |
| 512 | 2733.1 | 22914.2 | 0.119x |
| 2048 | 3159.3 | 33839.5 | 0.093x |
| 4096 | 2622.3 | 34905.7 | 0.075x |
| 8192 | 3154.4 | 31892.0 | 0.099x |
| 16384 | 3093.5 | 25777.8 | 0.120x |
| Depth | mistral.rs BF16 | vLLM BF16 | mistral.rs speedup |
|---|---|---|---|
| 128 | 141.1 | 162.1 | 0.871x |
| 512 | 140.8 | 159.9 | 0.880x |
| 2048 | 138.2 | 154.2 | 0.896x |
| 4096 | 142.7 | 147.9 | 0.965x |
| 8192 | 136.5 | 136.5 | 1.000x |
| 16384 | 132.9 | 127.3 | 1.044x |