Back to Powerinfer

llama.cpp/tools/llama-bench

smallthinker/tools/llama-bench/README.md

latest18.8 KB
Original Source

llama.cpp/tools/llama-bench

Performance testing tool for llama.cpp.

Table of contents

  1. Syntax
  2. Examples
    1. Text generation with different models
    2. Prompt processing with different batch sizes
    3. Different numbers of threads
    4. Different numbers of layers offloaded to the GPU
  3. Output formats
    1. Markdown
    2. CSV
    3. JSON
    4. JSONL
    5. SQL

Syntax

usage: llama-bench [options]

options:
  -h, --help
  --numa <distribute|isolate|numactl>       numa mode (default: disabled)
  -r, --repetitions <n>                     number of times to repeat each test (default: 5)
  --prio <0|1|2|3>                          process/thread priority (default: 0)
  --delay <0...N> (seconds)                 delay between each test (default: 0)
  -o, --output <csv|json|jsonl|md|sql>      output format printed to stdout (default: md)
  -oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
  -v, --verbose                             verbose output
  --progress                                print test progress indicators

test parameters:
  -m, --model <filename>                    (default: models/7B/ggml-model-q4_0.gguf)
  -p, --n-prompt <n>                        (default: 512)
  -n, --n-gen <n>                           (default: 128)
  -pg <pp,tg>                               (default: )
  -d, --n-depth <n>                         (default: 0)
  -b, --batch-size <n>                      (default: 2048)
  -ub, --ubatch-size <n>                    (default: 512)
  -ctk, --cache-type-k <t>                  (default: f16)
  -ctv, --cache-type-v <t>                  (default: f16)
  -dt, --defrag-thold <f>                   (default: -1)
  -t, --threads <n>                         (default: system dependent)
  -C, --cpu-mask <hex,hex>                  (default: 0x0)
  --cpu-strict <0|1>                        (default: 0)
  --poll <0...100>                          (default: 50)
  -ngl, --n-gpu-layers <n>                  (default: 99)
  -rpc, --rpc <rpc_servers>                 (default: none)
  -sm, --split-mode <none|layer|row>        (default: layer)
  -mg, --main-gpu <i>                       (default: 0)
  -nkvo, --no-kv-offload <0|1>              (default: 0)
  -fa, --flash-attn <0|1>                   (default: 0)
  -mmp, --mmap <0|1>                        (default: 1)
  -embd, --embeddings <0|1>                 (default: 0)
  -ts, --tensor-split <ts0/ts1/..>          (default: 0)
  -ot --override-tensors <tensor name pattern>=<buffer type>;...
                                            (default: disabled)
  -nopo, --no-op-offload <0|1>              (default: 0)

Multiple values can be given for each parameter by separating them with ','
or by specifying the parameter multiple times. Ranges can be given as
'first-last' or 'first-last+step' or 'first-last*mult'.

llama-bench can perform three types of tests:

  • Prompt processing (pp): processing a prompt in batches (-p)
  • Text generation (tg): generating a sequence of tokens (-n)
  • Prompt processing + text generation (pg): processing a prompt followed by generating a sequence of tokens (-pg)

With the exception of -r, -o and -v, all options can be specified multiple times to run multiple tests. Each pp and tg test is run with all combinations of the specified options. To specify multiple values for an option, the values can be separated by commas (e.g. -n 16,32), or the option can be specified multiple times (e.g. -n 16 -n 32).

Each test is repeated the number of times given by -r, and the results are averaged. The results are given in average tokens per second (t/s) and standard deviation. Some output formats (e.g. json) also include the individual results of each repetition.

Using the -d <n> option, each test can be run at a specified context depth, prefilling the KV cache with <n> tokens.

For a description of the other options, see the main example.

Examples

Text generation with different models

sh
$ ./llama-bench -m models/7B/ggml-model-q4_0.gguf -m models/13B/ggml-model-q4_0.gguf -p 0 -n 128,256,512
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 128132.19 ± 0.55
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 256129.37 ± 0.54
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 512123.83 ± 0.25
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 12882.17 ± 0.31
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 25680.74 ± 0.23
llama 13B mostly Q4_06.86 GiB13.02 BCUDA99tg 51278.08 ± 0.07

Prompt processing with different batch sizes

sh
$ ./llama-bench -n 0 -p 1024 -b 128,256,512,1024
modelsizeparamsbackendngln_batchtestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99128pp 10241436.51 ± 3.66
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99256pp 10241932.43 ± 23.48
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99512pp 10242254.45 ± 15.59
llama 7B mostly Q4_03.56 GiB6.74 BCUDA991024pp 10242498.61 ± 13.58

Different numbers of threads

sh
$ ./llama-bench -n 0 -n 16 -p 64 -t 1,2,4,8,16,32
modelsizeparamsbackendthreadstestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCPU1pp 646.17 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU1tg 164.05 ± 0.02
llama 7B mostly Q4_03.56 GiB6.74 BCPU2pp 6412.31 ± 0.13
llama 7B mostly Q4_03.56 GiB6.74 BCPU2tg 167.80 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU4pp 6423.18 ± 0.06
llama 7B mostly Q4_03.56 GiB6.74 BCPU4tg 1612.22 ± 0.07
llama 7B mostly Q4_03.56 GiB6.74 BCPU8pp 6432.29 ± 1.21
llama 7B mostly Q4_03.56 GiB6.74 BCPU8tg 1616.71 ± 0.66
llama 7B mostly Q4_03.56 GiB6.74 BCPU16pp 6433.52 ± 0.03
llama 7B mostly Q4_03.56 GiB6.74 BCPU16tg 1615.32 ± 0.05
llama 7B mostly Q4_03.56 GiB6.74 BCPU32pp 6459.00 ± 1.11
llama 7B mostly Q4_03.56 GiB6.74 BCPU32tg 1616.41 ± 0.79

Different numbers of layers offloaded to the GPU

sh
$ ./llama-bench -ngl 10,20,30,31,32,33,34,35
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10pp 512373.36 ± 2.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA10tg 12813.45 ± 0.93
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20pp 512472.65 ± 1.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA20tg 12821.36 ± 1.94
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30pp 512631.87 ± 11.25
llama 7B mostly Q4_03.56 GiB6.74 BCUDA30tg 12840.04 ± 1.82
llama 7B mostly Q4_03.56 GiB6.74 BCUDA31pp 512657.89 ± 5.08
llama 7B mostly Q4_03.56 GiB6.74 BCUDA31tg 12848.19 ± 0.81
llama 7B mostly Q4_03.56 GiB6.74 BCUDA32pp 512688.26 ± 3.29
llama 7B mostly Q4_03.56 GiB6.74 BCUDA32tg 12854.78 ± 0.65
llama 7B mostly Q4_03.56 GiB6.74 BCUDA33pp 512704.27 ± 2.24
llama 7B mostly Q4_03.56 GiB6.74 BCUDA33tg 12860.62 ± 1.76
llama 7B mostly Q4_03.56 GiB6.74 BCUDA34pp 512881.34 ± 5.40
llama 7B mostly Q4_03.56 GiB6.74 BCUDA34tg 12871.76 ± 0.23
llama 7B mostly Q4_03.56 GiB6.74 BCUDA35pp 5122400.01 ± 7.72
llama 7B mostly Q4_03.56 GiB6.74 BCUDA35tg 128131.66 ± 0.49

Different prefilled context

$ ./llama-bench -d 0,512
modelsizeparamsbackendngltestt/s
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99pp5127340.20 ± 23.45
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99tg128120.60 ± 0.59
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99pp512 @ d5126425.91 ± 18.88
qwen2 7B Q4_K - Medium4.36 GiB7.62 BCUDA99tg128 @ d512116.71 ± 0.60

Output formats

By default, llama-bench outputs the results in markdown format. The results can be output in other formats by using the -o option.

Markdown

sh
$ ./llama-bench -o md
modelsizeparamsbackendngltestt/s
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99pp 5122368.80 ± 93.24
llama 7B mostly Q4_03.56 GiB6.74 BCUDA99tg 128131.42 ± 0.59

CSV

sh
$ ./llama-bench -o csv
csv
build_commit,build_number,cpu_info,gpu_info,backends,model_filename,model_type,model_size,model_n_params,n_batch,n_ubatch,n_threads,cpu_mask,cpu_strict,poll,type_k,type_v,n_gpu_layers,split_mode,main_gpu,no_kv_offload,flash_attn,tensor_split,use_mmap,embeddings,n_prompt,n_gen,n_depth,test_time,avg_ns,stddev_ns,avg_ts,stddev_ts
"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","512","0","0","2025-04-24T11:57:09Z","70285660","982040","7285.676949","100.064434"
"8cf427ff","5163","AMD Ryzen 7 7800X3D 8-Core Processor","NVIDIA GeForce RTX 4080","CUDA","models/Qwen2.5-7B-Instruct-Q4_K_M.gguf","qwen2 7B Q4_K - Medium","4677120000","7615616512","2048","512","8","0x0","0","50","f16","f16","99","layer","0","0","0","0.00","1","0","0","128","0","2025-04-24T11:57:10Z","1067431600","3834831","119.915244","0.430617"

JSON

sh
$ ./llama-bench -o json
json
[
  {
    "build_commit": "8cf427ff",
    "build_number": 5163,
    "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
    "gpu_info": "NVIDIA GeForce RTX 4080",
    "backends": "CUDA",
    "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
    "model_type": "qwen2 7B Q4_K - Medium",
    "model_size": 4677120000,
    "model_n_params": 7615616512,
    "n_batch": 2048,
    "n_ubatch": 512,
    "n_threads": 8,
    "cpu_mask": "0x0",
    "cpu_strict": false,
    "poll": 50,
    "type_k": "f16",
    "type_v": "f16",
    "n_gpu_layers": 99,
    "split_mode": "layer",
    "main_gpu": 0,
    "no_kv_offload": false,
    "flash_attn": false,
    "tensor_split": "0.00",
    "use_mmap": true,
    "embeddings": false,
    "n_prompt": 512,
    "n_gen": 0,
    "n_depth": 0,
    "test_time": "2025-04-24T11:58:50Z",
    "avg_ns": 72135640,
    "stddev_ns": 1453752,
    "avg_ts": 7100.002165,
    "stddev_ts": 140.341520,
    "samples_ns": [ 74601900, 71632900, 71745200, 71952700, 70745500 ],
    "samples_ts": [ 6863.1, 7147.55, 7136.37, 7115.79, 7237.21 ]
  },
  {
    "build_commit": "8cf427ff",
    "build_number": 5163,
    "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor",
    "gpu_info": "NVIDIA GeForce RTX 4080",
    "backends": "CUDA",
    "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf",
    "model_type": "qwen2 7B Q4_K - Medium",
    "model_size": 4677120000,
    "model_n_params": 7615616512,
    "n_batch": 2048,
    "n_ubatch": 512,
    "n_threads": 8,
    "cpu_mask": "0x0",
    "cpu_strict": false,
    "poll": 50,
    "type_k": "f16",
    "type_v": "f16",
    "n_gpu_layers": 99,
    "split_mode": "layer",
    "main_gpu": 0,
    "no_kv_offload": false,
    "flash_attn": false,
    "tensor_split": "0.00",
    "use_mmap": true,
    "embeddings": false,
    "n_prompt": 0,
    "n_gen": 128,
    "n_depth": 0,
    "test_time": "2025-04-24T11:58:51Z",
    "avg_ns": 1076767880,
    "stddev_ns": 9449585,
    "avg_ts": 118.881588,
    "stddev_ts": 1.041811,
    "samples_ns": [ 1075361300, 1065089400, 1071761200, 1081934900, 1089692600 ],
    "samples_ts": [ 119.03, 120.178, 119.43, 118.307, 117.464 ]
  }
]

JSONL

sh
$ ./llama-bench -o jsonl
json
{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 512, "n_gen": 0, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 70497220, "stddev_ns": 883196, "avg_ts": 7263.609157, "stddev_ts": 90.940578, "samples_ns": [ 71551000, 71222800, 70364100, 69439100, 69909100 ],"samples_ts": [ 7155.74, 7188.71, 7276.44, 7373.37, 7323.8 ]}
{"build_commit": "8cf427ff", "build_number": 5163, "cpu_info": "AMD Ryzen 7 7800X3D 8-Core Processor", "gpu_info": "NVIDIA GeForce RTX 4080", "backends": "CUDA", "model_filename": "models/Qwen2.5-7B-Instruct-Q4_K_M.gguf", "model_type": "qwen2 7B Q4_K - Medium", "model_size": 4677120000, "model_n_params": 7615616512, "n_batch": 2048, "n_ubatch": 512, "n_threads": 8, "cpu_mask": "0x0", "cpu_strict": false, "poll": 50, "type_k": "f16", "type_v": "f16", "n_gpu_layers": 99, "split_mode": "layer", "main_gpu": 0, "no_kv_offload": false, "flash_attn": false, "tensor_split": "0.00", "use_mmap": true, "embeddings": false, "n_prompt": 0, "n_gen": 128, "n_depth": 0, "test_time": "2025-04-24T11:59:33Z", "avg_ns": 1068078400, "stddev_ns": 6279455, "avg_ts": 119.844681, "stddev_ts": 0.699739, "samples_ns": [ 1066331700, 1064864900, 1079042600, 1063328400, 1066824400 ],"samples_ts": [ 120.038, 120.203, 118.624, 120.377, 119.982 ]}

SQL

SQL output is suitable for importing into a SQLite database. The output can be piped into the sqlite3 command line tool to add the results to a database.

sh
$ ./llama-bench -o sql
sql
CREATE TABLE IF NOT EXISTS test (
  build_commit TEXT,
  build_number INTEGER,
  cpu_info TEXT,
  gpu_info TEXT,
  backends TEXT,
  model_filename TEXT,
  model_type TEXT,
  model_size INTEGER,
  model_n_params INTEGER,
  n_batch INTEGER,
  n_ubatch INTEGER,
  n_threads INTEGER,
  cpu_mask TEXT,
  cpu_strict INTEGER,
  poll INTEGER,
  type_k TEXT,
  type_v TEXT,
  n_gpu_layers INTEGER,
  split_mode TEXT,
  main_gpu INTEGER,
  no_kv_offload INTEGER,
  flash_attn INTEGER,
  tensor_split TEXT,
  use_mmap INTEGER,
  embeddings INTEGER,
  n_prompt INTEGER,
  n_gen INTEGER,
  n_depth INTEGER,
  test_time TEXT,
  avg_ns INTEGER,
  stddev_ns INTEGER,
  avg_ts REAL,
  stddev_ts REAL
);

INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '512', '0', '0', '2025-04-24T12:00:08Z', '69905000', '519516', '7324.546977', '54.032613');
INSERT INTO test (build_commit, build_number, cpu_info, gpu_info, backends, model_filename, model_type, model_size, model_n_params, n_batch, n_ubatch, n_threads, cpu_mask, cpu_strict, poll, type_k, type_v, n_gpu_layers, split_mode, main_gpu, no_kv_offload, flash_attn, tensor_split, use_mmap, embeddings, n_prompt, n_gen, n_depth, test_time, avg_ns, stddev_ns, avg_ts, stddev_ts) VALUES ('8cf427ff', '5163', 'AMD Ryzen 7 7800X3D 8-Core Processor', 'NVIDIA GeForce RTX 4080', 'CUDA', 'models/Qwen2.5-7B-Instruct-Q4_K_M.gguf', 'qwen2 7B Q4_K - Medium', '4677120000', '7615616512', '2048', '512', '8', '0x0', '0', '50', 'f16', 'f16', '99', 'layer', '0', '0', '0', '0.00', '1', '0', '0', '128', '0', '2025-04-24T12:00:09Z', '1063608780', '4464130', '120.346696', '0.504647');