Back to Powerinfer

llama.cpp/example/batched-bench

examples/batched-bench/README.md

latest2.6 KB
Original Source

llama.cpp/example/batched-bench

Benchmark the batched decoding performance of llama.cpp

Usage

There are 2 modes of operation:

  • prompt not shared - each batch has a separate prompt of size PP (i.e. N_KV = B*(PP + TG))
  • prompt is shared - there is a common prompt of size PP used by all batches (i.e. N_KV = PP + B*TG)
bash
./batched-bench MODEL_PATH [N_KV_MAX] [IS_PP_SHARED] [NGL] [MMQ] <PP> <TG> <PL>

# LLaMA 7B, F16, N_KV_MAX = 16384 (8GB), prompt not shared
./batched-bench ./models/llama-7b/ggml-model-f16.gguf 16384 0 99

# LLaMA 7B, Q8_0, N_KV_MAX = 16384 (8GB), prompt is shared
./batched-bench ./models/llama-7b/ggml-model-q8_0.gguf 16384 1 99

# custom set of batches
./batched-bench ./models/llama-7b/ggml-model-q8_0.gguf 2048 0 999 0 128,256,512 128,256 1,2,4,8,16,32

Sample results

  • PP - prompt tokens per batch
  • TG - generated tokens per batch
  • B - number of batches
  • N_KV - required KV cache size
  • T_PP - prompt processing time (i.e. time to first token)
  • S_PP - prompt processing speed ((B*PP)/T_PP or PP/T_PP)
  • T_TG - time to generate all batches
  • S_TG - text generation speed ((B*TG)/T_TG)
  • T - total time
  • S - total speed (i.e. all tokens / total time)
PPTGBN_KVT_PP sS_PP t/sT_TG sS_TG t/sT sS t/s
12812812560.1081186.643.07941.573.18780.32
12812825120.1981295.195.02950.905.22797.95
128128410240.3731373.966.87874.447.251141.23
128128820480.7511363.277.344139.438.095252.99
1281281640961.5701304.688.455242.2310.024408.60
1281283281923.4081201.738.801465.4012.209670.96
12825613840.1071196.706.32940.456.43659.67
12825627680.1941317.4510.23950.0010.43373.61
128256415360.3661399.0313.96073.3514.326107.22
128256830720.7511363.9215.110135.5415.861193.69
1282561661441.5691304.9318.073226.6419.642312.80
12825632122883.4091201.3519.223426.1522.633542.93