skills/mlops/evaluation/lm-evaluation-harness/references/distributed-eval.md
Guide to running evaluation across multiple GPUs using data parallelism and tensor/pipeline parallelism.
Distributed evaluation speeds up benchmarking by:
When to use:
hf)Each GPU loads a full copy of the model and processes a subset of evaluation data.
Single Node (8 GPUs):
accelerate launch --multi_gpu --num_processes 8 \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf,dtype=bfloat16 \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 16
Speedup: Near-linear (8 GPUs = ~8× faster)
Memory: Each GPU needs full model (7B model ≈ 14GB × 8 = 112GB total)
Split model weights across GPUs for models too large for single GPU.
Without accelerate launcher:
lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
dtype=bfloat16 \
--tasks mmlu,gsm8k \
--batch_size 8
With 8 GPUs: 70B model (140GB) / 8 = 17.5GB per GPU ✅
Advanced sharding:
lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
device_map_option=auto,\
max_memory_per_gpu=40GB,\
max_cpu_memory=100GB,\
dtype=bfloat16 \
--tasks mmlu
Options:
device_map_option: "auto" (default), "balanced", "balanced_low_0"max_memory_per_gpu: Max memory per GPU (e.g., "40GB")max_cpu_memory: Max CPU memory for offloadingoffload_folder: Disk offloading directoryUse both for very large models.
Example: 70B model on 16 GPUs (2 copies, 8 GPUs each):
accelerate launch --multi_gpu --num_processes 2 \
-m lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
parallelize=True,\
dtype=bfloat16 \
--tasks mmlu \
--batch_size 8
Result: 2× speedup from data parallelism, 70B model fits via tensor parallelism
accelerate configCreate ~/.cache/huggingface/accelerate/default_config.yaml:
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
num_machines: 1
num_processes: 8
gpu_ids: all
mixed_precision: bf16
Then run:
accelerate launch -m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu
vllm)vLLM provides highly optimized distributed inference.
Single Node (4 GPUs):
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=4,\
dtype=auto,\
gpu_memory_utilization=0.9 \
--tasks mmlu,gsm8k \
--batch_size auto
Memory: 70B model split across 4 GPUs = ~35GB per GPU
Multiple model replicas:
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
data_parallel_size=4,\
dtype=auto,\
gpu_memory_utilization=0.8 \
--tasks hellaswag,arc_challenge \
--batch_size auto
Result: 4 model replicas = 4× throughput
Example: 8 GPUs = 4 TP × 2 DP:
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=4,\
data_parallel_size=2,\
dtype=auto,\
gpu_memory_utilization=0.85 \
--tasks mmlu \
--batch_size auto
Result: 70B model fits (TP=4), 2× speedup (DP=2)
vLLM doesn't natively support multi-node. Use Ray:
# Start Ray cluster
ray start --head --port=6379
# Run evaluation
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=8,\
dtype=auto \
--tasks mmlu
nemo_lm)8 replicas on 8 GPUs:
torchrun --nproc-per-node=8 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/model.nemo,\
devices=8 \
--tasks hellaswag,arc_challenge \
--batch_size 32
Speedup: Near-linear (8× faster)
4-way tensor parallelism:
torchrun --nproc-per-node=4 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/70b_model.nemo,\
devices=4,\
tensor_model_parallel_size=4 \
--tasks mmlu,gsm8k \
--batch_size 16
2 TP × 2 PP on 4 GPUs:
torchrun --nproc-per-node=4 --no-python \
lm_eval --model nemo_lm \
--model_args \
path=/path/to/model.nemo,\
devices=4,\
tensor_model_parallel_size=2,\
pipeline_model_parallel_size=2 \
--tasks mmlu \
--batch_size 8
Constraint: devices = TP × PP
Currently not supported by lm-evaluation-harness.
sglang)lm_eval --model sglang \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tp_size=4,\
dtype=auto \
--tasks gsm8k \
--batch_size auto
Note: SGLang is deprecating data parallelism. Use tensor parallelism instead.
lm_eval --model sglang \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
dp_size=4,\
dtype=auto \
--tasks mmlu
| Method | GPUs | Time | Memory/GPU | Notes |
|---|---|---|---|---|
| HF (no parallel) | 1 | 8 hours | 140GB (OOM) | Won't fit |
| HF (TP=8) | 8 | 2 hours | 17.5GB | Slower, fits |
| HF (DP=8) | 8 | 1 hour | 140GB (OOM) | Won't fit |
| vLLM (TP=4) | 4 | 30 min | 35GB | Fast! |
| vLLM (TP=4, DP=2) | 8 | 15 min | 35GB | Fastest |
| Method | GPUs | Time | Speedup |
|---|---|---|---|
| HF (single) | 1 | 4 hours | 1× |
| HF (DP=4) | 4 | 1 hour | 4× |
| HF (DP=8) | 8 | 30 min | 8× |
| vLLM (DP=8) | 8 | 15 min | 16× |
Takeaway: vLLM is significantly faster than HuggingFace for inference.
Model fits on single GPU?
├─ YES: Use data parallelism
│ ├─ HF: accelerate launch --multi_gpu --num_processes N
│ └─ vLLM: data_parallel_size=N (fastest)
│
└─ NO: Use tensor/pipeline parallelism
├─ Model < 70B:
│ └─ vLLM: tensor_parallel_size=4
├─ Model 70-175B:
│ ├─ vLLM: tensor_parallel_size=8
│ └─ Or HF: parallelize=True
└─ Model > 175B:
└─ Contact framework authors
Rule of thumb:
Memory (GB) = Parameters (B) × Precision (bytes) × 1.2 (overhead)
Examples:
With tensor parallelism:
Memory per GPU = Total Memory / TP
Submit job:
#!/bin/bash
#SBATCH --nodes=4
#SBATCH --gpus-per-node=8
#SBATCH --ntasks-per-node=1
srun accelerate launch --multi_gpu \
--num_processes $((SLURM_NNODES * 8)) \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu,gsm8k,hellaswag \
--batch_size 16
Submit:
sbatch eval_job.sh
On each node, run:
accelerate launch \
--multi_gpu \
--num_machines 4 \
--num_processes 32 \
--main_process_ip $MASTER_IP \
--main_process_port 29500 \
--machine_rank $NODE_RANK \
-m lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu
Environment variables:
MASTER_IP: IP of rank 0 nodeNODE_RANK: 0, 1, 2, 3 for each nodeTest on small sample first:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-70b-hf,parallelize=True \
--tasks mmlu \
--limit 100 # Just 100 samples
# Terminal 1: Run evaluation
lm_eval --model hf ...
# Terminal 2: Monitor
watch -n 1 nvidia-smi
Look for:
# Auto batch size (recommended)
--batch_size auto
# Or tune manually
--batch_size 16 # Start here
--batch_size 32 # Increase if memory allows
--model_args dtype=bfloat16 # Faster, less memory
For data parallelism, check network bandwidth:
# Should see InfiniBand or high-speed network
nvidia-smi topo -m
Solutions:
Increase tensor parallelism:
--model_args tensor_parallel_size=8 # Was 4
Reduce batch size:
--batch_size 4 # Was 16
Lower precision:
--model_args dtype=int8 # Quantization
Check:
nvidia-smipython -c "import torch; print(torch.cuda.nccl.version())"Fix:
export NCCL_DEBUG=INFO # Enable debug logging
export NCCL_IB_DISABLE=0 # Use InfiniBand if available
Possible causes:
Profile:
lm_eval --model hf \
--model_args pretrained=meta-llama/Llama-2-7b-hf \
--tasks mmlu \
--limit 100 \
--log_samples # Check timing
Symptom: GPU 0 at 100%, others at 50%
Solution: Use device_map_option=balanced:
--model_args parallelize=True,device_map_option=balanced
# 8 A100s, data parallel
accelerate launch --multi_gpu --num_processes 8 \
-m lm_eval --model hf \
--model_args \
pretrained=meta-llama/Llama-2-7b-hf,\
dtype=bfloat16 \
--tasks mmlu,gsm8k,hellaswag,arc_challenge \
--num_fewshot 5 \
--batch_size 32
# Time: ~30 minutes
# 8 H100s, tensor parallel
lm_eval --model vllm \
--model_args \
pretrained=meta-llama/Llama-2-70b-hf,\
tensor_parallel_size=8,\
dtype=auto,\
gpu_memory_utilization=0.9 \
--tasks mmlu,gsm8k,humaneval \
--num_fewshot 5 \
--batch_size auto
# Time: ~1 hour
Requires specialized setup - contact framework maintainers
docs/model_guide.md