Back to Vllm

Loading Model Weights with InstantTensor

docs/models/extensions/instanttensor.md

0.20.11.1 KB
Original Source

Loading Model Weights with InstantTensor

InstantTensor accelerates loading Safetensors weights on CUDA devices through distributed loading, pipelined prefetching, and direct I/O. InstantTensor also supports GDS (GPUDirect Storage) when available. For more details, see the InstantTensor GitHub repository.

Installation

bash
pip install instanttensor

Use InstantTensor in vLLM

Add --load-format instanttensor as a command-line argument.

For example:

bash
vllm serve Qwen/Qwen2.5-0.5B --load-format instanttensor

Benchmarks

ModelGPUBackendLoad Time (s)Throughput (GB/s)Speedup
Qwen3-30B-A3B1*H200Safetensors57.41.11x
Qwen3-30B-A3B1*H200InstantTensor1.7735<span style="color: green">32.4x</span>
DeepSeek-R18*H200Safetensors1604.31x
DeepSeek-R18*H200InstantTensor15.345<span style="color: green">10.5x</span>

For the full benchmark results, see https://github.com/scitix/InstantTensor/blob/main/docs/benchmark.md.