docs/vibevoice-vllm-asr.md
<a href="https://huggingface.co/microsoft/VibeVoice-ASR"></a>
Deploy VibeVoice ASR model as a high-performance API service using vLLM. This plugin provides OpenAI-compatible API endpoints for speech-to-text transcription with streaming support.
/v1/chat/completions endpoint with streaming supportUsing Official vLLM Docker Image (Recommended)
git clone https://github.com/microsoft/VibeVoice.git
cd VibeVoice
docker run -d --gpus all --name vibevoice-vllm \
--ipc=host \
-p 8000:8000 \
-e VIBEVOICE_FFMPEG_MAX_CONCURRENCY=64 \
-e PYTORCH_ALLOC_CONF=expandable_segments:True \
-v $(pwd):/app \
-w /app \
--entrypoint bash \
vllm/vllm-openai:v0.14.1 \
-c "python3 /app/vllm_plugin/scripts/start_server.py"
docker logs -f vibevoice-vllm
Note:
- The
-dflag runs the container in background (detached mode)- Use
docker stop vibevoice-vllmto stop the service- The model will be downloaded to HuggingFace cache (
~/.cache/huggingface) inside the container
Once the vLLM server is running, test it with the provided script:
# Basic transcription
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav
# With hotwords for better recognition of specific terms
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
# With auto-recovery from repetition loops (for long audio)
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav
# Auto-recover with hotwords
docker exec -it vibevoice-vllm python3 vllm_plugin/tests/test_api_auto_recover.py /app/audio.wav --hotwords "Microsoft,VibeVoice"
Note:
- The audio/video file must be inside the mounted directory (
/appin the container). Copy your files to the VibeVoice folder before testing.- Hotwords help improve recognition of domain-specific terms like proper nouns, technical terms, and speaker names.
| Variable | Description | Default |
|---|---|---|
VIBEVOICE_FFMPEG_MAX_CONCURRENCY | Maximum FFmpeg processes for audio decoding | 64 |
PYTORCH_ALLOC_CONF | PyTorch memory allocator config | expandable_segments:True |
--gpu-memory-utilization 0.9 for maximum throughput if you have dedicated GPU--max-num-seqs for higher concurrency (requires more GPU memory)VIBEVOICE_FFMPEG_MAX_CONCURRENCY based on CPU cores"CUDA out of memory"
--gpu-memory-utilization--max-num-seqs--max-model-len"Audio decoding failed"
ffmpeg -version"Model not found"
config.json and model weights"Plugin not loaded"
pip show vibevoicepip show -f vibevoice | grep entry