README.Cosyvoice3

Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM

Contributed by Yuekai Zhang (NVIDIA).

Quick Start

Launch the service directly with Docker Compose:

docker compose -f docker-compose.cosyvoice3.yml up

Build the Docker Image

To build the image from scratch:

docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06

Run a Docker Container

your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06

Understanding `run_cosyvoice3.sh`

The run_cosyvoice3.sh script orchestrates the entire workflow through numbered stages.

You can run a subset of stages with:

bash run_cosyvoice3.sh <start_stage> <stop_stage>

<start_stage>: The stage to start from.
<stop_stage>: The stage to stop after.

Stages:

Stage -1: Clones the CosyVoice repository.
Stage 0: Downloads the Fun-CosyVoice3-0.5B-2512 model and its HuggingFace LLM checkpoint.
Stage 1: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
Stage 2: Creates the Triton model repository, including configurations for cosyvoice3, token2wav, vocoder, audio_tokenizer, and speaker_embedding.
Stage 3: Launches the Triton Inference Server for Token2Wav module and uses trtllm-serve to deploy CosyVoice3 LLM.
Stage 4: Runs the gRPC benchmark client for performance testing.
Stage 5: Runs the offline TTS inference benchmark test.

Export Models and Launch Server

Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:

# This command runs stages 0, 1, 2, and 3
bash run_cosyvoice3.sh 0 3

Benchmark with client-server mode

To benchmark the running Triton server, run stage 4:

bash run_cosyvoice3.sh 4 4

# You can customize parameters such as the number of tasks inside the script.

The following results were obtained by decoding on a single L20 GPU.

Streaming TTS (Concurrent Tasks = 4)

First Chunk Latency

Concurrent Tasks	Average (ms)	50th Percentile (ms)	90th Percentile (ms)	95th Percentile (ms)	99th Percentile (ms)
4	750.42	740.31	941.05	977.55	1002.37

Benchmark with offline inference mode

For offline inference mode benchmark, please run stage 5:

bash run_cosyvoice3.sh 5 5

Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)

Backend	LLM Batch Size	llm_time (s)	token2wav_time (s)	pipeline_time (s)	RTF
TRTLLM	1	13.21	5.72	19.48	0.1091
TRTLLM	2	8.46	6.02	14.91	0.0822
TRTLLM	4	5.07	5.95	11.43	0.0630
TRTLLM	8	2.98	6.11	9.53	0.0562
TRTLLM	16	2.12	6.27	8.83	0.0501