Back to Cosyvoice

README.Cosyvoice3

runtime/triton_trtllm/README.Cosyvoice3.md

2.03.3 KB
Original Source

Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM

Contributed by Yuekai Zhang (NVIDIA).

Quick Start

Launch the service directly with Docker Compose:

sh
docker compose -f docker-compose.cosyvoice3.yml up

Build the Docker Image

To build the image from scratch:

sh
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06

Run a Docker Container

sh
your_mount_dir=/mnt:/mnt
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06

Understanding run_cosyvoice3.sh

The run_cosyvoice3.sh script orchestrates the entire workflow through numbered stages.

You can run a subset of stages with:

sh
bash run_cosyvoice3.sh <start_stage> <stop_stage>
  • <start_stage>: The stage to start from.
  • <stop_stage>: The stage to stop after.

Stages:

  • Stage -1: Clones the CosyVoice repository.
  • Stage 0: Downloads the Fun-CosyVoice3-0.5B-2512 model and its HuggingFace LLM checkpoint.
  • Stage 1: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
  • Stage 2: Creates the Triton model repository, including configurations for cosyvoice3, token2wav, vocoder, audio_tokenizer, and speaker_embedding.
  • Stage 3: Launches the Triton Inference Server for Token2Wav module and uses trtllm-serve to deploy CosyVoice3 LLM.
  • Stage 4: Runs the gRPC benchmark client for performance testing.
  • Stage 5: Runs the offline TTS inference benchmark test.

Export Models and Launch Server

Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:

sh
# This command runs stages 0, 1, 2, and 3
bash run_cosyvoice3.sh 0 3

Benchmark with client-server mode

To benchmark the running Triton server, run stage 4:

sh
bash run_cosyvoice3.sh 4 4

# You can customize parameters such as the number of tasks inside the script.

The following results were obtained by decoding on a single L20 GPU.

Streaming TTS (Concurrent Tasks = 4)

First Chunk Latency

Concurrent TasksAverage (ms)50th Percentile (ms)90th Percentile (ms)95th Percentile (ms)99th Percentile (ms)
4750.42740.31941.05977.551002.37

Benchmark with offline inference mode

For offline inference mode benchmark, please run stage 5:

sh
bash run_cosyvoice3.sh 5 5

Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)

BackendLLM Batch Sizellm_time (s)token2wav_time (s)pipeline_time (s)RTF
TRTLLM113.215.7219.480.1091
TRTLLM28.466.0214.910.0822
TRTLLM45.075.9511.430.0630
TRTLLM82.986.119.530.0562
TRTLLM162.126.278.830.0501