images/gpu/triton/tensorrt/README.md
This guide provides instructions for building TensorRT engine files for the Llama2-7B-Chat-HF model.
First, create a Google Cloud VM with the necessary accelerator.
Set the following environment variables for your project:
export IMAGE="common-cu128-ubuntu-2204-nvidia-570-v20251009"
export ZONE="us-central1-a"
export INSTANCE_NAME="model-prep"
export MACHINE_TYPE="g2-standard-32"
export ACCELERATOR="type=nvidia-l4,count=1"
export PROJECT="<your-project>"
Then, create the instance using the following command:
gcloud compute instances create $INSTANCE_NAME \
--zone=$ZONE \
--image=$IMAGE \
--machine-type=$MACHINE_TYPE \
--image-project=deeplearning-platform-release \
--maintenance-policy=TERMINATE \
--accelerator=$ACCELERATOR \
--metadata="install-nvidia-driver=True" \
--boot-disk-size=4TB \
--boot-disk-type=pd-ssd \
--boot-disk-device-name=boot-disk \
--no-shielded-secure-boot \
--project=$PROJECT
Connect to the newly created VM and install the required packages:
sudo apt-get update
sudo apt-get install -y ca-certificates curl gnupg neovim python3-dev
Install Docker on the VM:
Restart Docker and add your user to the docker group to run Docker commands
without sudo:
sudo systemctl restart docker
sudo usermod -a -G docker $USER
newgrp docker
In your VM, create Dockerfile.llama2-7b-chat-hf with the content of
images/gpu/triton/tensorrt/Dockerfile.llama-2-7b-chat-hf. Don't forget to
provide your own HF token.
Build the Docker image:
docker build . -f Dockerfile.llama2-7b-chat-hf -t tensorrt:llama2-7b-chat-hf
Run the Docker container:
docker run --rm -it --net host --shm-size=25g --ulimit memlock=-1 --ulimit stack=67108864 --gpus all -p 8000:8000 tensorrt:llama2-7b-chat-hf bash
Inside the container, run the following commands to generate the model checkpoint, quantize it, and build the TensorRT engine.
Convert the checkpoint: bash python convert_checkpoint.py --model_dir /llama-2-7b-chat-hf \ --output_dir /tllm_checkpoint_1gpu_tp1 \ --dtype float16 \ --tp_size 1
Quantize the model: bash python3 ../../../quantization/quantize.py --dtype=float16 --output_dir /tllm_checkpoint_1gpu_tp1 --model_dir /llama-2-7b-chat-hf --qformat=fp8 --kv_cache_dtype=fp8 --tp_size 1
Build the TensorRT engine: bash trtllm-build --checkpoint_dir /tllm_checkpoint_1gpu_tp1 \ --output_dir /engines/llama-2-7b-chat-hf/fp8/1-gpu/ \ --gemm_plugin auto \ --max_batch_size 1