docs/examples/llm/nvidia_tensorrt.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/nvidia_tensorrt.ipynb" target="_parent"></a>
TensorRT-LLM provides an easy-to-use Python API to define large language models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficiently on NVIDIA GPUs.
For more information, refer to the TensorRT-LLM GitHub repository.
Since TensorRT-LLM is an SDK for interacting with local models in process, there are a few environment steps that must be followed to ensure that TensorRT-LLM can be used. NVIDIA CUDA 12.2 or higher is required to run TensorRT-LLM.
In this tutorial we will show how to use the connector with GPT2 model. For the best experience, we recommend following Installation process on the official TensorRT-LLM Github.
The following steps are showing how to set up your model with TensorRT-LLM v0.8.0 for x86_64 users.
docker run --rm --runtime=nvidia --gpus all --entrypoint /bin/bash -it nvidia/cuda:12.1.0-devel-ubuntu22.04
apt-get update && apt-get -y install python3.10 python3-pip openmpi-bin libopenmpi-dev git git-lfs wget
pip3 install tensorrt_llm==0.8.0 -U --extra-index-url https://pypi.nvidia.com
python3 -c "import tensorrt_llm"
The above command should not produce any errors.
For this example we will use GPT2. The GPT2 model files need to be created via scripts following the instructions here
git clone --branch v0.8.0 https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM/examples/gpt/ && pip install -r requirements.txt
rm -rf gpt2 && git clone https://huggingface.co/gpt2-medium gpt2
cd gpt2
rm pytorch_model.bin model.safetensors
wget -q https://huggingface.co/gpt2-medium/resolve/main/pytorch_model.bin
cd ..
python3 hf_gpt_convert.py -i gpt2 -o ./c-model/gpt2 --tensor-parallelism 1 --storage-type float16
python3 build.py --model_dir=./c-model/gpt2/1-gpu --use_gpt_attention_plugin --remove_input_padding
Install llama-index-llms-nvidia-tensorrt package
pip install llama-index-llms-nvidia-tensorrt
complete with a promptfrom llama_index.llms.nvidia_tensorrt import LocalTensorRTLLM
llm = LocalTensorRTLLM(
model_path="./engine_outputs",
engine_name="gpt_float16_tp1_rank0.engine",
tokenizer_dir="gpt2",
max_new_tokens=40,
)
resp = llm.complete("Who is Harry Potter?")
print(str(resp))
The expected response should look like:
Harry Potter is a fictional character created by J.K. Rowling in her first novel, Harry Potter and the Philosopher's Stone. The character is a wizard who lives in the fictional town#