Deprecated: This example is deprecated. Use the Olive recipes instead: https://github.com/microsoft/olive-recipes/tree/main

LLaMA-2

Prerequisites

Please note the package versions needed for using LLaMA-2 in the requirements.txt file that fits your scenario.

requirements-cpu.txt
- For running LLaMA-2 on CPU
requirements-cuda.txt
- For running LLaMA-2 on CUDA
- Note that torch with CUDA enabled is not installed automatically. This is because torch should be installed with the CUDA version used on your machine. Please visit the PyTorch website to download the torch version that is used with the CUDA version installed on your machine and satisfies the requirement listed in the file.
requirements-quant.txt
- For running the SmoothQuant algorithm using Intel's Neural Compressor
requirements.txt
- Package versions needed in each of the above files

Exporting LLaMA-2

There are several ways to export LLaMA-2 models (using LLaMA-2 7B as an example).

Option 1: from convert_to_onnx

# From source:
$ git clone https://github.com/microsoft/onnxruntime
$ cd onnxruntime/onnxruntime/python/tools/transformers/
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b

To make this option compatible with Hugging Face's Optimum, you will need to create config.json and generation_config.json for your model and store them in the same directory as your ONNX models. For example, you can find those JSON files for LLaMA-2 7B on Hugging Face here.

Option 2: from Microsoft's custom export

Please follow the README instructions in the custom export of LLaMA-2.

Option 3: from Hugging Face Optimum

Note that this may produce two ONNX models with older Optimum versions. The above two options produce one ONNX model and installing Optimum from source will now produce one ONNX model.

First, log into the Hugging Face CLI in your terminal:

$ huggingface-cli login

Once authenticated, run the following Python code to export:

from optimum.onnxruntime import ORTModelForCausalLM

name = "meta-llama/Llama-2-7b-hf"
model = ORTModelForCausalLM.from_pretrained(
    name,
    export=True,
    use_auth_token=True,
)
model.save_pretrained(name.split("/")[-1] + "-onnx")

Examples of Exporting LLaMA-2

Here are some additional examples for exporting LLaMA-2.

Export Model with Different GPU Device Ids

# From source using first GPU:
$ CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --input ./Llama-2-7b-hf --output ./llama2-7b

# From wheel using second GPU:
$ CUDA_VISIBLE_DEVICES=1 python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --input ./Llama-2-7b-hf --output ./llama2-7b

Export Saved Model on Disk

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --input ./Llama-2-7b-hf --output ./llama2-7b

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --input ./Llama-2-7b-hf --output ./llama2-7b

Export for FP32 CUDA (with MultiHeadAttention)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp32-gpu --precision fp32 --execution_provider cuda

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp32-gpu --precision fp32 --execution_provider cuda

Export for FP32 CPU (with GroupQueryAttention)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp32-cpu --precision fp32 --execution_provider cpu --use_gqa

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp32-cpu --precision fp32 --execution_provider cpu --use_gqa

Export for FP16 CUDA (with MultiHeadAttention)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp16 --precision fp16 --execution_provider cuda

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp16 --precision fp16 --execution_provider cuda

Export for FP16 CUDA (with GroupQueryAttention)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp16 --precision fp16 --execution_provider cuda --use_gqa

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-fp16 --precision fp16 --execution_provider cuda --use_gqa

Note: GroupQueryAttention can provide faster inference than MultiHeadAttention, especially for large sequence lengths (e.g. 1024 or larger). For the best performance, you should pre-allocate the KV cache buffers to have size (batch_size, num_heads, max_sequence_length, head_size) so that the past KV and present KV caches share the same memory. You also need to bind them with ONNX Runtime's IO binding.

Here is an example of how you can bind directly to torch.tensor objects:

# Assumes all inputs and outputs to the model are pre-allocated with the correct shapes in GPU memory

# Bind inputs
for k, v in inputs.items():
    io_binding.bind_input(
        name=k,
        device_type="cuda",
        device_id=0,
        element_type=np.float16,
        shape=tuple(v.shape),
        buffer_ptr=v.data_ptr()
    )

# Bind outputs
for output in model.get_outputs():
    name = output.name
    if "present" in name:
        # Bind KV cache outputs to KV cache inputs
        v = inputs[name.replace("present", "past_key_values")]
        io_binding.bind_output(
            name=name,
            device_type="cuda",
            device_id=0,
            element_type=np.float16,
            shape=tuple(v.shape),
            buffer_ptr=v.data_ptr()
        )
    else:
        # Bind other outputs as actual outputs
        v = outputs[name]
        io_binding.bind_output(
            name=name,
            device_type="cuda",
            device_id=0,
            element_type=np.float16,
            shape=tuple(v.shape),
            buffer_ptr=v.data_ptr()
        )

io_binding.synchronize_inputs()
sess.run_with_iobinding(io_binding)
io_binding.synchronize_outputs()

Export for INT8 CPU (SmoothQuant)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int8 --precision int8 --quantization_method smooth_quant --execution_provider cpu --no_merged

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int8 --precision int8 --quantization_method smooth_quant --execution_provider cpu --no_merged

Note: Intel's Neural Compressor takes time to run the SmoothQuant quantization algorithm on LLMs. On an Azure Standard_NC24s_v3 VM, it takes about ~30-45 min for each of the exported ONNX models.

Export for INT8 CPU (DynamicQuant)

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int8 --precision int8 --quantization_method quantize_dynamic --execution_provider cpu

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int8 --precision int8 --quantization_method quantize_dynamic --execution_provider cpu

Export for INT4 CUDA

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int4-gpu --precision int4 --quantization_method blockwise --execution_provider cuda --use_gqa

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int4-gpu --precision int4 --quantization_method blockwise --execution_provider cuda --use_gqa

Export for INT4 CPU

# From source:
$ python3 -m models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int4-cpu --precision int4 --quantization_method blockwise --execution_provider cpu --use_gqa

# From wheel:
$ python3 -m onnxruntime.transformers.models.llama.convert_to_onnx -m meta-llama/Llama-2-7b-hf --output llama2-7b-int4-cpu --precision int4 --quantization_method blockwise --execution_provider cpu --use_gqa

Parity Checking LLaMA-2

Here are some examples of how you can use the parity checker to verify your LLaMA-2 ONNX model.

Merged ONNX model, FP32 CPU

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
    --model_name meta-llama/Llama-2-7b-hf \
    --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --merged \
    --execution_provider cpu \
    --precision fp32 \
    --cache_dir ./model_cache \

Merged ONNX model, FP32 CUDA

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
    --model_name meta-llama/Llama-2-7b-hf \
    --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --merged \
    --execution_provider cuda \
    --precision fp32 \
    --cache_dir ./model_cache \

Merged ONNX model, FP16 CUDA

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
    --model_name meta-llama/Llama-2-7b-hf \
    --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --merged \
    --execution_provider cuda \
    --precision fp16 \
    --cache_dir ./model_cache \

Merged ONNX model, FP16 CUDA with GroupQueryAttention + Buffer Sharing Enabled

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.llama_parity \
    --model_name meta-llama/Llama-2-7b-hf \
    --onnx_model_path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --merged \
    --use_buffer_share \
    --execution_provider cuda \
    --precision fp16 \
    --cache_dir ./model_cache \

Benchmark LLaMA-2

Here are some examples of how you can benchmark LLaMA-2.

Variants

PyTorch without torch.compile, FP32

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type hf-pt-eager \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp32 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cpu \
    --auth

PyTorch with torch.compile, FP16

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type hf-pt-compile \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp16 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cuda \
    --auth

Optimum + ONNX Runtime, FP32, export via Optimum or convert_to_onnx

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type hf-ort \
    --hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp32 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cpu \
    --auth

Optimum + ONNX Runtime, FP16, export via Optimum or convert_to_onnx

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type hf-ort \
    --hf-ort-dir-path ./Llama-2-7b-hf-onnx/ \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp16 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cuda \
    --auth

ONNX Runtime, FP32, Microsoft custom export

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type ort-msft \
    --ort-model-path ./llama-2-onnx/7B_float32/ONNX/LlamaV2_7B_float32.onnx \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp32 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cpu

ONNX Runtime, FP16, Microsoft custom export

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark \
    --benchmark-type ort-msft \
    --ort-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp16 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cuda

ONNX Runtime, FP32, convert_to_onnx, use 2nd GPU

CUDA_VISIBLE_DEVICES=1 python3 -m models.llama.benchmark \
    --benchmark-type ort-convert-to-onnx \
    --ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp32 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cpu

ONNX Runtime, FP16, convert_to_onnx, use 5th GPU

CUDA_VISIBLE_DEVICES=4 python3 -m models.llama.benchmark \
    --benchmark-type ort-convert-to-onnx \
    --ort-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp16 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cuda

You can profile a variant by adding the --profile flag and providing one batch size and sequence length combination.

Benchmark All

You can use benchmark_all.py to benchmark across various options and automatically store the results in a CSV file. Here is an example.

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_all \
    --hf-pt-eager \
    --hf-pt-compile \
    --hf-ort-dir-path ./llama2-7b-fp16/ \
    --ort-convert-to-onnx-model-path ./llama2-7b-fp16/Llama-2-7b-hf_decoder_merged_model_fp16.onnx \
    --ort-msft-model-path ./llama-2-onnx/7B_float16/ONNX/LlamaV2_7B_float16.onnx \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --precision fp16 \
    --batch-sizes "1 2" \
    --sequence-lengths "8 16" \
    --device cuda \
    --warmup-runs 5 \
    --num-runs 1000 \
    --timeout 60  # number of minutes before moving to the next benchmark

Benchmark E2E

You can use benchmark_e2e.py to benchmark the full end-to-end scenario and automatically store the results in a CSV file. This tool uses argmax for sampling to standardize the benchmarking process.

PyTorch without torch.compile, FP32

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
    --benchmark-type pt-eager \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --prompts-file ./models/llama/prompts.json \
    --precision fp32 \
    --batch-sizes "1 2" \
    --prompt-lengths "16 64" \
    --device cpu \
    --auth

PyTorch with torch.compile, FP16

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
    --benchmark-type pt-compile \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --prompts-file ./models/llama/prompts.json \
    --precision fp16 \
    --batch-sizes "1 2" \
    --prompt-lengths "16 64" \
    --device cuda \
    --auth

ONNX Runtime with convert_to_onnx, FP32

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
    --benchmark-type ort \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --prompts-file ./models/llama/prompts.json \
    --precision fp32 \
    --batch-sizes "1 2" \
    --prompt-lengths "16 64" \
    --device cpu \
    --auth

ONNX Runtime with convert_to_onnx, FP16

CUDA_VISIBLE_DEVICES=0 python3 -m models.llama.benchmark_e2e \
    --benchmark-type ort \
    --model-name meta-llama/Llama-2-7b-hf \
    --cache-dir ./model_cache \
    --onnx-model-path ./llama2-7b/rank_0_Llama-2-7b-hf_decoder_merged_model_fp32.onnx \
    --prompts-file ./models/llama/prompts.json \
    --precision fp16 \
    --batch-sizes "1 2" \
    --prompt-lengths "16 64" \
    --device cuda \
    --use_buffer_share \
    --auth

E2E Inference with LLaMA-2

For end-to-end inference, please visit the ONNX Runtime Inference Examples folder for a step-by-step walkthrough, code examples, and performance metrics.

Mistral

Introduction

These tools for LLaMA-2 also allow the quantization and optimization of Mistral in ORT.

Exporting Mistral

There is currently one supported way to export Mistral to ONNX format:

Hugging Face Optimum

The following command will export Mistral in full precision:

python -m optimum.exporters.onnx -m mistralai/Mistral-7B-v0.1 --library-name transformers /path/to/model/directory

Optimizing and Quantizing Mistral

To quantize Mistral to FP16 and apply fusion optimizations, you can run the following command:

python -m models.llama.convert_to_onnx -i /path/to/model/directory -o /path/to/optimized_model/directory -p fp16 --optimize_optimum -m mistralai/Mistral-7B-v0.1

Benchmark Mistral

The benchmarking scripts in the LLaMA directory support Mistral benchmarking. To benchmark the ORT version, you can run:

CUDA_VISIBLE_DEVICES=0 python -m models.llama.benchmark \
    -bt ort-convert-to-onnx \
    -p fp16 \
    -m mistralai/Mistral-7B-v0.1 \
    --ort-model-path /path/to/model.onnx

To benchmark the Hugging Face implementation without torch.compile:

CUDA_VISIBLE_DEVICES=0 python -m models.llama.benchmark \
    -bt hf-pt-eager \
    -p fp16 \
    -m mistralai/Mistral-7B-v0.1

And to benchmark the Hugging Face implementation with torch.compile:

CUDA_VISIBLE_DEVICES=0 python -m models.llama.benchmark \
    -bt hf-pt-compile \
    -p fp16 \
    -m mistralai/Mistral-7B-v0.1

Contents