Back to Mlc Llm

Compile Model Libraries

docs/compilation/compile_models.rst

0.20.dev053.5 KB
Original Source

.. _compile-model-libraries:

Compile Model Libraries

To run a model with MLC LLM in any platform, we need:

  1. Model weights converted to MLC format (e.g. RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC <https://huggingface.co/mlc-ai/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/tree/main>__.)
  2. Model library that comprises the inference logic

This page describes how to compile a model library with MLC LLM. Model compilation optimizes the model inference for a given platform, allowing users bring their own new model architecture, use different quantization modes, and customize the overall model optimization flow.

Notably, in many cases you do not need to explicit call compile.

  • If you are using the Python API, you can skip specifying model_lib and the system will JIT compile the library.

  • If you are building iOS/android package, checkout :ref:package-libraries-and-weights, which provides a simpler high-level command that leverages the compile behind the scheme.

This page is still helpful to understand the compilation flow behind the scheme, or be used to explicit create model libraries. We compile RedPajama-INCITE-Chat-3B-v1 with q4f16_1 as an example for all platforms.

.. note:: Before you proceed, make sure you followed :ref:install-tvm, a required backend to compile models with MLC LLM.

Please also follow the instructions in :ref:`deploy-cli` / :ref:`deploy-python-engine` to obtain
the CLI app / Python API that can be used to chat with the compiled model.

.. contents:: Table of Contents :depth: 1 :local:

  1. Verify Installation

Step 1. Verify mlc_llm

We use the python package mlc_llm to compile models. This can be installed by following :ref:install-mlc-packages, either by building from source, or by installing the prebuilt package. Verify mlc_llm installation in command line via:

.. code:: bash

$ mlc_llm --help
# You should see help information with this line
usage: MLC LLM Command Line Interface. [-h] {compile,convert_weight,gen_config}

.. note:: If it runs into error command not found: mlc_llm, try python -m mlc_llm --help.

Step 2. Verify TVM

To compile models, you also need to follow :ref:install-tvm. Here we verify tvm quickly with command line (for full verification, see :ref:tvm-validate):

.. code:: bash

$ python -c "import tvm; print(tvm.__file__)"
/some-path/lib/python3.13/site-packages/tvm/__init__.py
  1. Clone from HF and convert_weight

This replicates :ref:convert-weights-via-MLC, see that page for more details.

You can be under the mlc-llm repo, or your own working directory. Note that all platforms can share the same compiled/quantized weights.

.. code:: shell

# Create directory
mkdir -p dist/models && cd dist/models
# Clone HF weights
git lfs install
git clone https://huggingface.co/togethercomputer/RedPajama-INCITE-Chat-3B-v1
cd ../..
# Convert weight
mlc_llm convert_weight ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
    --quantization q4f16_1 \
    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC

2. Generate mlc-chat-config and compile

A model library is specified by:

  • The model architecture (e.g. llama-2, gpt-neox)
  • Quantization (e.g. q4f16_1, q0f32)
  • Metadata (e.g. context_window_size, sliding_window_size, prefill-chunk-size), which affects memory planning
  • Platform (e.g. cuda, webgpu, iOS)

All these knobs are specified in mlc-chat-config.json generated by gen_config.

.. code:: shell

# Create output directory for the model library compiled
mkdir dist/libs

.. tabs::

.. group-tab:: Linux - CUDA

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device cuda -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so


.. group-tab:: Metal

    For M-chip Mac:

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so

    Cross-Compiling for Intel Mac on M-chip Mac:

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib

    For Intel Mac:

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device metal -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib


.. group-tab:: Vulkan

    For Linux:

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so

    For Windows:

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device vulkan -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.dll

.. group-tab:: iOS/iPadOS

    You need a Mac to compile models for it.

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
            --conv-template redpajama_chat --context-window-size 768 \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device iphone -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar

    .. note::
        If it runs into error

        .. code:: text

            Compilation error:
            xcrun: error: unable to find utility "metal", not a developer tool or in PATH
            xcrun: error: unable to find utility "metallib", not a developer tool or in PATH

        , please check and make sure you have Command Line Tools for Xcode installed correctly.
        You can use ``xcrun metal`` to validate: when it prints ``metal: error: no input files``, it means the Command Line Tools for Xcode is installed and can be found, and you can proceed with the model compiling.

.. group-tab:: Android

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ --quantization q4f16_1 \
            --conv-template redpajama_chat --context-window-size 768 \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device android -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar

.. group-tab:: WebGPU

    .. code:: shell

        # 1. gen_config: generate mlc-chat-config.json and process tokenizers
        mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
            --quantization q4f16_1 --conv-template redpajama_chat \
            -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
        # 2. compile: compile model library with specification in mlc-chat-config.json
        mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
            --device webgpu -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm

    .. note::
        To compile for webgpu, you need to build from source when installing ``mlc_llm``. Besides, you also need to follow :ref:`install-web-build`.
        Otherwise, it would run into error

        .. code:: text

            RuntimeError: Cannot find libraries: wasm_runtime.bc

    .. note::
        For webgpu, when compiling larger models like ``Llama-2-7B``, you may want to add ``--prefill-chunk-size 1024`` or lower ``--context-window-size`` to decrease memory usage.
        Otherwise, you may run into issues like:

        .. code:: text

            TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
            'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

.. note::

For the ``conv-template``, `conversation_template.py <https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/conversation_template.py>`__
contains a full list of conversation templates that MLC provides. If the model you are adding
requires a new conversation template, you would need to add your own.
Follow `this PR <https://github.com/mlc-ai/mlc-llm/pull/2163>`__ as an example.
However, adding your own template would require you :ref:`build mlc_llm from source <mlcchat_build_from_source>`
in order for it to be recognized by the runtime.

For more details, please see :ref:`configure-mlc-chat-json`.

3. Verify output and chat

By executing the compile command above, we generate the model weights, model lib, and a chat config. We can check the output with the commands below:

.. tabs::

.. group-tab:: Linux - CUDA

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so      # ===> the model library

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    We can now chat with the model using the command line interface (CLI) app or the Python API.

    .. code:: shell

        python
        >>> from mlc_llm import MLCEngine
        >>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
        ...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-cuda.so")
        >>> engine.chat.completions.create(
        ...   messages=[{"role": "user", "content": "hello"}]
        ... )
        ChatCompletionResponse(
          choices=[ChatCompletionResponseChoice(
            message=ChatCompletionMessage(
              content="Hi! How can I assist you today?", role='assistant'
            )
          )],
          ...
        )

.. group-tab:: Metal

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so     # ===> the model library (will be -metal_x86_64.dylib for Intel Mac)

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    We can now chat with the model using the command line interface (CLI) app or the Python API.

    .. code:: shell

        python
        >>> from mlc_llm import MLCEngine
        >>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
        ...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal.so")
        >>> engine.chat.completions.create(
        ...   messages=[{"role": "user", "content": "hello"}]
        ... )
        ChatCompletionResponse(
          choices=[ChatCompletionResponseChoice(
            message=ChatCompletionMessage(
              content="Hi! How can I assist you today?", role='assistant'
            )
          )],
          ...
        )


.. group-tab:: Vulkan

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so    # ===> the model library (will be .dll for Windows)

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    We can now chat with the model using the command line interface (CLI) app or the Python API.

    .. code:: shell

        python
        >>> from mlc_llm import MLCEngine
        >>> engine = MLCEngine(model="./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC",
        ...   model_lib="./dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-vulkan.so")
        >>> engine.chat.completions.create(
        ...   messages=[{"role": "user", "content": "hello"}]
        ... )
        ChatCompletionResponse(
          choices=[ChatCompletionResponseChoice(
            message=ChatCompletionMessage(
              content="Hi! How can I assist you today?", role='assistant'
            )
          )],
          ...
        )

.. group-tab:: iOS/iPadOS

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar   # ===> the model library

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    The model lib ``dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-iphone.tar``
    will be packaged as a static library into the iOS app. Checkout :ref:`deploy-ios` for more details.

.. group-tab:: Android

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar  # ===> the model library

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    The model lib ``dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-android.tar``
    will be packaged as a static library into the android app. Checkout :ref:`deploy-android` for more details.

.. group-tab:: WebGPU

    .. code:: shell

        ~/mlc-llm > ls dist/libs
          RedPajama-INCITE-Chat-3B-v1-q4f16_1-webgpu.wasm  # ===> the model library

        ~/mlc-llm > ls dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC
          mlc-chat-config.json                             # ===> the chat config
          tensor-cache.json                               # ===> the model weight info
          params_shard_0.bin                               # ===> the model weights
          params_shard_1.bin
          ...
          tokenizer.json                                   # ===> the tokenizer files
          tokenizer_config.json

    To use this in WebGPU runtime, checkout :ref:`webllm-runtime`.

Compile Commands for More Models

This section lists compile commands for more models that you can try out. Note that this can be easily generalized to any model variant, as long as mlc-llm supports the architecture.

.. tabs::

.. tab:: Model: Llama-2-7B

    Please `request for access <https://huggingface.co/meta-llama>`_ to the Llama-2 weights from Meta first.
    After granted access, first create directory ``dist/models`` and download the model to the directory.
    For example, you can run the following code:

    .. code:: shell

        mkdir -p dist/models && cd dist/models
        git lfs install
        git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
        cd ../..

    Then convert the HF weights into MLC-compatible weights. Note that all platforms
    can share the same compiled/quantized weights.

    .. code:: shell

        mlc_llm convert_weight ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC

    Afterwards, run the following command to generate mlc config and compile the model.

    .. code:: shell

        # Create output directory for the model library compiled
        mkdir dist/libs

    .. tabs::

        .. tab:: Target: CUDA

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device cuda -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-cuda.so

        .. tab:: Metal

            For M-chip Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal.so

            Cross-Compiling for Intel Mac on M-chip Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/RedPajama-INCITE-Chat-3B-v1/ \
                    --quantization q4f16_1 --conv-template redpajama_chat \
                    -o dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/RedPajama-INCITE-Chat-3B-v1-q4f16_1-MLC/mlc-chat-config.json \
                    --device metal:x86-64 -o dist/libs/RedPajama-INCITE-Chat-3B-v1-q4f16_1-metal_x86_64.dylib

            For Intel Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device metal -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-metal_x86_64.dylib

        .. tab:: Vulkan

            For Linux:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.so

            For Windows:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device vulkan -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-vulkan.dll

        .. tab:: WebGPU

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --context-window-size 2048 --conv-template llama-2 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device webgpu -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-webgpu.wasm

            .. note::
                To compile for webgpu, you need to build from source when installing ``mlc_llm``. Besides, you also need to follow :ref:`install-web-build`.
                Otherwise, it would run into error

                .. code:: text

                    RuntimeError: Cannot find libraries: wasm_runtime.bc

        .. tab:: iPhone/iPad

            You need a Mac to compile models for it.

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device iphone -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-iphone.tar

        .. tab:: Android

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Llama-2-7b-chat-hf/ --quantization q4f16_1 \
                    --conv-template llama-2 --context-window-size 768 -o dist/Llama-2-7b-chat-hf-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Llama-2-7b-chat-hf-q4f16_1-MLC/mlc-chat-config.json \
                    --device android -o dist/libs/Llama-2-7b-chat-hf-q4f16_1-android.tar

.. tab:: Mistral-7B-Instruct-v0.2

    Note that Mistral uses sliding window attention (SWA). Thus, instead of specifying
    ``context-window-size``, we specify ``sliding-window-size``.

    First create directory ``dist/models`` and download the model to the directory.
    For example, you can run the following code:

    .. code:: shell

        mkdir -p dist/models && cd dist/models
        git lfs install
        git clone https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2
        cd ../..

    Then convert the HF weights into MLC-compatible weights. Note that all platforms
    can share the same compiled/quantized weights.

    .. code:: shell

        mlc_llm convert_weight ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
            -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC

    Afterwards, run the following command to generate mlc config and compile the model.

    .. code:: shell

        # Create output directory for the model library compiled
        mkdir dist/libs

    .. tabs::

        .. tab:: Target: CUDA

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device cuda -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-cuda.so

        .. tab:: Metal

            For M-chip Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal.so


            For Intel Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device metal -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-metal_x86_64.dylib

        .. tab:: Vulkan

            For Linux:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.so

            For Windows:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device vulkan -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-vulkan.dll

        .. tab:: WebGPU

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --prefill-chunk-size 1024 --conv-template mistral_default \
                    -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device webgpu -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-webgpu.wasm

            .. note::
                To compile for webgpu, you need to build from source when installing ``mlc_llm``. Besides, you also need to follow :ref:`install-web-build`.
                Otherwise, it would run into error

                .. code:: text

                    RuntimeError: Cannot find libraries: wasm_runtime.bc

            .. note::
                For webgpu, when compiling larger models like ``Llama-2-7B``, you may want to add ``--prefill-chunk-size 1024`` or lower ``--context-window-size`` to decrease memory usage.
                Otherwise, you may run into issues like:

                .. code:: text

                    TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
                    'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

        .. tab:: iPhone/iPad

            You need a Mac to compile models for it.

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128  \
                    -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device iphone -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-iphone.tar

        .. tab:: Android

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/Mistral-7B-Instruct-v0.2/ --quantization q4f16_1 \
                    --conv-template mistral_default --sliding-window-size 1024 --prefill-chunk-size 128 -o dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/Mistral-7B-Instruct-v0.2-q4f16_1-MLC/mlc-chat-config.json \
                    --device android -o dist/libs/Mistral-7B-Instruct-v0.2-q4f16_1-android.tar

.. tab:: Other models

    First create directory ``dist/models`` and download the model to the directory.
    For example, you can run the following code:

    .. code:: shell

        mkdir -p dist/models && cd dist/models
        git lfs install
        git clone https://huggingface.co/DISTRIBUTOR/HF_MODEL
        cd ../..

    Then convert the HF weights into MLC-compatible weights. Note that all platforms
    can share the same compiled/quantized weights.

    .. code:: shell

        mlc_llm convert_weight ./dist/models/HF_MODEL/ --quantization q4f16_1 -o dist/OUTPUT-MLC

    Afterwards, run the following command to generate mlc config and compile the model.

    .. code:: shell

        # Create output directory for the model library compiled
        mkdir dist/libs

    .. tabs::

        .. tab:: Target: CUDA

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device cuda -o dist/libs/OUTPUT-cuda.so

        .. tab:: Metal

            For M-chip Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal.so


            For Intel Mac:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device metal -o dist/libs/OUTPUT-metal_x86_64.dylib

        .. tab:: Vulkan

            For Linux:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.so

            For Windows:

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device vulkan -o dist/libs/OUTPUT-vulkan.dll

        .. tab:: WebGPU

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device webgpu -o dist/libs/OUTPUT-webgpu.wasm

            .. note::
                To compile for webgpu, you need to build from source when installing ``mlc_llm``. Besides, you also need to follow :ref:`install-web-build`.
                Otherwise, it would run into error

                .. code:: text

                    RuntimeError: Cannot find libraries: wasm_runtime.bc

            .. note::
                For webgpu, when compiling larger models like ``Llama-2-7B``, you may want to add ``--prefill-chunk-size 1024`` or lower ``--context-window-size`` to decrease memory usage.
                Otherwise, you may run into issues like:

                .. code:: text

                    TypeError: Failed to execute 'createBuffer' on 'GPUDevice': Failed to read the 'size' property from
                    'GPUBufferDescriptor': Value is outside the 'unsigned long long' value range.

        .. tab:: iPhone/iPad

            You need a Mac to compile models for it.

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
                    --context-window-size 768 -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device iphone -o dist/libs/OUTPUT-iphone.tar

        .. tab:: Android

            .. code:: shell

                # 1. gen_config: generate mlc-chat-config.json and process tokenizers
                mlc_llm gen_config ./dist/models/HF_MODEL/ --quantization q4f16_1 --conv-template CONV_TEMPLATE \
                    --context-window-size 768 -o dist/OUTPUT-MLC/
                # 2. compile: compile model library with specification in mlc-chat-config.json
                mlc_llm compile ./dist/OUTPUT-MLC/mlc-chat-config.json --device android -o dist/libs/OUTPUT-android.tar

For each model and each backend, the above only provides the most recommended build command (which is the most optimized). You can also try with different argument values (e.g., different quantization modes, context window size, etc.), whose build results affect runtime memory requirement, and it is possible that they may not run as fast and robustly as the provided one when running the model.

.. note:: Uing 3-bit quantization usually can be overly aggressive and only works for limited settings. If you encounter issues where the compiled model does not perform as expected, consider utilizing a higher number of bits for quantization (e.g., 4-bit quantization).

If you are interested in distributing the model besides local execution, please checkout :ref:distribute-compiled-models.

.. _compile-command-specification:

Compile Command Specification

As you have seen in the section above, the model compilation is split into three steps: convert weights, generate mlc-chat-config.json, and compile the model. This section describes the list of options that can be used during compilation.

  1. Convert Weight ^^^^^^^^^^^^^^^^^

Weight conversion command follows the pattern below:

.. code:: text

mlc_llm convert_weight \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--source SOURCE] \
    [--source-format SOURCE_FORMAT] \
    --output OUTPUT

Note that CONFIG is a positional argument. Arguments wrapped with [ ] are optional.

--CONFIG It can be one of the following:

                                1. Path to a HuggingFace model directory that contains a ``config.json`` or
                                2. Path to ``config.json`` in HuggingFace format, or
                                3. The name of a pre-defined model architecture.

                                A ``config.json`` file in HuggingFace format defines the model architecture, including the vocabulary
                                size, the number of layers, the hidden size, number of attention heads, etc.
                                Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.

                                A HuggingFace directory often contains a ``config.json`` which defines the model architecture,
                                the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations,
                                as well as an optional ``generation_config.json`` provides additional default configuration for
                                text generation.
                                Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.

                                For existing pre-defined model architecture, see ``MODEL_PRESETS``
                                `here <https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/compiler/model/model.py>`_.

--quantization QUANTIZATION_MODE The quantization mode we use to compile.

                                See :ref:`quantization_mode` for more information.
                                Available options are: ``q0f16``, ``q0f32``, ``q3f16_1``, ``q4f16_1``, ``q4f32_1``, and
                                ``q4f16_awq``.

                                We encourage you to use 4-bit quantization, as the text generated by 3-bit
                                quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE Model architecture such as "llama". If not set, it is inferred from config.json.

--device DEVICE The device used to do quantization such as "cuda" or "cuda:0". Will detect from local available GPUs if not specified.

--source SOURCE The path to original model weight, infer from config if missing.

--source-format SOURCE_FORMAT The format of source model weight, infer from config if missing.

--output OUTPUT The output directory to save the quantized model weight. Will create params_shard_*.bin and tensor-cache.json in this directory.

  1. Generate MLC Chat Config ^^^^^^^^^^^^^^^^^^^^^^^^^^^

In order to compile a model, we first need to generate the mlc-chat-config.json. This file contains specifications like context-window-size and sliding-window-size, among others that can alter the model compiled. We also process tokenizers in this step.

Config generation command follows the pattern below:

.. code:: text

mlc_llm gen_config \
    CONFIG \
    --quantization QUANTIZATION_MODE \
    [--model-type MODEL_TYPE] \
    --conv-template CONV_TEMPLATE \
    [--context-window-size CONTEXT_WINDOW_SIZE] \
    [--sliding-window-size SLIDING_WINDOW_SIZE] \
    [--prefill-chunk-size PREFILL_CHUNK_SIZE] \
    [--tensor-parallel-shard TENSOR_PARALLEL_SHARDS] \
    --output OUTPUT

Note that CONFIG is a positional argument. Arguments wrapped with [ ] are optional.

--CONFIG It can be one of the following:

                                            1. Path to a HuggingFace model directory that contains a ``config.json`` or
                                            2. Path to ``config.json`` in HuggingFace format, or
                                            3. The name of a pre-defined model architecture.

                                            A ``config.json`` file in HuggingFace format defines the model architecture, including the vocabulary
                                            size, the number of layers, the hidden size, number of attention heads, etc.
                                            Example: https://huggingface.co/codellama/CodeLlama-7b-hf/blob/main/config.json.

                                            A HuggingFace directory often contains a ``config.json`` which defines the model architecture,
                                            the non-quantized model weights in PyTorch or SafeTensor format, tokenizer configurations,
                                            as well as an optional ``generation_config.json`` provides additional default configuration for
                                            text generation.
                                            Example: https://huggingface.co/codellama/CodeLlama-7b-hf/tree/main.

                                            For existing pre-defined model architecture, see ``MODEL_PRESETS``
                                            `here <https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/compiler/model/model.py>`_.

--quantization QUANTIZATION_MODE The quantization mode we use to compile.

                                            See :ref:`quantization_mode` for more information.
                                            Available options are: ``q0f16``, ``q0f32``, ``q3f16_1``, ``q4f16_1``, ``q4f32_1``, and
                                            ``q4f16_awq``.

                                            We encourage you to use 4-bit quantization, as the text generated by 3-bit
                                            quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE Model architecture such as "llama". If not set, it is inferred from config.json.

--conv-template CONV_TEMPLATE Conversation template. It depends on how the model is tuned. Use "LM" for vanilla base model For existing pre-defined templates, see CONV_TEMPLATES here <https://github.com/mlc-ai/mlc-llm/blob/main/python/mlc_llm/model/model.py>_.

--context-window-size CONTEXT_WINDOW_SIZE Option to provide the maximum sequence length supported by the model. This is usually explicitly shown as context length or context window in the model card. If this option is not set explicitly, by default, it will be determined by context_window_size or max_position_embeddings in config.json, and the latter is usually inaccurate for some models.

--sliding-window-size SLIDING_WINDOW (Experimental) The sliding window size in sliding window attention (SWA). This optional field overrides the sliding_window in config.json for those models that use SWA. Currently only useful when compiling mistral-based models. This flag subjects to future refactoring.

--prefill-chunk-size PREFILL_CHUNK_SIZE (Experimental) The chunk size during prefilling. By default, the chunk size is the same as context_window_size or sliding_window_size. This flag subjects to future refactoring.

--tensor-parallel-shard TENSOR_PARALLEL_SHARDS Number of shards to split the model into in tensor parallelism multi-gpu inference.

--output OUTPUT The output directory for generated configurations, including mlc-chat-config.json and tokenizer configuration.

  1. Compile Model Library ^^^^^^^^^^^^^^^^^^^^^^^^

After generating mlc-chat-config.json, we can compile the model into a model library (files ending in .so, .tar, etc. that contains the inference logic of a model).

Model compilation command follows the pattern below:

.. code:: text

mlc_llm compile \
    MODEL \
    [--quantization QUANTIZATION_MODE] \
    [--model-type MODEL_TYPE] \
    [--device DEVICE] \
    [--host HOST] \
    [--opt OPT] \
    [--system-lib-prefix SYSTEM_LIB_PREFIX] \
    --output OUTPUT \
    [--overrides OVERRIDES]

Note that MODEL is a positional argument. Arguments wrapped with [ ] are optional.

--MODEL A path to mlc-chat-config.json, or an MLC model directory that contains mlc-chat-config.json.

--quantization QUANTIZATION_MODE The quantization mode we use to compile. If unprovided, will infer from MODEL.

                                        See :ref:`quantization_mode` for more information.
                                        Available options are: ``q0f16``, ``q0f32``, ``q3f16_1``, ``q4f16_1``, ``q4f32_1``, and
                                        ``q4f16_awq``.

                                        We encourage you to use 4-bit quantization, as the text generated by 3-bit
                                        quantized models may have bad quality depending on the model.

--model-type MODEL_TYPE Model architecture such as "llama". If not set, it is inferred from mlc-chat-config.json.

--device DEVICE The GPU device to compile the model to. If not set, it is inferred from GPUs available locally.

--host HOST The host LLVM triple to compile the model to. If not set, it is inferred from the local CPU and OS. Examples of the LLVM triple:

                                        1) iPhones: arm64-apple-ios;
                                        2) ARM64 Android phones: aarch64-linux-android;
                                        3) WebAssembly: wasm32-unknown-unknown-wasm;
                                        4) Windows: x86_64-pc-windows-msvc;
                                        5) ARM macOS: arm64-apple-darwin.

--opt OPT Optimization flags. MLC LLM maintains a predefined set of optimization flags, denoted as O0, O1, O2, O3, where O0 means no optimization, O2 means majority of them, and O3 represents extreme optimization that could potentially break the system.

                                        Meanwhile, optimization flags could be explicitly specified via details knobs, e.g.
                                        ``--opt="cutlass_attn=1;cutlass_norm=0;cublas_gemm=0;cudagraph=0"``.

--system-lib-prefix SYSTEM_LIB_PREFIX Adding a prefix to all symbols exported. Similar to objcopy --prefix-symbols. This is useful when compiling multiple models into a single library to avoid symbol conflicts. Different from objcopy, this takes no effect for shared library.

--output OUTPUT The path to the output file. The suffix determines if the output file is a shared library or objects. Available suffixes:

                                        1) Linux: .so (shared), .tar (objects);
                                        2) macOS: .dylib (shared), .tar (objects);
                                        3) Windows: .dll (shared), .tar (objects);
                                        4) Android, iOS: .tar (objects);
                                        5) Web: .wasm (web assembly).

--overrides OVERRIDES Model configuration override. Configurations to override mlc-chat-config.json. Supports context_window_size, prefill_chunk_size, sliding_window, max_batch_size and tensor_parallel_shards. Meanwhile, model config could be explicitly specified via details knobs, e.g. --overrides "context_window_size=1024;prefill_chunk_size=128".