Back to Mlc Llm

IDE Integration

docs/deploy/ide_integration.rst

0.20.dev07.0 KB
Original Source

.. _deploy-ide-integration:

IDE Integration

.. contents:: Table of Contents :local: :depth: 2

MLC LLM has now support for code completion on multiple IDEs. This means you can easily integrate an LLM with coding capabilities with your IDE through the MLC LLM :ref:deploy-rest-api. Here we provide a step-by-step guide on how to do this.

Convert Your Model Weights

To run a model with MLC LLM in any platform, you need to convert your model weights to the MLC format (e.g. CodeLlama-7b-hf-q4f16_1-MLC <https://huggingface.co/mlc-ai/CodeLlama-7b-hf-q4f16_1-MLC>). You can always refer to :ref:convert-weights-via-MLC for in-depth details on how to convert your model weights. If you are using your own model weights, i.e., you finetuned the model on your personal codebase, it is important to follow these steps to convert the respective weights properly. However, it is also possible to download precompiled weights from the original models, available in the MLC format. See the full list of all precompiled weights here <https://huggingface.co/mlc-ai>.

Example:

.. code:: bash

convert model weights

mlc_llm convert_weight ./dist/models/CodeLlama-7b-hf
--quantization q4f16_1
-o ./dist/CodeLlama-7b-hf-q4f16_1-MLC

Compile Your Model

Compiling the model architecture is the crucial step to optimize inference for a given platform. However, compilation relies on multiple settings that will impact the runtime. This configuration is specified inside the mlc-chat-config.json file, which can be generated by the gen_config command. You can learn more about the gen_config command here </docs/compilation/compile_models.html#generate-mlc-chat-config>__.

Example:

.. code:: bash

generate mlc-chat-config.json

mlc_llm gen_config ./dist/models/CodeLlama-7b-hf
--quantization q4f16_1 --conv-template LM
-o ./dist/CodeLlama-7b-hf-q4f16_1-MLC

.. note:: Make sure to set the --conv-template flag to LM. This template is specifically tailored to perform vanilla LLM completion, generally adopted by code completion models.

After generating the MLC model configuration file, we are all set to compile and create the model library. You can learn more about the compile command here </docs/compilation/compile_models.html#compile-model-library>__

Example:

.. tabs::

.. group-tab:: Linux - CUDA

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device cuda -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-cuda.so

.. group-tab:: Metal

  For M-chip Mac:

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device metal -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-metal.so

  Cross-Compiling for Intel Mac on M-chip Mac:

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device metal:x86-64 -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-metal_x86_64.dylib

  For Intel Mac:

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device metal -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-metal_x86_64.dylib

.. group-tab:: Vulkan

  For Linux:

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device vulkan -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-vulkan.so

  For Windows:

  .. code:: bash

     # compile model library with specification in mlc-chat-config.json
     mlc_llm compile ./dist/CodeLlama-7b-hf-q4f16_1-MLC/mlc-chat-config.json \
        --device vulkan -o ./dist/libs/CodeLlama-7b-hf-q4f16_1-vulkan.dll

.. note:: The generated model library can be shared across multiple model variants, as long as the architecture and number of parameters does not change, e.g., same architecture, but different weights (your finetuned model).

Setting up the Inference Entrypoint

You can now locally deploy your compiled model with the MLC serve module. To find more details about the MLC LLM API visit our :ref:deploy-rest-api page.

Example:

.. code:: bash

python -m mlc_llm.serve.server
--model dist/CodeLlama-7b-hf-q4f16_1-MLC
--model-lib ./dist/libs/CodeLlama-7b-hf-q4f16_1-cuda.so

Configure the IDE Extension

After deploying the LLM we can easily connect the IDE with the MLC Rest API. In this guide, we will be using the Hugging Face Code Completion extension llm-ls <https://github.com/huggingface/llm-ls>__ which has support across multiple IDEs (e.g., vscode <https://github.com/huggingface/llm-vscode>, intellij <https://github.com/huggingface/llm-intellij> and nvim <https://github.com/huggingface/llm.nvim>__) to connect to an external OpenAI compatible API (i.e., our MLC LLM :ref:deploy-rest-api).

After installing the extension on your IDE, open the settings.json extension configuration file:

.. figure:: /_static/img/ide_code_settings.png :width: 450 :align: center :alt: settings.json

|

Then, make sure to replace the following settings with the respective values:

.. code:: javascript

"llm.modelId": "dist/CodeLlama-7b-hf-q4f16_1-MLC" "llm.url": "http://127.0.0.1:8000/v1/completions" "llm.backend": "openai"

This will enable the extension to send OpenAI compatible requests to the MLC Serve API. Also, feel free to tune the API parameters. Please refer to our :ref:deploy-rest-api documentation for more details about these API parameters.

.. code:: javascript

"llm.requestBody": { "best_of": 1, "frequency_penalty": 0.0, "presence_penalty": 0.0, "logprobs": false, "top_logprobs": 0, "logit_bias": null, "max_tokens": 128, "seed": null, "stop": null, "suffix": null, "temperature": 1.0, "top_p": 1.0 }

The llm-ls extension supports a variety of different model code completion templates. Choose the one that best matches your model, i.e., the template with the correct tokenizer and Fill in the Middle tokens.

.. figure:: /_static/img/ide_code_templates.png :width: 375 :align: center :alt: llm-ls templates

|

After everything is all set, the extension will be ready to use the responses from the MLC Serve API to provide off-the-shelf code completion on your IDE.

.. figure:: /_static/img/code_completion.png :width: 700 :align: center :alt: IDE Code Completion

|

Conclusion

Please, let us know if you have any questions. Feel free to open an issue on the MLC LLM repo <https://github.com/mlc-ai/mlc-llm/issues>__!