docs/ADAPTER_MODELS.md
An adapter model is a model with X-LoRA or LoRA. X-LoRA support is provided by selecting an XLora* architecture, and LoRA support by selecting the Lora* architecture. For both X-LoRA and LoRA, an ordering file (see this section for preparing the ordering file) must be provided. The ordering file describes the ordering of layers and which adapters to use (and what order to use them in for X-LoRA).
When using an adapter model with a quantized base model, if the ordering file specifies unsupported layers you will receive an error.
Llama architecture:
Phi 3 architecture:
Preparing the X-LoRA/LoRA Ordering File
The X-LoRA/LoRA ordering file is necessary to prepare before inference with an X-LoRA model. However, it is easy with a provided script!
An ordering JSON file for X-LoRA contains 2 major parts.
order
layers
adapters = {
"math": ...,
"reasoning": ...,
"biology": ...
}
The specified order would be ["math", "reasoning", "biology"].
We provide an ordering file which contains the ordering for the X-LoRA model associated with the paper and the Huggingface repository: https://huggingface.co/lamm-mit/x-lora.
An ordering JSON file for LoRA contains 2 major parts:
order (optional):
preload_adapters (optional): see this section
There are 2 scripts to prepare the ordering file and which work for both X-LoRA and LoRA. The ordering file is specific to each architecture and set of target modules. Therefore, if either are changed, it is necessary to create a new ordering file using the first option. If only the adapter order or adapters changed, then the second option should be used.
From scratch: No ordering file for the architecture and target modules
A script create_ordering.py is provided which prompts the user for the model ID, target modules, and adapter names. The user is prompted for an output file location, relative to the working directory.
Create a new ordering file from an existing ordering file for an architecture and target modules
A script set_names.py is provided which prompts the user for the adapter names and the old ordering file. The user is prompted for an output file location, relative to the working directory.
Mistral.rs supports running quantized models with X-LoRA or LoRA. The X-LoRA or LoRA adapter layers will not be quantized, only the base model. P
In the X-LoRA case, please note that using a high quantization level (eg., 4-bit) can distort the signal and prevent the classifier from acting properly. Therefore, it is better to use slightly lower levels such as 8-bit.
The X-LoRA implementation supports non-granular scalings. This caches the scalings after k completion tokens are generated and they will be used for the remaining passes avoiding the scaling pass. The number of tokens to generate before caching is defined by setting tgt_non_granular_index. Setting tgt_non_granular_index will restrict the maximum running sequences to 1.
Please see this page for more details and examples.
We support dynamic adapter activation for LoRA models, allowing you to activate a set of adapters at runtime. There is a Python, Rust and HTTP API:
To use this feature, you should add a preload_adapters key to your ordering file:
{
"order": ["..."],
"layers": {"...": "123"},
"base_model_id": "...",
+ "preload_adapters": [{"name": "...", "adapter_model_id": "..."}] # New field here
}
This allows mistral.rs to preload the adapter and enable runtime activation.