docs/04 - Model Tab.md
This is where you load models, apply LoRAs to a loaded model, and download new models.
Loads: GGUF models. Note: GGML models have been deprecated and do not work anymore.
Example: https://huggingface.co/TheBloke/Llama-2-7b-Chat-GGUF
fp16, q8_0, q4_0. Lower quantization saves VRAM at the cost of some quality.30,70 for 30%/70%.flag1=value1,flag2,flag3=value3. Example: override-tensor=exps=CPU.Loads: full precision (16-bit or 32-bit) models, as well as bitsandbytes-quantized models. The repository usually has a clean name without GGUF or EXL3 in its name, and the model files are named model.safetensors or split into parts like model-00001-of-00004.safetensors.
Example: https://huggingface.co/lmsys/vicuna-7b-v1.5.
Full precision models use a ton of VRAM, so you will usually want to select the "load_in_4bit" and "use_double_quant" options to load the model in 4-bit precision using bitsandbytes.
Options:
20,7,7.sdpa, eager, flash_attention_2. The default (sdpa) works well in most cases; flash_attention_2 may be useful for training.Loads: EXL3 models. These models usually have "EXL3" or "exl3" in the model name.
Uses the ExLlamaV3 backend with Transformers samplers.
fp16, q2 to q8. You can also specify key and value bits separately, e.g. q4_q8. Lower quantization saves VRAM at the cost of some quality.20,7,7.native, nccl. Default: native.The same as ExLlamav3_HF but using the internal samplers of ExLlamaV3 instead of the ones in the Transformers library. Supports speculative decoding with a draft model. Also supports multimodal (vision) models natively.
native, nccl. Default: native.Loads: TensorRT-LLM engine models. These are highly optimized models compiled specifically for NVIDIA GPUs.
Here you can select a model to be loaded, refresh the list of available models, load/unload/reload the selected model, and save the settings for the model. The "settings" are the values in the input fields (checkboxes, sliders, dropdowns) below this dropdown.
After saving, those settings will get restored whenever you select that model again in the dropdown menu.
If the Autoload the model checkbox is selected, the model will be loaded as soon as it is selected in this menu. Otherwise, you will have to click on the "Load" button.
Used to apply LoRAs to the model. Note that LoRA support is not implemented for all loaders. Check the What Works page for details.
Here you can download a model or LoRA directly from the https://huggingface.co/ website.
user_data/models.user_data/loras.In the input field, you can enter either the Hugging Face username/model path (like facebook/galactica-125m) or the full model URL (like https://huggingface.co/facebook/galactica-125m). To specify a branch, add it at the end after a ":" character like this: facebook/galactica-125m:main.
To download a single file, as necessary for models in GGUF format, you can click on "Get file list" after entering the model path in the input field, and then copy and paste the desired file name in the "File name" field before clicking on "Download".