docs/source/en/peft.md
Parameter-efficient fine-tuning (PEFT) methods only fine-tune a small number of extra model parameters (adapters) on top of a pretrained model. Because only adapter parameters are updated, the optimizer tracks far fewer gradients and states, reducing memory usage significantly. Adapters are lightweight, making them convenient to share, store, and load.
Transformers integrates directly with the PEFT library through [~integrations.PeftAdapterMixin], added to all [PreTrainedModel] classes. You can load, add, train, switch, and delete adapters without wrapping your model in a separate [~peft.PeftModel]. All non-prompt-learning PEFT methods are supported (LoRA, IA3, AdaLoRA). Prompt-based methods like prompt tuning and prefix tuning require using the PEFT library directly.
Install PEFT to get started. The integration requires peft >= 0.18.0.
pip install -U peft
Create a PEFT config, like [~peft.LoraConfig] for example, and attach it to a model with [~integrations.PeftAdapterMixin.add_adapter].
from peft import LoraConfig, TaskType
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-2b")
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
inference_mode=False,
r=8,
lora_alpha=32,
lora_dropout=0.1,
)
model.add_adapter(lora_config, adapter_name="my_adapter")
To train additional modules alongside an adapter (for example, the language model head), specify them in modules_to_save. modules_to_save specifies layers that are fully fine-tuned alongside the adapter, so all of their parameters are updated. This is useful when certain layers need updates, for example the language model head (lm_head), when adapting a causal LM for sequence classification.
lora_config = LoraConfig(
modules_to_save=["lm_head"],
...
)
model.add_adapter(lora_config)
For common architectures (Llama, Gemma, Qwen, etc.), PEFT has predefined default targets (like q_proj and v_proj), so you don't need to specify target_modules. If you want to target different layers, or the model doesn't have predefined targets, pass target_modules explicitly as a list of module names or a regex pattern.
lora_config = LoraConfig(
target_modules=["q_proj", "k_proj"],
...
)
model.add_adapter(lora_config)
Pass the model with an attached adapter to [Trainer] and call [~Trainer.train]. [Trainer] only updates the adapter parameters (those with requires_grad=True) because the base model is frozen.
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./output",
num_train_epochs=3,
per_device_train_batch_size=4,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
During training, [Trainer] checkpoints contain only the adapter weights (adapter_model.safetensors) and configuration (adapter_config.json), keeping checkpoints small. The base model isn't included.
After training, save the final adapter with [~PreTrainedModel.save_pretrained].
model.save_pretrained("./my_adapter")
[Trainer] automatically detects adapter checkpoints when resuming. [Trainer] scans the checkpoint directory for subdirectories containing adapter weights and reloads each adapter with the correct trainable state.
trainer.train(resume_from_checkpoint="./output/checkpoint-1000")
PEFT adapters work with distributed training out of the box.
For ZeRO-3, [Trainer] passes exclude_frozen_parameters=True when saving checkpoints with a PEFT model. Frozen base model weights are skipped. Only the trainable adapter parameters are saved, reducing checkpoint size and save time.
For FSDP, [Trainer] updates the FSDP auto-wrap policy to correctly handle LoRA layers. For QLoRA (quantized base model + LoRA), [Trainer] also adjusts the mixed precision policy to match the quantization storage dtype.
To load an adapter, the Hub repository or local directory must contain an adapter_config.json file and the adapter weights.
[~PreTrainedModel.from_pretrained] automatically detects adapters. When it finds an adapter_config.json, it reads the base_model_name_or_path field to load the correct base model, then loads the adapter on top.
from transformers import AutoModelForCausalLM
# Automatically loads the base model and attaches the adapter
model = AutoModelForCausalLM.from_pretrained("klcsp/gemma7b-lora-alpaca-11-v1")
To load an adapter onto an existing model, use [~integrations.PeftAdapterMixin.load_adapter].
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("google/gemma-7b")
model.load_adapter("klcsp/gemma7b-lora-alpaca-11-v1")
For large models, load a quantized version in 8-bit or 4-bit precision with bitsandbytes to save memory. Add device_map="auto" to distribute the model across available hardware.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model = AutoModelForCausalLM.from_pretrained(
"klcsp/gemma7b-lora-alpaca-11-v1",
quantization_config=BitsAndBytesConfig(load_in_8bit=True),
device_map="auto",
)
A model can hold multiple adapters at once. Add adapters with unique names, and switch between them as needed.
from peft import LoraConfig
model.add_adapter(LoraConfig(r=8, lora_alpha=32), adapter_name="adapter_1")
model.add_adapter(LoraConfig(r=16, lora_alpha=64), adapter_name="adapter_2")
Use [~integrations.PeftAdapterMixin.set_adapter] to activate a specific adapter. The other adapters are disabled but remain in memory.
model.set_adapter("adapter_2")
[~integrations.PeftAdapterMixin.enable_adapters] enables all attached adapters, and [~integrations.PeftAdapterMixin.disable_adapters] disables all of them.
# Disable all adapters for base model inference
model.disable_adapters()
# Re-enable all adapters
model.enable_adapters()
Use [~integrations.PeftAdapterMixin.active_adapters] to see which adapters are currently active.
model.active_adapters()
# ["adapter_1"]
Remove adapters you no longer need with [~integrations.PeftAdapterMixin.delete_adapter] to free memory.
model.delete_adapter("adapter_1")
Loading a new adapter each time you serve a request allocates new memory. If the model is compiled with torch.compile, each new adapter triggers recompilation. Hotswapping replaces adapter weights in-place, avoiding both issues. Only LoRA adapters are supported.
Pass hotswap=True when loading a LoRA adapter to swap its weights into an existing adapter slot. Set adapter_name to the name of the adapter to replace ("default" is the default adapter name).
model = AutoModel.from_pretrained(...)
# Load the first adapter normally
model.load_adapter(adapter_path_1)
# Generate outputs with adapter 1
...
# Hotswap the second adapter in-place
model.load_adapter(adapter_path_2, hotswap=True, adapter_name="default")
# Generate outputs with adapter 2
For compiled models, call [~integrations.peft.PeftAdapterMixin.enable_peft_hotswap] before loading the first adapter and before compiling.
model = AutoModel.from_pretrained(...)
max_rank = ... # highest rank among all LoRAs you'll load
model.enable_peft_hotswap(target_rank=max_rank)
model.load_adapter(adapter_path_1, adapter_name="default")
model = torch.compile(model, ...)
output_1 = model(...)
# Hotswap without recompilation
model.load_adapter(adapter_path_2, adapter_name="default")
output_2 = model(...)
The target_rank argument sets the maximum rank among all LoRA adapters you'll load. If you have adapters with rank 8 and rank 16, pass target_rank=16. The default is 128.
After calling enable_peft_hotswap, all subsequent load_adapter calls hotswap by default. Pass hotswap=False explicitly to disable hotswapping.
Recompilation may still occur if the hotswapped adapter targets more layers than the initial adapter. Load the adapter that targets the most layers first to avoid recompilation.
[!TIP] Wrap your code in
with torch._dynamo.config.patch(error_on_recompile=True)to detect unexpected recompilation. If you detect recompilation despite following the steps above, open an issue with PEFT with a reproducible example.
torch.compile with hotswapping improves runtime.