Using PEFT with custom models

peft allows us to fine-tune models efficiently with LoRA. In this short notebook, we will demonstrate how to train a simple multilayer perceptron (MLP) using peft.

Imports

Make sure that you have the latest version of peft installed. To ensure that, run this in your Python environment:

python -m pip install --upgrade peft

python

import copy
import os

# ignore bnb warnings
os.environ["BITSANDBYTES_NOWELCOME"] = "1"

python

import peft
import torch
from torch import nn
import torch.nn.functional as F

python

torch.manual_seed(0)

Data

We will create a toy dataset consisting of random data for a classification task. There is a little bit of signal in the data, so we should expect that the loss of the model can improve during training.

python

X = torch.rand((1000, 20))
y = (X.sum(1) > 10).long()

python

n_train = 800
batch_size = 64

python

train_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[:n_train], y[:n_train]),
    batch_size=batch_size,
    shuffle=True,
)
eval_dataloader = torch.utils.data.DataLoader(
    torch.utils.data.TensorDataset(X[n_train:], y[n_train:]),
    batch_size=batch_size,
)

Model

As a model, we use a simple multilayer perceptron (MLP). For demonstration purposes, we use a very large number of hidden units. This is totally overkill for this task but it helps to demonstrate the advantages of peft. In more realistic settings, models will also be quite large on average, so this is not far-fetched.

python

class MLP(nn.Module):
    def __init__(self, num_units_hidden=2000):
        super().__init__()
        self.seq = nn.Sequential(
            nn.Linear(20, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, num_units_hidden),
            nn.ReLU(),
            nn.Linear(num_units_hidden, 2),
            nn.LogSoftmax(dim=-1),
        )

    def forward(self, X):
        return self.seq(X)

Training

Here are just a few training hyper-parameters and a simple function that performs the training and evaluation loop.

python

lr = 0.002
batch_size = 64
max_epochs = 30
device = torch.accelerator.current_accelerator().type if hasattr(torch, "accelerator") else "cuda"

python

def train(model, optimizer, criterion, train_dataloader, eval_dataloader, epochs):
    for epoch in range(epochs):
        model.train()
        train_loss = 0
        for xb, yb in train_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            outputs = model(xb)
            loss = criterion(outputs, yb)
            train_loss += loss.detach().float()
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()

        model.eval()
        eval_loss = 0
        for xb, yb in eval_dataloader:
            xb = xb.to(device)
            yb = yb.to(device)
            with torch.no_grad():
                outputs = model(xb)
            loss = criterion(outputs, yb)
            eval_loss += loss.detach().float()

        eval_loss_total = (eval_loss / len(eval_dataloader)).item()
        train_loss_total = (train_loss / len(train_dataloader)).item()
        print(f"{epoch=:<2}  {train_loss_total=:.4f}  {eval_loss_total=:.4f}")

Training without peft

Let's start without using peft to see what we can expect from the model training.

python

module = MLP().to(device)
optimizer = torch.optim.Adam(module.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()

python

%time train(module, optimizer, criterion, train_dataloader, eval_dataloader, epochs=max_epochs)

Okay, so we got an eval loss of ~0.26, which is much better than random.

Training with peft

Now let's train with peft. First we check the names of the modules, so that we can configure peft to fine-tune the right modules.

python

[(n, type(m)) for n, m in MLP().named_modules()]

Next we can define the LoRA config. There is nothing special going on here. We set the LoRA rank to 8 and select the layers seq.0 and seq.2 to be used for LoRA fine-tuning. As for seq.4, which is the output layer, we set it as module_to_save, which means it is also trained but no LoRA is applied.

*Note: Not all layers types can be fine-tuned with LoRA. At the moment, linear layers, embeddings, Conv2D and transformers.pytorch_utils.Conv1D are supported.

python

config = peft.LoraConfig(
    r=8,
    target_modules=["seq.0", "seq.2"],
    modules_to_save=["seq.4"],
)

Now let's create the peft model by passing our initial MLP, as well as the config we just defined, to get_peft_model.

python

module = MLP().to(device)
module_copy = copy.deepcopy(module)  # we keep a copy of the original model for later
peft_model = peft.get_peft_model(module, config)
optimizer = torch.optim.Adam(peft_model.parameters(), lr=lr)
criterion = nn.CrossEntropyLoss()
peft_model.print_trainable_parameters()

Checking the numbers, we see that only ~1% of parameters are actually trained, which is what we like to see.

Now let's start the training:

python

%time train(peft_model, optimizer, criterion, train_dataloader, eval_dataloader, epochs=max_epochs)

In the end, we see that the eval loss is very similar to the one we saw earlier when we trained without peft. This is quite nice to see, given that we are training a much smaller number of parameters.

Check which parameters were updated

Finally, just to check that LoRA was applied as expected, we check what original weights were updated what weights stayed the same.

python

for name, param in peft_model.base_model.named_parameters():
    if "lora" not in name:
        continue

    print(f"New parameter {name:<13} | {param.numel():>5} parameters | updated")

python

params_before = dict(module_copy.named_parameters())
for name, param in peft_model.base_model.named_parameters():
    if "lora" in name:
        continue

    name_before = (
        name.partition(".")[-1].replace("base_layer.", "").replace("original_", "").replace("module.", "").replace("modules_to_save.default.", "")
    )
    param_before = params_before[name_before]
    if torch.allclose(param, param_before):
        print(f"Parameter {name_before:<13} | {param.numel():>7} parameters | not updated")
    else:
        print(f"Parameter {name_before:<13} | {param.numel():>7} parameters | updated")

So we can see that apart from the new LoRA weights that were added, only the last layer was updated. Since the LoRA weights and the last layer have comparitively few parameters, this gives us a big boost in efficiency.

Pushing the model to HF Hub

With the peft model, it is also very easy to push a model the Hugging Face Hub. Below, we demonstrate how it works. It is assumed that you have a valid Hugging Face account and are logged in:

python

user = "BenjaminB"  # put your user name here
model_name = "peft-lora-with-custom-model"
model_id = f"{user}/{model_name}"

python

peft_model.push_to_hub(model_id);

As we can see, the adapter size is only 211 kB.

Loading the model from HF Hub

Now, it only takes one step to load the model from HF Hub. To do this, we can use PeftModel.from_pretrained, passing our base model and the model ID:

python

loaded = peft.PeftModel.from_pretrained(module_copy, model_id)
type(loaded)

Let's check that the two models produce the same output:

python

y_peft = peft_model(X.to(device))
y_loaded = loaded(X.to(device))
torch.allclose(y_peft, y_loaded)

Clean up

Finally, as a clean up step, you may want to delete the repo.

python

from huggingface_hub import delete_repo

python

delete_repo(model_id)

Using PEFT with custom models

Using PEFT with custom models

Imports

Data

Model

Training

Training without peft

Training with peft

Check which parameters were updated

Sharing the model through Hugging Face Hub

Pushing the model to HF Hub

Loading the model from HF Hub

Clean up