Back to Trl

Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL

examples/notebooks/sft_qwen_vl.ipynb

1.3.09.6 KB
Original Source

Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL

With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use free Colab (T4 GPU) to fine-tune models like Qwen3-VL.

Install dependencies

We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.

python
!pip install -Uq "trl[peft]" bitsandbytes trackio

Log in to Hugging Face

Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.

python
from huggingface_hub import notebook_login

notebook_login()

Load dataset

We'll load the trl-lib/llava-instruct-mix dataset from the Hugging Face Hub using the datasets library.

This dataset is a set of GPT-generated multimodal instruction-following data. We use a processed version for conveniency here. You can check out more details about how to configure your own multimodal dataset for traininig with SFT in the docs. Fine-tuning Qwen3-VL on it helps refine its response style and visual understanding.

python
from datasets import load_dataset

dataset_name = "trl-lib/llava-instruct-mix"
train_dataset = load_dataset(dataset_name, split="train[:10%]")

Let's review one example to understand the internal structure:

python
train_dataset[0]

Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.

python
from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig
import torch

model_name = "Qwen/Qwen3-VL-4B-Instruct" # "Qwen/Qwen3-VL-8B-Instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    dtype="float32",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
        bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    )
)

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter — a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

python
from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different VLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
)

Train model

We'll configure SFT using SFTConfig, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the TRL SFTConfig documentation.

python
from trl import SFTConfig

output_dir = "Qwen3-VL-4B-Instruct-trl-sft"

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    # Training schedule / optimization
    #num_train_epochs=1,
    max_steps=10,                                         # Number of dataset passes. For full trainings, use `num_train_epochs` instead
    per_device_train_batch_size=2,                        # Batch size per GPU/CPU
    gradient_accumulation_steps=8,                        # Gradients are accumulated over multiple steps → effective batch size = 4 * 8 = 32
    warmup_steps=5,                                       # Gradually increase LR during first N steps
    learning_rate=2e-4,                                   # Learning rate for the optimizer
    optim="adamw_8bit",                                   # Optimizer
    max_length=None,                                      # For VLMs, truncating may remove image tokens, leading to errors during training. max_length=None avoids it

    # Logging / reporting
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=1,                                      # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool

    # Hub integration
    push_to_hub=True,
)

Configure the SFT Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.

python
from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

python
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

python
trainer_stats = trainer.train()

Show memory stats after training

python
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Saving fine tuned model

In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.

python
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)

Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization + fine-tuned model name

model = Qwen3VLForConditionalGeneration.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)

processor = AutoProcessor.from_pretrained(base_model)
python
problem = train_dataset[0]['prompt'][0]['content']
image = train_dataset[0]['images'][0]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": problem},
        ],
    },
]
python
messages
python
image
python
inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt"
    return_dict=True,
).to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)