Supervised Fine-Tuning (SFT) Qwen3-VL with QLoRA using TRL

With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use free Colab (T4 GPU) to fine-tune models like Qwen3-VL.

TRL GitHub Repository — star us to support the project!
Official TRL Examples
Community Tutorials
More Qwen3-VL Fine-tuning Examples (including TRL scripts)

Install dependencies

We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.

python

!pip install -Uq "trl[peft]" bitsandbytes trackio

Log in to Hugging Face

Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.

python

from huggingface_hub import notebook_login

notebook_login()

Load dataset

We'll load the trl-lib/llava-instruct-mix dataset from the Hugging Face Hub using the datasets library.

This dataset is a set of GPT-generated multimodal instruction-following data. We use a processed version for conveniency here. You can check out more details about how to configure your own multimodal dataset for traininig with SFT in the docs. Fine-tuning Qwen3-VL on it helps refine its response style and visual understanding.

python

from datasets import load_dataset

dataset_name = "trl-lib/llava-instruct-mix"
train_dataset = load_dataset(dataset_name, split="train[:10%]")

Let's review one example to understand the internal structure:

python

train_dataset[0]

Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.

python

from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig
import torch

model_name = "Qwen/Qwen3-VL-4B-Instruct" # "Qwen/Qwen3-VL-8B-Instruct"

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name,
    dtype="float32",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,                        # Load the model in 4-bit precision to save memory
        bnb_4bit_compute_dtype=torch.float16,     # Data type used for internal computations in quantization
        bnb_4bit_use_double_quant=True,           # Use double quantization to improve accuracy
        bnb_4bit_quant_type="nf4"                 # Type of quantization. "nf4" is recommended for recent LLMs
    )
)

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter — a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

python

from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different VLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
)

Train model

We'll configure SFT using SFTConfig, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the TRL SFTConfig documentation.

python

from trl import SFTConfig

output_dir = "Qwen3-VL-4B-Instruct-trl-sft"

# Configure training arguments using SFTConfig
training_args = SFTConfig(
    # Training schedule / optimization
    #num_train_epochs=1,
    max_steps=10,                                         # Number of dataset passes. For full trainings, use `num_train_epochs` instead
    per_device_train_batch_size=2,                        # Batch size per GPU/CPU
    gradient_accumulation_steps=8,                        # Gradients are accumulated over multiple steps → effective batch size = 4 * 8 = 32
    warmup_steps=5,                                       # Gradually increase LR during first N steps
    learning_rate=2e-4,                                   # Learning rate for the optimizer
    optim="adamw_8bit",                                   # Optimizer
    max_length=None,                                      # For VLMs, truncating may remove image tokens, leading to errors during training. max_length=None avoids it

    # Logging / reporting
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=1,                                      # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool

    # Hub integration
    push_to_hub=True,
)

Configure the SFT Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.

python

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

python

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

python

trainer_stats = trainer.train()

Show memory stats after training

python

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Saving fine tuned model

In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.

python

trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)

Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

python

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization + fine-tuned model name

model = Qwen3VLForConditionalGeneration.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)

processor = AutoProcessor.from_pretrained(base_model)

python

problem = train_dataset[0]['prompt'][0]['content']
image = train_dataset[0]['images'][0]

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": problem},
        ],
    },
]

python

messages

python

image

python

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt"
    return_dict=True,
).to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)