examples/notebooks/sft_qwen_vl.ipynb
With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use free Colab (T4 GPU) to fine-tune models like Qwen3-VL.
We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.
!pip install -Uq "trl[peft]" bitsandbytes trackio
Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.
from huggingface_hub import notebook_login
notebook_login()
We'll load the trl-lib/llava-instruct-mix dataset from the Hugging Face Hub using the datasets library.
This dataset is a set of GPT-generated multimodal instruction-following data. We use a processed version for conveniency here. You can check out more details about how to configure your own multimodal dataset for traininig with SFT in the docs. Fine-tuning Qwen3-VL on it helps refine its response style and visual understanding.
from datasets import load_dataset
dataset_name = "trl-lib/llava-instruct-mix"
train_dataset = load_dataset(dataset_name, split="train[:10%]")
Let's review one example to understand the internal structure:
train_dataset[0]
This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.
from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig
import torch
model_name = "Qwen/Qwen3-VL-4B-Instruct" # "Qwen/Qwen3-VL-8B-Instruct"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_name,
dtype="float32",
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True, # Load the model in 4-bit precision to save memory
bnb_4bit_compute_dtype=torch.float16, # Data type used for internal computations in quantization
bnb_4bit_use_double_quant=True, # Use double quantization to improve accuracy
bnb_4bit_quant_type="nf4" # Type of quantization. "nf4" is recommended for recent LLMs
)
)
The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter — a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.
from peft import LoraConfig
# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different VLMs might have different attention/projection layer names.
peft_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules=['down_proj','o_proj','k_proj','q_proj','gate_proj','up_proj','v_proj'],
)
We'll configure SFT using SFTConfig, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the TRL SFTConfig documentation.
from trl import SFTConfig
output_dir = "Qwen3-VL-4B-Instruct-trl-sft"
# Configure training arguments using SFTConfig
training_args = SFTConfig(
# Training schedule / optimization
#num_train_epochs=1,
max_steps=10, # Number of dataset passes. For full trainings, use `num_train_epochs` instead
per_device_train_batch_size=2, # Batch size per GPU/CPU
gradient_accumulation_steps=8, # Gradients are accumulated over multiple steps → effective batch size = 4 * 8 = 32
warmup_steps=5, # Gradually increase LR during first N steps
learning_rate=2e-4, # Learning rate for the optimizer
optim="adamw_8bit", # Optimizer
max_length=None, # For VLMs, truncating may remove image tokens, leading to errors during training. max_length=None avoids it
# Logging / reporting
output_dir=output_dir, # Where to save model checkpoints and logs
logging_steps=1, # Log training metrics every N steps
report_to="trackio", # Experiment tracking tool
# Hub integration
push_to_hub=True,
)
Configure the SFT Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
peft_config=peft_config,
)
Show memory stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
And train!
trainer_stats = trainer.train()
Show memory stats after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)
Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.
from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization + fine-tuned model name
model = Qwen3VLForConditionalGeneration.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)
processor = AutoProcessor.from_pretrained(base_model)
problem = train_dataset[0]['prompt'][0]['content']
image = train_dataset[0]['images'][0]
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": problem},
],
},
]
messages
image
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt"
return_dict=True,
).to(model.device)
# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)