GRPO Qwen3-VL with QLoRA using TRL

With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge vision language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use free Colab (T4 GPU) to fine-tune models like Qwen3-VL.

TRL GitHub Repository — star us to support the project!
Official TRL Examples
Community Tutorials
More Qwen3-VL Fine-tuning Examples (including TRL scripts)

Install dependencies

We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.

python

!pip install -Uq "trl[peft]" bitsandbytes trackio math_verify

Log in to Hugging Face

Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.

python

from huggingface_hub import notebook_login

notebook_login()

Load dataset

We'll load the lmms-lab/multimodal-open-r1-8k-verified dataset from the Hugging Face Hub using the datasets library.

This dataset contains maths problems with the image representing the problem, along with the solution in thinking format specially tailored for VLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.

python

from datasets import load_dataset

dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

In addition to the problem and image columns, we also include a custom system prompt to tell the model how we'd like the generation.

The system prompt is extracted from DeepSeek R1. Refer to this previous recipe for more details.

We convert the dataset samples into conversation samples, including the system prompt and one image and problem description per sample, since this is how the GRPO trainer expects them.

We also set padding_side="left" to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.

python

from transformers import AutoProcessor

model_name = "Qwen/Qwen3-VL-4B-Instruct" # "Qwen/Qwen3-VL-8B-Instruct"
processor = AutoProcessor.from_pretrained(model_name, padding_side="left")

SYSTEM_PROMPT = (
    "You are a helpful AI Assistant that provides well-reasoned and detailed responses. "
    "You first think about the reasoning process as an internal monologue and then provide the user with the answer. "
    "Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"
)


def make_conversation(example):
    prompt = [
        {
            "role": "system",
            "content": [{"type": "text", "text": SYSTEM_PROMPT}],
        },
        {
            "role": "user",
            "content": [
                {"type": "image", "image": example["image"]},
                {"type": "text", "text": example["problem"]},
            ],
        },
    ]
    return {"prompt": prompt, "image": example["image"]}

train_dataset = train_dataset.map(make_conversation)

Let's review one example to understand the internal structure:

python

train_dataset[0]

python

train_dataset = train_dataset.remove_columns(['problem', 'original_question', 'original_answer'])

python

train_dataset[0]

Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.

python

from transformers import Qwen3VLForConditionalGeneration, BitsAndBytesConfig
import torch

model = Qwen3VLForConditionalGeneration.from_pretrained(
    model_name, dtype="float32",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    ),
)

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter — a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

python

from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different VLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=8,
    lora_alpha=32,
    lora_dropout=0.1,
    target_modules=["q_proj", "v_proj"],
)

Train model

We'll configure GRPO using GRPOConfig, keeping the parameters minimal so the training fits on a free Colab instance. You can adjust these settings if more resources are available. For full details on all available parameters, check the TRL GRPOConfig documentation.

First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include two reward functions. We'll use a format reward that will reward the model when the output includes <think> and <answer> tags and additionally a length-based reward to discourage overthinking. Both functions have been extracted from here.

python

import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"^<think>\n.*?\n</think>\n<answer>\n.*?\n</answer>$"
    matches = [re.match(pattern, content, re.DOTALL | re.MULTILINE) for content in completions]
    return [1.0 if match else 0.0 for match in matches]

python

from math_verify import LatexExtractionConfig, parse, verify
from latex2sympy2_extended import NormalizationConfig


def len_reward(completions, solution, **kwargs) -> float:
    """Compute length-based rewards to discourage overthinking and promote token efficiency.

    Taken from the Kimi 1.5 tech report: https://huggingface.co/papers/2501.12599

    Args:
        completions: List of model completions
        solution: List of ground truth solutions

    Returns:
        List of rewards where:
        - For correct answers: reward = 0.5 - (len - min_len)/(max_len - min_len)
        - For incorrect answers: reward = min(0, 0.5 - (len - min_len)/(max_len - min_len))
    """
    contents = completions

    # First check correctness of answers
    correctness = []
    for content, sol in zip(contents, solution):
        gold_parsed = parse(
            sol,
            extraction_mode="first_match",
            extraction_config=[LatexExtractionConfig()],
        )
        if len(gold_parsed) == 0:
            # Skip unparsable examples
            correctness.append(True)  # Treat as correct to avoid penalizing
            print("Failed to parse gold solution: ", sol)
            continue

        answer_parsed = parse(
            content,
            extraction_config=[
                LatexExtractionConfig(
                    normalization_config=NormalizationConfig(
                        nits=False,
                        malformed_operators=False,
                        basic_latex=True,
                        equations=True,
                        boxed=True,
                        units=True,
                    ),
                    boxed_match_priority=0,
                    try_extract_without_anchor=False,
                )
            ],
            extraction_mode="first_match",
        )
        correctness.append(verify(answer_parsed, gold_parsed))

    # Calculate lengths
    lengths = [len(content) for content in contents]
    min_len = min(lengths)
    max_len = max(lengths)

    # If all responses have the same length, return zero rewards
    if max_len == min_len:
        return [0.0] * len(completions)

    rewards = []
    for length, is_correct in zip(lengths, correctness):
        lambda_val = 0.5 - (length - min_len) / (max_len - min_len)

        if is_correct:
            reward = lambda_val
        else:
            reward = min(0, lambda_val)

        rewards.append(float(reward))

    return rewards

After defining the reward function(s), we can define the GRPOConfig.

python

from trl import GRPOConfig

output_dir = "Qwen3-VL-4B-Instruct-trl-grpo"

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    learning_rate=2e-5,
    #num_train_epochs=1,
    max_steps=100,                                        # Number of dataset passes. For full trainings, use `num_train_epochs` instead

    # Parameters that control the data preprocessing
    per_device_train_batch_size=2,
    max_completion_length=1024, # default: 256            # Max completion length produced during training
    num_generations=2, # 2, # default: 8                  # Number of generations produced during training for comparison

    fp16=True,

    # Parameters related to reporting and saving
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=1,                                      # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool

    # Hub integration
    push_to_hub=True,
    log_completions=True
)

Configure the GRPO Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.

python

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward, len_reward],
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

python

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

python

trainer_stats = trainer.train()

Show memory stats after training

python

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Saving fine tuned model

In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.

python

trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

python

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from peft import PeftModel

base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization

model = Qwen3VLForConditionalGeneration.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)

processor = AutoProcessor.from_pretrained(base_model)

python

train_dataset[0]

python

from datasets import load_dataset

dataset_id = 'lmms-lab/multimodal-open-r1-8k-verified'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

problem = train_dataset[0]['problem']
image = train_dataset[0]['image']

messages = [
    {
        "role": "system", "content": [
            {"type": "text", "text": SYSTEM_PROMPT}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "image", "image": image},
            {"type": "text", "text": problem},
        ],
    },
]

python

messages

python

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

# Inference: Generation of the output
generated_ids = model.generate(**inputs, max_new_tokens=500)
generated_ids_trimmed = [
    out_ids[len(in_ids) :] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)