GRPO EssentialAI/rnj-1-instruct with QLoRA using TRL

With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge large language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use Colab to fine-tune models like EssentialAI/rnj-1-instruct.

TRL GitHub Repository — star us to support the project!
Official TRL Examples
Community Tutorials

In this notebook, we'll add reasoning capabilities to the model, teaching it to generate reasoning traces (<think></think>) before giving us the final answer (<answer></answer>).

Install dependencies

We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.

python

!pip install -Uq "trl[peft]" bitsandbytes trackio math_verify

Log in to Hugging Face

Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.

python

from huggingface_hub import notebook_login

notebook_login()

Load dataset

We'll load the AI-MO/NuminaMath-TIR dataset from the Hugging Face Hub using the datasets library.

This dataset contains maths problems, along with the solution in thinking format specially tailored for LLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.

We only use a subset for educational purposes. In a real scenario, we'd use the complete dataset.

python

from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

In addition to the current columns, we also include a custom system prompt to tell the model how we'd like the generation.

This system prompt is an adapted version of the original one extracted from DeepSeek R1. For additional background, see this previous recipe. We extend the prompt with examples and a more explicit, verbose formulation to make the desired behavior easier for the model to learn. Depending on your goals, you may further enrich the prompt to simplify learning, or intentionally shorten and harden it to encourage more robust and generalizable behavior.

We convert the dataset samples into conversation samples, including the system prompt and problem description per sample, since this is how the GRPO trainer expects them.

We also set padding_side="left" to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.

python

SYSTEM_PROMPT = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.
Use exactly one <think>...</think> block followed by exactly one <answer>...</answer> block.

Examples:

User: What is 2 + 2?
Assistant:
<think>
I will add 2 and 2 together.
</think>
<answer>4</answer>

User: What is 3 × 5?
Assistant:
<think>
I will multiply 3 by 5.
</think>
<answer>15</answer>

User: Find the GCD of 12 and 18.
Assistant:
<think>
I will list the divisors of 12 and 18 and find the greatest one they have in common.
</think>
<answer>6</answer>
"""

def make_conversation(example):
    return {
        "prompt": [
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": example["problem"]},
        ],
    }

train_dataset = train_dataset.map(make_conversation)

Let's review one example to understand the internal structure:

python

print(train_dataset[0])

And remove the columns that are not needed for training:

python

train_dataset = train_dataset.remove_columns(['messages', 'problem'])
print(train_dataset)

Load model and configure LoRA/QLoRA

This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.

python

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

model_name = "EssentialAI/rnj-1-instruct"

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    dtype="float32",
    device_map="auto",
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type="nf4",
        bnb_4bit_compute_dtype=torch.float16
    ),
)

The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter, a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.

python

from peft import LoraConfig

# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
    r=32,
    lora_alpha=32,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)

Train model

We'll configure GRPO using GRPOConfig, keeping the parameters minimal so the training fits on a Colab instance. You can adjust these settings depending on the resources available. For full details on all available parameters, check the TRL GRPOConfig documentation.

First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include just one reward function. We'll use a format reward that will reward the model when the output includes <think> and <answer> tags. This is a simplification of the pipeline for educational purposes, but in a real scenario, you'd at least all need a reward function to check the correctness of the model answer. The function has been extracted from here.

💡 Note:
You can further refine this reward by making it more granular. For example, assigning partial rewards when <think> and <answer> appear independently, or when they are present but incorrectly ordered. This can make the learning signal denser and speed up early training. However, overly simplifying the reward may reduce robustness, even if it helps the model converge faster. In practice, there is a trade-off between ease of learning and the generalization quality of the final model.

python

import re

def format_reward(completions, **kwargs):
    """Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
    pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"

    matches = []
    for item in completions:
        if isinstance(item, list):
            text = item[0]['content']
        else:
            text = item
        match = re.match(pattern, text, re.DOTALL | re.MULTILINE)
        matches.append(match)

    return [1.0 if match else 0.0 for match in matches]

After defining the reward function(s), we can define the GRPOConfig. You can adapt the values in the config depending on your training setting and even fit the training in more constrained setups like free Colab (T4).

python

from trl import GRPOConfig

output_dir = "EssentialAI-rnj-1-instruct-trl-grpo"

# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
    learning_rate=2e-5,                                   # Learning rate used during traing
    num_train_epochs=1,                                   # Number of full dataset passes. For testing, use `max_steps` instead
    #max_steps=100,

    # Parameters that control the data preprocessing
    per_device_train_batch_size=8,
    max_completion_length=256, # default: 256             # Max completion length produced during training
    num_generations=8, # default: 8                       # Number of generations produced during training for comparison

    # Parameters related to reporting and saving
    output_dir=output_dir,                                # Where to save model checkpoints and logs
    logging_steps=10,                                     # Log training metrics every N steps
    report_to="trackio",                                  # Experiment tracking tool
    trackio_space_id = output_dir,                        # HF Space where you trackio will be

    # Hub integration
    push_to_hub=True,                                     # Push the resulted model to the Hub
    log_completions=True,                                 # Log completions during training
)

Configure the GRPO Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.

python

from trl import GRPOTrainer

trainer = GRPOTrainer(
    model=model,
    reward_funcs=[format_reward],
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training

python

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

python

trainer_stats = trainer.train()

Show memory stats after training

python

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Saving fine tuned model

In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.

python

trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_id)

Load the fine-tuned model and run inference

Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.

python

output_dir = 'sergiopaniego/EssentialAI-rnj-1-instruct-trl-grpo'
model_name = "EssentialAI/rnj-1-instruct"

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization

model = AutoModelForCausalLM.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)

tokenizer = AutoTokenizer.from_pretrained(base_model)

python

train_dataset[0]

python

from datasets import load_dataset

dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')

problem = train_dataset[0]['problem']

messages = [
    {
        "role": "system", "content": [
            {"type": "text", "text": SYSTEM_PROMPT}
        ]
    },
    {
        "role": "user",
        "content": [
            {"type": "text", "text": problem},
        ],
    },
]

python

messages

python

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=False,
).to(model.device)

# --- Generate Prediction --- #
print("Generating prediction...")
output_ids = model.generate(
    input_ids,
    max_new_tokens=50,
    pad_token_id=tokenizer.eos_token_id,
    do_sample=True,
    temperature=0.2,
    top_p=0.95
)

response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)