examples/notebooks/grpo_rnj_1_instruct.ipynb
With Transformers Reinforcement Learning (TRL), you can fine-tune cutting edge large language models. It comes with support for quantized parameter efficient fine-tuning technique QLoRA, so we can use Colab to fine-tune models like EssentialAI/rnj-1-instruct.
In this notebook, we'll add reasoning capabilities to the model, teaching it to generate reasoning traces (<think></think>) before giving us the final answer (<answer></answer>).
We'll install TRL with the PEFT extra, which ensures all main dependencies such as Transformers and PEFT (a package for parameter-efficient fine-tuning, e.g., LoRA/QLoRA) are included. Additionally, we'll install trackio to log and monitor our experiments, and bitsandbytes to enable quantization of LLMs, reducing memory consumption for both inference and training.
!pip install -Uq "trl[peft]" bitsandbytes trackio math_verify
Log in to your Hugging Face account to save your fine-tuned model, track your experiment results directly on the Hub or access gated models. You can find your access token on your account settings page.
from huggingface_hub import notebook_login
notebook_login()
We'll load the AI-MO/NuminaMath-TIR dataset from the Hugging Face Hub using the datasets library.
This dataset contains maths problems, along with the solution in thinking format specially tailored for LLMs. By training our model with this dataset, it'll improve its maths and thinking reasoning.
We only use a subset for educational purposes. In a real scenario, we'd use the complete dataset.
from datasets import load_dataset
dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')
In addition to the current columns, we also include a custom system prompt to tell the model how we'd like the generation.
This system prompt is an adapted version of the original one extracted from DeepSeek R1. For additional background, see this previous recipe. We extend the prompt with examples and a more explicit, verbose formulation to make the desired behavior easier for the model to learn. Depending on your goals, you may further enrich the prompt to simplify learning, or intentionally shorten and harden it to encourage more robust and generalizable behavior.
We convert the dataset samples into conversation samples, including the system prompt and problem description per sample, since this is how the GRPO trainer expects them.
We also set padding_side="left" to ensure that generated completions during training are concatenated directly after the prompt, which is essential for GRPO to correctly compare token-level probabilities between preferred and rejected responses.
SYSTEM_PROMPT = """A conversation between User and Assistant. The user asks a question, and the Assistant solves it.
The assistant first thinks about the reasoning process in the mind and then provides the user with the answer.
The reasoning process and answer are enclosed within <think> </think> and <answer> </answer> tags.
Use exactly one <think>...</think> block followed by exactly one <answer>...</answer> block.
Examples:
User: What is 2 + 2?
Assistant:
<think>
I will add 2 and 2 together.
</think>
<answer>4</answer>
User: What is 3 × 5?
Assistant:
<think>
I will multiply 3 by 5.
</think>
<answer>15</answer>
User: Find the GCD of 12 and 18.
Assistant:
<think>
I will list the divisors of 12 and 18 and find the greatest one they have in common.
</think>
<answer>6</answer>
"""
def make_conversation(example):
return {
"prompt": [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": example["problem"]},
],
}
train_dataset = train_dataset.map(make_conversation)
Let's review one example to understand the internal structure:
print(train_dataset[0])
And remove the columns that are not needed for training:
train_dataset = train_dataset.remove_columns(['messages', 'problem'])
print(train_dataset)
This notebook can be used with two fine-tuning methods. By default, it is set up for QLoRA, which includes quantization using BitsAndBytesConfig. If you prefer to use standard LoRA without quantization, simply comment out the BitsAndBytesConfig configuration.
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
model_name = "EssentialAI/rnj-1-instruct"
model = AutoModelForCausalLM.from_pretrained(
model_name,
dtype="float32",
device_map="auto",
quantization_config=BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16
),
)
The following cell defines LoRA (or QLoRA if needed). When training with LoRA/QLoRA, we use a base model (the one selected above) and, instead of modifying its original weights, we fine-tune a LoRA adapter, a lightweight layer that enables efficient and memory-friendly training. The target_modules specify which parts of the model (e.g., attention or projection layers) will be adapted by LoRA during fine-tuning.
from peft import LoraConfig
# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)
We'll configure GRPO using GRPOConfig, keeping the parameters minimal so the training fits on a Colab instance. You can adjust these settings depending on the resources available. For full details on all available parameters, check the TRL GRPOConfig documentation.
First, we need to define the rewards functions that the training algorithm will use to improve the model. In this case, we'll include just one reward function.
We'll use a format reward that will reward the model when the output includes <think> and <answer> tags. This is a simplification of the pipeline for educational purposes, but in a real scenario, you'd at least all need a reward function to check the correctness of the model answer. The function has been extracted from here.
💡 Note:
You can further refine this reward by making it more granular. For example, assigning partial rewards when<think>and<answer>appear independently, or when they are present but incorrectly ordered. This can make the learning signal denser and speed up early training. However, overly simplifying the reward may reduce robustness, even if it helps the model converge faster. In practice, there is a trade-off between ease of learning and the generalization quality of the final model.
import re
def format_reward(completions, **kwargs):
"""Reward function that checks if the reasoning process is enclosed within <think> and </think> tags, while the final answer is enclosed within <answer> and </answer> tags."""
pattern = r"<think>.*?</think>.*?<answer>.*?</answer>"
matches = []
for item in completions:
if isinstance(item, list):
text = item[0]['content']
else:
text = item
match = re.match(pattern, text, re.DOTALL | re.MULTILINE)
matches.append(match)
return [1.0 if match else 0.0 for match in matches]
After defining the reward function(s), we can define the GRPOConfig. You can adapt the values in the config depending on your training setting and even fit the training in more constrained setups like free Colab (T4).
from trl import GRPOConfig
output_dir = "EssentialAI-rnj-1-instruct-trl-grpo"
# Configure training arguments using GRPOConfig
training_args = GRPOConfig(
learning_rate=2e-5, # Learning rate used during traing
num_train_epochs=1, # Number of full dataset passes. For testing, use `max_steps` instead
#max_steps=100,
# Parameters that control the data preprocessing
per_device_train_batch_size=8,
max_completion_length=256, # default: 256 # Max completion length produced during training
num_generations=8, # default: 8 # Number of generations produced during training for comparison
# Parameters related to reporting and saving
output_dir=output_dir, # Where to save model checkpoints and logs
logging_steps=10, # Log training metrics every N steps
report_to="trackio", # Experiment tracking tool
trackio_space_id = output_dir, # HF Space where you trackio will be
# Hub integration
push_to_hub=True, # Push the resulted model to the Hub
log_completions=True, # Log completions during training
)
Configure the GRPO Trainer. We pass the previously configured training_args. We don't use eval dataset to maintain memory usage low but you can configure it.
from trl import GRPOTrainer
trainer = GRPOTrainer(
model=model,
reward_funcs=[format_reward],
args=training_args,
train_dataset=train_dataset,
peft_config=peft_config,
)
Show memory stats before training
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
And train!
trainer_stats = trainer.train()
Show memory stats after training
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
In this step, we save the fine-tuned model both locally and to the Hugging Face Hub using the credentials from your account.
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_id)
Now, let's test our fine-tuned model by loading the LoRA/QLoRA adapter and performing inference. We'll start by loading the base model, then attach the adapter to it, creating the final fine-tuned model ready for evaluation.
output_dir = 'sergiopaniego/EssentialAI-rnj-1-instruct-trl-grpo'
model_name = "EssentialAI/rnj-1-instruct"
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base_model = model_name
adapter_model = f"{output_dir}" # Replace with your HF username or organization
model = AutoModelForCausalLM.from_pretrained(base_model, dtype="float32", device_map="auto")
model = PeftModel.from_pretrained(model, adapter_model)
tokenizer = AutoTokenizer.from_pretrained(base_model)
train_dataset[0]
from datasets import load_dataset
dataset_id = 'AI-MO/NuminaMath-TIR'
train_dataset = load_dataset(dataset_id, split='train[:5%]')
problem = train_dataset[0]['problem']
messages = [
{
"role": "system", "content": [
{"type": "text", "text": SYSTEM_PROMPT}
]
},
{
"role": "user",
"content": [
{"type": "text", "text": problem},
],
},
]
messages
input_ids = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=False,
).to(model.device)
# --- Generate Prediction --- #
print("Generating prediction...")
output_ids = model.generate(
input_ids,
max_new_tokens=50,
pad_token_id=tokenizer.eos_token_id,
do_sample=True,
temperature=0.2,
top_p=0.95
)
response = tokenizer.decode(output_ids[0][input_ids.shape[-1]:], skip_special_tokens=True)
print(response)