examples/notebooks/sft_tool_calling.ipynb
Learn how to teach a language model to perform tool calling using Supervised Fine-Tuning (SFT) with LoRA/QLoRA and the TRL library.
The model used in this notebook does not have native tool-calling support. We extend its Jinja2 chat template (via tiny_aya_chat_template.jinja) to serialize tool schemas into the system preamble and render tool calls as structured <tool_call> XML inside the model's native <|START_RESPONSE|> / <|END_RESPONSE|> delimiters. The modified template is saved with the tokenizer, making inference reproducible: just load the tokenizer from the output directory and call apply_chat_template with tools=TOOLS.
We'll install TRL with the PEFT extra, which brings in all main dependencies such as Transformers and PEFT (parameter-efficient fine-tuning). We also install trackio for experiment logging, and bitsandbytes for 4-bit quantization,
!pip install -Uq "trl[peft]" trackio bitsandbytes liger-kernel
Log in to your Hugging Face account to push the fine-tuned model to the Hub and access gated models. You can find your access token on your account settings page.
from huggingface_hub import notebook_login
notebook_login()
We load the bebechien/SimpleToolCalling dataset, which contains user queries paired with the correct tool call to handle each request. Each sample provides a user_content, a tool_name, and tool_arguments.
from datasets import load_dataset
dataset_name = "bebechien/SimpleToolCalling"
dataset = load_dataset(dataset_name, split="train")
dataset
We define two tools: search_knowledge_base for internal company documents and search_google for public information. We then write a custom Jinja2 chat template that extends the model's default template with two additions:
tools is passed to apply_chat_template.tool_calls render the call as structured <tool_call> inside the model's existing <|START_RESPONSE|> / <|END_RESPONSE|> delimiters.Each training sample uses the standard tool_calls message format with a tools key — SFTTrainer passes these to apply_chat_template automatically.
import json
# These are the tool schemas that are used in the dataset
TOOLS = [
{
"type": "function",
"function": {
"name": "search_knowledge_base",
"description": "Search internal company documents, policies and project data.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string", "description": "query string"}},
"required": ["query"],
},
"return": {"type": "string"},
},
},
{
"type": "function",
"function": {
"name": "search_google",
"description": "Search public information.",
"parameters": {
"type": "object",
"properties": {"query": {"type": "string", "description": "query string"}},
"required": ["query"],
},
"return": {"type": "string"},
},
},
]
def create_conversation(sample):
return {
"prompt": [{"role": "user", "content": sample["user_content"]}],
"completion": [
{
"role": "assistant",
"tool_calls": [
{
"type": "function",
"function": {
"name": sample["tool_name"],
"arguments": json.loads(sample["tool_arguments"]),
},
}
],
},
],
"tools": TOOLS,
}
dataset = dataset.map(create_conversation, remove_columns=dataset.features)
# Split dataset into 50% training samples and 50% test samples
dataset = dataset.train_test_split(test_size=0.5, shuffle=True)
Let's inspect an example from the training set to verify the format:
dataset['train'][0]
dataset
Choose the model you want to fine-tune. This notebook uses CohereLabs/tiny-aya-global by default.
model_id, output_dir = "CohereLabs/tiny-aya-global", "tiny-aya-global-SFT" # ✅ ~9.1 GB VRAM
Load the model with 4-bit quantization using BitsAndBytesConfig (QLoRA). To use standard LoRA without quantization, comment out the quantization_config parameter. We also load the tokenizer separately so we can install the custom chat template before training.
import torch
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="sdpa", # Change to Flash Attention if GPU has support
dtype=torch.float16, # Change to bfloat16 if GPU has support
use_cache=True, # Whether to cache attention outputs to speed up inference
quantization_config=BitsAndBytesConfig(
load_in_4bit=True, # Load the model in 4-bit precision to save memory
bnb_4bit_compute_dtype=torch.float16, # Data type used for internal computations in quantization
bnb_4bit_use_double_quant=True, # Use double quantization to improve accuracy
bnb_4bit_quant_type="nf4" # Type of quantization. "nf4" is recommended for recent LLMs
)
)
!wget https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/tiny_aya_chat_template.jinja
Configure LoRA. Instead of updating the model's original weights, we fine-tune a lightweight LoRA adapter. The target_modules specify which layers receive the adapter — update these if using a different model architecture.
from peft import LoraConfig
# You may need to update `target_modules` depending on the architecture of your chosen model.
# For example, different LLMs might have different attention/projection layer names.
peft_config = LoraConfig(
r=32,
lora_alpha=32,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj",],
)
Configure the training run with SFTConfig. The settings below are tuned for low memory usage. For full details on available parameters, see the TRL SFTConfig documentation.
from trl import SFTConfig
training_args = SFTConfig(
# Training schedule / optimization
per_device_train_batch_size = 1, # Batch size per GPU
gradient_accumulation_steps = 4, # Effective batch size = 1 * 4 = 4
warmup_steps = 5,
learning_rate = 2e-4, # Learning rate for the optimizer
optim = "paged_adamw_8bit", # Optimizer
chat_template_path= "tiny_aya_chat_template.jinja", # Use the tool-aware chat template
# Logging / reporting
logging_steps=1, # Log training metrics every N steps
report_to="trackio", # Experiment tracking tool
trackio_space_id=output_dir, # HF Space where the experiment tracking will be saved
output_dir=output_dir, # Where to save model checkpoints and logs
max_length=1024, # Maximum input sequence length
activation_offloading=True, # Offload activations to CPU to reduce GPU memory usage
# Hub integration
push_to_hub=True, # Automatically push the trained model to the Hugging Face Hub
# The model will be saved under your Hub account in the repository named `output_dir`
)
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=dataset['train'],
peft_config=peft_config
)
Show memory stats before training:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
And train!
trainer_stats = trainer.train()
Show memory stats after training:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
Save the trained LoRA adapter locally and push it to the Hugging Face Hub.
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)
Load the trained LoRA adapter on top of the base model and merge it into the weights for efficient inference.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
# Load from output_dir to get the tokenizer with the updated chat template
tokenizer = AutoTokenizer.from_pretrained(output_dir)
base_model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="sdpa",
dtype=torch.float16,
device_map="auto",
)
model = PeftModel.from_pretrained(base_model, output_dir)
model = model.merge_and_unload()
model.eval()
Define a prediction function that uses apply_chat_template with tools=TOOLS to construct the prompt. The model generates a JSON tool call inside its native response delimiters; skip_special_tokens=True strips those delimiters, leaving just the JSON string.
def generate_prediction(prompt):
text = tokenizer.apply_chat_template(
prompt, tools=TOOLS, tokenize=False, add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=512,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):]
return tokenizer.decode(output_ids, skip_special_tokens=True)
Let's test the fine-tuned model on an example from the test set:
sample_test_data = dataset["test"][0] # Get a sample from the test set
user_content = sample_test_data["prompt"]
print(f"User Query: {user_content}")
predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")
You can still use the strong multilingual model capabilities:
user_content = "Explica en español qué significa la palabra japonesa 'ikigai' y da un ejemplo práctico." # Spanish question
user_content = [{"role": "user", "content": user_content}]
print(f"User Query: {user_content}")
predicted_output = generate_prediction(user_content)
print(f"Predicted Output: {predicted_output}")