examples/notebooks/sft_nemotron_3.ipynb
Fine-tune NVIDIA Nemotron 3 models using LoRA and SFT with the TRL library.
The Nemotron 3 family includes hybrid Mamba2/Transformer models in different sizes (Nano, Super, etc.), natively supported in transformers (no trust_remote_code needed). See the full collection for all available checkpoints.
Note: This notebook requires a GPU with sufficient VRAM (e.g., A100 80GB). It is not designed for free Colab instances.
We install TRL with the PEFT and quantization extras, along with trackio for experiment tracking. Nemotron 3 models use Mamba2 layers, which require the mamba_ssm and causal_conv1d CUDA kernels.
!pip install -Uq "trl[peft,quantization]" trackio
# Nemotron 3 requires transformers>=5.3.0, which may not be installed by default with TRL
!pip install -Uq "transformers>=5.3.0"
# Mamba2 CUDA kernels (--no-build-isolation needed so they find your PyTorch/CUDA installation)
!pip install --no-build-isolation mamba_ssm==2.2.5 # installation takes around 30 mins
!pip install --no-build-isolation causal_conv1d==1.5.2
Log in to your Hugging Face account to save your fine-tuned model and track experiments on the Hub. You can find your access token on your account settings page.
from huggingface_hub import notebook_login
notebook_login()
We load the HuggingFaceH4/Multilingual-Thinking dataset from the Hugging Face Hub. This dataset focuses on multilingual reasoning, where the chain of thought has been translated into several languages such as French, Spanish, and German. By fine-tuning a reasoning-capable model on this dataset, it learns to generate reasoning steps in multiple languages.
For efficiency, we'll load only the training split:
from datasets import load_dataset
dataset_name = "HuggingFaceH4/Multilingual-Thinking"
train_dataset = load_dataset(dataset_name, split="train")
This dataset contains different columns. We'll only need messages as it contains the conversation and is the one used by the SFT trainer.
train_dataset
Let's look at a full example to understand the internal structure:
train_dataset[0]
The messages column contains an extra thinking field that we need to merge into the message content using <think>...</think> tags. We also remove the extra columns that are not needed for training:
def merge_thinking_and_remove_key(example):
new_messages = []
for msg in example["messages"]:
content = msg["content"]
thinking = msg.get("thinking")
if thinking and isinstance(thinking, str) and thinking.strip():
content = f"<think>\n{thinking}\n</think>\n{content}"
new_messages.append({"role": msg["role"], "content": content})
example["messages"] = new_messages
return example
train_dataset = train_dataset.remove_columns(["reasoning_language", "developer", "user", "analysis", "final"])
train_dataset = train_dataset.map(merge_thinking_and_remove_key)
Load an NVIDIA Nemotron 3 model. These models are natively supported in transformers so no trust_remote_code is needed. We use attn_implementation="eager" as required by the hybrid Mamba2/Transformer architecture.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
# Select a Nemotron 3 checkpoint below
model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"
output_dir = "nemotron-3-sft"
model = AutoModelForCausalLM.from_pretrained(
model_id,
attn_implementation="eager",
dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
Configure the LoRA adapter. Instead of modifying the original weights, we fine-tune a lightweight LoRA adapter for efficient and memory-friendly training.
from peft import LoraConfig
peft_config = LoraConfig(
r=8,
lora_alpha=16,
target_modules=[
"q_proj",
"k_proj",
"v_proj",
"o_proj",
"gate_proj",
"up_proj",
"down_proj",
],
)
Configure SFT using SFTConfig. Note that gradient_checkpointing is set to False because Nemotron 3 (NemotronHForCausalLM) does not support it. For full details on all available parameters, check the TRL SFTConfig documentation.
from trl import SFTConfig
training_args = SFTConfig(
# Training schedule / optimization
per_device_train_batch_size=1, # Batch size per GPU
gradient_accumulation_steps=4, # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
num_train_epochs=1, # Number of full dataset passes
learning_rate=2e-4, # Learning rate for the optimizer
optim="paged_adamw_8bit", # Memory-efficient 8-bit optimizer
# Logging / reporting
logging_steps=10, # Log training metrics every N steps
report_to="trackio", # Experiment tracking tool
trackio_space_id=output_dir, # HF Space where the experiment tracking will be saved
output_dir=output_dir, # Where to save model checkpoints and logs
max_length=128, # Kept short due to VRAM constraints; increase for better results
gradient_checkpointing=False, # NemotronH does not support gradient checkpointing
# Hub integration
push_to_hub=True, # Push the trained model to the Hugging Face Hub
)
Configure the SFT Trainer. We pass the previously configured training_args and the peft_config for LoRA.
from trl import SFTTrainer
trainer = SFTTrainer(
model=model,
args=training_args,
train_dataset=train_dataset,
peft_config=peft_config,
)
Show memory stats before training:
gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)
print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")
And train!
trainer_stats = trainer.train()
Show memory stats after training:
used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)
print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")
Save the fine-tuned model both locally and to the Hugging Face Hub.
trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)
Let's run the fine-tuned model using standard transformers generation (model.generate).
model = trainer.model
model.eval()
messages = [{"role": "user", "content": "Continue the sequence: 1, 1, 2, 3, 5, 8,"}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
from transformers import TextStreamer
_ = model.generate(
**tokenizer(text, return_tensors="pt").to("cuda"),
max_new_tokens=128,
temperature=0.7,
top_p=0.8,
top_k=20,
streamer=TextStreamer(tokenizer, skip_prompt=True),
)