Fine-Tune NVIDIA Nemotron 3 with SFT and LoRA using TRL

Fine-tune NVIDIA Nemotron 3 models using LoRA and SFT with the TRL library.

The Nemotron 3 family includes hybrid Mamba2/Transformer models in different sizes (Nano, Super, etc.), natively supported in transformers (no trust_remote_code needed). See the full collection for all available checkpoints.

Note: This notebook requires a GPU with sufficient VRAM (e.g., A100 80GB). It is not designed for free Colab instances.

Install dependencies

We install TRL with the PEFT and quantization extras, along with trackio for experiment tracking. Nemotron 3 models use Mamba2 layers, which require the mamba_ssm and causal_conv1d CUDA kernels.

python

!pip install -Uq "trl[peft,quantization]" trackio

# Nemotron 3 requires transformers>=5.3.0, which may not be installed by default with TRL
!pip install -Uq "transformers>=5.3.0"

# Mamba2 CUDA kernels (--no-build-isolation needed so they find your PyTorch/CUDA installation)
!pip install --no-build-isolation mamba_ssm==2.2.5  # installation takes around 30 mins
!pip install --no-build-isolation causal_conv1d==1.5.2

Log in to Hugging Face

Log in to your Hugging Face account to save your fine-tuned model and track experiments on the Hub. You can find your access token on your account settings page.

python

from huggingface_hub import notebook_login

notebook_login()

Load Dataset

We load the HuggingFaceH4/Multilingual-Thinking dataset from the Hugging Face Hub. This dataset focuses on multilingual reasoning, where the chain of thought has been translated into several languages such as French, Spanish, and German. By fine-tuning a reasoning-capable model on this dataset, it learns to generate reasoning steps in multiple languages.

For efficiency, we'll load only the training split:

python

from datasets import load_dataset

dataset_name = "HuggingFaceH4/Multilingual-Thinking"
train_dataset = load_dataset(dataset_name, split="train")

This dataset contains different columns. We'll only need messages as it contains the conversation and is the one used by the SFT trainer.

python

train_dataset

Let's look at a full example to understand the internal structure:

python

train_dataset[0]

The messages column contains an extra thinking field that we need to merge into the message content using <think>...</think> tags. We also remove the extra columns that are not needed for training:

python

def merge_thinking_and_remove_key(example):
    new_messages = []
    for msg in example["messages"]:
        content = msg["content"]
        thinking = msg.get("thinking")
        if thinking and isinstance(thinking, str) and thinking.strip():
            content = f"<think>\n{thinking}\n</think>\n{content}"
        new_messages.append({"role": msg["role"], "content": content})
    example["messages"] = new_messages
    return example

train_dataset = train_dataset.remove_columns(["reasoning_language", "developer", "user", "analysis", "final"])
train_dataset = train_dataset.map(merge_thinking_and_remove_key)

Load model and configure LoRA

Load an NVIDIA Nemotron 3 model. These models are natively supported in transformers so no trust_remote_code is needed. We use attn_implementation="eager" as required by the hybrid Mamba2/Transformer architecture.

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# Select a Nemotron 3 checkpoint below
model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
# model_id = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16"

output_dir = "nemotron-3-sft"

model = AutoModelForCausalLM.from_pretrained(
    model_id,
    attn_implementation="eager",
    dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

Configure the LoRA adapter. Instead of modifying the original weights, we fine-tune a lightweight LoRA adapter for efficient and memory-friendly training.

python

from peft import LoraConfig

peft_config = LoraConfig(
    r=8,
    lora_alpha=16,
    target_modules=[
        "q_proj",
        "k_proj",
        "v_proj",
        "o_proj",
        "gate_proj",
        "up_proj",
        "down_proj",
    ],
)

Train model

Configure SFT using SFTConfig. Note that gradient_checkpointing is set to False because Nemotron 3 (NemotronHForCausalLM) does not support it. For full details on all available parameters, check the TRL SFTConfig documentation.

python

from trl import SFTConfig

training_args = SFTConfig(
    # Training schedule / optimization
    per_device_train_batch_size=1,                          # Batch size per GPU
    gradient_accumulation_steps=4,                          # Effective batch size = per_device_train_batch_size * gradient_accumulation_steps
    num_train_epochs=1,                                     # Number of full dataset passes
    learning_rate=2e-4,                                     # Learning rate for the optimizer
    optim="paged_adamw_8bit",                               # Memory-efficient 8-bit optimizer

    # Logging / reporting
    logging_steps=10,                                       # Log training metrics every N steps
    report_to="trackio",                                    # Experiment tracking tool
    trackio_space_id=output_dir,                            # HF Space where the experiment tracking will be saved
    output_dir=output_dir,                                  # Where to save model checkpoints and logs

    max_length=128,                                         # Kept short due to VRAM constraints; increase for better results
    gradient_checkpointing=False,                           # NemotronH does not support gradient checkpointing

    # Hub integration
    push_to_hub=True,                                       # Push the trained model to the Hugging Face Hub
)

Configure the SFT Trainer. We pass the previously configured training_args and the peft_config for LoRA.

python

from trl import SFTTrainer

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    peft_config=peft_config,
)

Show memory stats before training:

python

gpu_stats = torch.cuda.get_device_properties(0)
start_gpu_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
max_memory = round(gpu_stats.total_memory / 1024 / 1024 / 1024, 3)

print(f"GPU = {gpu_stats.name}. Max memory = {max_memory} GB.")
print(f"{start_gpu_memory} GB of memory reserved.")

And train!

python

trainer_stats = trainer.train()

Show memory stats after training:

python

used_memory = round(torch.cuda.max_memory_reserved() / 1024 / 1024 / 1024, 3)
used_memory_for_lora = round(used_memory - start_gpu_memory, 3)
used_percentage = round(used_memory / max_memory * 100, 3)
lora_percentage = round(used_memory_for_lora / max_memory * 100, 3)

print(f"{trainer_stats.metrics['train_runtime']} seconds used for training.")
print(f"{round(trainer_stats.metrics['train_runtime']/60, 2)} minutes used for training.")
print(f"Peak reserved memory = {used_memory} GB.")
print(f"Peak reserved memory for training = {used_memory_for_lora} GB.")
print(f"Peak reserved memory % of max memory = {used_percentage} %.")
print(f"Peak reserved memory for training % of max memory = {lora_percentage} %.")

Save fine-tuned model

Save the fine-tuned model both locally and to the Hugging Face Hub.

python

trainer.save_model(output_dir)
trainer.push_to_hub(dataset_name=dataset_name)

Inference

Let's run the fine-tuned model using standard transformers generation (model.generate).

python

model = trainer.model
model.eval()

python

messages = [{"role": "user", "content": "Continue the sequence: 1, 1, 2, 3, 5, 8,"}]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)

from transformers import TextStreamer

_ = model.generate(
    **tokenizer(text, return_tensors="pt").to("cuda"),
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.8,
    top_k=20,
    streamer=TextStreamer(tokenizer, skip_prompt=True),
)