docs/source/jobs_training.md
Hugging Face Jobs lets you run training scripts on fully managed infrastructure—no need to manage GPUs or local environment setup.
In this guide, you'll learn how to:
For general details about Hugging Face Jobs (hardware selection, job monitoring, etc.), see the Jobs documentation.
hf auth login)TRL Jobs is a high-level wrapper around Hugging Face Jobs and TRL that streamlines training. It provides optimized default configurations so you can start quickly without manually tuning parameters.
Example:
pip install trl-jobs
trl-jobs sft --model_name Qwen/Qwen3-0.6B --dataset_name trl-lib/Capybara
TRL Jobs supports everything covered in this guide, with additional optimizations to simplify workflows.
For more control, you can run Hugging Face Jobs directly with your own scripts, using uv scripts.
Create a Python script (e.g., train.py) containing your training code:
from datasets import load_dataset
from trl import SFTTrainer
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
)
trainer.train()
trainer.push_to_hub("Qwen2.5-0.5B-SFT")
Launch the job using either the hf jobs CLI or the Python API:
hf jobs uv run \
--flavor a100-large \
--with trl \
--secrets HF_TOKEN \
train.py
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
dependencies=["trl"],
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
)
To run successfully, the script needs:
--with trl flag or the dependencies argument. uv installs these dependencies automatically before running the script.--secrets HF_TOKEN flag or the secrets argument.[!WARNING] When training with Jobs, be sure to:
- Set a sufficient timeout. Jobs time out after 30 minutes by default. If your job exceeds the timeout, it will fail and all progress will be lost. See Setting a custom timeout.
- Push the model to the Hub. The Jobs environment is ephemeral—files are deleted when the job ends. If you don’t push the model, it will be lost.
You can also run a script directly from a URL:
<hfoptions id="script_type"> <hfoption id="bash">hf jobs uv run \
--flavor a100-large \
--with trl \
--secrets HF_TOKEN \
"https://gist.githubusercontent.com/qgallouedec/eb6a7d20bd7d56f9c440c3c8c56d2307/raw/69fd78a179e19af115e4a54a1cdedd2a6c237f2f/train.py"
from huggingface_hub import run_uv_job
run_uv_job(
"https://gist.githubusercontent.com/qgallouedec/eb6a7d20bd7d56f9c440c3c8c56d2307/raw/69fd78a179e19af115e4a54a1cdedd2a6c237f2f/train.py",
flavor="a100-large",
dependencies=["trl"],
secrets={"HF_TOKEN": "hf_..."},
)
To make a script self-contained, declare dependencies at the top:
# /// script
# dependencies = [
# "trl",
# "peft",
# ]
# ///
from datasets import load_dataset
from peft import LoraConfig
from trl import SFTTrainer
dataset = load_dataset("trl-lib/Capybara", split="train")
trainer = SFTTrainer(
model="Qwen/Qwen2.5-0.5B",
train_dataset=dataset,
peft_config=LoraConfig(),
)
trainer.train()
trainer.push_to_hub("Qwen2.5-0.5B-SFT")
You can then run the script without specifying dependencies:
<hfoptions id="script_type"> <hfoption id="bash">hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
train.py
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
)
TRL example scripts are fully uv-compatible, so you can run a complete training workflow directly on Jobs. You can customize training with standard script arguments plus hardware and secrets:
<hfoptions id="script_type"> <hfoption id="bash">hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/prm.py \
--model_name_or_path Qwen/Qwen2-0.5B-Instruct \
--dataset_name trl-lib/prm800k \
--output_dir Qwen2-0.5B-Reward \
--push_to_hub
from huggingface_hub import run_uv_job
run_uv_job(
"https://raw.githubusercontent.com/huggingface/trl/refs/heads/main/examples/scripts/prm.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
script_args=[
"--model_name_or_path", "Qwen/Qwen2-0.5B-Instruct",
"--dataset_name", "trl-lib/prm800k",
"--output_dir", "Qwen2-0.5B-Reward",
"--push_to_hub"
]
)
An up-to-date Docker image with all TRL dependencies is available at huggingface/trl and can be used directly with Hugging Face Jobs:
<hfoptions id="script_type"> <hfoption id="bash">hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
--image huggingface/trl \
train.py
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
image="huggingface/trl",
)
Jobs runs on a Docker image from Hugging Face Spaces or Docker Hub, so you can also specify any custom image:
<hfoptions id="script_type"> <hfoption id="bash">hf jobs uv run \
--flavor a100-large \
--secrets HF_TOKEN \
--image <docker-image> \
--secrets HF_TOKEN \
train.py
from huggingface_hub import run_uv_job
run_uv_job(
"train.py",
flavor="a100-large",
secrets={"HF_TOKEN": "hf_..."},
image="<docker-image>",
)