YOLO26 Training Recipe

Introduction

This guide documents the exact training recipe used to produce the official YOLO26 pretrained checkpoints on COCO. Every hyperparameter shown here is already embedded in the released .pt weights and can be inspected programmatically.

Knowing what went into the official checkpoints — not just the architecture, but the learning rate schedules, augmentation pipelines, and loss weights that shaped their performance — helps you make better decisions when fine-tuning: which data augmentations to keep, which loss function weights to adjust, and what optimizer settings work best for your dataset size.

Training Overview

All YOLO26 base models were trained on COCO at 640x640 resolution using the MuSGD optimizer with batch size 128. Rather than starting from random weights in a single run, models were initialized from intermediate pretrained weights and refined with hyperparameters found via evolutionary search. Full training logs and metrics for every model size are available on Ultralytics Platform:

Key design choices across all sizes:

End-to-end training (end2end=True) with NMS-free one-to-one head
MuSGD optimizer combining SGD with Muon-style orthogonalized updates for weight matrices (parameters with ndim >= 2, such as conv and linear weights)
Heavy mosaic augmentation (~0.9-1.0 probability) disabled in the last 10 epochs (close_mosaic=10)
Aggressive scale augmentation (0.56-0.95) to handle objects at different sizes
Minimal rotation/shear for most sizes, keeping geometric distortion low

Inspecting YOLO26 Checkpoint Training Args

Every Ultralytics checkpoint stores the full training configuration used to produce it, so you can verify each number on this page yourself:

!!! example "Inspect checkpoint training args"

=== "Ultralytics API"

    ```python
    from ultralytics import YOLO

    model = YOLO("yolo26n.pt")
    print(model.ckpt["train_args"])
    ```

=== "PyTorch"

    ```python
    import torch

    # Load any official checkpoint
    ckpt = torch.load("yolo26n.pt", map_location="cpu", weights_only=False)

    # Print all training arguments
    for k, v in sorted(ckpt["train_args"].items()):
        print(f"{k}: {v}")
    ```

The output lists the full configuration of over 100 entries, including every recipe value documented on this page. An excerpt for yolo26n.pt:

plaintext

batch: 128
...
box: 5.62767
...
close_mosaic: 10
cls: 0.56099
...
dfl: 9.03871
...
epochs: 245
...
lr0: 0.0054
lrf: 0.04952
...
optimizer: MuSGD

This works for any .pt checkpoint — official releases and your own fine-tuned models alike. For the full list of configurable training arguments, see the training configuration reference.

YOLO26 Training Hyperparameters per Model Size

The tables below group the recipe by category — optimizer and schedule, loss weights, and augmentation. Every value comes straight from the train_args embedded in the released checkpoints.

Optimizer and Learning Rate

These optimizer and schedule settings drove COCO pretraining for each size; note how the N model stands apart from the rest:

Setting	N	S	M	L	X
`optimizer`	MuSGD	MuSGD	MuSGD	MuSGD	MuSGD
`lr0`	0.0054	0.00038	0.00038	0.00038	0.00038
`lrf`	0.0495	0.882	0.882	0.882	0.882
`momentum`	0.947	0.948	0.948	0.948	0.948
`weight_decay`	0.00064	0.00027	0.00027	0.00027	0.00027
`warmup_epochs`	0.98	0.99	0.99	0.99	0.99
`epochs`	245	70	80	60	40
`batch`	128	128	128	128	128
`imgsz`	640	640	640	640	640

!!! info "Learning rate strategy"

The N model used a higher initial learning rate with steep decay (`lrf=0.0495`), while S/M/L/X models used a much lower initial LR with a gentler schedule (`lrf=0.882`). This reflects the different convergence dynamics of smaller vs larger models — smaller models need more aggressive updates to learn effectively.

Loss Weights

Loss weights balance the three components of the detection loss — bounding box IoU regression (box), classification (cls), and a box-distance regression term (dfl). Note that DFL-free YOLO26 repurposes the dfl gain to weight an L1 loss on normalized box distances rather than distribution focal loss:

Setting	N	S	M	L	X
`box`	5.63	9.83	9.83	9.83	9.83
`cls`	0.56	0.65	0.65	0.65	0.65
`dfl`	9.04	0.96	0.96	0.96	0.96

The N model prioritizes the dfl distance-regression term, while S/M/L/X models shift emphasis to IoU-based box regression. Classification loss remains relatively consistent across all sizes.

Augmentation Pipeline

For a detailed explanation of each technique, see the YOLO Data Augmentation guide.

Setting	N	S	M	L	X
`mosaic`	0.909	0.992	0.992	0.992	0.992
`mixup`	0.012	0.05	0.427	0.427	0.427
`copy_paste`	0.075	0.404	0.304	0.404	0.404
`scale`	0.562	0.9	0.95	0.95	0.95
`fliplr`	0.606	0.304	0.304	0.304	0.304
`degrees`	1.11	~0	~0	~0	~0
`shear`	1.46	~0	~0	~0	~0
`translate`	0.071	0.275	0.275	0.275	0.275
`hsv_h`	0.014	0.013	0.013	0.013	0.013
`hsv_s`	0.645	0.353	0.353	0.353	0.353
`hsv_v`	0.566	0.194	0.194	0.194	0.194
`bgr`	0.106	0.0	0.0	0.0	0.0

Values shown as ~0 are below 0.01 in the actual checkpoints (for example, degrees=0.00012 for the S model) — the augmentation is effectively disabled.

Larger models use more aggressive augmentation overall (higher mixup, copy-paste, and scale), since they have more capacity and benefit from stronger regularization. The N model is the only size with meaningful rotation, shear, and BGR augmentation.

Internal Training Parameters

??? note "Advanced: internal pipeline parameters"

The checkpoints also contain parameters that were used in the internal training pipeline but are **not** exposed as user-configurable settings in `default.yaml`:

| Setting | Description | N | S | M | L | X |
|---|---|---|---|---|---|---|
| `muon_w` | Muon update weight in MuSGD | 0.528 | 0.436 | 0.436 | 0.436 | 0.436 |
| `sgd_w` | SGD update weight in MuSGD | 0.674 | 0.479 | 0.479 | 0.479 | 0.479 |
| `cls_w` | Internal classification weight | 2.74 | 3.48 | 3.48 | 3.48 | 3.48 |
| `o2m` | One-to-many head loss weight | 1.0 | 0.705 | 0.705 | 0.705 | 0.705 |
| `topk` | Top-k label assignment | 8 | 5 | 5 | 5 | 5 |

See the [FAQ entry on these parameters](#what-are-muon_w-sgd_w-cls_w-o2m-and-topk-in-the-checkpoint) for what they mean when fine-tuning.

Fine-Tuning YOLO26 on Your Own Dataset

When fine-tuning YOLO26 on your own dataset, you don't need to replicate the full pretraining recipe. The pretrained weights already encode the augmentation and optimization knowledge from COCO training. For more general training best practices, see Tips for Model Training.

Fine-Tune with Default Settings

!!! example "Fine-tune with defaults"

=== "Python"

    ```python
    from ultralytics import YOLO

    model = YOLO("yolo26n.pt")
    results = model.train(data="your-dataset.yaml", epochs=100, imgsz=640)
    ```

=== "CLI"

    ```bash
    yolo train model=yolo26n.pt data=your-dataset.yaml epochs=100 imgsz=640
    ```

Fine-tuning with defaults is a strong baseline. Only adjust hyperparameters if you have a specific reason to.

When to Adjust YOLO26 Hyperparameters

Small datasets (< 1,000 images):

Reduce augmentation strength: mosaic=0.5, mixup=0.0, copy_paste=0.0
Lower learning rate: lr0=0.001
Use fewer epochs with patience: epochs=50, patience=20
Consider freezing backbone layers: freeze=10

Large datasets (> 50,000 images):

Match the pretraining recipe more closely
Consider optimizer=MuSGD for longer runs
Increase augmentation: mosaic=1.0, mixup=0.3, scale=0.9

Domain-specific imagery (aerial, medical, underwater):

Increase flipud=0.5 if vertical orientation varies
Increase degrees if objects appear at arbitrary rotations
Adjust hsv_s and hsv_v if lighting conditions differ significantly from COCO

For automated hyperparameter optimization, see the Hyperparameter Tuning guide.

Choose a Model Size

Model	Best For	Batch Size Guidance
YOLO26n	Edge devices, mobile, real-time on CPU	Large batches (64-128) on consumer GPUs
YOLO26s	Balanced speed and accuracy	Medium batches (32-64)
YOLO26m	Higher accuracy with moderate compute	Smaller batches (16-32)
YOLO26l	High accuracy when GPU is available	Small batches (8-16) or multi-GPU
YOLO26x	Maximum accuracy, server deployment	Small batches (4-8) or multi-GPU

For export and deployment options, see the Export guide and Model Deployment Options.

Conclusion

The YOLO26 checkpoints ship with their full training recipe embedded, so the exact hyperparameters behind every model size are always one train_args lookup away. Start fine-tuning from the defaults, adjust deliberately using the tables on this page, and verify every change against your own validation set. If questions come up along the way, ask the community on the Ultralytics GitHub repository or the Ultralytics Discord server.

FAQ

How do I see the exact hyperparameters used for any checkpoint?

Load the checkpoint with torch.load() and access the train_args key, or use model.ckpt["train_args"] with the Ultralytics API. See Inspecting YOLO26 Checkpoint Training Args for complete examples.

Why are the epoch counts different for each model size?

Larger models generally needed fewer epochs on COCO because their greater capacity speeds up convergence — the X model trained for 40 epochs versus 245 for N — though the counts are not strictly monotonic (S used 70, M used 80). When fine-tuning on your own dataset, the optimal number of epochs depends on your dataset size and complexity, not the model size. Use early stopping (patience) to find the right stopping point automatically.

Should I use MuSGD for fine-tuning?

Usually you don't need to choose: with the default optimizer=auto, Ultralytics automatically selects MuSGD for longer training runs (>10,000 iterations) and AdamW for shorter ones. You can explicitly set optimizer=MuSGD if you prefer. For more on how MuSGD works, see the training documentation.

What are `muon_w`, `sgd_w`, `cls_w`, `o2m`, and `topk` in the checkpoint?

These are internal parameters from the training pipeline that produced the base checkpoints, recorded in train_args for reproducibility. They are not user-configurable settings in default.yaml, and passing them to model.train() raises an invalid-argument error — the public package does not read them. You do not need to set them when fine-tuning; see Internal Training Parameters for their values per model size.

Can I replicate the exact pretraining from scratch?

Not exactly — the checkpoints were produced using an internal training branch with additional features not in the public codebase (like configurable o2m weights and cls_w). You can get very close results using the hyperparameters documented on this page with the public Ultralytics package, but an exact reproduction requires the internal branch.

YOLO26 Training Recipe

YOLO26 Training Recipe

Introduction

Training Overview

Inspecting YOLO26 Checkpoint Training Args

YOLO26 Training Hyperparameters per Model Size

Optimizer and Learning Rate

Loss Weights

Augmentation Pipeline

Internal Training Parameters

Fine-Tuning YOLO26 on Your Own Dataset

Fine-Tune with Default Settings

When to Adjust YOLO26 Hyperparameters

Choose a Model Size

Conclusion

FAQ

How do I see the exact hyperparameters used for any checkpoint?

Why are the epoch counts different for each model size?

Should I use MuSGD for fine-tuning?

What are muon_w, sgd_w, cls_w, o2m, and topk in the checkpoint?

Can I replicate the exact pretraining from scratch?

What are `muon_w`, `sgd_w`, `cls_w`, `o2m`, and `topk` in the checkpoint?