docs/en/guides/finetuning-guide.md
Fine-tuning adapts a pretrained model to recognize new classes by starting from learned weights rather than random initialization. Instead of training from scratch for hundreds of epochs, fine-tuning leverages pretrained COCO features and converges on custom data in a fraction of the time.
This guide covers fine-tuning YOLO26 on custom datasets, from basic usage to advanced techniques like layer freezing and two-stage training.
A pretrained model has already learned general visual features - edge detection, texture recognition, shape understanding - from millions of images. Transfer learning through fine-tuning reuses that knowledge and only teaches the model what the new classes look like, which is why it converges faster and requires less data. Training from scratch discards all of that and forces the model to learn everything from pixel-level patterns up, which demands significantly more resources.
| Fine-Tuning | Training from Scratch | |
|---|---|---|
| Starting weights | Pretrained on COCO (80 classes) | Random initialization |
| Command | YOLO("yolo26n.pt") | YOLO("yolo26n.yaml") |
| Convergence | Faster - backbone is already trained | Slower - all layers learn from zero |
| Data requirements | Lower - pretrained features compensate for less data | Higher - model must learn all features from the dataset alone |
| When to use | Custom classes with natural images | Domains fundamentally different from COCO (medical, satellite, radar) |
!!! tip "Fine-tuning requires no extra code"
When a `.pt` file is loaded with `YOLO("yolo26n.pt")`, the pretrained weights are stored in the model. Calling `.train(data="custom.yaml")` after that automatically transfers all compatible weights to the new model architecture, reinitializes any layers that don't match (such as the detection head when the number of classes differs), and begins training. No manual weight loading, layer manipulation, or custom transfer learning code is required.
When a pretrained model is fine-tuned on a dataset with a different number of classes (for example, COCO's 80 classes to 5 custom classes), Ultralytics performs shape-aware weight transfer:
cv3, one2one_cv3) have shapes tied to the class count (80 vs 5), so they cannot transfer and are randomly initialized. Box regression layers (cv2, one2one_cv2) in the head have fixed shapes regardless of class count, so they transfer normally.For datasets with the same number of classes as the pretrained model (for example, fine-tuning COCO-pretrained weights on another 80-class dataset), 100% of weights transfer including the detection head.
!!! example
=== "Python"
```python
from ultralytics import YOLO
model = YOLO("yolo26n.pt") # load pretrained model
model.train(data="path/to/data.yaml", epochs=50, imgsz=640)
```
=== "CLI"
```bash
yolo detect train model=yolo26n.pt data=path/to/data.yaml epochs=50 imgsz=640
```
Larger models have more capacity but also more parameters to update, which can increase the risk of overfitting when training data is limited. Starting with a smaller model (YOLO26n or YOLO26s) and scaling up only if validation metrics plateau is a practical approach. The optimal model size depends on the complexity of the task, the number of classes, the diversity of the dataset, and the hardware available for deployment. See the full YOLO26 model page for available sizes and performance benchmarks.
The default optimizer=auto setting selects the optimizer and learning rate based on the total number of training iterations:
For most fine-tuning tasks, the default setting works well without any manual tuning. Consider setting the optimizer explicitly when:
optimizer=AdamW, lr0=0.001 for more stable convergencelr0=0.001 helps preserve pretrained features!!! warning "Auto optimizer overrides manual lr0"
When `optimizer=auto`, the `lr0` and `momentum` values are ignored. To control the learning rate manually, set the optimizer explicitly: `optimizer=SGD, lr0=0.005`.
Freezing prevents specific layers from updating during training. This speeds up training and reduces overfitting when the dataset is small relative to the model capacity.
The freeze parameter accepts either an integer or a list. An integer freeze=10 freezes the first 10 layers (0 through 9, which corresponds to the backbone in YOLO26). A list can contain layer indices like freeze=[0, 3, 5] for partial backbone freezing, or module name strings like freeze=["23.cv2"] for fine-grained control over specific branches within a layer.
!!! example
=== "Freeze backbone"
```python
model.train(data="custom.yaml", epochs=50, freeze=10)
```
=== "Freeze specific layers"
```python
model.train(data="custom.yaml", epochs=50, freeze=[0, 1, 2, 3, 4])
```
=== "Freeze by module name"
```python
# Freeze the box regression branch of the detection head
model.train(data="custom.yaml", epochs=50, freeze=["23.cv2"])
```
The right freeze depth depends on how similar the target domain is to the pretrained data and how much training data is available:
| Scenario | Recommendation | Rationale |
|---|---|---|
| Large dataset, similar domain | freeze=None (default) | Enough data to adapt all layers without overfitting |
| Small dataset, similar domain | freeze=10 | Preserves backbone features, reduces trainable parameters |
| Very small dataset | freeze=23 | Only the detection head trains, minimizing overfitting risk |
| Domain far from COCO | freeze=None | Backbone features may not transfer well and need retraining |
Freeze depth can also be treated as a hyperparameter - trying a few values (0, 5, 10) and comparing validation mAP is a practical way to find the best setting for a specific dataset.
Fine-tuning generally requires fewer hyperparameter adjustments than training from scratch. The parameters that matter most are:
epochs: Fine-tuning converges faster than training from scratch. Start with a moderate value and use patience to stop early when validation metrics plateau.patience: The default of 100 is designed for long training runs. Reducing this to 10-20 avoids wasting time on runs that have already converged.warmup_epochs: The default warmup (3 epochs) gradually increases the learning rate from zero, which prevents large gradient updates from damaging pretrained features in early iterations. Keeping the default is recommended even for fine-tuning.For the full list of training parameters, see the training configuration reference.
Two-stage fine-tuning splits training into two phases. The first stage freezes the backbone and trains only the neck and head, allowing the detection layers to adapt to the new classes without disrupting pretrained features. The second stage unfreezes all layers and trains the full model with a lower learning rate to refine the backbone for the target domain.
This approach is particularly useful when the target domain differs significantly from COCO (medical images, aerial imagery, microscopy), where the backbone may need adaptation but training everything at once causes instability. For automatic unfreezing with a callback-based approach, see Freezing and Unfreezing the Backbone.
!!! example "Two-stage fine-tuning"
```python
from ultralytics import YOLO
# Stage 1: freeze backbone, train head and neck
model = YOLO("yolo26n.pt")
model.train(data="custom.yaml", epochs=20, freeze=10, name="stage1", exist_ok=True)
# Stage 2: unfreeze all, fine-tune with lower lr
model = YOLO("runs/detect/stage1/weights/best.pt")
model.train(data="custom.yaml", epochs=30, lr0=0.001, name="stage2", exist_ok=True)
```
data.yaml silently produce zero labels. Run yolo detect val model=yolo26n.pt data=your_data.yaml before training to confirm labels load correctly.conf=0.1 during inference.nc in data.yaml matches the actual number of classes in the label files.cls_pw to apply inverse frequency class weighting (start with cls_pw=0.25 for moderate imbalance, increase to 1.0 for severe imbalance).mosaic=0.5 or mosaic=0.0.imgsz=1280 to preserve detail.This is known as catastrophic forgetting - the model loses previously learned knowledge when fine-tuned exclusively on new data. Forgetting is mostly unavoidable without including original dataset images alongside new data. To mitigate this:
There is no fixed minimum - results depend on the complexity of the task, the number of classes, and how similar the domain is to COCO. More diverse images (varied lighting, angles, backgrounds) matter more than raw quantity. Start with what you have and scale up if validation metrics are insufficient.
Load a pretrained .pt file and call .train() with the path to a custom data.yaml. Ultralytics automatically handles weight transfer, detection head reinitialization, and optimizer selection. See the Basic Fine-Tuning section for the complete code example.
The most common causes are incorrect paths in data.yaml (which silently produces zero labels), a mismatch between nc in the YAML and the actual label files, or a confidence threshold that is too high. See Common Pitfalls for a full troubleshooting checklist.
It depends on the dataset size and domain similarity. For small datasets with a domain similar to COCO, freezing the backbone (freeze=10) prevents overfitting. For domains very different from COCO, leaving all layers unfrozen (freeze=None) allows the backbone to adapt. See Freezing Layers for detailed recommendations.
Include examples of the original classes in the training data alongside the new classes. If that is not possible, freezing more layers (freeze=10 or higher) and using a lower learning rate helps preserve the pretrained knowledge. See Performance degrades on original classes for more details.