docs/en/guides/preprocessing-annotated-data.md
Data preprocessing turns raw, annotated images into the clean and consistent inputs a computer vision model needs to train well. With Ultralytics YOLO26, the core pixel operations — RGB conversion, scaling to [0, 1], and resizing — run automatically inside the training pipeline, so the work that remains is splitting your dataset correctly, balancing classes, and choosing augmentations. This guide covers those essential techniques: resizing, normalization, dataset splitting, data augmentation, and exploratory data analysis (EDA).
<strong>Watch:</strong> How to Use Data Preprocessing and Augmentation to Improve Model Accuracy in Real-World Scenarios 🚀
</p>This step comes after you've defined your project's goals and collected and annotated your data, and it sits early in the computer vision project workflow.
Preprocessing gets your data into a format that reduces computational load and improves model performance. It addresses three common issues in raw data:
The main techniques are resizing, normalization, dataset splitting, and augmentation. With YOLO26 the first two are automatic, while splitting and augmentation are where your choices matter most.
Many models require a consistent input size, so resizing makes images uniform and reduces computational complexity. Two common interpolation methods are:
Libraries like OpenCV and PIL (Pillow) provide these functions, but with YOLO26 you usually don't resize manually. The imgsz argument during model training handles it: when set to a value such as 640, YOLO scales each image so its largest dimension is 640 pixels while preserving the aspect ratio, then pads the shorter side (default gray, value 114) to reach a square 640 × 640 input.
Normalization scales pixel values to a standard range, which helps the model converge faster during training. Two common techniques are:
YOLO26 handles normalization automatically as part of its preprocessing pipeline: it converts images to RGB and scales pixel values to the range [0, 1] by dividing by 255 (min-max scaling). YOLO does not apply ImageNet-style mean/standard-deviation (z-score) normalization by default, so no manual normalization step is required.
Splitting the data into training, validation, and test sets lets you evaluate the model on unseen data and measure its generalization. A common split is 70% for training, 20% for validation, and 10% for testing. Tools like scikit-learn or TensorFlow make this straightforward.
Keep these points in mind when splitting:
!!! warning "Avoid data leakage"
Split the dataset **before** applying any augmentation or other preprocessing, and apply those transforms only to the training set. Augmenting before the split lets information from the validation or test images influence training, producing misleadingly high scores that collapse on real-world data.
Data augmentation artificially increases the size of a dataset by creating modified versions of existing images. It helps reduce overfitting and improves generalization, with several benefits:
With YOLO26, augmentation is controlled through training arguments passed to model.train() or the equivalent CLI flags — not by editing the dataset YAML, which defines dataset metadata such as paths, class names, and splits. The built-in augmentations include:
mosaic, mixup, cutmix): Combine multiple images into one training sample.fliplr, flipud): Mirror images horizontally or vertically.degrees, translate, scale, shear, perspective): Rotate, shift, zoom, and warp images.hsv_h, hsv_s, hsv_v): Vary hue, saturation, and brightness.copy_paste): Paste objects between images for segmentation.!!! example "Set augmentation strength when training"
=== "Python"
```python
from ultralytics import YOLO
model = YOLO("yolo26n.pt")
# Augmentation is configured with training arguments, not the dataset YAML
model.train(data="coco8.yaml", epochs=10, hsv_h=0.015, fliplr=0.5, mosaic=1.0, degrees=10.0)
```
=== "CLI"
```bash
yolo detect train model=yolo26n.pt data=coco8.yaml epochs=10 hsv_h=0.015 fliplr=0.5 mosaic=1.0 degrees=10.0
```
For the full list of augmentation arguments and their default values, see the augmentation settings reference and the dedicated YOLO data augmentation guide. If the albumentations package is installed, YOLO also enables its built-in Albumentations-based augmentations automatically.
Consider a project to detect and classify vehicles in traffic images with YOLO26, starting from images annotated with bounding boxes and labels. Here is what each preprocessing decision looks like:
imgsz during training.[0, 1] automatically.fliplr for direction invariance, hsv_v for day/night lighting, and mosaic for varied object density.With these decisions made, the dataset is ready for Exploratory Data Analysis (EDA).
EDA uses statistics and visualizations to reveal patterns and distributions in your data, helping you catch issues like class imbalance or outliers before training.
Statistical EDA starts with basic metrics — mean, median, standard deviation, and range — computed over properties such as pixel-intensity distributions. These give a quick overview of your dataset's quality and surface irregularities early.
Visualizations reveal patterns that summary statistics miss, such as class imbalance and outliers. Common tools include:
For a no-code approach to EDA, upload your dataset to Ultralytics Platform. The dataset's Charts tab automatically generates key EDA visualizations: split distribution, top class counts, image width/height histograms, and 2D heatmaps of annotation positions and image dimensions. The Images tab lets you browse your data in grid, compact, or table views with annotation overlays, making it easy to spot mislabeled examples or unbalanced classes without writing any code.
Properly split, normalized, and augmented data reduces noise and improves generalization, turning a raw collection of images into a dependable training set. With your dataset preprocessed, the next step is to train your model. If questions come up along the way, ask the community on the Ultralytics GitHub repository or the Ultralytics Discord server.
Preprocessing ensures your data is clean, consistent, and in a format optimized for training. By addressing noise, inconsistency, and class imbalance in raw data, steps like resizing, normalization, augmentation, and dataset splitting reduce computational load and improve model performance. See the steps of a computer vision project for how it fits into the broader workflow.
Configure augmentation through training arguments, not the dataset YAML. Pass arguments such as fliplr, mosaic, hsv_h, and degrees to model.train() (or the equivalent CLI flags) to set the probability and strength of each transform. These are defined in the augmentation settings and explained in the YOLO data augmentation guide.
The two most common techniques are min-max scaling (rescaling pixels to a range of 0 to 1) and z-score normalization (rescaling based on mean and standard deviation). YOLO26 applies min-max scaling automatically — converting images to RGB and dividing pixel values by 255 — so you don't need a manual normalization step. It does not apply z-score normalization by default.
A common practice is 70% for training, 20% for validation, and 10% for testing. Maintain the class distribution across all three splits, and avoid data leakage by applying augmentation only to the training set after the split. Tools like scikit-learn or TensorFlow handle the split efficiently. See the data collection and annotation guide for upstream dataset preparation.
Yes. The imgsz argument resizes images during training and inference so their largest dimension matches the specified size (e.g., 640 pixels) while preserving the aspect ratio, then pads the shorter side. You don't need to resize images yourself — see the model training documentation for details.