docs/en/guides/data-collection-and-annotation.md
Data collection and annotation are the two foundational steps of every computer vision project: you gather representative images or video, then label them so a model can learn from them. The quality of this data directly determines model performance, which is why class definition, unbiased sourcing, and consistent annotation matter before any training begins.
<p align="center"> <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/iBk6S-PHwS0" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen> </iframe><strong>Watch:</strong> How to Build Effective Data Collection and Annotation Strategies for Computer Vision 🚀
</p>This guide covers setting up classes and collecting data, what data annotation is along with the annotation types and formats to choose from, and efficient labeling strategies — every decision aligned with your project's goals.
Collecting images and video for a computer vision project comes down to three decisions: how many classes to define, where to source the data, and how to keep the dataset free of bias.
One of the first questions when starting a computer vision project is how many classes to include. You need to determine the class membership, which involves the different categories or labels that you want your model to recognize and differentiate. The number of classes should be determined by the specific goals of your project.
For example, if you want to monitor traffic, your classes might include "car," "truck," "bus," "motorcycle," and "bicycle." On the other hand, for tracking items in a store, your classes could be "fruits," "vegetables," "beverages," and "snacks." Defining classes based on your project goals helps keep your dataset relevant and focused.
When you define your classes, another important distinction to make is whether to choose coarse or fine class counts. 'Count' refers to the number of distinct classes you are interested in. This decision influences the granularity of your data and the complexity of your model. Here are the considerations for each approach:
Starting with more specific classes can be very helpful, especially in complex projects where details are important. More specific classes let you collect more detailed data, gain deeper insights, and establish clearer distinctions between categories. Not only does it improve the accuracy of the model, but it also makes it easier to adjust the model later if needed, saving both time and resources.
You can use public datasets or gather your own custom data. Public datasets like those on Kaggle and Google Dataset Search Engine offer well-annotated, standardized data, making them great starting points for training and validating models.
Custom data collection, on the other hand, allows you to customize your dataset to your specific needs. You might capture images and videos with cameras or drones, scrape the web for images, or use existing internal data from your organization. Custom data gives you more control over its quality and relevance. Combining both public and custom data sources helps create a diverse and comprehensive dataset.
Bias occurs when certain groups or scenarios are underrepresented or overrepresented in your dataset. It leads to a model that performs well on some data but poorly on others. It's crucial to avoid bias in AI so that your computer vision model can perform well in a variety of scenarios.
Here is how you can avoid bias while collecting data:
Following these practices helps create a more robust and fair model that can generalize well in real-world applications.
Data annotation is the process of labeling data to make it usable for training machine learning models. In computer vision, this means labeling images or videos with the information that a model needs to learn from. Without properly annotated data, models cannot accurately learn the relationships between inputs and outputs.
Depending on the specific requirements of a computer vision task, there are different types of data annotation. Here are some examples:
After selecting a type of annotation, it's important to choose the appropriate format for storing and sharing annotations. The most common formats are:
| Format | File structure | Commonly used for |
|---|---|---|
| COCO | Single JSON file | Object detection, instance segmentation, keypoint detection, stuff and panoptic segmentation, image captioning |
| Pascal VOC | One XML file per image | Object detection |
| YOLO | One .txt file per image | Object detection, segmentation, and pose |
The YOLO format stores one row per object with class indices starting from 0. For object detection the row is class x_center y_center width height with normalized 0–1 coordinates, while segmentation appends normalized polygon points and pose appends keypoint coordinates plus optional visibility values after the box.
With a type of annotation and format chosen, the next step is to establish clear and objective labeling rules. These rules act as a roadmap for consistency and accuracy throughout the annotation process. Key aspects of these rules include:
Regularly reviewing and updating your labeling rules will help keep your annotations accurate, consistent, and aligned with your project goals.
A good annotation tool lets you label every type your task needs, enforces consistent guidelines, and exports labels in a training-ready format. Ultralytics Platform provides a built-in annotation editor covering detection, instance segmentation, pose, OBB, and classification, with SAM-powered smart annotation that turns a single click into a mask for detection, segmentation, and OBB tasks. Because every annotation is saved in YOLO format, your labeled dataset moves straight into training with no conversion step.
Before annotating at scale, it helps to understand accuracy, precision, outliers, and quality control, so you don't label your data in a counterproductive way.
It's important to understand the difference between accuracy and precision and how it relates to annotation. Accuracy refers to how close the annotated data is to the true values. It helps us measure how closely the labels reflect real-world scenarios. Precision indicates the consistency of annotations. It checks if you are giving the same label to the same object or feature throughout the dataset. High accuracy and precision lead to better-trained models by reducing noise and improving the model's ability to generalize from the training data.
<p align="center"> </p>Outliers are data points that deviate quite a bit from other observations in the dataset. With respect to annotations, an outlier could be an incorrectly labeled image or an annotation that doesn't fit with the rest of the dataset. Outliers are concerning because they can distort the model's learning process, leading to inaccurate predictions and poor generalization.
You can use various methods to detect and correct outliers:
Just like other technical projects, quality control is a must for annotated data. It is a good practice to regularly check annotations to make sure they are accurate and consistent. This can be done in a few different ways:
If you are working with multiple people, consistency between different annotators is important. Good inter-annotator agreement means that the guidelines are clear and everyone is following them the same way. It keeps everyone on the same page and the annotations consistent.
While reviewing, if you find errors, correct them and update the guidelines to avoid future mistakes. Provide feedback to annotators and offer regular training to help reduce errors. Having a strong process for handling errors keeps your dataset accurate and reliable.
To make the process of data labeling smoother and more effective, consider implementing these strategies:
These strategies can help maintain high-quality annotations while reducing the time and resources required for the labeling process.
Bouncing your ideas and queries off other computer vision enthusiasts can help accelerate your projects. Here are some great ways to learn, troubleshoot, and network:
Collecting diverse, unbiased data and annotating it consistently with the right tools is the foundation of a reliable computer vision model. With your dataset collected and labeled, continue to the steps of a computer vision project guide to move into training and evaluation.
To minimize bias, collect data from diverse sources, ensure balanced representation across all relevant groups (such as different ages, genders, and ethnicities), regularly review and update your dataset to catch emerging biases, and apply mitigation techniques like oversampling underrepresented classes, data augmentation, and fairness-aware algorithms. Avoiding bias this way keeps your computer vision model performing well across varied real-world scenarios and improves its generalization capability.
Establish clear, objective labeling guidelines with detailed instructions, examples, and illustrations, then apply them uniformly across all data types so every annotation follows the same rules. Train annotators to stay neutral to reduce personal bias, review and update the guidelines regularly, and use automated consistency checks plus inter-annotator feedback to keep accuracy high and aligned with your project goals.
A few hundred annotated objects per class is enough to start experimenting with transfer learning, but for reliable real-world performance Ultralytics recommends at least 1,500 images and 10,000 labeled instances per class. Pair a sufficiently large dataset with a reasonable training schedule — around 300 epochs is a common starting point, reduced if the model overfits early — and keep your annotations rigorous and aligned with your project's specific goals. Explore detailed training strategies in the YOLO26 training guide.
Yes. Ultralytics Platform includes a built-in annotation editor that supports bounding boxes, polygons, keypoints, oriented boxes, and classification labels in a single workspace. SAM-powered smart annotation speeds up labeling for detection, segmentation, and OBB tasks by generating masks from a single click, and every annotation is stored in YOLO format, ready for training.
The most common data annotation types in computer vision are bounding boxes, polygons, masks, and keypoints, each suited to a different task:
Selecting the appropriate annotation type depends on your project's requirements. Learn more about how to implement these annotations and their formats in our data annotation guide.