docs/en/models/sam-3.md
!!! success "Now Available in Ultralytics"
SAM 3 is fully integrated into the Ultralytics package as of **version 8.3.237** ([PR #22897](https://github.com/ultralytics/ultralytics/pull/22897)). Install or upgrade with `pip install -U ultralytics` to access all SAM 3 features including text-based concept segmentation, image exemplar prompts, and video tracking.
SAM 3 (Segment Anything Model 3) is Meta's released foundation model for Promptable Concept Segmentation (PCS). Building upon SAM 2, SAM 3 introduces a fundamentally new capability: detecting, segmenting, and tracking all instances of a visual concept specified by text prompts, image exemplars, or both. Unlike previous SAM versions that segment single objects per prompt, SAM 3 can find and segment every occurrence of a concept appearing anywhere in images or videos, aligning with open-vocabulary goals in modern instance segmentation.
<p align="center"> <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/BE2Nu3edyOo" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen> </iframe><strong>Watch:</strong> How to Use Meta Segment Anything 3 with Ultralytics | Text-Prompt Segmentation on Images & Videos
</p>SAM 3 is now fully integrated into the ultralytics package, providing native support for concept segmentation with text prompts, image exemplar prompts, and video tracking capabilities.
SAM 3 achieves a 2× performance gain over existing systems in Promptable Concept Segmentation while maintaining and improving SAM 2's capabilities for interactive visual segmentation. The model excels at open-vocabulary segmentation, allowing users to specify concepts using simple noun phrases (e.g., "yellow school bus", "striped cat") or by providing example images of the target object. These capabilities complement production-ready pipelines that rely on streamlined predict and track workflows.
The PCS task takes a concept prompt as input and returns segmentation masks with unique identities for all matching object instances. Concept prompts can be:
This differs from traditional visual prompts (points, boxes, masks) which segment only a single specific object instance, as popularized by the original SAM family.
| Metric | SAM 3 Achievement |
|---|---|
| LVIS Zero-Shot Mask AP | 47.0 (vs previous best 38.5, +22% improvement) |
| SA-Co Benchmark | 2× better than existing systems |
| Inference Speed (H200 GPU) | 30 ms per image with 100+ detected objects |
| Video Performance | Near real-time for ~5 concurrent objects |
| MOSEv2 VOS Benchmark | 60.1 J&F (+25.5% over SAM 2.1, +17% over prior SOTA) |
| Interactive Refinement | +18.6 CGF1 improvement after 3 exemplar prompts |
| Human Performance Gap | Achieves 88% of estimated lower bound on SA-Co/Gold |
For context on model metrics and trade-offs in production, see model evaluation insights and YOLO performance metrics.
SAM 3 consists of a detector and tracker that share a Perception Encoder (PE) vision backbone. This decoupled design avoids task conflicts while enabling both image-level detection and video-level tracking, with an interface compatible with Ultralytics Python usage and CLI usage.
Detector: DETR-based architecture for image-level concept detection
Tracker: Memory-based video segmentation inherited from SAM 2
Presence Token: A learned global token that predicts whether the target concept is present in the image/frame, improving detection by separating recognition from localization.
SAM 3 is trained on Segment Anything with Concepts (SA-Co), Meta's largest and most diverse segmentation dataset to date, expanding beyond common benchmarks like COCO and LVIS.
| Dataset Component | Description | Scale |
|---|---|---|
| SA-Co/HQ | High-quality human-annotated image data from 4-phase data engine | 5.2M images, 4M unique noun phrases |
| SA-Co/SYN | Synthetic dataset labeled by AI without human involvement | 38M noun phrases, 1.4B masks |
| SA-Co/EXT | 15 external datasets enriched with hard negatives | Varies by source |
| SA-Co/VIDEO | Video annotations with temporal tracking | 52.5K videos, 24.8K unique noun phrases |
The SA-Co evaluation benchmark contains 214K unique phrases across 126K images and videos, providing over 50× more concepts than existing benchmarks. It includes:
SAM 3's scalable human- and model-in-the-loop data engine achieves 2× annotation throughput through:
SAM 3 is available in Ultralytics version 8.3.237 and later. Install or upgrade with:
pip install -U ultralytics
!!! warning "SAM 3 Model Weights Required"
Unlike other Ultralytics models, SAM 3 weights (`sam3.pt`) are **not automatically downloaded**. You must first request access for the model weights on the [SAM 3 model page on Hugging Face](https://huggingface.co/facebook/sam3) and then, once approved, download the [`sam3.pt` file](https://huggingface.co/facebook/sam3/resolve/main/sam3.pt?download=true). Place the downloaded `sam3.pt` file in your working directory or specify the full path when loading the model.
!!! warning "TypeError: 'SimpleTokenizer' object is not callable"
If you get the above error during prediction, it means you have the incorrect `clip` package installed. Install the correct `clip` package by running the following:
```bash
pip uninstall clip -y
pip install git+https://github.com/ultralytics/CLIP.git
```
SAM 3 supports both Promptable Concept Segmentation (PCS) and Promptable Visual Segmentation (PVS) tasks through different predictor interfaces:
| Task Type | Prompt Types | Output |
|---|---|---|
| Concept Segmentation (PCS) | Text (noun phrases), image exemplars | All instances matching the concept |
| Visual Segmentation (PVS) | Points, boxes, masks | Single object instance (SAM 2 style) |
| Interactive Refinement | Add/remove exemplars or clicks iteratively | Refined segmentation with improved accuracy |
!!! example "Text-based Concept Segmentation"
Find and segment all instances of a concept using a text description. Text prompts require the `SAM3SemanticPredictor` interface.
=== "Python"
```python
from ultralytics.models.sam import SAM3SemanticPredictor
# Initialize predictor with configuration
overrides = dict(
conf=0.25,
task="segment",
mode="predict",
model="sam3.pt",
half=True, # Use FP16 for faster inference
save=True,
)
predictor = SAM3SemanticPredictor(overrides=overrides)
# Set image once for multiple queries
predictor.set_image("path/to/image.jpg")
# Query with multiple text prompts
results = predictor(text=["person", "bus", "glasses"])
# Works with descriptive phrases
results = predictor(text=["person with red cloth", "person with blue cloth"])
# Query with a single concept
results = predictor(text=["a person"])
```
!!! example "Image Exemplar-based Segmentation"
Use bounding boxes as visual prompts to find all similar instances. This also requires `SAM3SemanticPredictor` for concept-based matching.
=== "Python"
```python
from ultralytics.models.sam import SAM3SemanticPredictor
# Initialize predictor
overrides = dict(conf=0.25, task="segment", mode="predict", model="sam3.pt", half=True, save=True)
predictor = SAM3SemanticPredictor(overrides=overrides)
# Set image
predictor.set_image("path/to/image.jpg")
# Provide bounding box examples to segment similar objects
results = predictor(bboxes=[[480.0, 290.0, 590.0, 650.0]])
# Multiple bounding boxes for different concepts
results = predictor(bboxes=[[539, 599, 589, 639], [343, 267, 499, 662]])
```
!!! example "Reusing Image Features for Multiple Queries"
Extract image features once and reuse them for multiple segmentation queries to improve efficiency.
=== "Python"
```python
import cv2
from ultralytics.models.sam import SAM3SemanticPredictor
from ultralytics.utils.plotting import Annotator, colors
# Initialize predictors
overrides = dict(conf=0.50, task="segment", mode="predict", model="sam3.pt", verbose=False)
predictor = SAM3SemanticPredictor(overrides=overrides)
predictor2 = SAM3SemanticPredictor(overrides=overrides)
# Extract features from the first predictor
source = "path/to/image.jpg"
predictor.set_image(source)
src_shape = cv2.imread(source).shape[:2]
# Setup second predictor and reuse features
predictor2.setup_model()
# Perform inference using shared features with text prompt
masks, boxes = predictor2.inference_features(predictor.features, src_shape=src_shape, text=["person"])
# Perform inference using shared features with bounding box prompt
masks, boxes = predictor2.inference_features(predictor.features, src_shape=src_shape, bboxes=[[439, 437, 524, 709]])
# Visualize results
if masks is not None:
masks, boxes = masks.cpu().numpy(), boxes.cpu().numpy()
im = cv2.imread(source)
annotator = Annotator(im, pil=False)
annotator.masks(masks, [colors(x, True) for x in range(len(masks))])
cv2.imshow("result", annotator.result())
cv2.waitKey(0)
```
!!! example "Video Tracking with Visual Prompts"
Detect and track object instances across video frames using bounding box prompts.
=== "Python"
```python
from ultralytics.models.sam import SAM3VideoPredictor
# Create video predictor
overrides = dict(conf=0.25, task="segment", mode="predict", model="sam3.pt", half=True)
predictor = SAM3VideoPredictor(overrides=overrides)
# Track objects using bounding box prompts
results = predictor(source="path/to/video.mp4", bboxes=[[706.5, 442.5, 905.25, 555], [598, 635, 725, 750]], stream=True)
# Process and display results
for r in results:
r.show() # Display frame with segmentation masks
```
!!! example "Video Tracking with Semantic Queries"
Track all instances of concepts specified by text across video frames.
=== "Python"
```python
from ultralytics.models.sam import SAM3VideoSemanticPredictor
# Initialize semantic video predictor
overrides = dict(conf=0.25, task="segment", mode="predict", imgsz=640, model="sam3.pt", half=True, save=True)
predictor = SAM3VideoSemanticPredictor(overrides=overrides)
# Track concepts using text prompts
results = predictor(source="path/to/video.mp4", text=["person", "bicycle"], stream=True)
# Process results
for r in results:
r.show() # Display frame with tracked objects
# Alternative: Track with bounding box prompts
results = predictor(
source="path/to/video.mp4",
bboxes=[[864, 383, 975, 620], [705, 229, 782, 402]],
labels=[1, 1], # Positive labels
stream=True,
)
```
SAM 3 maintains full backward compatibility with SAM 2's visual prompting for single-object segmentation:
!!! example "SAM 2 Style Visual Prompts"
The basic `SAM` interface behaves exactly like SAM 2, segmenting only the specific area indicated by visual prompts (points, boxes, or masks).
=== "Python"
```python
from ultralytics import SAM
model = SAM("sam3.pt")
# Single point prompt - segments object at specific location
results = model.predict(source="path/to/image.jpg", points=[900, 370], labels=[1])
results[0].show()
# Multiple points - segments single object with multiple point hints
results = model.predict(source="path/to/image.jpg", points=[[400, 370], [900, 370]], labels=[1, 1])
# Box prompt - segments object within bounding box
results = model.predict(source="path/to/image.jpg", bboxes=[100, 150, 300, 400])
results[0].show()
```
!!! warning "Visual Prompts vs Concept Segmentation"
Using `SAM("sam3.pt")` with visual prompts (points/boxes/masks) will segment **only the specific object** at that location, just like SAM 2. To segment **all instances of a concept**, use `SAM3SemanticPredictor` with text or exemplar prompts as shown above.
SAM 3 achieves state-of-the-art results across multiple benchmarks, including real-world datasets like LVIS and COCO for segmentation:
| Benchmark | Metric | SAM 3 | Previous Best | Improvement |
|---|---|---|---|---|
| LVIS (zero-shot) | Mask AP | 47.0 | 38.5 | +22.1% |
| SA-Co/Gold | CGF1 | 65.0 | 34.3 (OWLv2) | +89.5% |
| COCO (zero-shot) | Box AP | 53.5 | 52.2 (T-Rex2) | +2.5% |
| ADE-847 (semantic seg) | mIoU | 14.7 | 9.2 (APE-D) | +59.8% |
| PascalConcept-59 | mIoU | 59.4 | 58.5 (APE-D) | +1.5% |
| Cityscapes (semantic seg) | mIoU | 65.1 | 44.2 (APE-D) | +47.3% |
Explore dataset options for quick experimentation in Ultralytics datasets.
SAM 3 shows significant improvements over SAM 2 and prior state-of-the-art across video benchmarks such as DAVIS 2017 and YouTube-VOS:
| Benchmark | Metric | SAM 3 | SAM 2.1 L | Improvement |
|---|---|---|---|---|
| MOSEv2 | J&F | 60.1 | 47.9 | +25.5% |
| DAVIS 2017 | J&F | 92.0 | 90.7 | +1.4% |
| LVOSv2 | J&F | 88.2 | 79.6 | +10.8% |
| SA-V | J&F | 84.6 | 78.4 | +7.9% |
| YTVOS19 | J&F | 89.6 | 89.3 | +0.3% |
SAM 3 excels at adapting to new domains with minimal examples, relevant for data-centric AI workflows:
| Benchmark | 0-shot AP | 10-shot AP | Previous Best (10-shot) |
|---|---|---|---|
| ODinW13 | 59.9 | 71.6 | 67.9 (gDino1.5-Pro) |
| RF100-VL | 14.3 | 35.7 | 33.7 (gDino-T) |
SAM 3's concept-based prompting with exemplars converges much faster than visual prompting:
| Prompts Added | CGF1 Score | Gain vs Text-Only | Gain vs PVS Baseline |
|---|---|---|---|
| Text only | 46.4 | baseline | baseline |
| +1 exemplar | 57.6 | +11.2 | +6.7 |
| +2 exemplars | 62.2 | +15.8 | +9.7 |
| +3 exemplars | 65.0 | +18.6 | +11.2 |
| +4 exemplars | 65.7 | +19.3 | +11.5 (plateau) |
SAM 3 provides accurate counting by segmenting all instances, a common requirement in object counting:
| Benchmark | Accuracy | MAE | vs Best MLLM |
|---|---|---|---|
| CountBench | 95.6% | 0.11 | 92.4% (Gemini 2.5) |
| PixMo-Count | 87.3% | 0.22 | 88.8% (Molmo-72B) |
Here we compare SAM 3's capabilities with SAM 2 and YOLO26 models:
| Capability | SAM 3 | SAM 2 | YOLO26n-seg |
|---|---|---|---|
| Concept Segmentation | ✅ All instances from text/exemplars | ❌ Not supported | ❌ Not supported |
| Visual Segmentation | ✅ Single instance (SAM 2 compatible) | ✅ Single instance | ✅ All instances |
| Zero-shot Capability | ✅ Open vocabulary | ✅ Geometric prompts | ❌ Closed set |
| Interactive Refinement | ✅ Exemplars + clicks | ✅ Clicks only | ❌ Not supported |
| Video Tracking | ✅ Multi-object with identities | ✅ Multi-object | ✅ Multi-object |
| LVIS Mask AP (zero-shot) | 47.0 | N/A | N/A |
| MOSEv2 J&F | 60.1 | 47.9 | N/A |
| Speed (GPU, ms/im) | 2921 | 857 | 8.4 |
| Model Size | 3.45 GB | 162 MB (base) | 6.4 MB |
Speed benchmarked on NVIDIA RTX PRO 6000 with torch==2.9.1 and ultralytics==8.4.19.
Key Takeaways:
Comparing SAM 3, SAM 2, SAM, MobileSAM, and FastSAM against Ultralytics YOLO segmentation models (YOLOv8, YOLO11, YOLO26) in size, parameters, and GPU inference speed:
| Model | Size <sup>(MB)</sup> | Parameters <sup>(M)</sup> | Speed (GPU) <sup>(ms/im)</sup> | | ---------------------------------------------------------------------------------------------- | ----------------------- | ---------------------------- | --------------------------------- | | Meta SAM-b | 375 | 93.7 | 1306 | | Meta SAM2-b | 162 | 80.8 | 857 | | Meta SAM2-t | 78.1 | 38.9 | 668 | | Meta SAM3 | 3450 | 473.6 | 2921 | | MobileSAM | 40.7 | 10.1 | 605 | | FastSAM-s with YOLOv8 backbone | 23.7 | 11.8 | 55.9 | | Ultralytics YOLOv8n-seg | 6.7 (515x smaller) | 3.4 (139.1x less) | 17.4 (167x faster) | | Ultralytics YOLO11n-seg | 5.9 (585x smaller) | 2.9 (163.1x less) | 12.6 (231x faster) | | Ultralytics YOLO26n-seg | 6.4 (539x smaller) | 2.7 (175.2x less) | 8.4 (347x faster) |
This comparison demonstrates the substantial differences in model sizes and speeds between SAM variants and YOLO segmentation models. While SAM provides unique automatic segmentation capabilities, YOLO models, particularly YOLOv8n-seg, YOLO11n-seg and YOLO26n-seg, are significantly smaller, faster, and more computationally efficient.
Tests run on a NVIDIA RTX PRO 6000 with 96GB of VRAM using torch==2.9.1 and ultralytics==8.4.19. To reproduce this test:
!!! example
=== "Python"
```python
from ultralytics import ASSETS, SAM, YOLO, FastSAM
# Profile SAM3, SAM2-t, SAM2-b, SAM-b, MobileSAM
for file in ["sam_b.pt", "sam2_b.pt", "sam2_t.pt", "mobile_sam.pt", "sam3.pt"]:
model = SAM(file)
model.info()
model(ASSETS)
# Profile FastSAM-s
model = FastSAM("FastSAM-s.pt")
model.info()
model(ASSETS)
# Profile YOLO models
for file_name in ["yolov8n-seg.pt", "yolo11n-seg.pt", "yolo26n-seg.pt"]:
model = YOLO(file_name)
model.info()
model(ASSETS)
```
SAM 3 introduces new metrics designed for the PCS task, complementing familiar measures like F1 score, precision, and recall.
The primary metric combining localization and classification:
CGF1 = 100 × pmF1 × IL_MCC
Where:
Traditional AP metrics don't account for calibration, making models difficult to use in practice. By evaluating only predictions above 0.5 confidence, SAM 3's metrics enforce good calibration and mimic real-world usage patterns in interactive predict and track loops.
The presence head decouples recognition from localization, providing significant improvements:
| Configuration | CGF1 | IL_MCC | pmF1 |
|---|---|---|---|
| Without presence | 57.6 | 0.77 | 74.7 |
| With presence | 63.3 | 0.82 | 77.1 |
The presence head provides a +5.7 CGF1 boost (+9.9%), primarily improving recognition ability (IL_MCC +6.5%).
| Hard Negatives/Image | CGF1 | IL_MCC | pmF1 |
|---|---|---|---|
| 0 | 31.8 | 0.44 | 70.2 |
| 5 | 44.8 | 0.62 | 71.9 |
| 30 | 49.2 | 0.68 | 72.3 |
Hard negatives are crucial for open-vocabulary recognition, improving IL_MCC by 54.5% (0.44 → 0.68).
| Data Sources | CGF1 | IL_MCC | pmF1 |
|---|---|---|---|
| External only | 30.9 | 0.46 | 66.3 |
| External + Synthetic | 39.7 | 0.57 | 70.6 |
| External + HQ | 51.8 | 0.71 | 73.2 |
| All three | 54.3 | 0.74 | 73.5 |
High-quality human annotations provide large gains over synthetic or external data alone. For background on data quality practices, see data collection and annotation.
SAM 3's concept segmentation capability enables new use cases:
SAM 3 can be combined with Multimodal Large Language Models (MLLMs) to handle complex queries requiring reasoning, similar in spirit to open-vocabulary systems like OWLv2 and T-Rex.
| Benchmark | Metric | SAM 3 Agent (Gemini 2.5 Pro) | Previous Best |
|---|---|---|---|
| ReasonSeg (validation) | gIoU | 76.0 | 65.0 (SoTA) |
| ReasonSeg (test) | gIoU | 73.8 | 61.3 (SoTA) |
| OmniLabel (validation) | AP | 46.7 | 36.5 (REAL) |
| RefCOCO+ | Acc | 91.2 | 89.3 (LISA) |
SAM 3 Agent can handle queries requiring reasoning:
The MLLM proposes simple noun phrase queries to SAM 3, analyzes returned masks, and iterates until satisfied.
While SAM 3 represents a major advancement, it has certain limitations:
!!! quote ""
=== "BibTeX"
```bibtex
@inproceedings{sam3_2025,
title = {SAM 3: Segment Anything with Concepts},
author = {Anonymous authors},
booktitle = {Submitted to ICLR 2026},
year = {2025},
url = {https://openreview.net/forum?id=r35clVtGzw},
note = {Paper ID: 4183, under double-blind review}
}
```
SAM 3 was released by Meta on November 20th, 2025 and is fully integrated into Ultralytics as of version 8.3.237 (PR #22897). Full support is available for predict mode and track mode.
Yes! SAM 3 is fully integrated into the Ultralytics Python package, including concept segmentation, SAM 2–style visual prompts, and multi-object video tracking. SAM 3 also powers the smart annotation feature on Ultralytics Platform, where you can annotate images with just a few clicks.
PCS is a new task introduced in SAM 3 that segments all instances of a visual concept in an image or video. Unlike traditional segmentation that targets a specific object instance, PCS finds every occurrence of a category. For example:
See related background on object detection and instance segmentation.
| Feature | SAM 2 | SAM 3 |
|---|---|---|
| Task | Single object per prompt | All instances of a concept |
| Prompt Types | Points, boxes, masks | + Text phrases, image exemplars |
| Detection Capability | Requires external detector | Built-in open-vocabulary detector |
| Recognition | Geometry-based only | Text and visual recognition |
| Architecture | Tracker only | Detector + Tracker with presence head |
| Zero-Shot Performance | N/A (requires visual prompts) | 47.0 AP on LVIS, 2× better on SA-Co |
| Interactive Refinement | Clicks only | Clicks + exemplar generalization |
SAM 3 maintains backward compatibility with SAM 2 visual prompting while adding concept-based capabilities.
SAM 3 is trained on the Segment Anything with Concepts (SA-Co) dataset:
Training Data:
Benchmark Data:
This massive scale and diversity enables SAM 3's superior zero-shot generalization across open-vocabulary concepts.
SAM 3 and YOLO26 serve different use cases:
SAM 3 Advantages:
YOLO26 Advantages:
Recommendation:
SAM 3 is designed for simple noun phrases (e.g., "red apple", "person wearing hat"). For complex queries requiring reasoning, combine SAM 3 with an MLLM as SAM 3 Agent:
Simple queries (native SAM 3):
Complex queries (SAM 3 Agent with MLLM):
SAM 3 Agent achieves 76.0 gIoU on ReasonSeg validation (vs 65.0 previous best, +16.9% improvement) by combining SAM 3's segmentation with MLLM reasoning capabilities.
On the SA-Co/Gold benchmark with triple human annotation:
SAM 3 achieves strong performance approaching human-level accuracy on open-vocabulary concept segmentation, with the gap primarily on ambiguous or subjective concepts (e.g., "small window", "cozy room").