docs/en/guides/similarity-search.md
This guide walks you through building a semantic image search engine using OpenAI CLIP, Meta FAISS, and Flask. By combining CLIP's visual-language embeddings with FAISS's efficient nearest-neighbor search, you can build a web interface that retrieves relevant images from natural language queries, no labels or categories required.
<p align="center"> <iframe loading="lazy" width="720" height="405" src="https://www.youtube.com/embed/zplKRlX3sLg" title="YouTube video player" frameborder="0" allow="accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture; web-share" allowfullscreen> </iframe><strong>Watch:</strong> How Similarity Search Works | Visual Search Using OpenAI CLIP, META FAISS and Ultralytics Package 🎉
</p>The Ultralytics Python package wraps this entire pipeline behind two classes, so you can launch a working search app or run queries programmatically in a few lines. This guide covers why semantic search is useful, how it works, running the web app, searching programmatically, and configuring parameters.
Building your own semantic image search system with CLIP and FAISS provides several compelling advantages:
The pipeline combines three components, each handling one stage of turning images and text into ranked results:
Because both images and text land in the same vector space, retrieval is zero-shot: you don't need labels or categories, just image data and a good prompt.
The SearchApp class launches the full Flask interface. On first run it downloads a sample image set, builds the FAISS index, and serves a page where you can type a query and view ranked results.
??? note "Image Path Warning"
If you're using your own images, make sure to provide an absolute path to the image directory. Otherwise, the images may not appear on the webpage due to Flask's file serving limitations.
=== "Python"
```python
from ultralytics import solutions
app = solutions.SearchApp(
# data = "path/to/img/directory" # Optional, build search engine with your own images
device="cpu" # configure the device for processing, e.g., "cpu" or "cuda"
)
app.run(debug=False) # You can also use `debug=True` argument for testing
```
The VisualAISearch class performs all the backend operations without the web layer:
Call the searcher with a natural language query to get back a list of matching image filenames ranked by similarity:
=== "Python"
```python
from ultralytics import solutions
searcher = solutions.VisualAISearch(
# data = "path/to/img/directory" # Optional, build search engine with your own images
device="cpu" # configure the device for processing, e.g., "cpu" or "cuda"
)
results = searcher("a dog sitting on a bench")
# Ranked Results:
# - 000000546829.jpg | Similarity: 0.3269
# - 000000549220.jpg | Similarity: 0.2899
# - 000000517069.jpg | Similarity: 0.2761
# - 000000029393.jpg | Similarity: 0.2742
# - 000000534270.jpg | Similarity: 0.2680
```
The table below outlines the available parameters for VisualAISearch:
{% from "macros/solutions-args.md" import param_table %} {{ param_table(["data"]) }} {% from "macros/track-args.md" import param_table %} {{ param_table(["device"]) }}
!!! tip "Manage your data in the cloud"
To search image collections at production scale without managing local files, you can organize and version your images in the [Ultralytics Platform](../platform/data/index.md) before indexing them with CLIP and FAISS.
With CLIP, FAISS, and the Ultralytics Python package, you can stand up a zero-shot semantic image search engine in just a few lines, either as a Flask web app or as a programmatic search backend. From here, point data at your own image directory to index it, then explore other Ultralytics Solutions to build on top of your computer vision workflows.
CLIP (Contrastive Language Image Pretraining) is a model developed by OpenAI that learns to connect visual and linguistic information. It's trained on a massive dataset of images paired with natural language captions. This training allows it to map both images and text into a shared embedding space, so you can compare them directly using vector similarity.
What makes CLIP stand out is its ability to generalize. Instead of being trained just for specific labels or tasks, it learns from natural language itself. This allows it to handle flexible queries like "a man riding a jet ski" or "a surreal dreamscape," making it useful for everything from classification to creative semantic search, without retraining.
FAISS (Facebook AI Similarity Search) is a toolkit that helps you search through high-dimensional vectors very efficiently. Once CLIP turns your images into embeddings, FAISS makes it fast and easy to find the closest matches to a text query, perfect for real-time image retrieval.
While CLIP and FAISS are developed by OpenAI and Meta respectively, the Ultralytics Python package wraps both into a complete semantic image search pipeline behind a few lines of code that just work:
=== "Python"
```python
from ultralytics import solutions
searcher = solutions.VisualAISearch(
# data = "path/to/img/directory" # Optional, build search engine with your own images
device="cpu" # configure the device for processing, e.g., "cpu" or "cuda"
)
results = searcher("a dog sitting on a bench")
```
This high-level implementation handles:
Yes. The current setup uses Flask with a basic HTML frontend, but you can replace it with your own HTML or build a more dynamic UI with React, Vue, or another frontend framework. Flask can serve as the backend API for your custom interface.
Not directly. A simple workaround is to extract individual frames from your videos (e.g., one every second), treat them as standalone images, and feed those into the system. This way, the search engine can semantically index visual moments from your videos.