docs/mintlify/integrations/embedding-models/google-gemini.mdx
Chroma provides a convenient wrapper around Google's Generative AI embedding API. This embedding function runs remotely on Google's servers, and requires an API key.
You can get an API key by signing up for an account at Google AI Studio.
<Tabs> <Tab title="Python" icon="python">This embedding function relies on the google-genai python package, which you can install with pip install google-genai.
import chromadb.utils.embedding_functions as embedding_functions
# The GoogleGeminiEmbeddingFunction expects the API key in the GEMINI_API_KEY environment variable.
google_ef = embedding_functions.GoogleGeminiEmbeddingFunction(
model_name="gemini-embedding-001",
task_type="RETRIEVAL_DOCUMENT",
)
google_ef(["document1", "document2"])
# pass documents to query for .add and .query
collection = client.create_collection(name="name", embedding_function=google_ef)
collection = client.get_collection(name="name", embedding_function=google_ef)
You can optionally specify the dimension parameter to control the output dimensionality of the embeddings (supported range: 128–3072):
google_ef = embedding_functions.GoogleGeminiEmbeddingFunction(
model_name="gemini-embedding-001",
task_type="RETRIEVAL_DOCUMENT",
dimension=768,
)
You can view a more complete example chatting over documents with Gemini embedding and language models.
For more info - please visit the official Google docs.
</Tab> <Tab title="TypeScript" icon="js">// npm install @chroma-core/google-gemini
import { ChromaClient } from "chromadb";
import { GoogleGeminiEmbeddingFunction } from "@chroma-core/google-gemini";
const embedder = new GoogleGeminiEmbeddingFunction({
apiKey: "<YOUR API KEY>",
modelName: "gemini-embedding-001",
});
// use directly
const embeddings = await embedder.generate(["document1", "document2"]);
// pass documents to query for .add and .query
const collection = await client.createCollection({
name: "name",
embeddingFunction: embedder,
});
const collectionGet = await client.getCollection({
name: "name",
embeddingFunction: embedder,
});
You can view a more complete example using Node.
For more info - please visit the official Google docs.
</Tab> </Tabs>The GoogleGeminiEmbeddingFunction supports the new gemini-embedding-2-preview model from Google. It is Google's first fully multimodal embedding model that is capable of mapping text, image, video, audio, and PDFs and their interleaved combinations thereof into a single, unified vector space. By natively handling interleaved data without intermediate processing steps, this model simplifies complex pipelines and unlocks new capabilities for RAG, agentic search, recommendation systems, and more.
Traditional embedding models work with a single modality—typically text. If you wanted to search across images, you'd need a separate image embedding model, and the two vector spaces wouldn't be compatible. Searching for "a red sports car" in a text collection and an image collection would require different queries and different indices.
Multimodal embeddings solve this by projecting different types of content into the same vector space. A text description like "a chef mixing ingredients in a bowl" and an image of that scene will have similar embeddings—allowing you to:
This is particularly powerful for applications like:
In the Chroma Cookbooks repo, we feature an example using multimodal embeddings to search through YouTube videos. The project downloads a video, extracts frames and transcript, embeds everything into a single Chroma collection, and then uses an agentic search loop with Gemini to answer questions about the video.
For example, given a cooking video like this apple tart recipe, you can ask questions like:
The agent uses a semantic_search tool to query the collection, and can actually see the retrieved images—making it capable of answering visual questions that would be impossible with text-only search.
yt-dlp, frames are extracted at 1-second intervals using ffmpeg, and the transcript is fetched via the YouTube APIgemini-embedding-2-previewmultimodal-video-{video_id}semantic_search tool. When it retrieves image results, the actual images are passed to the model so it can see themgit clone https://github.com/chroma-core/chroma-cookbooks.git
cd chroma-cookbooks/multimodal-video-search
touch .env
GEMINI_API_KEY=<YOUR GEMINI API KEY>
CHROMA_HOST=api.trychroma.com
CHROMA_API_KEY=<YOUR CHROMA API KEY>
CHROMA_TENANT=<YOUR CHROMA TENANT>
CHROMA_DATABASE=multimodal-video-search
uv sync
brew install ffmpeg
Run the project with a YouTube URL and a question:
uv run python main.py "https://youtube.com/shorts/wHI926TlQcM" "How many bowls are in the video?"
The first run will download the video, extract frames, embed them, and index everything to Chroma. Subsequent runs with the same video will skip indexing and go straight to answering your question.
You can watch the agent's search process in the terminal output—it will show each search query and the number of results found before providing its final answer.