Back to Chroma

Sparse Vector Search Setup

docs/mintlify/cloud/schema/sparse-vector-search.mdx

1.5.97.9 KB
Original Source

import { Callout } from '/snippets/callout.mdx';

What are Sparse Vectors?

Sparse vectors are high-dimensional vectors with mostly zero values, designed for keyword-based retrieval. Unlike dense embeddings which capture semantic meaning, sparse vectors excel at:

  • Exact keyword matching: Finding documents containing specific terms
  • Domain-specific terminology: Better at matching technical terms, proper nouns, and rare words
  • Lexical retrieval: BM25-style retrieval patterns

Sparse vectors use models like SPLADE that assign importance weights to specific tokens, making them complementary to dense semantic embeddings.

Enabling Sparse Vector Index

To use sparse vectors, add a sparse vector index to your schema. The key parameter is the metadata field name where sparse embeddings will be stored - you can name it whatever you want:

<CodeGroup> ```python Python from chromadb import Schema, SparseVectorIndexConfig, K from chromadb.utils.embedding_functions import ChromaCloudSpladeEmbeddingFunction

schema = Schema()

Add sparse vector index for keyword-based search

"sparse_embedding" is just a metadata key name - use any name you prefer

sparse_ef = ChromaCloudSpladeEmbeddingFunction() schema.create_index( config=SparseVectorIndexConfig( source_key=K.DOCUMENT, embedding_function=sparse_ef ), key="sparse_embedding" )


```typescript TypeScript
import { Schema, SparseVectorIndexConfig, K } from 'chromadb';
import { ChromaCloudSpladeEmbeddingFunction } from '@chroma-core/chroma-cloud-splade';

const schema = new Schema();

// Add sparse vector index for keyword-based search
// "sparse_embedding" is just a metadata key name - use any name you prefer
const sparseEf = new ChromaCloudSpladeEmbeddingFunction({
  apiKeyEnvVar: "CHROMA_API_KEY"
});
schema.createIndex(
  new SparseVectorIndexConfig({
    sourceKey: K.DOCUMENT,
    embeddingFunction: sparseEf
  }),
  "sparse_embedding"
);
</CodeGroup> <Callout> The `source_key` specifies which field to generate sparse embeddings from (typically `K.DOCUMENT` for document text), and `embedding_function` specifies the function to generate the sparse embeddings. This example uses `ChromaCloudSpladeEmbeddingFunction`, but you can also use other sparse embedding functions like `HuggingFaceSparseEmbeddingFunction` or `FastembedSparseEmbeddingFunction`. The sparse embeddings are automatically generated and stored in the metadata field you specify as the `key`. </Callout>

Create Collection and Add Data

Create Collection with Schema

<CodeGroup> ```python Python import chromadb

client = chromadb.CloudClient( tenant="your-tenant", database="your-database", api_key="your-api-key" )

collection = client.create_collection( name="hybrid_search_collection", schema=schema )


```typescript TypeScript
import { CloudClient } from 'chromadb';

const client = new CloudClient({
  tenant: "your-tenant",
  database: "your-database",
  apiKey: "your-api-key"
});

const collection = await client.createCollection({
  name: "hybrid_search_collection",
  schema: schema
});
</CodeGroup>

Add Data

When you add documents, sparse embeddings are automatically generated from the source key:

<CodeGroup> ```python Python collection.add( ids=["doc1", "doc2", "doc3"], documents=[ "The quick brown fox jumps over the lazy dog", "A fast auburn fox leaps over a sleepy canine", "Machine learning is a subset of artificial intelligence" ], metadatas=[ {"category": "animals"}, {"category": "animals"}, {"category": "technology"} ] )

Sparse embeddings for "sparse_embedding" are generated automatically

from the documents (source_key=K.DOCUMENT)


```typescript TypeScript
await collection.add({
  ids: ["doc1", "doc2", "doc3"],
  documents: [
    "The quick brown fox jumps over the lazy dog",
    "A fast auburn fox leaps over a sleepy canine",
    "Machine learning is a subset of artificial intelligence"
  ],
  metadatas: [
    { category: "animals" },
    { category: "animals" },
    { category: "technology" }
  ]
});

// Sparse embeddings for "sparse_embedding" are generated automatically
// from the documents (source_key=K.DOCUMENT)
</CodeGroup>

Once configured, you can search using sparse vectors alone or combine them with dense embeddings for hybrid search.

Use sparse vectors for keyword-based retrieval:

<CodeGroup> ```python Python from chromadb import Search, K, Knn

Search using sparse embeddings only

sparse_rank = Knn(query="fox animal", key="sparse_embedding")

Build and execute search

search = (Search() .rank(sparse_rank) .limit(10) .select(K.DOCUMENT, K.SCORE))

results = collection.search(search)

Process results

for row in results.rows()[0]: print(f"Score: {row['score']:.3f} - {row['document']}")


```typescript TypeScript
import { Search, K, Knn } from 'chromadb';

// Search using sparse embeddings only
const sparseRank = Knn({ query: "fox animal", key: "sparse_embedding" });

// Build and execute search
const search = new Search()
  .rank(sparseRank)
  .limit(10)
  .select(K.DOCUMENT, K.SCORE);

const results = await collection.search(search);

// Process results
for (const row of results.rows()[0]) {
  console.log(`Score: ${row.score.toFixed(3)} - ${row.document}`);
}
</CodeGroup>

Hybrid search combines dense semantic embeddings with sparse keyword embeddings for improved retrieval quality. By merging results from both approaches using Reciprocal Rank Fusion (RRF), you often achieve better results than either approach alone.

  • Semantic + Lexical: Dense embeddings capture meaning while sparse vectors catch exact keywords
  • Improved recall: Finds relevant documents that either semantic or keyword search might miss alone
  • Balanced results: Combines the strengths of both retrieval methods

Combining Dense and Sparse with RRF

Use RRF (Reciprocal Rank Fusion) to merge dense and sparse search results:

<CodeGroup> ```python Python from chromadb import Search, K, Knn, Rrf

Create RRF ranking combining dense and sparse embeddings

hybrid_rank = Rrf( ranks=[ Knn(query="fox animal", return_rank=True), # Dense semantic search Knn(query="fox animal", key="sparse_embedding", return_rank=True) # Sparse keyword search ], weights=[0.7, 0.3], # 70% semantic, 30% keyword k=60 )

Build and execute search

search = (Search() .rank(hybrid_rank) .limit(10) .select(K.DOCUMENT, K.SCORE))

results = collection.search(search)

Process results

for row in results.rows()[0]: print(f"Score: {row['score']:.3f} - {row['document']}")


```typescript TypeScript
import { Search, K, Knn, Rrf } from 'chromadb';

// Create RRF ranking combining dense and sparse embeddings
const hybridRank = Rrf({
  ranks: [
    Knn({ query: "fox animal", returnRank: true }),           // Dense semantic search
    Knn({ query: "fox animal", key: "sparse_embedding", returnRank: true })  // Sparse keyword search
  ],
  weights: [0.7, 0.3],  // 70% semantic, 30% keyword
  k: 60
});

// Build and execute search
const search = new Search()
  .rank(hybridRank)
  .limit(10)
  .select(K.DOCUMENT, K.SCORE);

const results = await collection.search(search);

// Process results
for (const row of results.rows()[0]) {
  console.log(`Score: ${row.score.toFixed(3)} - ${row.document}`);
}
</CodeGroup> <Callout> For comprehensive details on RRF parameters, weight tuning, and advanced hybrid search strategies, see the [Search API Hybrid Search documentation](../search-api/hybrid-search). </Callout>

Next Steps