examples/sparse_encoder/applications/semantic_search/README.md
Semantic search refers to search techniques that go beyond traditional keyword-based search. Instead of relying solely on exact matches of keywords, semantic search aims to understand the meaning and context of the query and the documents being searched. This allows for more relevant and accurate search results, even when the exact keywords may not match.
Sparse embeddings are a type of representation where most of the values are zero, and only a small number of dimensions contain non-zero (a.k.a. active) values. This is in contrast to dense embeddings, where all dimensions typically have non-zero values. Traditional sparse embedding solutions are often lexically based, meaning they rely on exact matches of terms or phrases. However, modern sparse encoders like SPLADE and other sparse encoder models can generate embeddings that capture semantic meaning while still being sparse.
These embeddings can allow for extremely efficient semantic search, as long as the search solution takes good advantage of the fact that the large majority of sparse embedding dimensions are 0. This page shows an example demonstrating how to perform semantic search manually, but also how to integrate a SparseEncoder model with popular vector databases/search systems.
If you aren't familiar with Semantic Search, see the Sentence Transformers > Semantic Search for a broader explanation using dense embedding models.
Manually performing semantic search with sparse encoders is straightforward, and only consists of a few steps:
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. :meth:`SparseEncoder.encode_query <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_query>`
#. :meth:`SparseEncoder.encode_document <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_document>`
#. :meth:`util.semantic_search <sentence_transformers.util.semantic_search>`
#. :meth:`SparseEncoder.similarity <sentence_transformers.sparse_encoder.model.SparseEncoder.similarity>`
#. `naver/splade-cocondenser-ensembledistil <https://huggingface.co/naver/splade-cocondenser-ensembledistil>`_
::
from sentence_transformers import SparseEncoder, util
# 1. Load a pretrained SparseEncoder model
model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 2. Encode a corpus of texts using the SparseEncoder model
corpus = [
"Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed.",
"Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning.",
"Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains.",
"Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments.",
"The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy.",
"SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond.",
"Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities.",
"Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time.",
"Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground.",
]
# Use "convert_to_tensor=True" to keep the tensors on GPU (if available)
corpus_embeddings = model.encode_document(corpus, convert_to_tensor=True)
# 3. Encode the user queries using the same SparseEncoder model
queries = [
"How do artificial neural networks work?",
"What technology is used for modern space exploration?",
"How can we address climate change challenges?",
]
query_embeddings = model.encode_query(queries, convert_to_tensor=True)
# 4. Use the similarity function to compute the similarity scores between the query and corpus embeddings
top_k = min(5, len(corpus)) # Find at most 5 sentences of the corpus for each query sentence
results = util.semantic_search(query_embeddings, corpus_embeddings, top_k=top_k, score_function=model.similarity)
# 5. Sort the results and print the top 5 most similar sentences for each query
for query_id, query in enumerate(queries):
pointwise_scores = model.intersection(query_embeddings[query_id], corpus_embeddings)
print(f"Query: {query}")
for res in results[query_id]:
corpus_id, score = res.values()
sentence = corpus[corpus_id]
pointwise_score = model.decode(pointwise_scores[corpus_id], top_k=10)
token_scores = ", ".join([f'("{token.strip()}", {value:.2f})' for token, value in pointwise_score])
print(f"Score: {score:.4f} - Sentence: {sentence} - Top influential tokens: {token_scores}")
print("")
"""
Query: How do artificial neural networks work?
Score: 16.9053 - Sentence: Neural networks are computing systems vaguely inspired by the biological neural networks that constitute animal brains. - Top influential tokens: ("neural", 5.71), ("networks", 3.24), ("network", 2.93), ("brain", 2.10), ("computer", 0.50), ("##uron", 0.32), ("artificial", 0.27), ("technology", 0.27), ("communication", 0.27), ("connection", 0.21)
Score: 13.6119 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("artificial", 3.71), ("neural", 3.15), ("networks", 1.78), ("brain", 1.22), ("network", 1.12), ("ai", 1.07), ("machine", 0.39), ("robot", 0.20), ("technology", 0.20), ("algorithm", 0.18)
Score: 2.7373 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("machine", 0.78), ("computer", 0.50), ("technology", 0.32), ("artificial", 0.22), ("robot", 0.21), ("ai", 0.20), ("process", 0.16), ("theory", 0.11), ("technique", 0.11), ("fuzzy", 0.06)
Score: 2.1430 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 0.42), ("function", 0.41), ("mechanism", 0.21), ("sensor", 0.21), ("device", 0.18), ("process", 0.18), ("generator", 0.13), ("detection", 0.10), ("technique", 0.10), ("tracking", 0.05)
Score: 2.0195 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("robot", 0.67), ("function", 0.34), ("technology", 0.29), ("device", 0.23), ("experiment", 0.20), ("machine", 0.10), ("artificial", 0.08), ("design", 0.04), ("useful", 0.03), ("they", 0.02)
Query: What technology is used for modern space exploration?
Score: 10.4748 - Sentence: SpaceX's Starship is designed to be a fully reusable transportation system capable of carrying humans to Mars and beyond. - Top influential tokens: ("space", 4.40), ("technology", 1.15), ("nasa", 1.06), ("mars", 0.63), ("exploration", 0.52), ("spacecraft", 0.44), ("robot", 0.32), ("rocket", 0.28), ("astronomy", 0.27), ("travel", 0.26)
Score: 9.3818 - Sentence: The James Webb Space Telescope is the largest optical telescope in space, designed to conduct infrared astronomy. - Top influential tokens: ("space", 3.89), ("nasa", 1.09), ("astronomy", 0.93), ("discovery", 0.48), ("instrument", 0.47), ("technology", 0.35), ("device", 0.26), ("spacecraft", 0.25), ("invented", 0.22), ("equipment", 0.22)
Score: 8.5147 - Sentence: Mars rovers are robotic vehicles designed to travel on the surface of Mars to collect data and perform experiments. - Top influential tokens: ("technology", 1.39), ("mars", 0.79), ("exploration", 0.78), ("robot", 0.67), ("used", 0.66), ("nasa", 0.52), ("spacecraft", 0.44), ("device", 0.39), ("explore", 0.38), ("travel", 0.25)
Score: 7.6993 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("technology", 1.99), ("tech", 1.76), ("technologies", 1.74), ("equipment", 0.32), ("device", 0.31), ("technological", 0.28), ("mining", 0.22), ("sensor", 0.19), ("tool", 0.18), ("software", 0.11)
Score: 2.5526 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("technology", 1.52), ("machine", 0.27), ("robot", 0.21), ("computer", 0.18), ("engineering", 0.12), ("technique", 0.11), ("science", 0.05), ("technological", 0.05), ("techniques", 0.02), ("innovation", 0.01)
Query: How can we address climate change challenges?
Score: 9.5587 - Sentence: Global warming is the long-term heating of Earth's climate system observed since the pre-industrial period due to human activities. - Top influential tokens: ("climate", 3.21), ("warming", 2.87), ("weather", 1.58), ("change", 0.46), ("global", 0.41), ("environmental", 0.39), ("storm", 0.19), ("pollution", 0.15), ("environment", 0.11), ("adaptation", 0.08)
Score: 1.3191 - Sentence: Carbon capture technologies aim to collect CO2 emissions before they enter the atmosphere and store them underground. - Top influential tokens: ("warming", 0.39), ("pollution", 0.34), ("environmental", 0.15), ("goal", 0.12), ("strategy", 0.07), ("monitoring", 0.07), ("protection", 0.06), ("greenhouse", 0.05), ("safety", 0.02), ("escape", 0.01)
Score: 1.0774 - Sentence: Renewable energy sources include solar, wind, hydro, and geothermal power that naturally replenish over time. - Top influential tokens: ("conservation", 0.39), ("sustainability", 0.18), ("environmental", 0.18), ("sustainable", 0.13), ("agriculture", 0.13), ("alternative", 0.07), ("recycling", 0.00)
Score: 0.2401 - Sentence: Machine learning is a field of study that gives computers the ability to learn without being explicitly programmed. - Top influential tokens: ("strategy", 0.10), ("success", 0.06), ("foster", 0.04), ("engineering", 0.03), ("innovation", 0.00), ("research", 0.00)
Score: 0.1516 - Sentence: Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. - Top influential tokens: ("strategy", 0.09), ("foster", 0.04), ("research", 0.01), ("approach", 0.01), ("engineering", 0.01)
"""
Alternatively, some vector databases and search engines can be used to perform semantic search with sparse encoders. These systems are designed to efficiently handle large-scale vector data and provide fast retrieval of relevant documents. They can leverage the sparsity of the embeddings to optimize storage and search operations.
The overall structure is similar to the manual search, but the vector database handles the indexing and retrieval of documents. The steps are approximately as follows:
The advantages of Sparse Vectors for search are:
This example demonstrates how to set up Qdrant for sparse vector search by showing how to efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See semantic_search_qdrant.py or below:
pip install qdrant-client
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. :meth:`SparseEncoder.encode_query <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_query>`
#. :meth:`SparseEncoder.encode_document <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_document>`
#. :meth:`semantic_search_qdrant <sentence_transformers.sparse_encoder.search_engines.semantic_search_qdrant>`
#. `naver/splade-cocondenser-ensembledistil <https://huggingface.co/naver/splade-cocondenser-ensembledistil>`_
#. `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
::
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_qdrant
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
# Initially, we don't have a qdrant index yet
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using qdrant
results, search_time, corpus_index = semantic_search_qdrant(
query_embeddings,
corpus_index=corpus_index,
corpus_embeddings=corpus_embeddings if corpus_index is None else None,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
This example demonstrates how to set up OpenSearch for sparse vector search by showing how to efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See semantic_search_opensearch.py or below:
pip install opensearch-py
opensearch v2.15.0+.
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. :meth:`SparseEncoder.encode_query <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_query>`
#. :meth:`SparseEncoder.encode_document <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_document>`
#. :meth:`semantic_search_opensearch <sentence_transformers.sparse_encoder.search_engines.semantic_search_opensearch>`
#. `opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill <https://huggingface.co/opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill>`_
#. `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
::
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.modules import Router, SparseStaticEmbedding, SpladePooling, Transformer
from sentence_transformers.sparse_encoder.search_engines import semantic_search_opensearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
print(f"Finish loading data. Corpus size: {len(corpus)}")
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
model_id = "opensearch-project/opensearch-neural-sparse-encoding-doc-v3-distill"
doc_encoder = Transformer(model_id, transformer_task="fill-mask")
router = Router.for_query_document(
query_modules=[
SparseStaticEmbedding.from_json(
model_id,
tokenizer=doc_encoder.tokenizer,
frozen=True,
),
],
document_modules=[
doc_encoder,
SpladePooling("max", activation_function="log1p_relu"),
],
)
sparse_model = SparseEncoder(modules=[router], similarity_fn_name="dot")
print("Start encoding corpus...")
start_time = time.time()
# 4. Encode the corpus
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=32, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using inference-free mode
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Query encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using OpenSearch
results, search_time, corpus_index = semantic_search_opensearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
This example demonstrates how to set up Elasticsearch for sparse vector search by showing how to efficiently encode and index documents with sparse encoders, formulating search queries with sparse vectors, and providing an interactive query interface. See semantic_search_elasticsearch.py or below:
pip install elasticsearch
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. :meth:`SparseEncoder.encode_query <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_query>`
#. :meth:`SparseEncoder.encode_document <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_document>`
#. :meth:`semantic_search_elasticsearch <sentence_transformers.sparse_encoder.search_engines.semantic_search_elasticsearch>`
#. `naver/splade-cocondenser-ensembledistil <https://huggingface.co/naver/splade-cocondenser-ensembledistil>`_
#. `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
::
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_elasticsearch
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Elasticsearch
results, search_time, corpus_index = semantic_search_elasticsearch(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
This example demonstrates how to use Seismic for extremely performant sparse vector search. It does not require running a separate client, but instead performs search directly in memory. The Seismic library was introduced in Bruch et al. (2024), where it's shown to outperform the common inverted file (IVF) approach by an order of magnitude. For more information on building your Seismic Index you can look at the Seismic Guidelines. See semantic_search_seismic.py or below:
pip install pyseismic-lsr
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. :meth:`SparseEncoder.encode_query <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_query>`
#. :meth:`SparseEncoder.encode_document <sentence_transformers.sparse_encoder.model.SparseEncoder.encode_document>`
#. :meth:`semantic_search_seismic <sentence_transformers.sparse_encoder.search_engines.semantic_search_seismic>`
#. `naver/splade-cocondenser-ensembledistil <https://huggingface.co/naver/splade-cocondenser-ensembledistil>`_
#. `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
::
import time
from datasets import load_dataset
from sentence_transformers import SparseEncoder
from sentence_transformers.sparse_encoder.search_engines import semantic_search_seismic
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("naver/splade-cocondenser-ensembledistil")
# 4. Encode the corpus
print("Start encoding corpus...")
start_time = time.time()
corpus_embeddings = sparse_model.encode_document(
corpus, convert_to_sparse_tensor=True, batch_size=16, show_progress_bar=True
)
corpus_embeddings_decoded = sparse_model.decode(corpus_embeddings)
print(f"Corpus encoding time: {time.time() - start_time:.6f} seconds")
corpus_index = None
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
query_embeddings = sparse_model.encode_query(queries, convert_to_sparse_tensor=True)
query_embeddings_decoded = sparse_model.decode(query_embeddings)
print(f"Encoding time: {time.time() - start_time:.6f} seconds")
# 6. Perform semantic search using Seismic
results, search_time, corpus_index = semantic_search_seismic(
query_embeddings_decoded,
corpus_embeddings_decoded=corpus_embeddings_decoded if corpus_index is None else None,
corpus_index=corpus_index,
top_k=5,
output_index=True,
)
# 7. Output the results
print(f"Search time: {search_time:.6f} seconds")
for query, result in zip(queries, results):
print(f"Query: {query}")
for entry in result:
print(f"(Score: {entry['score']:.4f}) {corpus[entry['corpus_id']]}, corpus_id: {entry['corpus_id']}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]
This example demonstrates how to use splade-index for very fast sparse vector search powered by SciPy sparse matrices, built on top of the excellent bm25s, a fast BM25 implementation. It does not require running a separate client, but instead performs search directly in memory. See semantic_search_splade_index.py or below:
pip install splade-index
.. sidebar:: Documentation
#. :class:`SparseEncoder <sentence_transformers.sparse_encoder.model.SparseEncoder>`
#. `rasyosef/splade-tiny <https://huggingface.co/rasyosef/splade-tiny>`_
#. `rasyosef/splade-mini <https://huggingface.co/rasyosef/splade-mini>`_
#. `sentence-transformers/natural-questions <https://huggingface.co/datasets/sentence-transformers/natural-questions>`_
::
import time
from datasets import load_dataset
from splade_index import SPLADE
from sentence_transformers import SparseEncoder
# 1. Load the natural-questions dataset with 100K answers
dataset = load_dataset("sentence-transformers/natural-questions", split="train")
num_docs = 10_000
corpus = dataset["answer"][:num_docs]
# 2. Come up with some queries
queries = dataset["query"][:2]
# 3. Load the model
sparse_model = SparseEncoder("rasyosef/splade-tiny")
# 4. Encode the corpus & create the index
print("Start encoding corpus and creating index...")
start_time = time.time()
corpus_index = SPLADE()
corpus_index.index(model=sparse_model, documents=corpus, batch_size=16, show_progress=True)
print(f"Encoded corpus and created index in {time.time() - start_time:.6f} seconds")
while True:
# 5. Encode the queries using the full precision
start_time = time.time()
all_doc_ids, all_documents, all_scores = corpus_index.retrieve(queries, k=5)
print(f"Encoding & Search time: {time.time() - start_time:.6f} seconds")
# 7. Output the results
for query, doc_ids, documents, scores in zip(queries, all_doc_ids, all_documents, all_scores):
print(f"Query: {query}")
for doc_id, document, score in zip(doc_ids, documents, scores):
print(f"(Score: {score:.4f}) {document}, corpus_id: {doc_id}")
print("")
# 8. Prompt for more queries
queries = [input("Please enter a question: ")]