Back to Llama Index

Optimized Embedding Model using Optimum-Intel

docs/examples/embeddings/optimum_intel.ipynb

0.14.211.5 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/embeddings/huggingface.ipynb" target="_parent"></a>

Optimized Embedding Model using Optimum-Intel

LlamaIndex has support for loading quantized embedding models for Intel, using the Optimum-Intel library.

Optimized models are smaller and faster, with minimal accuracy loss, see the documentation and an optimization guide using the IntelLabs/fastRAG library.

Optimization is based on math instructions in the Xeon® 4th generation or newer processors.

In order to be able to load and use the quantized models, install the required dependency pip install optimum[exporters] optimum-intel neural-compressor intel_extension_for_pytorch.

Loading is done using the class IntelEmbedding; usage is similar to any HuggingFace local embedding model; See example:

python
%pip install llama-index-embeddings-huggingface-optimum-intel
python
from llama_index.embeddings.huggingface_optimum_intel import IntelEmbedding

embed_model = IntelEmbedding("Intel/bge-small-en-v1.5-rag-int8-static")
python
embeddings = embed_model.get_text_embedding("Hello World!")
print(len(embeddings))
print(embeddings[:5])