docs/content/Models/embeddings.md
Embedding models are a crucial component of DocsGPT, enabling its powerful document understanding and question-answering capabilities. This guide will explain what embedding models are, why they are essential for DocsGPT, and how to configure them.
In simple terms, an embedding model is a type of language model that converts text into numerical vectors. These vectors, known as embeddings, capture the semantic meaning of the text. Think of it as translating words and sentences into a language that computers can understand mathematically, where similar meanings are represented by vectors that are close to each other in vector space.
Why are embedding models important for DocsGPT?
DocsGPT uses embedding models for several key tasks:
In essence, embedding models are the bridge that allows DocsGPT to understand the nuances of human language and connect your questions to the relevant information within your documents.
DocsGPT is designed to be flexible and supports a wide range of embedding models right out of the box. Currently, DocsGPT provides native support for models from two major sources:
text-embedding-ada-002 model, which is a powerful and widely used embedding model from OpenAI's API.To utilize Sentence Transformer models within DocsGPT, you need to follow these steps:
Download the Model: Sentence Transformer models are typically hosted on Hugging Face Model Hub. You need to download your chosen model and place it in the model/ folder in the root directory of your DocsGPT project.
For example, to use the all-mpnet-base-v2 model, you would set EMBEDDINGS_NAME as described below, and ensure that the model files are available locally (DocsGPT will attempt to download it if it's not found, but local download is recommended for development and offline use).
Set EMBEDDINGS_NAME in .env (or settings.py): You need to configure the EMBEDDINGS_NAME setting in your .env file (or settings.py) to point to the desired Sentence Transformer model.
Using a pre-downloaded model from model/ folder: You can specify a path to the downloaded model within the model/ directory. For instance, if you downloaded all-mpnet-base-v2 and it's in model/all-mpnet-base-v2, you could potentially use a relative path like (though direct path to the model name is usually sufficient):
EMBEDDINGS_NAME=huggingface_sentence-transformers/all-mpnet-base-v2
or simply use the model identifier:
EMBEDDINGS_NAME=sentence-transformers/all-mpnet-base-v2
Using a model directly from Hugging Face Model Hub: You can directly specify the model identifier from Hugging Face Model Hub:
EMBEDDINGS_NAME=huggingface_sentence-transformers/all-mpnet-base-v2
To use OpenAI's text-embedding-ada-002 embedding model, you need to set EMBEDDINGS_NAME to openai_text-embedding-ada-002 and ensure you have your OpenAI API key configured correctly via API_KEY in your .env file (if you are not using Azure OpenAI).
Example .env configuration for OpenAI Embeddings:
LLM_PROVIDER=openai
API_KEY=YOUR_OPENAI_API_KEY # Your OpenAI API Key
EMBEDDINGS_NAME=openai_text-embedding-ada-002
If you wish to use an embedding model that is not supported out-of-the-box, a good starting point for adding custom embedding model support is to examine the base.py file located in the application/vectorstore directory.
Specifically, pay attention to the EmbeddingsWrapper and EmbeddingsSingleton classes. EmbeddingsWrapper provides a way to wrap different embedding model libraries into a consistent interface for DocsGPT. EmbeddingsSingleton manages the instantiation and retrieval of embedding model instances. By understanding these classes and the existing embedding model implementations, you can create your own custom integration for virtually any embedding model library you desire.