Back to Tensorzero

How to generate embeddings

docs/gateway/generate-embeddings.mdx

2026.4.15.9 KB
Original Source

This page shows how to:

  • Generate embeddings with a unified API. TensorZero unifies many LLM APIs (e.g. OpenAI) and inference servers (e.g. Ollama).
  • Use any programming language. You can use any OpenAI SDK (Python, Node, Go, etc.) or the OpenAI-compatible HTTP API.
<Tip>

You can find a complete runnable example of this guide on GitHub.

</Tip>

Generate embeddings from OpenAI

<Tip>

Our example uses the OpenAI Python SDK, but you can use any OpenAI SDK or call the OpenAI-compatible HTTP API. See Call any LLM for an example using the OpenAI Node SDK.

The TensorZero Python SDK doesn't have an independent embedding endpoint at the moment.

</Tip> <Tabs> <Tab title="Python (OpenAI SDK)">

You can point the OpenAI Python SDK to a TensorZero Gateway to generate embeddings with a unified API.

<Steps> <Step title="Set up the credentials for your LLM provider">

For example, if you're using OpenAI, you can set the OPENAI_API_KEY environment variable with your API key.

bash
export OPENAI_API_KEY="sk-..."
<Tip>

See the Integrations page to learn how to set up credentials for other LLM providers.

</Tip> </Step> <Step title="Install the OpenAI Python SDK">

You can install the OpenAI SDK with a Python package manager like pip.

bash
pip install openai
</Step> <Step title="Deploy the TensorZero Gateway">

Let's deploy the TensorZero Gateway using Docker. For simplicity, we'll use the gateway without observability or custom configuration.

bash
docker run \
  -e OPENAI_API_KEY \
  -p 3000:3000 \
  tensorzero/gateway \
  --default-config
<Tip>

See the Deploy the TensorZero Gateway page for more details.

</Tip> </Step> <Step title="Initialize the OpenAI client">

Let's initialize the OpenAI SDK and point it to the gateway we just launched.

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")
</Step> <Step title="Call the LLM">
python
result = client.embeddings.create(
    input="Hello, world!",
    model="tensorzero::embedding_model_name::openai::text-embedding-3-small",
    # or: Azure, any OpenAI-compatible endpoint (e.g. Ollama, Voyager)
)
<Accordion title="Sample Response">
python
CreateEmbeddingResponse(
    data=[
        Embedding(
            embedding=[
                -0.019143931567668915,
                # ...
            ],
            index=0,
            object='embedding'
        )
    ],
    model='tensorzero::embedding_model_name::openai::text-embedding-3-small',
    object='list',
    usage=Usage(prompt_tokens=4, total_tokens=4)
)
</Accordion> </Step> </Steps> </Tab> </Tabs>

Define a custom embedding model

You can define a custom embedding model in your TensorZero configuration file.

For example, let's define a custom embedding model for nomic-embed-text served locally by Ollama.

<Steps> <Step title="Deploy the Ollama embedding model">

Download the embedding model and launch the Ollama server:

bash
ollama pull nomic-embed-text
ollama serve

We assume that Ollama is running on your host machine at http://localhost:11434.

</Step> <Step title="Define your custom embedding model">

Add your custom model and model provider to your configuration file:

toml
[embedding_models.nomic-embed-text]
routing = ["ollama"]

[embedding_models.nomic-embed-text.providers.ollama]
type = "openai"
api_base = "http://host.docker.internal:11434/v1"
model_name = "nomic-embed-text"
api_key_location = "none"
<Tip>

See the Configuration Reference for details on configuring your embedding models.

</Tip> </Step> <Step title="Deploy the TensorZero Gateway with your configuration">

Deploy the TensorZero Gateway with your configuration file. Make sure that the container has access to the Ollama server running on the host.

<Tip>

See the Deploy the TensorZero Gateway page for more details.

</Tip> </Step> <Step title="Call your custom embedding model">

Use your custom model by referencing it with tensorzero::embedding_model_name::nomic-embed-text.

For example, using the OpenAI Python SDK:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")

result = client.embeddings.create(
    input="Hello, world!",
    model="tensorzero::embedding_model_name::nomic-embed-text",
)
<Accordion title="Sample Response">
python
CreateEmbeddingResponse(
    data=[
        Embedding(
            embedding=[
                -0.019143931567668915,
                # ...
            ],
            index=0,
            object='embedding'
        )
    ],
    model='tensorzero::embedding_model_name::nomic-embed-text',
    object='list',
    usage=Usage(prompt_tokens=4, total_tokens=4)
)
</Accordion> </Step> </Steps>

Cache embeddings

The TensorZero Gateway supports caching embeddings to improve latency and reduce costs. When caching is enabled, identical embedding requests will be served from the cache instead of being sent to the model provider.

python
result = client.embeddings.create(
    input="Hello, world!",
    model="tensorzero::embedding_model_name::openai::text-embedding-3-small",
    extra_body={
        "tensorzero::cache_options": {
            "enabled": "on",  # Enable reading from and writing to cache
            "max_age_s": 3600,  # Optional: cache entries older than 1 hour are ignored
        }
    }
)

Caching works for single embeddings. Batch embedding requests (multiple inputs) will write to the cache but won't serve cached responses.

See the Inference Caching guide for more details on cache modes and options.