Back to Llama Index

Using LLMs

docs/src/content/docs/framework/understanding/using_llms.mdx

0.14.215.2 KB
Original Source
<Aside type="tip"> For a list of our supported LLMs and a comparison of their functionality, check out our [LLM module guide](/python/framework/module_guides/models/llms). </Aside>

One of the first steps when building an LLM-based application is which LLM to use; they have different strengths and price points and you may wish to use more than one.

LlamaIndex provides a single interface to a large number of different LLMs. Using an LLM can be as simple as installing the appropriate integration:

bash
pip install llama-index-llms-openai

And then calling it in a one-liner:

python
from llama_index.llms.openai import OpenAI

response = OpenAI().complete("William Shakespeare is ")
print(response)

Note that this requires an API key called OPENAI_API_KEY in your environment; see the starter tutorial for more details.

complete is also available as an async method, acomplete.

You can also get a streaming response by calling stream_complete, which returns a generator that yields tokens as they are produced:

handle = OpenAI().stream_complete("William Shakespeare is ")

for token in handle:
    print(token.delta, end="", flush=True)

stream_complete is also available as an async method, astream_complete.

Chat interface

The LLM class also implements a chat method, which allows you to have more sophisticated interactions:

python
messages = [
    ChatMessage(role="system", content="You are a helpful assistant."),
    ChatMessage(role="user", content="Tell me a joke."),
]
chat_response = llm.chat(messages)

For streaming responses (where tokens are yielded as they are generated), stream_chat and astream_chat are available.

Synchronous (stream_chat):

Use this in standard Python scripts or notebooks where blocking operations are acceptable. It returns a generator directly.

python
stream_response = llm.stream_chat(messages)

for token in stream_response:
    print(token.delta, end="", flush=True)

Asynchronous (astream_chat):

When using async frameworks (like FastAPI), remember to await the method call and iterate over the returned async stream.

python
stream_response = await llm.astream_chat(messages)

async for token in stream_response:
    print(token.delta, end="", flush=True)

Specifying models

Many LLM integrations provide more than one model. You can specify a model by passing the model parameter to the LLM constructor:

python
llm = OpenAI(model="gpt-4o-mini")
response = llm.complete("Who is Laurie Voss?")
print(response)

Multi-Modal LLMs

Some LLMs support multi-modal chat messages. This means that you can pass in a mix of text and other modalities (images, audio, video, etc.) and the LLM will handle it.

Currently, LlamaIndex supports text, images, and audio inside ChatMessages using content blocks.

python
from llama_index.core.llms import ChatMessage, TextBlock, ImageBlock
from llama_index.llms.openai import OpenAI

llm = OpenAI(model="gpt-4o")

messages = [
    ChatMessage(
        role="user",
        blocks=[
            ImageBlock(path="image.png"),
            TextBlock(text="Describe the image in a few sentences."),
        ],
    )
]

resp = llm.chat(messages)
print(resp.message.content)

Tool Calling

Some LLMs (OpenAI, Anthropic, Gemini, Ollama, etc.) support tool calling directly over API calls -- this means tools and functions can be called without specific prompts and parsing mechanisms.

python
from llama_index.core.tools import FunctionTool
from llama_index.llms.openai import OpenAI


def generate_song(name: str, artist: str) -> Song:
    """Generates a song with provided name and artist."""
    return {"name": name, "artist": artist}


tool = FunctionTool.from_defaults(fn=generate_song)

llm = OpenAI(model="gpt-4o")
response = llm.predict_and_call(
    [tool],
    "Pick a random song for me",
)
print(str(response))

For more details on even more advanced tool calling, check out the in-depth guide using OpenAI. The same approaches work for any LLM that supports tools/functions (e.g. Anthropic, Gemini, Ollama, etc.).

You can learn more about tools and agents in the tools guide.

Available LLMs

We support integrations with OpenAI, Anthropic, Mistral, DeepSeek, Hugging Face, and dozens more. Check out our module guide to LLMs for a full list, including how to run a local model.

<Aside type="tip"> A general note on privacy and LLM usage can be found on the [privacy page](/python/framework/understanding/privacy). </Aside>

Using a local LLM

LlamaIndex doesn't just support hosted LLM APIs; you can also run a local model such as Meta's Llama 3 locally. For example, if you have Ollama installed and running:

python
from llama_index.llms.ollama import Ollama

llm = Ollama(
    model="llama3.3",
    request_timeout=60.0,
    # Manually set the context window to limit memory usage
    context_window=8000,
)

See the custom LLM's How-To for more details on using and configuring LLM models.