Back to Llama Index

IBM watsonx.ai

docs/examples/llm/ibm_watsonx.ipynb

0.14.215.9 KB
Original Source

<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/llm/ibm_watsonx.ipynb" target="_parent"></a>

IBM watsonx.ai

WatsonxLLM is a wrapper for IBM watsonx.ai foundation models.

The aim of these examples is to show how to communicate with watsonx.ai models using the LlamaIndex LLMs API.

Setting up

Install the llama-index-llms-ibm package:

python
!pip install -qU llama-index-llms-ibm

The cell below defines the credentials required to work with watsonx Foundation Model inferencing.

Action: Provide the IBM Cloud user API key. For details, see Managing user API keys.

python
import os
from getpass import getpass

watsonx_api_key = getpass()
os.environ["WATSONX_APIKEY"] = watsonx_api_key

Additionally, you can pass additional secrets as an environment variable:

python
import os

os.environ["WATSONX_URL"] = "your service instance url"
os.environ["WATSONX_TOKEN"] = "your token for accessing the CPD cluster"
os.environ["WATSONX_PASSWORD"] = "your password for accessing the CPD cluster"
os.environ["WATSONX_USERNAME"] = "your username for accessing the CPD cluster"
os.environ[
    "WATSONX_INSTANCE_ID"
] = "your instance_id for accessing the CPD cluster"

Load the model

You might need to adjust model parameters for different models or tasks. For details, refer to Available MetaNames.

python
temperature = 0.5
max_new_tokens = 50
additional_params = {
    "decoding_method": "sample",
    "min_new_tokens": 1,
    "top_k": 50,
    "top_p": 1,
}

Initialize the WatsonxLLM class with the previously set parameters.

Note:

In this example, we’ll use the project_id and Dallas URL.

You need to specify the model_id that will be used for inferencing. You can find the list of all the available models in Supported foundation models.

python
from llama_index.llms.ibm import WatsonxLLM

watsonx_llm = WatsonxLLM(
    model_id="ibm/granite-13b-instruct-v2",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="PASTE YOUR PROJECT_ID HERE",
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)

Alternatively, you can use Cloud Pak for Data credentials. For details, see watsonx.ai software setup.

python
watsonx_llm = WatsonxLLM(
    model_id="ibm/granite-13b-instruct-v2",
    url="PASTE YOUR URL HERE",
    username="PASTE YOUR USERNAME HERE",
    password="PASTE YOUR PASSWORD HERE",
    instance_id="openshift",
    version="4.8",
    project_id="PASTE YOUR PROJECT_ID HERE",
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)

Instead of model_id, you can also pass the deployment_id of the previously tuned model. The entire model tuning workflow is described in Working with TuneExperiment and PromptTuner.

python
watsonx_llm = WatsonxLLM(
    deployment_id="PASTE YOUR DEPLOYMENT_ID HERE",
    url="https://us-south.ml.cloud.ibm.com",
    project_id="PASTE YOUR PROJECT_ID HERE",
    temperature=temperature,
    max_new_tokens=max_new_tokens,
    additional_params=additional_params,
)

Create a Completion

Call the model directly using a string type prompt:

python
response = watsonx_llm.complete("What is a Generative AI?")
print(response)

From the CompletionResponse, you can also retrieve a raw response returned by the service:

python
print(response.raw)

You can also call a model that provides a prompt template:

python
from llama_index.core import PromptTemplate

template = "What is {object} and how does it work?"
prompt_template = PromptTemplate(template=template)

prompt = prompt_template.format(object="a loan")

response = watsonx_llm.complete(prompt)
print(response)

Calling chat with a list of messages

Create chat completions by providing a list of messages:

python
from llama_index.core.llms import ChatMessage

messages = [
    ChatMessage(role="system", content="You are an AI assistant"),
    ChatMessage(role="user", content="Who are you?"),
]
response = watsonx_llm.chat(
    messages, max_new_tokens=20, decoding_method="greedy"
)
print(response)

Note that we changed the max_new_tokens parameter to 20 and the decoding_method parameter to greedy.

Streaming the model output

Stream the model's response:

python
for chunk in watsonx_llm.stream_complete(
    "Describe your favorite city and why it is your favorite."
):
    print(chunk.delta, end="")

Similarly, to stream the chat completions, use the following code:

python
messages = [
    ChatMessage(role="system", content="You are an AI assistant"),
    ChatMessage(role="user", content="Who are you?"),
]

for chunk in watsonx_llm.stream_chat(messages):
    print(chunk.delta, end="")