docs/examples/multi_modal/gpt4v_experiments_cot.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/multi_modal/gpt4v_experiments_cot.ipynb" target="_parent"></a>
GPT-4V has amazed us with its ability to analyze images and even generate website code from visuals.
This tutorial notebook investigates GPT-4V's proficiency in interpreting bar charts, scatter plots, and tables. We aim to assess whether specific questioning and chain of thought prompting can yield better responses compared to broader inquiries. Our demonstration seeks to determine if GPT-4V can exceed these known limitations with precise questioning and systematic reasoning techniques.
We observed in these experiments that asking specific questions, rather than general ones, yields better answers. Let's delve into these experiments.
NOTE: This tutorial notebook aims to inform the community about GPT-4V's performance, though the results might not be universally applicable. We strongly advise conducting tests with similar questions on your own dataset before drawing conclusions.
We have put to test following images from Llama2 and MistralAI papers.
Let's inspect each of these images now.
Let's start analyzing these images by following these steps for our questions:
These guidelines aim to test how different questioning techniques might improve the precision of the information we gather from the images.
%pip install llama-index-multi-modal-llms-openai
!pip install llama-index
import os
OPENAI_API_KEY = "YOUR OPENAI API KEY"
os.environ["OPENAI_API_KEY"] = OPENAI_API_KEY
from llama_index.core import SimpleDirectoryReader
from llama_index.multi_modal_llms.openai import OpenAIMultiModal
openai_mm_llm = OpenAIMultiModal(
model="gpt-4o",
api_key=OPENAI_API_KEY,
max_new_tokens=500,
temperature=0.0,
)
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/gpt4_experiments/llama2_mistral.png' -O './llama2_mistral.png'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/gpt4_experiments/llama2_model_analysis.pdf' -O './llama2_model_analysis.png'
!wget 'https://raw.githubusercontent.com/run-llama/llama_index/main/docs/examples/data/gpt4_experiments/llama2_violations_charts.png' -O './llama2_violations_charts.png'
from PIL import Image
import matplotlib.pyplot as plt
img = Image.open("llama2_violations_charts.png")
plt.imshow(img)
# put your local directory here
image_documents = SimpleDirectoryReader(
input_files=["./llama2_violations_charts.png"]
).load_data()
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
As you can see though the categories hateful and harmful, illicit and criminal activity, and unqualified advice but it hallicunated with x-axis values with - "Video sharing", "Social networking", "Gaming", "Dating", "Forums & boards", "Commercial Websites", "Media sharing", "P2P/File sharing", "Wiki", and "Other".
query = "Compare Llama2 models vs Vicuna models across categories."
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It answered wrong by saying Vicuna model generally has a lower violation percentage across all subcategories compared to the Llama2 model.
query = "which model among llama2 and vicuna models does better in terms of violation percentages in Hateful and harmful category."
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It failed to accurately capture the information, mistakenly identifying the light blue bar as representing Vicuna when, in fact, it is the light blue bar that represents Llama2.
Now let's inspect by giving more detailed information and ask the same question.
query = """In the image provided to you depicts about the violation rate performance of various AI models across Hateful and harmful, Illicit and criminal activity, Unqualified advice categories.
Hateful and harmful category is in first column. Bars with light blue are with Llama2 model and dark blue are with Vicuna models.
With this information, Can you compare about Llama2 and Vicuna models in Hateful and harmful category."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It did answer the question correctly.
query = """Based on the image provided. Follow the steps and answer the query - which model among llama2 and vicuna does better in terms of violation percentages in 'Hateful and harmful'.
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the violation percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
With chain of thought prompting it did hallicunate with bar colours but answered correctly saying Llama2 has lower violation compared to vicuna in Hateful and harmful though for a section Llama2 has higher violation compared to vicuna.
img = Image.open("llama2_mistral.png")
plt.imshow(img)
image_documents = SimpleDirectoryReader(
input_files=["./llama2_mistral.png"]
).load_data()
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It did answer the query but hallicunated with NLU task which is MMLU task and assumed Mistral is available across all different model parameters.
query = "How well does mistral model compared to llama2 model?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
Incorrect answer and percentages are not accurate enough and again assumed mistral is available across all parameter models.
query = "Assuming mistral is available in 7B series. How well does mistral model compared to llama2 model?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
Now with giving the detail that mistral is available in 7B series, it is able to answer correctly.
query = """Based on the image provided. Follow the steps and answer the query - Assuming mistral is available in 7B series. How well does mistral model compared to llama2 model?.
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the respective percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
There is hallicunation with number of model parameters and percentage points though the final conclusion is partially correct.
img = Image.open("llm_analysis.png")
plt.imshow(img)
image_documents = SimpleDirectoryReader(
input_files=["./llama2_model_analysis.png"]
).load_data()
query = "Analyse the image"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It did not analyse the image specifically but understood the overall data present in the image to some extent.
query = "which model has higher performance in SAT-en?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It did answer correctly but the numbers are being hallicunated.
query = "which model has higher performance in SAT-en in 7B series models?"
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
It did pick up the model names and answered correctly but recognised Llama series of models and values incorrectly.
query = """Based on the image provided. Follow the steps and answer the query - which model has higher performance in SAT-en in 7B series models?
Examine the Image: Look at the mentioned category in the query in the Image.
Identify Relevant Data: Note the respective percentages.
Evaluate: Compare if there is any comparison required as per the query.
Draw a Conclusion: Now draw the conclusion based on the whole data."""
response_gpt4v = openai_mm_llm.complete(
prompt=query,
image_documents=image_documents,
)
print(response_gpt4v)
With chain of the thought prompting we are able to get right conclusion though it should be noted that it picked up wrong values.
Observations made based on experiments on Hallucination and correctness.
(Please note that these observations are specific to the images used and cannot be generalized, as they vary depending on the images.)
In this tutorial notebook, we have showcased experiments ranging from general inquiries to systematic questions and chain of thought prompting techniques and observed Hallucination and correctness metrics.
However, it should be noted that the outputs from GPT-4V can be somewhat inconsistent, and the levels of hallucination are slightly elevated. Therefore, repeating the same experiment could result in different answers, particularly with generalized questions.