docs/examples/multi_modal/multi_modal_videorag_videodb.ipynb
<a href="https://colab.research.google.com/github/run-llama/llama_index/blob/main/docs/examples/multi_modal/multi_modal_videorag_videodb.ipynb " target="_parent"></a>
Constructing a RAG pipeline for text is relatively straightforward, thanks to the tools developed for parsing, indexing, and retrieving text data.
However, adapting RAG models for video content presents a greater challenge. Videos combine visual, auditory, and textual elements, requiring more processing power and sophisticated video pipelines.
VideoDB is a serverless database designed to streamline the storage, search, editing, and streaming of video content. VideoDB offers random access to sequential video data by building indexes and developing interfaces for querying and browsing video content. Learn more at docs.videodb.io.
To build a truly Multimodal search for Videos, you need to work with different modalities of a video like Spoken Content, Visual.
In this notebook, we will develop a multimodal RAG for video using VideoDB and Llama-Index β¨.
Β
To connect to VideoDB, simply get the API key and create a connection. This can be done by setting the VIDEO_DB_API_KEY environment variable. You can get it from ππΌ VideoDB Console. ( Free for first 50 uploads, No credit card required! )
Get your OPENAI_API_KEY from OpenAI platform for llama_index response synthesizer.
import os
os.environ["VIDEO_DB_API_KEY"] = ""
os.environ["OPENAI_API_KEY"] = ""
To get started, we'll need to install the following packages:
llama-indexvideodb%pip install videodb
%pip install llama-index
Let's upload a our video file first.
You can use any public url, Youtube link or local file on your system.
β¨ First 50 uploads are free!
from videodb import connect
# connect to VideoDB
conn = connect()
coll = conn.get_collection()
# upload videos to default collection in VideoDB
print("Uploading Video")
video = conn.upload(url="https://www.youtube.com/watch?v=libKVRa01L8")
print(f"Video uploaded with ID: {video.id}")
# video = coll.get_video("m-56f55058-62b6-49c4-bbdc-43c0badf4c0b")
coll = conn.get_collection(): Returns default collection object.coll.get_videos(): Returns list of all the videos in a collections.coll.get_video(video_id): Returns Video object from givenvideo_id.
First, we need to extract scenes from the video and then use vLLM to obtain a description of each scene.
To learn more about Scene Extraction options, explore the following guides:
from videodb import SceneExtractionType
# Specify Scene Extraction algorithm
index_id = video.index_scenes(
extraction_type=SceneExtractionType.time_based,
extraction_config={"time": 2, "select_frames": ["first", "last"]},
prompt="Describe the scene in detail",
)
video.get_scene_index(index_id)
print(f"Scene Extraction successful with ID: {index_id}")
To develop a thorough multimodal search for videos, you need to handle different video modalities, including spoken content and visual elements.
You can retrieve all Transcript Nodes and Visual Nodes of a video using VideoDB and then incorporate them into your LlamaIndex pipeline.
You can fetch transcript nodes using Video.get_transcript()
To configure the segmenter, use the segmenter and length arguments.
Possible values for segmenter are:
Segmenter.time: Segments the video based on the specified length in seconds.Segmenter.word: Segments the video based on the word count specified by lengthfrom videodb import Segmenter
from llama_index.core.schema import TextNode
# Fetch all Transcript Nodes
nodes_transcript_raw = video.get_transcript(
segmenter=Segmenter.time, length=60
)
# Convert the raw transcript nodes to TextNode objects
nodes_transcript = [
TextNode(
text=node["text"],
metadata={key: value for key, value in node.items() if key != "text"},
)
for node in nodes_transcript_raw
]
# Fetch all Scenes
scenes = video.get_scene_index(index_id)
# Convert the scenes to TextNode objects
nodes_scenes = [
TextNode(
text=node["description"],
metadata={
key: value for key, value in node.items() if key != "description"
},
)
for node in scenes
]
We index both our Transcript Nodes and Scene Node
πβ¨ For simplicity, we are using a basic RAG pipeline. However, you can integrate more advanced LlamaIndex RAG pipelines here for better results.
from llama_index.core import VectorStoreIndex
# Index both Transcript and Scene Nodes
index = VectorStoreIndex(nodes_scenes + nodes_transcript)
q = index.as_query_engine()
res = q.query(
"Show me where the narrator discusses the formation of the solar system and visualize the milky way galaxy"
)
print(res)
Our nodes' metadata includes start and end fields, which represent the start and end times relative to the beginning of the video.
Using this information from the relevant nodes, we can create Video Clips corresponding to these nodes.
from videodb import play_stream
# Helper function to merge overlapping intervals
def merge_intervals(intervals):
if not intervals:
return []
intervals.sort(key=lambda x: x[0])
merged = [intervals[0]]
for interval in intervals[1:]:
if interval[0] <= merged[-1][1]:
merged[-1][1] = max(merged[-1][1], interval[1])
else:
merged.append(interval)
return merged
# Extract relevant timestamps from the source nodes
relevant_timestamps = [
[node.metadata["start"], node.metadata["end"]] for node in res.source_nodes
]
# Create a compilation of all relevant timestamps
stream_url = video.generate_stream(merge_intervals(relevant_timestamps))
play_stream(stream_url)
In this guide, we built a Simple Multimodal RAG for Videos Using VideoDB, Llamaindex, and OpenAI
You can optimize the pipeline by incorporating more advanced techniques like
To learn more about Scene Index, explore the following guides:
If you have any questions or feedback. Feel free to reach out to us ππΌ