examples/conversation_to_knowledge/README.md
Converts YouTube podcast/interview sessions into a structured knowledge graph with CocoIndex
<div align="center"> <!--[](https://pypistats.org/packages/cocoindex) -->š Please help star CocoIndex if you like this project!
</div>Pipeline:
yt-dlp, transcribe with speaker diarization via AssemblyAIopenai/gpt-5.4-mini)entity_resolution utility (embedding similarity via faiss + LLM confirmation via pydantic-ai)yt-dlp for audio extraction)Run SurrealDB with persistent storage via Docker:
docker run -d \
--name surrealdb \
--user root \
-p 8787:8000 \
-v surrealdb-data:/data \
surrealdb/surrealdb:latest \
start --user root --pass root surrealkv:/data/database
This persists data in a Docker volume (surrealdb-data) across container restarts.
# Required
export ASSEMBLYAI_API_KEY="..."
export OPENAI_API_KEY="sk-..."
# Optional (shown with defaults)
export SURREALDB_URL="ws://localhost:8787/rpc"
export SURREALDB_NS="cocoindex"
export SURREALDB_DB="yt_conversations"
export SURREALDB_USER="root"
export SURREALDB_PASS="root"
export INPUT_DIR="./input"
export LLM_MODEL="openai/gpt-5.4-mini"
export RESOLUTION_LLM_MODEL="openai/gpt-5-mini"
pip install -e .
Edit input/sample.txt (or create new .txt files under input/). One URL per line, # for comments:
# AI podcasts
https://www.youtube.com/watch?v=VIDEO_ID_1
https://www.youtube.com/watch?v=VIDEO_ID_2
Build/update the knowledge graph:
cocoindex update conv_knowledge.app
This is incremental ā re-running skips sessions that haven't changed.
SurrealDB includes a built-in web UI called Surrealist for exploring and visualizing data.
ws://localhost:8787cocoindexyt_conversationsroot / Password: rootDownload from surrealdb.com/surrealist for a native app with the same features.
See all relationships:
SELECT * FROM statement_involves, session_statement, person_session;
Browse all sessions and their statements:
SELECT id, name, description, date FROM session;
Find all statements a person made:
SELECT
<-person_statement<-person.name AS speaker,
statement
FROM statement;
Explore the full graph around a person:
SELECT
name,
->person_session->session.name AS sessions,
->person_statement->statement.statement AS statements
FROM person;
Find all entities involved in a statement:
SELECT
statement,
->statement_involves->person.name AS persons,
->statement_involves->tech.name AS techs,
->statement_involves->org.name AS orgs
FROM statement;
Top N techs mentioned by the most people:
SELECT
name,
array::distinct(
<-statement_involves<-statement<-person_statement<-person.name
) AS persons,
array::len(array::distinct(
<-statement_involves<-statement<-person_statement<-person.id
)) AS person_count
FROM tech
ORDER BY person_count DESC
LIMIT 10;
Nodes: session, statement, person, tech, org
Edges: session_statement (session -> statement)
person_session (person -> session)
person_statement (person -> statement)
statement_involves (statement -> person | tech | org)