Podcasts to Knowledge

Converts YouTube podcast/interview sessions into a structured knowledge graph with CocoIndex

🌟 Please help star CocoIndex if you like this project!

</div>

Pipeline:

Read YouTube URLs from plain text files
Download audio via yt-dlp, transcribe with speaker diarization via AssemblyAI
Extract metadata, speakers, and thematic statements via LLM (openai/gpt-5.4-mini)
Resolve duplicate entities (persons, techs, orgs) using CocoIndex's entity_resolution utility (embedding similarity via faiss + LLM confirmation via pydantic-ai)
Store the knowledge graph in SurrealDB: sessions, statements, persons, techs, orgs, and their relationships

Prerequisites

Python 3.11+
FFmpeg installed (required by yt-dlp for audio extraction)
Docker (for SurrealDB)
An AssemblyAI API key (for transcription with speaker diarization)
An OpenAI API key (for LLM extraction via litellm)

Setup

1. Start SurrealDB

Run SurrealDB with persistent storage via Docker:

docker run -d \
  --name surrealdb \
  --user root \
  -p 8787:8000 \
  -v surrealdb-data:/data \
  surrealdb/surrealdb:latest \
  start --user root --pass root surrealkv:/data/database

This persists data in a Docker volume (surrealdb-data) across container restarts.

2. Set environment variables

# Required
export ASSEMBLYAI_API_KEY="..."
export OPENAI_API_KEY="sk-..."

# Optional (shown with defaults)
export SURREALDB_URL="ws://localhost:8787/rpc"
export SURREALDB_NS="cocoindex"
export SURREALDB_DB="yt_conversations"
export SURREALDB_USER="root"
export SURREALDB_PASS="root"
export INPUT_DIR="./input"
export LLM_MODEL="openai/gpt-5.4-mini"
export RESOLUTION_LLM_MODEL="openai/gpt-5-mini"

3. Install dependencies

pip install -e .

4. Add YouTube URLs

Edit input/sample.txt (or create new .txt files under input/). One URL per line, # for comments:

# AI podcasts
https://www.youtube.com/watch?v=VIDEO_ID_1
https://www.youtube.com/watch?v=VIDEO_ID_2

Run

Build/update the knowledge graph:

cocoindex update conv_knowledge.app

This is incremental — re-running skips sessions that haven't changed.

Visualize the Knowledge Graph

SurrealDB includes a built-in web UI called Surrealist for exploring and visualizing data.

Option A: Surrealist Cloud

Go to app.surrealdb.com
Connect to your local instance:
- Endpoint: ws://localhost:8787
- Namespace: cocoindex
- Database: yt_conversations
- Username: root / Password: root
Use the Explorer tab to browse tables and relations
Use the Query tab to run SurrealQL queries (see below)

Option B: Surrealist Desktop

Download from surrealdb.com/surrealist for a native app with the same features.

Example queries

See all relationships:

surql

SELECT * FROM statement_involves, session_statement, person_session;

Browse all sessions and their statements:

surql

SELECT id, name, description, date FROM session;

Find all statements a person made:

surql

SELECT
  <-person_statement<-person.name AS speaker,
  statement
FROM statement;

Explore the full graph around a person:

surql

SELECT
  name,
  ->person_session->session.name AS sessions,
  ->person_statement->statement.statement AS statements
FROM person;

Find all entities involved in a statement:

surql

SELECT
  statement,
  ->statement_involves->person.name AS persons,
  ->statement_involves->tech.name AS techs,
  ->statement_involves->org.name AS orgs
FROM statement;

Top N techs mentioned by the most people:

surql

SELECT
  name,
  array::distinct(
    <-statement_involves<-statement<-person_statement<-person.name
  ) AS persons,
  array::len(array::distinct(
    <-statement_involves<-statement<-person_statement<-person.id
  )) AS person_count
FROM tech
ORDER BY person_count DESC
LIMIT 10;

Schema

Nodes:  session, statement, person, tech, org
Edges:  session_statement  (session -> statement)
        person_session     (person -> session)
        person_statement   (person -> statement)
        statement_involves (statement -> person | tech | org)