examples/conversation_to_knowledge/spec.md
We want to convert a bunch of podcast sessions to knowledge base.
Note: Properties listed above are what matters for our business logic. An key field can be added for those entities without an simple key field, which makes it's easier to identify these entities for most database. (e.g. id for SurrealDB)
All names should be clear enough without ambiguity for common audience. In general, they're good Wikiepdia entry names. Examples:
At the current version, we only support YouTube
Users provide a folder, with a list of files, each with a list of source video locations (e.g. YouTube Video ID).
Processing for each session should be mounted as a component, and memoized.
The component does processing for individual session, and declare target states that don't need cross-session entity resolutions.
For things that need cross-session entity resolutions, it returns them (i.e. we should use use_mount()), for later stages to consume.
For each video, we fetch the audio transcript (with speaker diarization labels) and all available YouTube metadata (channel name, video title, description, upload date). These are carried forward for extraction.
We use a shared reformat_transcript(transcript, speaker_map) utility to replace raw diarization labels (e.g. "Speaker A") with real names when known, or keep them as (Speaker A) when not. Both extraction steps use this function — Step 1 passes an empty dict (no names known yet), Step 2 passes the mapping from Step 1.
Using the reformatted transcript (with (Speaker A), (Speaker B) labels since no names are known yet) together with all available metadata (YouTube channel name, video title, description, upload date), we ask the LLM to:
The output of this step gives us the speaker_label -> Person name mapping (e.g. {"A": "Lex Fridman", "B": "Sam Altman"}), plus the session metadata.
This mapping is also used for the person_session relationship — only identified speakers are linked to the session.
We reformat the transcript again, this time with the speaker mapping from Step 1 — recognized speakers get their real names, unrecognized ones stay as (Speaker A).
We then ask the LLM to extract statements and involved entities from the reformatted transcript. For each statement:
Important constraints for extraction quality:
The Sessions and Statements are final, so we can directly declare them as nodes together with the session_statement relationship in the target database here, so we won't need to carry these entities (especially Session with large text blobs) for later processing.
Entities involved in statements above are raw entities, as they need to be resolved later.
We do entity resolution for each entity type separately. For each one, we leverage in-memory embedding match. Here's our approach for each type of entity:
The output we want is a deduplication dict with type dict[str, str | Literal['True']], i.e. name -> canonical_name | None where None means the name itself is canonical. e.g. {'A': None, 'B': 'A'} means A is identified as a canonical upstream of B.
Note that ther can be multiple hops in the chain, e.g. {'A': None, 'B': 'A', 'C': 'B'}. And to find the canoinical of a given name, we need to iterate until reaching the one with value None.
To construct the deduplication dict, we need to:
Get the set of all raw entities (all_raw_entities).
For each item in all_raw_entities, we compute (memoized!) its embedding. Now we have a entity_embedding_map.
Then do a process similar to "bubble sort", i.e. for each entity in all_raw_entities, we
entity_embedding_map such that
duplication_dict). If it's already a duplication of others, collect the canonical instead.MAX_DISTANCE_FOR_RESOLUTIONN entities with least distance to the current entity in entity_embedding_mapduplication_dict:
entity into duplication_dict: None (if canonical) or the canonical oneentity, update the dict entry for the other instead, to mark it as a dup of the current.Now, with the deduplication dict, we have our canonical entities and our relationships pointing to our canoical entities. We can declare the entire knowledge base.
Use CocoIndex for processing.
Use SurrealDB for target knowledge database. CocoIndex has a target connector for it.
Use Pydantic for various models. Use instructor + LiteLLM to talk with LLM and get structured output from it.
Use SentenceTransformerEmbedder for embedding.
For YouTube audio fetching and conversion, yt-dlp + pyannote is one option I've heard. I'm open to other online options especially if it can offer higher conversion speed and easy to setup (e.g. we already have OpenAI API key, so any OpenAI service will be very easy for us to use).
For others, please make your own judgement.