examples/rust/conversation_to_knowledge/README.md
Rust port of the Python conversation_to_knowledge example.
Turns podcast/interview sessions into a knowledge graph in SurrealDB: sessions, statements, persons, techs, orgs, and the relationships between them.
Pipeline: read sources → (YouTube: yt-dlp + AssemblyAI diarized transcription | local transcript) → two LLM passes (identify speakers/metadata, then extract statements + mentioned person/tech/org) → cocoindex::entity_resolution (embed names → candidate search → LLM pair resolver) → SurrealDB graph targets.
| Step | Python | Rust (this example) |
|---|---|---|
| Read sources | localfs.walk_dir | cocoindex::walk (*.txt URLs, *.json transcripts) |
| Per-session incremental skip | @coco.fn(memo=True) | #[cocoindex::function(memo)] |
| Audio + transcription | yt-dlp + assemblyai SDK | yt-dlp (subprocess) + AssemblyAI REST (reqwest) |
| LLM extraction (2 passes) | instructor + litellm | reqwest → OpenAI JSON mode |
| Stable ids | IdGenerator | cocoindex::IdGenerator |
| Entity resolution | ops.entity_resolution (faiss + LLM) | cocoindex::entity_resolution + fastembed Snowflake embeddings + LLM pair resolver |
| Graph store | surrealdb connector (TableTarget/RelationTarget) | cocoindex::surrealdb targets over the native surrealdb crate |
| Embedder change-detection | ContextKey(..., detect_change=True) | ContextKey::new_with_state(...) |
TableTarget/RelationTarget declarations backed by CocoIndex target-state reconciliation. The connector is still narrower than Python's full connector surface (for example, Python has richer type inference and vector index helpers).Snowflake/snowflake-arctic-embed-xs model used by the Python example, loaded through the model's ONNX artifact. Set EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2 for a smaller local fallback.input/*.txt — one YouTube URL per line (real path; needs yt-dlp, ffmpeg, ASSEMBLYAI_API_KEY).input/*.json — a pre-transcribed session (cheap, audio-free; see input/sample.json). Great for trying the extract→resolve→graph half without audio.docker run -d --name surrealdb -p 8787:8000 surrealdb/surrealdb:latest \
start --user root --pass root surrealkv:/data/database
export OPENAI_API_KEY=sk-... (override model with LLM_MODEL, default gpt-4o-mini).yt-dlp + ffmpeg installed and export ASSEMBLYAI_API_KEY=....Connection/config via env (defaults shown): SURREALDB_URL=127.0.0.1:8787, SURREALDB_NS=cocoindex, SURREALDB_DB=yt_conversations, SURREALDB_USER=root, SURREALDB_PASS=root.
# Build the graph from the input directory (default ./input).
cargo run -- index # or: cargo run -- index /path/to/input
Re-running skips fetch+LLM for unchanged sessions (memoized) and reconciles graph target state.
curl -s -X POST http://localhost:8787/sql \
-H "surreal-ns: cocoindex" -H "surreal-db: yt_conversations" -u root:root \
-d "SELECT name FROM person; SELECT ->statement_mentions->{tech,org} FROM statement LIMIT 5;"