examples/slides_to_speech/README.md
A deck is a great outline and a terrible thing to listen to or search; this fixes both — in plain async Python.
</p> <p align="center"> <strong>Star us ❤️ →</strong> <a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> · <a href="https://cocoindex.io/docs/examples/slides-to-speech/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> · <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center"> </div>A slide deck is a great outline and a terrible thing to listen to or search. This pipeline fixes both: for each slide, a vision LLM writes natural speaker notes, Piper synthesizes them to audio locally, and the notes are embedded into LanceDB so you can search the deck by meaning and play back the narration for any hit. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — the vision and TTS steps run on a GPU runner, and the Rust engine handles incremental processing, so adding a deck processes only its slides.
A deck fans out to slides, and each slide produces text, audio, and a vector:
process_slide runs the vision LLM, then synthesizes audio and embeds the notes with asyncio.gather before declaring the row. Read it in main.py:
@coco.fn
async def process_slide(slide: SlidePage, filename: str, table: lancedb.TableTarget[SlideRecord]) -> None:
notes = (await extract_speaker_notes(slide.image)).speaker_notes # vision LLM
voice, embedding = await asyncio.gather(
text_to_speech(notes), # Piper TTS — local, no API
coco.use_context(EMBEDDER).embed(notes), # sentence-transformer
)
table.declare_row(row=SlideRecord(
id=f"{filename}#{slide.page_number}", filename=filename, page=slide.page_number,
speaker_notes=notes, voice=voice, embedding=embedding,
))
@coco.fn(memo=True) # unchanged deck is never re-rendered or re-narrated
async def process_file(file: FileLike, table: lancedb.TableTarget[SlideRecord]) -> None:
slides = await pdf_to_slides(await file.read())
await coco.map(process_slide, slides, str(file.file_path.path), table)
The MP3 audio is stored right in the LanceDB row, so a semantic-search hit comes with playable narration attached.
<p align="center"> 📘 <b><a href="https://cocoindex.io/docs/examples/slides-to-speech/">Full Tutorial →</a></b>Step-by-step walkthrough with the vision-LLM speaker notes, local Piper TTS, the per-slide LanceDB row, and searching the deck by meaning.
</p>SlideRecord.@functools.cache.asyncio.gather runs TTS and embedding side by side; the heavy vision and TTS steps run on a coco.GPU runner.@coco.fn(memo=True) reprocesses only changed slides; LLM_MODEL and EMBEDDER are declared with detect_change=True, so swapping the model or voice re-runs only the affected steps.Needs LLM credentials for the vision model (default
gemini/gemini-2.5-flash→GEMINI_API_KEY), a local Piper voice, and ffmpeg for MP3 export.
1. Download a Piper voice (~60 MB, local):
python3 -m piper.download_voices en_US-lessac-medium
2. Configure & install:
cp .env.example .env # set GEMINI_API_KEY (or swap LLM_MODEL, e.g. OpenAI)
pip install -e .
3. Build the index — drop a slide-deck PDF into slides/, then:
cocoindex update main # or: cocoindex update -L main (keep watching the folder)
On a 3-slide sample deck this produces three LanceDB rows, each with vision-LLM speaker notes and ~170–280 KB of Piper MP3 audio.
4. Search the deck — embed a query the same way and search LanceDB:
python main.py "reducing latency and reliability"
On the sample deck, that query ranks the Engineering Priorities slide first — above the roadmap and go-to-market slides — matching the spoken notes by meaning, not keywords. Each hit carries the slide's MP3 narration, ready to play.
<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/slides-to-speech/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>
</p>