Back to Cocoindex

README

examples/slides_to_speech/README.md

1.0.148.4 KB
Original Source
<p align="center"> <a href="https://cocoindex.io/docs/examples/slides-to-speech/" title="Turn slide decks into a narrated, searchable index with CocoIndex — vision LLM speaker notes, local Piper TTS, and LanceDB, in plain async Python"> </a> </p> <h1 align="center">Turn a slide deck into <em>narrated</em>, searchable audio.</h1> <p align="center"> <b>A vision LLM writes speaker notes for each slide, Piper synthesizes them to audio <em>locally</em>, and the notes are embedded into LanceDB — so you search the deck by meaning and play back the narration for any hit.</b>

A deck is a great outline and a terrible thing to listen to or search; this fixes both — in plain async Python.

</p> <p align="center"> <strong>Star us&nbsp;❤️&nbsp;→</strong>&nbsp;<a href="https://github.com/cocoindex-io/cocoindex" title="Star CocoIndex on GitHub"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/star-btn-small-light.svg"></picture></a> &nbsp;·&nbsp; <a href="https://cocoindex.io/docs/examples/slides-to-speech/" title="Read the full walkthrough"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/docs-inline-light.svg"></picture></a> &nbsp;·&nbsp; <a href="https://discord.com/invite/zpA9S2DR7s" title="Join the CocoIndex Discord"><picture><source media="(prefers-color-scheme: dark)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-dark.svg"><source media="(prefers-color-scheme: light)" srcset="https://cocoindex.io/blobs/github/homepage/discord-inline-light.svg"></picture></a> </p> <div align="center">

</div>

A slide deck is a great outline and a terrible thing to listen to or search. This pipeline fixes both: for each slide, a vision LLM writes natural speaker notes, Piper synthesizes them to audio locally, and the notes are embedded into LanceDB so you can search the deck by meaning and play back the narration for any hit. You declare the transformation in native Python and your own types — target_state = transformation(source_state) — the vision and TTS steps run on a GPU runner, and the Rust engine handles incremental processing, so adding a deck processes only its slides.

How it works

A deck fans out to slides, and each slide produces text, audio, and a vector:

  • Render each slide of the PDF to an image (pymupdf).
  • Narrate — a vision LLM (instructor over LiteLLM) writes natural speaker notes for the slide image.
  • Voice + embed — Piper synthesizes the notes to MP3 while a sentence-transformer embeds them, concurrently.
  • Store one LanceDB row per slide — page, notes, audio (a binary column), and the embedding.

process_slide runs the vision LLM, then synthesizes audio and embeds the notes with asyncio.gather before declaring the row. Read it in main.py:

python
@coco.fn
async def process_slide(slide: SlidePage, filename: str, table: lancedb.TableTarget[SlideRecord]) -> None:
    notes = (await extract_speaker_notes(slide.image)).speaker_notes   # vision LLM
    voice, embedding = await asyncio.gather(
        text_to_speech(notes),                       # Piper TTS — local, no API
        coco.use_context(EMBEDDER).embed(notes),     # sentence-transformer
    )
    table.declare_row(row=SlideRecord(
        id=f"{filename}#{slide.page_number}", filename=filename, page=slide.page_number,
        speaker_notes=notes, voice=voice, embedding=embedding,
    ))

@coco.fn(memo=True)   # unchanged deck is never re-rendered or re-narrated
async def process_file(file: FileLike, table: lancedb.TableTarget[SlideRecord]) -> None:
    slides = await pdf_to_slides(await file.read())
    await coco.map(process_slide, slides, str(file.file_path.path), table)

The MP3 audio is stored right in the LanceDB row, so a semantic-search hit comes with playable narration attached.

<p align="center"> 📘 <b><a href="https://cocoindex.io/docs/examples/slides-to-speech/">Full Tutorial →</a></b>

Step-by-step walkthrough with the vision-LLM speaker notes, local Piper TTS, the per-slide LanceDB row, and searching the deck by meaning.

</p>

Why it's worth a star ⭐

  • Three modalities, one row. Each slide becomes text (LLM notes), audio (Piper MP3), and a vector (sentence-transformer) — declared as a single LanceDB SlideRecord.
  • Local TTS, no per-character billing. Piper is a fast, fully local neural voice — no API, no streaming costs; the voice model loads once via @functools.cache.
  • Audio travels with the hit. The MP3 lives in a binary LanceDB column, so a search result carries its own playable narration.
  • Concurrent per slide. asyncio.gather runs TTS and embedding side by side; the heavy vision and TTS steps run on a coco.GPU runner.
  • Incremental & swappable. @coco.fn(memo=True) reprocesses only changed slides; LLM_MODEL and EMBEDDER are declared with detect_change=True, so swapping the model or voice re-runs only the affected steps.

Run it

Needs LLM credentials for the vision model (default gemini/gemini-2.5-flashGEMINI_API_KEY), a local Piper voice, and ffmpeg for MP3 export.

1. Download a Piper voice (~60 MB, local):

sh
python3 -m piper.download_voices en_US-lessac-medium

2. Configure & install:

sh
cp .env.example .env     # set GEMINI_API_KEY (or swap LLM_MODEL, e.g. OpenAI)
pip install -e .

3. Build the index — drop a slide-deck PDF into slides/, then:

sh
cocoindex update main        # or: cocoindex update -L main   (keep watching the folder)

On a 3-slide sample deck this produces three LanceDB rows, each with vision-LLM speaker notes and ~170–280 KB of Piper MP3 audio.

4. Search the deck — embed a query the same way and search LanceDB:

sh
python main.py "reducing latency and reliability"

On the sample deck, that query ranks the Engineering Priorities slide first — above the roadmap and go-to-market slides — matching the spoken notes by meaning, not keywords. Each hit carries the slide's MP3 narration, ready to play.


<p align="center"> If this gave your decks a voice and a search box, <a href="https://github.com/cocoindex-io/cocoindex"><b>give CocoIndex a star ⭐</b></a> — it helps a lot.

<a href="https://cocoindex.io/docs">Docs</a> · <a href="https://cocoindex.io/docs/examples/slides-to-speech/">Walkthrough</a> · <a href="https://discord.com/invite/zpA9S2DR7s">Discord</a> · <a href="https://github.com/cocoindex-io/cocoindex/tree/main/examples"><b>See all examples →</b></a>

</p>