docs/examples/voice-ai-analytics.md
This tutorial walks through how to build a Voice AI analytics pipeline using Daft and Faster-Whisper from raw audio to searchable, multilingual transcripts. You'll learn how to:
In short, learn how Daft simplifies multimodal AI pipelines letting you process, enrich, and query audio data with the same ease as tabular data.
Behind every AI meeting note, podcast summary, and voice agent lies an AI pipeline that transcribes raw audio and enriches those transcripts to make it easy to retrieve for downstream applications.
Voice AI encompasses a broad range of tasks:
In this tutorial we will focus on Speech-to-Text (STT) and LLM Text Generation, exploring common techniques for preprocessing and enriching speech from audio to support downstream applications like meeting summaries, highlight extraction, and embeddings.
Audio is inherently different from traditional structured data. Since audio isn't stored in neat rows and columns in a table, running frontier models on audio data comes with some extra challenges.
Before we can run our STT models on audio data we'll need to:
Traditional approaches face challenges:
Daft solves these issues by:
Let's start by importing the necessary libraries and setting up our environment.
First, install the required dependencies:
pip install daft faster-whisper soundfile sentence-transformers python-dotenv openai
Then import the necessary modules:
from dataclasses import asdict
import os
import daft
from daft import DataType, col
from daft.functions import format, file, unnest
from daft.functions.ai import prompt, embed_text
from daft.ai.openai.provider import OpenAIProvider
from faster_whisper import WhisperModel, BatchedInferencePipeline
# Load environment variables
from dotenv import load_dotenv
load_dotenv()
Let's define the parameters we'll use throughout this tutorial.
# Define Constants
SAMPLE_RATE = 16000
DTYPE = "float32"
BATCH_SIZE = 16
# Define Parameters
SOURCE_URI = "hf://datasets/Eventual-Inc/sample-files/audio/*.mp3"
DEST_URI = ".data/voice_ai_analytics"
LLM_MODEL_ID = "openai/gpt-oss-120b"
EMBEDDING_MODEL_ID = "sentence-transformers/all-MiniLM-L6-v2"
CONTEXT = "Daft: Unified Engine for Data Analytics, Engineering & ML/AI (github.com/Eventual-Inc/Daft) YouTube channel video. Transcriptions can have errors like 'DAF' referring to 'Daft'."
PRINT_SEGMENTS = True
Faster-Whisper comes with built-in VAD from Silero for segmenting long-form audio into neat chunks. This makes it so we don't need to worry about the length of video or handle any windowing ourselves since Whisper only operates over 30 sec chunks. We also want to take full advantage of faster-whisper's BatchedInferencePipeline to improve our throughput.
We'll define a FasterWhisperTranscriber class and decorate it with @daft.cls(). This converts any standard Python class into a distributed massively parallel user-defined-function, enabling us to take full advantage of Daft's rust-backed performance.
Key design decisions:
__init__ methoddaft.File and return a dictionary that will be materialized as a daft.DataType.struct()daft.File for simplified preprocessingNote: Jump to the bottom of this document to see how TranscriptionResult is defined.
@daft.cls()
class FasterWhisperTranscriber:
def __init__(self, model="distil-large-v3", compute_type="float32", device="auto"):
self.model = WhisperModel(model, compute_type=compute_type, device=device)
self.pipe = BatchedInferencePipeline(self.model)
@daft.method(return_dtype=TranscriptionResult)
def transcribe(self, audio_file: daft.File):
"""Transcribe Audio Files with Voice Activity Detection (VAD) using Faster Whisper"""
with audio_file.to_tempfile() as tmp:
segments_iter, info = self.pipe.transcribe(
str(tmp.name),
vad_filter=True,
vad_parameters=dict(min_silence_duration_ms=500, speech_pad_ms=200),
word_timestamps=True,
without_timestamps=False,
temperature=0,
batch_size=BATCH_SIZE,
)
segments = [asdict(seg) for seg in segments_iter]
text = " ".join([seg["text"] for seg in segments])
return {"transcript": text, "segments": segments, "info": asdict(info)}
We'll use OpenRouter as our LLM provider for summaries and translations. Let's configure it:
# Create an OpenAI provider, attach, and set as the default
openrouter_provider = OpenAIProvider(
name="OpenRouter",
base_url="https://openrouter.ai/api/v1",
api_key=os.environ.get("OPENROUTER_API_KEY"),
)
daft.attach_provider(openrouter_provider)
daft.set_provider("OpenRouter")
Before we dive into transcription, let's understand why Daft's dataframe interface is powerful:
Daft's execution engine runs on a push-based processing model, enabling the engine to optimize each operation by planning everything from query through the logic and finally writing to disk.
Now let's transcribe our audio files:
daft.File objects# Instantiate Transcription UDF
fwt = FasterWhisperTranscriber()
# Transcribe the audio files
df_transcript = (
# Discover the audio files
daft.from_glob_path(SOURCE_URI)
# Wrap the path as a daft.File
.with_column("audio_file", file(col("path")))
# Transcribe the audio file with Voice Activity Detection (VAD) using Faster Whisper
.with_column("result", fwt.transcribe(col("audio_file")))
# Unpack Results
.select("path", "audio_file", unnest(col("result")))
).collect()
print(
"\n\nRunning Transcription with Voice Activity Detection (VAD) using Faster Whisper..."
)
# Show the transcript
df_transcript.select(
"path",
"info",
"transcript",
"segments",
).show(3, format="fancy", max_width=40)
╭────────────────────────────────────────┬─────────────────────────┬────────────────────────────────────────┬──────────────╮
│ path ┆ info ┆ transcript ┆ segments │
╞════════════════════════════════════════╪═════════════════════════╪════════════════════════════════════════╪══════════════╡
│ hf://datasets/Eventual-Inc/sample-fil… ┆ {language: en, ┆ Hi, I'm Kevin. Let's talk batch infe… ┆ [{id: 1, │
│ ┆ language_probability: … ┆ ┆ seek: 0, │
│ ┆ ┆ ┆ start: 0.09, │
│ ┆ ┆ ┆ end: 2… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ {language: en, ┆ I'm climbing today. Peor. What are… ┆ [{id: 1, │
│ ┆ language_probability: … ┆ ┆ seek: 0, │
│ ┆ ┆ ┆ start: 0.76, │
│ ┆ ┆ ┆ end: 1… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ {language: en, ┆ Hi, I'm Colin. I'm a software engine… ┆ [{id: 1, │
│ ┆ language_probability: … ┆ ┆ seek: 0, │
│ ┆ ┆ ┆ start: 0.15, │
│ ┆ ┆ ┆ end: 2… │
╰────────────────────────────────────────┴─────────────────────────┴────────────────────────────────────────┴──────────────╯
(Showing first 7 of 7 rows)
Great! We've successfully transcribed our audio files. The dataframe now contains:
path: The source file pathtranscript: The full transcription textsegments: A list of transcription segments with timestampsinfo: Metadata about the transcription (language, duration, etc.)Moving on to our downstream enrichment stages, summarization is a common and simple means of leveraging an LLM for publishing, socials, or search. With Daft, generating a summary from your transcripts is as simple as adding a column.
We'll also demonstrate how easy it is to add translations - since all the data is organized and accessible, we just need to declare what we want!
# Summarize the transcripts and translate to Chinese
df_summaries = (
df_transcript
# Summarize the transcripts
.with_column(
"summary",
prompt(
format(
"Summarize the following transcript from a YouTube video belonging to {}: \n {}",
daft.lit(CONTEXT),
col("transcript"),
),
model=LLM_MODEL_ID,
),
).with_column(
"summary_chinese",
prompt(
format(
"Translate the following text to Simplified Chinese: <text>{}</text>", col("summary")
),
system_message="You will be provided with a piece of text. Your task is to translate the text to Simplified Chinese exactly as it is written. Return the translated text only, no other text or formatting.",
model=LLM_MODEL_ID,
),
)
)
print("\n\nGenerating Summaries...")
# Show the summaries and the transcript
df_summaries.select(
"path",
"transcript",
"summary",
"summary_chinese",
).show(format="fancy", max_width=40)
╭────────────────────────────────────────┬────────────────────────────────────────┬────────────────────────────────────────┬─────────────────────────────────────────────────────────────╮
│ path ┆ transcript ┆ summary ┆ summary_chinese │
╞════════════════════════════════════════╪════════════════════════════════════════╪════════════════════════════════════════╪═════════════════════════════════════════════════════════════╡
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Kevin, engineer at Eventual,… ┆ **Video Summary – “Spark Connect for … ┆ **视频摘要 – “Spark Connect for Daft”(Daf… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Colin. I'm a software engine… ┆ **Video Summary – “Unified Engine for… ┆ **视频摘要 – “统一的数据分析、工程与 ML/AI 引擎 (Daft)… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Okay, so I have a cluster running wi… ┆ **Video Summary – “Unified Engine for… ┆ **视频摘要 – “统一的用于数据分析、工程和 ML/AI 的引擎”(Da… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Kevin. Let's talk batch infe… ┆ **Video Summary – “Batch Inference wi… ┆ 3. **执行** – `daft.run()` 执行该操作。 │
│ ┆ ┆ ┆ │
│ ┆ ┆ ┆ - … │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Colin, a software engineer a… ┆ **Video Summary – “Unified Engine for… ┆ **视频摘要 – “统一的数据分析、工程与 ML/AI 引擎”(Daft)… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Real-old data is messy. There's an e… ┆ **Summary of the Daft “Unified Engine… ┆ **Daft “统一的数据分析、工程与 ML/AI 引擎” 视频摘要** │
│ ┆ ┆ ┆ … │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ I'm climbing today. Peor. What are… ┆ **Summary** ┆ **摘要** │
│ ┆ ┆ ┆ │
│ ┆ ┆ The video opens with a b… ┆ 视频以一次简短的随意对话开场,讨论攀岩等级,仅作为轻松的引… │
╰────────────────────────────────────────┴────────────────────────────────────────┴────────────────────────────────────────┴─────────────────────────────────────────────────────────────╯
(Showing first 7 of 7 rows)
Excellent! We now have summaries in both English and Chinese. This demonstrates how easy it is to add multilingual support to your pipeline.
A common downstream task is preparing subtitles. Since our segments come with start and end timestamps, we can easily add another section to our Voice AI pipeline for translation. We'll explode the segments (one row per segment) and translate each segment to Simplified Chinese.
# Explode the segments, embed, and translate to simplified Chinese for subtitles
df_segments = (
df_transcript.explode("segments")
.select(
"path",
unnest(col("segments")),
)
.with_column(
"segment_text_chinese",
prompt(
format("Translate the following text to Simplified Chinese: <text>{}</text>", col("text")),
system_message="You will be provided with a transcript segment. Your task is to translate the text to Simplified Chinese exactly as it is written. Return the translated text only, no other text or formatting.",
model=LLM_MODEL_ID,
),
)
)
print("\n\nGenerating Chinese Subtitles...")
# Show the segments and translations
df_segments.select(
"path",
col("text"),
"segment_text_chinese",
).show(format="fancy", max_width=40)
╭────────────────────────────────────────┬────────────────────────────────────────┬──────────────────────────────────────────────────────────────╮
│ path ┆ text ┆ segment_text_chinese │
╞════════════════════════════════════════╪════════════════════════════════════════╪══════════════════════════════════════════════════════════════╡
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Then we're using DAF's LLM generate … ┆ 然后我们在数据集的 prompts 列上使用 DAF 的 LLM 生成函数… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ So here we're using DAF to read a CS… ┆ <text> 所以这里我们使用 DAF 从 Hugging Face 读取… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ With DAF's LLM generate function, th… ┆ 使用 DAF 的 LLM 生成函数,这非常容易实现。… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ So let me just run the code first wh… ┆ <text> 所以让我先运行代码,同时解释一下发生了什么。 </text>… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ specifying that we want to run it on… ┆ <text>指定我们想要在 Open AI 提供的 GPT5 Nano 模… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ awesome chat GPT prompts data set. ┆ <text> 超棒的 chat GPT 提示数据集。</text>… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Kevin. Let's talk batch infe… ┆ 嗨,我是凯文。让我们谈谈批量推理。… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Say you have a dataset of prompts th… ┆ <text> 假设你有一个提示数据集,想要将其运行在 GPT 上。</… │
╰────────────────────────────────────────┴────────────────────────────────────────┴──────────────────────────────────────────────────────────────╯
Perfect! These segments can now be used to make content more accessible for wider audiences, which is a great way to increase reach. Each segment has:
start, end)Our final stage is embeddings. If you're going through the trouble of transcription, you might as well make that content available as part of your knowledge base. Meeting notes might not be the most advanced AI use-case anymore, but it still provides immense value for tracking decisions and key moments in discussions.
Adding an embeddings stage is as simple as calling embed_text():
# Embed the segments
df_segments = (
df_segments.with_column(
"segment_embeddings",
embed_text(
col("text"),
provider="transformers",
model=EMBEDDING_MODEL_ID,
),
)
)
print("\n\nGenerating Embeddings for Segments...")
# Show the segments with embeddings
df_segments.select(
"path",
"text",
"segment_embeddings",
).show(format="fancy", max_width=40)
╭────────────────────────────────────────┬────────────────────────────────────────┬───────────────────────────╮
│ path ┆ text ┆ segment_embeddings │
╞════════════════════════════════════════╪════════════════════════════════════════╪═══════════════════════════╡
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Hi, I'm Kevin. Let's talk batch infe… ┆ ▁▇▇▇▆▆▄▆▆▇█▇▅▆▆▆▄▅█▄█▇▅▅… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Say you have a dataset of prompts th… ┆ ▃▁▄▄▇▃▅▄▂▆▇▄▅▅▂▃█▄▄▆▂▂▁▂… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ With DAF's LLM generate function, th… ┆ ▄▇▃▂▃▆▄▃▄▁▄▁▂█▄▂▆▃▆▂▄▃▄▁… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ So let me just run the code first wh… ┆ ▄▇▄▃█▆▃█▆█▇▆▂▁▅▆▆▄▅█▆▆▂▇… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ So here we're using DAF to read a CS… ┆ ▄▃█▇▄▅▆█▅▅▂▅▅▆▄▁▆▄▄█▃▅▅▅… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ awesome chat GPT prompts data set. ┆ ▄▃▄▄▄▄▃▂▂▅▄▆▅▄▅▃█▅▃▅▄▁▂▃… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ Then we're using DAF's LLM generate … ┆ ▆▅▄▄▅▆▆▄▄▇▆▃▄▇▂▂▇▄▄█▅▂▃▁… │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ hf://datasets/Eventual-Inc/sample-fil… ┆ specifying that we want to run it on… ┆ ▆▃▅▄▃▁▃▅▃▅▂▃▅█▃▂▇▇▇▇▄▂▄▅… │
╰────────────────────────────────────────┴────────────────────────────────────────┴───────────────────────────╯
(Showing first 8 rows)
Excellent! Daft's native embedding DataType intelligently stores embedding vectors for you, regardless of their size. Now you have:
We've successfully built a complete Voice AI Analytics pipeline that:
From here there are several directions you could take:
Leverage the embeddings to host a Q/A chatbot that enables listeners to engage with content across episodes:
Build recommendation engines that surface hidden gems based on semantic similarity rather than just metadata tags.
Create dynamic highlight reels that auto-generate shareable clips based on sentiment spikes and topic density.
Leverage Daft's cosine_distance function to put together a full RAG (Retrieval-Augmented Generation) workflow for an interactive experience.
Use the same tooling to power analytics dashboards showcasing trending topics, or supply content for automated newsletters. Since everything you store is queryable and performant, the only limit is your imagination!
At Eventual, we're simplifying multimodal AI so you don't have to. Managing voice AI pipelines or processing thousands of hours of podcast audio ultimately comes down to a few core needs:
Traditionally, delivering all of this meant juggling multiple tools, data formats, and scaling headaches, a brittle setup that doesn't grow with your workload. With Daft, you get one unified engine to process, store, and query multimodal data efficiently.
Fewer moving parts means fewer failure points, less debugging, and a much shorter path from raw audio to usable insights.
For more examples and to get help, check out:
TranscriptionResult Definitionfrom daft import DataType
WordStruct = DataType.struct(
{
"start": DataType.float64(),
"end": DataType.float64(),
"word": DataType.string(),
"probability": DataType.float64(),
}
)
SegmentStruct = DataType.struct(
{
"id": DataType.int64(),
"seek": DataType.int64(),
"start": DataType.float64(),
"end": DataType.float64(),
"text": DataType.string(),
"tokens": DataType.list(DataType.int64()),
"avg_logprob": DataType.float64(),
"compression_ratio": DataType.float64(),
"no_speech_prob": DataType.float64(),
"words": DataType.list(WordStruct),
"temperature": DataType.float64(),
}
)
TranscriptionOptionsStruct = DataType.struct(
{
"beam_size": DataType.int64(),
"best_of": DataType.int64(),
"patience": DataType.float64(),
"length_penalty": DataType.float64(),
"repetition_penalty": DataType.float64(),
"no_repeat_ngram_size": DataType.int64(),
"log_prob_threshold": DataType.float64(),
"no_speech_threshold": DataType.float64(),
"compression_ratio_threshold": DataType.float64(),
"condition_on_previous_text": DataType.bool(),
"prompt_reset_on_temperature": DataType.float64(),
"temperatures": DataType.list(DataType.float64()),
"initial_prompt": DataType.python(),
"prefix": DataType.string(),
"suppress_blank": DataType.bool(),
"suppress_tokens": DataType.list(DataType.int64()),
"without_timestamps": DataType.bool(),
"max_initial_timestamp": DataType.float64(),
"word_timestamps": DataType.bool(),
"prepend_punctuations": DataType.string(),
"append_punctuations": DataType.string(),
"multilingual": DataType.bool(),
"max_new_tokens": DataType.float64(),
"clip_timestamps": DataType.python(),
"hallucination_silence_threshold": DataType.float64(),
"hotwords": DataType.string(),
}
)
VadOptionsStruct = DataType.struct(
{
"threshold": DataType.float64(),
"neg_threshold": DataType.float64(),
"min_speech_duration_ms": DataType.int64(),
"max_speech_duration_s": DataType.float64(),
"min_silence_duration_ms": DataType.int64(),
"speech_pad_ms": DataType.int64(),
}
)
LanguageProbStruct = DataType.struct(
{
"language": DataType.string(),
"probability": DataType.float64(),
}
)
InfoStruct = DataType.struct(
{
"language": DataType.string(),
"language_probability": DataType.float64(),
"duration": DataType.float64(),
"duration_after_vad": DataType.float64(),
"all_language_probs": DataType.list(LanguageProbStruct),
"transcription_options": TranscriptionOptionsStruct,
"vad_options": VadOptionsStruct,
}
)
TranscriptionResult = DataType.struct(
{
"transcript": DataType.string(),
"segments": DataType.list(SegmentStruct),
"info": InfoStruct,
}
)