Introduction

Daft is a high-performance data engine designed for AI and multimodal workloads, providing simple, reliable data processing for images, audio, video, and structured data at any scale.

<div class="stage-box">
  <div class="row">
    <div class="description">
      <div class="stage-header">EXPENSIVE TRANSFORMATIONS</div>
      <p>Build efficient Daft data pipelines involving heavy transformations</p>
      <p>GPU models

User-provided Python code External LLM APIs</p> </div> <div class="pipeline-item"> <span id="xform" class="swap"></span> </div> </div> </div>

<div class="stage-box">
  <div class="row">
    <div class="description">
      <div class="stage-header">WRITE</div>
      <p>Daft lands data into specialized data systems for downstream use cases</p>
      <p>Search (full-text search and vector DBs)

Applications (SQL/NoSQL databases) Analytics (data warehouses) Model training (S3 object storage)</p> </div> <div class="pipeline-item"> <span id="write" class="swap"></span> </div> </div> </div>

</div> </div> <script> // Hierarchical data buckets by modality with transform-to-write mappings const MODALITIES = { "Images": { sources: ["*.jpeg files in S3", "URLs in database", "*.parquet on Huggingface"], transforms: { "OCR for text extraction": { details: ["# use Tesseract OCR engine", "# use Azure Computer Vision API"], destinations: { "Elasticsearch": "# for full-text search" } }, "Image captioning with LLM": { details: ["# use qwen model on H100 GPU"], destinations: { "PostgreSQL": "# for querying by webapps", "MongoDB": "# for querying by webapps" } }, "Object detection": { details: ["# use YOLOv8 model on GPU", "# use Azure Object Detection APIs"], destinations: { "PostgreSQL": "# for querying by webapps", "MySQL": "# for querying by webapps" } }, "Generate embeddings": { details: ["# use CLIP model on GPU", "# use OpenAI text-embedding-3"], destinations: { "Turbopuffer": "# for vector search", "LanceDB": "# for vector search" } } } }, "Documents": { sources: ["*.pdf files in S3", "*.docx files in GCS", "*.html files in R2", "*.parquet on Huggingface"], transforms: { "OCR for text extraction": { details: ["# use Tesseract OCR engine", "# use EasyOCR with GPU acceleration", "# use Azure Computer Vision API"], destinations: { "Elasticsearch": "# for full-text search" } }, "Structured data extraction": { details: ["# use OpenAI's API for gpt-4o", "# use Azure Form Recognizer APIs"], destinations: { "BigQuery": "# for analytics", "Snowflake": "# for analytics", "Databricks": "# for analytics" } }, "Generate embeddings": { details: ["# use OpenAI's API for text-embedding-3", "# use sentence-transformers on GPU"], destinations: { "Turbopuffer": "# for vector search", "LanceDB": "# for vector search" } }, "PII detection": { details: ["# use spaCy NER model", "# use Azure PII detection"], destinations: { "BigQuery": "# for analytics", "Snowflake": "# for analytics", "Databricks": "# for analytics" } }, "Chunking + deduplication": { details: ["# use daft default splitting"], destinations: { "Parquet": "# for data lake storage" } } } }, "Video": { sources: ["*.mp4 files in S3", "URLs in CSVs", "*.parquet on Huggingface"], transforms: { "Video captioning": { details: ["# custom Python code: extract audio and transcribe"], destinations: { "PostgreSQL": "# for querying by webapps" } }, "Scene detection": { details: ["# use OpenCV scene detection", "# use PySceneDetect library"], destinations: { "AWS S3": "# for object storage" } }, "Audio transcription": { details: ["# use Whisper model on GPU", "# use Azure Speech Services API"], destinations: { "PostgreSQL": "# for querying by webapps", "MongoDB": "# for querying by webapps", "Elasticsearch": "# for full-text search" } }, "Generate embeddings": { details: ["# use CLIP model on CPU", "# use CLIP model on GPU"], destinations: { "Turbopuffer": "# for vector search", "LanceDB": "# for vector search" } } } }, "Audio (WAV/MP3/FLAC)": { sources: ["*.wav files in S3", "URLs in database", "*.parquet on Huggingface"], transforms: { "Transcription with Whisper": { details: ["# use Whisper.cpp on CPU", "# use Azure Speech Services API"], destinations: { "PostgreSQL": "# for querying by webapps", "MongoDB": "# for querying by webapps", "Elasticsearch": "# for full-text search" } }, "Speaker identification": { details: ["# use custom Python code with pyannote.audio", "# use Azure Speaker Recognition API"], destinations: { "PostgreSQL": "# for querying by webapps", "MySQL": "# for querying by webapps" } }, "Emotion detection": { details: ["# use custom Python code with wav2vec2", "# use Azure Emotion API"], destinations: { "BigQuery": "# for analytics", "Snowflake": "# for analytics", "Databricks": "# for analytics" } }, "Generate embeddings": { details: ["# use custom Python code with wav2vec2", "# use OpenAI's endpoint for text-embedding-3"], destinations: { "Turbopuffer": "# for vector search", "LanceDB": "# for vector search" } } } }, "AI Agent Logs": { sources: ["JSON logs in Kafka", "JSON-lines in S3"], transforms: { "LLM summarization": { details: ["# use OpenAI gpt-4o endpoint", "# use Claude 3.5 Sonnet API endpoint", "# use custom summarization model on GPUs"], destinations: { "PostgreSQL": "# for querying by webapps", "MySQL": "# for querying by webapps" } }, "Generate embeddings": { details: ["# use OpenAI's endpoint for text-embedding-3", "# use sentence-transformers on GPUs", "# use BERT model on GPUs"], destinations: { "Turbopuffer": "# for vector search", "LanceDB": "# for vector search" } } } } }; // Helpers const q = (id) => document.getElementById(id); // Wait for DOM to be ready function waitForElements() { const els = { read: q("read"), xform: q("xform"), write: q("write") }; if (els.read && els.xform && els.write) { return els; } return null; } function pick(list, last) { if (list.length < 2) return list[0]; let choice; do choice = list[(Math.random() * list.length) | 0]; while (choice === last); return choice; } // Typewriter effect async function typeTo(el, text) { if (!el) return; // Safety check for null elements const speed = 12 + Math.random() * 10; el.innerHTML = ""; const span = document.createElement("span"); span.className = "type"; el.appendChild(span); for (let i = 0; i <= text.length; i++) { const currentText = text.slice(0, i); const lines = currentText.split('\n'); const formattedLines = lines.map(line => { if (line.startsWith('# ')) { return `<span class="source-comment">${line}</span>`; } return line; }); span.innerHTML = formattedLines.join('\n') + '<span class="cursor">█</span>'; await new Promise(r => setTimeout(r, speed)); } // Let cursor blink for a moment, then fade away const cursor = span.querySelector('.cursor'); if (cursor) { setTimeout(() => { cursor.classList.remove('cursor'); cursor.classList.add('cursor-fade'); }, 2200); // Keep blinking for N seconds before fading } } // Cycle logic let last = { read: null, xform: null, write: null, source: null, detail: null }; async function shuffleAll() { const els = waitForElements(); if (!els) return; // Exit if elements aren't ready const modalities = Object.keys(MODALITIES); const read = pick(modalities, last.read); const modality = MODALITIES[read]; const source = pick(modality.sources, last.source); const transforms = Object.keys(modality.transforms); const xform = pick(transforms, last.xform); const transformData = modality.transforms[xform]; const detail = pick(transformData.details, last.detail); const destinations = Object.keys(transformData.destinations); const write = pick(destinations, last.write); const writeUseCase = transformData.destinations[write]; last = { read, xform, write, source, detail }; await Promise.all([ typeTo(els.read, read + "\n" + "# " + source), typeTo(els.xform, xform + "\n" + detail), typeTo(els.write, write + "\n" + writeUseCase) ]); } // Auto-advance with delay async function runCycle() { await shuffleAll(); await new Promise(resolve => setTimeout(resolve, 8000)); } // Start the cycle runCycle(); setInterval(runCycle, 4600); </script>

Why Daft?

:octicons-image-24: Unified multimodal data processing

While traditional dataframes struggle with anything beyond tables, Daft natively handles tables, images, text, and embeddings through a single Python API. No more stitching together specialized tools for different data types.

:material-language-python: Python-native, no JVM required

Built for modern AI/ML workflows with Python at its core and Rust under the hood. Skip the JVM complexity, version conflicts, and memory tuning to achieve 20x faster start times—get the performance without the Java tax.

:fontawesome-solid-laptop: Seamless scaling, from laptop to cluster

Start local, scale global—without changing a line of code. Daft's Rust-powered engine delivers blazing performance on a single machine and effortlessly extends to distributed clusters when you need more horsepower.

Key Features

Native multimodal processing: Process any data type—from structured tables to unstructured text and rich media—with native support for images, audio, video, and embeddings in a single, unified framework.
Built-in AI operations: Transform data with AI natively: run LLM prompts with structured outputs, generate embeddings, and classify images or text using models from OpenAI, Transformers, or your own custom providers, all optimized for batch processing.
Rust-powered performance: Experience breakthrough speed with our Rust foundation delivering vectorized execution and non-blocking I/O that processes the same queries with 5x less memory while consistently outperforming industry standards by an order of magnitude.
Universal data connectivity: Access data anywhere it lives—cloud storage (S3, Azure, GCS, Hugging Face), modern table formats (Apache Iceberg, Delta Lake, Apache Hudi), or enterprise catalogs (Unity Catalog, AWS Glue)—all with zero configuration.
Push your code to your data: Bring your Python functions directly to your data with zero-copy UDFs powered by Apache Arrow, eliminating data movement overhead and accelerating processing speeds.
Out of the box reliability: Deploy with confidence—intelligent memory management prevents OOM errors while sensible defaults eliminate configuration headaches, letting you focus on results, not infrastructure.

!!! tip "Looking to get started with Daft ASAP?"

If you are ready to jump into code, take a look at these resources:

1. [Quickstart](quickstart.md): Itching to run some Daft code? Hit the ground running with our 10 minute quickstart.

2. [Examples](examples/index.md): See Daft in action with use cases across text, images, audio, and more.

3. [API Documentation](api/index.md): Searchable documentation and reference material to Daft's public API.

Contribute to Daft

If you're interested in hands-on learning about Daft internals and would like to contribute to our project, join us on GitHub 🚀

Take a look at the many issues tagged with good first issue in our repo. If there are any that interest you, feel free to chime in on the issue itself or join us in our Distributed Data Slack Community and send us a message in #daft-dev. Daft team members will be happy to assign any issue to you and provide any guidance if needed!