Back to Daft

Built on Daft

docs/extensions/projects.md

0.7.108.5 KB
Original Source

Built on Daft

This page highlights projects that depend on or build on top of Daft.

These projects are different from Community Extensions. An extension directly adds reusable Daft APIs, functions, datatypes, or distributed operators. A project built on Daft may instead use Daft as a query engine, DataFrame runtime, compute backend, or pipeline framework.

To propose a project for this page, open a PR with a short description and a link to the relevant Daft usage.

Projects Building on Daft

ProjectDescriptionHow It Uses Daft
DeltaCatPortable multimodal lakehouse on Ray, Arrow, and Daft.Daft is a core dependency: dc.read() returns a Daft DataFrame by default, and Daft is a first-class compute backend for reading and writing lake data.
SmooSenseWeb-based tool for interactively exploring large-scale multimodal tabular data.Accepts Daft DataFrames as input in the Jupyter widget (Sense(daft_df)) and converts them for display alongside Pandas DataFrames.
eai-taxonomyAnnotation tools and dataset filters from the Essential-Web project.Uses Daft in annotation pipelines to read Parquet annotation files and compute inter-annotator agreement.
GraphReduceFeature engineering library for large graphs of tabular data.Supports Daft as a compute backend via pip install "graphreduce[daft]".
ArchetypeDataFrame-first, append-only ECS runtime for simulations and AI agents.Uses Daft DataFrames as the native data structure for world state and processor logic.
hypergraphPython workflow orchestration framework for DAG pipelines and agentic workflows.Provides a DaftRunner that compiles a hypergraph DAG into a Daft query plan using Daft UDFs.
Sashimi4TalentData pipeline and web app for discovering and ranking GitHub contributors.Uses Daft as the core processing engine for repository search, commits, contributor analysis, and dataset merges.
daft-image-playgroundFlask-based intelligent image search engine.Uses Daft to discover images and run native image decoding, resizing, and encoding.
SMiRSynthetic data pipeline for multi-image reasoning.Uses Daft with the Ray runner to read Parquet from S3, download images, and run GPU-accelerated embedding UDFs.
DerezzCLI tool for video indexing and semantic search.Uses Daft to read and write an S3 Tables-backed video frame index and query it.
teraflopai-daftDaft plugin library for Teraflop AI NLP services.Exposes custom Daft expressions such as segment_text and search_text that call the Teraflop AI provider API.
daft-sql-adapterCLI adapter that runs Spark SQL queries against Databricks Unity Catalog tables through Daft.Uses Daft as the execution engine for SQL queries and table writes.
zarr-daft-datasourcePrototype compute-storage-separation architecture using Zarr, LanceDB, and Daft.Implements a custom Daft DataSource that reads Zarr arrays into lazy Daft DataFrames.
pdf-document-processing-daftDemonstration pipeline for scalable PDF document processing.Uses Daft DataFrames to parallelize PDF parsing, text chunking, embedding generation, and Parquet writes.
daft-ai-drivingProject for processing the KITTI autonomous driving dataset.Uses Daft DataFrames for data loading, filtering, aggregation, and Parquet export.
sigilCLI tool for collecting and analyzing AI coding session data.Uses Daft to write local Parquet session data and query it with filtering and aggregation.
embed_qwen_k8sKubernetes pipeline for Common Crawl text processing and embeddings.Uses Daft to read WARC/Parquet data, apply UDFs for chunking and embedding, and write results to S3.

Usage Patterns and Examples

ProjectDescriptionHow It Uses Daft
daft-examplesOfficial Daft examples repository.Shows idiomatic Daft usage across daft.read_*, UDFs, AI functions, multimodal data, SQL analytics, vector search, and integrations.
workflow-eventual-inc-daft-distributed-udf-processingWorkflow demonstrating custom Python UDFs for ML inference and GPU transforms.Uses @daft.func and @daft.cls with lazy and distributed execution.
workflow-eventual-inc-daft-sql-query-analyticsWorkflow for SQL analytics with Daft sessions and catalogs.Creates sessions, attaches catalogs, registers tables, executes SQL, and exports results.
workflow-eventual-inc-daft-data-lakehouse-etlETL pipeline for Iceberg, Delta Lake, and Hudi lakehouse formats.Uses Daft lakehouse readers, DataFrame transformations, and write APIs.
workflow-eventual-inc-daft-multimodal-ai-batch-inferenceWorkflow for AI inference over multimodal datasets.Applies Daft AI functions and embedding workflows to Hugging Face, Parquet, and CSV datasets.

Daft as a Chosen Backend

ProjectDescriptionHow It Uses Daft
cualleeDataFrame-agnostic data quality check library.Implements a Daft validation backend and test suite for quality checks against Daft DataFrames.
Atlan application-sdkSDK for building data catalog integration applications.Uses Daft to lazily read and filter JSON table metadata files during incremental sync.
narwhals-daftNarwhals plugin for Daft.Wraps Daft DataFrames so the Narwhals expression API can execute lazily on Daft.
jettETL template framework supporting multiple DataFrame engines.Lists Daft as a supported engine backend for YAML-defined ETL pipelines.

Ecosystem and Content

ProjectDescriptionHow It Uses Daft
AWS MCP serversCollection of open-source MCP servers for AWS services.The S3 Tables MCP server uses Daft as the SQL query engine for read-only queries against AWS S3 Tables.
Fabric_Notebooks_DemoMicrosoft Fabric notebook demos.Includes ETL notebooks using Daft to load and transform CSVs from Azure storage and write Delta Lake tables.
lakevisionWeb tool for exploring Apache Iceberg lakehouses.Uses Daft to query and display sample data from Iceberg tables.
general-demosData engineering demo projects.Includes a daft-quickstart folder demonstrating Daft with Apache Iceberg.
DaftHudiDemo dashboard application using Apache Hudi, Daft, and Streamlit.Uses Daft to read Apache Hudi tables and power analytical queries.
Engineering Lakehouses with Open Table FormatsCode samples for a lakehouse engineering book.Includes a notebook demonstrating reading and writing Apache Iceberg tables with Daft.
databricks-demosDatabricks demo notebooks.Includes a notebook using Daft with Unity Catalog to write data to Delta tables.
knee-deep-in-the-lakeHands-on training repository for IoT data lake technologies.Uses Daft in notebooks alongside PyArrow and Pandas for querying Parquet files with SQL.