docs/extensions/overview.md
Daft extensions are reusable libraries that add domain-specific functionality on top of Daft without requiring every specialized function, datatype, integration, or workflow to live directly in Daft core.
There are two broad ways to build extensions:
@daft.func, @daft.func.batch,
@daft.cls, and @daft.method.batch.Both models can expose clean Python APIs that feel like ordinary Daft expression functions. The implementation details can stay hidden behind functions, classes, and expressions.
The fastest way to build a Daft extension is often pure Python. Using
@daft.func, @daft.func.batch,
@daft.cls, and @daft.method.batch,
contributors can package reusable Python logic that runs inside Daft's
distributed execution engine.
This is ideal for libraries that orchestrate existing Python ecosystems, external services, ML models, GPUs, data maintenance tasks, or domain-specific workflows. These extensions do not require a native shared library or the Arrow C ABI. They are normal Python packages that expose higher-level Daft APIs.
daft-lance is an example of
this model. It extends Daft with Lance-specific distributed operations such as
file compaction, scalar indexing, column merging, and REST catalog operations.
Internally, it uses Daft's Python UDF and class-UDF APIs to distribute Lance
tasks across a Daft query, while users interact with functions like
compact_files, create_scalar_index, and merge_columns_df.
This pattern also exists inside Daft itself. The
daft.functions.ai module exposes high-level functions like
prompt,
embed_text,
embed_image,
classify_text, and
classify_image. To users, these look
like normal Daft expression functions. Under the hood, they use Daft's UDF and
class-UDF machinery, including batching, concurrency controls, retries, and
GPU resource hints.
The file APIs follow a similar product pattern. daft.File
values and helpers like file,
audio_file,
video_file,
file_path, and
file_size give users expression-level
building blocks for working with files. Those file objects can then flow into
Python UDFs, model pipelines, and domain-specific libraries without users
needing to think about execution details.
For contributors who need lower-level, vectorized performance, Daft also supports native extensions through a C ABI based on the Arrow C Data Interface.
Native ABI extensions can add high-performance scalar functions, aggregate
functions, Python expression wrappers, and extension-backed datatypes. They
ship as pip-installable Python packages that bundle a native shared library.
Users import the package, load it into a Daft Session,
and call ordinary Python wrappers in their DataFrame expressions.
Rust currently has the most ergonomic SDK through
daft-ext,
while C++ is demonstrated through the raw ABI. Other systems languages are
possible if they can produce a shared library, export the expected C ABI, and
read/write Arrow C Data Interface arrays.
Daft's own examples include:
examples/hello:
a minimal Rust native extension that registers a greet scalar function.examples/dvector:
a pgvector-style native extension for vector distance functions such as
l2_distance, cosine_distance, inner_product, and jaccard_distance.examples/hello_cpp:
a pure C++ native extension using Apache Arrow C++ and the raw Daft C ABI.Contributors can build:
@daft.func,
@daft.func.batch, @daft.cls, and
@daft.method.batchdaft-lanceDataType.extension(...)