Back to Daft

Extensions

docs/extensions/overview.md

0.7.105.5 KB
Original Source

Extensions

Daft extensions are reusable libraries that add domain-specific functionality on top of Daft without requiring every specialized function, datatype, integration, or workflow to live directly in Daft core.

There are two broad ways to build extensions:

Both models can expose clean Python APIs that feel like ordinary Daft expression functions. The implementation details can stay hidden behind functions, classes, and expressions.

Python UDF-Based Extensions

The fastest way to build a Daft extension is often pure Python. Using @daft.func, @daft.func.batch, @daft.cls, and @daft.method.batch, contributors can package reusable Python logic that runs inside Daft's distributed execution engine.

This is ideal for libraries that orchestrate existing Python ecosystems, external services, ML models, GPUs, data maintenance tasks, or domain-specific workflows. These extensions do not require a native shared library or the Arrow C ABI. They are normal Python packages that expose higher-level Daft APIs.

daft-lance is an example of this model. It extends Daft with Lance-specific distributed operations such as file compaction, scalar indexing, column merging, and REST catalog operations. Internally, it uses Daft's Python UDF and class-UDF APIs to distribute Lance tasks across a Daft query, while users interact with functions like compact_files, create_scalar_index, and merge_columns_df.

This pattern also exists inside Daft itself. The daft.functions.ai module exposes high-level functions like prompt, embed_text, embed_image, classify_text, and classify_image. To users, these look like normal Daft expression functions. Under the hood, they use Daft's UDF and class-UDF machinery, including batching, concurrency controls, retries, and GPU resource hints.

The file APIs follow a similar product pattern. daft.File values and helpers like file, audio_file, video_file, file_path, and file_size give users expression-level building blocks for working with files. Those file objects can then flow into Python UDFs, model pipelines, and domain-specific libraries without users needing to think about execution details.

Native ABI Extensions

For contributors who need lower-level, vectorized performance, Daft also supports native extensions through a C ABI based on the Arrow C Data Interface.

Native ABI extensions can add high-performance scalar functions, aggregate functions, Python expression wrappers, and extension-backed datatypes. They ship as pip-installable Python packages that bundle a native shared library. Users import the package, load it into a Daft Session, and call ordinary Python wrappers in their DataFrame expressions.

Rust currently has the most ergonomic SDK through daft-ext, while C++ is demonstrated through the raw ABI. Other systems languages are possible if they can produce a shared library, export the expected C ABI, and read/write Arrow C Data Interface arrays.

Daft's own examples include:

  • examples/hello: a minimal Rust native extension that registers a greet scalar function.
  • examples/dvector: a pgvector-style native extension for vector distance functions such as l2_distance, cosine_distance, inner_product, and jaccard_distance.
  • examples/hello_cpp: a pure C++ native extension using Apache Arrow C++ and the raw Daft C ABI.

What Can Contributors Build?

Contributors can build:

  • Pure-Python UDF extension libraries using @daft.func, @daft.func.batch, @daft.cls, and @daft.method.batch
  • Distributed task libraries like daft-lance
  • GPU-backed model inference extensions
  • AI, file, media, and multimodal processing libraries that hide UDFs behind expression APIs
  • Native scalar functions
  • Native aggregate functions / UDAFs
  • Python expression wrappers
  • Extension-backed logical datatypes such as DataType.extension(...)

Where to Start