What is Daft?

Python Library

You can install Daft easily with pip:

python

!pip install daft

Cloud-Native Dataframe API

You can easily read from various sources of data (including cloud object storage) into a Dataframe.

See (Daft API Documentation: Input/Output)

python

import daft

daft.set_planning_config(default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True)))

# Glob a path and return file listing as a Dataframe
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# Daft also supports reading from many other sources:
# df = daft.read_csv(...)
# df = daft.read_parquet(...)
# df = daft.read_json(...)
# df = daft.read_iceberg(...) # <Coming Soon!>

python

df.show(3)

python

# All the other dataframe operations that you would expect:
#
# 1. df.join(...)
# 2. df.sort(...)
# 3. df.with_column(...)
# 4. df.where(...)

Complex Data Types/Rust Core

Daft supports representing and performing operations on complex types such as URLs and images natively.

These operations are defined in Python, but executed using our Rust core library.

See (Daft Documentation: Expressions)

python

df = df.with_column("data", df["path"].download())  # Utf8 -> Binary
df = df.with_column("image", df["data"].decode_image())  # Binary -> Image

python

df.show(3)

Distributed Execution

Daft supports execution locally on a Python multithreaded backend, or on a Ray cluster.

See (Daft Documentation: Distributed Computing)

python

## Use the Native multithreaded local runner (default behavior)
# daft.set_runner_native()

## Connect to a Ray cluster and use the Ray runner
# daft.set_runner_ray(address="ray://...")

Intelligent Optimizations

Daft provides intelligent optimizations for you to speed up your queries.

python

df = daft.read_parquet("s3://daft-public-data/benchmarking/lineitem-parquet/")
df = df.select(df["L_ORDERKEY"], df["L_DISCOUNT"])

python

# df.explain()
# df.explain(show_all=True)

python

%%time

df.show()