Back to Daft

What is Daft?

tutorials/intro.ipynb

0.7.102.3 KB
Original Source

What is Daft?

Python Library

You can install Daft easily with pip:

python
!pip install daft

Cloud-Native Dataframe API

You can easily read from various sources of data (including cloud object storage) into a Dataframe.

See (Daft API Documentation: Input/Output)

python
import daft

daft.set_planning_config(default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True)))

# Glob a path and return file listing as a Dataframe
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

# Daft also supports reading from many other sources:
# df = daft.read_csv(...)
# df = daft.read_parquet(...)
# df = daft.read_json(...)
# df = daft.read_iceberg(...) # <Coming Soon!>
python
df.show(3)
python
# All the other dataframe operations that you would expect:
#
# 1. df.join(...)
# 2. df.sort(...)
# 3. df.with_column(...)
# 4. df.where(...)

Complex Data Types/Rust Core

Daft supports representing and performing operations on complex types such as URLs and images natively.

These operations are defined in Python, but executed using our Rust core library.

See (Daft Documentation: Expressions)

python
df = df.with_column("data", df["path"].download())  # Utf8 -> Binary
df = df.with_column("image", df["data"].decode_image())  # Binary -> Image
python
df.show(3)

Distributed Execution

Daft supports execution locally on a Python multithreaded backend, or on a Ray cluster.

See (Daft Documentation: Distributed Computing)

python
## Use the Native multithreaded local runner (default behavior)
# daft.set_runner_native()

## Connect to a Ray cluster and use the Ray runner
# daft.set_runner_ray(address="ray://...")

Intelligent Optimizations

Daft provides intelligent optimizations for you to speed up your queries.

python
df = daft.read_parquet("s3://daft-public-data/benchmarking/lineitem-parquet/")
df = df.select(df["L_ORDERKEY"], df["L_DISCOUNT"])
python
# df.explain()
# df.explain(show_all=True)
python
%%time

df.show()