Back to Daft

What is Daft?

tutorials/talks_and_demos/linkedin-03-05-2024.ipynb

0.7.103.1 KB
Original Source

What is Daft?

Python Library

You can install Daft easily with pip:

python
!pip install -U 'daft[iceberg,hudi,deltalake]'
!pip install -U ipywidgets
python
CI = False
python
# Skip this notebook execution in CI because it hits non-public data in AWS
if CI:
    import sys

    sys.exit()

Cloud-Native Dataframe API

You can easily read from various sources of data (including cloud object storage and open table formats) into a Dataframe.

See (Daft API Documentation: Input/Output)

python
import daft

ANONYMOUS_IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True, region_name="us-west-2"))
python
### Iceberg

from pyiceberg.catalog.glue import GlueCatalog

catalog = GlueCatalog("default")
iceberg_table = catalog.load_table("tpch_iceberg_sf100.lineitem")

ice_df = daft.read_iceberg(iceberg_table)
ice_df.show()
python
### Hudi

hudi_df = daft.read_hudi("s3://daft-public-data/hudi/v6_simplekeygen_nonhivestyle/", io_config=ANONYMOUS_IO_CONFIG)
hudi_df.show()
python
### DeltaLake

delta_df = daft.read_deltalake(
    "s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/", io_config=ANONYMOUS_IO_CONFIG
)
delta_df.show()
python
### Daft also supports reading from many other file sources:
# df = daft.read_csv(...)
# df = daft.read_parquet(...)
# df = daft.read_json(...)

### Read from SQL Databases
# df = daft.read_sql("SELECT * FROM table", "mysql://...")

### Glob a path into files
laion_df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")

laion_df.show(3)

Familiar, powerful relational operations + query optimizer

python
# All the other dataframe operations that you would expect:
#
# 1. df.join(...)
# 2. df.sort(...)
# 3. df.with_column(...)
# 4. df.where(...)

import datetime

ice_df = daft.read_iceberg(iceberg_table)
ice_df = ice_df.where(ice_df["L_SHIPDATE"] < datetime.date(1993, 1, 1))
ice_df.explain(True)
python
ice_df.show()

Complex Data Types/Rust Core

Daft supports representing and performing operations on complex types such as URLs and images natively.

These operations are defined in Python, but executed using our Rust core library.

See (Daft Documentation: Expressions)

python
laion_df.show(3)
python
laion_df = laion_df.with_column("data", laion_df["path"].download())  # Utf8 -> Binary
laion_df = laion_df.with_column("image", laion_df["data"].decode_image())  # Binary -> Image
python
laion_df.show(3)

Distributed Execution

Daft supports execution locally on a Python multithreaded backend, or on a Ray cluster.

See (Daft Documentation: Distributed Computing)

python
## Use the Native multithreaded local runner (default behavior)
# daft.set_runner_native()

## Connect to a Ray cluster and use the Ray runner
# daft.set_runner_ray(address="ray://...")