tutorials/talks_and_demos/linkedin-03-05-2024.ipynb
You can install Daft easily with pip:
!pip install -U 'daft[iceberg,hudi,deltalake]'
!pip install -U ipywidgets
CI = False
# Skip this notebook execution in CI because it hits non-public data in AWS
if CI:
import sys
sys.exit()
You can easily read from various sources of data (including cloud object storage and open table formats) into a Dataframe.
See (Daft API Documentation: Input/Output)
import daft
ANONYMOUS_IO_CONFIG = daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True, region_name="us-west-2"))
### Iceberg
from pyiceberg.catalog.glue import GlueCatalog
catalog = GlueCatalog("default")
iceberg_table = catalog.load_table("tpch_iceberg_sf100.lineitem")
ice_df = daft.read_iceberg(iceberg_table)
ice_df.show()
### Hudi
hudi_df = daft.read_hudi("s3://daft-public-data/hudi/v6_simplekeygen_nonhivestyle/", io_config=ANONYMOUS_IO_CONFIG)
hudi_df.show()
### DeltaLake
delta_df = daft.read_deltalake(
"s3://daft-public-data/nyc-taxi-dataset-2023-jan-deltalake/", io_config=ANONYMOUS_IO_CONFIG
)
delta_df.show()
### Daft also supports reading from many other file sources:
# df = daft.read_csv(...)
# df = daft.read_parquet(...)
# df = daft.read_json(...)
### Read from SQL Databases
# df = daft.read_sql("SELECT * FROM table", "mysql://...")
### Glob a path into files
laion_df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")
laion_df.show(3)
# All the other dataframe operations that you would expect:
#
# 1. df.join(...)
# 2. df.sort(...)
# 3. df.with_column(...)
# 4. df.where(...)
import datetime
ice_df = daft.read_iceberg(iceberg_table)
ice_df = ice_df.where(ice_df["L_SHIPDATE"] < datetime.date(1993, 1, 1))
ice_df.explain(True)
ice_df.show()
Daft supports representing and performing operations on complex types such as URLs and images natively.
These operations are defined in Python, but executed using our Rust core library.
See (Daft Documentation: Expressions)
laion_df.show(3)
laion_df = laion_df.with_column("data", laion_df["path"].download()) # Utf8 -> Binary
laion_df = laion_df.with_column("image", laion_df["data"].decode_image()) # Binary -> Image
laion_df.show(3)
Daft supports execution locally on a Python multithreaded backend, or on a Ray cluster.
See (Daft Documentation: Distributed Computing)
## Use the Native multithreaded local runner (default behavior)
# daft.set_runner_native()
## Connect to a Ray cluster and use the Ray runner
# daft.set_runner_ray(address="ray://...")