tutorials/intro.ipynb
You can install Daft easily with pip:
!pip install daft
You can easily read from various sources of data (including cloud object storage) into a Dataframe.
See (Daft API Documentation: Input/Output)
import daft
daft.set_planning_config(default_io_config=daft.io.IOConfig(s3=daft.io.S3Config(anonymous=True)))
# Glob a path and return file listing as a Dataframe
df = daft.from_glob_path("s3://daft-public-data/laion-sample-images/*")
# Daft also supports reading from many other sources:
# df = daft.read_csv(...)
# df = daft.read_parquet(...)
# df = daft.read_json(...)
# df = daft.read_iceberg(...) # <Coming Soon!>
df.show(3)
# All the other dataframe operations that you would expect:
#
# 1. df.join(...)
# 2. df.sort(...)
# 3. df.with_column(...)
# 4. df.where(...)
Daft supports representing and performing operations on complex types such as URLs and images natively.
These operations are defined in Python, but executed using our Rust core library.
See (Daft Documentation: Expressions)
df = df.with_column("data", df["path"].download()) # Utf8 -> Binary
df = df.with_column("image", df["data"].decode_image()) # Binary -> Image
df.show(3)
Daft supports execution locally on a Python multithreaded backend, or on a Ray cluster.
See (Daft Documentation: Distributed Computing)
## Use the Native multithreaded local runner (default behavior)
# daft.set_runner_native()
## Connect to a Ray cluster and use the Ray runner
# daft.set_runner_ray(address="ray://...")
Daft provides intelligent optimizations for you to speed up your queries.
df = daft.read_parquet("s3://daft-public-data/benchmarking/lineitem-parquet/")
df = df.select(df["L_ORDERKEY"], df["L_DISCOUNT"])
# df.explain()
# df.explain(show_all=True)
%%time
df.show()