Back to Daft

or --scale medium / --scale large / --all

benchmarking/asof_join/README.md

0.7.141.7 KB
Original Source

Summary

Adds a self-contained benchmarking suite for Daft's join_asof operation:

  • data_generation.py — generates reproducible left/right parquet datasets at three scales (small, medium, large) with clustered timestamps and Zipf-skewed entity distribution, written to benchmarking/data/asof_join/
  • benchmark.py — runs a single asof-join using Daft's native or Ray runner, wrapped in a memray memory tracker, and prints a JSON result with wall time and memray output path
  • cluster.yaml — Ray cluster config for AWS (1 m7i.large head + 4 r7i.4xlarge workers) for distributed runs against S3

How to run

From inside benchmarking/asof_join/:

1. Generate data (one-time)

bash
python data_generation.py --scale small
# or --scale medium / --scale large / --all

2. Run locally (native runner)

bash
python benchmark.py --scale small
# Output: asof_join_memray.bin + JSON result on stdout

3. Inspect memory profile

bash
memray flamegraph asof_join_memray.bin

Run on a Ray cluster

Before running: update DATA_ROOT to your S3 bucket and uncomment daft.set_runner_ray() in benchmark.py. Also update the S3 bucket and IAM settings in cluster.yaml.

Spin up the cluster:

bash
ray up benchmarking/asof_join/cluster.yaml

Forward the dashboard in one terminal:

bash
ray dashboard benchmarking/asof_join/cluster.yaml

Submit the job in another (after updating DATA_ROOT and daft.set_runner_ray() in benchmark.py):

bash
ray job submit --address "http://localhost:8265" --working-dir benchmarking/asof_join -- python benchmark.py --scale small

Tear down when done:

bash
ray down benchmarking/asof_join/cluster.yaml