benchmarking/asof_join/README.md
Adds a self-contained benchmarking suite for Daft's join_asof operation:
data_generation.py — generates reproducible left/right parquet datasets at three scales (small, medium, large) with clustered timestamps and Zipf-skewed entity distribution, written to benchmarking/data/asof_join/benchmark.py — runs a single asof-join using Daft's native or Ray runner, wrapped in a memray memory tracker, and prints a JSON result with wall time and memray output pathcluster.yaml — Ray cluster config for AWS (1 m7i.large head + 4 r7i.4xlarge workers) for distributed runs against S3From inside benchmarking/asof_join/:
1. Generate data (one-time)
python data_generation.py --scale small
# or --scale medium / --scale large / --all
2. Run locally (native runner)
python benchmark.py --scale small
# Output: asof_join_memray.bin + JSON result on stdout
3. Inspect memory profile
memray flamegraph asof_join_memray.bin
Run on a Ray cluster
Before running: update
DATA_ROOTto your S3 bucket and uncommentdaft.set_runner_ray()inbenchmark.py. Also update the S3 bucket and IAM settings incluster.yaml.
Spin up the cluster:
ray up benchmarking/asof_join/cluster.yaml
Forward the dashboard in one terminal:
ray dashboard benchmarking/asof_join/cluster.yaml
Submit the job in another (after updating DATA_ROOT and daft.set_runner_ray() in benchmark.py):
ray job submit --address "http://localhost:8265" --working-dir benchmarking/asof_join -- python benchmark.py --scale small
Tear down when done:
ray down benchmarking/asof_join/cluster.yaml