docs/optimization/memory.md
Daft is a streaming execution engine where data flows through the pipeline defined by your query. Operators process data in bounded batches, but some steps can inflate data or materialize large intermediate results. When those steps outgrow available memory, workloads slow down, spill to disk, or fail with out-of-memory (OOM) errors. This page walks through the most common sources of memory pressure and the tuning levers you can use to keep pipelines stable.
The majority of OOM incidents we observe come from a handful of patterns:
Understanding which operations your query contains can guide you to tuning the proper parameters.
batch_size][daft.udf.UDF.batch_size] argument on the UDF so each invocation handles less data at once.concurrency][daft.udf.UDF.concurrency] if each invocation is memory hungry. Fewer concurrent tasks often outperforms repeated worker restarts caused by OOM kills.For remote reads (HTTP, S3, GCS, etc.), reduce the max_connections parameter when using [download()][daft.functions.download]. Lower download concurrency can keep memory bounded while allowing downstream operators to continue consuming data.
Daft parallelizes work according to batch size. If the default batch size is too large, explicitly insert [df.into_batches(...)][daft.DataFrame.into_batches] ahead of decoding, exploding, or other expansion steps. Smaller batches prevent any single task from growing beyond memory limits.
!!! warning "Experimental Feature"
Execution configuration parameters are experimental and subject to change between releases. Use with caution in production environments.
Advanced [execution parameters][daft.context.set_execution_config] in daft.context expose finer-grained controls (e.g. default batch sizing, target partition sizes, operator-specific buffers). These settings are subject to change between releases, so record any overrides you rely on and review them during upgrades.
If tuning alone is insufficient:
If workloads continue to fail, we may be under-estimating required memory or you may have discovered a gap in our heuristics. Please reach out on Slack or open a GitHub issue so we can help investigate.