Back to Daft

Batch Inference

docs/use-case/batch-inference.md

0.7.104.9 KB
Original Source

Batch Inference

Run prompts, embeddings, and model scoring over large datasets, then stream the results to durable storage. Daft is a reliable engine to express batch inference pipelines and scale them from your laptop to a distributed cluster.

When to use Daft for batch inference

If you’re new to Daft, see the quickstart first. For distributed execution, see our docs on Scaling Out and Deployment.

Core idea

Daft provides first-class APIs for model inference. Under the hood, Daft pipelines data operations so that reading, inference, and writing overlap automatically, and is optimized for throughput.

Example: Prompt GPT-5 with OpenAI

python
import daft
from daft.functions import prompt

(
    daft.read_huggingface("fka/awesome-chatgpt-prompts")
    .with_column( # Generate model outputs in a new column
        "output",
        prompt(
            daft.col("prompt"),
            model="gpt-5",           # Any chat/completions-capable model
            provider="openai",       # Switch providers by changing this; e.g. to "vllm"
            max_output_tokens=256,   # OpenAI Provider uses Responses API by default
        ),
    )
    .write_parquet("output.parquet/", write_mode="overwrite")  # Write to Parquet as the pipeline runs
)

What this does:

  • Uses prompt() to express inference.
  • Streams rows through OpenAI concurrently while reading from Hugging Face and writing to Parquet.
  • Requires no explicit async, batching, rate limiting, or retry code in your script.

Example: Local text embedding with LM Studio

python
import daft
from daft.ai.provider import load_provider
from daft.functions.ai import embed_text

provider = load_provider("lm_studio")
model = "text-embedding-nomic-embed-text-v1.5"

(
    daft.read_huggingface("Open-Orca/OpenOrca")
    .with_column("embedding", embed_text(daft.col("response"), provider=provider, model=model))
    .show()
)

Notes:

  • LM Studio is a local AI model platform that lets you run Large Language Models like Qwen, Mistral, Gemma, or gpt-oss on your own machine. By using Daft with LM Studio, you can perform inference with any model locally, and utilize accelerators like Apple's Metal Performance Shaders (MPS).

Scaling out on Ray

Turn on distributed execution with a single line; then run the same script on a Ray cluster.

python
import daft
daft.set_runner_ray()  # Enable Daft's distributed runner

Daft partitions the data, schedules remote execution, and orchestrates your workload across the cluster. No pipeline rewrites.

Patterns that work well

  • Read → Preprocess → Infer → Write: Daft parallelizes and pipelines automatically to maximize throughput and resource utilization.
  • Provider-agnostic pipelines: Switch between OpenAI and local LLMs by changing a single parameter.

Case Studies

For inspiration and real-world scale:

Next Steps

Ready to explore Daft further? Check out these topics: