Back to Cocoindex

CocoIndex *overview*

docs/src/content/docs/getting_started/overview.mdx

1.0.43.6 KB
Original Source

CocoIndex is an ultra-performant framework for building data processing pipelines for AI workloads, with built-in incremental processing.

Programming model

CocoIndex uses a declarative, state-driven programming model. You specify what your target should look like as a function of your source data β€” not how to incrementally update it. CocoIndex handles change detection and applies only the necessary updates automatically.

If you’ve used React, spreadsheets, or materialized views, this will feel familiar:

  • React: declare UI as a function of state β†’ React re-renders what changed
  • Spreadsheets: declare formulas β†’ cells recompute when inputs change
  • CocoIndex: declare target states as a function of source β†’ CocoIndex syncs what changed

CocoIndex features

High-performance Rust πŸ¦€ engine

CocoIndex executes pipelines on a high-performance Rust engine, delivering resilient and scalable data processing.

Easy to code

  • Write simple transformations in Python without learning new DSLs
  • Write batch-style code without worrying about deltas β€” CocoIndex runs it incrementally in both batch and live mode, continuously updating results. No separate DAGs, operators, or orchestration logic required.

Incremental & low-latency

CocoIndex tracks fine-grained dependencies and only recomputes what changed in the input data or the code. End-to-end updates drop from hours/days to seconds while keeping full correctness.

Full lineage & explainability

Every processing step, intermediate result, and execution path is inspectable. This helps it remain compliant with the EU AI Act for transparency, and satisfies enterprise auditability/traceability requirements.

Open integration model

Sources and targets plug in through a standard, open interface (no vendor lock-in). Leverage the full Python ecosystem for models, functions, and libraries.

High throughput + controlled concurrency

Pipelines automatically parallelize with managed concurrency and request batching β€” reducing GPU cost, RPC fanout, and end-to-end latency.

Fault-tolerant runtime

The engine gracefully retries transient failures and resumes from previous progress after interruptions β€” eliminating manual backfills and replays.

Low operational overhead

CocoIndex removes the need for elaborate plumbing: refreshing datasets, maintaining state, handling backfills, ensuring correctness, coordinating GPUs, scaling workers, and managing infra are all handled by the engine.

Incremental data processing

CocoIndex continuously maintains and tracks state while processing only new or changed data. It is designed to support incremental processing from day zero.

What incremental processing means:

  • Avoid unnecessarily recomputing work, based on multi-level change detection:
    • Component level: only reprocess source items with changes
    • Function level: within an item’s processing, memoize expensive function calls and reuse when possible
    • Target level: apply minimum necessary changes (insertions, updates, deletions) to the target
  • Support multiple mechanisms to capture source changes (CDC, poll-based) out of the box

You write simple batch-style code β€” no delta logic, no state handling. CocoIndex automatically runs your pipeline incrementally and keeps the output up to date for serving, training, or feature computation.

Next steps