packages/shared-skills/skills/programming/references/python/data-processing.md
NEVER pandas. Polars (with numpy) plus DuckDB. Pandas is 10-50x slower, has weaker types, and the modern Python data ecosystem has moved on.
| Operation | Use | Why |
|---|---|---|
.csv / .parquet / .json direct query | DuckDB | Zero memory load, SQL ergonomics |
.duckdb file | DuckDB | Native format |
| Filter (any size) | Polars | 128x faster than DuckDB for filtering |
| Sort | Polars | 12x faster |
| Multi-table join | DuckDB | 3x faster, more join types |
| Heavy GROUP BY aggregation | DuckDB | 4x faster on large datasets |
| Window function | Polars | 3-5x faster |
| Pivot / melt / string ops | Polars | 2x faster |
| Larger than RAM | Polars streaming or DuckDB out-of-core | Both handle OOM |
| Mixed pipeline | Hybrid (zero-copy via Arrow) | Use each tool's strengths |
For the deep version (per-operation benchmarks, OOM strategies, full execution templates), load the data-scientist skill - it lives in this same skill set and is the source of truth for performance numbers.
import numpy as np
import polars as pl
import duckdb
result = duckdb.sql("""
SELECT category, SUM(amount) AS total
FROM 'data.csv'
WHERE date >= '2026-01-01'
GROUP BY category
ORDER BY total DESC
""").pl() # zero-copy → Polars DataFrame
.pl() returns Polars; .df() would return pandas - never use .df().
result = (
pl.scan_csv("data.csv") # lazy, no read yet
.filter(pl.col("amount") > 1000)
.filter(pl.col("status") == "active")
.sort("amount", descending=True)
.head(100)
.collect() # execute optimised plan
)
scan_* over read_* for files; lazy() then collect() for in-memory frames. Polars optimises the entire plan before execution (predicate pushdown, projection pushdown, common subexpression elimination).
result = (
pl.scan_csv("huge.csv")
.filter(pl.col("active"))
.group_by("category")
.agg([
pl.len().alias("count"),
pl.sum("amount").alias("total"),
])
.collect(streaming=True)
)
# Phase 1: DuckDB for the join (3x faster)
joined = duckdb.sql("""
SELECT o.*, c.region, p.category
FROM 'orders.parquet' o
JOIN 'customers.parquet' c ON o.customer_id = c.id
JOIN 'products.parquet' p ON o.product_id = p.id
""").pl()
# Phase 2: Polars for filtering and transformation (128x + 2x faster)
processed = (
joined
.filter(pl.col("amount") > 100)
.with_columns([
(pl.col("amount") * 1.1).alias("amount_with_tax"),
])
)
# Phase 3: DuckDB for final aggregation (4x faster) - register Polars frame by name
duckdb.register("processed", processed)
final = duckdb.sql("""
SELECT region, category, SUM(amount_with_tax) AS revenue
FROM processed
GROUP BY region, category
ORDER BY revenue DESC
""").pl()
Polars supports schema overrides at read time, and .cast() for explicit conversion. Avoid implicit coercion in hot paths.
schema = {"id": pl.Int64, "amount": pl.Float64, "date": pl.Date}
df = pl.read_csv("data.csv", schema_overrides=schema)
basedpyright understands polars-stubs, which ship with polars itself. No extra type stubs to install.
| pandas | polars |
|---|---|
df.iloc[5] | df.row(5) (named tuple) or df[5] (single-row frame) |
df.loc[df["x"] > 5] | df.filter(pl.col("x") > 5) |
df["x"].apply(fn) | df["x"].map_elements(fn) (slow path) or use native expressions |
df.merge(...) | df.join(other, on="key") |
df.groupby(...).agg(...) | df.group_by(...).agg(...) |
pd.read_csv(...).dtypes | pl.read_csv(...).schema |
df.to_dict("records") | df.to_dicts() |
data-scientist skill (load it for the deep version)