Back to Claude Scientific Skills

Pandas to Polars Migration Guide

scientific-skills/polars/references/pandas_migration.md

2.38.012.1 KB
Original Source

Pandas to Polars Migration Guide

This guide helps you migrate from pandas to Polars with comprehensive operation mappings and key differences.

Core Conceptual Differences

1. No Index System

Pandas: Uses row-based indexing system

python
df.loc[0, "column"]
df.iloc[0:5]
df.set_index("id")

Polars: Uses integer positions only

python
df[0, "column"]  # Row position, column name
df[0:5]  # Row slice
# No set_index equivalent - use group_by instead

2. Memory Format

Pandas: Row-oriented NumPy arrays Polars: Columnar Apache Arrow format

Implications:

  • Polars is faster for column operations
  • Polars uses less memory
  • Polars has better data sharing capabilities

3. Parallelization

Pandas: Primarily single-threaded (requires Dask for parallelism) Polars: Parallel by default using Rust's concurrency

4. Lazy Evaluation

Pandas: Only eager evaluation Polars: Both eager (DataFrame) and lazy (LazyFrame) with query optimization

5. Type Strictness

Pandas: Allows silent type conversions Polars: Strict typing, explicit casts required

Example:

python
# Pandas: Silently converts to float
pd_df["int_col"] = [1, 2, None, 4]  # dtype: float64

# Polars: Keeps as integer with null
pl_df = pl.DataFrame({"int_col": [1, 2, None, 4]})  # dtype: Int64

Operation Mappings

Data Selection

OperationPandasPolars
Select columndf["col"] or df.coldf.select("col") or df["col"]
Select multipledf[["a", "b"]]df.select("a", "b")
Select by positiondf.iloc[:, 0:3]df.select(pl.col(df.columns[0:3]))
Select by conditiondf[df["age"] > 25]df.filter(pl.col("age") > 25)

Data Filtering

OperationPandasPolars
Single conditiondf[df["age"] > 25]df.filter(pl.col("age") > 25)
Multiple conditionsdf[(df["age"] > 25) & (df["city"] == "NY")]df.filter(pl.col("age") > 25, pl.col("city") == "NY")
Query methoddf.query("age > 25")df.filter(pl.col("age") > 25)
isindf[df["city"].isin(["NY", "LA"])]df.filter(pl.col("city").is_in(["NY", "LA"]))
isnadf[df["value"].isna()]df.filter(pl.col("value").is_null())
notnadf[df["value"].notna()]df.filter(pl.col("value").is_not_null())

Adding/Modifying Columns

OperationPandasPolars
Add columndf["new"] = df["old"] * 2df.with_columns(new=pl.col("old") * 2)
Multiple columnsdf.assign(a=..., b=...)df.with_columns(a=..., b=...)
Conditional columnnp.where(condition, a, b)pl.when(condition).then(a).otherwise(b)

Important difference - Parallel execution:

python
# Pandas: Sequential (lambda sees previous results)
df.assign(
    a=lambda df_: df_.value * 10,
    b=lambda df_: df_.value * 100
)

# Polars: Parallel (all computed together)
df.with_columns(
    a=pl.col("value") * 10,
    b=pl.col("value") * 100
)

Grouping and Aggregation

OperationPandasPolars
Group bydf.groupby("col")df.group_by("col")
Agg singledf.groupby("col")["val"].mean()df.group_by("col").agg(pl.col("val").mean())
Agg multipledf.groupby("col").agg({"val": ["mean", "sum"]})df.group_by("col").agg(pl.col("val").mean(), pl.col("val").sum())
Sizedf.groupby("col").size()df.group_by("col").agg(pl.len())
Countdf.groupby("col").count()df.group_by("col").agg(pl.col("*").count())

Window Functions

OperationPandasPolars
Transformdf.groupby("col").transform("mean")df.with_columns(pl.col("val").mean().over("col"))
Rankdf.groupby("col")["val"].rank()df.with_columns(pl.col("val").rank().over("col"))
Shiftdf.groupby("col")["val"].shift(1)df.with_columns(pl.col("val").shift(1).over("col"))
Cumsumdf.groupby("col")["val"].cumsum()df.with_columns(pl.col("val").cum_sum().over("col"))

Joins

OperationPandasPolars
Inner joindf1.merge(df2, on="id")df1.join(df2, on="id", how="inner")
Left joindf1.merge(df2, on="id", how="left")df1.join(df2, on="id", how="left")
Different keysdf1.merge(df2, left_on="a", right_on="b")df1.join(df2, left_on="a", right_on="b")

Concatenation

OperationPandasPolars
Verticalpd.concat([df1, df2], axis=0)pl.concat([df1, df2], how="vertical")
Horizontalpd.concat([df1, df2], axis=1)pl.concat([df1, df2], how="horizontal")

Sorting

OperationPandasPolars
Sort by columndf.sort_values("col")df.sort("col")
Descendingdf.sort_values("col", ascending=False)df.sort("col", descending=True)
Multiple columnsdf.sort_values(["a", "b"])df.sort("a", "b")

Reshaping

OperationPandasPolars
Pivotdf.pivot(index="a", columns="b", values="c")df.pivot(values="c", index="a", columns="b")
Meltdf.melt(id_vars="id")df.unpivot(index="id")

I/O Operations

OperationPandasPolars
Read CSVpd.read_csv("file.csv")pl.read_csv("file.csv") or pl.scan_csv()
Write CSVdf.to_csv("file.csv")df.write_csv("file.csv")
Read Parquetpd.read_parquet("file.parquet")pl.read_parquet("file.parquet")
Write Parquetdf.to_parquet("file.parquet")df.write_parquet("file.parquet")
Read Excelpd.read_excel("file.xlsx")pl.read_excel("file.xlsx")

String Operations

OperationPandasPolars
Upperdf["col"].str.upper()df.select(pl.col("col").str.to_uppercase())
Lowerdf["col"].str.lower()df.select(pl.col("col").str.to_lowercase())
Containsdf["col"].str.contains("pattern")df.filter(pl.col("col").str.contains("pattern"))
Replacedf["col"].str.replace("old", "new")df.select(pl.col("col").str.replace("old", "new"))
Splitdf["col"].str.split(" ")df.select(pl.col("col").str.split(" "))

Datetime Operations

OperationPandasPolars
Parse datespd.to_datetime(df["col"])df.select(pl.col("col").str.strptime(pl.Date, "%Y-%m-%d"))
Yeardf["date"].dt.yeardf.select(pl.col("date").dt.year())
Monthdf["date"].dt.monthdf.select(pl.col("date").dt.month())
Daydf["date"].dt.daydf.select(pl.col("date").dt.day())

Missing Data

OperationPandasPolars
Drop nullsdf.dropna()df.drop_nulls()
Fill nullsdf.fillna(0)df.fill_null(0)
Check nulldf["col"].isna()df.select(pl.col("col").is_null())
Forward filldf.fillna(method="ffill")df.select(pl.col("col").fill_null(strategy="forward"))

Other Operations

OperationPandasPolars
Unique valuesdf["col"].unique()df["col"].unique()
Value countsdf["col"].value_counts()df["col"].value_counts()
Describedf.describe()df.describe()
Sampledf.sample(n=100)df.sample(n=100)
Headdf.head()df.head()
Taildf.tail()df.tail()

Common Migration Patterns

Pattern 1: Chained Operations

Pandas:

python
result = (df
    .assign(new_col=lambda x: x["old_col"] * 2)
    .query("new_col > 10")
    .groupby("category")
    .agg({"value": "sum"})
    .reset_index()
)

Polars:

python
result = (df
    .with_columns(new_col=pl.col("old_col") * 2)
    .filter(pl.col("new_col") > 10)
    .group_by("category")
    .agg(pl.col("value").sum())
)
# No reset_index needed - Polars doesn't have index

Pattern 2: Apply Functions

Pandas:

python
# Avoid in Polars - breaks parallelization
df["result"] = df["value"].apply(lambda x: x * 2)

Polars:

python
# Use expressions instead
df = df.with_columns(result=pl.col("value") * 2)

# If custom function needed
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2, return_dtype=pl.Float64)
)

Pattern 3: Conditional Column Creation

Pandas:

python
df["category"] = np.where(
    df["value"] > 100,
    "high",
    np.where(df["value"] > 50, "medium", "low")
)

Polars:

python
df = df.with_columns(
    category=pl.when(pl.col("value") > 100)
        .then("high")
        .when(pl.col("value") > 50)
        .then("medium")
        .otherwise("low")
)

Pattern 4: Group Transform

Pandas:

python
df["group_mean"] = df.groupby("category")["value"].transform("mean")

Polars:

python
df = df.with_columns(
    group_mean=pl.col("value").mean().over("category")
)

Pattern 5: Multiple Aggregations

Pandas:

python
result = df.groupby("category").agg({
    "value": ["mean", "sum", "count"],
    "price": ["min", "max"]
})

Polars:

python
result = df.group_by("category").agg(
    pl.col("value").mean().alias("value_mean"),
    pl.col("value").sum().alias("value_sum"),
    pl.col("value").count().alias("value_count"),
    pl.col("price").min().alias("price_min"),
    pl.col("price").max().alias("price_max")
)

Performance Anti-Patterns to Avoid

Anti-Pattern 1: Sequential Pipe Operations

Bad (disables parallelization):

python
df = df.pipe(function1).pipe(function2).pipe(function3)

Good (enables parallelization):

python
df = df.with_columns(
    function1_result(),
    function2_result(),
    function3_result()
)

Anti-Pattern 2: Python Functions in Hot Paths

Bad:

python
df = df.with_columns(
    result=pl.col("value").map_elements(lambda x: x * 2)
)

Good:

python
df = df.with_columns(result=pl.col("value") * 2)

Anti-Pattern 3: Using Eager Reading for Large Files

Bad:

python
df = pl.read_csv("large_file.csv")
result = df.filter(pl.col("age") > 25).select("name", "age")

Good:

python
lf = pl.scan_csv("large_file.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age").collect()

Anti-Pattern 4: Row Iteration

Bad:

python
for row in df.iter_rows():
    # Process row
    pass

Good:

python
# Use vectorized operations
df = df.with_columns(
    # Vectorized computation
)

Migration Checklist

When migrating from pandas to Polars:

  1. Remove index operations - Use integer positions or group_by
  2. Replace apply/map with expressions - Use Polars native operations
  3. Update column assignment - Use with_columns() instead of direct assignment
  4. Change groupby.transform to .over() - Window functions work differently
  5. Update string operations - Use .str.to_uppercase() instead of .str.upper()
  6. Add explicit type casts - Polars won't silently convert types
  7. Consider lazy evaluation - Use scan_* instead of read_* for large data
  8. Update aggregation syntax - More explicit in Polars
  9. Remove reset_index calls - Not needed in Polars
  10. Update conditional logic - Use when().then().otherwise() pattern

Compatibility Layer

For gradual migration, you can use both libraries:

python
import pandas as pd
import polars as pl

# Convert pandas to Polars
pl_df = pl.from_pandas(pd_df)

# Convert Polars to pandas
pd_df = pl_df.to_pandas()

# Use Arrow for zero-copy (when possible)
pl_df = pl.from_arrow(pd_df)
pd_df = pl_df.to_arrow().to_pandas()

When to Stick with Pandas

Consider staying with pandas when:

  • Working with time series requiring complex index operations
  • Need extensive ecosystem support (some libraries only support pandas)
  • Team lacks Rust/Polars expertise
  • Data is small and performance isn't critical
  • Using advanced pandas features without Polars equivalents

When to Switch to Polars

Switch to Polars when:

  • Performance is critical
  • Working with large datasets (>1GB)
  • Need lazy evaluation and query optimization
  • Want better type safety
  • Need parallel execution by default
  • Starting a new project