Polars Core Concepts

Expressions

Expressions are the foundation of Polars' API. They are composable units that describe data transformations without executing them immediately.

What are Expressions?

An expression describes a transformation on data. It only materializes (executes) within specific contexts:

select() - Select and transform columns
with_columns() - Add or modify columns
filter() - Filter rows
group_by().agg() - Aggregate data

Expression Syntax

Basic column reference:

python

pl.col("column_name")

Computed expressions:

python

# Arithmetic
pl.col("height") * 2
pl.col("price") + pl.col("tax")

# With alias
(pl.col("weight") / (pl.col("height") ** 2)).alias("bmi")

# Method chaining
pl.col("name").str.to_uppercase().str.slice(0, 3)

Expression Contexts

Select context:

python

df.select(
    "name",  # Simple column name
    pl.col("age"),  # Expression
    (pl.col("age") * 12).alias("age_in_months")  # Computed expression
)

With_columns context:

python

df.with_columns(
    age_doubled=pl.col("age") * 2,
    name_upper=pl.col("name").str.to_uppercase()
)

Filter context:

python

df.filter(
    pl.col("age") > 25,
    pl.col("city").is_in(["NY", "LA", "SF"])
)

Group_by context:

python

df.group_by("department").agg(
    pl.col("salary").mean(),
    pl.col("employee_id").count()
)

Expression Expansion

Apply operations to multiple columns at once:

All columns:

python

df.select(pl.all() * 2)

Pattern matching:

python

# All columns ending with "_value"
df.select(pl.col("^.*_value$") * 100)

# All numeric columns
df.select(pl.col(pl.NUMERIC_DTYPES) + 1)

Exclude patterns:

python

df.select(pl.all().exclude("id", "name"))

Expression Composition

Expressions can be stored and reused:

python

# Define reusable expressions
age_expression = pl.col("age") * 12
name_expression = pl.col("name").str.to_uppercase()

# Use in multiple contexts
df.select(age_expression, name_expression)
df.with_columns(age_months=age_expression)

Data Types

Polars has a strict type system based on Apache Arrow.

Core Data Types

Numeric:

Int8, Int16, Int32, Int64 - Signed integers
UInt8, UInt16, UInt32, UInt64 - Unsigned integers
Float32, Float64 - Floating point numbers

Text:

Utf8 / String - UTF-8 encoded strings
Categorical - Categorized strings (low cardinality)
Enum - Fixed set of string values

Temporal:

Date - Calendar date (no time)
Datetime - Date and time with optional timezone
Time - Time of day
Duration - Time duration/difference

Boolean:

Boolean - True/False values

Nested:

List - Variable-length lists
Array - Fixed-length arrays
Struct - Nested record structures

Other:

Binary - Binary data
Object - Python objects (avoid in production)
Null - Null type

Type Casting

Convert between types explicitly:

python

# Cast to different type
df.select(
    pl.col("age").cast(pl.Float64),
    pl.col("date_string").str.strptime(pl.Date, "%Y-%m-%d"),
    pl.col("id").cast(pl.Utf8)
)

Null Handling

Polars uses consistent null handling across all types:

Check for nulls:

python

df.filter(pl.col("value").is_null())
df.filter(pl.col("value").is_not_null())

Fill nulls:

python

pl.col("value").fill_null(0)
pl.col("value").fill_null(strategy="forward")
pl.col("value").fill_null(strategy="backward")
pl.col("value").fill_null(strategy="mean")

Drop nulls:

python

df.drop_nulls()  # Drop any row with nulls
df.drop_nulls(subset=["col1", "col2"])  # Drop rows with nulls in specific columns

Categorical Data

Use categorical types for string columns with low cardinality (repeated values):

python

# Cast to categorical
df.with_columns(
    pl.col("category").cast(pl.Categorical)
)

# Benefits:
# - Reduced memory usage
# - Faster grouping and joining
# - Maintains order information

Lazy vs Eager Evaluation

Polars supports two execution modes: eager (DataFrame) and lazy (LazyFrame).

Eager Evaluation (DataFrame)

Operations execute immediately:

python

import polars as pl

# DataFrame operations execute right away
df = pl.read_csv("data.csv")  # Reads file immediately
result = df.filter(pl.col("age") > 25)  # Filters immediately
final = result.select("name", "age")  # Selects immediately

When to use eager:

Small datasets that fit in memory
Interactive exploration in notebooks
Simple one-off operations
Immediate feedback needed

Lazy Evaluation (LazyFrame)

Operations build a query plan, optimized before execution:

python

import polars as pl

# LazyFrame operations build a query plan
lf = pl.scan_csv("data.csv")  # Doesn't read yet
lf2 = lf.filter(pl.col("age") > 25)  # Adds to plan
lf3 = lf2.select("name", "age")  # Adds to plan
df = lf3.collect()  # NOW executes optimized plan

When to use lazy:

Large datasets
Complex query pipelines
Only need subset of data
Performance is critical
Streaming required

Query Optimization

Polars automatically optimizes lazy queries:

Predicate Pushdown: Filter operations pushed to data source when possible:

python

# Only reads rows where age > 25 from CSV
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).collect()

Projection Pushdown: Only read needed columns from data source:

python

# Only reads "name" and "age" columns from CSV
lf = pl.scan_csv("data.csv")
result = lf.select("name", "age").collect()

Query Plan Inspection:

python

# View the optimized query plan
lf = pl.scan_csv("data.csv")
result = lf.filter(pl.col("age") > 25).select("name", "age")
print(result.explain())  # Shows optimized plan

Streaming Mode

Process data larger than memory:

python

# Enable streaming for very large datasets
lf = pl.scan_csv("very_large.csv")
result = lf.filter(pl.col("age") > 25).collect(streaming=True)

Streaming benefits:

Process data larger than RAM
Lower peak memory usage
Chunk-based processing
Automatic memory management

Streaming limitations:

Not all operations support streaming
May be slower for small data
Some operations require materializing entire dataset

Converting Between Eager and Lazy

Eager to Lazy:

python

df = pl.read_csv("data.csv")
lf = df.lazy()  # Convert to LazyFrame

Lazy to Eager:

python

lf = pl.scan_csv("data.csv")
df = lf.collect()  # Execute and return DataFrame

Memory Format

Polars uses Apache Arrow columnar memory format:

Benefits:

Zero-copy data sharing with other Arrow libraries
Efficient columnar operations
SIMD vectorization
Reduced memory overhead
Fast serialization

Implications:

Data stored column-wise, not row-wise
Column operations very fast
Random row access slower than pandas
Best for analytical workloads

Parallelization

Polars parallelizes operations automatically using Rust's concurrency:

What gets parallelized:

Aggregations within groups
Window functions
Most expression evaluations
File reading (multiple files)
Join operations

What to avoid for parallelization:

Python user-defined functions (UDFs)
Lambda functions in .map_elements()
Sequential .pipe() chains

Best practice:

python

# Good: Stays in expression API (parallelized)
df.with_columns(
    pl.col("value") * 10,
    pl.col("value").log(),
    pl.col("value").sqrt()
)

# Bad: Uses Python function (sequential)
df.with_columns(
    pl.col("value").map_elements(lambda x: x * 10)
)

Strict Type System

Polars enforces strict typing:

No silent conversions:

python

# This will error - can't mix types
# df.with_columns(pl.col("int_col") + "string")

# Must cast explicitly
df.with_columns(
    pl.col("int_col").cast(pl.Utf8) + "_suffix"
)

Benefits:

Prevents silent bugs
Predictable behavior
Better performance
Clearer code intent

Integer nulls: Unlike pandas, integer columns can have nulls without converting to float:

python

# In pandas: Int column with null becomes Float
# In polars: Int column with null stays Int (with null values)
df = pl.DataFrame({"int_col": [1, 2, None, 4]})
# dtype: Int64 (not Float64)