Back to Claude Scientific Skills

Genomic Interval Operations

scientific-skills/polars-bio/references/interval_operations.md

2.38.012.0 KB
Original Source

Genomic Interval Operations

Overview

polars-bio provides 8 core operations for genomic interval arithmetic. All operations work on Polars DataFrames or LazyFrames containing genomic intervals (columns: chrom, start, end by default) and return a LazyFrame by default. Pass output_type="polars.DataFrame" for eager results.

Operations Summary

OperationInputsDescription
overlaptwo DataFramesFind pairs of overlapping intervals
count_overlapstwo DataFramesCount overlaps per interval in the first set
nearesttwo DataFramesFind nearest intervals between two sets
mergeone DataFrameMerge overlapping/bookended intervals
clusterone DataFrameAssign cluster IDs to overlapping intervals
coveragetwo DataFramesCompute per-interval coverage counts
complementone DataFrame + genomeFind gaps between intervals
subtracttwo DataFramesRemove overlapping portions

overlap

Find pairs of overlapping intervals between two DataFrames.

Functional API

python
import polars as pl
import polars_bio as pb

df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1"],
    "start": [1, 5, 22],
    "end":   [6, 9, 30],
})

df2 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [3, 25],
    "end":   [8, 28],
})

# Returns LazyFrame by default
result_lf = pb.overlap(df1, df2, suffixes=("_1", "_2"))
result_df = result_lf.collect()

# Or get DataFrame directly
result_df = pb.overlap(df1, df2, suffixes=("_1", "_2"), output_type="polars.DataFrame")

Method-Chaining API (LazyFrame only)

python
result = df1.lazy().pb.overlap(df2, suffixes=("_1", "_2")).collect()

Parameters

ParameterTypeDefaultDescription
df1DataFrame/LazyFrame/strrequiredFirst (probe) interval set
df2DataFrame/LazyFrame/strrequiredSecond (build) interval set
suffixestuple[str, str]("_1", "_2")Suffixes for overlapping column names
on_colslist[str]NoneAdditional columns to join on (beyond genomic coords)
cols1list[str]["chrom", "start", "end"]Column names in df1
cols2list[str]["chrom", "start", "end"]Column names in df2
algorithmstr"Coitrees"Interval algorithm
low_memoryboolFalseLow memory mode
output_typestr"polars.LazyFrame"Output format: "polars.LazyFrame", "polars.DataFrame", "pandas.DataFrame"
projection_pushdownboolTrueEnable projection pushdown optimization

Output Schema

Returns columns from both inputs with suffixes applied:

  • chrom_1, start_1, end_1 (from df1)
  • chrom_2, start_2, end_2 (from df2)
  • Any additional columns from df1 and df2

Column dtypes are String for chrom and Int64 for start/end.

count_overlaps

Count the number of overlapping intervals from df2 for each interval in df1.

python
# Functional
counts = pb.count_overlaps(df1, df2)

# Method-chaining (LazyFrame)
counts = df1.lazy().pb.count_overlaps(df2)

Parameters

ParameterTypeDefaultDescription
df1DataFrame/LazyFrame/strrequiredQuery interval set
df2DataFrame/LazyFrame/strrequiredTarget interval set
suffixestuple[str, str]("", "_")Suffixes for column names
cols1list[str]["chrom", "start", "end"]Column names in df1
cols2list[str]["chrom", "start", "end"]Column names in df2
on_colslist[str]NoneAdditional join columns
output_typestr"polars.LazyFrame"Output format
naive_queryboolTrueUse naive query strategy
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns df1 columns with an additional count column (Int64).

nearest

Find the nearest interval in df2 for each interval in df1.

python
# Find nearest (default: k=1, any direction)
nearest = pb.nearest(df1, df2, output_type="polars.DataFrame")

# Find k nearest
nearest = pb.nearest(df1, df2, k=3)

# Exclude overlapping intervals from results
nearest = pb.nearest(df1, df2, overlap=False)

# Without distance column
nearest = pb.nearest(df1, df2, distance=False)

Parameters

ParameterTypeDefaultDescription
df1DataFrame/LazyFrame/strrequiredQuery interval set
df2DataFrame/LazyFrame/strrequiredTarget interval set
suffixestuple[str, str]("_1", "_2")Suffixes for column names
on_colslist[str]NoneAdditional join columns
cols1list[str]["chrom", "start", "end"]Column names in df1
cols2list[str]["chrom", "start", "end"]Column names in df2
kint1Number of nearest neighbors to find
overlapboolTrueInclude overlapping intervals in results
distanceboolTrueInclude distance column in output
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns columns from both DataFrames (with suffixes) plus a distance column (Int64) with the distance to the nearest interval (0 if overlapping). Distance column is omitted if distance=False.

merge

Merge overlapping and bookended intervals within a single DataFrame.

python
import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr1", "chr2"],
    "start": [1, 4, 20, 1],
    "end":   [6, 9, 30, 10],
})

# Functional
merged = pb.merge(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
merged = df.lazy().pb.merge().collect()

# Merge intervals within a minimum distance
merged = pb.merge(df, min_dist=10)

Parameters

ParameterTypeDefaultDescription
dfDataFrame/LazyFrame/strrequiredInterval set to merge
min_distint0Minimum distance between intervals to merge (0 = must overlap or be bookended)
colslist[str]["chrom", "start", "end"]Column names
on_colslist[str]NoneAdditional grouping columns
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

ColumnTypeDescription
chromStringChromosome
startInt64Merged interval start
endInt64Merged interval end
n_intervalsInt64Number of intervals merged

cluster

Assign cluster IDs to overlapping intervals. Intervals that overlap are assigned the same cluster ID.

python
# Functional
clustered = pb.cluster(df, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
clustered = df.lazy().pb.cluster().collect()

# With minimum distance
clustered = pb.cluster(df, min_dist=5)

Parameters

ParameterTypeDefaultDescription
dfDataFrame/LazyFrame/strrequiredInterval set
min_distint0Minimum distance for clustering
colslist[str]["chrom", "start", "end"]Column names
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns the original columns plus:

ColumnTypeDescription
clusterInt64Cluster ID (intervals in the same cluster overlap)
cluster_startInt64Start of the cluster extent
cluster_endInt64End of the cluster extent

coverage

Compute per-interval coverage counts. This is a two-input operation: for each interval in df1, count the coverage from df2.

python
# Functional
cov = pb.coverage(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
cov = df1.lazy().pb.coverage(df2).collect()

Parameters

ParameterTypeDefaultDescription
df1DataFrame/LazyFrame/strrequiredQuery intervals
df2DataFrame/LazyFrame/strrequiredCoverage source intervals
suffixestuple[str, str]("_1", "_2")Suffixes for column names
on_colslist[str]NoneAdditional join columns
cols1list[str]["chrom", "start", "end"]Column names in df1
cols2list[str]["chrom", "start", "end"]Column names in df2
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns columns from df1 plus a coverage column (Int64).

complement

Find gaps between intervals within a genome. Requires a genome definition specifying chromosome sizes.

python
import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [100, 500],
    "end":   [200, 600],
})

genome = pl.DataFrame({
    "chrom": ["chr1"],
    "start": [0],
    "end":   [1000],
})

# Functional
gaps = pb.complement(df, view_df=genome, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
gaps = df.lazy().pb.complement(genome).collect()

Parameters

ParameterTypeDefaultDescription
dfDataFrame/LazyFrame/strrequiredInterval set
view_dfDataFrame/LazyFrameNoneGenome with chrom, start, end defining chromosome extents
colslist[str]["chrom", "start", "end"]Column names in df
view_colslist[str]NoneColumn names in view_df
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns a DataFrame with chrom (String), start (Int64), end (Int64) columns representing gaps between intervals.

subtract

Remove portions of intervals in df1 that overlap with intervals in df2.

python
# Functional
result = pb.subtract(df1, df2, output_type="polars.DataFrame")

# Method-chaining (LazyFrame)
result = df1.lazy().pb.subtract(df2).collect()

Parameters

ParameterTypeDefaultDescription
df1DataFrame/LazyFrame/strrequiredIntervals to subtract from
df2DataFrame/LazyFrame/strrequiredIntervals to subtract
cols1list[str]["chrom", "start", "end"]Column names in df1
cols2list[str]["chrom", "start", "end"]Column names in df2
output_typestr"polars.LazyFrame"Output format
projection_pushdownboolTrueEnable projection pushdown

Output Schema

Returns chrom (String), start (Int64), end (Int64) representing the remaining portions of df1 intervals after subtraction.

Performance Considerations

Probe-Build Architecture

Two-input operations (overlap, nearest, count_overlaps, coverage, subtract) use a probe-build join:

  • Probe (first DataFrame): Iterated over, row by row
  • Build (second DataFrame): Indexed into an interval tree for fast lookup

For best performance, pass the larger DataFrame as the probe (first argument) and the smaller one as the build (second argument).

Parallelism

By default, polars-bio uses a single execution partition. For large datasets, enable parallel execution:

python
import os
import polars_bio as pb

pb.set_option("datafusion.execution.target_partitions", os.cpu_count())

Streaming Execution

DataFusion streaming is enabled by default for interval operations. Data is processed in batches, enabling out-of-core computation for datasets larger than available RAM.

When to Use Lazy Evaluation

Use scan_* functions and lazy DataFrames for:

  • Files larger than available RAM
  • When only a subset of results is needed
  • Pipeline operations where intermediate results can be optimized away
python
# Lazy pipeline
lf1 = pb.scan_bed("large1.bed")
lf2 = pb.scan_bed("large2.bed")
result = pb.overlap(lf1, lf2).collect()