Back to Claude Scientific Skills

Migrating from bioframe to polars-bio

scientific-skills/polars-bio/references/bioframe_migration.md

2.38.07.2 KB
Original Source

Migrating from bioframe to polars-bio

Overview

polars-bio is a drop-in replacement for bioframe's core interval operations, offering 6.5-38x speedups on real-world genomic benchmarks. The main differences are: Polars DataFrames instead of pandas, a Rust/DataFusion backend instead of pure Python, streaming support for large genomes, and LazyFrame returns by default.

Operation Mapping

bioframepolars-bioNotes
bioframe.overlap(df1, df2)pb.overlap(df1, df2)Returns LazyFrame; .collect() for DataFrame
bioframe.closest(df1, df2)pb.nearest(df1, df2)Renamed; uses k, overlap, distance params
bioframe.count_overlaps(df1, df2)pb.count_overlaps(df1, df2)Default suffixes differ: ("", "_") vs bioframe's
bioframe.merge(df)pb.merge(df)Output includes n_intervals column
bioframe.cluster(df)pb.cluster(df)Output cols: cluster, cluster_start, cluster_end
bioframe.coverage(df1, df2)pb.coverage(df1, df2)Two-input in both libraries
bioframe.complement(df, chromsizes)pb.complement(df, view_df=genome)Genome as DataFrame, not Series
bioframe.subtract(df1, df2)pb.subtract(df1, df2)Same semantics

Key API Differences

DataFrames: pandas vs Polars

bioframe (pandas):

python
import bioframe
import pandas as pd

df1 = pd.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [1, 10],
    "end":   [5, 20],
})

result = bioframe.overlap(df1, df2)
# result is a pandas DataFrame
result["start_1"]  # pandas column access

polars-bio (Polars):

python
import polars_bio as pb
import polars as pl

df1 = pl.DataFrame({
    "chrom": ["chr1", "chr1"],
    "start": [1, 10],
    "end":   [5, 20],
})

result = pb.overlap(df1, df2)  # Returns LazyFrame
result_df = result.collect()   # Materialize to DataFrame
result_df.select("start_1")   # Polars column access

Return Types: LazyFrame by Default

All polars-bio operations return a LazyFrame by default. Use .collect() or output_type="polars.DataFrame":

python
# bioframe: always returns DataFrame
result = bioframe.overlap(df1, df2)

# polars-bio: returns LazyFrame, collect for DataFrame
result_lf = pb.overlap(df1, df2)
result_df = result_lf.collect()

# Or get DataFrame directly
result_df = pb.overlap(df1, df2, output_type="polars.DataFrame")

Genome/Chromsizes

bioframe:

python
chromsizes = bioframe.fetch_chromsizes("hg38")  # Returns pandas Series
complement = bioframe.complement(df, chromsizes)

polars-bio:

python
genome = pl.DataFrame({
    "chrom": ["chr1", "chr2"],
    "start": [0, 0],
    "end":   [248956422, 242193529],
})
complement = pb.complement(df, view_df=genome)

closest vs nearest

bioframe:

python
result = bioframe.closest(df1, df2)

polars-bio:

python
# Basic nearest
result = pb.nearest(df1, df2)

# Find k nearest neighbors
result = pb.nearest(df1, df2, k=3)

# Exclude overlapping intervals
result = pb.nearest(df1, df2, overlap=False)

# Without distance column
result = pb.nearest(df1, df2, distance=False)

Method-Chaining (polars-bio only)

polars-bio adds a .pb accessor on LazyFrame for method chaining:

python
# bioframe: sequential function calls
merged = bioframe.merge(bioframe.overlap(df1, df2))

# polars-bio: fluent pipeline (must use LazyFrame)
# Note: overlap adds suffixes, so rename before merge
merged = (
    df1.lazy()
    .pb.overlap(df2)
    .select(
        pl.col("chrom_1").alias("chrom"),
        pl.col("start_1").alias("start"),
        pl.col("end_1").alias("end"),
    )
    .pb.merge()
    .collect()
)

Performance Comparison

Benchmarks on real-world genomic datasets (from the polars-bio paper, Bioinformatics 2025):

Operationbioframepolars-bioSpeedup
overlap1.0x6.5x6.5x
nearest1.0x38x38x
merge1.0x8.2x8.2x
coverage1.0x12x12x

Speedups come from:

  • Rust-based interval tree implementation
  • Apache DataFusion query engine
  • Apache Arrow columnar memory format
  • Parallel execution (when configured)
  • Streaming/out-of-core support

Migration Code Examples

Example 1: Basic Overlap Pipeline

Before (bioframe):

python
import bioframe
import pandas as pd

df1 = pd.read_csv("peaks.bed", sep="\t", names=["chrom", "start", "end"])
df2 = pd.read_csv("genes.bed", sep="\t", names=["chrom", "start", "end", "name"])

overlaps = bioframe.overlap(df1, df2, suffixes=("_peak", "_gene"))
filtered = overlaps[overlaps["start_gene"] > 10000]
merged = bioframe.merge(filtered[["chrom_peak", "start_peak", "end_peak"]]
    .rename(columns={"chrom_peak": "chrom", "start_peak": "start", "end_peak": "end"}))

After (polars-bio):

python
import polars_bio as pb
import polars as pl

df1 = pb.read_bed("peaks.bed")
df2 = pb.read_bed("genes.bed")

overlaps = pb.overlap(df1, df2, suffixes=("_peak", "_gene"), output_type="polars.DataFrame")
filtered = overlaps.filter(pl.col("start_gene") > 10000)
merged = pb.merge(
    filtered.select(
        pl.col("chrom_peak").alias("chrom"),
        pl.col("start_peak").alias("start"),
        pl.col("end_peak").alias("end"),
    ),
    output_type="polars.DataFrame",
)

Example 2: Large-Scale Streaming

Before (bioframe) — limited to in-memory:

python
import bioframe
import pandas as pd

# Must load entire file into memory
df1 = pd.read_csv("huge_intervals.bed", sep="\t", names=["chrom", "start", "end"])
result = bioframe.merge(df1)  # Memory-bound

After (polars-bio) — streaming:

python
import polars_bio as pb

# Lazy scan, streaming execution
lf = pb.scan_bed("huge_intervals.bed")
result = pb.merge(lf).collect(streaming=True)

pandas Compatibility Mode

For gradual migration, install with pandas support:

bash
pip install polars-bio[pandas]

This enables conversion between pandas and Polars DataFrames:

python
import polars_bio as pb
import polars as pl

# Convert pandas DataFrame to Polars for polars-bio
polars_df = pl.from_pandas(pandas_df)
result = pb.overlap(polars_df, other_df).collect()

# Convert back to pandas if needed
pandas_result = result.to_pandas()

# Or request pandas output directly
pandas_result = pb.overlap(polars_df, other_df, output_type="pandas.DataFrame")

Migration Checklist

  1. Replace import bioframe with import polars_bio as pb
  2. Replace import pandas as pd with import polars as pl
  3. Convert DataFrame creation from pd.DataFrame to pl.DataFrame
  4. Replace bioframe.closest with pb.nearest
  5. Add .collect() after operations (they return LazyFrame by default)
  6. Update column access from df["col"] to df.select("col") or pl.col("col")
  7. Replace pandas filtering df[df["col"] > x] with df.filter(pl.col("col") > x)
  8. Update chromsizes from Series to DataFrame with chrom, start, end; pass as view_df=
  9. Add pb.set_option("datafusion.execution.target_partitions", N) for parallelism
  10. Replace pd.read_csv for BED files with pb.read_bed or pb.scan_bed
  11. Note cluster output column is cluster (not cluster_id), plus cluster_start, cluster_end
  12. Note merge output includes n_intervals column