scientific-skills/polars-bio/references/pileup_operations.md
polars-bio provides the pb.depth() function for computing per-base or per-block read depth from BAM/CRAM files. It uses CIGAR-aware depth calculation to accurately account for insertions, deletions, and clipping. Returns a LazyFrame by default.
Compute read depth from alignment files.
import polars_bio as pb
# Compute depth across entire BAM file (returns LazyFrame)
depth_lf = pb.depth("aligned.bam")
depth_df = depth_lf.collect()
# Get DataFrame directly
depth_df = pb.depth("aligned.bam", output_type="polars.DataFrame")
| Parameter | Type | Default | Description |
|---|---|---|---|
path | str | required | Path to BAM or CRAM file |
filter_flag | int | 1796 | SAM flag filter (default excludes unmapped, secondary, duplicate, QC-fail) |
min_mapping_quality | int | 0 | Minimum mapping quality to include reads |
binary_cigar | bool | True | Use binary CIGAR for faster processing |
dense_mode | str | "auto" | Dense output mode |
use_zero_based | bool | None | Coordinate system (None = use global setting) |
per_base | bool | False | Per-base depth (True) vs block depth (False) |
output_type | str | "polars.LazyFrame" | Output format: "polars.LazyFrame", "polars.DataFrame", "pandas.DataFrame" |
When per_base=False (default), adjacent positions with the same depth are grouped into blocks:
| Column | Type | Description |
|---|---|---|
contig | String | Chromosome/contig name |
pos_start | Int64 | Block start position |
pos_end | Int64 | Block end position |
coverage | Int16 | Read depth |
When per_base=True, each position is reported individually:
| Column | Type | Description |
|---|---|---|
contig | String | Chromosome/contig name |
pos | Int64 | Position |
coverage | Int16 | Read depth at position |
The default filter_flag=1796 excludes reads with these SAM flags:
pb.depth() correctly handles CIGAR operations:
import polars_bio as pb
import polars as pl
# Compute depth genome-wide (block mode)
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
# Summary statistics
depth.select(
pl.col("coverage").cast(pl.Int64).mean().alias("mean_depth"),
pl.col("coverage").cast(pl.Int64).median().alias("median_depth"),
pl.col("coverage").cast(pl.Int64).max().alias("max_depth"),
)
import polars_bio as pb
# Per-base depth (one row per position)
depth = pb.depth("sample.bam", per_base=True, output_type="polars.DataFrame")
import polars_bio as pb
# Only count well-mapped reads
depth = pb.depth(
"sample.bam",
min_mapping_quality=20,
output_type="polars.DataFrame",
)
import polars_bio as pb
# Only exclude unmapped (4) and duplicate (1024) reads
depth = pb.depth(
"sample.bam",
filter_flag=4 + 1024,
output_type="polars.DataFrame",
)
Depth results can be used with polars-bio interval operations. Note that depth output uses contig/pos_start/pos_end column names, so use cols parameters to map them:
import polars_bio as pb
import polars as pl
# Compute depth
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
# Rename columns to match interval operation conventions
depth_intervals = depth.rename({
"contig": "chrom",
"pos_start": "start",
"pos_end": "end",
})
# Find regions with adequate coverage
adequate = depth_intervals.filter(pl.col("coverage") >= 30)
# Merge adjacent adequate-coverage blocks
merged = pb.merge(adequate, output_type="polars.DataFrame")
# Find gaps in coverage (complement)
genome = pl.DataFrame({
"chrom": ["chr1"],
"start": [0],
"end": [248956422],
})
gaps = pb.complement(adequate, view_df=genome, output_type="polars.DataFrame")
import polars_bio as pb
depth = pb.depth("sample.bam", output_type="polars.DataFrame")
targets = pb.read_bed("targets.bed")
# Use cols1 to specify depth column names
overlapping = pb.overlap(
depth, targets,
cols1=["contig", "pos_start", "pos_end"],
output_type="polars.DataFrame",
)