Back to Claude Scientific Skills

Bioinformatics File I/O

scientific-skills/polars-bio/references/file_io.md

2.38.012.0 KB
Original Source

Bioinformatics File I/O

Overview

polars-bio provides read_*, scan_*, write_*, and sink_* functions for common bioinformatics formats. read_* loads data eagerly into a DataFrame, while scan_* creates a LazyFrame for streaming/out-of-core processing. write_* writes from DataFrame/LazyFrame and returns a row count, while sink_* streams from a LazyFrame.

Supported Formats

FormatReadScanRegister (SQL)WriteSink
BEDread_bedscan_bedregister_bed
VCFread_vcfscan_vcfregister_vcfwrite_vcfsink_vcf
BAMread_bamscan_bamregister_bamwrite_bamsink_bam
CRAMread_cramscan_cramregister_cramwrite_cramsink_cram
GFFread_gffscan_gffregister_gff
GTFread_gtfscan_gtfregister_gtf
FASTAread_fastascan_fasta
FASTQread_fastqscan_fastqregister_fastqwrite_fastqsink_fastq
SAMread_samscan_samregister_samwrite_samsink_sam
Hi-C pairsread_pairsscan_pairsregister_pairs
Generic tableread_tablescan_table

Common Cloud/IO Parameters

All read_* and scan_* functions share these parameters (instead of a single storage_options dict):

ParameterTypeDefaultDescription
pathstrrequiredFile path (local, S3, GCS, Azure)
chunk_sizeint8Number of chunks for parallel reading
concurrent_fetchesint1Number of concurrent fetches for cloud storage
allow_anonymousboolTrueAllow anonymous access to cloud storage
enable_request_payerboolFalseEnable requester-pays for cloud storage
max_retriesint5Maximum retries for cloud operations
timeoutint300Timeout in seconds for cloud operations
compression_typestr"auto"Compression type (auto-detected from extension)
projection_pushdownboolTrueEnable projection pushdown optimization
use_zero_basedboolNoneSet coordinate system metadata (None = use global setting)

Not all functions support all parameters. SAM functions lack cloud parameters. FASTA/FASTQ lack predicate_pushdown.

BED Format

read_bed / scan_bed

Read BED files. Columns are auto-detected (BED3 through BED12). BED files use 0-based half-open coordinates; polars-bio attaches coordinate metadata automatically.

python
import polars_bio as pb

# Eager read
df = pb.read_bed("regions.bed")

# Lazy scan
lf = pb.scan_bed("regions.bed")

Column Schema (BED3)

ColumnTypeDescription
chromStringChromosome name
startInt64Start position
endInt64End position

Extended BED fields (auto-detected) add: name, score, strand, thickStart, thickEnd, itemRgb, blockCount, blockSizes, blockStarts.

VCF Format

read_vcf / scan_vcf

Read VCF/BCF files. Supports .vcf, .vcf.gz, .bcf.

python
import polars_bio as pb

# Read VCF
df = pb.read_vcf("variants.vcf.gz")

# Read with specific INFO and FORMAT fields extracted as columns
df = pb.read_vcf("variants.vcf.gz", info_fields=["AF", "DP"], format_fields=["GT", "GQ"])

# Read specific samples
df = pb.read_vcf("variants.vcf.gz", samples=["SAMPLE1", "SAMPLE2"])

Additional Parameters

ParameterTypeDefaultDescription
info_fieldslist[str]NoneINFO fields to extract as columns
format_fieldslist[str]NoneFORMAT fields to extract as columns
sampleslist[str]NoneSamples to include
predicate_pushdownboolTrueEnable predicate pushdown

Column Schema

ColumnTypeDescription
chromStringChromosome
startUInt32Start position
endUInt32End position
idStringVariant ID
refStringReference allele
altStringAlternate allele(s)
qualFloat32Quality score
filterStringFilter status
infoStringINFO field (raw, unless info_fields specified)

write_vcf / sink_vcf

python
import polars_bio as pb

# Write DataFrame to VCF
rows_written = pb.write_vcf(df, "output.vcf")

# Stream LazyFrame to VCF
pb.sink_vcf(lf, "output.vcf")

BAM Format

read_bam / scan_bam

Read aligned sequencing reads from BAM files. Requires a .bai index file.

python
import polars_bio as pb

# Read BAM
df = pb.read_bam("aligned.bam")

# Scan BAM (streaming)
lf = pb.scan_bam("aligned.bam")

# Read with specific tags
df = pb.read_bam("aligned.bam", tag_fields=["NM", "MD"])

Additional Parameters

ParameterTypeDefaultDescription
tag_fieldslist[str]NoneSAM tags to extract as columns
predicate_pushdownboolTrueEnable predicate pushdown
infer_tag_typesboolTrueInfer tag column types from data
infer_tag_sample_sizeint100Number of records to sample for type inference
tag_type_hintslist[str]NoneExplicit type hints for tags

Column Schema

ColumnTypeDescription
chromStringReference sequence name
startInt64Alignment start position
endInt64Alignment end position
nameStringRead name
flagsUInt32SAM flags
mapping_qualityUInt32Mapping quality
cigarStringCIGAR string
sequenceStringRead sequence
quality_scoresStringBase quality string
mate_chromStringMate reference name
mate_startInt64Mate start position
template_lengthInt64Template length

write_bam / sink_bam

python
rows_written = pb.write_bam(df, "output.bam")
rows_written = pb.write_bam(df, "output.bam", sort_on_write=True)

pb.sink_bam(lf, "output.bam")
pb.sink_bam(lf, "output.bam", sort_on_write=True)

CRAM Format

read_cram / scan_cram

CRAM files have separate functions from BAM. Require a reference FASTA and .crai index.

python
import polars_bio as pb

# Read CRAM (reference required)
df = pb.read_cram("aligned.cram", reference_path="reference.fasta")

# Scan CRAM (streaming)
lf = pb.scan_cram("aligned.cram", reference_path="reference.fasta")

Same additional parameters and column schema as BAM, plus:

ParameterTypeDefaultDescription
reference_pathstrNonePath to reference FASTA

write_cram / sink_cram

python
rows_written = pb.write_cram(df, "output.cram", reference_path="reference.fasta")
pb.sink_cram(lf, "output.cram", reference_path="reference.fasta")

GFF/GTF Format

read_gff / scan_gff / read_gtf / scan_gtf

GFF3 and GTF have separate functions.

python
import polars_bio as pb

# Read GFF3
df = pb.read_gff("annotations.gff3")

# Read GTF
df = pb.read_gtf("genes.gtf")

# Extract specific attributes as columns
df = pb.read_gff("annotations.gff3", attr_fields=["gene_id", "gene_name"])

Additional Parameters

ParameterTypeDefaultDescription
attr_fieldslist[str]NoneAttribute fields to extract as columns
predicate_pushdownboolTrueEnable predicate pushdown

Column Schema

ColumnTypeDescription
chromStringSequence name
sourceStringFeature source
typeStringFeature type (gene, exon, etc.)
startInt64Start position
endInt64End position
scoreFloat32Score
strandStringStrand (+/-/.)
phaseUInt32Phase (0/1/2)
attributesStringAttributes string

FASTA Format

read_fasta / scan_fasta

Read reference sequences from FASTA files.

python
import polars_bio as pb

df = pb.read_fasta("reference.fasta")

Column Schema

ColumnTypeDescription
nameStringSequence name
descriptionStringDescription line
sequenceStringNucleotide sequence

FASTQ Format

read_fastq / scan_fastq

Read raw sequencing reads with quality scores.

python
import polars_bio as pb

df = pb.read_fastq("reads.fastq.gz")

Column Schema

ColumnTypeDescription
nameStringRead name
descriptionStringDescription line
sequenceStringNucleotide sequence
qualityStringQuality string (Phred+33 encoded)

write_fastq / sink_fastq

python
rows_written = pb.write_fastq(df, "output.fastq")
pb.sink_fastq(lf, "output.fastq")

SAM Format

read_sam / scan_sam

Read text-format alignment files. Same column schema as BAM. No cloud parameters.

python
import polars_bio as pb

df = pb.read_sam("alignments.sam")

Additional Parameters

ParameterTypeDefaultDescription
tag_fieldslist[str]NoneSAM tags to extract
infer_tag_typesboolTrueInfer tag types
infer_tag_sample_sizeint100Sample size for inference
tag_type_hintslist[str]NoneExplicit type hints

write_sam / sink_sam

python
rows_written = pb.write_sam(df, "output.sam")
pb.sink_sam(lf, "output.sam", sort_on_write=True)

Hi-C Pairs

read_pairs / scan_pairs

Read Hi-C pairs format files for chromatin contact data.

python
import polars_bio as pb

df = pb.read_pairs("contacts.pairs")
lf = pb.scan_pairs("contacts.pairs")

Column Schema

ColumnTypeDescription
readIDStringRead identifier
chrom1StringChromosome of first contact
pos1Int32Position of first contact
chrom2StringChromosome of second contact
pos2Int32Position of second contact
strand1StringStrand of first contact
strand2StringStrand of second contact

Generic Table Reader

read_table / scan_table

Read tab-delimited files with custom schema. Useful for non-standard formats or bioframe-compatible tables.

python
import polars_bio as pb

df = pb.read_table("custom.tsv", schema={"chrom": str, "start": int, "end": int, "name": str})
lf = pb.scan_table("custom.tsv", schema={"chrom": str, "start": int, "end": int})

Cloud Storage

All read_* and scan_* functions support cloud storage via individual parameters:

Amazon S3

python
df = pb.read_bed(
    "s3://bucket/regions.bed",
    allow_anonymous=False,
    max_retries=10,
    timeout=600,
)

Google Cloud Storage

python
df = pb.read_vcf("gs://bucket/variants.vcf.gz", allow_anonymous=True)

Azure Blob Storage

python
df = pb.read_bam("az://container/aligned.bam", allow_anonymous=False)

Note: For authenticated access, configure credentials via environment variables or cloud SDK configuration (e.g., AWS_ACCESS_KEY_ID, GOOGLE_APPLICATION_CREDENTIALS).

Compression Support

polars-bio transparently handles compressed files:

CompressionExtensionParallel Decompression
GZIP.gzNo
BGZF.gz (with BGZF blocks)Yes
Uncompressed(none)N/A

Recommendation: Use BGZF compression (e.g., created with bgzip) for large files. BGZF supports parallel block decompression, significantly improving read performance compared to plain GZIP.

Describe Functions

Inspect file structure without fully reading:

python
import polars_bio as pb

# Describe file schemas and metadata
schema_df = pb.describe_vcf("samples.vcf.gz")
schema_df = pb.describe_bam("aligned.bam")
schema_df = pb.describe_sam("alignments.sam")
schema_df = pb.describe_cram("aligned.cram", reference_path="ref.fasta")