Back to Claude Scientific Skills

SQL Data Processing

scientific-skills/polars-bio/references/sql_processing.md

2.38.06.0 KB
Original Source

SQL Data Processing

Overview

polars-bio integrates Apache DataFusion's SQL engine, enabling SQL queries on bioinformatics files and Polars DataFrames. Register files as tables and query them using standard SQL syntax. All queries return a LazyFrame — call .collect() to materialize results.

Register Functions

Register bioinformatics files as SQL tables. Path is the first argument, name is an optional keyword:

python
import polars_bio as pb

# Register various file formats (path first, name= keyword)
pb.register_vcf("samples.vcf.gz", name="variants")
pb.register_bed("target_regions.bed", name="regions")
pb.register_bam("aligned.bam", name="alignments")
pb.register_cram("aligned.cram", name="cram_alignments")
pb.register_gff("genes.gff3", name="annotations")
pb.register_gtf("genes.gtf", name="gtf_annotations")
pb.register_fastq("sample.fastq.gz", name="reads")
pb.register_sam("alignments.sam", name="sam_alignments")
pb.register_pairs("contacts.pairs", name="hic_contacts")

Parameters

All register_* functions share these parameters:

ParameterTypeDefaultDescription
pathstrrequired (first positional)Path to file (local or cloud)
namestrNoneTable name for SQL queries (auto-generated if omitted)
chunk_sizeint64Chunk size for reading
concurrent_fetchesint8Concurrent cloud fetches
allow_anonymousboolTrueAllow anonymous cloud access
max_retriesint5Cloud retry count
timeoutint300Cloud timeout in seconds
enable_request_payerboolFalseRequester-pays cloud
compression_typestr"auto"Compression type

Some register functions have additional format-specific parameters (e.g., info_fields on register_vcf).

Note: register_fasta does not exist. Use scan_fasta + from_polars as a workaround.

from_polars

Register an existing Polars DataFrame as a SQL-queryable table:

python
import polars as pl
import polars_bio as pb

df = pl.DataFrame({
    "chrom": ["chr1", "chr1", "chr2"],
    "start": [100, 500, 200],
    "end":   [200, 600, 400],
    "name":  ["peak1", "peak2", "peak3"],
})

pb.from_polars("my_peaks", df)

# Now query with SQL
result = pb.sql("SELECT * FROM my_peaks WHERE chrom = 'chr1'").collect()

Important: register_view takes a SQL query string, not a DataFrame. Use from_polars to register DataFrames.

register_view

Create a SQL view from a query string:

python
import polars_bio as pb

# Create a view from a SQL query
pb.register_view("chr1_variants", "SELECT * FROM variants WHERE chrom = 'chr1'")

# Query the view
result = pb.sql("SELECT * FROM chr1_variants WHERE qual > 30").collect()

Parameters

ParameterTypeDescription
namestrView name
querystrSQL query string defining the view

pb.sql()

Execute SQL queries using DataFusion SQL syntax. Returns a LazyFrame — call .collect() to get a DataFrame.

python
import polars_bio as pb

# Simple query
result = pb.sql("SELECT chrom, start, end FROM regions WHERE chrom = 'chr1'").collect()

# Aggregation
result = pb.sql("""
    SELECT chrom, COUNT(*) as variant_count, AVG(qual) as avg_qual
    FROM variants
    GROUP BY chrom
    ORDER BY variant_count DESC
""").collect()

# Join tables
result = pb.sql("""
    SELECT v.chrom, v.start, v.end, v.ref, v.alt, r.name
    FROM variants v
    JOIN regions r ON v.chrom = r.chrom
        AND v.start >= r.start
        AND v.end <= r.end
""").collect()

DataFusion SQL Syntax

polars-bio uses Apache DataFusion's SQL dialect. Key features:

Filtering

sql
SELECT * FROM variants WHERE qual > 30 AND filter = 'PASS'

Aggregations

sql
SELECT chrom, COUNT(*) as n, MIN(start) as min_pos, MAX(end) as max_pos
FROM regions
GROUP BY chrom
HAVING COUNT(*) > 100

Window Functions

sql
SELECT chrom, start, end,
    ROW_NUMBER() OVER (PARTITION BY chrom ORDER BY start) as row_num,
    LAG(end) OVER (PARTITION BY chrom ORDER BY start) as prev_end
FROM regions

Subqueries

sql
SELECT * FROM variants
WHERE chrom IN (SELECT DISTINCT chrom FROM regions)

Common Table Expressions (CTEs)

sql
WITH filtered_variants AS (
    SELECT * FROM variants WHERE qual > 30
),
chr1_regions AS (
    SELECT * FROM regions WHERE chrom = 'chr1'
)
SELECT f.chrom, f.start, f.ref, f.alt
FROM filtered_variants f
JOIN chr1_regions r ON f.start BETWEEN r.start AND r.end

Combining SQL with Interval Operations

SQL queries return LazyFrames that can be used directly with polars-bio interval operations:

python
import polars_bio as pb

# Register files
pb.register_vcf("samples.vcf.gz", name="variants")
pb.register_bed("target_regions.bed", name="targets")

# SQL to filter (returns LazyFrame)
high_qual = pb.sql("SELECT chrom, start, end FROM variants WHERE qual > 30").collect()
targets = pb.sql("SELECT chrom, start, end FROM targets WHERE chrom = 'chr1'").collect()

# Interval operation on SQL results
overlapping = pb.overlap(high_qual, targets).collect()

Example Workflows

Variant Density Analysis

python
import polars_bio as pb

pb.register_vcf("cohort.vcf.gz", name="variants")
pb.register_bed("genome_windows_1mb.bed", name="windows")

# Count variants per window using SQL
result = pb.sql("""
    SELECT w.chrom, w.start, w.end, COUNT(v.start) as variant_count
    FROM windows w
    LEFT JOIN variants v ON w.chrom = v.chrom
        AND v.start >= w.start
        AND v.start < w.end
    GROUP BY w.chrom, w.start, w.end
    ORDER BY variant_count DESC
""").collect()

Gene Annotation Lookup

python
import polars_bio as pb

pb.register_gff("gencode.gff3", name="genes")

# Find all protein-coding genes on chromosome 1
coding_genes = pb.sql("""
    SELECT chrom, start, end, attributes
    FROM genes
    WHERE type = 'gene'
        AND chrom = 'chr1'
        AND attributes LIKE '%protein_coding%'
    ORDER BY start
""").collect()