Back to Claude Scientific Skills

Overlap Detection and IGD

scientific-skills/gtars/references/overlap.md

2.38.03.6 KB
Original Source

Overlap Detection and IGD

The overlaprs module provides efficient overlap detection between genomic intervals using the Integrated Genome Database (IGD) data structure.

IGD Index

IGD (Integrated Genome Database) is a specialized data structure for fast genomic interval queries and overlap detection.

Building an IGD Index

Create indexes from genomic region files:

python
import gtars

# Build IGD index from BED file
igd = gtars.igd.build_index("regions.bed")

# Save index for reuse
igd.save("regions.igd")

# Load existing index
igd = gtars.igd.load_index("regions.igd")

Querying Overlaps

Find overlapping regions efficiently:

python
# Query a single region
overlaps = igd.query("chr1", 1000, 2000)

# Query multiple regions
results = []
for chrom, start, end in query_regions:
    overlaps = igd.query(chrom, start, end)
    results.append(overlaps)

# Get overlap counts only
count = igd.count_overlaps("chr1", 1000, 2000)

CLI Usage

The overlaprs command-line tool provides overlap detection:

bash
# Find overlaps between two BED files
gtars overlaprs query --index regions.bed --query query_regions.bed

# Count overlaps
gtars overlaprs count --index regions.bed --query query_regions.bed

# Output overlapping regions
gtars overlaprs overlap --index regions.bed --query query_regions.bed --output overlaps.bed

IGD CLI Commands

Build and query IGD indexes:

bash
# Build IGD index
gtars igd build --input regions.bed --output regions.igd

# Query IGD index
gtars igd query --index regions.igd --region "chr1:1000-2000"

# Batch query from file
gtars igd query --index regions.igd --query-file queries.bed --output results.bed

Python API

Overlap Detection

Compute overlaps between region sets:

python
import gtars

# Load two region sets
set_a = gtars.RegionSet.from_bed("regions_a.bed")
set_b = gtars.RegionSet.from_bed("regions_b.bed")

# Find overlaps
overlaps = set_a.overlap(set_b)

# Get regions from A that overlap with B
overlapping_a = set_a.filter_overlapping(set_b)

# Get regions from A that don't overlap with B
non_overlapping_a = set_a.filter_non_overlapping(set_b)

Overlap Statistics

Calculate overlap metrics:

python
# Count overlaps
overlap_count = set_a.count_overlaps(set_b)

# Calculate overlap fraction
overlap_fraction = set_a.overlap_fraction(set_b)

# Get overlap coverage
coverage = set_a.overlap_coverage(set_b)

Performance Characteristics

IGD provides efficient querying:

  • Index construction: O(n log n) where n is number of regions
  • Query time: O(k + log n) where k is number of overlaps
  • Memory efficient: Compact representation of genomic intervals

Use Cases

Regulatory Element Analysis

Identify overlap between genomic features:

python
# Find transcription factor binding sites overlapping promoters
tfbs = gtars.RegionSet.from_bed("chip_seq_peaks.bed")
promoters = gtars.RegionSet.from_bed("promoters.bed")

overlapping_tfbs = tfbs.filter_overlapping(promoters)
print(f"Found {len(overlapping_tfbs)} TFBS in promoters")

Variant Annotation

Annotate variants with overlapping features:

python
# Check which variants overlap with coding regions
variants = gtars.RegionSet.from_bed("variants.bed")
cds = gtars.RegionSet.from_bed("coding_sequences.bed")

coding_variants = variants.filter_overlapping(cds)

Chromatin State Analysis

Compare chromatin states across samples:

python
# Find regions with consistent chromatin states
sample1 = gtars.RegionSet.from_bed("sample1_peaks.bed")
sample2 = gtars.RegionSet.from_bed("sample2_peaks.bed")

consistent_regions = sample1.overlap(sample2)