scientific-skills/geniml/references/consensus_peaks.md
Geniml provides tools for building genomic "universes" — standardized reference sets of consensus peaks from collections of BED files. These universes represent genomic regions where analyzed datasets show significant coverage overlap, serving as reference vocabularies for tokenization and analysis.
Use consensus peak creation when:
Merge all BED files into a single combined file:
cat /path/to/bed/files/*.bed > combined_files.bed
Create bigWig coverage tracks using uniwig with a smoothing window:
uniwig -m 25 combined_files.bed chrom.sizes coverage/
Parameters:
-m 25: Smoothing window size (25bp typical for chromatin accessibility)chrom.sizes: Chromosome sizes file for your genomecoverage/: Output directory for bigWig filesThe smoothing window helps reduce noise and creates more robust peak boundaries.
Use one of four methods to construct the consensus peaks:
The simplest approach using a fixed coverage threshold:
geniml universe build cc \
--coverage-folder coverage/ \
--output-file universe_cc.bed \
--cutoff 5 \
--merge 100 \
--filter-size 50
Parameters:
--cutoff: Coverage threshold (1 = union; file count = intersection)--merge: Distance for merging adjacent peaks (bp)--filter-size: Minimum peak size for inclusion (bp)Use when: Simple threshold-based selection is sufficient
Creates confidence intervals around likelihood cutoffs for boundaries and region cores:
geniml universe build ccf \
--coverage-folder coverage/ \
--output-file universe_ccf.bed \
--cutoff 5 \
--confidence 0.95 \
--merge 100 \
--filter-size 50
Additional parameters:
--confidence: Confidence level for flexible boundaries (0-1)Use when: Uncertainty in peak boundaries should be captured
Builds probabilistic models accounting for region start/end positions:
geniml universe build ml \
--coverage-folder coverage/ \
--output-file universe_ml.bed \
--merge 100 \
--filter-size 50 \
--model-type gaussian
Parameters:
--model-type: Distribution for likelihood estimation (gaussian, poisson)Use when: Statistical modeling of peak locations is important
Models genomic regions as hidden states with coverage as emissions:
geniml universe build hmm \
--coverage-folder coverage/ \
--output-file universe_hmm.bed \
--states 3 \
--merge 100 \
--filter-size 50
Parameters:
--states: Number of HMM hidden states (typically 2-5)Use when: Complex patterns of genomic states should be captured
from geniml.universe import build_universe
# Build using coverage cutoff method
universe = build_universe(
coverage_folder='coverage/',
method='cc',
cutoff=5,
merge_distance=100,
min_size=50,
output_file='universe.bed'
)
| Method | Complexity | Flexibility | Computational Cost | Best For |
|---|---|---|---|---|
| CC | Low | Low | Low | Quick reference sets |
| CCF | Medium | Medium | Medium | Boundary uncertainty |
| ML | High | High | High | Statistical rigor |
| HMM | High | High | Very High | Complex patterns |
Coverage cutoff:
cutoff = 1: Union of all peaks (most permissive)cutoff = n_files: Intersection (most stringent)cutoff = 0.5 * n_files: Moderate consensus (typical choice)Merge distance:
Filter size:
After building, assess universe quality:
from geniml.evaluation import assess_universe
metrics = assess_universe(
universe_file='universe.bed',
coverage_folder='coverage/',
bed_files='bed_files/'
)
print(f"Number of regions: {metrics['n_regions']}")
print(f"Mean region size: {metrics['mean_size']:.1f}bp")
print(f"Coverage of input peaks: {metrics['coverage']:.1%}")
Key metrics:
Consensus peaks are saved as BED files with three required columns:
chr1 1000 1500
chr1 2000 2800
chr2 500 1000
Additional columns may include confidence scores or state annotations depending on the method.
Too few regions: Lower cutoff threshold or reduce filter size
Too many regions: Raise cutoff threshold, increase merge distance, or increase filter size
Noisy boundaries: Use CCF or ML methods instead of CC
Long computation: Start with CC method for quick results, then refine with ML/HMM if needed