scientific-skills/deeptools/references/normalization_methods.md
This document explains the various normalization methods available in deepTools and when to use each one.
Normalization is essential for:
Without normalization, a sample with 100 million reads will appear to have higher coverage than a sample with 50 million reads, even if the true biological signal is identical.
Formula: (Number of reads) / (Length of region in kb × Total mapped reads in millions)
When to use:
Available in: bamCoverage
Example:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPKM
Interpretation: RPKM of 10 means 10 reads per kilobase of feature per million mapped reads.
Pros:
Cons:
Formula: (Number of reads) / (Total mapped reads in millions)
Also known as: RPM (Reads Per Million)
When to use:
Available in: bamCoverage, bamCompare
Example:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing CPM
Interpretation: CPM of 5 means 5 reads per million mapped reads in that bin.
Pros:
Cons:
Formula: (Number of reads in bin) / (Sum of all reads in bins in millions)
Key difference from CPM: Only considers reads that fall within the analyzed bins, not all mapped reads.
When to use:
Available in: bamCoverage, bamCompare
Example:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing BPM
Interpretation: BPM accounts only for reads in the binned regions.
Pros:
Cons:
Formula: (Number of reads × Scaling factor) / Effective genome size
Scaling factor: Calculated to achieve 1× genomic coverage (1 read per base)
When to use:
Available in: bamCoverage, bamCompare
Requires: --effectiveGenomeSize parameter
Example:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
Interpretation: Signal value approximates the coverage depth (e.g., value of 2 ≈ 2× coverage).
Pros:
Cons:
Formula: Raw read counts
When to use:
Available in: All tools (usually default)
Example:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing None
Interpretation: Raw read counts per bin.
Pros:
Cons:
Method: Signal Extraction Scaling - more sophisticated method for comparing ChIP to control
When to use:
Available in: bamCompare only
Example:
bamCompare -b1 chip.bam -b2 input.bam -o output.bw \
--scaleFactorsMethod SES
Note: SES is specifically designed for ChIP-seq data and may work better than simple read count scaling for noisy data.
Method: Scale by ratio of total read counts between samples
When to use:
bamCompareAvailable in: bamCompare
Example:
bamCompare -b1 treatment.bam -b2 control.bam -o output.bw \
--scaleFactorsMethod readCount
How it works: If sample1 has 100M reads and sample2 has 50M reads, sample2 is scaled by 2× before comparison.
Recommended: RPGC or CPM
bamCoverage --bam chip.bam --outFileName chip.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--extendReads 200 \
--ignoreDuplicates
Reasoning: Accounts for sequencing depth differences; RPGC provides interpretable coverage values.
Recommended: log2 ratio with readCount or SES scaling
bamCompare -b1 chip.bam -b2 input.bam -o ratio.bw \
--operation log2 \
--scaleFactorsMethod readCount \
--extendReads 200 \
--ignoreDuplicates
Reasoning: Log2 ratio shows enrichment (positive) and depletion (negative); readCount adjusts for depth.
Recommended: CPM or RPKM
# Strand-specific forward
bamCoverage --bam rnaseq.bam --outFileName forward.bw \
--normalizeUsing CPM \
--filterRNAstrand forward
# For gene-level: RPKM accounts for gene length
bamCoverage --bam rnaseq.bam --outFileName output.bw \
--normalizeUsing RPKM
Reasoning: CPM for comparing fixed-width bins; RPKM for genes (accounts for length).
Recommended: RPGC or CPM
bamCoverage --bam atac_shifted.bam --outFileName atac.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398
Reasoning: Similar to ChIP-seq; want comparable coverage across samples.
Recommended: CPM or RPGC
multiBamSummary bins \
--bamfiles sample1.bam sample2.bam sample3.bam \
-o readCounts.npz
plotCorrelation -in readCounts.npz \
--corMethod pearson \
--whatToShow heatmap \
-o correlation.png
Note: multiBamSummary doesn't explicitly normalize, but correlation analysis is robust to scaling. For very different library sizes, consider normalizing BAM files first or using CPM-normalized bigWig files with multiBigwigSummary.
For experiments with spike-in controls (e.g., Drosophila chromatin spike-in for ChIP-seq):
--scaleFactor parameter# Calculate spike-in factor (example: 0.8)
SCALE_FACTOR=0.8
bamCoverage --bam chip.bam --outFileName chip_spikenorm.bw \
--scaleFactor ${SCALE_FACTOR} \
--extendReads 200
You can apply custom scaling factors:
# Apply 2× scaling
bamCoverage --bam input.bam --outFileName output.bw \
--scaleFactor 2.0
Exclude specific chromosomes from normalization calculations:
bamCoverage --bam input.bam --outFileName output.bw \
--normalizeUsing RPGC \
--effectiveGenomeSize 2913022398 \
--ignoreForNormalization chrX chrY chrM
When to use: Sex chromosomes in mixed-sex samples, mitochondrial DNA, or chromosomes with unusual coverage.
Problem: RPKM accounts for region length, but all bins are the same size Solution: Use CPM or RPGC instead
Problem: Sample with 2× sequencing depth appears to have 2× signal Solution: Always normalize when comparing samples
Problem: Using hg19 genome size for hg38 data Solution: Double-check genome assembly and use correct size
Problem: Can introduce bias
Solution: Never use --ignoreDuplicates after correctGCBias
Problem: Command fails
Solution: Always specify --effectiveGenomeSize with RPGC
Use: RPKM (accounts for region length)
Use: CPM, RPGC, or BPM (accounts for library size)
Use: bamCompare with log2 ratio and readCount/SES scaling
Use: CPM or RPGC normalized bigWig files, then multiBigwigSummary
| Method | Accounts for Depth | Accounts for Length | Best For | Command |
|---|---|---|---|---|
| RPKM | ✓ | ✓ | RNA-seq genes | --normalizeUsing RPKM |
| CPM | ✓ | ✗ | Fixed-size bins | --normalizeUsing CPM |
| BPM | ✓ | ✗ | Specific regions | --normalizeUsing BPM |
| RPGC | ✓ | ✗ | Interpretable coverage | --normalizeUsing RPGC --effectiveGenomeSize X |
| None | ✗ | ✗ | Raw data | --normalizeUsing None |
| SES | ✓ | ✗ | ChIP comparisons | bamCompare --scaleFactorsMethod SES |
| readCount | ✓ | ✗ | ChIP comparisons | bamCompare --scaleFactorsMethod readCount |
For more details on normalization theory and best practices: