ATAC-seq and Chromatin Accessibility Models

This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.

PeakVI

Purpose: Analysis and integration of single-cell ATAC-seq data using peak counts.

Key Features:

Variational autoencoder specifically designed for scATAC-seq peak data
Learns low-dimensional representations of chromatin accessibility
Performs batch correction across samples
Enables differential accessibility testing
Integrates multiple ATAC-seq datasets

When to Use:

Analyzing scATAC-seq peak count matrices
Integrating multiple ATAC-seq experiments
Batch correction of chromatin accessibility data
Dimensionality reduction for ATAC-seq
Differential accessibility analysis between cell types or conditions

Data Requirements:

Peak count matrix (cells × peaks)
Binary or count data for peak accessibility
Batch/sample annotations (optional, for batch correction)

Basic Usage:

python

import scvi

# Prepare data (peaks should be in adata.X)
# Optional: filter peaks
sc.pp.filter_genes(adata, min_cells=3)

# Setup data
scvi.model.PEAKVI.setup_anndata(
    adata,
    batch_key="batch"
)

# Train model
model = scvi.model.PEAKVI(adata)
model.train()

# Get latent representation (batch-corrected)
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent

# Differential accessibility
da_results = model.differential_accessibility(
    groupby="cell_type",
    group1="TypeA",
    group2="TypeB"
)

Key Parameters:

n_latent: Dimensionality of latent space (default: 10)
n_hidden: Number of nodes per hidden layer (default: 128)
n_layers: Number of hidden layers (default: 1)
region_factors: Whether to learn region-specific factors (default: True)
latent_distribution: Distribution for latent space ("normal" or "ln")

Outputs:

get_latent_representation(): Low-dimensional embeddings for cells
get_accessibility_estimates(): Normalized accessibility values
differential_accessibility(): Statistical testing for differential peaks
get_region_factors(): Peak-specific scaling factors

Best Practices:

Filter out low-quality peaks (present in very few cells)
Include batch information if integrating multiple samples
Use latent representations for clustering and UMAP visualization
Consider using region_factors=True for datasets with high technical variation
Store latent embeddings in adata.obsm for downstream analysis with scanpy

PoissonVI

Purpose: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).

Key Features:

Models fragment counts directly (not just peak presence/absence)
Poisson distribution for count data
Captures quantitative differences in accessibility
Enables fine-grained analysis of chromatin state

When to Use:

Analyzing fragment-level ATAC-seq data
Need quantitative accessibility measurements
Higher resolution analysis than binary peak calls
Investigating gradual changes in chromatin accessibility

Data Requirements:

Fragment count matrix (cells × genomic regions)
Count data (not binary)

Basic Usage (PoissonVI lives in scvi.external):

python

scvi.external.POISSONVI.setup_anndata(
    adata,
    batch_key="batch"
)

model = scvi.external.POISSONVI(adata)
model.train()

# Get results
latent = model.get_latent_representation()
accessibility = model.get_normalized_accessibility()

Key Differences from PeakVI:

PeakVI: Best for standard peak count matrices, faster
PoissonVI: Best for quantitative fragment counts, more detailed

When to Choose PoissonVI over PeakVI:

Working with fragment counts rather than called peaks
Need to capture quantitative differences
Have high-quality, high-coverage data
Interested in subtle accessibility changes

scBasset

Purpose: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.

Key Features:

Convolutional neural network (CNN) architecture for sequence-based analysis
Models raw DNA sequences, not just peak counts
Enables motif discovery and transcription factor (TF) binding prediction
Provides interpretable feature importance
Performs batch correction

When to Use:

Want to incorporate DNA sequence information
Interested in TF motif analysis
Need interpretable models (which sequences drive accessibility)
Analyzing regulatory elements and TF binding sites
Predicting accessibility from sequence alone

Data Requirements:

Peak sequences (extracted from genome)
Peak accessibility matrix
Genome reference (for sequence extraction)

Basic Usage (scBasset lives in scvi.external):

python

# scBasset needs per-peak DNA sequences. Add them to the AnnData first;
# this downloads the genome (once) and stores one-hot codes in adata.varm.
scvi.data.add_dna_sequence(
    adata,
    genome_name="hg38",
    install_genome=True,
)

# Register the per-peak sequence code, then train
scvi.external.SCBASSET.setup_anndata(adata, dna_code_key="dna_code")

model = scvi.external.SCBASSET(adata)
model.train()

# Cell embeddings (low-dimensional latent representation)
latent = model.get_latent_representation()

Key Parameters:

n_latent: Latent space dimensionality
conv_layers: Number of convolutional layers
n_filters: Number of filters per conv layer
filter_size: Size of convolutional filters

Advanced Features:

In silico mutagenesis: Predict how sequence changes affect accessibility
Motif enrichment: Identify enriched TF motifs in accessible regions
Batch correction: Similar to other scvi-tools models
Transfer learning: Fine-tune on new datasets

Interpretability Tools:

scBasset learns sequence-aware cell and peak embeddings. Transcription-factor activity is assessed by scoring motif sequences against the trained model rather than calling a single importance function. See the scBasset user guide for the current motif-injection / TF-activity workflow.

python

# Cell embeddings for clustering / visualization
cell_embedding = model.get_latent_representation()

Model Selection for ATAC-seq

PeakVI

Choose when:

Standard scATAC-seq analysis workflow
Have peak count matrices (most common format)
Need fast, efficient batch correction
Want straightforward differential accessibility
Prioritize computational efficiency

Advantages:

Fast training and inference
Proven track record for scATAC-seq
Easy integration with scanpy workflow
Robust batch correction

PoissonVI

Choose when:

Have fragment-level count data
Need quantitative accessibility measures
Interested in subtle differences
Have high-coverage, high-quality data

Advantages:

More detailed quantitative information
Better for gradient changes
Appropriate statistical model for counts

scBasset

Choose when:

Want to incorporate DNA sequence
Need biological interpretation (motifs, TFs)
Interested in regulatory mechanisms
Have computational resources for CNN training
Want predictive power for new sequences

Advantages:

Sequence-based, biologically interpretable
Motif and TF analysis built-in
Predictive modeling capabilities
In silico perturbation experiments

Workflow Example: Complete ATAC-seq Analysis

python

import scvi
import scanpy as sc

# 1. Load and preprocess ATAC-seq data
adata = sc.read_h5ad("atac_data.h5ad")

# 2. Filter low-quality peaks
sc.pp.filter_genes(adata, min_cells=10)

# 3. Setup and train PeakVI
scvi.model.PEAKVI.setup_anndata(
    adata,
    batch_key="sample"
)

model = scvi.model.PEAKVI(adata, n_latent=20)
model.train(max_epochs=400)

# 4. Extract latent representation
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent

# 5. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")

# 6. Differential accessibility
da_results = model.differential_accessibility(
    groupby="clusters",
    group1="0",
    group2="1"
)

# 7. Save model
model.save("peakvi_model")

Integration with Gene Expression (RNA+ATAC)

For paired multimodal data (RNA+ATAC from same cells), use MultiVI instead:

python

from mudata import MuData

# MultiVI is configured from a MuData object (setup_anndata was removed in v1.3)
mdata = MuData({"rna": rna_adata, "atac": atac_adata})
scvi.model.MULTIVI.setup_mudata(
    mdata,
    batch_key="sample",
    modalities={"rna_layer": "rna", "atac_layer": "atac"},
)

model = scvi.model.MULTIVI(
    mdata,
    n_genes=rna_adata.n_vars,
    n_regions=atac_adata.n_vars,
)
model.train()

# Get joint latent space
latent = model.get_latent_representation()

See models-multimodal.md for more details on multimodal integration.

Best Practices for ATAC-seq Analysis

Quality Control:
- Filter cells with very low or very high peak counts
- Remove peaks present in very few cells
- Filter mitochondrial and sex chromosome peaks if needed
Batch Correction:
- Always include batch_key if integrating multiple samples
- Consider technical covariates (sequencing depth, TSS enrichment)
Feature Selection:
- Unlike RNA-seq, all peaks are often used
- Consider filtering very rare peaks for efficiency
Latent Dimensions:
- Start with n_latent=10-30 depending on dataset complexity
- Larger values for more heterogeneous datasets
Downstream Analysis:
- Use latent representations for clustering and visualization
- Link peaks to genes for regulatory analysis
- Perform motif enrichment on cluster-specific peaks
Computational Considerations:
- ATAC-seq matrices are often very large (many peaks)
- Consider downsampling peaks for initial exploration
- Use GPU acceleration for large datasets