skills/scvi-tools/references/models-atac-seq.md
This document covers models for analyzing single-cell ATAC-seq and chromatin accessibility data in scvi-tools.
Purpose: Analysis and integration of single-cell ATAC-seq data using peak counts.
Key Features:
When to Use:
Data Requirements:
Basic Usage:
import scvi
# Prepare data (peaks should be in adata.X)
# Optional: filter peaks
sc.pp.filter_genes(adata, min_cells=3)
# Setup data
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="batch"
)
# Train model
model = scvi.model.PEAKVI(adata)
model.train()
# Get latent representation (batch-corrected)
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# Differential accessibility
da_results = model.differential_accessibility(
groupby="cell_type",
group1="TypeA",
group2="TypeB"
)
Key Parameters:
n_latent: Dimensionality of latent space (default: 10)n_hidden: Number of nodes per hidden layer (default: 128)n_layers: Number of hidden layers (default: 1)region_factors: Whether to learn region-specific factors (default: True)latent_distribution: Distribution for latent space ("normal" or "ln")Outputs:
get_latent_representation(): Low-dimensional embeddings for cellsget_accessibility_estimates(): Normalized accessibility valuesdifferential_accessibility(): Statistical testing for differential peaksget_region_factors(): Peak-specific scaling factorsBest Practices:
region_factors=True for datasets with high technical variationadata.obsm for downstream analysis with scanpyPurpose: Quantitative analysis of scATAC-seq fragment counts (more detailed than peak counts).
Key Features:
When to Use:
Data Requirements:
Basic Usage (PoissonVI lives in scvi.external):
scvi.external.POISSONVI.setup_anndata(
adata,
batch_key="batch"
)
model = scvi.external.POISSONVI(adata)
model.train()
# Get results
latent = model.get_latent_representation()
accessibility = model.get_normalized_accessibility()
Key Differences from PeakVI:
When to Choose PoissonVI over PeakVI:
Purpose: Deep learning approach to scATAC-seq analysis with interpretability and motif analysis.
Key Features:
When to Use:
Data Requirements:
Basic Usage (scBasset lives in scvi.external):
# scBasset needs per-peak DNA sequences. Add them to the AnnData first;
# this downloads the genome (once) and stores one-hot codes in adata.varm.
scvi.data.add_dna_sequence(
adata,
genome_name="hg38",
install_genome=True,
)
# Register the per-peak sequence code, then train
scvi.external.SCBASSET.setup_anndata(adata, dna_code_key="dna_code")
model = scvi.external.SCBASSET(adata)
model.train()
# Cell embeddings (low-dimensional latent representation)
latent = model.get_latent_representation()
Key Parameters:
n_latent: Latent space dimensionalityconv_layers: Number of convolutional layersn_filters: Number of filters per conv layerfilter_size: Size of convolutional filtersAdvanced Features:
Interpretability Tools:
scBasset learns sequence-aware cell and peak embeddings. Transcription-factor activity is assessed by scoring motif sequences against the trained model rather than calling a single importance function. See the scBasset user guide for the current motif-injection / TF-activity workflow.
# Cell embeddings for clustering / visualization
cell_embedding = model.get_latent_representation()
Choose when:
Advantages:
Choose when:
Advantages:
Choose when:
Advantages:
import scvi
import scanpy as sc
# 1. Load and preprocess ATAC-seq data
adata = sc.read_h5ad("atac_data.h5ad")
# 2. Filter low-quality peaks
sc.pp.filter_genes(adata, min_cells=10)
# 3. Setup and train PeakVI
scvi.model.PEAKVI.setup_anndata(
adata,
batch_key="sample"
)
model = scvi.model.PEAKVI(adata, n_latent=20)
model.train(max_epochs=400)
# 4. Extract latent representation
latent = model.get_latent_representation()
adata.obsm["X_PeakVI"] = latent
# 5. Downstream analysis
sc.pp.neighbors(adata, use_rep="X_PeakVI")
sc.tl.umap(adata)
sc.tl.leiden(adata, key_added="clusters")
# 6. Differential accessibility
da_results = model.differential_accessibility(
groupby="clusters",
group1="0",
group2="1"
)
# 7. Save model
model.save("peakvi_model")
For paired multimodal data (RNA+ATAC from same cells), use MultiVI instead:
from mudata import MuData
# MultiVI is configured from a MuData object (setup_anndata was removed in v1.3)
mdata = MuData({"rna": rna_adata, "atac": atac_adata})
scvi.model.MULTIVI.setup_mudata(
mdata,
batch_key="sample",
modalities={"rna_layer": "rna", "atac_layer": "atac"},
)
model = scvi.model.MULTIVI(
mdata,
n_genes=rna_adata.n_vars,
n_regions=atac_adata.n_vars,
)
model.train()
# Get joint latent space
latent = model.get_latent_representation()
See models-multimodal.md for more details on multimodal integration.
Quality Control:
Batch Correction:
batch_key if integrating multiple samplesFeature Selection:
Latent Dimensions:
n_latent=10-30 depending on dataset complexityDownstream Analysis:
Computational Considerations: