skills/pydeseq2/references/api_reference.md
This document provides a practical API reference for PyDESeq2 0.5.x classes, methods, and utilities.
The main class for differential expression analysis that handles data processing from normalization through log-fold change fitting.
Purpose: Implements dispersion and log fold-change (LFC) estimation for RNA-seq count data.
Initialization Parameters:
counts: pandas DataFrame of shape (samples × genes) containing non-negative integer read countsmetadata: pandas DataFrame of shape (samples × variables) with sample annotationsdesign: formulaic/Wilkinson formula string or design matrix specifying the statistical model (e.g., "~condition", "~batch + condition")fit_type: dispersion trend fit type, "parametric" or "mean" (default: "parametric")size_factors_fit_type: size factor method, "ratio", "poscounts", or "iterative" (default: "ratio")control_genes: optional genes used for size factor fitting, useful for invariant housekeeping genesrefit_cooks: bool, whether to refit parameters after removing Cook's distance outliers (default: True)inference: optional inference backend, usually DefaultInference(n_cpus=...)quiet: bool, suppress progress messages (default: False)low_memory: bool, remove intermediate structures after use (default: False)Deprecated 0.5.x parameters: avoid design_factors, continuous_factors, and ref_level in new workflows. Continuous variables are detected from the formula; categorical handling should be expressed through formulaic syntax or pandas categorical dtypes.
Key Methods:
deseq2()Run the complete DESeq2 pipeline for normalization and dispersion/LFC fitting.
Steps performed:
refit_cooks=TrueReturns: None (modifies object in-place)
to_picklable_anndata()Convert the DeseqDataSet to an AnnData object that can be serialized.
Returns: AnnData object with:
X: count data matrixobs: sample-level metadata (1D)var: gene-level metadata (1D)varm: gene-level multi-dimensional data (e.g., LFC estimates)Usage:
dds.to_picklable_anndata().write_h5ad("result_adata.h5ad")
Only load pickle files from trusted sources. Prefer .h5ad or CSV for exchanging results between tools or collaborators.
Attributes (after running deseq2()):
layers: dict containing various matrices (normalized counts, etc.)varm: dict containing gene-level results (log fold changes, dispersions, etc.)obsm: dict containing sample-level informationuns: dict containing global parametersClass for performing statistical tests and computing p-values for differential expression.
Purpose: Facilitates PyDESeq2 statistical tests using Wald tests and optional LFC shrinkage.
Initialization Parameters:
dds: DeseqDataSet object that has been processed with deseq2()contrast: list or numpy array specifying the contrast for testing
[variable, test_level, reference_level]["condition", "treated", "control"] tests treated vs controlalpha: float, significance threshold for independent filtering (default: 0.05)cooks_filter: bool, whether to filter outliers based on Cook's distance (default: True)independent_filter: bool, whether to perform independent filtering (default: True)lfc_null: log2 fold-change under the null hypothesis for thresholded tests (default: 0.0)alt_hypothesis: optional thresholded-test alternative ("greaterAbs", "lessAbs", "greater", or "less")inference: optional inference backend, usually the same DefaultInference object used for DeseqDataSetquiet: bool, suppress progress messages (default: False)n_cpus: int, number of CPUs for parallel processing (optional)PyDESeq2 0.5.x no longer supports default contrasts. Always pass contrast.
Key Methods:
summary()Run Wald tests and compute p-values and adjusted p-values.
Steps performed:
Returns: None (results stored in results_df attribute)
Result DataFrame columns:
baseMean: mean normalized count across all sampleslog2FoldChange: log2 fold change between conditionslfcSE: standard error of the log2 fold changestat: Wald test statisticpvalue: raw p-valuepadj: adjusted p-value (FDR-corrected)lfc_shrink(coeff, adapt=True)Apply shrinkage to log fold changes using the apeGLM method.
Purpose: Reduces noise in LFC estimates for better visualization and ranking, especially for genes with low counts or high variability.
Parameters:
coeff: coefficient name to shrink, matching a column in dds.obsm["design_matrix"] (for example, "condition[T.treated]")adapt: whether to adapt the prior scale from MLE estimates (default: True)Important: Shrinkage is applied only for visualization/ranking purposes. The statistical test results (p-values, adjusted p-values) remain unchanged.
Returns: None (updates results_df with shrunk LFCs)
Attributes:
results_df: pandas DataFrame containing test results (available after summary())pydeseq2.utils.load_example_data(modality, dataset="synthetic", debug=False)Load synthetic example datasets for testing and tutorials.
Parameters:
modality: data modality to load, commonly "raw_counts" or "metadata"dataset: example dataset name, commonly "synthetic"debug: whether to load a smaller debug datasetReturns: tuple of (counts_df, metadata_df)
counts_df: pandas DataFrame with synthetic count datametadata_df: pandas DataFrame with sample annotationsThe pydeseq2.preprocessing module provides normalization utilities used by the core pipeline.
Common operations:
Abstract base class defining the interface for DESeq2-related inference methods.
Default implementation of inference methods using scipy, sklearn, and numpy.
Purpose: Provides the mathematical implementations for:
counts_df = counts_df.Tfrom pydeseq2.dds import DeseqDataSet
from pydeseq2.default_inference import DefaultInference
from pydeseq2.ds import DeseqStats
# 1. Initialize dataset
inference = DefaultInference(n_cpus=4)
dds = DeseqDataSet(
counts=counts_df,
metadata=metadata,
design="~condition",
refit_cooks=True,
inference=inference,
)
# 2. Fit dispersions and LFCs
dds.deseq2()
# 3. Perform statistical testing
ds = DeseqStats(
dds,
contrast=["condition", "treated", "control"],
alpha=0.05,
inference=inference,
)
ds.summary()
# 4. Optional: Shrink LFCs for visualization
ds.lfc_shrink(coeff="condition[T.treated]")
# 5. Access results
results = ds.results_df
PyDESeq2 aims to match the default settings of DESeq2 v1.34.0 for single-factor and multi-factor Wald-test workflows. Some differences may exist because it is a from-scratch reimplementation in Python.
Tested with:
Important 0.5.x changes:
design should be a formulaic formula string or an explicit design matrix.design_factors, continuous_factors, and ref_level are deprecated.DeseqStats requires an explicit contrast.lfc_shrink() requires an explicit coeff.