skills/pathway-enrichment/references/interpretation.md
ORA asks: among my k hits (out of a background of N genes), are more in gene set S (size K) than expected by chance? This is a hypergeometric / Fisher's exact test. It depends entirely on the threshold used to define hits and on the background N. Good when there is a clear, strong hit list.
GSEA asks: walking down the fully ranked list of all tested genes, is gene set S concentrated near the top (or bottom)? It uses a weighted Kolmogorov– Smirnov-like running sum; significance comes from permutations. No arbitrary threshold; sensitive to coordinated, modest shifts across many genes. Better when effects are broad/subtle or when a hit list would be very short or very long.
Rule of thumb: a discrete hit list → ORA; a ranked table with per-gene scores → GSEA. They answer different questions and can legitimately disagree.
The background (the "domain" / universe) is the set of genes that could have appeared as a hit. For RNA-seq that is the set of expressed/tested genes, not all ~20,000 protein-coding genes. Using too large a background makes ordinary housekeeping categories look significant — the most common way ORA results mislead.
domain_scope='custom', background=...) or gseapy gp.enrich() with an
explicit background.Adjusted P-value,
FDR q-val). Controls expected false-discovery proportion. Use < 0.05.FDR is computed within a library/run. Running many libraries multiplies the total tests, so report per-library FDR and avoid cherry-picking the one library that produced a hit.
< 0.05, or < 0.25 for
exploratory hypothesis generation, the GSEA convention).Lead_genes) — the subset of genes that drive the
signal (those before the running-sum peak). Report these; they are the concrete
biology and are useful for overlap/redundancy analysis.GO and large pathway sets return many overlapping terms describing the same biology. Don't list 40 near-duplicates. Options:
gp.enrichment_map(...); render with networkx (see the networkx skill).min_size/max_size
filters (15–500) exist for this reason.permutation_num, seed, min_size, max_size, weight, and
the ranking metric (e.g., DESeq2 stat).Report a compact, reviewer-friendly table:
| Term | Source | Direction (NES / Odds Ratio) | Overlap / Set size | FDR | Key genes |
|---|---|---|---|---|---|
| Interferon alpha response | Hallmark | NES +2.1 | 38/97 | 1e-4 | STAT1, IRF7, ISG15 |
For ORA use Odds Ratio + Overlap (k/K); for GSEA use NES + leading-edge size. Note method, library version, background, and correction in the legend.