Back to Claude Scientific Skills

Databases, Gene Sets, and Gene-ID Mapping

skills/pathway-enrichment/references/databases-and-gene-sets.md

2.46.06.3 KB
Original Source

Databases, Gene Sets, and Gene-ID Mapping

Contents

Picking libraries by question

Match the database to the biological question instead of running everything:

QuestionBest gene sets
"What are the broad themes?"MSigDB Hallmark (50 curated, low redundancy)
"What mechanism/process?"GO Biological Process
"Which curated pathways?"Reactome, KEGG, WikiPathways
"Molecular function / localization?"GO MF / GO CC
"Immune signatures?"MSigDB C7 (ImmuneSigDB)
"Oncogenic / perturbation?"MSigDB C6 (oncogenic), C2:CGP
"TF targets / regulons?"MSigDB C3, ChEA, or decoupler (below)
"Disease/phenotype association?"g:Profiler HP, DisGeNET, GWAS Catalog

Start narrow (Hallmark + one of GO:BP / Reactome). Add libraries only if the question needs them — each extra library multiplies the testing burden.

The main gene-set databases

  • GO (Gene Ontology) — three namespaces: Biological Process (BP), Molecular Function (MF), Cellular Component (CC). Hierarchical → highly redundant; collapse terms after testing (see interpretation.md).
  • KEGG — manually curated metabolic & signaling pathways. Compact, well known.
  • Reactome — large, expert-curated, hierarchical human pathway set; good granularity. APIs in database-lookup.
  • WikiPathways — community-curated pathways; complements KEGG/Reactome.
  • MSigDB — collections of collections (Hallmark, curated, GO, immune, etc.); the standard source of GMT files for GSEA.

MSigDB collections

CollectionContents
H (h.all)Hallmark — 50 refined, non-redundant signatures (best default for GSEA)
C2:CPCanonical Pathways: c2.cp.kegg_medicus, c2.cp.reactome, c2.cp.wikipathways, c2.cp.biocarta
C2:CGPChemical & genetic perturbations
C3Regulatory targets (TFT, miRNA)
C5Ontology: c5.go.bp, c5.go.mf, c5.go.cc, c5.hpo
C6Oncogenic signatures
C7ImmuneSigDB
C8Cell-type signatures

Fetch via gseapy: gp.Msigdb().get_gmt(category="h.all", dbver="2024.1.Hs") (use dbver="…Mm" for mouse symbols). See gseapy.md.

g:Profiler

The official client (gprofiler-official) is the best path when you need a custom background, many organisms (~500), or g:Profiler's g:SCS multiple-testing correction. It performs ORA over GO, KEGG, Reactome, WikiPathways, miRTarBase, CORUM, HP, and more in one call.

python
from gprofiler import GProfiler

gp = GProfiler(return_dataframe=True)
res = gp.profile(
    organism="hsapiens",                      # mmusculus, dmelanogaster, ...
    query=gene_list,                          # symbols, Ensembl, Entrez — auto-detected
    sources=["GO:BP", "KEGG", "REAC", "WP"],  # restrict sources
    user_threshold=0.05,
    significance_threshold_method="g_SCS",    # default; or "fdr" / "bonferroni"
    domain_scope="custom",                    # use a custom statistical background
    background=expressed_genes,               # the tested/expressed universe
    no_iea=False,                             # True = drop electronic GO annotations
)
# columns: source, native, name, p_value, term_size, query_size,
#          intersection_size, effective_domain_size, intersections

gp.convert(organism="hsapiens", query=ids, target_namespace="ENTREZGENE") maps IDs; gp.orth(...) maps orthologs across organisms.

Gene-ID types and conversion

Enrichr and MSigDB libraries are keyed by gene symbols. Convert other ID types before ORA/GSEA, or matches silently drop.

You haveConvert with
Ensembl gene IDs (ENSG…)gp.Biomart, g:Profiler g:Convert, or mygene
Entrez IDsmygene, g:Profiler
Mouse symbols → humang:Profiler g:Orth, mygene (then run human libraries)

mygene example:

python
import mygene
mg = mygene.MyGeneInfo()
hits = mg.querymany(ensembl_ids, scopes="ensembl.gene",
                    fields="symbol", species="human", as_dataframe=True)
symbols = hits["symbol"].dropna().tolist()

Strip Ensembl version suffixes first (ENSG00000141510.16ENSG00000141510). The gget skill (gget info) is another quick ID-mapping path.

Organism handling

  • Human symbols are UPPERCASE (TP53); mouse symbols are Title-case (Trp53).
  • Set organism= for gp.enrichr (Enrichr) and use the matching MSigDB dbver (…Hs vs …Mm) or g:Profiler organism= code.
  • Don't run human libraries on mouse symbols — convert or map orthologs first.

Pathway/interaction APIs

For raw pathway content or network context (not enrichment statistics), use the database-lookup skill, which wraps:

  • Reactome content + Analysis Service (submit a gene list, get pathway over-representation).
  • KEGG pathways/compounds.
  • STRING — protein–protein interactions plus its own functional-enrichment endpoint for a submitted gene set; pairs well with networkx for network views.
  • Gene Ontology / QuickGO term metadata.

Activity inference

When the goal is pathway or TF activity (a continuous score per sample/cell) rather than over-representation of a list, use decoupler. It runs multiple enrichment/activity methods (ORA, GSEA, univariate linear models, etc.) against curated priors:

  • PROGENy — 14 signaling pathway responsive signatures.
  • DoRothEA / CollecTRI — TF→target regulons for TF-activity inference.
  • MSigDB priors via its OmniPath integration.

decoupler integrates natively with AnnData/Scanpy (per-cell activities) and with per-sample pseudobulk matrices. APIs evolve between major versions — check the current decoupler docs (https://decoupler-py.readthedocs.io/) for exact function names before writing code.