scientific-skills/hugging-science/references/flagship-resources.md
Sensible "if in doubt, start here" picks per scientific domain. These are the resources users most often want when they describe a task in plain language. Always confirm with the live catalog (fetch_catalog.py topic <slug>) before final recommendation — the catalog evolves and there may be something newer.
The catalog is the source of truth. Treat this file as a fast cheatsheet, not a directory.
DNA foundation models
arcinstitute/evo2_40b — 40B-param DNA LM, 9.3T nucleotide pretraining; zero-shot variant effect prediction, sequence generation. Huge — use Inference Providers or a Space.arcinstitute/evo2_7b — 7B instruction-tuned variant; runs on a single 24 GB GPU.Protein language models
facebook/esm2_t33_650M_UR50D — 650M ESM2; standard for embeddings, structure prediction, mutation scoring. Strong default.*_t30_150M_*, *_t12_35M_*) for laptop/CPU use.Single-cell / transcriptomics
arcinstitute/Stack-Large (STACK) — single-cell foundation model with in-context learning across cell types.Merck/TEDDY — 116M single-cell foundation models for genomics + drug discovery.Antibodies
opig/OAS — Observed Antibody Space, ~1B antibody sequences. The standard antibody ML dataset.Bioacoustics / ecology
EarthSpeciesProject/NatureLM-audio — first audio-LM for animal vocalizations.EarthSpeciesProject/esp-aves2-sl-beats-all — self-supervised bioacoustic encoder.Genomic corpora
arcinstitute/opengenome2 — curated prokaryotic + eukaryotic sequences for foundation-model pretraining.SandboxAQ/AQAffinity — drug-target affinity prediction.chemistry topic for current best — molecular foundation models change frequently.topic materials-science for the live list. Many ship pymatgen.Structure objects rather than tensors.topic physics --filter datasets) — magnetohydrodynamics, fluid dynamics, plasma.topic climate — these are often huge multi-modal models with structured atmospheric inputs (geopotential, temperature on multiple pressure levels).topic earth-science.hugging-science/breast-cancer-detector-2 — image classification baseline from the org itself.topic medicine. Many are gated; check the model card for access requirements.topic scientific-reasoning for LLM-as-scientific-assistant work, paper QA, multi-step reasoning evals.hugging-science/mmu_legacysurvey_dr10_south_21) — image-heavy, FITS format, may need astropy.topic astronomy.hugging-science/boltzgen-demo — protein/peptide/nanobody binder design (the marquee demo).hugging-science/dataset-quest — discover and submit scientific datasets.hugging-science/science-release-heatmap — visualize who's publishing AI4Science resources.fetch_catalog.py topic <slug> and confirm the recommended resource still exists / is current.using-datasets.md / using-models.md / using-spaces.md for the actual code.If nothing in this cheatsheet fits, run fetch_catalog.py search "<keyword>" against the full index. The catalog has hundreds of entries this file doesn't enumerate.