docs/skills.md
llms.txt, llms-full.txt, and per-topic markdown files designed for agent consumption); usage goes through standard Hugging Face APIs. Includes a bundled fetch_catalog.py script for filtered access by topic, type (datasets/models/blogs/spaces), or free-text search, plus reference guides for loading datasets via the datasets library (with streaming for billion-token corpora), running models via transformers or the Inference API/Providers (handling trust_remote_code requirements for custom architectures like Evo-2), and calling Spaces via gradio_client (with a worked BoltzGen example for protein binder design). Authenticates gated resources via HF_TOKEN from .env. Use cases: discovering the right dataset/model for a scientific ML task without trawling the broader Hub, fine-tuning on curated scientific data, citing methodology blogs from dataset/model authors, running interactive scientific demos (binder design, theorem proving, weather modeling) without local GPU setup, and bridging from "I need a model for protein/genome/molecule/climate/materials/astronomy" to working codelimit for top-N links, and diy() for custom regressors. Core engine for pySCENIC (cisTarget pruning and regulon activity are downstream). Use for bulk or single-cell expression matrices when inferring TF–target regulatory links from co-expression patternsget.aggregate(), trajectory inference (PAGA, diffusion maps), and visualization. Key features include: efficient handling of large datasets using sparse matrices and experimental Dask out-of-core support, integration with scvi-tools for advanced analysis, batch correction methods (ComBat), and publication-quality plotting. Optional GPU acceleration via rapids-singlecell. Use cases: single-cell RNA-seq analysis, cell-type identification, exploratory cluster markers, pseudobulk DE workflows (with pydeseq2), trajectory analysis, and comprehensive single-cell genomics workflowszarr.codecs compression (Blosc, gzip, zstd), partial chunk reads, consolidated metadata, sharding, and integration with NumPy, Dask, and Xarray. Use for out-of-core arrays, cloud-native pipelines, and large scientific datasets (genomics, imaging, climate). Skill: zarr-python.map() for batch processing, input concurrency and dynamic batching for I/O-bound workloads, and resource configuration (CPU cores, memory, ephemeral disk up to 3 TiB). Supports custom Docker images, Micromamba/Conda environments, integration with Hugging Face/Weights & Biases, and distributed multi-GPU training. Free tier includes $30/month credits. Use cases: ML model deployment and inference (LLMs, image generation, speech, embeddings), GPU-accelerated training and fine-tuning, batch processing large datasets in parallel, scheduled compute-intensive jobs, serverless API deployment with autoscaling, protein folding and computational biology, scientific computing requiring distributed compute or specialized hardware, and data pipeline automationdeepchem[torch], [tensorflow], [jax]). Use for ADMET/toxicity prediction, materials properties, and transfer learning on small datasetsESM_API_KEY) for scalable ESM3/ESM C inference, and Biohub-hosted ESMFold2 for all-atom structure prediction. Use cases: novel protein design, structure prediction from sequence, sequence design from structure, protein embeddings, function annotation, variant generation, and directed evolution workflowsData/HeteroData, 60+ conv layers (GCN, GAT, GraphSAGE, GIN), node/link/graph classification, heterogeneous graphs, neighbor sampling (NeighborLoader, LinkNeighborLoader), OGB and built-in datasets, custom dataset loading, GNN explainability, and scaling via DDP, Lightning, and torch.compilematlab -nodisplay -r "run('script.m'); exit;" or octave script.m. Use cases: numerical simulations, signal processing, image processing, control systems, statistical analysis, algorithm prototyping, data visualization, and any scientific computing task requiring matrix operations or numerical methodseverything(), full-text content indexing, saved search management, and file upload/download. Optional CLI and built-in MCP server (pyzotero 1.12+) for searching local Zotero 7 libraries including full-text PDF search and Semantic Scholar integration. Use cases: building research automation pipelines that integrate with Zotero, bulk importing references, exporting bibliographies programmatically, managing large reference collections, syncing library metadata, enriching bibliographic data, and connecting LLM agents to a local Zotero library.text_items (coordinates, font metadata, OCR confidence). Built-in Tesseract OCR with optional HTTP OCR servers (EasyOCR/PaddleOCR-compatible API), page subsets, encrypted PDFs, stdin/bytes parsing, and PNG page screenshots for multimodal agents. CLI (lit parse, lit batch-parse, lit screenshot) and Python API (LiteParse, search_items). Targets liteparse 2.0.0, Python 3.10+. Use when you need spatial grounding for RAG, batch literature ingestion, or agent vision—not for Markdown (MarkItDown) or PDF merge/split/forms (pdf skill)category="research paper" plus academic domain allowlists, and batch URL extraction