scientific-skills/pyhealth/references/installation.md
PyHealth 2.0 requires Python 3.12 or 3.13 (>=3.12,<3.14). The 1.x line supports Python 3.9+ if a downgrade is unavoidable.
uv is the right tool here — it resolves and installs an order of magnitude faster than pip, and the lockfile makes runs reproducible across machines.
uv init my-pyhealth-project
cd my-pyhealth-project
uv python pin 3.12 # writes .python-version
uv add pyhealth # resolves PyTorch + transitive deps, writes uv.lock
uv run python train.py # runs inside the project venv
If a pyproject.toml already exists:
uv add pyhealth
If only requirements.txt exists, either migrate to pyproject.toml (preferred) or:
uv pip install pyhealth
uv run --with pyhealth python script.py
This creates an ephemeral environment, runs the script, and disposes the env. Good for quick experiments.
uv add 'pyhealth==1.16' # last 1.x release, Python 3.9+
The 1.x and 2.x APIs differ — examples in this skill target 2.x. If a user is on 1.x, mention the version mismatch before debugging.
PyHealth uses PyTorch under the hood. uv add pyhealth pulls the default PyTorch wheel, which is CPU-only on macOS and CUDA-enabled on Linux when CUDA is detected.
For explicit CUDA control on Linux:
# Replace cu121 with the user's CUDA version
uv add 'torch>=2.1' --index https://download.pytorch.org/whl/cu121
uv add pyhealth
For Apple Silicon, the default wheel works and uses MPS automatically when Trainer(device="mps") is set. CPU is the safe default if device behavior is unclear.
PyHealth hosts a synthetic copy on Google Cloud Storage that any pipeline can hit directly:
root="https://storage.googleapis.com/pyhealth/Synthetic_MIMIC-III/"
Use this for demos, tutorials, and any code that needs to run without PhysioNet credentials.
These require completed CITI training and a credentialed PhysioNet account. Once downloaded, point root= (or ehr_root= for MIMIC-IV) at the local directory containing the CSV/CSV.gz files:
MIMIC4Dataset(
ehr_root="/path/to/mimic-iv/2.2/hosp", # not `root`
tables=["diagnoses_icd", "procedures_icd", "prescriptions"],
cache_dir="/path/to/cache", # cache parsed output
)
Standardized schema; point root= at the directory containing CDM tables (person.csv, condition_occurrence.csv, etc.).
The first call to set_task() is expensive (parses every CSV, applies the task to every patient). Set cache_dir= on the dataset constructor to persist the parsed result:
MIMIC3Dataset(root=..., tables=..., cache_dir="./cache/mimic3")
Subsequent runs reload from disk in seconds. Without cache_dir, every run re-parses from scratch — fine for a one-off script, painful for iteration.
dev=TrueAll dataset constructors accept dev=True, which loads only a small subset of patients. Use this while iterating on pipeline shape; switch to dev=False (the default) once the pipeline runs end-to-end.
uv python pin 3.12 and reinstall.set_task — set_task is CPU-only; this is almost always a Trainer issue. Reduce batch_size or move to CPU temporarily to localize the problem.cache_dir= and re-run.KeyError on table name — table names are case-sensitive and dataset-specific. MIMIC-III uses uppercase (DIAGNOSES_ICD), MIMIC-IV uses lowercase (diagnoses_icd). Check the user's dataset version.