scientific-skills/pathml/references/image_loading.md
PathML provides comprehensive support for loading whole-slide images (WSI) from 160+ proprietary medical imaging formats. The framework abstracts vendor-specific complexities through unified slide classes and interfaces, enabling seamless access to image pyramids, metadata, and regions of interest across different file formats.
PathML supports the following slide formats:
.svs) - Leica Biosystems.ndpi) - Hamamatsu Photonics.scn) - Leica Biosystems.zvi) - Carl Zeiss.mrxs) - 3DHISTECH Ltd..bif) - Roche Ventana.tif, .tiff).dcm) - Digital Imaging and Communications in Medicine.ome.tif, .ome.tiff) - Open Microscopy Environment.qptiff) - Multiplex immunofluorescencePathML leverages OpenSlide and other specialized libraries to handle format-specific nuances automatically.
SlideData is the fundamental class for representing whole-slide images in PathML.
Loading from file:
from pathml.core import SlideData
# Load a whole-slide image
wsi = SlideData.from_slide("path/to/slide.svs")
# Load with specific backend
wsi = SlideData.from_slide("path/to/slide.svs", backend="openslide")
# Load from OME-TIFF
wsi = SlideData.from_slide("path/to/slide.ome.tiff", backend="bioformats")
Key attributes:
wsi.slide - Backend slide object (OpenSlide, BioFormats, etc.)wsi.tiles - Collection of image tileswsi.metadata - Slide metadata dictionarywsi.level_dimensions - Image pyramid level dimensionswsi.level_downsamples - Downsample factors for each pyramid levelMethods:
wsi.generate_tiles() - Generate tiles from the slidewsi.read_region() - Read a specific region at a given levelwsi.get_thumbnail() - Get a thumbnail imageSlideType is an enumeration defining supported slide backends:
from pathml.core import SlideType
# Available backends
SlideType.OPENSLIDE # For most WSI formats (SVS, NDPI, etc.)
SlideType.BIOFORMATS # For OME-TIFF and other formats
SlideType.DICOM # For DICOM WSI
SlideType.VectraQPTIFF # For Vectra multiplex IF
PathML provides specialized slide classes for specific imaging modalities:
CODEXSlide:
from pathml.core import CODEXSlide
# Load CODEX spatial proteomics data
codex_slide = CODEXSlide(
path="path/to/codex_dir",
stain="IF", # Immunofluorescence
backend="bioformats"
)
VectraSlide:
from pathml.core import types
# Load Vectra multiplex IF data
vectra_slide = SlideData.from_slide(
"path/to/vectra.qptiff",
backend=SlideType.VectraQPTIFF
)
MultiparametricSlide:
from pathml.core import MultiparametricSlide
# Generic multiparametric imaging
mp_slide = MultiparametricSlide(path="path/to/multiparametric_data")
For large WSI files, tile-based loading enables memory-efficient processing:
from pathml.core import SlideData
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Generate tiles at specific magnification level
wsi.generate_tiles(
level=0, # Pyramid level (0 = highest resolution)
tile_size=256, # Tile dimensions in pixels
stride=256, # Spacing between tiles (256 = no overlap)
pad=False # Whether to pad edge tiles
)
# Iterate over tiles
for tile in wsi.tiles:
image = tile.image # numpy array
coords = tile.coords # (x, y) coordinates
# Process tile...
Overlapping tiles:
# Generate tiles with 50% overlap
wsi.generate_tiles(
level=0,
tile_size=256,
stride=128 # 50% overlap
)
Extract specific regions of interest directly:
# Read region at specific location and level
region = wsi.read_region(
location=(10000, 15000), # (x, y) in level 0 coordinates
level=1, # Pyramid level
size=(512, 512) # Width, height in pixels
)
# Returns numpy array
Whole-slide images are stored in multi-resolution pyramids. Select the appropriate level based on desired magnification:
# Inspect available levels
print(wsi.level_dimensions) # [(width0, height0), (width1, height1), ...]
print(wsi.level_downsamples) # [1.0, 4.0, 16.0, ...]
# Load at lower resolution for faster processing
wsi.generate_tiles(level=2, tile_size=256) # Use level 2 (16x downsampled)
Common pyramid levels:
Generate low-resolution thumbnails for visualization and quality control:
# Get thumbnail
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
# Display with matplotlib
import matplotlib.pyplot as plt
plt.imshow(thumbnail)
plt.axis('off')
plt.show()
Process multiple slides efficiently using SlideDataset:
from pathml.core import SlideDataset
import glob
# Create dataset from multiple slides
slide_paths = glob.glob("data/*.svs")
dataset = SlideDataset(
slide_paths,
tile_size=256,
stride=256,
level=0
)
# Iterate over all tiles from all slides
for tile in dataset:
image = tile.image
slide_id = tile.slide_id
# Process tile...
With preprocessing pipeline:
from pathml.preprocessing import Pipeline, StainNormalizationHE
# Create pipeline
pipeline = Pipeline([
StainNormalizationHE(target='normalize')
])
# Apply to entire dataset
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, n_workers=8)
Extract slide metadata including acquisition parameters, magnification, and vendor-specific information:
# Access metadata
metadata = wsi.metadata
# Common metadata fields
print(metadata.get('openslide.objective-power')) # Magnification
print(metadata.get('openslide.mpp-x')) # Microns per pixel X
print(metadata.get('openslide.mpp-y')) # Microns per pixel Y
print(metadata.get('openslide.vendor')) # Scanner vendor
# Slide dimensions
print(wsi.level_dimensions[0]) # (width, height) at level 0
PathML supports DICOM WSI through specialized handling:
from pathml.core import SlideData, SlideType
# Load DICOM WSI
dicom_slide = SlideData.from_slide(
"path/to/slide.dcm",
backend=SlideType.DICOM
)
# DICOM-specific metadata
print(dicom_slide.metadata.get('PatientID'))
print(dicom_slide.metadata.get('StudyDate'))
OME-TIFF provides an open standard for multi-dimensional imaging:
from pathml.core import SlideData
# Load OME-TIFF
ome_slide = SlideData.from_slide(
"path/to/slide.ome.tiff",
backend="bioformats"
)
# Access channel information for multi-channel images
n_channels = ome_slide.shape[2] # Number of channels
For large WSI files (often >1GB), use tile-based loading to avoid memory exhaustion:
# Efficient: Tile-based processing
wsi.generate_tiles(level=1, tile_size=256)
for tile in wsi.tiles:
process_tile(tile) # Process one tile at a time
# Inefficient: Loading entire slide into memory
full_image = wsi.read_region((0, 0), level=0, wsi.level_dimensions[0]) # May crash
Use Dask for parallel processing across multiple workers:
from pathml.core import SlideDataset
from dask.distributed import Client
# Start Dask client
client = Client(n_workers=8, threads_per_worker=2)
# Process dataset in parallel
dataset = SlideDataset(slide_paths)
dataset.run(pipeline, distributed=True, client=client)
Balance resolution and performance by selecting appropriate pyramid levels:
Issue: Slide fails to load
backend="bioformats" or backend="openslide"Issue: Out of memory errors
Issue: Color inconsistencies across slides
preprocessing.md)StainNormalizationHE transform in preprocessing pipelineIssue: Metadata missing or incorrect
wsi.metadata to inspect available fieldsAlways inspect pyramid structure before processing: Check level_dimensions and level_downsamples to understand available resolutions
Use appropriate pyramid levels: Process at level 1-2 for most tasks; reserve level 0 for final high-resolution analysis
Tile with overlap for segmentation tasks: Use stride < tile_size to avoid edge artifacts
Verify magnification consistency: Check openslide.objective-power metadata when combining slides from different sources
Handle vendor-specific formats: Use specialized slide classes (CODEXSlide, VectraSlide) for multiparametric data
Implement quality control: Generate thumbnails and inspect for artifacts before processing
Use distributed processing for large datasets: Leverage Dask for parallel processing across multiple workers
from pathml.core import SlideData
import matplotlib.pyplot as plt
# Load slide
wsi = SlideData.from_slide("path/to/slide.svs")
# Inspect properties
print(f"Dimensions: {wsi.level_dimensions}")
print(f"Downsamples: {wsi.level_downsamples}")
print(f"Magnification: {wsi.metadata.get('openslide.objective-power')}")
# Generate thumbnail for QC
thumbnail = wsi.get_thumbnail(size=(1024, 1024))
plt.imshow(thumbnail)
plt.title(f"Slide: {wsi.name}")
plt.axis('off')
plt.show()
from pathml.core import SlideDataset
from pathml.preprocessing import Pipeline, TissueDetectionHE
import glob
# Find all slides
slide_paths = glob.glob("data/slides/*.svs")
# Create pipeline
pipeline = Pipeline([TissueDetectionHE()])
# Process all slides
dataset = SlideDataset(
slide_paths,
tile_size=512,
stride=512,
level=1
)
# Run pipeline with distributed processing
dataset.run(pipeline, distributed=True, n_workers=8)
# Save processed data
dataset.to_hdf5("processed_dataset.h5")
from pathml.core import CODEXSlide
from pathml.preprocessing import Pipeline, CollapseRunsCODEX, SegmentMIF
# Load CODEX slide
codex = CODEXSlide("path/to/codex_dir", stain="IF")
# Create CODEX-specific pipeline
pipeline = Pipeline([
CollapseRunsCODEX(z_slice=2), # Select z-slice
SegmentMIF(
nuclear_channel='DAPI',
cytoplasm_channel='CD45',
model='mesmer'
)
])
# Process
pipeline.run(codex)