scientific-skills/pathml/references/preprocessing.md
PathML provides a modular preprocessing architecture based on composable transforms organized into pipelines. Transforms are individual operations that modify images, create masks, or extract features. Pipelines chain transforms together to create reproducible, scalable preprocessing workflows for computational pathology.
The Pipeline class composes a sequence of transforms applied consecutively:
from pathml.preprocessing import Pipeline, Transform1, Transform2
# Create pipeline
pipeline = Pipeline([
Transform1(param1=value1),
Transform2(param2=value2),
# ... more transforms
])
# Run on a single slide
pipeline.run(slide_data)
# Run on a dataset
pipeline.run(dataset, distributed=True, n_workers=8)
Key features:
All transforms inherit from the Transform base class and implement:
apply() - Core transformation logicinput_type - Expected input (tile, mask, etc.)output_type - Produced outputPathML provides transforms in six major categories:
Apply various blurring kernels for noise reduction:
MedianBlur:
from pathml.preprocessing import MedianBlur
# Apply median filter
transform = MedianBlur(kernel_size=5)
GaussianBlur:
from pathml.preprocessing import GaussianBlur
# Apply Gaussian blur
transform = GaussianBlur(kernel_size=5, sigma=1.0)
BoxBlur:
from pathml.preprocessing import BoxBlur
# Apply box filter
transform = BoxBlur(kernel_size=5)
RescaleIntensity:
from pathml.preprocessing import RescaleIntensity
# Rescale intensity to [0, 255]
transform = RescaleIntensity(
in_range=(0, 1.0),
out_range=(0, 255)
)
HistogramEqualization:
from pathml.preprocessing import HistogramEqualization
# Global histogram equalization
transform = HistogramEqualization()
AdaptiveHistogramEqualization (CLAHE):
from pathml.preprocessing import AdaptiveHistogramEqualization
# Contrast Limited Adaptive Histogram Equalization
transform = AdaptiveHistogramEqualization(
clip_limit=0.03,
tile_grid_size=(8, 8)
)
SuperpixelInterpolation:
from pathml.preprocessing import SuperpixelInterpolation
# Divide into superpixels using SLIC
transform = SuperpixelInterpolation(
n_segments=100,
compactness=10.0
)
TissueDetectionHE:
from pathml.preprocessing import TissueDetectionHE
# Detect tissue regions in H&E slides
transform = TissueDetectionHE(
use_saturation=True, # Use HSV saturation channel
threshold=10, # Intensity threshold
min_region_size=500 # Minimum tissue region size in pixels
)
tile.masks['tissue']NucleusDetectionHE:
from pathml.preprocessing import NucleusDetectionHE
# Detect nuclei in H&E images
transform = NucleusDetectionHE(
stain='hematoxylin', # Use hematoxylin channel
threshold=0.3,
min_nucleus_size=10
)
tile.masks['nucleus']BinaryThreshold:
from pathml.preprocessing import BinaryThreshold
# Threshold using Otsu's method
transform = BinaryThreshold(
method='otsu', # 'otsu' or manual threshold value
invert=False
)
# Or specify manual threshold
transform = BinaryThreshold(threshold=128)
ForegroundDetection:
from pathml.preprocessing import ForegroundDetection
# Detect foreground regions
transform = ForegroundDetection(
threshold=0.5,
min_region_size=1000, # Minimum size in pixels
use_saturation=True
)
Apply morphological operations to clean up masks:
MorphOpen:
from pathml.preprocessing import MorphOpen
# Remove small objects and noise
transform = MorphOpen(
kernel_size=5,
mask_name='tissue' # Which mask to modify
)
MorphClose:
from pathml.preprocessing import MorphClose
# Fill small holes
transform = MorphClose(
kernel_size=5,
mask_name='tissue'
)
Normalize H&E staining across slides to account for variations in staining procedure and scanners:
from pathml.preprocessing import StainNormalizationHE
# Normalize to reference slide
transform = StainNormalizationHE(
target='normalize', # 'normalize', 'hematoxylin', or 'eosin'
stain_estimation_method='macenko', # 'macenko' or 'vahadane'
tissue_mask_name=None # Optional tissue mask for better estimation
)
Target modes:
'normalize' - Normalize both stains to reference'hematoxylin' - Extract hematoxylin channel only'eosin' - Extract eosin channel onlyStain estimation methods:
'macenko' - Macenko et al. 2009 method (faster, more stable)'vahadane' - Vahadane et al. 2016 method (more accurate, slower)Advanced parameters:
transform = StainNormalizationHE(
target='normalize',
stain_estimation_method='macenko',
target_od=None, # Optical density matrix for reference (optional)
target_concentrations=None, # Target stain concentrations (optional)
regularizer=0.1, # Regularization for vahadane method
background_intensity=240 # Background intensity level
)
Workflow:
Example with tissue mask:
from pathml.preprocessing import Pipeline, TissueDetectionHE, StainNormalizationHE
pipeline = Pipeline([
TissueDetectionHE(), # Create tissue mask first
StainNormalizationHE(
target='normalize',
stain_estimation_method='macenko',
tissue_mask_name='tissue' # Use tissue mask for better estimation
)
])
LabelArtifactTileHE:
from pathml.preprocessing import LabelArtifactTileHE
# Label tiles containing artifacts
transform = LabelArtifactTileHE(
pen_threshold=0.5, # Threshold for pen marking detection
bubble_threshold=0.5 # Threshold for bubble detection
)
LabelWhiteSpaceHE:
from pathml.preprocessing import LabelWhiteSpaceHE
# Label tiles with excessive white space
transform = LabelWhiteSpaceHE(
threshold=0.9, # Fraction of white pixels
mask_name='white_space'
)
SegmentMIF:
from pathml.preprocessing import SegmentMIF
# Segment cells using Mesmer deep learning model
transform = SegmentMIF(
nuclear_channel='DAPI', # Nuclear marker channel name
cytoplasm_channel='CD45', # Cytoplasm marker channel name
model='mesmer', # Deep learning segmentation model
image_resolution=0.5, # Microns per pixel
compartment='whole-cell' # 'nuclear', 'cytoplasm', or 'whole-cell'
)
SegmentMIFRemote:
from pathml.preprocessing import SegmentMIFRemote
# Remote inference using DeepCell API
transform = SegmentMIFRemote(
nuclear_channel='DAPI',
cytoplasm_channel='CD45',
model='mesmer',
api_url='https://deepcell.org/api'
)
QuantifyMIF:
from pathml.preprocessing import QuantifyMIF
# Quantify marker expression per cell
transform = QuantifyMIF(
segmentation_mask_name='cell_segmentation',
markers=['CD3', 'CD4', 'CD8', 'CD20', 'CD45'],
output_format='anndata' # or 'dataframe'
)
CollapseRunsCODEX:
from pathml.preprocessing import CollapseRunsCODEX
# Consolidate multi-run CODEX data
transform = CollapseRunsCODEX(
z_slice=2, # Select specific z-slice
run_order=[0, 1, 2] # Order of acquisition runs
)
CollapseRunsVectra:
from pathml.preprocessing import CollapseRunsVectra
# Process Vectra multiplex IF data
transform = CollapseRunsVectra(
wavelengths=[520, 570, 620, 670, 780] # Emission wavelengths
)
from pathml.preprocessing import (
Pipeline,
TissueDetectionHE,
StainNormalizationHE,
NucleusDetectionHE,
MedianBlur,
LabelWhiteSpaceHE
)
pipeline = Pipeline([
# 1. Quality control
LabelWhiteSpaceHE(threshold=0.9),
# 2. Noise reduction
MedianBlur(kernel_size=3),
# 3. Tissue detection
TissueDetectionHE(min_region_size=500),
# 4. Stain normalization
StainNormalizationHE(
target='normalize',
stain_estimation_method='macenko',
tissue_mask_name='tissue'
),
# 5. Nucleus detection
NucleusDetectionHE(threshold=0.3)
])
from pathml.preprocessing import (
Pipeline,
CollapseRunsCODEX,
SegmentMIF,
QuantifyMIF
)
codex_pipeline = Pipeline([
# 1. Consolidate multi-run data
CollapseRunsCODEX(z_slice=2),
# 2. Cell segmentation
SegmentMIF(
nuclear_channel='DAPI',
cytoplasm_channel='CD45',
model='mesmer',
image_resolution=0.377
),
# 3. Quantify markers
QuantifyMIF(
segmentation_mask_name='cell_segmentation',
markers=['CD3', 'CD4', 'CD8', 'CD20', 'PD1', 'PDL1'],
output_format='anndata'
)
])
from pathml.preprocessing import (
Pipeline,
LabelWhiteSpaceHE,
LabelArtifactTileHE,
TissueDetectionHE,
MorphOpen,
MorphClose,
StainNormalizationHE,
AdaptiveHistogramEqualization
)
advanced_pipeline = Pipeline([
# Stage 1: Quality control
LabelWhiteSpaceHE(threshold=0.85),
LabelArtifactTileHE(pen_threshold=0.5, bubble_threshold=0.5),
# Stage 2: Tissue detection
TissueDetectionHE(threshold=10, min_region_size=1000),
MorphOpen(kernel_size=5, mask_name='tissue'),
MorphClose(kernel_size=7, mask_name='tissue'),
# Stage 3: Stain normalization
StainNormalizationHE(
target='normalize',
stain_estimation_method='vahadane',
tissue_mask_name='tissue'
),
# Stage 4: Contrast enhancement
AdaptiveHistogramEqualization(clip_limit=0.03, tile_grid_size=(8, 8))
])
from pathml.core import SlideData
# Load slide
wsi = SlideData.from_slide("slide.svs")
# Generate tiles
wsi.generate_tiles(level=1, tile_size=256, stride=256)
# Run pipeline
pipeline.run(wsi)
# Access processed data
for tile in wsi.tiles:
normalized_image = tile.image
tissue_mask = tile.masks.get('tissue')
nucleus_mask = tile.masks.get('nucleus')
from pathml.core import SlideDataset
from dask.distributed import Client
import glob
# Start Dask client
client = Client(n_workers=8, threads_per_worker=2, memory_limit='4GB')
# Create dataset
slide_paths = glob.glob("data/*.svs")
dataset = SlideDataset(
slide_paths,
tile_size=512,
stride=512,
level=1
)
# Run pipeline in parallel
dataset.run(
pipeline,
distributed=True,
client=client
)
# Save results
dataset.to_hdf5("processed_dataset.h5")
client.close()
Execute transforms only on tiles meeting specific criteria:
# Filter tiles before processing
wsi.generate_tiles(level=1, tile_size=256)
# Run pipeline only on tissue tiles
for tile in wsi.tiles:
if tile.masks.get('tissue') is not None:
pipeline.run(tile)
# Process large datasets in batches
batch_size = 100
for i in range(0, len(slide_paths), batch_size):
batch_paths = slide_paths[i:i+batch_size]
batch_dataset = SlideDataset(batch_paths)
batch_dataset.run(pipeline, distributed=True)
batch_dataset.to_hdf5(f"batch_{i}.h5")
Certain transforms leverage GPU acceleration when available:
import torch
# Check GPU availability
print(f"CUDA available: {torch.cuda.is_available()}")
# Transforms that benefit from GPU:
# - SegmentMIF (Mesmer deep learning model)
# - StainNormalizationHE (matrix operations)
from dask.distributed import Client
# CPU-bound tasks (image processing)
client = Client(
n_workers=8,
threads_per_worker=1, # Use processes, not threads
memory_limit='8GB'
)
# GPU tasks (deep learning inference)
client = Client(
n_workers=2, # Fewer workers for GPU
threads_per_worker=4,
processes=True
)
Create custom preprocessing operations by subclassing Transform:
from pathml.preprocessing.transforms import Transform
import numpy as np
class CustomTransform(Transform):
def __init__(self, param1, param2):
self.param1 = param1
self.param2 = param2
def apply(self, tile):
# Access tile image
image = tile.image
# Apply custom operation
processed = self.custom_operation(image, self.param1, self.param2)
# Update tile
tile.image = processed
return tile
def custom_operation(self, image, param1, param2):
# Implement custom logic
return processed_image
# Use in pipeline
pipeline = Pipeline([
CustomTransform(param1=10, param2=0.5),
# ... other transforms
])
Order transforms appropriately:
Use tissue masks for stain normalization:
TissueDetectionHE() then StainNormalizationHE(tissue_mask_name='tissue')Apply morphological operations to clean masks:
MorphOpen to remove small false positivesMorphClose to fill small gapsLeverage distributed processing for large datasets:
Save intermediate results:
Validate preprocessing on sample images:
Handle edge cases:
Issue: Stain normalization produces artifacts
Issue: Out of memory during pipeline execution
Issue: Tissue detection misses tissue regions
use_saturation=TrueIssue: Nucleus detection is inaccurate