GRN Inference Algorithms

Arboreto provides two algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.

Algorithm Overview

Both algorithms follow the same inference strategy:

For each target gene in the dataset, train a regression model
Identify the most important features (potential regulators) from the model
Emit these features as candidate regulators with importance scores

The key difference is computational efficiency and the underlying regression method.

GRNBoost2 (Recommended)

Purpose: Fast GRN inference for large-scale datasets using gradient boosting.

When to Use

Large datasets: Tens of thousands of observations (e.g., single-cell RNA-seq)
Time-constrained analysis: Need faster results than GENIE3
Default choice: GRNBoost2 is the flagship algorithm and recommended for most use cases

Technical Details

Method: Stochastic gradient boosting with early-stopping regularization
Performance: Significantly faster than GENIE3 on large datasets
Output: Same format as GENIE3 (TF-target-importance triplets)

Usage

python

from arboreto.algo import grnboost2

network = grnboost2(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42  # For reproducibility
)

Parameters

python

grnboost2(
    expression_data,           # Required: pandas DataFrame or numpy array
    gene_names=None,           # Required for numpy arrays
    tf_names='all',            # List of TF names or 'all'
    verbose=False,             # Print progress messages
    client_or_address='local', # Dask client or scheduler address
    seed=None                  # Random seed for reproducibility
)

GENIE3

Purpose: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.

When to Use

Smaller datasets: When dataset size allows for longer computation
Comparison studies: When comparing with published GENIE3 results
Validation: To validate GRNBoost2 results

Technical Details

Method: Random Forest or ExtraTrees regression
Foundation: Original multiple regression GRN inference strategy
Trade-off: More computationally expensive but well-established

Usage

python

from arboreto.algo import genie3

network = genie3(
    expression_data=expression_matrix,
    tf_names=tf_names,
    seed=42
)

Parameters

python

genie3(
    expression_data,           # Required: pandas DataFrame or numpy array
    gene_names=None,           # Required for numpy arrays
    tf_names='all',            # List of TF names or 'all'
    verbose=False,             # Print progress messages
    client_or_address='local', # Dask client or scheduler address
    seed=None                  # Random seed for reproducibility
)

Algorithm Comparison

Feature	GRNBoost2	GENIE3
Speed	Fast (optimized for large data)	Slower
Method	Gradient boosting	Random Forest
Best for	Large-scale data (10k+ observations)	Small-medium datasets
Output format	Same	Same
Inference strategy	Multiple regression	Multiple regression
Recommended	Yes (default choice)	For comparison/validation

Advanced: Custom Regressor Parameters

For advanced users, pass custom scikit-learn regressor parameters:

python

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Custom GRNBoost2 parameters
custom_grnboost2 = grnboost2(
    expression_data=expression_matrix,
    regressor_type='GBM',
    regressor_kwargs={
        'n_estimators': 100,
        'max_depth': 5,
        'learning_rate': 0.1
    }
)

# Custom GENIE3 parameters
custom_genie3 = genie3(
    expression_data=expression_matrix,
    regressor_type='RF',
    regressor_kwargs={
        'n_estimators': 1000,
        'max_features': 'sqrt'
    }
)

Choosing the Right Algorithm

Decision guide:

Start with GRNBoost2 - It's faster and handles large datasets better
Use GENIE3 if:
- Comparing with existing GENIE3 publications
- Dataset is small-medium sized
- Validating GRNBoost2 results

Both algorithms produce comparable regulatory networks with the same output format, making them interchangeable for most analyses.