skills/arboreto/references/algorithms.md
Arboreto provides two high-level algorithms for gene regulatory network (GRN) inference, both based on the multiple regression approach.
Both algorithms follow the same inference strategy:
The key difference is computational efficiency and the underlying regression method.
Purpose: Fast GRN inference for large-scale datasets using gradient boosting.
from arboreto.algo import grnboost2
network = grnboost2(
expression_data=expression_matrix,
tf_names=tf_names,
seed=42,
limit=5000,
)
grnboost2)grnboost2(
expression_data, # DataFrame, ndarray, or scipy.sparse.csc_matrix
gene_names=None, # Required for ndarray/sparse inputs
tf_names='all', # TF list, None/'all' → all genes as regulators
client_or_address='local', # 'local', scheduler address, or Dask Client
early_stop_window_length=25, # Early-stopping window (GRNBoost2 only)
limit=None, # Return top N links globally
seed=None, # Random seed; None = non-deterministic
verbose=False,
)
Purpose: Classic Random Forest-based GRN inference, serving as the conceptual blueprint.
diy)from arboreto.algo import genie3
network = genie3(
expression_data=expression_matrix,
tf_names=tf_names,
seed=42,
)
genie3)genie3(
expression_data,
gene_names=None,
tf_names='all',
client_or_address='local',
limit=None,
seed=None,
verbose=False,
)
| Feature | GRNBoost2 | GENIE3 |
|---|---|---|
| Speed | Fast (optimized for large data) | Slower |
| Method | Gradient boosting (GBM) | Random Forest |
| Best for | Large-scale data (10k+ observations) | Small-medium datasets |
| Output format | Same | Same |
| Inference strategy | Multiple regression | Multiple regression |
| Recommended | Yes (default choice) | For comparison/validation |
| Early stopping | Yes (early_stop_window_length) | No |
diyFor custom scikit-learn regressor settings, use diy() (not grnboost2/genie3 kwargs):
from arboreto.algo import diy
from arboreto.core import SGBM_KWARGS, RF_KWARGS
# Custom GRNBoost2-style run
custom_gbm = diy(
expression_data=expression_matrix,
regressor_type='GBM', # 'RF', 'GBM', or 'ET'
regressor_kwargs={
**SGBM_KWARGS,
'n_estimators': 100,
'max_depth': 5,
'learning_rate': 0.1,
},
tf_names=tf_names,
seed=42,
)
# Custom GENIE3-style run
custom_rf = diy(
expression_data=expression_matrix,
regressor_type='RF',
regressor_kwargs={
**RF_KWARGS,
'n_estimators': 1000,
'max_features': 'sqrt',
},
tf_names=tf_names,
)
Import default kwargs from arboreto.core and override only the keys you need.
Decision guide:
diy() if you need non-default regressor hyperparametersBoth algorithms produce comparable regulatory networks with the same output format.