docs/numpy_ml.trees.rst
Tree-based models ################# .. raw:: html
<h2>Decision Trees</h2>Decision trees_ [1]_ are popular nonparametric models that iteratively split a
training dataset into smaller, more homogenous subsets. Each node in the tree
is associated with a decision rule, which dictates how to divide the data the
node inherits from its parent among each of its children. Each leaf node is
associated with at least one data point from the original training set.
.. figure:: img/decision_tree.png :width: 95% :align: center
A binary decision tree trained on the dataset :math:`X = \{ \mathbf{x}_1,
\ldots, \mathbf{x}_{10} \}`. Each example in the dataset is a 4-dimensional
vector of real-valued features labeled :math:`x_1, \ldots, x_4`. Unshaded
circles correspond to internal decision nodes, while shaded circles
correspond to leaf nodes. Each leaf node is associated with a subset of the
examples in `X`, selected based on the decision rules along the path from
root to leaf.
At test time, new examples travel from the tree root to one of the leaves, their path through the tree determined by the decision rules at each of the nodes it visits. When a test example arrives at a leaf node, the targets for the training examples at that leaf node are used to compute the model's prediction.
Training decision trees corresponds to learning the set of decision rules to
partition the training data. This learning process proceeds greedily by
selecting the decision rule at each node that results in the greatest reduction
in an inhomogeneity or "impurity" metric, :math:\mathcal{L}. One popular
metric is the information entropy:
.. math::
-\sum_j P_n(\omega_j) \log P_n(\omega_j)
where :math:P_n(\omega_j) is the fraction of data points at split n that are
associated with category :math:\omega_j. Another useful metric is the Gini
impurity:
.. math::
\sum_{i \neq j} P_n(\omega_i) P_n(\omega_j) = 1 - \sum_{j} P_n(\omega_j)^2
For a binary tree (where each node has only two children), the reduction in impurity after a particular split is
.. math::
\Delta \mathcal{L} = \mathcal{L}(\text{Parent}) -
P_{\text{left}} \mathcal{L}(\text{Left child}) -
(1 - P_{\text{left}})\mathcal{L}(\text{Right child})
where :math:\mathcal{L}(x) is the impurity of the dataset at node x,
and :math:P_{\text{left}}/:math:P_{\text{right}} are the proportion of
examples at the current node that are partitioned into the left / right
children, respectively, by the proposed split.
.. _Decision trees: https://en.wikipedia.org/wiki/Decision_tree_learning
Models
~numpy_ml.trees.DecisionTreeReferences
.. [1] Breiman, L., Friedman, J. H., Olshen, R. A., and Stone, C. J. (1984). Classification and regression trees. Monterey, CA: Wadsworth & Brooks/Cole Advanced Books & Software.
.. raw:: html
<h2>Bootstrap Aggregating</h2>Bootstrap aggregating_ (bagging) methods [2]_ are an ensembling approach_ that
proceeds by creating n bootstrapped samples of a training dataset by sampling
from it with replacement. A separate learner is fit on each of the n
bootstrapped datasets, with the final bootstrap aggregated model prediction
corresponding to the average (or majority vote, for classifiers) across each
of the n learners' predictions for a given datapoint.
The random forest_ model [3]_ [4]_ is a canonical example of bootstrap
aggregating. For this approach, each of the n learners is a different
decision tree. In addition to training each decision tree on a different
bootstrapped dataset, random forests employ a random subspace_ approach [5]_:
each decision tree is trained on a subsample (without replacement) of the full
collection of dataset features.
.. _Bootstrap aggregating: https://en.wikipedia.org/wiki/Bootstrap_aggregating
.. _random forest: https://en.wikipedia.org/wiki/Random_forest
.. _ensembling approach: https://en.wikipedia.org/wiki/Ensemble_learning
.. _random subspace: https://en.wikipedia.org/wiki/Random_subspace_method
Models
~numpy_ml.trees.RandomForestReferences
.. [2] Breiman, L. (1994). "Bagging predictors". Technical Report 421. Statistics Department, UC Berkeley. .. [3] Ho, T. K. (1995). "Random decision forests". Proceedings of the Third International Conference on Document Analysis and Recognition, 1: 278-282. .. [4] Breiman, L. (2001). "Random forests". Machine Learning. 45(1): 5-32. .. [5] Ho, T. K. (1998). "The random subspace method for constructing decision forests". IEEE Transactions on Pattern Analysis and Machine Intelligence. 20(8): 832-844.
.. raw:: html
<h2>Gradient Boosting</h2>Gradient boosting_ [6]_ [7]_ [8]_ is another popular ensembling technique_
that proceeds by iteratively fitting a sequence of m weak learners such that:
.. math::
f_m(X) = b(X) + \eta w_1 g_1 + \ldots + \eta w_m g_m
where b is a fixed initial estimate for the targets, :math:\eta is
a learning rate parameter, and :math:w_{i} and :math:g_{i}
denote the weights and predictions of the :math:i^{th} learner.
At each training iteration a new weak learner is fit to predict the negative
gradient of the loss with respect to the previous prediction,
:math:\nabla_{f_{i-1}} \mathcal{L}(y, \ f_{i-1}(X)). We then use the
element-wise product of the predictions of this weak learner, :math:g_i, with
a weight, :math:w_i, computed via, e.g., line-search_ on the objective
:math:w_i = \arg \min_{w} \sum_{j=1}^n \mathcal{L}(y_j, f_{i-1}(x_j) + w g_i)
, to adjust the predictions of the model from the previous iteration,
:math:f_{i-1}(X):
.. math::
f_i(X) := f_{i-1}(X) + w_i g_i
The current module implements gradient boosting using decision trees as the weak learners.
.. _Gradient boosting: https://en.wikipedia.org/wiki/Gradient_boosting
.. _ensembling technique: https://en.wikipedia.org/wiki/Ensemble_learning
.. _line-search: https://en.wikipedia.org/wiki/Line_search
Models
~numpy_ml.trees.GradientBoostedDecisionTreeReferences
.. [6] Breiman, L. (1997). "Arcing the edge". Technical Report 486. Statistics Department, UC Berkeley. .. [7] Friedman, J. H. (1999). "Greedy function approximation: A gradient boosting machine". IMS 1999 Reitz Lecture. .. [8] Mason, L., Baxter, J., Bartlett, P. L., Frean, M. (1999). "Boosting algorithms as gradient descent" Advances in Neural Information Processing Systems, 12: 512–518.
.. toctree:: :maxdepth: 3 :hidden:
numpy_ml.trees.dt
numpy_ml.trees.rf
numpy_ml.trees.gbdt