docs/algo/gbdt_on_angel_en.md
GBDT (Gradient Boosting Decision Tree) is a machine-learning algorithm that produces an ensemble of weak learners (decision trees) for a prediction task. It is a powerful method in solving classification and regression problems.
Figure 1 shows an example GBDT for modeling consumers' purchasing potential. The procedure takes the following steps:
As shown in Figure 2,in order to optimize for performance, we need to store the following parameter matrices on the parameter server (PS):
All the above parameter matrices will be repetitively updated and passed around in the GBDT fitting process.
GBDT's implementation procedure takes the following steps:
The above steps are iterated until all trees are built. Once training is done, we compute the performance metrics (such as accuracy/error) and output the trained model.
The next section is dedicated to details for Step 3, finding the best split (split point and split node) -- a challenging key step.
Finding the best split is the most challenging and important step in GBDT, and is exactly where PS is to the point. The procedure takes the following steps:
Compute the histograms: Retrieve a tree node from the queue; worker computes a local gradient histogram and hessian histogram, and sends to PS through the interface of updating histograms.
Synchronize and merge histograms: Worker pushes local histograms to PS through the right interface; but before this, each local histogram is partitioned to P parts, and each part is sent to its corresponding PS node. A PS node, upon receiving local histograms from the workers, determines which tree node to process, and adds it to the corresponding overall histograms.
Find the best split point: Worker requests best split points from all PS nodes through the right interface for requesting best split point from PS, compares the gain in objective function for each of the P candidates, and chooses the best split point that maximizes the gain.
Split node: PS returns the split feature, split point, and objective function gain to worker; worker creates leaf nodes based on the best split and devides training data to the leaf nodes. If the tree depth is yet smaller than the maximum depth, the two leaf nodes are then added to the queue.
From the above logic, we can see how PS and Angel are well-suited to model updating and synchronization at scale for GBDT. More specifically, the perks are:
Massively complex models: The gradient/hessian histograms used by GBDT increase in size with number of features, and can be prohibitively large for single-thread computing with big data; Angel, on the other hand, partitions the histograms to be stored by multiple PS nodes, thus can avoid the single-point bottleneck for parameter merging for complex models.
Two-stage tree-splitting algorithm: Best split can be searched in parallel on multiple PS nodes, and only locally best split needs to be returned to worker, thus communication cost is almost neglegible.
Overall, Angel PS's advantage as stated above is demonstrated by GBDT's performance on the platform: it is drastically improved when compared to Spark, and also slightly improved when compared to MPI XGBoost.
Data format is set in "ml.data.type". GBDT on Angel supports "libsvm" and "dummy" formats. For details, see Angel Data Format
Feature vector's dimension is set in "ml.feature.num".
Algorithm Parameters
I/O Parameters
Resource Parameters
We compare Angel and XGBoost using Tencent's internal data:
| Data Set | Data Set Size | Sample Size | Sample Dimension (Number of Features) | Task |
|---|---|---|---|---|
| UserGender1 | 24GB | 12.5M | 2.57K | binary classification |
| UserGender2 | 145GB | 120M | 330K | binary classification |
Task is to classify user's gender. The dataset UserGender1 is 24GB,including 12.5M training samples, each of which has 2570 features. UserGender2 is 145GB, including 120M samples, each of which has 330K features. Both data sets are sparse.
Experimental Environment
The experiment was run on Tencent's Gaia cluster (Yarn), and each instance has the following configuration:
Variable Setting
Angel and XGBoost used the following setting:
* Number of trees: 20
* Maximum tree depth: 7
* Histogram size: 10
* Learning rate: 0.1 for XGboost, 0.2 for Angel
* Number of workers: 50
* Number of PS: 10
* Worker node memory: 2GB (UserGender1), 10GB (UserGender2)
Result
| System | Dataset | Total Training Time | Training Time per Tree | Testset Error |
|---|---|---|---|---|
| XGBoost | UserGender1 | 36min 48s | 110s | 0.155008 |
| Angel | UserGender1 | 25min 22s | 76s | 0.154160 |
| XGBoost | UserGender2 | 2h 25min | 435s | 0.232039 |
| Angel | UserGender2 | 58min 39s | 175s | 0.243316 |