docs/deploy/resource_config_guide_en.md
Angel is a distributed machine-learning system based on the parameter-server (PS) paradigm. The introduction of PS simplifies the computational complexity and improves the operational speed; in the meantime, however, it increases the complexity of resource configuration. We create this file to help guide users configure the PS system properly to achieve high performance of their algorithms.
We list resource parameters for Angel jobs. Sensible configurations of these parameters, based on the application-specific information such as data size, model design, and machine-learning algorithms, can significantly improve the operational efficiency.
Because Angel master is lightweight, in most cases, the default configuration is sufficient. Since the master does not host the computations, it does not need more resources than the default allocation unless there will be too many workers. In contrast, configurations of the worker and PS resources need more care.
Angel can configure number of workers autonomously to achieve the best result for data parallelization. In general, the number of workers is determined by the total amount of data in the computation. We suggest the typical range of data that a single worker processes to be between 1GB and 5GB . Please note that the estimation is for uncompressed data only, and needs to be multiplied by the data compression ratio for other formats --- we use the same principle for other estimations when applicable.
The comparative usage of worker memory is shown below:
Model Partitions
System Buffer
Data
Equations for Estimation
Let's define the following variables:
N: number of tasks running on a single workerSm: size of model used in trainingSmd: size of model update per iteration, per taskSmmd: size of merged model updatesSt: memory used for training dataSb: memory used by the systemThen, the worker memory can be estimated by
The system memory can be roughly estimated by
Summing the two up, the overall estimation is
In contrast to the memory parameters, number of CPU vcores only affects the job's operational efficiency, but not its accuracy. We suggest adjusting the number of CPU vcores and memory based on the resources on the actual machine, for example:
Assuming the machine has 100G total memory and 50 CPU vcores; if the worker's memory is configured to be 10G/20G, then the number of CPU vcores can be configured to 5/10
Number of PS depends on the number of workers, the model types (see Model Partitions section below for definition) and model complexity.
In general, PS number should increase if the number of workers or model complexity increase. We suggest setting the number of PS to be between 1/5 and 4/5 of the number of workers. In fact, the number of PS should be configured in an algorithm-specific way.
The comparative usage of PS memory is shown below:
Model Partitions: partitions of the model loaded on the PS
System Buffer: ByteBuf pool used by Netty framework, among others
Equations for Estimation
Smp: size of model partitionsSw: size of model being pulled per iteration, per taskN: number of workersPS memory is roughly estimated by:
This is similar to the estimation for workers, but needs additional analyses of the algorithm. In gneral, if there is intense computation on the PS side, number of vcores needs to be raised correspondingly.
We take LR as an example to demonstrate the parameter configurations. By default, LR is dense and uses double arrays to store the model and model partitions.
Assuming 300GB training data and 100M features:
If we configure 100 workers, each of which running one task, then the estimated memory for each single worker is:
We round up the worker memory to 8GB, set the number of PS to 20 (1/5 of the number of workers). Now, each PS loads 5M model partitions, thus the estimated memory requirement for each PS is:
We round the PS memory to 8GB. Assuming each machine has 128G memory and 48 CPU vcores, each worker needs the following CPU vcores:
Number of CPU vcores for the PS is: