docs/design/sync_controller_en.md
As is well-known, in a distributed system, nodes usually vary in progress. So when aggregation for result is needed, Slow nodes will block the entire computing task. And this will greatly waste computation resource.
Considering machine learning's specific characteristic, ML system can actually loosen the synchronization restrictions. It is not a must to wait for all nodes to finish in every iteration. Faster workers can push their computed model delta ahead and proceed with next iteration。In this way, waiting time for each other can be reduced, which make the entire process faster.
Thus in distributed computing system Sync Controller is one of the most important function. Angel provides three levels of sync control: BSP (Bulk Synchronous Parallel),SSP (Stalness Synchronous Parallel) and ASP (Asynchronous Parallel). Among these three protocols, BSP is the most restricted synchronization protocol, whereas ASP is the least restricted. In general, looser synchronization protocol results in better speed.
BSP is the default sync protocol in Angel and widely used in other distributed computing systems. It requires waiting of all tasks to complete in every iteration.
SSP allows the tasks to drift apart up to an upper limit, known as staleness, which is the number of iterations that the fastest task is allowed to be ahead of the slowest task.
angel.staleness=N, where N must be a positive integerTasks can drift apart with no restrictions; once a task finishes, it just continues to the next iteration without waiting.
angel.staleness=-1Configuration of the sync controller is as simple as setting the staleness value, though one needs to keep in mind the tradeoff between convergence quality and speed, and it is always a good practice to adjust the sync control level, together with other related parameters, to the specific machine-learning algorithm and the metrics.
In Angel, sync controller is implemented using the Vector Clock algorithm.
get (or other retrieving operations) from PSModel based on its local clock and stalenessclock method of PSModel to update the vector clockThe default invoke interface for user is simple:
psModel.increment(update)
……
psModel.clock().get()
ctx.incIteration()
Angel's flexible sync controller gives user convenient control over the synchronization protocols, making it possible to avoid serious performance issues in large-scale machine learning caused by few machines failure.