This paper addresses the problem of model synchronization in data-parallelism of deep-learning systems. In such systems, workers on different machines continuously update their local copies of the model, and the updates need to be merged so that the copies are roughly consistent to each other. In modern implementations using GPUs, workers generate very high updates, posing significant scalability challenges.
We model this as a distributed state anti-entropy problem, and propose a fully asynchronous and decentralized parameter sharing protocol that allows machines to reconcile model differences to their connected peers. The protocol is provably correct to achieve eventual consistency, and recovers gracefully from common failures without invasive maneuver. In addition, it is flexible in that both different consistency requirements and topology connections can be dynamically reconfigured. We show that such flexibility is important for better tradeoff between system performance and model convergence rate. For example, under high update rates, the existing master-slave architecture performs poorly against other configurations with larger diameter.