Model Scheduling For Super Large Scale Model Training

Established: August 1, 2016

Machine learning in this big data era is facing two challenges, one is big data, the other is big model (one machine cannot hold the entire model in runtime). In distributed machine learning, people has proposed data parallelism to solve the big data problem by partitioning the data to different machines while training many replicas of the same model simultaneously. As for big model problem, model parallelism is proposed to solve it by partition the model across different machines to train one model in a distributed way. Model parallelism isn’t so scalable due to the communication cost and vulnerable system issue. In this project, we are going to find out a better solution to handle big model learning with the model scheduling logic. We will propose customized scheduling logic for different training tasks including deep learning model (DNN, CNN, RNN), topic model (LDA based on gibbs sampling), and we will conduct careful study on comparison with traditional model parallelism and model swapping methods, and come up with a sound solution for big model training problem.

People

Portrait of Junjie Li

Junjie Li

Principal Applied Scientist