Efficient and Scalable Topic Model Training on Distributed Data-Parallel Platform
- Bo Zhao ,
- Hucheng Zhou ,
- Guoqiang Li ,
- Yihua Huang ,
- Hucheng Zhou
Distributed Collapsed Gibbs Sampling (CGS) in Latent Dirichlet Allocation (LDA) training usually prefers a “customized” design with sophisticated asynchronization support. However, with both algorithm level innovation and system level optimizations, we demonstrate that the “generalized” design on distributed data-parallel platform can even outperform the dedicated designs. We first present a novel CGS sampling algorithm, ZenLDA, that has different formula decomposition with different performance-accuracy tradeoff with other CGS algorithms. With respect to parallelization, we convert the serial CGS algorithm to Monte Carlo Expectation-Maximization (MCEM) algorithm thus can be parallelized in a fully batch and synchronized way. To push the performance to the limit, we also present two approximations, sparse model initialization and “converged” token exclusion, as well as several system level optimizations. Training corpus is represented as a directed graph and model parameters are annotated as the corresponding vertex attributes, thus we implemented ZenLDA and other well-known CGS algorithms on GraphX in Spark, and it has been deployed and daily used in production. We evaluated the efficiency of presented techniques against multiple datasets including web-scale corpus. Experimental results indicate that MCEM variant achieves much faster than CGS algorithms but still converges with similar accuracy, and ZenLDA is the best performer. When compared with state-of-art systems, ZenLDA achieves comparable (even better) performance with similar accuracy.
Besides, ZenLDA demonstrates good scalability when dealing with large topics and huge corpus.