SparkTree: Push the Limit of Tree Ensemble Learning

  • Cui Li ,
  • Bo Zhao ,
  • Hucheng Zhou ,
  • Hucheng Zhou

Decision trees, when combined with different loss functions and ensemble techniques, are widely used in web applications such as document ranking in web search and click prediction in ads targeting. Therefore, there is an increasing motivation on an integrated framework that supports all loss functions and all ensemble techniques instead of each with a separated system.
The web-scale training is challenging because of big data that has hundreds of millions of samples and each sample with thousands of features, and big model in which thousands of trees are fitted and each tree has hundreds of nodes. Therefore, there is an increasing demand for a scalable distributed learning system that exploits all dimensions of parallelism. However, widely used offers such as XGBoost and Spark MLlib are incomplete without the support of ranking such as LambdaMART, and without the support of the feature parallelism, they are not scalable to support a large number of features. Simply adding these supports does not meet the efficiency requirement needed to balance the training speed and accuracy. In this paper, we present SparkTree which seamlessly integrates all parallelism with completely new tradeoffs and presents a series of optimizations to improve training speed and accuracy. SparkTree has been evaluated against our production workloads against XGBoost and MLlib, the result indicates that SparkTree outperforms MLlib with an 8.3X-71.5X speed increase and outperforms XGBoost with speeds up to 11.8X. The effectiveness of the presented optimization has also been evaluated, with SparkTree outperforming MLLib with 7.70% AUC gain and outperforms XGBoost with 5.77% AUC gain.