Scalable machine learning over big data stored on a cluster of commodity machines with significant communication costs has become important in recent years. In this paper we give a novel approach to the distributed training of linear classifiers (involving smooth losses and L2 regularization) that is designed to reduce communication costs. At each iteration, the nodes minimize approximate objective functions; then the resulting minimizers are combined to form a descent direction to move. Our approach gives a lot of freedom in the formation of the approximate objective function as well as in the choice of methods to solve them. The method is shown to have O(log(1/ϵ)) time convergence. The method can be viewed as an iterative parameter mixing method. A special instantiation yields a parallel stochastic gradient descent method with strong convergence. When communication times between nodes are large, our method is much faster than the SQM method, which computes function and gradient values in a distributed fashion.