Recently, we developed context-dependent deep neural network (DNN) hidden Markov models for large vocabulary speech recogni- tion. While reducing errors by 33% compared to its discriminatively trained Gaussian-mixture counterpart on the switchboard benchmark task, DNN requires much more parameters. In this paper, we report our recent work on DNN for improved generalization, model size, and computation speed by exploiting parameter sparseness. We for- mulate the goal of enforcing sparseness as soft regularization and convex constraint optimization problems, and propose solutions un- der the stochastic gradient ascent setting. We also propose novel data structures to exploit the random sparseness patterns to reduce model size and computation time. The proposed solutions have been evaluated on the voice-search and switchboard datasets. They have decreased the number of nonzero connections to one third while re- ducing the error rate by 0.2-0.3% over the fully connected model on both datasets. The nonzero connections have been further reduced to only 12% and 19% on the two respective datasets without sacri?cing speech recognition performance. Under these conditions we can re- duce the model size to 18% and 29%, and computation to 14% and 23%, respectively, on these two datasets.