Context-dependent deep neural network HMMs have been
shown to achieve recognition accuracy superior to Gaussian
mixture models in a number of recent works. Typically, neural
networks are optimized with stochastic gradient descent.
On large datasets, stochastic gradient descent improves quickly
during the beginning of the optimization. But since it does not
make use of second order information, its asymptotic convergence
behavior is slow. In regions with pathological curvature,
stochastic gradient descent may almost stagnate and thereby
falsely indicate convergence. Another drawback of stochastic
gradient descent is that it can only be parallelized within minibatches.
The Hessian-free algorithm is a second order batch optimization
algorithm that does not suffer from these problems.
In a recent work, Hessian-free optimization has been applied
to a training of deep neural networks according to a sequence
criterion. In that work, improvements in accuracy and training
time have been reported. In this paper, we analyze the properties
of the Hessian-free optimization algorithm and investigate
whether it is suited for cross-entropy training of deep neural
networks as well.