The widely used maximum likelihood linear regression speaker adaptation procedure suffers from overtraining when used for rapid adaptation tasks in which the amount of adaptation data is severely limited. This is a well known difficulty associated with the estimation maximization algorithm. We use an information geometric analysis of the estimation maximization algorithm as an alternating minimization of a Kullback-Leibler-type divergence to see the cause of this difficulty, and propose a more robust discounted likelihood estimation procedure. This gives rise to a discounted likelihood linear regression procedure, which is a variant of maximum likelihood linear regression suited for small adaptation sets. Our procedure is evaluated on an unsupervised rapid adaptation task defined on the Switchboard conversational telephone speech corpus, where our proposed procedure improves word error rate by 1.6% (absolute) with as little as five seconds of adaptation data, which is a situation in which maximum likelihood linear regression overtrains in the first iteration of adaptation. We compare several realizations of discounted likelihood linear regression with maximum likelihood linear regression and other simple maximum likelihood linear regression variants, and discuss issues that arise in implementing our discounted likelihood procedures.