In this paper we take a modeling approach to studying representation of formant frequencies of spoken speech and speech in noise in the temporal responses of the peripheral anditory system. On the basis of the properties of the representation, we have devised and evaluated a cross-channel correlation algorithm and an interpeak interval analysis for automatic formant extraction of speech which is strongly dynamic in acoustic characteristics and is embedded in noise. The basilar membrane model used in this study contains laterally coupled damping elements, which are made monotonically dependent on the spatial distribution of the short-term power in the outputs of the model. Efficient digital implementation and the related salient numerical properties of the model are described. Simulation results from the model in response to speech and speech in noise illustrate temporal response patterns that are tonotopically organized in relation to speech formant parameters with little influence by the noise level. By utilizing such relations the devised cross-channel correlation algorithm is shown to be capable of accurately tracking formant movements in spoken syllables and sentences.