Power is not everything: two frameworks to overcome limitations of power domain modeling

December 15, 2008
Jonathan Le Roux | University of Tokyo

Although many audio signal processing techniques, developed for a wide range of applications such as denoising, source separation or time/pitch-scale modification, operate in the time-frequency power or magnitude domain, discarding the phase information is not without raising important issues. First, if resynthesis of a time-domain signal is necessary, phase needs to be estimated in such a way that it is coherent with the magnitude. Second, additivity of signals is not true anymore, as the cross-terms in the power of a sum are in general not equal to zero. Third, although modeling the phase is often considered an intricate problem, phase may still contain relevant information to exploit, for example in electronic music and to some extent for some instruments such as piano and percussive instruments, where the waveform of the same note or sound played several times is perfectly or nearly perferctly reproducible from one occurence to the other.

In this talk, I will present two frameworks to avoid or overcome these issues, one focusing on the complex time-frequency domain and the other on the time domain. The first framework relies on the derivation of general consistency constraints for complex short-time Fourier transform spectrograms. The consistency criterion which we deduce from them can be used as a cost function in audio signal processing algorithms working in the complex time-frequency domain, or as an objective function to estimate the phase which best corresponds to a given magnitude spectrogram. The second framework, called shift-invariant semi-non-negative matrix factorization, attempts to solve the problem of template matching with unknown templates. It consists in a general model defined in the time domain, assuming that the observed waveform is the superposition of a limited number of elementary patterns, added with variable latencies and variable but positive amplitudes. The elementary patterns are learnt from the data together with the timing and amplitude of their activations. I will show preliminary results on audio data and extracellular recordings.

Speaker Details

Jonathan Le Roux received the degree in Mathematics from the Ecole Normale Supérieure, Paris, France, the M.Sc. degree in partial differential equations from the University of Paris XI, Paris, in 2001, and the M.Sc. degree in stochastic processes from the University of Paris VI, Paris, in 2003 and is currently pursuing the Ph.D. degree in the Graduate School of Computer Science, Telecommunications and Electronics of Paris, University of Paris VI, France, and in the Graduate School of Information Science and Technology, Department of Information Physics and Computing, University of Tokyo, Tokyo, Japan. His research interests include audio signal processing, speech processing, acoustical scene analysis and language acquisition modeling.