Decoding Auditory Attention (in Real Time) with EEG

  • Malcolm Slaney

Proceedings of the 37th ARO MidWinter Meeting |

Published by Association for Research in Otolaryngology (ARO)

Both magnetoencephalography and electrocorticography recordings have been used to decode which of two competing sources a listener is attending. However, it is not clear whether these techniques might work with Electroencephalography (EEG), particularly in a real-time system. We therefore set out to decode a listener’s attentional focus from EEG signals in real time, knowledge that could be incorporated into next-generation assistive listening devices.

Offline, we acquired EEG data when a subject listened to a single speech source, from which we estimated a mapping from the EEG data to the perceived speech. The subject then attended to one of two simultaneous speech streams, presented dichotically. The previously estimated system transfer function from the single-source presentation was used to estimate the attended stream in real time. Whichever input speech stream more closely resembled the estimated input was deemed to be the attended stream. Three decoding methods were tested. The first approach, canonical correlation analysis (CCA), is based on measuring the correlation between audio streams and the EEG signals. The second two approaches estimate the mapping from the EEG to the input stimulus. This can be done using a single channel at a time and summing the result, or by finding a single multi-channel filter that represents all the signals.

We found the best results by estimating a multivariate linear filter that incorporates the channel covariance structure in the least-squares estimation of the impulse response, similar to the approach described by Mesagarani & Chang (2012). Using this approach we could estimate single-speaker data with high accuracy. Notably this approach yielded estimates of the speech envelope that were better correlated with the original speech (r ~= 0.08) than the other two methods. Applying this to the attention paradigm (after training on single-speaker data), we could predict the focus of attention with 95% accuracy for a one-minute-long sample of dichotic speech. As we shortened the amount of data used to decode, our accuracy fell almost linearly to about 65% for 10 seconds. Other presentation conditions (i.e., diotic and HRTF) were decoded with lower accuracy than dichotic.

EEG signals can be decoded in real time to determine what natural speech stream a listener is attending with relatively high accuracy.