Looking to Listen: Audio-Visual Speech Separation from Google | Unibot
Looking to Listen: Audio-Visual Speech Separation from Google

Looking to Listen: Audio-Visual Speech Separation from Google

People are remarkably good at focusing their attention on a particular person in a noisy environment, mentally “muting” all other voices and sounds.

Known as the cocktail party effect, this capability comes natural to us humans. However, automatic speech separation — separating an audio signal into its individual speech sources — while a well-studied problem, remains a significant challenge for computers.

In “Looking to Listen at the Cocktail Party”, we present a deep learning audio-visual model for isolating a single speech signal from a mixture of sounds such as other voices and background noise. In this work, we are able to computationally produce videos in which speech of specific people is enhanced while all other sounds are suppressed. Our method works on ordinary videos with a single audio track, and all that is required from the user is to select the face of the person in the video they want to hear, or to have such a person be selected algorithmically based on context. We believe this capability can have a wide range of applications, from speech enhancement and recognition in videos, through video conferencing, to improved hearing aids, especially in situations where there are multiple people speaking.

A unique aspect of our technique is in combining both the auditory and visual signals of an input video to separate the speech. Intuitively, movements of a person’s mouth, for example, should correlate with the sounds produced as that person is speaking, which in turn can help identify which parts of the audio correspond to that person. The visual signal not only improves the speech separation quality significantly in cases of mixed speech (compared to speech separation using audio alone, as we demonstrate in our paper), but, importantly, it also associates the separated, clean speech tracks with the visible speakers in the video.

Application to Speech Recognition

Our method can also potentially be used as a pre-process for speech recognition and automatic video captioning. Handling overlapping speakers is a known challenge for automatic captioning systems, and separating the audio to the different sources could help in presenting more accurate and easy-to-read captions.

You can similarly see and compare the captions before and after speech separation in all the other videos in this post and on our website, by turning on closed captions in the YouTube player when playing the videos (“cc” button at the lower right corner of the player).


#atificial intelligence #news