The input to our method is a video with one or more people speaking, where the speech of interest is interfered by other speakers and/or background noise. The output is a decomposition of the input audio track into clean speech tracks, one for each person detected in the video. |