Deep network perceptual losses for speech denoising

Mark R. Saddler*, Andrew Francl*, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott

Example audio from subjective evaluation experiments, in which participants were asked to rate the "naturalness" and "cleanness" of speech clips.

Audio label Description of denoising model used to process input signal
Input Signal Noisy input speech signal consisting of clean speech superimposed on background noise (no denoising model)
A123 Wave-U-Net trained to minimize deep feature losses from three AudioSet-trained DNNs (model producing highest subjective ratings)
A1+W1 Wave-U-Net trained to minimize deep feature losses from one AudioSet-trained DNN and one Word-trained DNN
Random123 Wave-U-Net trained to minimize deep feature losses from three untrained DNNs (random weights)
Baseline UNet Wave-U-Net trained to reconstruct clean speech waveform
Baseline WaveNet WaveNet trained to reconstruct clean speech waveform


Example 1: recorded auditory scene + female speaker (+5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 2: recorded auditory scene + female speaker (0 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 3: recorded auditory scene + male speaker (-5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 4: speech shaped noise + male speaker (+5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 5: speech shaped noise + male speaker (0 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 6: speech shaped noise + female speaker (-5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 7: instrumental music + female speaker (+5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 8: instrumental music + female speaker (-5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 9: 8-speaker babble + male speaker (+5 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet


Example 10: 8-speaker babble + male speaker (0 dB SNR)

Input Signal A123 A1+W1 Random123 Baseline UNet Baseline WaveNet