Speech Denoising with Auditory Models (arXiv, GitHub)

Mark R. Saddler*, Andrew Francl*, Jenelle Feather, Kaizhi Qian, Yang Zhang, Josh H. McDermott

* denotes equal contribution

Example audio from subjective evaluation experiments in which participants rated the "naturalness" of speech clips processed by audio-to-audio transforms optimized for various loss functions.

Audio label Description of denoising model used to process input signal
Unprocessed Input Noisy input signal consisting of clean speech superimposed on background noise (no denoising model)
A123 Wave-U-Net trained to minimize deep feature losses from three AudioSet-trained DNNs (deep features producing highest subjective ratings)
Random123 Wave-U-Net trained to minimize deep feature losses from three untrained DNNs (random weights)
GermainDeepFeatures Wave-U-Net trained to minimize deep feature loss proposed by Germain et al. (2018)
CochlearModel (human) Wave-U-Net trained to minimize loss derived from a cochlear model with human-like frequency tuning (ERB-spaced filter bank)
CochlearModel (reverse) Wave-U-Net trained to minimize loss derived from a cochlear model with altered frequency tuning (reverse-ERB-spaced filter bank)
Waveform Wave-U-Net Wave-U-Net trained to reconstruct clean speech waveform
Waveform WaveNet WaveNet trained to reconstruct clean speech waveform


Example 1: recorded auditory scene + female speaker (+5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 2: recorded auditory scene + male speaker (-5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 3: recorded auditory scene + female speaker (-5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 4: instrumental music + female speaker (0 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 5: instrumental music + male speaker (-5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 6: instrumental music + female speaker (-10 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 7: speech-shaped noise + male speaker (+5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 8: speech-shaped noise + female speaker (-5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 9: speech-shaped noise + male speaker (-5 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet


Example 10: 8-speaker babble + male speaker (0 dB SNR)

Unprocessed Input A123 Random123 GermainDeepFeatures CochlearModel (human) CochlearModel (reverse) Waveform Wave-U-Net Waveform WaveNet