TF-Attention-Net: An End To End Neural Network For Singing Voice Separation
arXiv:1909.05746 [cs, eess], vol.
Computer Science - Information Retrieval,Computer Science - Machine Learning,Computer Science - Sound,Electrical Engineering and Systems Science - Audio and Speech Processing
- Tingle Li
- Jiawei Chen
- Haowen Hou
- Ming Li
In terms of source separation task, most of deep neural networks have two main types: one is modeling in the spectrogram, and the other is in the waveform. Most of them use CNNs, LSTMs, but due to the high sampling rate of audio, whether it is LSTMs with a long-distance dependent or CNNs with sliding windows is difficult to extract long-term input context. In this case, we propose an end-to-end network: Time Frequency Attention Net(TF-Attention-Net), to study the ability of the attention mechanism in the source separation task. Later, we will introduce the Slice Attention, which can extract the acoustic features of time and frequency scales under different channels while the time complexity of which is less than Multi-head Attention. Also, attention mechanism can be efficiently parallelized while the LSTMs can not because of their time dependence. Meanwhile, the receptive field of attention mechanism is larger than the CNNs, which means we can use shallower layers to extract deeper features. Experiments for singing voice separation indicate that our model yields a better performance compared with the SotA model: spectrogram-based U-Net and waveform-based Wave-U-Net, given the same data.