Deep Learning for Music Information Retrieval in Limited Data Scenarios
- Daniel Stoller
While deep learning (DL) models have achieved impressive results in settings where large amounts of annotated training data are available, overfitting often degrades performance when data is more limited. To improve the generalisation of DL models, we investigate "data-driven priors" that exploit additional unlabelled data or labelled data from related tasks. Unlike techniques such as data augmentation, these priors are applicable across a range of machine listening tasks, since their design does not rely on problem-specific knowledge. We first consider scenarios in which parts of samples can be missing, aiming to make more datasets available for model training. In an initial study focusing on audio source separation (ASS), we exploit additionally available unlabelled music and solo source recordings by using generative adversarial networks (GANs), resulting in higher separation quality. We then present a fully adversarial framework for learning generative models with missing data. Our discriminator consists of separately trainable components that can be combined to train the generator with the same objective as in the original GAN framework. We apply our framework to image generation, image segmentation and ASS, demonstrating superior performance compared to the original GAN. To improve performance on any given MIR task, we also aim to leverage datasets which are annotated for similar tasks. We use multi-task learning (MTL) to perform singing voice detection and singing voice separation with one model, improving performance on both tasks. Furthermore, we employ meta-learning on a diverse collection of ten MIR tasks to find a weight initialisation for a “universal MIR model” so that training the model on any MIR task with this initialisation quickly leads to good performance. Since our data-driven priors encode knowledge shared across tasks and datasets, they are suited for high-dimensional, end-to-end models, instead of small models relying on task-specific feature engineering, such as fixed spectrogram representations of audio commonly used in machine listening. To this end, we propose “Wave-U-Net”, an adaptation of the U-Net, which can perform ASS directly on the raw waveform while performing favourably to its spectrogram- based counterpart. Finally, we derive “Seq-U-Net” as a causal variant of Wave-U-Net, which performs comparably to Wavenet and Temporal Convolutional Network (TCN) on a variety of sequence modelling tasks, while being more computationally efficient.