Publications

Evolutionary multi-objective training set selection of data instances and augmentations for vocal detection

Igor Vatolkin, Daniel Stoller

International Conference on Computational Intelligence in Music, Sound, Art and Design (EvoMUSART) • 2019

Ensemble models for spoofing detection in automatic speaker verification

Bhusan Chettri, Daniel Stoller, Veronica Morfi, Marco A. Martínez Ramírez, Emmanouil Benetos, Bob L. Sturm

Proceedings of INTERSPEECH • 2019

GAN-based generation and automatic selection of explanations for neural networks

Saumitra Mishra, Daniel Stoller, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

ICLR Workshop on Safe Machine Learning • 2019

End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model

Daniel Stoller, Simon Durand, Sebastian Ewert

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2019

Show abstract

Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude.

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Source Separation

Daniel Stoller, Sebastian Ewert, Simon Dixon

International Society for Music Information Retrieval Conference (ISMIR) • 2018

Show abstract

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Daniel Stoller, Sebastian Ewert, Simon Dixon

Latent Variable Analysis and Signal Separation • 2018

Detection of Cut-Points for Automatic Music Rearrangement

D. Stoller, V. Akkermans, S. Dixon

IEEE International Workshop on Machine Learning for Signal Processing (MLSP) • 2018

Show abstract

Existing music recordings are often rearranged, for example to fit their duration and structure to video content. Often an expert is needed to find suitable cut points allowing for imperceptible transitions between different sections. In previous work, the search for these cuts is restricted to the beginnings of beats or measures and only timbre and loudness are taken into account, while melodic expectations and instrument continuity are neglected. We instead aim to learn these features by training neural networks on a dataset of over 300 popular Western songs to classify which note onsets are suitable entry or exit points for a cut. We investigate existing and novel architectures and different feature representations, and find that best performance is achieved using neural networks with two-dimensional convolutions applied to spectrogram input covering several seconds of audio with a high temporal resolution of 23 or 46 ms. Finally, we analyse our best model using saliency maps and find it attends to rhythmical structures and the presence of sounds at the onset position, suggesting instrument activity to be important for predicting cut quality.

Intuitive and efficient computer-aided music rearrangement with optimised processing of audio transitions

Daniel Stoller, Igor Vatolkin, Heinrich Müller

Journal of New Music Research • 2018

Show abstract

A promising approach to create new versions of existing music pieces automatically is to cut out and rearrange sections so that transitions are minimally perceptible and constraints regarding duration or structure are fulfilled. We evaluate previous work and improve on its limitations, particularly the disregard for loudness changes at cuts and the unintuitive control over the musical structure of the output. Our software provides a user-friendly interface, which we make more responsive by greatly accelerating the search for an optimal output track using the A* algorithm. Listening experiments demonstrate an improvement in perceived audio quality.

Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction

Daniel Stoller, Sebastian Ewert, Simon Dixon

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2018

Show abstract

The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used to combat overfitting. Mixing random tracks, however, can even reduce separation performance as instruments in real music are strongly correlated. The key concept in our approach is that source estimates of an optimal separator should be indistinguishable from real source signals. Based on this idea, we drive the separator towards outputs deemed as realistic by discriminator networks that are trained to tell apart real from separator samples. This way, we can also use unpaired source and mixture recordings without the drawbacks of creating unrealistic music mixtures. Our framework is widely applicable as it does not assume a specific network architecture or number of sources. To our knowledge, this is the first adoption of adversarial training for music source separation. In a prototype experiment for singing voice separation, separation performance increases with our approach compared to purely supervised training.

Analysis and classification of phonation modes in singing

Daniel Stoller, Simon Dixon

International Society for Music Information Retrieval Conference (ISMIR) • 2016

Show abstract

Phonation mode is an expressive aspect of the singing voice and can be described using the four categories neutral, breathy, pressed and flow. Previous attempts at automatically classifying the phonation mode on a dataset containing vowels sung by a female professional have been lacking in accuracy or have not sufficiently investigated the characteristic features of the different phonation modes which enable successful classification. In this paper, we extract a large range of features from this dataset, including specialised descriptors of pressedness and breathiness, to analyse their explanatory power and robustness against changes of pitch and vowel. We train and optimise a feed-forward neural network (NN) with one hidden layer on all features using cross validation to achieve a mean F-measure above 0.85 and an improved performance compared to previous work. Applying feature selection based on mutual information and retaining the nine highest ranked features as input to a NN results in a mean F-measure of 0.78, demonstrating the suitability of these features to discriminate between phonation modes. Training and pruning a decision tree yields a simple rule set based only on cepstral peak prominence (CPP), temporal flatness and average energy that correctly categorises 78% of the recordings.