Publications

Ensemble models for spoofing detection in automatic speaker verification

Bhusan Chettri, Daniel Stoller, Veronica Morfi, Marco A. Martínez Ramírez, Emmanouil Benetos, Bob L. Sturm

Proceedings of INTERSPEECH • 2019

GAN-based generation and automatic selection of explanations for neural networks

Saumitra Mishra, Daniel Stoller, Emmanouil Benetos, Bob L. Sturm, Simon Dixon

ICLR Workshop on Safe Machine Learning • 2019

End-to-end Lyrics Alignment for Polyphonic Music Using An Audio-to-Character Recognition Model

Daniel Stoller, Simon Durand, Sebastian Ewert

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2019

Show abstract

Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude.

Wave-U-Net: A Multi-Scale Neural Network for End-to-End Source Separation

Daniel Stoller, Sebastian Ewert, Simon Dixon

International Society for Music Information Retrieval Conference (ISMIR) • 2018

Show abstract

Models for audio source separation usually operate on the magnitude spectrum, which ignores phase information and makes separation performance dependant on hyper-parameters for the spectral front-end. Therefore, we investigate end-to-end source separation in the time-domain, which allows modelling phase information and avoids fixed spectral transformations. Due to high sampling rates for audio, employing a long temporal input context on the sample level is difficult, but required for high quality separation results because of long-range temporal correlations. In this context, we propose the Wave-U-Net, an adaptation of the U-Net to the one-dimensional time domain, which repeatedly resamples feature maps to compute and combine features at different time scales. We introduce further architectural improvements, including an output layer that enforces source additivity, an upsampling technique and a context-aware prediction framework to reduce output artifacts. Experiments for singing voice separation indicate that our architecture yields a performance comparable to a state-of-the-art spectrogram-based U-Net architecture, given the same data. Finally, we reveal a problem with outliers in the currently used SDR evaluation metrics and suggest reporting rank-based statistics to alleviate this problem.

Jointly Detecting and Separating Singing Voice: A Multi-Task Approach

Daniel Stoller, Sebastian Ewert, Simon Dixon

Latent Variable Analysis and Signal Separation • 2018

Detection of Cut-Points for Automatic Music Rearrangement

D. Stoller, V. Akkermans, S. Dixon

IEEE International Workshop on Machine Learning for Signal Processing (MLSP) • 2018

Show abstract

Existing music recordings are often rearranged, for example to fit their duration and structure to video content. Often an expert is needed to find suitable cut points allowing for imperceptible transitions between different sections. In previous work, the search for these cuts is restricted to the beginnings of beats or measures and only timbre and loudness are taken into account, while melodic expectations and instrument continuity are neglected. We instead aim to learn these features by training neural networks on a dataset of over 300 popular Western songs to classify which note onsets are suitable entry or exit points for a cut. We investigate existing and novel architectures and different feature representations, and find that best performance is achieved using neural networks with two-dimensional convolutions applied to spectrogram input covering several seconds of audio with a high temporal resolution of 23 or 46 ms. Finally, we analyse our best model using saliency maps and find it attends to rhythmical structures and the presence of sounds at the onset position, suggesting instrument activity to be important for predicting cut quality.

Intuitive and efficient computer-aided music rearrangement with optimised processing of audio transitions

Daniel Stoller, Igor Vatolkin, Heinrich Müller

Journal of New Music Research • 2018

Show abstract

A promising approach to create new versions of existing music pieces automatically is to cut out and rearrange sections so that transitions are minimally perceptible and constraints regarding duration or structure are fulfilled. We evaluate previous work and improve on its limitations, particularly the disregard for loudness changes at cuts and the unintuitive control over the musical structure of the output. Our software provides a user-friendly interface, which we make more responsive by greatly accelerating the search for an optimal output track using the A* algorithm. Listening experiments demonstrate an improvement in perceived audio quality.

Adversarial Semi-Supervised Audio Source Separation applied to Singing Voice Extraction

Daniel Stoller, Sebastian Ewert, Simon Dixon

IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) • 2018

Show abstract

The state of the art in music source separation employs neural networks trained in a supervised fashion on multi-track databases to estimate the sources from a given mixture. With only few datasets available, often extensive data augmentation is used to combat overfitting. Mixing random tracks, however, can even reduce separation performance as instruments in real music are strongly correlated. The key concept in our approach is that source estimates of an optimal separator should be indistinguishable from real source signals. Based on this idea, we drive the separator towards outputs deemed as realistic by discriminator networks that are trained to tell apart real from separator samples. This way, we can also use unpaired source and mixture recordings without the drawbacks of creating unrealistic music mixtures. Our framework is widely applicable as it does not assume a specific network architecture or number of sources. To our knowledge, this is the first adoption of adversarial training for music source separation. In a prototype experiment for singing voice separation, separation performance increases with our approach compared to purely supervised training.

Analysis and classification of phonation modes in singing

Daniel Stoller, Simon Dixon

International Society for Music Information Retrieval Conference (ISMIR) • 2016

Show abstract

Phonation mode is an expressive aspect of the singing voice and can be described using the four categories neutral, breathy, pressed and flow. Previous attempts at automatically classifying the phonation mode on a dataset containing vowels sung by a female professional have been lacking in accuracy or have not sufficiently investigated the characteristic features of the different phonation modes which enable successful classification. In this paper, we extract a large range of features from this dataset, including specialised descriptors of pressedness and breathiness, to analyse their explanatory power and robustness against changes of pitch and vowel. We train and optimise a feed-forward neural network (NN) with one hidden layer on all features using cross validation to achieve a mean F-measure above 0.85 and an improved performance compared to previous work. Applying feature selection based on mutual information and retaining the nine highest ranked features as input to a NN results in a mean F-measure of 0.78, demonstrating the suitability of these features to discriminate between phonation modes. Training and pruning a decision tree yields a simple rule set based only on cepstral peak prominence (CPP), temporal flatness and average energy that correctly categorises 78% of the recordings.

Impact of Frame Size and Instrumentation on Chroma-Based Automatic Chord Recognition

Daniel Stoller, Matthias Mauch, Igor Vatolkin, Claus Weihs

Data Science, Learning by Latent Structures, and Knowledge Discovery • 2015

Show abstract

This paper presents a comparative study of classification performance in automatic audio chord recognition based on three chroma feature implementations, with the aim of distinguishing effects of frame size, instrumentation, and choice of chroma feature. Until recently, research in automatic chord recognition has focused on the development of complete systems. While results have remarkably improved, the understanding of the error sources remains lacking. In order to isolate sources of chord recognition error, we create a corpus of artificial instrument mixtures and investigate (a) the influence of different chroma frame sizes and (b) the impact of instrumentation and pitch height. We show that recognition performance is significantly affected not only by the method used, but also by the nature of the audio input. We compare these results to those obtained from a corpus of more than 200 real-world pop songs from The Beatles and other artists for the case in which chord boundaries are known in advance.