Dans World

Spotify Internship Report

2019-02-21T00:00:00+01:00

From June to September 2019, I took a break from my ongoing PhD and worked as a Research Intern at Spotify in London. I was under the supervision of Simon Durand and Tristan Jehan as part of the music intelligence (MIQ) team. Their work is also similar to my PhD, focused on applying machine learning to music signals in an effort to make computers able to understand musical properties, such as classifying whether a music piece has singing voice in it or not, so this was a good match for me to get to know how it’s like to work in industry in my field.

Jam night at Spotify New York

Overall impressions

Overall my experience was very good. Spotify has lots of energetic, ambitious people that are happy to help out and collaborate with you (on that note, many thanks to Simon, Tristan, Sebastian, Rachel, Andreas, Till, Aparna, and probably more that I forgot!)

Compared to my usual PhD work, this made me especially productive - I worked a lot during these months, but enjoyed it at the same time. Add to that many (mostly optional) meetings such as “Research weekly”, talks from invited researchers, and so on, as well as a great IT infrastructure that puts powerful compute at your fingertips and increases your work efficiency, and you have a very engaging atmosphere that extracts the most out of everyone’s potential.

I also had great freedom in terms of selecting the topic I wanted to research. This is not a given at all considering interns are often bound to one particular task during their stay.

Unfortunately, Spotify had their London offices restructured at the time, so I worked in a temporary office building that was a bit bland in design compared to the nicely designed and decorated main offices. However they were still very high-quality and a big step up from some of the buildings at my university!

Since much of the MIQ team works in the New York offices, I was also able to visit them for a week, with all expenses sponsored! It was great to meet all those people in real life that I only saw in video-conferences until then. Offices there are quite big, and even offer daily catering and in-house music events such as a jam night.

During my stay, I investigated methods for

separating singing voice from accompaniment
detecting what is being sung in a music piece (lyrics transcription)
and when it is being sung if the lyrics text is already given along with the music piece (lyrics alignment).

Better objectives for singing voice separation

For separation, people often use very simple loss functions (e.g. an L2 norm between the predicted output and the real one) to measure the error, which they then minimise to train the system [1], [2]. The problem is that those do not necessarily align with how a human listening to, let’s say, a separated vocal track, would rate the output quality. In other words, the simple loss function can be low, while output quality is rated as bad, and vice versa. This means we are not optimising our systems to maximise the actual listening quality! Evaluation metrics such as SDR [3] or PEASS [4] share similar issues, and are also more complicated to compute or possibly unstable to use for training.

The above losses and metrics also assume that for every music input, there exists exactly one true source output as solution for the separation task (uni-modal). But that might not be the case - if music has background vocals for example, there are two solutions for singing voice separation that both make sense: one that puts the background vocals into the accompaniment track, and one that puts them into the vocal track along with the main vocals:

These monotonic objectives would take whatever option happens to be in the training dataset, and reward the separator more the closer it gets to that solution. This means the other option is punished severely, and the separator might, instead of representing both, be encouraged to predict an average of the two solutions, which however can be a bad output in itself.

We investigated GAN-based training as a potential solution, and also performed perceptual listening experiments (with great help of Aparna!) in an effort to develop a better objective function that more closely resembles how humans would rate the output quality of such systems.

Combining singing voice separation and lyrics transcription

Another idea explored in my internship is concerned with the interactions between tasks: Maybe we can separate voice better if we know what is being sung. And similarly, if we know the isolated voice track, detecting what is being sung should be much easier!

Therefore we looked at multi-task learning to build models that perform separation and lyrics transcription at the same time. We chose the Wave-U-Net [2] model since it already performs separation, and since we hypothesised that its generic architecture allows capturing many time-varying features on different time-scales, including those useful for transcription. We simply take the output of one of the upsampling blocks whose features have an appropriate time resolution (23 feature vectors per second in our case) to directly predict the lyrics characters over time (at most 23 characters per second). To learn from the start and end times for each lyrical line given in our dataset, we apply the CTC loss in these time intervals (see our publication on lyrics transcription and alignment for details).

Branching off a Wave-U-Net upsampling block to predict lyrics characters (source) . The later upsampling blocks for separation are not depicted here

Although we hoped to improve performance this way and tried out many different multi-tasking strategies, the multi-task models were only very slightly better, if at all, than their single-task counterparts. We don’t know exactly why, since there are many factors that can influence the results: Possibly the datasets were so big that the single-task models already underfit, so extra information from other task was not beneficial, or the tasks do not overlap so much after all, or we used the wrong multi-tasking strategy, etc. So we did not investigate further.

Also, we found that lyrics transcription is very, very hard! This might not be very surprising, since it could be seen as speech recognition, which is already hard, but with a lot of additional noise (from the accompaniment), slurred pronounciation and other unusual effects due to singing differing from speech.

Lyrics alignment

Since very good lyrics transcription seemed still out of reach, I also tried to use the transcription models for aligning already given lyrics across time according to when they are sung in the music piece. This turned out to be much more successful, and much better than state of the art models participating in the MIREX lyrics alignment challenge 2017 [5].

In conclusion, we find that you can train a lyrics transcription system from only line-level aligned lyrics annotations to predict characters directly from raw audio, and get excellent alignment accuracy even with mediocre transcription performance - see our ICASSP publication (preprint available here)!

Lyrics transcription however, can still be considered unsolved for now, and this exciting problem will hopefully attract lots of research in the future.

References

Singing-Voice Separation from Monaural Recordings Using Deep Recurrent Neural Networks. ()
Proc. of the International Society for Music Information Retrieval Conference (ISMIR)
Huang, Po-Sen and Kim, Minje and Hasegawa-Johnson, Mark and Smaragdis, Paris
Details
Wave-U-Net: A Multi-Scale Neural Network for End-to-End Source Separation ()
Proc. of the International Society for Music Information Retrieval Conference (ISMIR)
Stoller, Daniel and Ewert, Sebastian and Dixon, Simon
Details
Performance Measurement in Blind Audio Source Separation ()
Vincent, E. and Gribonval, R. and Fevotte, C.
Details
Improved Perceptual Metrics for the Evaluation of Audio Source Separation ()
Latent Variable Analysis and Signal Separation
Vincent, Emmanuel
Details
Music Information Retrieval Exchange (MIREX) ()
IMIRSEL
Details

ICLR 2020 impressions and paper highlights

2019-02-21T00:00:00+01:00

Having just “visited” my first virtual conference, ICLR 2020, I wanted to talk about my general impression and highlight some papers that stuck out to me from a variety of subfields.

General impressions

I presented our FactorGAN paper at the conference. Like every other paper, we have a quick explainer video, along with an asynchronous chat room to answer questions, and two poster session slots lasting two hours each, where people could spontaneously join into virtual Zoom meetings to discuss the paper.

ICLR organisers did a really good job overall, considering there was so little time to react to the Coronavirus pandemic and to switch from a physical to a virtual conference. Poster sessions were very useful to get to know some people and discuss specific questions about papers. In my experience, they were surprisingly empty oftentimes, but that in turn allowed everyone to participate more easily. It’s really nice that the explainer videos are permanently available to everyone now, which should really help with disseminating all the latest research efficiently.

However, I found it a bit difficult to get to know people in a more relaxed setting. Just like poster sessions, the socials on offer were mostly focused on a specific topic, such as AI for environmental issues. There was a “VR” application called “ICLR Town”, where people can run around with characters in a 2D top-down view and meet up in this virtual space using webcams. While this suited my needs more, there were barely any people online. Maybe such a virtual meeting space should be promoted more and included as a coffee break into the conference schedule- which only featured poster sessions and talks this time.

Finally, I was surprised that poster sessions were not clearly separated according to topic, which made it quite overwhelming to find relevant papers. But overall it was a nice experience and organisers did the best they could considering the circumstances.

Paper highlights

Causality

Connecting deep learning models operating with differentiable operations and loss functions on the one hand with causal learning dealing with discrete graphs on the other hand, is generally a very interesting research direction. I wanted to highlight two papers here.

A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms

This paper looks at simple two-dimensional distributions $P(A,B)$. The question is: Does A cause B or B cause A? A probabilistic model could estimate the joint by decomposing it into $P(A)P(B|A)$ or $P(B)P(B|A)$. The main idea in this paper: If the model uses the “correct” decomposition that reflects the causal structure, then if the cause changes, adapting the model to this new distribution is fast. Why is that? Let’s assume that A causes B. If we model using the correct decomposition, $P(A)P(B|A)$, and $P(A)$ changes, then $P(B|A)$ stays the same, so only one part of the model needs to be adapted. If we modelled $P(B)P(B|A)$, then BOTH $P(B)$ and $P(B|A)$ would change.

Authors then construct a clever meta-learning objective - a sum of the likelihood of both model variants on the new distribution after training for a certain number of steps. This sum is weighted with the meta-parameter $\gamma$. After meta-training, $\gamma$ indicates which model, and therefore which causal explanation, is most likely the correct one.

This research seems still in its infancy – two-dimensional distributions are clearly not very practically relevant. But the idea of smoothly interpolating between different generative models using meta-learning might prove valuable in the future in more difficult settings!

Neural Causal Induction from Interventions

In this paper, authors employ a deep learning model to estimate the outcomes of particular interventions, and by observing multiple interventions in a row, to predict the structure of the underlying causal graph that generated the observations. This paper also makes use of meta-learning, in that in each meta-iteration, a new causal graph is used, with the aim of obtaining a deep learning model that can predict the causal structure between multiple variables even for new, previously unseen distributions.

The structure of the neural network is shown above. At each time step, an intervention is performed on one variable, and the encoder takes the resulting $N$ observed variables plus one that indicates which variable was intervened upon. The output is fed to a sequence model that updates its belief state about the causal graph, given all the information (interventions) we have seen so far. Finally, a graph decoder model is trained to output the correct causal graph.

For more detail on these papers and similar papers, check out the workshop on causal learning for decision making.

Classification theory

Your classifier is secretly an energy based model and you should treat it like one

This paper blew me away: Using simple math, it shows very elegantly that a discriminative classifier can also be viewed as an energy-based model, which in turn allows you to detect samples coming from outside the distribution the classifier was originally trained on.

Let’s take a classifier with scalar output $f_{\theta}(x)[y]$ for input $x$ and class index $y$. Class probabilities can be obtained by using the softmax operation, which makes values positive and sum to one over all classes:

$p_{\theta}(y \vert x) = \frac{e^{f_{\theta}(x)[y]}}{\sum_y e^{f_{\theta}(x)[y]}}$.

But the unnormalised outputs can also be used to define an energy based model to define the joint probability over inputs $x$ and labels $y$

$p_{\theta}(x,y) = \frac{e^{f_{\theta}(x)[y]}}{Z(\theta)}$,

where $Z(\theta)$ is the normalising constant that sums up the total energy over the whole $(x,y)$ space. The cool things is - we can now determine the likelihood of an input $p(x)$, by marginalising out $y$ from the above equation, which results in

$p_{\theta}(x) = \frac{\sum_y e^{f_{\theta}(x)[y]}}{Z(\theta)}$.

Notice that the numerator simply contains the sum of exponentiated outputs, which is the denominator in the softmax expression. One can compute $p(y|x)$ to perform classification using the same rules of marginalising out variables, and surprisingly, obtain the exact same formulation of a softmax-based classifier we introduced in the beginning!

Authors then go on and train models as standard classifiers while also maximising the likelihood $p(x)$ at the same time.

The benefits are numerous:

Obtain good classification accuracy, almost as good as purely discriminative training
Models can be used to generate new input samples
Better calibrated classifier output probabilities - NNs are often prone to output probabilities close to 0 or 1, when they should be more uncertain, especially for novel inputs not seen during training. When using the proposed method, samples assigned to a class with a probability of 0.8 would actually end up being from that class 80% of the time.
Out of distribution detection: Simply check an input example $x$ for its likelihood $p(x)$ - if it is too low, reject the sample and return “I don’t know”
More robust to adversarial attacks. Even further increased robustness if the input $x$ is first preprocessed by letting the model perturb it into a version $\hat{x}$ that has higher likelihood $p(\hat{x}) > p(x)$ first, thereby “undoing” the adversarial manipulation and restricting classification to input samples that are similar to those seen during training

Towards neural networks that provably know when they don’t know

In a similar vein, this paper calibrates classifier output probabilities by reformulating the conditional $p(y|x)$. This approach assumes samples either come from the “in-distribution” (seen during training), or from a specific “out-distribution”, where the classifier should indicate its complete uncertainty by assigning the same probability to all classes. $p(y|x)$ is then decomposed using Bayes rule:

$p(y \vert x) = \frac{p(y \vert x,i)p(x \vert i)p(i) + p(y \vert x,o)p(x \vert o)p(o)}{p(x \vert i)p(i) + p(x \vert o)p(o)}$

$i$ and $o$ indicate whether a sample comes from the in- or out distribution. $p(y \vert x,i)$ is the classifier of interest, while $p(y \vert x,o)$ is simply set to a uniform distribution over classes, which allows the authors to make uncertainty guarantees. $p(x \vert i)$ and $p(x \vert o)$ are Gaussian mixture models indicating how likely it is to observe this input sample $x$ assuming it’s drawn from the in- or out distribution, respectively.

While the assumption of a specific out-distribution seems limiting, it is very nice to have mathematically proven guarantees for classifier confidences.

Learning with small data, learning representations

Research on how to make deep learning generalise in the face of small datasets has reached a new peak in the last few years. Representation learning, self-supervised learning and meta learning are very popular topics, especially given recent breakthroughs in NLP by models such as BERT, and so ICLR also had a good representation (ha) of papers on these topics.

Current meta-learning approaches are often limited to the few-shot setting, where a model is only updated a few times on a task before it is used to make predictions (e.g. MAML [1], [2]. WarpGrad aims to extend the applicability of meta learning to settings where more adaptation might be needed. Instead of directly learning an update rule for gradient descent, or a model initialisation from which training on a new task should start, it introduces so-called warp layers that essentially transform the optimisation landscape itself. The warp layer parameters can then be meta-learned so that normal SGD methods can more easily converge to good solutions.

For representation learning, Gradients as Features for Deep Representation Learning add another trainable output layer to pre-trained networks that operates on the network’s gradients, in addition to the usual linear output layer that is used to process intermediate activations from the pre-trained network.

In A Mutual Information Maximization Perspective of Language Representation Learning, authors gain very interesting theoretical insight that commonly used representation learning techniques, such as Deep InfoMax or BERT, while not similar at first glance, all end up optimising a version of a common objective that maximises the mutual information between different parts of the input. This paper might turn out to be critical in developing self-supervised learning techniques that work reliably across different input domains (such as text, audio and video).

A critical analysis of self-supervision, or what we can learn from a single image investigates what current computer vision models can learn from very few (even just single) images under current self-supervision techniques, when strong data augmentation is used. The results are quite concerning: Self-supervision techniques currently can not rival standard supervised training even if millions of unlabelled images are used for self-supervision. Also, similar performance with self-supervision can be reached even when using a single image under heavy data augmentation, as this is sufficient to for early network layers to pick up on low-level statistics of natural images. It seems that self-supervision at the moment suffers from the unsolved problem of finding optimisation objectives that actually encourage modelling high-level, semantically meaningful properties of the input.

Audio processing

Due to my background in audio processing, I wanted to specifically highlight two audio related papers.

In Deep Audio Priors, authors propose a new convolution kernel for audio spectrograms. They correctly note that using normal convolutions is well motivated for images, where nearby pixels are strongly correlated. For spectrograms, this also applies to the time dimension, but not to the frequency dimension, where one can find strong dependencies across the whole frequency band. In particular, many sound sources are harmonic, meaning they are comprised of a sine wave with a certain base frequency (fundamental frequency), accompanied by additional sine waves at frequencies which are multiples of the base frequency (caled harmonics). Authors change convolution kernels to reflect this to obtain “harmonic convolution”, which are assigned to attend to a certain base frequency in addition to frequency bins representing the harmonics. Experiments in audio source separation and audio denoising show improved performance over spectrogram-based U-Nets and Wave-U-Net, indicating that such convolutions provide a more suitable “audio prior”.

DDSP - Differentiable Digital Signal Processing This paper integrates many tools from traditional signal processing, such as synthesisers, with deep learning to gain the benefits of both - DSP provides useful building blocks that can realise complicated audio transformations with just a few control parameters, thereby bringing a lot of prior knowledge to bear on the problem at hand, while deep learning can flexibly learn the desired transformation based on the available training data. This can be especially useful for small data scenarios, where DSP tools are not flexible enough and can not make use of the data to improve results, and where standard deep learning models fail since they require much more data since they have to learn everything from scratch.

A final note on transformer efficiency

There was lots of work trying to make transformer models more computationally efficient [1][2][3], since they have a computational complexity of $O(N^2)$ for sequence inputs of length $N$. This is encouraging to see – while their application was mostly limited to processing a few sentences at a time in the domain of NLP, this might allow for modelling long sequences such as audio signals and other time series data.

References

Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks ()
Proc. of the International Conference on Machine Learning (ICML)
Finn, Chelsea and Abbeel, Pieter and Levine, Sergey
Details
On First-Order Meta-Learning Algorithms ()
Nichol, Alex and Achiam, Joshua and Schulman, John
Details

Bounded output regression with neural networks

2019-01-23T00:00:00+01:00

Say we have a neural network (or some other model trainable with gradient descent) that performs supervised regression: For an input $x$, it outputs one or more real values $y$ as prediction, and tries to get as close to a given target value $\hat{y}$ as possible. We also know that the targets $\hat{y}$ always lie in a certain interval $[a,b]$.

This sounds like a very standard setting that should not pose any problems. But as we will see, it is not so obvious how to ensure the output of the network is always in the given interval, and that training based on stochastic gradient descent (SGD) is possible without issues.

Application: Directly output audio waveforms

This supervised regression occurs in many situations, such predicting a depth map from an image, or audio style transfer. We will take a look at the Wave-U-Net model that takes a music waveform as input $x$, and predicts the individual instrument tracks directly as raw audio. Since audio amplitudes are usually represented as values in the $[-1,1]$ range, I decided to use $\tanh(a)$ as the final activation function with $a \in \mathbb{R}$ as the last layer’s output at a given time-step. As a result, all outputs are now between -1 and 1:

As loss function for the regression, I simply use the mean squared error (MSE, $(y - \hat{y})^2$) between the prediction $y$ and the target $\hat{y}$.

Squashing the output with $\tanh$: Issues

This looks like a good solution at first glance, since the network always produces valid outputs, so the output does not need to be post-processed. But there are two potential problems:

The true audio amplitudes $\hat{y}$ are in the range $[-1,1]$, but $\tanh(a) \in (-1, 1)$ and so never reaches -1 and 1 exactly. If our targets $\hat{y}$ in the training data actually contain these values, the network is forced to output extremely large/small $a$ so that $\tanh(a)$ gets as close to -1 or 1 as possible. I tested this with the Wave-U-Net in an extreme scenario, where all target amplitudes $\hat{y}$ are 1 for all inputs $x$. After just a few training steps, activations in the layers began to explode to increase $a$, which confirms that this can actually become a problem (although my training data is a bit unrealistic). And generally, the network has to drive up activations $a$ (and thus weights) to produce predictions with very high or low amplitudes, potentially making training more unstable.
At very small or large $x$ values, the gradient of $\tanh(a)$ with respect to a, $\tanh’(a)$, vanishes towards zero, as you can see in the plot below. At any point during training, a large weight update that makes all model outputs $y$ almost $-1$ or $1$ would thus make the gradient of the loss with respect to the weights vanish towards zero, since it contains $\tanh’(a)$ as one factor. This can actually happen in practice - some people reported to me that for their dataset, the Wave-U-Net suddenly diverged in this fashion after training stably for a long time, and then couldn’t recover.

Possible solutions

So what other options are there? One option is to simply use a linear output, but clipped to the $[-1,1]$ range: $y := \min(\max(a, -1), 1)$. This solves problem number 1, since $-1$ and $1$ can be output directly. However, problem number 2 still remains, and is maybe even more pronounced now: Clipping all output values outside $[-1,1]$ means the gradient for these outputs is exactly zero, not just arbitrarily close to it like with $\tanh$, so the network might still diverge and never recover.

Finally, I want to propose a third option: A linear output that is unbounded during training ($y := a$), but at test time, the output is clipped to $[-1,1]$. Compared to always clipping, there is now a significant, non-zero gradient for the network to learn from during training at all times: If the network predicts for example $1.4$ as amplitude where the target is $1$, the MSE loss will result in the output being properly corrected towards $1$.

I trained a Wave-U-Net variant for singing voice separation that uses this linear output with test-time clipping for each source independently. Apart from that, all settings are equal to the M5-HighSR model, which uses the $\tanh$ function to predict the accompaniment, and outputs the difference between the input music and the predicted accompaniment signal as voice signal.

Below you can see the waveforms of an instrumental section of a Nightwish song (top), the accompaniment prediction from our new model variant (middle), and from the $\tanh$ model (bottom). Red parts indicate amplitude clipping. We can see the accompaniment from the $\tanh$ model is attenuated, since it cannot reach values close to $-1$ and $1$ easily. In contrast, our model can output the input music almost 1:1, which is here since there are no vocals to subtract. The clipping occurs where the original input also has it, so this can be considered a feature, not a bug.

The problem with the accompaniment output also creates more noise in the vocal channel for the $\tanh$ model, since it uses the difference signal as vocal output:

Original song

Your browser does not support audio file playing!

Tanh model vocal prediction

Your browser does not support audio file playing!

Linear output model vocal prediction

Your browser does not support audio file playing!

Outlook

Although we managed to get improvements over $\tanh$ to output values of bounded range with neural networks, this might not be the perfect solution. Output activations such as $\sin$ or $\cos$ could also be considered, since they squash the output to a desired interval while still allowing to output the boundary values, but training might be difficult due to their periodic nature.

Also, different regression loss functions than MSE might be useful, too. If we used cross-entropy as loss function, it should provide a more well-behaved gradient even when using the $\tanh$ activation, so different loss functions can also play a role and should be explored in the future.

ISMIR 2018 - Paper Overviews

2018-10-18T00:00:00+02:00

This year’s ISMIR was great as ever, this time featuring

lots of deep learning - I suspect since it became much more easy to use with recently developed libraries
lots of new, and surprisingly large, datasets (suited for the new deep learning era)
and a fantastic boat tour through Paris!

For those that want some very quick overview about many of the papers (but not all - and the selection is biased towards my own research interests admittedly). I created “mini-abstracts” designed to describe the core idea or contribution in each paper that should at least be understandable to someone familiar with the field, since even abstracts tend to sometimes be wordy or unneccessarily, well… abstract! I divided them according to the ISMIR conference session they belong to.

Use this page in parallel to quickly retrieve the links to each paper’s PDF document.

Musical objects

(A-1) A Confidence Measure For Key Labelling

Roman B. Gebhardt, Michael Stein and Athanasios Lykartsis

Uncertainty in key classification for songs can be estimated by looking at how much the estimated key varies across the whole song (stability), and by taking the sum of the chroma vector at each timepoint and taking the average over the whole song as a measure of how much tonality is contained (keyness)

(A-2) Improved Chord Recognition by Combining Duration and Harmonic Language Models

Filip Korzeniowski and Gerhard Widmer

Use a model for predicting the next chord given the previous ones, combined with a duration model that predicts at which timestep the chord changes, as a language model to facilitate the learning of long-term dependencies that would be otherwise hard to learn with a time-frame based approach.

(A-4) A Predictive Model for Music based on Learned Interval Representations

Stefan Lattner, Maarten Grachten and Gerhard Widmer

Use a gated recurrent autoencoder to encode the relativ change in pitch at each timestep, then model these relativ changes with an RNN to perform monphonic pitch sequence generation, to enable an RNN to generalize better to repeating melody patterns that continually rise/fall each time.

(A-5) An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts

Miguel A. Román, Antonio Pertusa and Jorge Calvo-Zaragoza

Use a neural network on audio to output a symbolic sequence using a vocbulary with clefs, keys, pitches etc. required to reconstruct a full score-sheet, using a CTC loss.

(A-6) Evaluating Automatic Polyphonic Music Transcription

Andrew McLeod and Mark Steedman

Proposes five metrics with which to rate the quality of a music transcription system, and combines them to one metric describing overall quality, aiming to penalize each mistake only in exactly one of the five metrics (no multiple penalties). There is however no evaluation as to how this metric correlates with subjective quality ratings by humans.

(A-7) Onsets and Frames: Dual-Objective Piano Transcription

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore and Douglas Eck

Instead of using a frame-wise cross-entropy loss on the piano roll output for transcription, also predict the onset position of notes to improve performance/reduce spurious note activations. They also predict note velocity separately to further improve the sound of the synthesized transcription.

(A-8) Player Vs Transcriber: A Game Approach To Data Manipulation For Automatic Drum Transcription

Carl Southall, Ryan Stables and Jason Hockman

Add another model to the drum transcription setting (player) that can learn to use data augmentation operations on the training set to decrease the resulting transciption accuracy. Player and transcriber are trained together to make the transcriber learn from difficult examples not seen in the training data.

(A-10) Evaluating a collection of Sound-Tracing Data of Melodic Phrases

Tejaswinee Kelkar, Udit Roy and Alexander Refsum Jensenius

Make people move their bodies in response to melodic phrases while motion-capturing them and then try to find out which movement features correlate with/predict the corresponding melody.

(A-11) Main Melody Estimation with Source-Filter NMF and CRNN

Dogac Basaran, Slim Essid and Geoffroy Peeters

Pretrain a source-filter NMF model to provide useful features for input into a convolutional-recurrent neural network to track the main melody in music pieces. Pretraining helps since it provides a better representation of the dominant fundamental frequency/pitch salience.

(A-13) A single-step approach to musical tempo estimation using a convolutional neural network

Hendrik Schreiber and Meinard Mueller

Neural network that predicts the local tempo given a 12 second long audio input, and its aggregated outputs over a whole song can be used for estimating the whole song’s tempo.

(A-14) Analysis of Common Design Choices in Deep Learning Systems for Downbeat Tracking

Magdalena Fuentes, Brian McFee, Hélène C. Crayencour, Slim Essid and Juan Pablo Bello

Investigation how downbeat performance changes when the SotA approaches are changed slightly, e.g. what temporal granularity the input spectrogram has, how the output is decoded from the neural network, convolutional-RNN vs only RNN.

Generation, visual

(B-5) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces

Philippe Esling, Axel Chemla–Romeu-Santos and Adrien Bitton

Beta-VAE is used on instrument samples. The latent space is additionally regularized such that the distances between samples of different instruments corresponds to the perceived timbral difference according to perceptual ratings. The resulting model’s latent space can be used to classify the instrument, pitch, dynamics and family, and together with the decoder one can synthesize smoothly interpolated new sounds.

(B-6) Conditioning Deep Generative Raw Audio Models for Structured Automatic Music

Rachel Manzelli, Vijay Thakkar, Ali Siahkamari and Brian Kulis

Combine symbolic and audio music models: Recurrent network is trained to model symbolic note sequences, and a Wavenet model separately is trained to produce raw audio conditioned on a piano-roll representation. Then the models are put together to synthesize music pieces.

(B-7) Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation

Hao-Wen Dong and Yi-Hsuan Yang

To adapt GANs for symbolic music generation, which is a discrete problem and not a continuous problem as usually handled by GANs, they use the straight-through estimator (“stochastic binary neurons”) that have a binary output (randomly sampled) in the forward, but a real-valued probability in the backward path to compute gradients.

Source separation

(C-2) Music Source Separation Using Stacked Hourglass Networks

Sungheon Park, Taehoon Kim, Kyogu Lee and Nojun Kwak

2D U-Net neural network for source separation applied multiple times in a row using a residual connection, so that the initial estimate can be further refined each time

(C-3) The Northwestern University Source Separation Library

Ethan Manilow, Prem Seetharaman and Bryan Pardo

Library for source separation: Supports using trained separation models easily, offers computation of evaluation metrics

(C-4) Improving Bass Saliency Estimation using Transfer Learning and Label Propagation

Jakob Abeßer, Stefan Balke and Meinard Müller

Detecting bass notes in jazz ensemble recordings. Two techniques are investigated in the face of the small available labelled data: Label propagation - train model on annotated dataset, then predict labels for unlabelled data and retrain - and transfer learning - network is trained on isolated bass recordings first, then on the actual jazz data.

(C-5) Improving Peak-picking Using Multiple Time-step Loss Functions

Carl Southall, Ryan Stables and Jason Hockman

Since many current models that predict a series of events given an audio sequence are trained with frame-wise cross-entropy followed by separate peak picking, the models activations might not be well suited for the peak picking procedure. Loss functions that act on neighbouring outputs as well are investigated to remedy this.

(C-6) Zero-Mean Convolutions for Level-Invariant Singing Voice Detection

Jan Schlüter and Bernhard Lehner

Singing voice classifiers turn out to be sensitive to the overall volume of the music output, which is undesirable. While data augmentation by random amplification and mixing of voice and instrumentals helps with classification performance, it appears that this sensitivity largely remains. The paper shows you can directly bake in this invariance by constraining the first convolutional layer so that the weights in each filter sum up to 0, and get better performance. One potential drawback is that a very quiet music input with singing voice will now be classified as positive, although a listener might not be able to hear anything and say there is no singing voice.

(C-8) Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation

Daniel Stoller, Sebastian Ewert and Simon Dixon

This is our own paper ;)

Change the “U-Net”, previously used in biomedical image segmentation and magnitude-based source separation to capture multi-scale features and dependencies, from 2D to 1D convolution (across time), to perform separation directly on the waveform without needing artifact-inducing source reconstruction steps.

For more information, please see the corresponding Github repository here

(C-13) Music Mood Detection Based on Audio and Lyrics with Deep Neural Net

Rémi Delbouys, Romain Hennequin, Francesco Piccoli, Jimena Royo-Letelier and Manuel Moussallam

Predicting the mood of a music piece by combining audio and lyrics information helps performance.

Corpora

(D-4) DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm

Gabriel Meseguer-Brocal, Alice Cohen-Hadria and Geoffroy Peeters

5000 music pieces with lyrics aligned up to a syllable-level, created by matching Karaoke files with user-defined aligned lyrics with the corresponding audio tracks according to a singing voice probability vector across the time duration of the song. Iteratively, the singing voice detection system can be retrained with the newly derived dataset to improve its performance, and the cycle can be repeated.

(D-5) OpenMIC-2018: An open data-set for multiple instrument recognition

Eric Humphrey, Simon Durand and Brian McFee

Large dataset of 10 sec music snippets with labels indicating which instruments are present.

(D-6) From Labeled to Unlabeled Data – On the Data Challenge in Automatic Drum Transcription

Chih-Wei Wu and Alexander Lerch

Investigating feature learning and student-teacher learning paradigms for drum transcription to circumvent the lack of labelled training data. Performance does not clearly increase however, indicating the need for better feature learning/student-teacher learning approaches to enable better transfer.

(D-9) VocalSet: A Singing Voice Dataset

Julia Wilkins, Prem Seetharaman, Alison Wahl and Bryan Pardo

Solo singing voice recordings by 20 different professional singers with annotated labels of singing style and pitch.

(D-10) The NES Music Database: A multi-instrumental dataset with expressive performance attributes

Chris Donahue, Huanru Henry Mao and Julian McAuley

Large dataset of polyphonic music in symbolic form taken from NES games, but with extra performance-related attributes (note velocity, timbre)

(D-14) Revisiting Singing Voice Detection: A quantitative review and the future outlook

Kyungyun Lee, Keunwoo Choi and Juhan Nam

Review paper about singing voice detection. Common problems are identified where current methods fail: 1. Low SNR ratio between vocals and instrumentals 2. Guitar and other instruments that sound “similar” to voice are mistaken for it and 3. Presence of vibrato in other instruments is mistaken for singing voice. These findings give inspiration on how to improve future systems.

(D-15) Vocals in Music Matter: the Relevance of Vocals in the Minds of Listeners

Andrew Demetriou, Andreas Jansson, Aparna Kumar and Rachel Bittner

Psychological qualtiative and quantitative studies demonstrate that listeners attend very closely to singing voice compared to many other aspects of music, despite the lack of singing voice related attributes in music tags for songs.

(D-16) Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning

Wei Tsung Lu and Li Su

For vocal melody extraction, a symbolic vocal segmentation model is first trained on symbolic data. Then the vocal melody extractor is trained from the audio plus the symbolic representation extracted by the other model (we assume audio and symbolic input is known for each sample). At test time, since symbolic data is not available, a simple filter is applied to the audio to get an estimate of what the symbolic transcription might look like to feed it into the symbolic model, before its output is fed into the audio model.

(D-17) Empirically Weighting the Importance of Decision Factors for Singing Preference

Michael Barone, Karim Ibrahim, Chitralekha Gupta and Ye Wang

Psychological study into how important different factors (familiarity, genre preference, ease of vocal reproducibility, and overall preference of the song) are for predicting how attractive it is for a person to sing along to a song.

Timbre, tagging, similarity, patterns and alignment

(E-3) Comparison of Audio Features for Recognition of Western and Ethnic Instruments in Polyphonic Mixtures

Igor Vatolkin and Günter Rudolph

Using evolutionary optimisation to select features most useful for detecting Western or Ethnic instruments. Since these feature sets turn out to be somewhat different, they also search for the best “compromise set” of features that performs reasonably well (but worse than the specialised features) on both types of data.

(E-4) Instrudive: A Music Visualization System Based on Automatically Recognized Instrumentation

Takumi Takahashi, Satoru Fukayama and Masataka Goto

Visualising a collection of music pieces by turning each piece into a pie-chart that shows the percentage of time each instrument is active.

(E-6) Jazz Solo Instrument Classification with Convolutional Neural Networks, Source Separation, and Transfer Learning

Juan S. Gómez, Jakob Abeßer and Estefanía Cano To classify the particular jazz solo instrument playing, source separation is used to remove other instruments first, which helps classification performance. Transfer learning on the other hand by using a model trained on a different dataset beforehand does not turn out to work better, but that may be due to the way the model predictions are aggregated to compute the evaluation metrics.

(E-9) Semi-supervised lyrics and solo-singing alignment

Chitralekha Gupta, Rong Tong, Haizhou Li and Ye Wang

Usage of the DAMP data containing amateur solo singing recordings together with unaligned lyrics, which are roughly aligned by using existing speech recognition technology, to train a lyrics transcription and alignment system. They reach a word error rate of 36%, however it is not known how much this degrades on normal music with lots of accopaniment noise.

(E-14) End-to-end Learning for Music Audio Tagging at Scale

Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra

Comparison of spectrogram-based with direct raw audio-based classification models for music tagging with varying sizes of training data indicates that spectrograms lead to slightly better performance for small, but slightly worse performance for very large training datasets compared to direct audio input.

(E-17) Learning Interval Representations from Polyphonic Music Sequences

Stefan Lattner, Maarten Grachten and Gerhard Widmer

Instead of modeling a sequence of pitches directly, we model the transformation of previous pitches into the current one with a gated auto-encoder and then let the RNN model the autoencoder embeddings, which makes for key-invariant processing.

Session F - Machine and human learning of music

(F-3) Listener Anonymizer: Camouflaging Play Logs to Preserve User’s Demographic Anonymity

Kosetsu Tsukuda, Satoru Fukayama and Masataka Goto Individual users of music streaming services can protect themselves against being identified in terms of their nationality, age, etc. by way of their playlist history with this technique, which estimates these attributes internally and then tells the user which songs he should play to confuse the recommendation engine to obfuscate these attributes.

(F-7) Representation Learning of Music Using Artist Labels

Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha and Juhan Nam

Instead of classifying genre directly, which limits training data and introduces label noise, train to detect the artist, which are objective, easily obtained labels, first. Then use the learned feature representation in the last layer to perform genre detection on a few different datasets

(F-11) MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer

Gino Brunner, Andres Konrad, Yuyi Wang and Roger Wattenhofer

Use a VAE on short audio excerpts, but reserve 2 dimensions in the latent space for modeling different musical styles (jazz, classic, …) by ensuring that a classifier using these dimensions can identify the style of the input. Then the VAE can be used for style transfer by encoding a given input, and changing the style code in the latent space before decoding it.

(F-12) Understanding a Deep Machine Listening Model Through Feature Inversion

Saumitra Mishra, Bob L. Sturm and Simon Dixon

To understand what information/concept is captured by each layer/neuron at a deep audio model, extra decoder functions are trained at each layer to recover the original input of the network (which gets harder the closer you get to the classification layer).

(F-16) Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game

Matthias Dorfer, Florian Henkel and Gerhard Widmer

Apply reinforcement learning to score following by defining an agent that looks at the current section of music sheet and audio spectrogram and then decides whether to increase or decrease the current note sheet scrolling speed.

Spectrogram input normalisation for neural networks

2017-11-29T00:00:00+01:00

In this post, I want to talk about magnitude spectrograms as inputs and outputs of neural networks, and how to normalise them to help the training process.

Introduction: Time-frequency representations, magnitude spectrogram

When using neural networks for audio tasks, it is often advantageous to input not the audio waveform directly into the network, but a two-dimensional representation describing the energy in the signal at a particular time and frequency. A popular time-frequency representation which we will also use here is obtained by the short-time Fourier transform (STFT), where the audio is split into overlapping time frames for which the FFT is computed. From an STFT, we obtain a spectrogram matrix $\mathbf{S}$ with F rows (number of frequency bins) and T columns (number of time frames), where each entry is a complex number. We can take the radius and the polar angle of each complex number to decompose the spectrogram matrix into a magnitude matrix $\mathbf{M}$ and a phase matrix $\mathbf{P}$ so that for each entry we have

\[S_{i,j} = M_{i,j} * e^{i * P_{i,j}}\]

For many audio tasks, only the magnitudes $\mathbf{M}$ are used and the phases $ \mathbf{P}$ are discarded - when using overlapping windows, $ \mathbf{S}$ can be reconstructed from $ \mathbf{M}$ alone, and the phase tends to have a minor impact on sound quality. Here you can see the magnitude and phase for an example song, computed from an STFT with $ N=2048$ samples in each window, and a hop size of 512:

Magnitudes $ \mathbf{M}$ on logarithmic scale ($log(x+1)$) of an example song, with frequency on the vertical, and time frames on the horizontal axes.

Phase matrix $ \mathbf{P}$, each value being an angle in radians

Determining the value range for spectrogram magnitudes

Neural networks tend to converge faster and more stably when the inputs are normally distributed with a mean close to zero and a bounded variance (e.g. 1) so that with the initial weights, the output is already close to the desired one. Similarly, to allow a neural network to output high-dimensional objects, using an output activation function in the last layer that constrains the output range of the network to the real data range can greatly help training and also prevent invalid network predictions. For example, pixels in images are often reduced to a [0,1] interval, and the network output is fed through a sigmoid nonlinearity whose output domain is (0,1).

For these reasons, we need to know the minimum and maximum possible magnitude value in our spectrograms. Since they measure the length of a 2D vector (polar coordinates, their minimum value is zero. For the maximum value of any magnitude, we take a look at how the complex Fourier coefficient for frequency k is computed in an N-point FFT, which ends up in the complex-valued spectrogram:

\[X_k = \sum_{i=0}^{N-1}{x_i \cdot [cos(\frac{2 \pi k n}{N}) - i \cdot sin(\frac{2 \pi k n}{N})]}\]

Since the complex part of the product is bounded by 1 in its magnitude, multiplication by the signal amplitude $x_i$ which is between -1 and 1 results in a complex number still bounded by 1 in magnitude. Taking the sum, the maximum for $X_k$ is thus N. When using a window, we multiply a $ w_i$ term to each element in the sum. In the worst case, we have a 1 in magnitude for each entry, so multiplying by each window element gives us the sum of all the window elements as maximum.

Therefore, the range of possible magnitudes resulting from an N-point STFT is: $ [0, \sum_{i=0}^{N-1}{w_i}]$ With a Hanning window, this turns out to be exactly $\frac{N}{2}$. Now we know the value range of magnitudes and could therefore normalise them to a desired range, for example [0, 1]. If we want our model to output audio, we can use a sigmoid function as output activation function, and apply the inverse of this normalisation step, to accelerate learning and ensure that the network outputs are valid.

Gaussianisation to make spectral magnitudes normally distributed

However, the overall distribution of magnitude values is very non-Gaussian, since many entries in the spectrogram are close to zero, creating a very skewed (heavy-tailed) distribution which can impede learning:

Histogram of spectrogram values, frequency on vertical axis.

As the plot shows, the magnitudes roughly follow a steep exponential distribution. I will show both an easy, and a more accurate and complex way to make this more normally distributed.

1. Option: A logarithmic transformation

For a roughly exponential distribution, an obvious idea would be to compute $\log(\mathbf{S})$, which is a simple and dataset-independent transformation, to stretch out the near-zero values. However, magnitudes can be zero, and log(0) is undefined! What about if we add a certain positive constant c and compute $\log(\mathbf{S} + c)$? The problem here is that the value of the number critically influences how much low magnitudes close to zero are expanded during transformation - low values of c such as 0.001 lead to great expansion, and vice versa. It is also not immediately clear how to set c to approximate a normal distribution closest. Despite these problems, the transformation at least compresses large values effectively and is used so often that many computing libraries such as Numpy offer the c=1 version with a shorthand called “log1p”, and the inverse “expm1” defined as $\exp{x} - 1$. For our example, we see that c=1 does not change the shape of distribution much at all, while $ c=10^{-7}$ gets us close to a normal distribution:

Applying $log1p = \log(x+1)$ (c=1) to the magnitude values does not change the shape of the distribution much, but constrains the input values to a much smaller range, and enlarges differences between small values.

Applying $\log(x + 10^{-7})$ ($c=10^{-7}$) to the magnitude values almost gives us a normally distributed variable, leaving only a small additional peak at around -8.

In general, we see that the closer c is to zero, the more small values are expanded. But how do we set this factor of expansion to make values as close to normally distributed as possible? This is where the Box-Cox transformation comes in!

2. Option: Transformation with Box-Cox

A more advanced method for Gaussianisation is the Box-Cox-Transformation. It comes in two variants: With only one, or with two parameters. We will need the two-parameter version, as it can handle zero values which can occur in our spectrogram. With parameters $ \lambda_1$ and $ \lambda_2$ the Box-Cox transformation is defined as

\[y_i^{(\boldsymbol{\lambda})} = \begin{cases} \dfrac{(y_i + \lambda_2)^{\lambda_1} - 1}{\lambda_1} & \text{if } \lambda_1 \neq 0, \\ \ln{(y_i + \lambda_2)} & \text{if } \lambda_1 = 0. \end{cases}\]

Upon closer inspection, we can see similarities to our first option, the logarithmic transformation. $ \lambda_2$ serves the same purpose as the constant c: Making the values non-zero so we can apply further transformations. For $ \lambda_1 = 0$, Box-Cox is even equivalent to our first method and can be seen as an extension of it! The important difference is that the parameters are estimated from data so that the resulting distribution is as close to normally distributed as possible, which is more accurate than a simple log transform with a predefined constant c. I have not come across an implementation that estimates both parameters, though. With the boxcox method from Scipy, we can only estimate $ \lambda_1$, so for our audio example we use $ \lambda_2 = 10^{-7}$ just like in our first method. The best parameter found is $ \lambda_1 = 0.043$, which is very close to zero and therefore a very similar transformation:

Histogram of magnitude values after Box-Cox transformation with $\lambda_1 = 0.043, \lambda_2 = 10^{-7}$. The tails of the distribution are more symmetrical and the peak is more centered between them, but the additional peak remains.

Box-cox-transformed magnitude spectrogram. The structures in the higher frequency ranges are now more easily visible, while the fact that lower frequencies have higher energy is less emphasized.

Summary

Magnitude spectrograms are tricky to use as input or output of neural networks due to their very skewed, non-normal distribution of values, and because it is hard to find out their maximum value that is needed to scale the value range to a desired interval. The latter is especially important for neural networks that output magnitudes directly, whose training can work better with an output activation function that restricts the network output range to valid values.

The solution is input normalisation: First, transform the values either with a simple $x \rightarrow \log(x+1)$ (first option) or a Box-Cox transformation (second option, more advanced), which should expand low values and compress high ones, making the distribution more Gaussian. Then bring the transformed values into the desired interval. For this we calculate which value 0 and the maximum magnitude are transformed into after applying our particular Gaussianisation, and use this to scale the values to the desired interval.

Tensorflow LSTM for Language Modelling

2016-11-12T00:00:00+01:00

In this post, I will show you how to build an LSTM network for the task of character-based language modelling (predict the next character based on the previous ones), and apply it to generate lyrics. In general, the model allows you to generate new text as well as auto-complete your own sentences based on any given text database.

The code

The code is available here, and the lyrics database can be created by running the crawler I developed here.

Want to see what lyrics my model composes? Click here!

It is implemented in Tensorflow, which has been rapidly evolving in the last few months. As a result, best practices for common tasks are changing as well. That is why I built my own code to make the best use out of Queues, and new functionality from the training-contrib package. In particular, implementing batch-training with variable-length sequences on an unrolled LSTM with truncated BPTT with maintaining the hidden state for each sequence is greatly simplified and optimised. Without further ado, let’s get started!

Input pipeline

First, we are going to load the dataset. I adapted this a bit to my case, but it should be easy for you to change it to your liking:

data, vocab = Dataset.readLyrics(data_settings["input_csv"], data_settings["input_vocab"])
trainIndices, testIndices = Dataset.createPartition(data, train_settings["trainPerc"])

Dataset is a helper file that manages dataset-related operations. Here I assume data has the form of a list of entries, each again a list with two entries: The first entry denotes the artist and the second the lyrics content. The first entry is used to prevent having the same artist in both training and test set. trainIndices and testIndices are lists of indices refer to the rows in data that correspond to the training and test set, respectively. The settings variables are just dictionaries that yield user-defined settings. vocab is a special Vocabulary object that translates characters into integer indices and vice versa, and we will get to know its functionality better along the way. For creating the batches to train on, we will use batch_sequences_with_states, as it is very convenient to use. It requires a key, the sequence length, and input and output for a SINGLE sequence in symbolic form. We create these properties here as placeholders, to later feed them with our own input thread. This design makes input processing very fast and ensures it does not reduce overall training speed: Model training and input processing are running simultaneously in different threads.

keyInput = tf.placeholder(tf.string) # To identify each sequence
lengthInput = tf.placeholder(tf.int32) # Length of sequence
seqInput = tf.placeholder(tf.int32, shape=[None]) # Input sequence
seqOutput = tf.placeholder(tf.int32, shape=[None]) # Output sequence

Then we create a RandomShuffleQueue and the enqueue and dequeue operations, which means sequences will be randomly selected from the queue to form batches during training. This presents an effective compromise between completely random sample selection, which is often slow for large datasets as data has to be pulled from very different memory locations, and completely sequential reading. Using the dictionaries ensures compatibility with the Tensorflow sequence format:

q = tf.RandomShuffleQueue(input_settings["queue_capacity"], input_settings["min_queue_capacity"],
                      [tf.string, tf.int32, tf.int32, tf.int32])
enqueue_op = q.enqueue([keyInput, lengthInput, seqInput, seqOutput])

with tf.device("/cpu:0"):
    key, contextT, sequenceIn, sequenceOut = q.dequeue()
    context = {"length" : tf.reshape(contextT, [])}
    sequences = {"inputs" : tf.reshape(sequenceIn, [contextT]),
                "outputs" : tf.reshape(sequenceOut, [contextT])}

Instead of using the built-in CSV or TFRecord Readers to enqueue samples, I created my own method that can read directly from the data in the RAM. It endlessly loops over the samples given by indices and adds them to the queue. It can easily be adapted to read arbitrary files/parts of datasets and perform further preprocessing:

# Enqueueing method in different thread, loading sequence examples and feeding into FIFO Queue
def load_and_enqueue(indices):
    run = True
    key = 0 # Unique key for every sample, even over multiple epochs (otherwise the queue could be filled up with two same-key examples)
    while run:
        for index in indices:
            current_seq = data[index][1]
            try:
                sess.run(enqueue_op, feed_dict={keyInput: str(key),
                                          lengthInput: len(current_seq)-1,
                                        seqInput: current_seq[:-1],
                                        seqOutput: current_seq[1:]},
                                options=tf.RunOptions(timeout_in_ms=60000))
            except tf.errors.DeadlineExceededError as e:
                print("Timeout while waiting to enqueue into input queue! Stopping input queue thread!")
                run = False
                break
            key += 1
        print "Finished enqueueing all " + str(len(indices)) + " samples!"

load_and_enqueue will be started as a separate Thread later. Two important things to note here. One is the key which is different even for the same samples to avoid errors when incidentally queueing the same samples into the queue, which causes the keys to clash. The other is the timeout, which is the only way I found to be able to stop the training by closing the queue and then catching the resulting DeadlineExceededError. Lastly, the input and output is delayed by one step to force the model to predict the upcoming character.

The LSTM model

Our model is an LSTM with a variable number of layers and configurable dropout, whose hidden states will be maintained separately for each sequence. I created a new class LyricsPredictor with an inference method that builds the computational graph. The beginning looks like this and sets up the RNN cells:

def inference(self, key, context, sequences, num_enqueue_threads):
    # RNN cells and states
    cells = list()
    initial_states = dict()
    for i in range(0, self.num_layers):
        cell = tf.contrib.rnn.LSTMBlockCell(num_units=self.lstm_size) # Block LSTM version gives better performance #TODO Add linear projection option
        cell = tf.nn.rnn_cell.DropoutWrapper(cell,input_keep_prob=1-self.input_dropout, output_keep_prob=1-self.output_dropout)
        cells.append(cell)
        initial_states["lstm_state_c_" + str(i)] = tf.zeros(cell.state_size[0], dtype=tf.float32)
        initial_states["lstm_state_h_" + str(i)] = tf.zeros(cell.state_size[1], dtype=tf.float32)
    cell = tf.nn.rnn_cell.MultiRNNCell(cells)
    [...]

It receives the key, context, and content of a sequence, and how many threads should be used to fetch the RandomShuffleQueue to provide input to the model. I found that LSTMBlockCell works slightly faster than the normal LSTMCell class. Now comes the neat bit. We can use the following code to let Tensorflow form batches from these sequences after splitting them up into chunks according to the unroll length. These batches also come with a context that ensures the hidden state is carried over from the last chunk of a sequence to the next, sparing us from a lot of hassle trying to implement and optimise that on our own.

# BATCH INPUT
self.batch = tf.contrib.training.batch_sequences_with_states(
    input_key=key,
    input_sequences=sequences,
    input_context=context,
    input_length=tf.cast(context["length"], tf.int32),
    initial_states=initial_states,
    num_unroll=self.num_unroll,
    batch_size=self.batch_size,
    num_threads=num_enqueue_threads,
    capacity=self.batch_size * num_enqueue_threads * 2)
inputs = self.batch.sequences["inputs"]
targets = self.batch.sequences["outputs"]

inputs and targets are part of the resulting batch and are formed during runtime. New sequences get pulled as soon as some in the batch are finished. In the following, the inputs are transformed from one-dimensional indices into one-hot vectors. Then, they are reshaped from an [batch_size,unroll_length,vocab_size] tensor to a list of [batch_size,vocab_size] tensors with length unroll_length, to conform with the RNN interface. Finally, we can use the state_saving_rnn with the state-saving batch we created beforehand, to get our outputs.

# Convert input into one-hot representation (from single integers indicating character)
print(self.vocab_size)
embedding = tf.constant(np.eye(self.vocab_size), dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, inputs)

# Reshape inputs (and targets respectively) into list of length T (unrolling length), with each element being a Tensor of shape (batch_size, input_dimensionality)
inputs_by_time = tf.split(1, self.num_unroll, inputs)
inputs_by_time = [tf.squeeze(elem, squeeze_dims=1) for elem in inputs_by_time]
targets_by_time = tf.split(1, self.num_unroll, targets)
targets_by_time = [tf.squeeze(elem, squeeze_dims=1) for elem in targets_by_time] # num_unroll-list of (batch_size) tensors
self.targets_by_time_packed = tf.pack(targets_by_time) # (num_unroll, batch_size)

# Build RNN
state_name = initial_states.keys()
self.seq_lengths = self.batch.context["length"]
(self.outputs, state) = tf.nn.state_saving_rnn(cell, inputs_by_time, state_saver=self.batch,
                                          sequence_length=self.seq_lengths, state_name=state_name, scope='SSRNN')

Here we have to be careful: If we have N characters, a special end-of-sequence token at the end of every sequence is appended while loading the data so the network learns when to stop generating lyrics, and I do not use zero as an index for any character as we need the zero entry to mask the output later to correctly compute the loss. So here, self.vocab_size would be N+2. Finally, we put a softmax on top of the outputs by iterating through the list of length num_unroll in which each entry represents one timestep, and return logits and probabilities:

# Create softmax parameters, weights and bias, and apply to RNN outputs at each timestep
with tf.variable_scope('softmax'):
    softmax_w = tf.get_variable("softmax_w", [self.lstm_size, self.vocab_size])
    softmax_b = tf.get_variable("softmax_b", [self.vocab_size])
    logits = [tf.matmul(outputStep, softmax_w) + softmax_b for outputStep in self.outputs]

    self.logit = tf.pack(logits)

    self.probs = tf.nn.softmax(self.logit)
tf.summary.histogram("probabilities", self.probs)
return (self.logit, self.probs)

To train the model, we also need a loss function. Here we use the fact that we do not use the 0-index as a target, so we know that such an entry in the target list must come from the zero-padding and indicates that the sequence is already over. This is used in the following code to mask the loss by only considering non-zero targets. We also add L2 regularisation.

def loss(self, l2_regularisation):
    with tf.name_scope('loss'):
        # Compute mean cross entropy loss for each output.
        self.cross_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(self.logit, self.targets_by_time_packed) # (num_unroll, batchsize)

        # Mask losses of outputs for positions t which are outside the length of the respective sequence, so they are not used for backprop
        # Take signum => if target is non-zero (valid char), set mask to 1 (valid output), otherwise 0 (invalid output, no gradient/loss calculation)
        mask = tf.sign(tf.abs(tf.cast(self.targets_by_time_packed, dtype=tf.float32))) # Unroll*Batch \in {0,1}
        self.cross_loss = self.cross_loss * mask

        output_num = tf.reduce_sum(mask)
        sum_cross_loss = tf.reduce_sum(self.cross_loss)
        mean_cross_loss = sum_cross_loss / output_num # Mean loss is sum over masked losses for each output, divided by total number of valid outputs

        # L2
        vars = tf.trainable_variables()
        l2_loss = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(l2_regularisation), weights_list=vars)

        loss = mean_cross_loss + l2_loss
        tf.summary.scalar('mean_batch_cross_entropy_loss', mean_cross_loss)
        tf.summary.scalar('mean_batch_loss', loss)
    return loss, mean_cross_loss, sum_cross_loss, output_num

Training the model

Now that we set up the model, we need to train it!

Set up symbolic training operations

First we need the necessary symbolic operations. We set up a step counter and a learning rate variable that decays exponentially depending on the current step:

global_step = tf.get_variable('global_step', [],
                              initializer=tf.constant_initializer(0.0))
# Learning rate
initial_learning_rate = tf.constant(train_settings["learning_rate"])
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, train_settings["learning_rate_decay_epoch"], train_settings["learning_rate_decay_factor"])
tf.summary.scalar("learning_rate", learning_rate

Then we calculate the gradients and prepare them for visualisation for Tensorboard, as you might run into the vanishing gradient problem for deeper RNNs:

# Gradient calculation
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars, aggregation_method=2), # Use experimental aggregation to reduce memory usage
                                  5.0)

# Visualise gradients
vis_grads =  [0 if i == None else i for i in grads]
for g in vis_grads:
    tf.summary.histogram("gradients_" + str(g), g)

We have to replace None entries with 0 for visualisation. We choose a gradient descent method (ADAM in this case) and define the training operations.

optimizer = tf.train.AdamOptimizer(learning_rate)

train_op = optimizer.apply_gradients(zip(grads, tvars),
                                    global_step=global_step)

trainOps = [loss, train_op,
          global_step, learning_rate]

Now we are ready to execute!

Performing training

Start up a session and the QueueRunners associated with it. In our case, these are associated with the RNN input queue (NOT our own RandomShuffleQueue).

# Start session
sess = tf.Session()
coord = tf.train.Coordinator()
init_op = tf.global_variables_initializer()
sess.run(init_op)
tf_threads = tf.train.start_queue_runners(sess=sess, coord=coord)

In case we crashed or want to import a (partly) trained model for other reasons such as fine-tuning, we check for previous model checkpoints to load the model parameters:

# CHECKPOINTING
#TODO save model directly after every epoch, so that we can safely refill the queues after loading a model (uniform sampling of dataset is still ensured)
# Load pretrained model to continue training, if it exists
latestCheckpoint = tf.train.latest_checkpoint(train_settings["checkpoint_dir"])
if latestCheckpoint is not None:
      restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
      restorer.restore(sess, latestCheckpoint)
      print('Pre-trained model restored')

saver = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)

Now we start a thread to run our custom load_and_enqueue method from earlier which reads from data and enqueues the sequences into the RandomShuffleQueue. Preprocessing and data loading is better done on the CPU.

# Start a thread to enqueue data asynchronously to decouple data I/O from training
with tf.device("/cpu:0"):
    t = threading.Thread(target=load_and_enqueue, args=[trainIndices])
    t.start()

We can set up some logging functions so we can nicely visualise statistics with Tensorboard:

# LOGGING
# Add histograms for trainable variables.
histograms = [tf.summary.histogram(var.op.name, var) for var in tf.trainable_variables()]
summary_op = tf.summary.merge_all()
# Create summary writer
summary_writer = tf.summary.FileWriter(train_settings["log_dir"], sess.graph.as_graph_def(add_shapes=True)

Now the training loop runs the training and summary operations, and writes summaries and periodically also model checkpoints to save progress:

loops = 0
while loops < train_settings["max_iterations"]:
    loops += 1
    [res_loss, _, res_global_step, res_learning_rate, summary] = sess.run(trainOps + [summary_op])
    new_time = time.time()
    print("Chars per second: " + str(float(model_settings["batch_size"] * model_settings["num_unroll"]) / (new_time - current_time)))
    current_time = new_time
    print("Loss: " + str(res_loss) + ", Learning rate: " + str(res_learning_rate) + ", Step: " + str(res_global_step))

    # Write summaries for this step
    summary_writer.add_summary(summary, global_step=int(res_global_step))
    if res_global_step % train_settings["save_model_epoch_frequency"] == 0:
        print("Saving model...")
        saver.save(sess, train_settings["checkpoint_path"], global_step=int(res_global_step))

After the maximum desired number of iterations has been reached (or some other criterion), we stop:

# Stop our custom input thread
print("Stopping custom input thread")
sess.run(q.close())  # Then close the input queue
t.join(timeout=1)

# Close session, clear computational graph
sess.close()
tf.reset_default_graph()

Testing

After training, we want to evaluate the performance on the test set. The code for this looks similar, so I will only show the differences here. We load the trained model:

# CHECKPOINTING
# Load pretrained model to test
latestCheckpoint = tf.train.latest_checkpoint(train_settings["checkpoint_dir"])
restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, latestCheckpoint)

In our custom input queue thread, we close the queue after enqueueing all test samples so the process is stopped after seeing each example exactly once:

# Enqueueing method in different thread, loading sequence examples and feeding into FIFO Queue
def load_and_enqueue(indices):
    for index in indices:
        current_seq = data[index][1]
        sess.run(enqueue_op, feed_dict={keyInput: str(index),
                                      lengthInput: len(current_seq)-1,
                                    seqInput: current_seq[:-1],
                                    seqOutput: current_seq[1:]})
    print "Finished enqueueing all " + str(len(indices)) + " samples!"
    sess.run(q.close())

The test loop now uses the total cross-entropy returned from our model class along with the number of valid output positions to compute a bit-per-character metric:

inferenceOps = [loss, mean_cross_loss, sum_cross_loss, output_num]

current_time = time.time()
logprob_sum = 0.0
character_sum = 0
iteration = 0
while True:
    try:
        [l, mcl, scl, nb, summary] = sess.run(inferenceOps + [summary_op])
    except tf.errors.OutOfRangeError:
        print("Finished testing!")
        break

    new_time = time.time()
    print("Chars per second: " + str(
        float(model_settings["batch_size"] * model_settings["num_unroll"]) / (new_time - current_time)))
    current_time = new_time

    logprob_sum += scl # Add up per-char log probabilities of predictive model: Sum_i=1^N (log_2 q(x_i)), which is equal to cross-entropy term for all chars
    character_sum +=  nb # Add up how many characters were in the batch

    print(l, mcl, scl)
    summary_writer.add_summary(summary, global_step=int(iteration))
    iteration += 1

    print("Bit-per-character: " + str(logprob_sum / character_sum))

Note here that we catch an OutOfRangeError that is thrown as soon as our input queue is empty after the input thread finishes and closes it.

Sampling

Unfortunately, sampling as a use case is very different from the train and test setting:

We want to consider a single sequence, not a whole batch
The output of the RNN at the current time step is used as input for the next, which makes static unrolling inappropriate
We cannot in parallel evaluate multiple timesteps, but need to keep the hidden state after each input to feed back into the model (rendering the previously used state saver concepts cumbersome)

Therefore, I did it the hard way and maintained the RNN states myself. This requires setting up placeholders for the states manually to be able to feed values in during sampling, and defining the initial zero states to use for the prediction of the first character:

# Load vocab
vocab = Vocabulary.load(data_settings["input_vocab"])

# INPUT PIPELINE
input = tf.placeholder(tf.int32, shape=[None], name="input") # Integers representing characters
# Create state placeholders - 2 for each lstm cell.
state_placeholders = list()
initial_states = list()
for i in range(0,model_settings["num_layers"]):
    state_placeholders.append(tuple([tf.placeholder(tf.float32, shape=[1, model_settings["lstm_size"]], name="lstm_state_c_" + str(i)), # Batch size x State size
                                tf.placeholder(tf.float32, shape=[1, model_settings["lstm_size"]], name="lstm_state_h_" + str(i))])) # Batch size x State size
    initial_states.append(tuple([np.zeros(shape=[1, model_settings["lstm_size"]], dtype=np.float32),
                          np.zeros(shape=[1, model_settings["lstm_size"]], dtype=np.float32)]))
state_placeholders = tuple(state_placeholders)
initial_states = tuple(initial_states)

The states are represented as tuples in tensorflow. The model itself also has to be adapted accordingly. We use a batch size and unroll length of 1, so we only predict exactly one character at a time, and feed in the input along with the state placeholders:

# MODEL
inference_settings = model_settings
inference_settings["batch_size"] = 1 # Only sample from one example simultaneously
inference_settings["num_unroll"] = 1 # Only sample one character at a time
model = LyricsPredictor(inference_settings, vocab.size + 1)  # Include EOS token
probs, state = model.sample(input, state_placeholders)

This time, we use the sample method from the LyricsPredictor class to build the required computational graph:

def sample(self, input, current_state): # RNN cells and states cells = list() for i in range(0, self.num_layers): cell = tf.contrib.rnn.LSTMBlockCell(num_units=self.lstm_size) # Block LSTM version gives better performance #TODO Add linear projection option cell = tf.nn.rnn_cell.DropoutWrapper(cell,1.0,1.0) # No dropout during sampling cells.append(cell) cell = tf.nn.rnn_cell.MultiRNNCell(cells) self.initial_states = cell.zero_state(batch_size=1,dtype=tf.float32)

# Convert input into one-hot representation (from single integers indicating character)
embedding = tf.constant(np.eye(self.vocab_size), dtype=tf.float32)
input = tf.nn.embedding_lookup(embedding, input) # 1 x Vocab-size
inputs\_by\_time = \[input\] # List of 1 x Vocab-size tensors (with just one tensor in it, because we just use sequence length 1

self.outputs, state = tf.nn.rnn(cell, inputs\_by\_time, initial\_state=current\_state, scope='SSRNN')

Crucially, we set the scope when setting up the RNN to the same that was used when building the model during training and testing, so that when we load the checkpoint, the RNN variables are set up correctly. Afterwards, the softmax is applied as shown earlier. The function returns the probabilities and the resulting LSTM state after processing the input character, which we store in the following.

inference = [probs, state]

current_seq = "never" # This can be any alphanumeric text
current_seq_ind = vocab.char2index(current_seq)

# Warm up RNN with initial sequence
s = initial_states
for ind in current_seq_ind:
    # Create feed dict for states
    feed = dict()
    for i in range(0, model_settings["num_layers"]):
        for c in range(0, len(s[i])):
            feed[state_placeholders[i][c]] = s[i][c]
            feed[state_placeholders[i][c]] = s[i][c]

    feed[input] = [ind] # Add new input symbol to feed
    [p, s] = sess.run(inference, feed_dict=feed)

In the above code, we set an initial sequence (“never”) and prepare the LSTM to continue the lyrics (e.g. “never gonna give you up”) by feeding in one character after another and carrying over the states. These are nested tuples, organised according to layers, each with a cell and a hidden state (this is due to the LSTM structure). The hidden state now hopefully captures meaningful information about the input text current_seq, so we can take the current prediction probabilities and sample from them to generate the next character, feed it into the network, and repeat that process until we receive the special end token signalling that the LSTM is finished with the “creative process”:

# Sample until we receive an end-of-lyrics token
iteration = 0
while iteration < 100000: # Just a safety measure in case the model does not stop
    # Now p contains probability of upcoming char, as estimated by model, and s the last RNN state
    ind_sample = np.random.choice(range(0,vocab.size+1), p=np.squeeze(p))
    if ind_sample == vocab.size: # EOS token
        print("Model decided to stop generating!")
        break

    current_seq_ind.append(ind_sample)

    # Create feed dict for states
    feed = dict()
    for i in range(0, model_settings["num_layers"]):
        for c in range(0, len(s[i])):
            feed[state_placeholders[i][c]] = s[i][c]
            feed[state_placeholders[i][c]] = s[i][c]

    feed[input] = [ind_sample]  # Add new input symbol to feed
    [p, s] = sess.run(inference, feed_dict=feed)

    iteration += 1

Finally, we convert the generated list of integer indices to their character representation, and print out the result:

c_sample = vocab.index2char(current_seq_ind)
print("".join(c_sample))

sess.close()

Example output

And now starts the fun! Feel free to extend the model to your liking. Want to see what my own model generates? Here is the output of a 2-layer 512-hidden node LSTM with 0.2 output dropout trained for only two hours on Metrolyrics text, when told to start with “never”:

never yet in you but that letters know a stobalal in you on the brink so to the victory no matter what i might understand the sun where i am with all this phon people theyll get my knife off a girl that it thats forsaken just smiling still welcome to me
its a gangsta good times is like a fesire then im holding fantastine is on though we bring it out to who burn today well make all the lights in his face im here so bright
sos we do what we do we dont know where we harm
and every time we get you now dont need rewith nothing yeah you dont want a sip or just look at dont make it on the 5 dirty doubt then most name yeah dont know about it i know and you dont wanna play with her no no no
come on ah yeah yeah women make you bury tight around rising in stop up in the top looking out of the middle of the sea
youre not drunk and im not in real hard to hit you all around cover up tune whats not there how to make you cry so long your cut money rolling around and the storm ignite it youre peacent burn so fast blue no fading
two number on the home was we praying of happy for your respect a death a lip another day and style niggas keep that an internuted at leven but you was the way you fall and ive been ready at all ive never seen the girls i tried to drive i took a fool from the river instead i just draw your life when my head stays the fellas we dreams to all ill stay
and nobody should be there in here but when i hide all my echo and make you hold me up im going to get more to fall in your sea oh right through a night in your news one two treatnessboy shes passed out in the sky all the real friends are downs light to out here he was rightly word out im not driving in my eyes suddenly reminding me im being that dragong class i wish i was since yours your peace of
pour it like like a record beautiful man i do you on the kid punk its when im attack i know when im smiling taste so i find a little far

As you can see, the model learned to structure its musings in paragraphs akin to real lyrics, and overall makes some good attempts at coming up with new sentences. Apparently this song is more of a Gangsta rap, as suggested by the words “knife”, “gangsta”, “niggas”, etc. Sentences only sometimes make proper sense, unfortunately. It sometimes comes up with semi-random new words, like “dragong” and “peacent”, because it has to learn spelling and vocabulary from scratch as opposed to word-level language models. It also did not learn meaningful long-term dependencies such as verse/chorus structures.

Snowfall - A very special video game controller

2015-11-20T00:00:00+01:00

Here is a short Youtube video explaining what this post it about. How did I do it? I will try to go through the main steps in the following.

Understanding the wiring and connecting the Arduino

To use the mat as a game controller, I had to first understand its internal wiring before setting up a connection to the Arduino microcontroller. So I freed the PCB along with its wiring of the red plastic hull that was originally containing it. After becoming almost insane trying to figure out the totally weird wiring inside the mat by manually checking the connections from outside with a multimeter, I finally decided to just cut the mat open, which made everything much easier and saved my sanity ;)

Part 1 - LEDs

Intuitively (and naively), for ten LEDs you would expect two connections for each LED and therefore 20 connections in total. But it seems that especially when dealing with cheap toy electronics, it is never that easy. Every cable drives up costs and so they favour more complicated wiring when it is less costly to produce. In this mat, there are 7 cables, each with a different colour (at least they did me that favour!). The setup can be seen in this picture:

So if you want to light up the LED number 2 for example, you have to apply a current to the green cable, and pull the brown cable to the ground to allow the current to flow. In general, for every LED there is a specific combination of two buttons that you have to set. But as if that is not complicated enough, you introduce some dependencies with this setup. What if you wanted to light up LED 1 and 6 simultaneously? You would have to supply the yellow cable with current for LED 1 and the green one for LED 6. Also, the brown and red cables would both have to act as ground. But wait a minute - in this configuration, the LEDs 2 and 5 would light up too, as both are now exposed to the same current as LEDs 1 and 6. I worked around this dependency with a “cheap” trick: When multiple LEDs should be activated, I light up one LED after the other, but switch between the LEDs at such a high frequency (about 5 ms for every LED) that with our surprisingly limited visual system it looks like the LEDs are actually glowing at the same time!

Part 2 - Buttons

Unfortunately, the wiring of the buttons turned out to be even more confusing than those of the LEDs. Basically, inside the mat there are two layers of foil separated by a layer of foam. The foil has conductive areas at the position of each button as well as black lines that connect the buttons to the PCB in the toy. When pressure is applied to a button, the two layers are pressed together as the foam gets squashed, and a current can now flow between the two layers of foil. So you can model this button as a resistor that changes in resistance depending on the applied pressure. I hope the picture below makes everything a little bit clearer, where you can see one of the layers and the electrical connections in black, and the foam underneath obstructing the second layer below.

Similar to the buttons, the wiring was not straightforward, as you can see in the following schematics. For both layers, the connections accessible from the outside are drawn at the top.

On the PCB, I discovered that the buttons 1-4, 5-8 and 9-10 were each short-circuited on the second layer, as shown in red in the schematic. This made matters much more complicated, because it again introduced dependencies between the buttons that make separate measurements of the buttons much harder. It leads to matrix setup, with four connections on the first and three connections on the second layer. Here is a diagram showing this matrix:

It assumes that current is applied on the digital pins 10 to 12 and the resulting voltage measured on the analog pins A0 to A3. For each coloured cable, the buttons that are connected to it are displayed. The matrix entries show the specific button addressed by the combination of two of these cables. So the idea is to apply HIGH to one of the three digital pins, while letting the other connections float, measuring the four analog voltages, and then selecting another digital pin to set to HIGH and repeat the procedure, and then do the same thing a final third time with the last remaining digital pin. The picture below shows a schematic of the Arduino setup for the analog pin A0 and demonstrates how setting exactly one pin between 10 and 12 to high allows for the measurement of the voltage for exactly one of the three buttons. It works analogously for the other analog pins A1 to A3.

Finally, I programmed my Arduino to periodically read these values, detect button presses using these measured voltages, and send the button states over the serial port, where they are received by the Unity game engine.

Game development in Unity

The video game itself was developed in Unity, which is a game development platform suited to build your own 3D as well as 2D games without going through the hassle of creating your own game engine and taking care of every little detail yourself. I can really recommend it to people interested in game development, because it represents a nice, accessible starting point for further endeavours.

Designing the 3D scene

One of Unity’s strength is definitely the modelling of 3D game environments. Although it can be difficult to find the right 3D models and textures for your game (and often you find many expensive offers on the internet and only few or no free ones), designing the terrain and moving, scaling and rotating objects is very intuitive and quickly done. Here is the scene of Snowfall as viewed inside the editor of Unity:

I quickly added some hills and a snow texture to the terrain, placed a lot of trees and obstacles (barely visible in the middle of the screen) and added 10 red skis with their positions relative to the main camera, so they move along with the player. Because the scene is rendered from the perspective of this main camera, for the player the skis do not seem to move at all. On the left side, you can see the scene hierarchy, containing all objects of the scene organised in a tree-like structure. As stated before, the skis are a child of the main camera object so that they do not change their position on the screen. The TreeGroup contains all trees, a directional light acts as the sunlight and the terrain features a set of obstacles in the form of many thin cuboids with a rocky texture. The UI consists of text labels from one to ten for every ski, a play button, a health slider, a title text for the start screen and a loss and win text shown when the player loses or wins. It is also simple to set it up so that a specific function in your own script is called every time the button is pressed or the slider is moved. Finally, the “GameLogic” object contains most of the C# scripts responsible for handling most of the game’s logic. I say most because some scripts are attached to specific game objects like the main camera.

Game logic with C# scripts

So now I designed a pretty 3D game world, but nothing is really happening in it and no interaction with the player is taking place - you could hardly call that a game, right? At first, I implemented moving the camera at a certain speed through the terrain, maintaining the same height and orientation, and gradually speeding up as we go through the level. I achieved this with a script attached to the Main Camera object containing the class “CameraController”. It defines a minimum and maximum speed in units per second as well as the acceleration additionalMinSpeedPerSec. A boolean variable determines if the camera should be moving at the moment - an external script can modify this variable to stop and start camera movement. Every frame, a call to the Update() function causes the current camera speed to be increased by Time.deltaTime * additionalMinSpeedPerSec, but limited to the maximum camera speed. Then, the camera is moved according to this currentCameraSpeed using the following code: Vector3 camPos = transform.localPosition; camPos.z += Time.deltaTime * currentCameraSpeed; transform.localPosition = camPos; Next, I implemented the desired reaction to detected collisions between skis and the rocky terrain. In the 3D scene, I already added box colliders to the obstacles and the skis and marked them as a trigger. Also, I added rigid bodies to the skis. Afterwards I attached a script to every ski whose class SkiCollisionDetector overrides the OnTriggerStay function which is called in regular intervals as long as a collision with the corresponding game object is detected. My implementation for every ski simply counts the total duration of the collisions in a variable collisionTime:

void OnTriggerStay(Collider coll)
{
        collisionTime += Time.deltaTime;
}

To maintain all ten skis in a comfortable way, I also implemented a SkiController that keeps a list of all skis, can add up all the collisionTime values of all skis to retrieve the total amount of damage, reset these variables back to zero in case a new game is started, and also set the visibility and the ability to trigger collision handling depending on the button states. The latter shall be explained in greater detail. The function setSkiStates(bool[] buttonStates) receives information about which buttons are currently pressed, goes through the skiList and makes only the skis belonging to those buttons visible and able to trigger the collision handling functions:

public void setSkiStates(bool[] buttonStates)
{
    for (int i = 0; i < skiList.Count; ++i) // Go through the ski list...
    {
        GameObject currentSki = ((GameObject)skiList[i]);
        currentSki.GetComponent().enabled = buttonStates[i]; // Ski is visible if its button is pressed
        currentSki.GetComponent().enabled = buttonStates[i]; // Ski can trigger calls of collision function (OnTriggerStay) if its button is pressed
    }
}

A HealthBar class uses GUI.DrawTexture with a black and a green texture to draw the health bar, using the black texture as a background and the green texture for the bar itself. Finally, a MainLoop class ties everything together and also reacts to pressing the play button and varying the health slider on the start screen. Setting up to have a specific function called when a specific UI element is interacted with is very easy, just select the UI element in the Unity editor and then, in the inspector on the right hand side, select the function you want to call. As an example, this screenshot shows my play button and how I set it up to call the function startGame of the class MainLoop contained in the game object GameLogic whenever it is pressed:

I will conclude this post with the code of the main loop during active gameplay. Hopefully, it is comprehensible with the extensive comments and can further develop your coding skills in Unity.

// Update is called once per frame
void Update()
{
    parser.readButtonStates(buttonStates); // Read the current button states
    skiController.setSkiStates(buttonStates); // Set ski visibility and collision triggering according to these button states

    if (gameActive) // If the game is active at the moment
    {
        float currentDamage = skiController.getSkiCollisionTime(); // Determine current collision time ("damage")
        float percentage = currentDamage / maxSeconds; // Determine ratio between damage and total health
        healthBar.setPercentage(1 - percentage); // Set health bar to reflect how many time is still available

        // Check for game ending conditions
        bool lossCondition = (percentage >= 1.0f); // Game lost if health percentage is above 1
        bool winCondition = (camController.transform.position.z >= 5300.0f); // Game won if camera has moved to the end of the level (camera moves along z axis only, starts at 0, at z=5300 the end of the level is reached)
        if (lossCondition || winCondition) // If a loss or a win occurred
        {
            gameActive = false; // Game is no longer active
            camController.setMoving(false); // Stop camera movement

            StartCoroutine(resetGame()); // Reset the game in 10 seconds

            if (winCondition) // Show winning text when won...
            {
                winText.enabled = true;
                print("WON");
            }
            else if (lossCondition) // ... otherwise show losing text
            {
                lossText.enabled = true;
                print("LOST");
            }
        }
    }
}