Jekyll2022-03-21T00:46:44+01:00https://dans.world/feed.xmlDans WorldMachine learning, music information retrieval, and other thingsDaniel Stollerbusiness@dstoller.netSpotify Internship Report2019-02-21T00:00:00+01:002019-02-21T00:00:00+01:00https://dans.world/Spotify-internship<p>From June to September 2019, I took a break from my ongoing PhD and worked as a Research Intern at Spotify in London.
I was under the supervision of Simon Durand and Tristan Jehan as part of the music intelligence (MIQ) team.
Their work is also similar to my PhD, focused on applying machine learning to music signals in an effort to make computers able to understand musical properties, such as classifying whether a music piece has singing voice in it or not, so this was a good match for me to get to know how it’s like to work in industry in my field.</p>
<p><img src="https://dans.world/assets/img/2019-02-21-Spotify-internship/jam.png" alt="Jamming night at Spotify" />
<em>Jam night at Spotify New York</em></p>
<h2 id="overall-impressions">Overall impressions</h2>
<p>Overall my experience was very good.
Spotify has lots of energetic, ambitious people that are happy to help out and collaborate with you (on that note, many thanks to Simon, Tristan, Sebastian, Rachel, Andreas, Till, Aparna, and probably more that I forgot!)</p>
<p>Compared to my usual PhD work, this made me especially productive - I worked a lot during these months, but enjoyed it at the same time.
Add to that many (mostly optional) meetings such as “Research weekly”, talks from invited researchers, and so on, as well as a great IT infrastructure that puts powerful compute at your fingertips and increases your work efficiency, and you have a very engaging atmosphere that extracts the most out of everyone’s potential.</p>
<p>I also had great freedom in terms of selecting the topic I wanted to research. This is not a given at all considering interns are often bound to one particular task during their stay.</p>
<p>Unfortunately, Spotify had their London offices restructured at the time, so I worked in a temporary office building that was a bit bland in design compared to the nicely designed and decorated main offices. However they were still very high-quality and a big step up from some of the buildings at my university!</p>
<p>Since much of the MIQ team works in the New York offices, I was also able to visit them for a week, with all expenses sponsored!
It was great to meet all those people in real life that I only saw in video-conferences until then.
Offices there are quite big, and even offer daily catering and in-house music events such as a jam night.</p>
<p>During my stay, I investigated methods for</p>
<ul>
<li>separating singing voice from accompaniment</li>
<li>detecting what is being sung in a music piece (lyrics transcription)</li>
<li>and when it is being sung if the lyrics text is already given along with the music piece (lyrics alignment).</li>
</ul>
<h2 id="better-objectives-for-singing-voice-separation">Better objectives for singing voice separation</h2>
<p>For separation, people often use very simple loss functions (e.g. an L2 norm between the predicted output and the real one) to measure the error, which they then minimise to train the system <a class="citation" href="#huangSingingVoiceSeparation2014">[1], [2]</a>.
The problem is that those do not necessarily align with how a human listening to, let’s say, a separated vocal track, would rate the output quality. In other words, the simple loss function can be low, while output quality is rated as bad, and vice versa.
This means we are not optimising our systems to maximise the actual listening quality!
Evaluation metrics such as SDR <a class="citation" href="#vincentPerformanceMeasurement2006">[3]</a> or PEASS <a class="citation" href="#vincentImprovedPerceptual2012">[4]</a> share similar issues, and are also more complicated to compute or possibly unstable to use for training.</p>
<p>The above losses and metrics also assume that for every music input, there exists <em>exactly one true source output</em> as solution for the separation task (uni-modal).
But that might not be the case - if music has background vocals for example, there are <em>two</em> solutions for singing voice separation that both make sense: one that puts the background vocals into the accompaniment track, and one that puts them into the vocal track along with the main vocals:</p>
<p><img src="https://dans.world/assets/img/2019-02-21-Spotify-internship/multi_modal_output.png" alt="Multiple solutions for separation" /></p>
<p>These monotonic objectives would take whatever option happens to be in the training dataset, and reward the separator more the closer it gets to that solution.
This means the other option is punished severely, and the separator might, instead of representing both, be encouraged to predict an average of the two solutions, which however can be a bad output in itself.</p>
<p>We investigated GAN-based training as a potential solution, and also performed perceptual listening experiments (with great help of Aparna!) in an effort to develop a better objective function that more closely resembles how humans would rate the output quality of such systems.</p>
<h2 id="combining-singing-voice-separation-and-lyrics-transcription">Combining singing voice separation and lyrics transcription</h2>
<p>Another idea explored in my internship is concerned with the interactions between tasks:
Maybe we can separate voice better if we know what is being sung.
And similarly, if we know the isolated voice track, detecting what is being sung should be much easier!</p>
<p>Therefore we looked at <em>multi-task learning</em> to build models that perform separation and lyrics transcription at the same time.
We chose the <em>Wave-U-Net</em> <a class="citation" href="#stollerWaveUNetMultiScale2018">[2]</a> model since it already performs separation, and since we hypothesised that its generic architecture allows capturing many time-varying features on different time-scales, including those useful for transcription. We simply take the output of one of the upsampling blocks whose features have an appropriate time resolution (23 feature vectors per second in our case) to directly predict the lyrics characters over time (at most 23 characters per second). To learn from the start and end times for each lyrical line given in our dataset, we apply the CTC loss in these time intervals (see our <a href="https://arxiv.org/abs/1902.06797">publication on lyrics transcription and alignment</a> for details).</p>
<p><img src="https://dans.world/assets/img/2019-02-21-Spotify-internship/lyrics_waveunet.png" alt="Branching off a Wave-U-Net upsampling block to predict lyrics characters. The later upsampling blocks for separation are not depicted here." />
<em>Branching off a Wave-U-Net upsampling block to predict lyrics characters (<a href="https://arxiv.org/abs/1902.06797">source</a>) . The later upsampling blocks for separation are not depicted here</em></p>
<p>Although we hoped to improve performance this way and tried out many different multi-tasking strategies, the multi-task models were only very slightly better, if at all, than their single-task counterparts.
We don’t know exactly why, since there are many factors that can influence the results: Possibly the datasets were so big that the single-task models already underfit, so extra information from other task was not beneficial, or the tasks do not overlap so much after all, or we used the wrong multi-tasking strategy, etc. So we did not investigate further.</p>
<p>Also, we found that lyrics transcription is very, very hard!
This might not be very surprising, since it could be seen as speech recognition, which is already hard, but with a lot of additional noise (from the accompaniment), slurred pronounciation and other unusual effects due to singing differing from speech.</p>
<h2 id="lyrics-alignment">Lyrics alignment</h2>
<p>Since very good lyrics transcription seemed still out of reach, I also tried to use the transcription models for aligning already given lyrics across time according to when they are sung in the music piece. This turned out to be much more successful, and much better than state of the art models participating in the MIREX lyrics alignment challenge 2017 <a class="citation" href="#imirselMusicInformation2020">[5]</a>.</p>
<p>In conclusion, we find that <strong>you can train a lyrics transcription system from only line-level aligned lyrics annotations to predict characters directly from raw audio</strong>, and <strong>get excellent alignment accuracy even with mediocre transcription performance - see our ICASSP publication (preprint available <a href="https://arxiv.org/abs/1902.06797">here</a>)!</strong></p>
<p>Lyrics transcription however, can still be considered unsolved for now, and this exciting problem will hopefully attract lots of research in the future.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><div>
<a name="huangSingingVoiceSeparation2014" />
<b>Singing-Voice Separation from Monaural Recordings Using Deep Recurrent Neural Networks.</b> <span style="font-size:15px"> () </span>
<br />
Proc. of the International Society for Music Information Retrieval Conference (ISMIR)
<br />
<i>Huang, Po-Sen and Kim, Minje and Hasegawa-Johnson, Mark and Smaragdis, Paris</i>
</div>
<a download="huangSingingVoiceSeparation2014.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@inproceedings%7BhuangSingingVoiceSeparation2014,%0A%20%20title%20=%20%7BSinging-%7B%7BVoice%20Separation%7D%7D%20from%20%7B%7BMonaural%20Recordings%7D%7D%20Using%20%7B%7BDeep%20Recurrent%20Neural%20Networks%7D%7D.%7D,%0A%20%20booktitle%20=%20%7BProc.%20of%20the%20%7B%7BInternational%20Society%7D%7D%20for%20%7B%7BMusic%20Information%20Retrieval%20Conference%7D%7D%20(%7B%7BISMIR%7D%7D)%7D,%0A%20%20author%20=%20%7BHuang,%20Po-Sen%20and%20Kim,%20Minje%20and%20Hasegawa-Johnson,%20Mark%20and%20Smaragdis,%20Paris%7D,%0A%20%20date%20=%20%7B2014%7D,%0A%20%20pages%20=%20%7B477--482%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/huangSingingVoiceSeparation2014/">Details</a></li>
<li><div>
<a name="stollerWaveUNetMultiScale2018" />
<b>Wave-U-Net: A Multi-Scale Neural Network for End-to-End Source Separation</b> <span style="font-size:15px"> () </span>
<br />
Proc. of the International Society for Music Information Retrieval Conference (ISMIR)
<br />
<i>Stoller, Daniel and Ewert, Sebastian and Dixon, Simon</i>
</div>
<a target="_blank" rel="noopener noreferrer" href="/repository/stollerWaveUNetMultiScale2018.published.pdf"><input class="button0" type="button" value="PDF" /></a>
<a download="stollerWaveUNetMultiScale2018.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@inproceedings%7BstollerWaveUNetMultiScale2018,%0A%20%20title%20=%20%7BWave-%7B%7BU%7D%7D-%7B%7BNet%7D%7D:%20%7B%7BA%20Multi%7D%7D-%7B%7BScale%20Neural%20Network%7D%7D%20for%20%7B%7BEnd%7D%7D-to-%7B%7BEnd%20Source%20Separation%7D%7D%7D,%0A%20%20booktitle%20=%20%7BProc.%20of%20the%20%7B%7BInternational%20Society%7D%7D%20for%20%7B%7BMusic%20Information%20Retrieval%20Conference%7D%7D%20(%7B%7BISMIR%7D%7D)%7D,%0A%20%20author%20=%20%7BStoller,%20Daniel%20and%20Ewert,%20Sebastian%20and%20Dixon,%20Simon%7D,%0A%20%20date%20=%20%7B2018%7D,%0A%20%20volume%20=%20%7B19%7D,%0A%20%20pages%20=%20%7B334--340%7D,%0A%20%20abstract%20=%20%7BModels%20for%20audio%20source%20separation%20usually%20operate%20on%20the%20magnitude%20spectrum,%20which%20ignores%20phase%20information%20and%20makes%20separation%20performance%20dependant%20on%20hyper-parameters%20for%20the%20spectral%20front-end.%20Therefore,%20we%20investigate%20end-to-end%20source%20separation%20in%20the%20time-domain,%20which%20allows%20modelling%20phase%20information%20and%20avoids%20fixed%20spectral%20transformations.%20Due%20to%20high%20sampling%20rates%20for%20audio,%20employing%20a%20long%20temporal%20input%20context%20on%20the%20sample%20level%20is%20difficult,%20but%20required%20for%20high%20quality%20separation%20results%20because%20of%20long-range%20temporal%20correlations.%20In%20this%20context,%20we%20propose%20the%20Wave-U-Net,%20an%20adaptation%20of%20the%20U-Net%20to%20the%20one-dimensional%20time%20domain,%20which%20repeatedly%20resamples%20feature%20maps%20to%20compute%20and%20combine%20features%20at%20different%20time%20scales.%20We%20introduce%20further%20architectural%20improvements,%20including%20an%20output%20layer%20that%20enforces%20source%20additivity,%20an%20upsampling%20technique%20and%20a%20context-aware%20prediction%20framework%20to%20reduce%20output%20artifacts.%20Experiments%20for%20singing%20voice%20separation%20indicate%20that%20our%20architecture%20yields%20a%20performance%20comparable%20to%20a%20state-of-the-art%20spectrogram-based%20U-Net%20architecture,%20given%20the%20same%20data.%20Finally,%20we%20reveal%20a%20problem%20with%20outliers%20in%20the%20currently%20used%20SDR%20evaluation%20metrics%20and%20suggest%20reporting%20rank-based%20statistics%20to%20alleviate%20this%20problem.%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/stollerWaveUNetMultiScale2018/">Details</a></li>
<li><div>
<a name="vincentPerformanceMeasurement2006" />
<b>Performance Measurement in Blind Audio Source Separation</b> <span style="font-size:15px"> () </span>
<br />
<i>Vincent, E. and Gribonval, R. and Fevotte, C.</i>
</div>
<a download="vincentPerformanceMeasurement2006.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@article%7BvincentPerformanceMeasurement2006,%0A%20%20title%20=%20%7BPerformance%20Measurement%20in%20Blind%20Audio%20Source%20Separation%7D,%0A%20%20author%20=%20%7BVincent,%20E.%20and%20Gribonval,%20R.%20and%20Fevotte,%20C.%7D,%0A%20%20date%20=%20%7B2006%7D,%0A%20%20journaltitle%20=%20%7BIEEE%20Transactions%20on%20Audio,%20Speech,%20and%20Language%20Processing%7D,%0A%20%20volume%20=%20%7B14%7D,%0A%20%20pages%20=%20%7B1462--1469%7D,%0A%20%20issn%20=%20%7B1558-7916%7D,%0A%20%20doi%20=%20%7B10.1109/TSA.2005.858005%7D,%0A%20%20keywords%20=%20%7Badditive%20noise,Additive%20noise,algorithmic%20artifacts,audio%20signal%20processing,Audio%20source%20separation,blind%20audio%20source%20separation,blind%20source%20separation,Data%20mining,distortion,Distortion%20measurement,distortions,Energy%20measurement,evaluation,Filters,Image%20analysis,Independent%20component%20analysis,interference,Interference,measure,Microphones,performance,quality,source%20estimation,Source%20separation,time-invariant%20gains,time-varying%20filters%7D,%0A%20%20number%20=%20%7B4%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/vincentPerformanceMeasurement2006/">Details</a></li>
<li><div>
<a name="vincentImprovedPerceptual2012" />
<b>Improved Perceptual Metrics for the Evaluation of Audio Source Separation</b> <span style="font-size:15px"> () </span>
<br />
Latent Variable Analysis and Signal Separation
<br />
<i>Vincent, Emmanuel</i>
</div>
<a download="vincentImprovedPerceptual2012.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@inproceedings%7BvincentImprovedPerceptual2012,%0A%20%20title%20=%20%7BImproved%20%7B%7BPerceptual%20Metrics%7D%7D%20for%20the%20%7B%7BEvaluation%7D%7D%20of%20%7B%7BAudio%20Source%20Separation%7D%7D%7D,%0A%20%20booktitle%20=%20%7BLatent%20%7B%7BVariable%20Analysis%7D%7D%20and%20%7B%7BSignal%20Separation%7D%7D%7D,%0A%20%20author%20=%20%7BVincent,%20Emmanuel%7D,%0A%20%20date%20=%20%7B2012%7D,%0A%20%20pages%20=%20%7B430--437%7D,%0A%20%20publisher%20=%20%7B%7BSpringer%7D%7D,%0A%20%20abstract%20=%20%7BWe%20aim%20to%20predict%20the%20perceived%20quality%20of%20estimated%20source%20signals%20in%20the%20context%20of%20audio%20source%20separation.%20Recently,%20we%20proposed%20a%20set%20of%20metrics%20called%20PEASS%20that%20consist%20of%20three%20computation%20steps:%20decomposition%20of%20the%20estimation%20error%20into%20three%20components,%20measurement%20of%20the%20salience%20of%20each%20component%20via%20the%20PEMO-Q%20auditory-motivated%20measure,%20and%20combination%20of%20these%20saliences%20via%20a%20nonlinear%20mapping%20trained%20on%20subjective%20opinion%20scores.%20The%20parameters%20of%20the%20decomposition%20were%20shown%20to%20have%20little%20influence%20on%20the%20prediction%20performance.%20In%20this%20paper,%20we%20evaluate%20the%20impact%20of%20the%20parameters%20of%20PEMO-Q%20and%20the%20nonlinear%20mapping%20on%20the%20prediction%20performance.%20By%20selecting%20the%20optimal%20parameters,%20we%20improve%20the%20average%20correlation%20with%20mean%20opinion%20scores%20(MOS)%20from%200.738%20to%200.909%20in%20a%20cross-validation%20setting.%20The%20resulting%20improved%20metrics%20are%20used%20in%20the%20context%20of%20the%202011%20Signal%20Separation%20Evaluation%20Campaign%20(SiSEC).%7D,%0A%20%20isbn%20=%20%7B978-3-642-28551-6%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/vincentImprovedPerceptual2012/">Details</a></li>
<li><div>
<a name="imirselMusicInformation2020" />
<b>Music Information Retrieval Exchange (MIREX)</b> <span style="font-size:15px"> () </span>
<br />
<i>IMIRSEL</i>
</div>
<a target="_blank" rel="noopener noreferrer" href="https://www.music-ir.org/mirex/wiki/MIREX_HOME"><input class="button0" type="button" value="Link" /></a>
<a download="imirselMusicInformation2020.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@online%7BimirselMusicInformation2020,%0A%20%20title%20=%20%7BMusic%20%7B%7BInformation%20Retrieval%20Exchange%7D%7D%20(%7B%7BMIREX%7D%7D)%7D,%0A%20%20author%20=%20%7BIMIRSEL%7D,%0A%20%20date%20=%20%7B2020-03-04%7D,%0A%20%20url%20=%20%7Bhttps://www.music-ir.org/mirex/wiki/MIREX_HOME%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/imirselMusicInformation2020/">Details</a></li></ol>Daniel Stollerbusiness@dstoller.netFrom June to September 2019, I took a break from my ongoing PhD and worked as a Research Intern at Spotify in London. I was under the supervision of Simon Durand and Tristan Jehan as part of the music intelligence (MIQ) team. Their work is also similar to my PhD, focused on applying machine learning to music signals in an effort to make computers able to understand musical properties, such as classifying whether a music piece has singing voice in it or not, so this was a good match for me to get to know how it’s like to work in industry in my field.ICLR 2020 impressions and paper highlights2019-02-21T00:00:00+01:002019-02-21T00:00:00+01:00https://dans.world/ICLR-2020<p>Having just “visited” my first virtual conference, ICLR 2020, I wanted to talk about my general impression and highlight some papers that stuck out to me from a variety of subfields.</p>
<h2 id="general-impressions">General impressions</h2>
<p>I presented our <a href="https://iclr.cc/virtual/poster_Hye1RJHKwB.html">FactorGAN paper</a> at the conference. Like every other paper, we have a quick explainer video, along with an asynchronous chat room to answer questions, and two poster session slots lasting two hours each, where people could spontaneously join into virtual Zoom meetings to discuss the paper.</p>
<p>ICLR organisers did a really good job overall, considering there was so little time to react to the Coronavirus pandemic and to switch from a physical to a virtual conference.
Poster sessions were very useful to get to know some people and discuss specific questions about papers. In my experience, they were surprisingly empty oftentimes, but that in turn allowed everyone to participate more easily. It’s really nice that the explainer videos are permanently available to everyone now, which should really help with disseminating all the latest research efficiently.</p>
<p>However, I found it a bit difficult to get to know people in a more relaxed setting. Just like poster sessions, the socials on offer were mostly focused on a specific topic, such as AI for environmental issues. There was a “VR” application called “ICLR Town”, where people can run around with characters in a 2D top-down view and meet up in this virtual space using webcams. While this suited my needs more, there were barely any people online. Maybe such a virtual meeting space should be promoted more and included as a coffee break into the conference schedule- which only featured poster sessions and talks this time.</p>
<p>Finally, I was surprised that poster sessions were not clearly separated according to topic, which made it quite overwhelming to find relevant papers. But overall it was a nice experience and organisers did the best they could considering the circumstances.</p>
<h2 id="paper-highlights">Paper highlights</h2>
<h3 id="causality">Causality</h3>
<p>Connecting deep learning models operating with differentiable operations and loss functions on the one hand with causal learning dealing with discrete graphs on the other hand, is generally a very interesting research direction. I wanted to highlight two papers here.</p>
<p><a href="https://iclr.cc/virtual/poster_ryxWIgBFPS.html">A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms</a></p>
<p>This paper looks at simple two-dimensional distributions $P(A,B)$. The question is: Does A cause B or B cause A?
A probabilistic model could estimate the joint by decomposing it into $P(A)P(B|A)$ or $P(B)P(B|A)$.
The main idea in this paper: If the model uses the “correct” decomposition that reflects the causal structure, then if the cause changes, adapting the model to this new distribution is fast. Why is that? Let’s assume that A causes B. If we model using the correct decomposition, $P(A)P(B|A)$, and $P(A)$ changes, then $P(B|A)$ stays the same, so only one part of the model needs to be adapted.
If we modelled $P(B)P(B|A)$, then BOTH $P(B)$ and $P(B|A)$ would change.</p>
<p>Authors then construct a clever meta-learning objective - a sum of the likelihood of both model variants on the new distribution after training for a certain number of steps. This sum is weighted with the meta-parameter $\gamma$. After meta-training, $\gamma$ indicates which model, and therefore which causal explanation, is most likely the correct one.</p>
<p>This research seems still in its infancy – two-dimensional distributions are clearly not very practically relevant. But the idea of smoothly interpolating between different generative models using meta-learning might prove valuable in the future in more difficult settings!</p>
<p><a href="https://causalrlworkshop.github.io/program/cldm_21.html">Neural Causal Induction from Interventions</a></p>
<p>In this paper, authors employ a deep learning model to estimate the outcomes of particular interventions, and by observing multiple interventions in a row, to predict the structure of the underlying causal graph that generated the observations.
This paper also makes use of meta-learning, in that in each meta-iteration, a new causal graph is used, with the aim of obtaining a deep learning model that can predict the causal structure between multiple variables even for new, previously unseen distributions.</p>
<p><img src="https://dans.world/assets/img/2020-04-28-ICLR-2020/neural_causal_induction.png" alt="drawing" width="400" /></p>
<p>The structure of the neural network is shown above. At each time step, an intervention is performed on one variable, and the encoder takes the resulting $N$ observed variables plus one that indicates which variable was intervened upon. The output is fed to a sequence model that updates its belief state about the causal graph, given all the information (interventions) we have seen so far. Finally, a graph decoder model is trained to output the correct causal graph.</p>
<p>For more detail on these papers and similar papers, check out the <a href="https://iclr.cc/virtual/workshops_14.html">workshop on causal learning for decision making</a>.</p>
<h3 id="classification-theory">Classification theory</h3>
<p><a href="https://iclr.cc/virtual/poster_Hkxzx0NtDB.html">Your classifier is secretly an energy based model and you should treat it like one
</a></p>
<p>This paper blew me away: Using simple math, it shows very elegantly that a discriminative classifier can also be viewed as an energy-based model, which in turn allows you to detect samples coming from outside the distribution the classifier was originally trained on.</p>
<p>Let’s take a classifier with scalar output $f_{\theta}(x)[y]$ for input $x$ and class index $y$. Class probabilities can be obtained by using the softmax operation, which makes values positive and sum to one over all classes:</p>
<p>$p_{\theta}(y \vert x) = \frac{e^{f_{\theta}(x)[y]}}{\sum_y e^{f_{\theta}(x)[y]}}$.</p>
<p>But the unnormalised outputs can also be used to define an energy based model to define the joint probability over inputs $x$ and labels $y$</p>
<p>$p_{\theta}(x,y) = \frac{e^{f_{\theta}(x)[y]}}{Z(\theta)}$,</p>
<p>where $Z(\theta)$ is the normalising constant that sums up the total energy over the whole $(x,y)$ space.
The cool things is - we can now determine the likelihood of an input $p(x)$, by marginalising out $y$ from the above equation, which results in</p>
<p>$p_{\theta}(x) = \frac{\sum_y e^{f_{\theta}(x)[y]}}{Z(\theta)}$.</p>
<p>Notice that the numerator simply contains the sum of exponentiated outputs, which is the denominator in the softmax expression.
One can compute $p(y|x)$ to perform classification using the same rules of marginalising out variables, and surprisingly, obtain the exact same formulation of a softmax-based classifier we introduced in the beginning!</p>
<p>Authors then go on and train models as standard classifiers while also maximising the likelihood $p(x)$ at the same time.</p>
<p>The benefits are numerous:</p>
<ul>
<li>Obtain good classification accuracy, almost as good as purely discriminative training</li>
<li>Models can be used to generate new input samples</li>
<li>Better calibrated classifier output probabilities - NNs are often prone to output probabilities close to 0 or 1, when they should be more uncertain, especially for novel inputs not seen during training. When using the proposed method, samples assigned to a class with a probability of 0.8 would actually end up being from that class 80% of the time.</li>
<li>Out of distribution detection: Simply check an input example $x$ for its likelihood $p(x)$ - if it is too low, reject the sample and return “I don’t know”</li>
<li>More robust to adversarial attacks. Even further increased robustness if the input $x$ is first preprocessed by letting the model perturb it into a version $\hat{x}$ that has higher likelihood $p(\hat{x}) > p(x)$ first, thereby “undoing” the adversarial manipulation and restricting classification to input samples that are similar to those seen during training</li>
</ul>
<p><a href="https://iclr.cc/virtual/poster_ByxGkySKwH.html">Towards neural networks that provably know when they don’t know</a></p>
<p>In a similar vein, this paper calibrates classifier output probabilities by reformulating the conditional $p(y|x)$.
This approach assumes samples either come from the “in-distribution” (seen during training), or from a specific “out-distribution”, where the classifier should indicate its complete uncertainty by assigning the same probability to all classes.
$p(y|x)$ is then decomposed using Bayes rule:</p>
<p>$p(y \vert x) = \frac{p(y \vert x,i)p(x \vert i)p(i) + p(y \vert x,o)p(x \vert o)p(o)}{p(x \vert i)p(i) + p(x \vert o)p(o)}$</p>
<p>$i$ and $o$ indicate whether a sample comes from the in- or out distribution. $p(y \vert x,i)$ is the classifier of interest, while $p(y \vert x,o)$ is simply set to a uniform distribution over classes, which allows the authors to make uncertainty guarantees.
$p(x \vert i)$ and $p(x \vert o)$ are Gaussian mixture models indicating how likely it is to observe this input sample $x$ assuming it’s drawn from the in- or out distribution, respectively.</p>
<p>While the assumption of a specific out-distribution seems limiting, it is very nice to have mathematically proven guarantees for classifier confidences.</p>
<h3 id="learning-with-small-data-learning-representations">Learning with small data, learning representations</h3>
<p>Research on how to make deep learning generalise in the face of small datasets has reached a new peak in the last few years. Representation learning, self-supervised learning and meta learning are very popular topics, especially given recent breakthroughs in NLP by models such as BERT, and so ICLR also had a good representation (ha) of papers on these topics.</p>
<p>Current meta-learning approaches are often limited to the few-shot setting, where a model is only updated a few times on a task before it is used to make predictions (e.g. MAML <a class="citation" href="#finnModelAgnosticMetaLearning2017">[1], [2]</a>.
<a href="https://iclr.cc/virtual/poster_rkeiQlBFPB.html">WarpGrad</a> aims to extend the applicability of meta learning to settings where more adaptation might be needed.
Instead of directly learning an update rule for gradient descent, or a model initialisation from which training on a new task should start, it introduces so-called warp layers that essentially transform the optimisation landscape itself. The warp layer parameters can then be meta-learned so that normal SGD methods can more easily converge to good solutions.</p>
<p>For representation learning, <a href="https://iclr.cc/virtual/poster_BkeoaeHKDS.html">Gradients as Features for Deep Representation Learning</a> add another trainable output layer to pre-trained networks that operates on the network’s gradients, in addition to the usual linear output layer that is used to process intermediate activations from the pre-trained network.</p>
<p>In <a href="https://iclr.cc/virtual/poster_Syx79eBKwr.html">A Mutual Information Maximization Perspective of Language Representation Learning</a>, authors gain very interesting theoretical insight that commonly used representation learning techniques, such as Deep InfoMax or BERT, while not similar at first glance, all end up optimising a version of a common objective that maximises the mutual information between different parts of the input. This paper might turn out to be critical in developing self-supervised learning techniques that work reliably across different input domains (such as text, audio and video).</p>
<p><a href="https://iclr.cc/virtual/poster_B1esx6EYvr.html">A critical analysis of self-supervision, or what we can learn from a single image</a> investigates what current computer vision models can learn from very few (even just single) images under current self-supervision techniques, when strong data augmentation is used. The results are quite concerning: Self-supervision techniques currently can not rival standard supervised training even if millions of unlabelled images are used for self-supervision. Also, similar performance with self-supervision can be reached even when using a single image under heavy data augmentation, as this is sufficient to for early network layers to pick up on low-level statistics of natural images. It seems that self-supervision at the moment suffers from the unsolved problem of finding optimisation objectives that actually encourage modelling high-level, semantically meaningful properties of the input.</p>
<h3 id="audio-processing">Audio processing</h3>
<p>Due to my background in audio processing, I wanted to specifically highlight two audio related papers.</p>
<p>In <a href="https://iclr.cc/virtual/poster_rygjHxrYDB.html">Deep Audio Priors</a>, authors propose a new convolution kernel for audio spectrograms. They correctly note that using normal convolutions is well motivated for images, where nearby pixels are strongly correlated. For spectrograms, this also applies to the time dimension, but not to the frequency dimension, where one can find strong dependencies across the whole frequency band. In particular, many sound sources are harmonic, meaning they are comprised of a sine wave with a certain base frequency (fundamental frequency), accompanied by additional sine waves at frequencies which are multiples of the base frequency (caled harmonics). Authors change convolution kernels to reflect this to obtain “harmonic convolution”, which are assigned to attend to a certain base frequency in addition to frequency bins representing the harmonics. Experiments in audio source separation and audio denoising show improved performance over spectrogram-based U-Nets and Wave-U-Net, indicating that such convolutions provide a more suitable “audio prior”.</p>
<p><a href="https://iclr.cc/virtual/poster_B1x1ma4tDr.html">DDSP - Differentiable Digital Signal Processing</a>
This paper integrates many tools from traditional signal processing, such as synthesisers, with deep learning to gain the benefits of both - DSP provides useful building blocks that can realise complicated audio transformations with just a few control parameters, thereby bringing a lot of prior knowledge to bear on the problem at hand, while deep learning can flexibly learn the desired transformation based on the available training data. This can be especially useful for small data scenarios, where DSP tools are not flexible enough and can not make use of the data to improve results, and where standard deep learning models fail since they require much more data since they have to learn everything from scratch.</p>
<h3 id="a-final-note-on-transformer-efficiency">A final note on transformer efficiency</h3>
<p>There was lots of work trying to make transformer models more computationally efficient <a href="https://iclr.cc/virtual/poster_H1eA7AEtvS.html">[1]</a><a href="https://iclr.cc/virtual/poster_rkgNKkHtvB.html">[2]</a><a href="https://iclr.cc/virtual/poster_SylO2yStDr.html">[3]</a>, since they have a computational complexity of $O(N^2)$ for sequence inputs of length $N$.
This is encouraging to see – while their application was mostly limited to processing a few sentences at a time in the domain of NLP, this might allow for modelling long sequences such as audio signals and other time series data.</p>
<h2 id="references">References</h2>
<ol class="bibliography"><li><div>
<a name="finnModelAgnosticMetaLearning2017" />
<b>Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks</b> <span style="font-size:15px"> () </span>
<br />
Proc. of the International Conference on Machine Learning (ICML)
<br />
<i>Finn, Chelsea and Abbeel, Pieter and Levine, Sergey</i>
</div>
<a target="_blank" rel="noopener noreferrer" href="http://proceedings.mlr.press/v70/finn17a.html"><input class="button0" type="button" value="Link" /></a>
<a download="finnModelAgnosticMetaLearning2017.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@inproceedings%7BfinnModelAgnosticMetaLearning2017,%0A%20%20title%20=%20%7BModel-%7B%7BAgnostic%20Meta%7D%7D-%7B%7BLearning%7D%7D%20for%20%7B%7BFast%20Adaptation%7D%7D%20of%20%7B%7BDeep%20Networks%7D%7D%7D,%0A%20%20booktitle%20=%20%7BProc.%20of%20the%20%7B%7BInternational%20Conference%7D%7D%20on%20%7B%7BMachine%20Learning%7D%7D%20(%7B%7BICML%7D%7D)%7D,%0A%20%20author%20=%20%7BFinn,%20Chelsea%20and%20Abbeel,%20Pieter%20and%20Levine,%20Sergey%7D,%0A%20%20editor%20=%20%7BPrecup,%20Doina%20and%20Teh,%20Yee%20Whye%7D,%0A%20%20date%20=%20%7B2017-08-06/2017-08-11%7D,%0A%20%20volume%20=%20%7B70%7D,%0A%20%20pages%20=%20%7B1126--1135%7D,%0A%20%20publisher%20=%20%7B%7BPMLR%7D%7D,%0A%20%20location%20=%20%7B%7BInternational%20Convention%20Centre,%20Sydney,%20Australia%7D%7D,%0A%20%20url%20=%20%7Bhttp://proceedings.mlr.press/v70/finn17a.html%7D,%0A%20%20abstract%20=%20%7BWe%20propose%20an%20algorithm%20for%20meta-learning%20that%20is%20model-agnostic,%20in%20the%20sense%20that%20it%20is%20compatible%20with%20any%20model%20trained%20with%20gradient%20descent%20and%20applicable%20to%20a%20variety%20of%20different%20learning%20problems,%20including%20classification,%20regression,%20and%20reinforcement%20learning.%20The%20goal%20of%20meta-learning%20is%20to%20train%20a%20model%20on%20a%20variety%20of%20learning%20tasks,%20such%20that%20it%20can%20solve%20new%20learning%20tasks%20using%20only%20a%20small%20number%20of%20training%20samples.%20In%20our%20approach,%20the%20parameters%20of%20the%20model%20are%20explicitly%20trained%20such%20that%20a%20small%20number%20of%20gradient%20steps%20with%20a%20small%20amount%20of%20training%20data%20from%20a%20new%20task%20will%20produce%20good%20generalization%20performance%20on%20that%20task.%20In%20effect,%20our%20method%20trains%20the%20model%20to%20be%20easy%20to%20fine-tune.%20We%20demonstrate%20that%20this%20approach%20leads%20to%20state-of-the-art%20performance%20on%20two%20few-shot%20image%20classification%20benchmarks,%20produces%20good%20results%20on%20few-shot%20regression,%20and%20accelerates%20fine-tuning%20for%20policy%20gradient%20reinforcement%20learning%20with%20neural%20network%20policies.%7D,%0A%20%20series%20=%20%7BProceedings%20of%20%7B%7BMachine%20Learning%20Research%7D%7D%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/finnModelAgnosticMetaLearning2017/">Details</a></li>
<li><div>
<a name="nicholFirstOrderMetaLearning2018" />
<b>On First-Order Meta-Learning Algorithms</b> <span style="font-size:15px"> () </span>
<br />
<i>Nichol, Alex and Achiam, Joshua and Schulman, John</i>
</div>
<a target="_blank" rel="noopener noreferrer" href="http://arxiv.org/abs/1803.02999"><input class="button0" type="button" value="Link" /></a>
<a download="nicholFirstOrderMetaLearning2018.bib" href="data:application/x-bibtex,%7B%25raw%25%7D@article%7BnicholFirstOrderMetaLearning2018,%0A%20%20title%20=%20%7BOn%20%7B%7BFirst%7D%7D-%7B%7BOrder%20Meta%7D%7D-%7B%7BLearning%20Algorithms%7D%7D%7D,%0A%20%20author%20=%20%7BNichol,%20Alex%20and%20Achiam,%20Joshua%20and%20Schulman,%20John%7D,%0A%20%20date%20=%20%7B2018%7D,%0A%20%20journaltitle%20=%20%7BCoRR%7D,%0A%20%20volume%20=%20%7Babs/1803.02999%7D,%0A%20%20url%20=%20%7Bhttp://arxiv.org/abs/1803.02999%7D,%0A%20%20urldate%20=%20%7B2020-01-28%7D,%0A%20%20abstract%20=%20%7BThis%20paper%20considers%20meta-learning%20problems,%20where%20there%20is%20a%20distribution%20of%20tasks,%20and%20we%20would%20like%20to%20obtain%20an%20agent%20that%20performs%20well%20(i.e.,%20learns%20quickly)%20when%20presented%20with%20a%20previously%20unseen%20task%20sampled%20from%20this%20distribution.%20We%20analyze%20a%20family%20of%20algorithms%20for%20learning%20a%20parameter%20initialization%20that%20can%20be%20fine-tuned%20quickly%20on%20a%20new%20task,%20using%20only%20first-order%20derivatives%20for%20the%20meta-learning%20updates.%20This%20family%20includes%20and%20generalizes%20first-order%20MAML,%20an%20approximation%20to%20MAML%20obtained%20by%20ignoring%20second-order%20derivatives.%20It%20also%20includes%20Reptile,%20a%20new%20algorithm%20that%20we%20introduce%20here,%20which%20works%20by%20repeatedly%20sampling%20a%20task,%20training%20on%20it,%20and%20moving%20the%20initialization%20towards%20the%20trained%20weights%20on%20that%20task.%20We%20expand%20on%20the%20results%20from%20Finn%20et%20al.%20showing%20that%20first-order%20meta-learning%20algorithms%20perform%20well%20on%20some%20well-established%20benchmarks%20for%20few-shot%20classification,%20and%20we%20provide%20theoretical%20analysis%20aimed%20at%20understanding%20why%20these%20algorithms%20work.%7D,%0A%20%20archiveprefix%20=%20%7BarXiv%7D,%0A%20%20eprint%20=%20%7B1803.02999%7D,%0A%20%20eprinttype%20=%20%7Barxiv%7D,%0A%20%20keywords%20=%20%7BComputer%20Science%20-%20Machine%20Learning%7D%0A%7D%0A%7B%25endraw%25%7D"><input class="button0" type="button" value="Bibtex" /></a>
<a class="details" href="/repository/nicholFirstOrderMetaLearning2018/">Details</a></li></ol>Daniel Stollerbusiness@dstoller.netHaving just “visited” my first virtual conference, ICLR 2020, I wanted to talk about my general impression and highlight some papers that stuck out to me from a variety of subfields.Bounded output regression with neural networks2019-01-23T00:00:00+01:002019-01-23T00:00:00+01:00https://dans.world/Bounded-output-networks<p>Say we have a neural network (or some other model trainable with gradient descent) that performs supervised regression: For an input $x$, it outputs one or more real values $y$ as prediction, and tries to get as close to a given target value $\hat{y}$ as possible. We also know that the targets $\hat{y}$ always lie in a certain interval $[a,b]$.</p>
<p>This sounds like a very standard setting that should not pose any problems.
But as we will see, it is not so obvious how to ensure the output of the network is always in the given interval, and that training based on stochastic gradient descent (SGD) is possible without issues.</p>
<h2 id="application-directly-output-audio-waveforms">Application: Directly output audio waveforms</h2>
<p>This supervised regression occurs in many situations, such predicting a depth map from an image, or audio style transfer.
We will take a look at the <a href="https://github.com/f90/Wave-U-Net">Wave-U-Net</a> model that takes a music waveform as input $x$, and predicts the individual instrument tracks directly as raw audio.
Since audio amplitudes are usually represented as values in the $[-1,1]$ range, I decided to use $\tanh(a)$ as the final activation function with $a \in \mathbb{R}$ as the last layer’s output at a given time-step.
As a result, all outputs are now between -1 and 1:</p>
<p><img src="https://dans.world/assets/img/2019-01-23-Bounded-output-networks/tanh.png" alt="Tanh activation function" /></p>
<p>As loss function for the regression, I simply use the mean squared error (MSE, $(y - \hat{y})^2$) between the prediction $y$ and the target $\hat{y}$.</p>
<h2 id="squashing-the-output-with-tanh-issues">Squashing the output with $\tanh$: Issues</h2>
<p>This looks like a good solution at first glance, since the network always produces valid outputs, so the output does not need to be post-processed.
But there are two potential problems:</p>
<ol>
<li>
<p>The true audio amplitudes $\hat{y}$ are in the range $[-1,1]$, but $\tanh(a) \in (-1, 1)$ and so never reaches -1 and 1 exactly.
If our targets $\hat{y}$ in the training data actually contain these values, the network is forced to output extremely large/small $a$ so that $\tanh(a)$ gets as close to -1 or 1 as possible.
I tested this with the Wave-U-Net in an extreme scenario, where all target amplitudes $\hat{y}$ are 1 for all inputs $x$.
After just a few training steps, activations in the layers began to explode to increase $a$, which confirms that this can actually become a problem (although my training data is a bit unrealistic).
And generally, the network has to drive up activations $a$ (and thus weights) to produce predictions with very high or low amplitudes, potentially making training more unstable.</p>
</li>
<li>
<p>At very small or large $x$ values, the gradient of $\tanh(a)$ with respect to a, $\tanh’(a)$, vanishes towards zero, as you can see in the plot below.
<img src="https://dans.world/assets/img/2019-01-23-Bounded-output-networks/tanh_derivative.png" alt="Tanh derivative" />
At any point during training, a large weight update that makes all model outputs $y$ almost $-1$ or $1$ would thus make the gradient of the loss with respect to the weights vanish towards zero, since it contains $\tanh’(a)$ as one factor.
This can actually happen in practice - some people reported to me that for their dataset, the Wave-U-Net suddenly diverged in this fashion after training stably for a long time, and then couldn’t recover.</p>
</li>
</ol>
<h2 id="possible-solutions">Possible solutions</h2>
<p>So what other options are there?
One option is to simply use a linear output, but clipped to the $[-1,1]$ range: $y := \min(\max(a, -1), 1)$.
This solves problem number 1, since $-1$ and $1$ can be output directly.
However, problem number 2 still remains, and is maybe even more pronounced now: Clipping all output values outside $[-1,1]$ means the gradient for these outputs is exactly zero, not just arbitrarily close to it like with $\tanh$, so the network might still diverge and never recover.</p>
<p>Finally, I want to propose a third option: A linear output that is unbounded during training ($y := a$), but at test time, the output is clipped to $[-1,1]$. Compared to always clipping, there is now a significant, non-zero gradient for the network to learn from during training at all times:
If the network predicts for example $1.4$ as amplitude where the target is $1$, the MSE loss will result in the output being properly corrected towards $1$.</p>
<p>I trained a Wave-U-Net variant for singing voice separation that uses this linear output with test-time clipping for each source independently. Apart from that, all settings are equal to the <a href="https://github.com/f90/Wave-U-Net">M5-HighSR</a> model, which uses the $\tanh$ function to predict the accompaniment, and outputs the difference between the input music and the predicted accompaniment signal as voice signal.</p>
<p>Below you can see the waveforms of an instrumental section of a Nightwish song (top), the accompaniment prediction from our new model variant (middle), and from the $\tanh$ model (bottom). Red parts indicate amplitude clipping.
<img src="https://dans.world/assets/img/2019-01-23-Bounded-output-networks/waveform_comparison.png" alt="Waveform comparison" />
We can see the accompaniment from the $\tanh$ model is attenuated, since it cannot reach values close to $-1$ and $1$ easily. In contrast, our model can output the input music almost 1:1, which is here since there are no vocals to subtract. The clipping occurs where the original input also has it, so this can be considered a feature, not a bug.</p>
<p>The problem with the accompaniment output also creates more noise in the vocal channel for the $\tanh$ model, since it uses the difference signal as vocal output:</p>
<ul>
<li>Original song</li>
</ul>
<audio controls="">
<source src="https://dans.world/assets/audio/2019-01-23-Bounded-output-networks/nightwish_original.mp3" type="audio/mpeg" />
Your browser does not support audio file playing!
</audio>
<ul>
<li>Tanh model vocal prediction</li>
</ul>
<audio controls="">
<source src="https://dans.world/assets/audio/2019-01-23-Bounded-output-networks/nightwish_tanh_vocals.mp3" type="audio/mpeg" />
Your browser does not support audio file playing!
</audio>
<ul>
<li>Linear output model vocal prediction</li>
</ul>
<audio controls="">
<source src="https://dans.world/assets/audio/2019-01-23-Bounded-output-networks/nightwish_direct_vocals.mp3" type="audio/mpeg" />
Your browser does not support audio file playing!
</audio>
<h2 id="outlook">Outlook</h2>
<p>Although we managed to get improvements over $\tanh$ to output values of bounded range with neural networks, this might not be the perfect solution. Output activations such as $\sin$ or $\cos$ could also be considered, since they squash the output to a desired interval while still allowing to output the boundary values, but training might be difficult due to their periodic nature.</p>
<p>Also, different regression loss functions than MSE might be useful, too. If we used cross-entropy as loss function, it should provide a more well-behaved gradient even when using the $\tanh$ activation, so different loss functions can also play a role and should be explored in the future.</p>Daniel Stollerbusiness@dstoller.netSay we have a neural network (or some other model trainable with gradient descent) that performs supervised regression: For an input $x$, it outputs one or more real values $y$ as prediction, and tries to get as close to a given target value $\hat{y}$ as possible. We also know that the targets $\hat{y}$ always lie in a certain interval $[a,b]$.ISMIR 2018 - Paper Overviews2018-10-18T00:00:00+02:002018-10-18T00:00:00+02:00https://dans.world/ISMIR-Summary<p>This year’s ISMIR was great as ever, this time featuring</p>
<ul>
<li>lots of deep learning - I suspect since it became much more easy to use with recently developed libraries</li>
<li>lots of new, and surprisingly large, datasets (suited for the new deep learning era)</li>
<li>and a fantastic boat tour through Paris!</li>
</ul>
<p>For those that want some very quick overview about many of the papers (but not all - and the selection is biased towards my own research interests admittedly).
I created “mini-abstracts” designed to describe the core idea or contribution in each paper that should at least be understandable to someone familiar with the field, since even abstracts tend to sometimes be wordy or unneccessarily, well… abstract! I divided them according to the ISMIR conference session they belong to.</p>
<p>Use <a href="http://ismir2018.ircam.fr/pages/events-main-program.html">this page</a> in parallel to quickly retrieve the links to each paper’s PDF document.</p>
<h1 id="musical-objects">Musical objects</h1>
<h2 id="a-1-a-confidence-measure-for-key-labelling">(A-1) A Confidence Measure For Key Labelling</h2>
<p>Roman B. Gebhardt, Michael Stein and Athanasios Lykartsis</p>
<p>Uncertainty in key classification for songs can be estimated by looking at how much the estimated key varies across the whole song (stability), and by taking the sum of the chroma vector at each timepoint and taking the average over the whole song as a measure of how much tonality is contained (keyness)</p>
<h2 id="a-2-improved-chord-recognition-by-combining-duration-and-harmonic-language-models">(A-2) Improved Chord Recognition by Combining Duration and Harmonic Language Models</h2>
<p>Filip Korzeniowski and Gerhard Widmer</p>
<p>Use a model for predicting the next chord given the previous ones, combined with a duration model that predicts at which timestep the chord changes, as a language model to facilitate the learning of long-term dependencies that would be otherwise hard to learn with a time-frame based approach.</p>
<h2 id="a-4-a-predictive-model-for-music-based-on-learned-interval-representations">(A-4) A Predictive Model for Music based on Learned Interval Representations</h2>
<p>Stefan Lattner, Maarten Grachten and Gerhard Widmer</p>
<p>Use a gated recurrent autoencoder to encode the relativ change in pitch at each timestep, then model these relativ changes with an RNN to perform monphonic pitch sequence generation, to enable an RNN to generalize better to repeating melody patterns that continually rise/fall each time.</p>
<h2 id="a-5-an-end-to-end-framework-for-audio-to-score-music-transcription-on-monophonic-excerpts">(A-5) An End-to-end Framework for Audio-to-Score Music Transcription on Monophonic Excerpts</h2>
<p>Miguel A. Román, Antonio Pertusa and Jorge Calvo-Zaragoza</p>
<p>Use a neural network on audio to output a symbolic sequence using a vocbulary with clefs, keys, pitches etc. required to reconstruct a full score-sheet, using a CTC loss.</p>
<h2 id="a-6-evaluating-automatic-polyphonic-music-transcription">(A-6) Evaluating Automatic Polyphonic Music Transcription</h2>
<p>Andrew McLeod and Mark Steedman</p>
<p>Proposes five metrics with which to rate the quality of a music transcription system, and combines them to one metric describing overall quality, aiming to penalize each mistake only in exactly one of the five metrics (no multiple penalties). There is however no evaluation as to how this metric correlates with subjective quality ratings by humans.</p>
<h2 id="a-7-onsets-and-frames-dual-objective-piano-transcription">(A-7) Onsets and Frames: Dual-Objective Piano Transcription</h2>
<p>Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore and Douglas Eck</p>
<p>Instead of using a frame-wise cross-entropy loss on the piano roll output for transcription, also predict the onset position of notes to improve performance/reduce spurious note activations. They also predict note velocity separately to further improve the sound of the synthesized transcription.</p>
<h2 id="a-8-player-vs-transcriber-a-game-approach-to-data-manipulation-for-automatic-drum-transcription">(A-8) Player Vs Transcriber: A Game Approach To Data Manipulation For Automatic Drum Transcription</h2>
<p>Carl Southall, Ryan Stables and Jason Hockman</p>
<p>Add another model to the drum transcription setting (player) that can learn to use data augmentation operations on the training set to decrease the resulting transciption accuracy. Player and transcriber are trained together to make the transcriber learn from difficult examples not seen in the training data.</p>
<h2 id="a-10-evaluating-a-collection-of-sound-tracing-data-of-melodic-phrases">(A-10) Evaluating a collection of Sound-Tracing Data of Melodic Phrases</h2>
<p>Tejaswinee Kelkar, Udit Roy and Alexander Refsum Jensenius</p>
<p>Make people move their bodies in response to melodic phrases while motion-capturing them and then try to find out which movement features correlate with/predict the corresponding melody.</p>
<h2 id="a-11-main-melody-estimation-with-source-filter-nmf-and-crnn">(A-11) Main Melody Estimation with Source-Filter NMF and CRNN</h2>
<p>Dogac Basaran, Slim Essid and Geoffroy Peeters</p>
<p>Pretrain a source-filter NMF model to provide useful features for input into a convolutional-recurrent neural network to track the main melody in music pieces. Pretraining helps since it provides a better representation of the dominant fundamental frequency/pitch salience.</p>
<h2 id="a-13-a-single-step-approach-to-musical-tempo-estimation-using-a-convolutional-neural-network">(A-13) A single-step approach to musical tempo estimation using a convolutional neural network</h2>
<p>Hendrik Schreiber and Meinard Mueller</p>
<p>Neural network that predicts the local tempo given a 12 second long audio input, and its aggregated outputs over a whole song can be used for estimating the whole song’s tempo.</p>
<h2 id="a-14-analysis-of-common-design-choices-in-deep-learning-systems-for-downbeat-tracking">(A-14) Analysis of Common Design Choices in Deep Learning Systems for Downbeat Tracking</h2>
<p>Magdalena Fuentes, Brian McFee, Hélène C. Crayencour, Slim Essid and Juan Pablo Bello</p>
<p>Investigation how downbeat performance changes when the SotA approaches are changed slightly, e.g. what temporal granularity the input spectrogram has, how the output is decoded from the neural network, convolutional-RNN vs only RNN.</p>
<h1 id="generation-visual">Generation, visual</h1>
<h2 id="b-5-bridging-audio-analysis-perception-and-synthesis-with-perceptually-regularized-variational-timbre-spaces">(B-5) Bridging audio analysis, perception and synthesis with perceptually-regularized variational timbre spaces</h2>
<p>Philippe Esling, Axel Chemla–Romeu-Santos and Adrien Bitton</p>
<p>Beta-VAE is used on instrument samples. The latent space is additionally regularized such that the distances between samples of different instruments corresponds to the perceived timbral difference according to perceptual ratings. The resulting model’s latent space can be used to classify the instrument, pitch, dynamics and family, and together with the decoder one can synthesize smoothly interpolated new sounds.</p>
<h2 id="b-6-conditioning-deep-generative-raw-audio-models-for-structured-automatic-music">(B-6) Conditioning Deep Generative Raw Audio Models for Structured Automatic Music</h2>
<p>Rachel Manzelli, Vijay Thakkar, Ali Siahkamari and Brian Kulis</p>
<p>Combine symbolic and audio music models: Recurrent network is trained to model symbolic note sequences, and a Wavenet model separately is trained to produce raw audio conditioned on a piano-roll representation. Then the models are put together to synthesize music pieces.</p>
<h2 id="b-7-convolutional-generative-adversarial-networks-with-binary-neurons-for-polyphonic-music-generation">(B-7) Convolutional Generative Adversarial Networks with Binary Neurons for Polyphonic Music Generation</h2>
<p>Hao-Wen Dong and Yi-Hsuan Yang</p>
<p>To adapt GANs for symbolic music generation, which is a discrete problem and not a continuous problem as usually handled by GANs, they use the straight-through estimator (“stochastic binary neurons”) that have a binary output (randomly sampled) in the forward, but a real-valued probability in the backward path to compute gradients.</p>
<h1 id="source-separation">Source separation</h1>
<h2 id="c-2-music-source-separation-using-stacked-hourglass-networks">(C-2) Music Source Separation Using Stacked Hourglass Networks</h2>
<p>Sungheon Park, Taehoon Kim, Kyogu Lee and Nojun Kwak</p>
<p>2D U-Net neural network for source separation applied multiple times in a row using a residual connection, so that the initial estimate can be further refined each time</p>
<h2 id="c-3-the-northwestern-university-source-separation-library">(C-3) The Northwestern University Source Separation Library</h2>
<p>Ethan Manilow, Prem Seetharaman and Bryan Pardo</p>
<p>Library for source separation: Supports using trained separation models easily, offers computation of evaluation metrics</p>
<h2 id="c-4-improving-bass-saliency-estimation-using-transfer-learning-and-label-propagation">(C-4) Improving Bass Saliency Estimation using Transfer Learning and Label Propagation</h2>
<p>Jakob Abeßer, Stefan Balke and Meinard Müller</p>
<p>Detecting bass notes in jazz ensemble recordings. Two techniques are investigated in the face of the small available labelled data: Label propagation - train model on annotated dataset, then predict labels for unlabelled data and retrain - and transfer learning - network is trained on isolated bass recordings first, then on the actual jazz data.</p>
<h2 id="c-5-improving-peak-picking-using-multiple-time-step-loss-functions">(C-5) Improving Peak-picking Using Multiple Time-step Loss Functions</h2>
<p>Carl Southall, Ryan Stables and Jason Hockman</p>
<p>Since many current models that predict a series of events given an audio sequence are trained with frame-wise cross-entropy followed by separate peak picking, the models activations might not be well suited for the peak picking procedure. Loss functions that act on neighbouring outputs as well are investigated to remedy this.</p>
<h2 id="c-6-zero-mean-convolutions-for-level-invariant-singing-voice-detection">(C-6) Zero-Mean Convolutions for Level-Invariant Singing Voice Detection</h2>
<p>Jan Schlüter and Bernhard Lehner</p>
<p>Singing voice classifiers turn out to be sensitive to the overall volume of the music output, which is undesirable. While data augmentation by random amplification and mixing of voice and instrumentals helps with classification performance, it appears that this sensitivity largely remains. The paper shows you can directly bake in this invariance by constraining the first convolutional layer so that the weights in each filter sum up to 0, and get better performance. One potential drawback is that a very quiet music input with singing voice will now be classified as positive, although a listener might not be able to hear anything and say there is no singing voice.</p>
<h2 id="c-8-wave-u-net-a-multi-scale-neural-network-for-end-to-end-audio-source-separation">(C-8) Wave-U-Net: A Multi-Scale Neural Network for End-to-End Audio Source Separation</h2>
<p>Daniel Stoller, Sebastian Ewert and Simon Dixon</p>
<p>This is our own paper ;)</p>
<p>Change the “U-Net”, previously used in biomedical image segmentation and magnitude-based source separation to capture multi-scale features and dependencies, from 2D to 1D convolution (across time), to perform separation directly on the waveform without needing artifact-inducing source reconstruction steps.</p>
<p>For more information, please see the corresponding Github repository <a href="https://github.com/f90/Wave-U-Net">here</a></p>
<h2 id="c-13-music-mood-detection-based-on-audio-and-lyrics-with-deep-neural-net">(C-13) Music Mood Detection Based on Audio and Lyrics with Deep Neural Net</h2>
<p>Rémi Delbouys, Romain Hennequin, Francesco Piccoli, Jimena Royo-Letelier and Manuel Moussallam</p>
<p>Predicting the mood of a music piece by combining audio and lyrics information helps performance.</p>
<h1 id="corpora">Corpora</h1>
<h2 id="d-4-dali-a-large-dataset-of-synchronized-audio-lyrics-and-notes-automatically-created-using-teacher-student-machine-learning-paradigm">(D-4) DALI: a large Dataset of synchronized Audio, LyrIcs and notes, automatically created using teacher-student machine learning paradigm</h2>
<p>Gabriel Meseguer-Brocal, Alice Cohen-Hadria and Geoffroy Peeters</p>
<p>5000 music pieces with lyrics aligned up to a syllable-level, created by matching Karaoke files with user-defined aligned lyrics with the corresponding audio tracks according to a singing voice probability vector across the time duration of the song. Iteratively, the singing voice detection system can be retrained with the newly derived dataset to improve its performance, and the cycle can be repeated.</p>
<h2 id="d-5-openmic-2018-an-open-data-set-for-multiple-instrument-recognition">(D-5) OpenMIC-2018: An open data-set for multiple instrument recognition</h2>
<p>Eric Humphrey, Simon Durand and Brian McFee</p>
<p>Large dataset of 10 sec music snippets with labels indicating which instruments are present.</p>
<h2 id="d-6-from-labeled-to-unlabeled-data--on-the-data-challenge-in-automatic-drum-transcription">(D-6) From Labeled to Unlabeled Data – On the Data Challenge in Automatic Drum Transcription</h2>
<p>Chih-Wei Wu and Alexander Lerch</p>
<p>Investigating feature learning and student-teacher learning paradigms for drum transcription to circumvent the lack of labelled training data. Performance does not clearly increase however, indicating the need for better feature learning/student-teacher learning approaches to enable better transfer.</p>
<h2 id="d-9-vocalset-a-singing-voice-dataset">(D-9) VocalSet: A Singing Voice Dataset</h2>
<p>Julia Wilkins, Prem Seetharaman, Alison Wahl and Bryan Pardo</p>
<p>Solo singing voice recordings by 20 different professional singers with annotated labels of singing style and pitch.</p>
<h2 id="d-10-the-nes-music-database-a-multi-instrumental-dataset-with-expressive-performance-attributes">(D-10) The NES Music Database: A multi-instrumental dataset with expressive performance attributes</h2>
<p>Chris Donahue, Huanru Henry Mao and Julian McAuley</p>
<p>Large dataset of polyphonic music in symbolic form taken from NES games, but with extra performance-related attributes (note velocity, timbre)</p>
<h2 id="d-14-revisiting-singing-voice-detection-a-quantitative-review-and-the-future-outlook">(D-14) Revisiting Singing Voice Detection: A quantitative review and the future outlook</h2>
<p>Kyungyun Lee, Keunwoo Choi and Juhan Nam</p>
<p>Review paper about singing voice detection. Common problems are identified where current methods fail: 1. Low SNR ratio between vocals and instrumentals 2. Guitar and other instruments that sound “similar” to voice are mistaken for it and 3. Presence of vibrato in other instruments is mistaken for singing voice. These findings give inspiration on how to improve future systems.</p>
<h2 id="d-15-vocals-in-music-matter-the-relevance-of-vocals-in-the-minds-of-listeners">(D-15) Vocals in Music Matter: the Relevance of Vocals in the Minds of Listeners</h2>
<p>Andrew Demetriou, Andreas Jansson, Aparna Kumar and Rachel Bittner</p>
<p>Psychological qualtiative and quantitative studies demonstrate that listeners attend very closely to singing voice compared to many other aspects of music, despite the lack of singing voice related attributes in music tags for songs.</p>
<h2 id="d-16-vocal-melody-extraction-with-semantic-segmentation-and-audio-symbolic-domain-transfer-learning">(D-16) Vocal melody extraction with semantic segmentation and audio-symbolic domain transfer learning</h2>
<p>Wei Tsung Lu and Li Su</p>
<p>For vocal melody extraction, a symbolic vocal segmentation model is first trained on symbolic data. Then the vocal melody extractor is trained from the audio plus the symbolic representation extracted by the other model (we assume audio and symbolic input is known for each sample). At test time, since symbolic data is not available, a simple filter is applied to the audio to get an estimate of what the symbolic transcription might look like to feed it into the symbolic model, before its output is fed into the audio model.</p>
<h2 id="d-17-empirically-weighting-the-importance-of-decision-factors-for-singing-preference">(D-17) Empirically Weighting the Importance of Decision Factors for Singing Preference</h2>
<p>Michael Barone, Karim Ibrahim, Chitralekha Gupta and Ye Wang</p>
<p>Psychological study into how important different factors (familiarity, genre preference,
ease of vocal reproducibility, and overall preference of the song) are for predicting how attractive it is for a person to sing along to a song.</p>
<h1 id="timbre-tagging-similarity-patterns-and-alignment">Timbre, tagging, similarity, patterns and alignment</h1>
<h2 id="e-3-comparison-of-audio-features-for-recognition-of-western-and-ethnic-instruments-in-polyphonic-mixtures">(E-3) Comparison of Audio Features for Recognition of Western and Ethnic Instruments in Polyphonic Mixtures</h2>
<p>Igor Vatolkin and Günter Rudolph</p>
<p>Using evolutionary optimisation to select features most useful for detecting Western or Ethnic instruments. Since these feature sets turn out to be somewhat different, they also search for the best “compromise set” of features that performs reasonably well (but worse than the specialised features) on both types of data.</p>
<h2 id="e-4-instrudive-a-music-visualization-system-based-on-automatically-recognized-instrumentation">(E-4) Instrudive: A Music Visualization System Based on Automatically Recognized Instrumentation</h2>
<p>Takumi Takahashi, Satoru Fukayama and Masataka Goto</p>
<p>Visualising a collection of music pieces by turning each piece into a pie-chart that shows the percentage of time each instrument is active.</p>
<h2 id="e-6-jazz-solo-instrument-classification-with-convolutional-neural-networks-source-separation-and-transfer-learning">(E-6) Jazz Solo Instrument Classification with Convolutional Neural Networks, Source Separation, and Transfer Learning</h2>
<p>Juan S. Gómez, Jakob Abeßer and Estefanía Cano
To classify the particular jazz solo instrument playing, source separation is used to remove other instruments first, which helps classification performance. Transfer learning on the other hand by using a model trained on a different dataset beforehand does not turn out to work better, but that may be due to the way the model predictions are aggregated to compute the evaluation metrics.</p>
<h2 id="e-9-semi-supervised-lyrics-and-solo-singing-alignment">(E-9) Semi-supervised lyrics and solo-singing alignment</h2>
<p>Chitralekha Gupta, Rong Tong, Haizhou Li and Ye Wang</p>
<p>Usage of the DAMP data containing amateur solo singing recordings together with unaligned lyrics, which are roughly aligned by using existing speech recognition technology, to train a lyrics transcription and alignment system. They reach a word error rate of 36%, however it is not known how much this degrades on normal music with lots of accopaniment noise.</p>
<h2 id="e-14-end-to-end-learning-for-music-audio-tagging-at-scale">(E-14) End-to-end Learning for Music Audio Tagging at Scale</h2>
<p>Jordi Pons, Oriol Nieto, Matthew Prockup, Erik M. Schmidt, Andreas F. Ehmann and Xavier Serra</p>
<p>Comparison of spectrogram-based with direct raw audio-based classification models for music tagging with varying sizes of training data indicates that spectrograms lead to slightly better performance for small, but slightly worse performance for very large training datasets compared to direct audio input.</p>
<h2 id="e-17-learning-interval-representations-from-polyphonic-music-sequences">(E-17) Learning Interval Representations from Polyphonic Music Sequences</h2>
<p>Stefan Lattner, Maarten Grachten and Gerhard Widmer</p>
<p>Instead of modeling a sequence of pitches directly, we model the transformation of previous pitches into the current one with a gated auto-encoder and then let the RNN model the autoencoder embeddings, which makes for key-invariant processing.</p>
<h1 id="session-f---machine-and-human-learning-of-music">Session F - Machine and human learning of music</h1>
<h2 id="f-3-listener-anonymizer-camouflaging-play-logs-to-preserve-users-demographic-anonymity">(F-3) Listener Anonymizer: Camouflaging Play Logs to Preserve User’s Demographic Anonymity</h2>
<p>Kosetsu Tsukuda, Satoru Fukayama and Masataka Goto
Individual users of music streaming services can protect themselves against being identified in terms of their nationality, age, etc. by way of their playlist history with this technique, which estimates these attributes internally and then tells the user which songs he should play to confuse the recommendation engine to obfuscate these attributes.</p>
<h2 id="f-7-representation-learning-of-music-using-artist-labels">(F-7) Representation Learning of Music Using Artist Labels</h2>
<p>Jiyoung Park, Jongpil Lee, Jangyeon Park, Jung-Woo Ha and Juhan Nam</p>
<p>Instead of classifying genre directly, which limits training data and introduces label noise, train to detect the artist, which are objective, easily obtained labels, first. Then use the learned feature representation in the last layer to perform genre detection on a few different datasets</p>
<h2 id="f-11-midi-vae-modeling-dynamics-and-instrumentation-of-music-with-applications-to-style-transfer">(F-11) MIDI-VAE: Modeling Dynamics and Instrumentation of Music with Applications to Style Transfer</h2>
<p>Gino Brunner, Andres Konrad, Yuyi Wang and Roger Wattenhofer</p>
<p>Use a VAE on short audio excerpts, but reserve 2 dimensions in the latent space for modeling different musical styles (jazz, classic, …) by ensuring that a classifier using these dimensions can identify the style of the input. Then the VAE can be used for style transfer by encoding a given input, and changing the style code in the latent space before decoding it.</p>
<h2 id="f-12-understanding-a-deep-machine-listening-model-through-feature-inversion">(F-12) Understanding a Deep Machine Listening Model Through Feature Inversion</h2>
<p>Saumitra Mishra, Bob L. Sturm and Simon Dixon</p>
<p>To understand what information/concept is captured by each layer/neuron at a deep audio model, extra decoder functions are trained at each layer to recover the original input of the network (which gets harder the closer you get to the classification layer).</p>
<h2 id="f-16-learning-to-listen-read-and-follow-score-following-as-a-reinforcement-learning-game">(F-16) Learning to Listen, Read, and Follow: Score Following as a Reinforcement Learning Game</h2>
<p>Matthias Dorfer, Florian Henkel and Gerhard Widmer</p>
<p>Apply reinforcement learning to score following by defining an agent that looks at the current section of music sheet and audio spectrogram and then decides whether to increase or decrease the current note sheet scrolling speed.</p>Daniel Stollerbusiness@dstoller.netThis year’s ISMIR was great as ever, this time featuring lots of deep learning - I suspect since it became much more easy to use with recently developed libraries lots of new, and surprisingly large, datasets (suited for the new deep learning era) and a fantastic boat tour through Paris! For those that want some very quick overview about many of the papers (but not all - and the selection is biased towards my own research interests admittedly). I created “mini-abstracts” designed to describe the core idea or contribution in each paper that should at least be understandable to someone familiar with the field, since even abstracts tend to sometimes be wordy or unneccessarily, well… abstract! I divided them according to the ISMIR conference session they belong to. Use this page in parallel to quickly retrieve the links to each paper’s PDF document. Musical objectsSpectrogram input normalisation for neural networks2017-11-29T00:00:00+01:002017-11-29T00:00:00+01:00https://dans.world/Spectrogram-input-normalisation-for-neural-networks<p>In this post, I want to talk about magnitude spectrograms as inputs and outputs of neural networks, and how to normalise them to help the training process.</p>
<h3 id="introduction-time-frequency-representations-magnitude-spectrogram">Introduction: Time-frequency representations, magnitude spectrogram</h3>
<p>When using neural networks for audio tasks, it is often advantageous to input not the audio waveform directly into the network, but a two-dimensional representation describing the energy in the signal at a particular time and frequency. A popular time-frequency representation which we will also use here is obtained by the <a href="https://en.wikipedia.org/wiki/Short-time_Fourier_transform">short-time Fourier transform (STFT)</a>, where the audio is split into overlapping time frames for which the FFT is computed. From an STFT, we obtain a <strong>spectrogram matrix</strong> $\mathbf{S}$ with F rows (number of frequency bins) and T columns (number of time frames), where each entry is a complex number. We can take the radius and the polar angle of each complex number to decompose the spectrogram matrix into a <strong>magnitude matrix</strong> $\mathbf{M}$ and a <strong>phase matrix</strong> $\mathbf{P}$ so that for each entry we have</p>
\[S_{i,j} = M_{i,j} * e^{i * P_{i,j}}\]
<p>For many audio tasks, only the magnitudes $\mathbf{M}$ are used and the phases $ \mathbf{P}$ are discarded - when using overlapping windows, $ \mathbf{S}$ can be reconstructed from $ \mathbf{M}$ alone, and the phase tends to have a <a href="http://deepsound.io/dcgan_spectrograms.html">minor impact on sound quality</a>. Here you can see the magnitude and phase for an example song, computed from an STFT with $ N=2048$ samples in each window, and a hop size of 512:</p>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/logspectrogram.png" alt="my alt text" />
<figcaption>Magnitudes $ \mathbf{M}$ on logarithmic scale ($log(x+1)$) of an example song, with frequency on the vertical, and time frames on the horizontal axes.</figcaption>
</figure>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/phasespectrogram.png" alt="my alt text" />
<figcaption>Phase matrix $ \mathbf{P}$, each value being an angle in radians</figcaption>
</figure>
<h3 id="determining-the-value-range-for-spectrogram-magnitudes">Determining the value range for spectrogram magnitudes</h3>
<p>Neural networks tend to converge faster and more stably when the inputs are normally distributed with a mean close to zero and a bounded variance (e.g. 1) so that with the initial weights, the output is already close to the desired one. Similarly, to allow a neural network to output high-dimensional objects, using an output activation function in the last layer that constrains the output range of the network to the real data range can greatly help training and also prevent invalid network predictions. For example, pixels in images are often reduced to a [0,1] interval, and the network output is fed through a sigmoid nonlinearity whose output domain is (0,1).</p>
<p>For these reasons, we need to know the minimum and maximum possible magnitude value in our spectrograms. Since they measure the length of a 2D vector (polar coordinates, their minimum value is zero. For the maximum value of any magnitude, we take a look at how the complex Fourier coefficient for frequency k is computed in an N-point FFT, which ends up in the complex-valued spectrogram:</p>
\[X_k = \sum_{i=0}^{N-1}{x_i \cdot [cos(\frac{2 \pi k n}{N}) - i \cdot sin(\frac{2 \pi k n}{N})]}\]
<p>Since the complex part of the product is bounded by 1 in its magnitude, multiplication by the signal amplitude $x_i$ which is between -1 and 1 results in a complex number still bounded by 1 in magnitude. Taking the sum, the maximum for $X_k$ is thus N. When using a window, we multiply a $ w_i$ term to each element in the sum. In the worst case, we have a 1 in magnitude for each entry, so multiplying by each window element gives us the sum of all the window elements as maximum.</p>
<p>Therefore, <strong>the range of possible magnitudes resulting from an N-point STFT is</strong>: $ [0, \sum_{i=0}^{N-1}{w_i}]$ With a Hanning window, this turns out to be exactly $\frac{N}{2}$. Now we know the value range of magnitudes and could therefore normalise them to a desired range, for example [0, 1]. If we want our model to output audio, we can use a sigmoid function as output activation function, and apply the inverse of this normalisation step, to accelerate learning and ensure that the network outputs are valid.</p>
<h3 id="gaussianisation-to-make-spectral-magnitudes-normally-distributed">Gaussianisation to make spectral magnitudes normally distributed</h3>
<p>However, the overall distribution of magnitude values is very non-Gaussian, since many entries in the spectrogram are close to zero, creating a very skewed (heavy-tailed) distribution which can impede learning:</p>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/spectrohistro.png" alt="my alt text" />
<figcaption>Histogram of spectrogram values, frequency on vertical axis.</figcaption>
</figure>
<p>As the plot shows, the magnitudes roughly follow a steep exponential distribution. I will show both an easy, and a more accurate and complex way to make this more normally distributed.</p>
<h3 id="1-option-a-logarithmic-transformation">1. Option: A logarithmic transformation</h3>
<p>For a roughly exponential distribution, an obvious idea would be to compute $\log(\mathbf{S})$, which is a simple and dataset-independent transformation, to stretch out the near-zero values. However, magnitudes can be zero, and log(0) is undefined! What about if we add a certain positive constant c and compute $\log(\mathbf{S} + c)$? The problem here is that the value of the number critically influences how much low magnitudes close to zero are expanded during transformation - low values of c such as 0.001 lead to great expansion, and vice versa. It is also not immediately clear how to set c to approximate a normal distribution closest. Despite these problems, the transformation at least compresses large values effectively and is used so often that many computing libraries such as Numpy offer the c=1 version with a shorthand called “<a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.log1p.html">log1p</a>”, and the inverse “<a href="https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.expm1.html">expm1</a>” defined as $\exp{x} - 1$. For our example, we see that c=1 does not change the shape of distribution much at all, while $ c=10^{-7}$ gets us close to a normal distribution:</p>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/logspectrohisto.png" alt="my alt text" />
<figcaption>Applying $log1p = \log(x+1)$ (c=1) to the magnitude values does not change the shape of the distribution much, but constrains the input values to a much smaller range, and enlarges differences between small values.</figcaption>
</figure>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/loggspectrohisto.png" alt="my alt text" />
<figcaption>Applying $\log(x + 10^{-7})$ ($c=10^{-7}$) to the magnitude values almost gives us a normally distributed variable, leaving only a small additional peak at around -8.</figcaption>
</figure>
<p>In general, we see that the closer c is to zero, the more small values are expanded. But how do we set this factor of expansion to make values as close to normally distributed as possible? This is where the Box-Cox transformation comes in!</p>
<h3 id="2-option-transformation-with-box-cox">2. Option: Transformation with Box-Cox</h3>
<p>A more advanced method for Gaussianisation is the <a href="https://en.wikipedia.org/wiki/Power_transform#Box.E2.80.93Cox_transformation"><strong>Box-Cox-Transformation</strong></a>. It comes in two variants: With only one, or with two parameters. We will need the two-parameter version, as it can handle zero values which can occur in our spectrogram. With parameters $ \lambda_1$ and $ \lambda_2$ the Box-Cox transformation is defined as</p>
\[y_i^{(\boldsymbol{\lambda})} =
\begin{cases}
\dfrac{(y_i + \lambda_2)^{\lambda_1} - 1}{\lambda_1} & \text{if } \lambda_1 \neq 0, \\
\ln{(y_i + \lambda_2)} & \text{if } \lambda_1 = 0.
\end{cases}\]
<p>Upon closer inspection, we can see similarities to our first option, the logarithmic transformation. $ \lambda_2$ serves the same purpose as the constant c: Making the values non-zero so we can apply further transformations. For $ \lambda_1 = 0$, Box-Cox is even equivalent to our first method and can be seen as an extension of it! The important difference is that <strong>the parameters are estimated from data</strong> so that the resulting distribution is as close to normally distributed as possible, which is more accurate than a simple log transform with a predefined constant c. I have not come across an implementation that estimates both parameters, though. With the boxcox method from Scipy, we can only estimate $ \lambda_1$, so for our audio example we use $ \lambda_2 = 10^{-7}$ just like in our first method. The best parameter found is $ \lambda_1 = 0.043$, which is very close to zero and therefore a very similar transformation:</p>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/boxcoxhisto.png" alt="my alt text" />
<figcaption>Histogram of magnitude values after Box-Cox transformation with $\lambda_1 = 0.043, \lambda_2 = 10^{-7}$. The tails of the distribution are more symmetrical and the peak is more centered between them, but the additional peak remains.</figcaption>
</figure>
<figure>
<img src="https://dans.world/assets/img/2017-11-29-Spectrogram-input-normalisation-for-neural-networks/boxcoxspectro.png" alt="my alt text" />
<figcaption>Box-cox-transformed magnitude spectrogram. The structures in the higher frequency ranges are now more easily visible, while the fact that lower frequencies have higher energy is less emphasized.</figcaption>
</figure>
<h2 id="summary">Summary</h2>
<p>Magnitude spectrograms are tricky to use as input or output of neural networks due to their very skewed, non-normal distribution of values, and because it is hard to find out their maximum value that is needed to scale the value range to a desired interval. The latter is especially important for neural networks that output magnitudes directly, whose training can work better with an output activation function that restricts the network output range to valid values.</p>
<p>The solution is input normalisation: <strong>First, transform the values</strong> either with a simple $x \rightarrow \log(x+1)$ (first option) or a Box-Cox transformation (second option, more advanced), which should expand low values and compress high ones, <strong>making the distribution more Gaussian</strong>. <strong>Then bring the transformed values into the desired interval</strong>. For this we calculate which value 0 and the maximum magnitude are transformed into after applying our particular Gaussianisation, and use this to scale the values to the desired interval.</p>Daniel Stollerbusiness@dstoller.netIn this post, I want to talk about magnitude spectrograms as inputs and outputs of neural networks, and how to normalise them to help the training process.Tensorflow LSTM for Language Modelling2016-11-12T00:00:00+01:002016-11-12T00:00:00+01:00https://dans.world/Tensorflow-LSTM-for-Language-Modelling<p>In this post, I will show you how to build an LSTM network for the task of character-based language modelling (predict the next character based on the previous ones), and apply it to generate lyrics. In general, the model allows you to generate new text as well as auto-complete your own sentences based on any given text database.</p>
<h2 id="the-code">The code</h2>
<p>The code is available <a href="https://github.com/f90/Tensorflow-Char-LSTM/tree/master">here</a>, and the lyrics database can be created by running the crawler I developed <a href="https://github.com/f90/lyrics-crawler">here</a>.</p>
<h2 id="want-to-see-what-lyrics-my-model-composes-click-here">Want to see what lyrics my model composes? <a href="#example-output">Click here!</a></h2>
<p>It is implemented in Tensorflow, which has been rapidly evolving in the last few months. As a result, best practices for common tasks are changing as well. That is why I built my own code to make the best use out of <a href="https://www.tensorflow.org/versions/r0.12/how_tos/threading_and_queues/index.html">Queues</a>, and new functionality from the <a href="https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/api_docs/python/contrib.training.md">training-contrib</a> package. In particular, implementing batch-training with variable-length sequences on an unrolled LSTM with truncated BPTT with maintaining the hidden state for each sequence is greatly simplified and optimised. Without further ado, let’s get started!</p>
<h1 id="input-pipeline">Input pipeline</h1>
<p>First, we are going to load the dataset. I adapted this a bit to my case, but it should be easy for you to change it to your liking:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data, vocab = Dataset.readLyrics(data_settings["input_csv"], data_settings["input_vocab"])
trainIndices, testIndices = Dataset.createPartition(data, train_settings["trainPerc"])
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Dataset</code> is a helper file that manages dataset-related operations. Here I assume <code class="language-plaintext highlighter-rouge">data</code> has the form of a list of entries, each again a list with two entries: The first entry denotes the artist and the second the lyrics content. The first entry is used to prevent having the same artist in both training and test set. <code class="language-plaintext highlighter-rouge">trainIndices</code> and <code class="language-plaintext highlighter-rouge">testIndices</code> are lists of indices refer to the rows in <code class="language-plaintext highlighter-rouge">data</code> that correspond to the training and test set, respectively. The settings variables are just dictionaries that yield user-defined settings. <code class="language-plaintext highlighter-rouge">vocab</code> is a special Vocabulary object that translates characters into integer indices and vice versa, and we will get to know its functionality better along the way. For creating the batches to train on, we will use <code class="language-plaintext highlighter-rouge">batch_sequences_with_states</code>, as it is very convenient to use. It requires a key, the sequence length, and input and output for a SINGLE sequence in symbolic form. We create these properties here as placeholders, to later feed them with our own input thread. This design makes input processing very fast and ensures it does not reduce overall training speed: Model training and input processing are running simultaneously in different threads.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>keyInput = tf.placeholder(tf.string) # To identify each sequence
lengthInput = tf.placeholder(tf.int32) # Length of sequence
seqInput = tf.placeholder(tf.int32, shape=[None]) # Input sequence
seqOutput = tf.placeholder(tf.int32, shape=[None]) # Output sequence
</code></pre></div></div>
<p>Then we create a <code class="language-plaintext highlighter-rouge">RandomShuffleQueue</code> and the enqueue and dequeue operations, which means sequences will be randomly selected from the queue to form batches during training. This presents an effective compromise between completely random sample selection, which is often slow for large datasets as data has to be pulled from very different memory locations, and completely sequential reading. Using the dictionaries ensures compatibility with the Tensorflow sequence format:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>q = tf.RandomShuffleQueue(input_settings["queue_capacity"], input_settings["min_queue_capacity"],
[tf.string, tf.int32, tf.int32, tf.int32])
enqueue_op = q.enqueue([keyInput, lengthInput, seqInput, seqOutput])
with tf.device("/cpu:0"):
key, contextT, sequenceIn, sequenceOut = q.dequeue()
context = {"length" : tf.reshape(contextT, [])}
sequences = {"inputs" : tf.reshape(sequenceIn, [contextT]),
"outputs" : tf.reshape(sequenceOut, [contextT])}
</code></pre></div></div>
<p>Instead of using the built-in CSV or TFRecord Readers to enqueue samples, I created my own method that can read directly from the <code class="language-plaintext highlighter-rouge">data</code> in the RAM. It endlessly loops over the samples given by <code class="language-plaintext highlighter-rouge">indices</code> and adds them to the queue. It can easily be adapted to read arbitrary files/parts of datasets and perform further preprocessing:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Enqueueing method in different thread, loading sequence examples and feeding into FIFO Queue
def load_and_enqueue(indices):
run = True
key = 0 # Unique key for every sample, even over multiple epochs (otherwise the queue could be filled up with two same-key examples)
while run:
for index in indices:
current_seq = data[index][1]
try:
sess.run(enqueue_op, feed_dict={keyInput: str(key),
lengthInput: len(current_seq)-1,
seqInput: current_seq[:-1],
seqOutput: current_seq[1:]},
options=tf.RunOptions(timeout_in_ms=60000))
except tf.errors.DeadlineExceededError as e:
print("Timeout while waiting to enqueue into input queue! Stopping input queue thread!")
run = False
break
key += 1
print "Finished enqueueing all " + str(len(indices)) + " samples!"
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">load_and_enqueue</code> will be started as a separate Thread later. Two important things to note here. One is the <code class="language-plaintext highlighter-rouge">key</code> which is different even for the same samples to avoid errors when incidentally queueing the same samples into the queue, which causes the keys to clash. The other is the timeout, which is the only way I found to be able to stop the training by closing the queue and then catching the resulting <code class="language-plaintext highlighter-rouge">DeadlineExceededError</code>. Lastly, the input and output is delayed by one step to force the model to predict the upcoming character.</p>
<h1 id="the-lstm-model">The LSTM model</h1>
<p>Our model is an LSTM with a variable number of layers and configurable dropout, whose hidden states will be maintained separately for each sequence. I created a new class <code class="language-plaintext highlighter-rouge">LyricsPredictor</code> with an inference method that builds the computational graph. The beginning looks like this and sets up the RNN cells:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def inference(self, key, context, sequences, num_enqueue_threads):
# RNN cells and states
cells = list()
initial_states = dict()
for i in range(0, self.num_layers):
cell = tf.contrib.rnn.LSTMBlockCell(num_units=self.lstm_size) # Block LSTM version gives better performance #TODO Add linear projection option
cell = tf.nn.rnn_cell.DropoutWrapper(cell,input_keep_prob=1-self.input_dropout, output_keep_prob=1-self.output_dropout)
cells.append(cell)
initial_states["lstm_state_c_" + str(i)] = tf.zeros(cell.state_size[0], dtype=tf.float32)
initial_states["lstm_state_h_" + str(i)] = tf.zeros(cell.state_size[1], dtype=tf.float32)
cell = tf.nn.rnn_cell.MultiRNNCell(cells)
[...]
</code></pre></div></div>
<p>It receives the key, context, and content of a sequence, and how many threads should be used to fetch the <code class="language-plaintext highlighter-rouge">RandomShuffleQueue</code> to provide input to the model. I found that <code class="language-plaintext highlighter-rouge">LSTMBlockCell</code> works slightly faster than the normal <code class="language-plaintext highlighter-rouge">LSTMCell</code> class. Now comes the neat bit. We can use the following code to let Tensorflow form batches from these sequences after splitting them up into chunks according to the unroll length. These batches also come with a context that ensures the hidden state is carried over from the last chunk of a sequence to the next, sparing us from a lot of hassle trying to implement and optimise that on our own.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># BATCH INPUT
self.batch = tf.contrib.training.batch_sequences_with_states(
input_key=key,
input_sequences=sequences,
input_context=context,
input_length=tf.cast(context["length"], tf.int32),
initial_states=initial_states,
num_unroll=self.num_unroll,
batch_size=self.batch_size,
num_threads=num_enqueue_threads,
capacity=self.batch_size * num_enqueue_threads * 2)
inputs = self.batch.sequences["inputs"]
targets = self.batch.sequences["outputs"]
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">inputs</code> and <code class="language-plaintext highlighter-rouge">targets</code> are part of the resulting batch and are formed during runtime. New sequences get pulled as soon as some in the batch are finished. In the following, the inputs are transformed from one-dimensional indices into one-hot vectors. Then, they are reshaped from an [batch_size,unroll_length,vocab_size] tensor to a list of [batch_size,vocab_size] tensors with length unroll_length, to conform with the RNN interface. Finally, we can use the state_saving_rnn with the state-saving batch we created beforehand, to get our outputs.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Convert input into one-hot representation (from single integers indicating character)
print(self.vocab_size)
embedding = tf.constant(np.eye(self.vocab_size), dtype=tf.float32)
inputs = tf.nn.embedding_lookup(embedding, inputs)
# Reshape inputs (and targets respectively) into list of length T (unrolling length), with each element being a Tensor of shape (batch_size, input_dimensionality)
inputs_by_time = tf.split(1, self.num_unroll, inputs)
inputs_by_time = [tf.squeeze(elem, squeeze_dims=1) for elem in inputs_by_time]
targets_by_time = tf.split(1, self.num_unroll, targets)
targets_by_time = [tf.squeeze(elem, squeeze_dims=1) for elem in targets_by_time] # num_unroll-list of (batch_size) tensors
self.targets_by_time_packed = tf.pack(targets_by_time) # (num_unroll, batch_size)
# Build RNN
state_name = initial_states.keys()
self.seq_lengths = self.batch.context["length"]
(self.outputs, state) = tf.nn.state_saving_rnn(cell, inputs_by_time, state_saver=self.batch,
sequence_length=self.seq_lengths, state_name=state_name, scope='SSRNN')
</code></pre></div></div>
<p>Here we have to be careful: If we have N characters, a special end-of-sequence token at the end of every sequence is appended while loading the data so the network learns when to stop generating lyrics, and I do not use zero as an index for any character as we need the zero entry to mask the output later to correctly compute the loss. So here, <code class="language-plaintext highlighter-rouge">self.vocab_size</code> would be N+2. Finally, we put a softmax on top of the outputs by iterating through the list of length num_unroll in which each entry represents one timestep, and return logits and probabilities:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Create softmax parameters, weights and bias, and apply to RNN outputs at each timestep
with tf.variable_scope('softmax'):
softmax_w = tf.get_variable("softmax_w", [self.lstm_size, self.vocab_size])
softmax_b = tf.get_variable("softmax_b", [self.vocab_size])
logits = [tf.matmul(outputStep, softmax_w) + softmax_b for outputStep in self.outputs]
self.logit = tf.pack(logits)
self.probs = tf.nn.softmax(self.logit)
tf.summary.histogram("probabilities", self.probs)
return (self.logit, self.probs)
</code></pre></div></div>
<p>To train the model, we also need a loss function. Here we use the fact that we do not use the 0-index as a target, so we know that such an entry in the target list must come from the zero-padding and indicates that the sequence is already over. This is used in the following code to mask the loss by only considering non-zero targets. We also add L2 regularisation.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>def loss(self, l2_regularisation):
with tf.name_scope('loss'):
# Compute mean cross entropy loss for each output.
self.cross_loss = tf.nn.sparse_softmax_cross_entropy_with_logits(self.logit, self.targets_by_time_packed) # (num_unroll, batchsize)
# Mask losses of outputs for positions t which are outside the length of the respective sequence, so they are not used for backprop
# Take signum => if target is non-zero (valid char), set mask to 1 (valid output), otherwise 0 (invalid output, no gradient/loss calculation)
mask = tf.sign(tf.abs(tf.cast(self.targets_by_time_packed, dtype=tf.float32))) # Unroll*Batch \in {0,1}
self.cross_loss = self.cross_loss * mask
output_num = tf.reduce_sum(mask)
sum_cross_loss = tf.reduce_sum(self.cross_loss)
mean_cross_loss = sum_cross_loss / output_num # Mean loss is sum over masked losses for each output, divided by total number of valid outputs
# L2
vars = tf.trainable_variables()
l2_loss = tf.contrib.layers.apply_regularization(tf.contrib.layers.l2_regularizer(l2_regularisation), weights_list=vars)
loss = mean_cross_loss + l2_loss
tf.summary.scalar('mean_batch_cross_entropy_loss', mean_cross_loss)
tf.summary.scalar('mean_batch_loss', loss)
return loss, mean_cross_loss, sum_cross_loss, output_num
</code></pre></div></div>
<h1 id="training-the-model">Training the model</h1>
<p>Now that we set up the model, we need to train it!</p>
<h2 id="set-up-symbolic-training-operations">Set up symbolic training operations</h2>
<p>First we need the necessary symbolic operations. We set up a step counter and a learning rate variable that decays exponentially depending on the current step:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>global_step = tf.get_variable('global_step', [],
initializer=tf.constant_initializer(0.0))
# Learning rate
initial_learning_rate = tf.constant(train_settings["learning_rate"])
learning_rate = tf.train.exponential_decay(initial_learning_rate, global_step, train_settings["learning_rate_decay_epoch"], train_settings["learning_rate_decay_factor"])
tf.summary.scalar("learning_rate", learning_rate
</code></pre></div></div>
<p>Then we calculate the gradients and prepare them for visualisation for Tensorboard, as you might run into the vanishing gradient problem for deeper RNNs:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Gradient calculation
tvars = tf.trainable_variables()
grads, _ = tf.clip_by_global_norm(tf.gradients(loss, tvars, aggregation_method=2), # Use experimental aggregation to reduce memory usage
5.0)
# Visualise gradients
vis_grads = [0 if i == None else i for i in grads]
for g in vis_grads:
tf.summary.histogram("gradients_" + str(g), g)
</code></pre></div></div>
<p>We have to replace None entries with 0 for visualisation. We choose a gradient descent method (ADAM in this case) and define the training operations.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>optimizer = tf.train.AdamOptimizer(learning_rate)
train_op = optimizer.apply_gradients(zip(grads, tvars),
global_step=global_step)
trainOps = [loss, train_op,
global_step, learning_rate]
</code></pre></div></div>
<p>Now we are ready to execute!</p>
<h2 id="performing-training">Performing training</h2>
<p>Start up a session and the QueueRunners associated with it. In our case, these are associated with the RNN input queue (NOT our own RandomShuffleQueue).</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Start session
sess = tf.Session()
coord = tf.train.Coordinator()
init_op = tf.global_variables_initializer()
sess.run(init_op)
tf_threads = tf.train.start_queue_runners(sess=sess, coord=coord)
</code></pre></div></div>
<p>In case we crashed or want to import a (partly) trained model for other reasons such as fine-tuning, we check for previous model checkpoints to load the model parameters:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># CHECKPOINTING
#TODO save model directly after every epoch, so that we can safely refill the queues after loading a model (uniform sampling of dataset is still ensured)
# Load pretrained model to continue training, if it exists
latestCheckpoint = tf.train.latest_checkpoint(train_settings["checkpoint_dir"])
if latestCheckpoint is not None:
restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, latestCheckpoint)
print('Pre-trained model restored')
saver = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
</code></pre></div></div>
<p>Now we start a thread to run our custom <code class="language-plaintext highlighter-rouge">load_and_enqueue</code> method from earlier which reads from <code class="language-plaintext highlighter-rouge">data</code> and enqueues the sequences into the <code class="language-plaintext highlighter-rouge">RandomShuffleQueue</code>. Preprocessing and data loading is better done on the CPU.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Start a thread to enqueue data asynchronously to decouple data I/O from training
with tf.device("/cpu:0"):
t = threading.Thread(target=load_and_enqueue, args=[trainIndices])
t.start()
</code></pre></div></div>
<p>We can set up some logging functions so we can nicely visualise statistics with Tensorboard:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># LOGGING
# Add histograms for trainable variables.
histograms = [tf.summary.histogram(var.op.name, var) for var in tf.trainable_variables()]
summary_op = tf.summary.merge_all()
# Create summary writer
summary_writer = tf.summary.FileWriter(train_settings["log_dir"], sess.graph.as_graph_def(add_shapes=True)
</code></pre></div></div>
<p>Now the training loop runs the training and summary operations, and writes summaries and periodically also model checkpoints to save progress:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>loops = 0
while loops < train_settings["max_iterations"]:
loops += 1
[res_loss, _, res_global_step, res_learning_rate, summary] = sess.run(trainOps + [summary_op])
new_time = time.time()
print("Chars per second: " + str(float(model_settings["batch_size"] * model_settings["num_unroll"]) / (new_time - current_time)))
current_time = new_time
print("Loss: " + str(res_loss) + ", Learning rate: " + str(res_learning_rate) + ", Step: " + str(res_global_step))
# Write summaries for this step
summary_writer.add_summary(summary, global_step=int(res_global_step))
if res_global_step % train_settings["save_model_epoch_frequency"] == 0:
print("Saving model...")
saver.save(sess, train_settings["checkpoint_path"], global_step=int(res_global_step))
</code></pre></div></div>
<p>After the maximum desired number of iterations has been reached (or some other criterion), we stop:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Stop our custom input thread
print("Stopping custom input thread")
sess.run(q.close()) # Then close the input queue
t.join(timeout=1)
# Close session, clear computational graph
sess.close()
tf.reset_default_graph()
</code></pre></div></div>
<h1 id="testing">Testing</h1>
<p>After training, we want to evaluate the performance on the test set. The code for this looks similar, so I will only show the differences here. We load the trained model:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># CHECKPOINTING
# Load pretrained model to test
latestCheckpoint = tf.train.latest_checkpoint(train_settings["checkpoint_dir"])
restorer = tf.train.Saver(tf.global_variables(), write_version=tf.train.SaverDef.V2)
restorer.restore(sess, latestCheckpoint)
</code></pre></div></div>
<p>In our custom input queue thread, we close the queue after enqueueing all test samples so the process is stopped after seeing each example exactly once:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Enqueueing method in different thread, loading sequence examples and feeding into FIFO Queue
def load_and_enqueue(indices):
for index in indices:
current_seq = data[index][1]
sess.run(enqueue_op, feed_dict={keyInput: str(index),
lengthInput: len(current_seq)-1,
seqInput: current_seq[:-1],
seqOutput: current_seq[1:]})
print "Finished enqueueing all " + str(len(indices)) + " samples!"
sess.run(q.close())
</code></pre></div></div>
<p>The test loop now uses the total cross-entropy returned from our model class along with the number of valid output positions to compute a bit-per-character metric:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>inferenceOps = [loss, mean_cross_loss, sum_cross_loss, output_num]
current_time = time.time()
logprob_sum = 0.0
character_sum = 0
iteration = 0
while True:
try:
[l, mcl, scl, nb, summary] = sess.run(inferenceOps + [summary_op])
except tf.errors.OutOfRangeError:
print("Finished testing!")
break
new_time = time.time()
print("Chars per second: " + str(
float(model_settings["batch_size"] * model_settings["num_unroll"]) / (new_time - current_time)))
current_time = new_time
logprob_sum += scl # Add up per-char log probabilities of predictive model: Sum_i=1^N (log_2 q(x_i)), which is equal to cross-entropy term for all chars
character_sum += nb # Add up how many characters were in the batch
print(l, mcl, scl)
summary_writer.add_summary(summary, global_step=int(iteration))
iteration += 1
print("Bit-per-character: " + str(logprob_sum / character_sum))
</code></pre></div></div>
<p>Note here that we catch an <code class="language-plaintext highlighter-rouge">OutOfRangeError</code> that is thrown as soon as our input queue is empty after the input thread finishes and closes it.</p>
<h1 id="sampling">Sampling</h1>
<p>Unfortunately, sampling as a use case is very different from the train and test setting:</p>
<ul>
<li>We want to consider a single sequence, not a whole batch</li>
<li>The output of the RNN at the current time step is used as input for the next, which makes static unrolling inappropriate</li>
<li>We cannot in parallel evaluate multiple timesteps, but need to keep the hidden state after each input to feed back into the model (rendering the previously used state saver concepts cumbersome)</li>
</ul>
<p>Therefore, I did it the hard way and maintained the RNN states myself. This requires setting up placeholders for the states manually to be able to feed values in during sampling, and defining the initial zero states to use for the prediction of the first character:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Load vocab
vocab = Vocabulary.load(data_settings["input_vocab"])
# INPUT PIPELINE
input = tf.placeholder(tf.int32, shape=[None], name="input") # Integers representing characters
# Create state placeholders - 2 for each lstm cell.
state_placeholders = list()
initial_states = list()
for i in range(0,model_settings["num_layers"]):
state_placeholders.append(tuple([tf.placeholder(tf.float32, shape=[1, model_settings["lstm_size"]], name="lstm_state_c_" + str(i)), # Batch size x State size
tf.placeholder(tf.float32, shape=[1, model_settings["lstm_size"]], name="lstm_state_h_" + str(i))])) # Batch size x State size
initial_states.append(tuple([np.zeros(shape=[1, model_settings["lstm_size"]], dtype=np.float32),
np.zeros(shape=[1, model_settings["lstm_size"]], dtype=np.float32)]))
state_placeholders = tuple(state_placeholders)
initial_states = tuple(initial_states)
</code></pre></div></div>
<p>The states are represented as tuples in tensorflow. The model itself also has to be adapted accordingly. We use a batch size and unroll length of 1, so we only predict exactly one character at a time, and feed in the input along with the state placeholders:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># MODEL
inference_settings = model_settings
inference_settings["batch_size"] = 1 # Only sample from one example simultaneously
inference_settings["num_unroll"] = 1 # Only sample one character at a time
model = LyricsPredictor(inference_settings, vocab.size + 1) # Include EOS token
probs, state = model.sample(input, state_placeholders)
</code></pre></div></div>
<p>This time, we use the <code class="language-plaintext highlighter-rouge">sample</code> method from the <code class="language-plaintext highlighter-rouge">LyricsPredictor</code> class to build the required computational graph:</p>
<p>def sample(self, input, current_state):
# RNN cells and states
cells = list()
for i in range(0, self.num_layers):
cell = tf.contrib.rnn.LSTMBlockCell(num_units=self.lstm_size) # Block LSTM version gives better performance #TODO Add linear projection option
cell = tf.nn.rnn_cell.DropoutWrapper(cell,1.0,1.0) # No dropout during sampling
cells.append(cell)
cell = tf.nn.rnn_cell.MultiRNNCell(cells)
self.initial_states = cell.zero_state(batch_size=1,dtype=tf.float32)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Convert input into one-hot representation (from single integers indicating character)
embedding = tf.constant(np.eye(self.vocab_size), dtype=tf.float32)
input = tf.nn.embedding_lookup(embedding, input) # 1 x Vocab-size
inputs\_by\_time = \[input\] # List of 1 x Vocab-size tensors (with just one tensor in it, because we just use sequence length 1
self.outputs, state = tf.nn.rnn(cell, inputs\_by\_time, initial\_state=current\_state, scope='SSRNN')
</code></pre></div></div>
<p>Crucially, we set the scope when setting up the RNN to the same that was used when building the model during training and testing, so that when we load the checkpoint, the RNN variables are set up correctly. Afterwards, the softmax is applied as shown earlier. The function returns the probabilities and the resulting LSTM state after processing the input character, which we store in the following.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>inference = [probs, state]
current_seq = "never" # This can be any alphanumeric text
current_seq_ind = vocab.char2index(current_seq)
# Warm up RNN with initial sequence
s = initial_states
for ind in current_seq_ind:
# Create feed dict for states
feed = dict()
for i in range(0, model_settings["num_layers"]):
for c in range(0, len(s[i])):
feed[state_placeholders[i][c]] = s[i][c]
feed[state_placeholders[i][c]] = s[i][c]
feed[input] = [ind] # Add new input symbol to feed
[p, s] = sess.run(inference, feed_dict=feed)
</code></pre></div></div>
<p>In the above code, we set an initial sequence (“never”) and prepare the LSTM to continue the lyrics (e.g. “never gonna give you up”) by feeding in one character after another and carrying over the states. These are nested tuples, organised according to layers, each with a cell and a hidden state (this is due to the LSTM structure). The hidden state now hopefully captures meaningful information about the input text <code class="language-plaintext highlighter-rouge">current_seq</code>, so we can take the current prediction probabilities and sample from them to generate the next character, feed it into the network, and repeat that process until we receive the special end token signalling that the LSTM is finished with the “creative process”:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Sample until we receive an end-of-lyrics token
iteration = 0
while iteration < 100000: # Just a safety measure in case the model does not stop
# Now p contains probability of upcoming char, as estimated by model, and s the last RNN state
ind_sample = np.random.choice(range(0,vocab.size+1), p=np.squeeze(p))
if ind_sample == vocab.size: # EOS token
print("Model decided to stop generating!")
break
current_seq_ind.append(ind_sample)
# Create feed dict for states
feed = dict()
for i in range(0, model_settings["num_layers"]):
for c in range(0, len(s[i])):
feed[state_placeholders[i][c]] = s[i][c]
feed[state_placeholders[i][c]] = s[i][c]
feed[input] = [ind_sample] # Add new input symbol to feed
[p, s] = sess.run(inference, feed_dict=feed)
iteration += 1
</code></pre></div></div>
<p>Finally, we convert the generated list of integer indices to their character representation, and print out the result:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>c_sample = vocab.index2char(current_seq_ind)
print("".join(c_sample))
sess.close()
</code></pre></div></div>
<h2 id="example-output">Example output</h2>
<p>And now starts the fun! Feel free to extend the model to your liking. Want to see what my own model generates? Here is the output of a 2-layer 512-hidden node LSTM with 0.2 output dropout trained for only two hours on Metrolyrics text, when told to start with “never”:</p>
<blockquote>
<p>never yet in you but that letters know a stobalal in you on the brink so to the victory no matter what i might understand the sun where i am with all this phon people theyll get my knife off a girl that it thats forsaken just smiling still welcome to me<br />
its a gangsta good times is like a fesire then im holding fantastine is on though we bring it out to who burn today well make all the lights in his face im here so bright<br />
sos we do what we do we dont know where we harm<br />
and every time we get you now dont need rewith nothing yeah you dont want a sip or just look at dont make it on the 5 dirty doubt then most name yeah dont know about it i know and you dont wanna play with her no no no<br />
come on ah yeah yeah women make you bury tight around rising in stop up in the top looking out of the middle of the sea<br />
youre not drunk and im not in real hard to hit you all around cover up tune whats not there how to make you cry so long your cut money rolling around and the storm ignite it youre peacent burn so fast blue no fading<br />
two number on the home was we praying of happy for your respect a death a lip another day and style niggas keep that an internuted at leven but you was the way you fall and ive been ready at all ive never seen the girls i tried to drive i took a fool from the river instead i just draw your life when my head stays the fellas we dreams to all ill stay<br />
and nobody should be there in here but when i hide all my echo and make you hold me up im going to get more to fall in your sea oh right through a night in your news one two treatnessboy shes passed out in the sky all the real friends are downs light to out here he was rightly word out im not driving in my eyes suddenly reminding me im being that dragong class i wish i was since yours your peace of<br />
pour it like like a record beautiful man i do you on the kid punk its when im attack i know when im smiling taste so i find a little far</p>
</blockquote>
<p>As you can see, the model learned to structure its musings in paragraphs akin to real lyrics, and overall makes some good attempts at coming up with new sentences. Apparently this song is more of a Gangsta rap, as suggested by the words “knife”, “gangsta”, “niggas”, etc. Sentences only sometimes make proper sense, unfortunately. It sometimes comes up with semi-random new words, like “dragong” and “peacent”, because it has to learn spelling and vocabulary from scratch as opposed to word-level language models. It also did not learn meaningful long-term dependencies such as verse/chorus structures.</p>Daniel Stollerbusiness@dstoller.netIn this post, I will show you how to build an LSTM network for the task of character-based language modelling (predict the next character based on the previous ones), and apply it to generate lyrics. In general, the model allows you to generate new text as well as auto-complete your own sentences based on any given text database.Snowfall - A very special video game controller2015-11-20T00:00:00+01:002015-11-20T00:00:00+01:00https://dans.world/Snowfall:-a-very-special-video-game-controller<p>Here is a short <a href="https://www.youtube.com/watch?v=TRiBA4o_pBs">Youtube video</a> explaining what this post it about. How did I do it? I will try to go through the main steps in the following.</p>
<h1 id="understanding-the-wiring-and-connecting-the-arduino">Understanding the wiring and connecting the Arduino</h1>
<p>To use the mat as a game controller, I had to first understand its internal wiring before setting up a connection to the Arduino microcontroller. So I freed the PCB along with its wiring of the red plastic hull that was originally containing it. After becoming almost insane trying to figure out the totally weird wiring inside the mat by manually checking the connections from outside with a multimeter, I finally decided to just cut the mat open, which made everything much easier and saved my sanity ;)</p>
<h2 id="part-1---leds">Part 1 - LEDs</h2>
<p>Intuitively (and naively), for ten LEDs you would expect two connections for each LED and therefore 20 connections in total. But it seems that especially when dealing with cheap toy electronics, it is never that easy. Every cable drives up costs and so they favour more complicated wiring when it is less costly to produce. In this mat, there are 7 cables, each with a different colour (at least they did me that favour!). The setup can be seen in this picture:</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/led-wiring.png" alt="LED wiring.png" /></p>
<p>So if you want to light up the LED number 2 for example, you have to apply a current to the green cable, and pull the brown cable to the ground to allow the current to flow. In general, for every LED there is a specific combination of two buttons that you have to set. But as if that is not complicated enough, you introduce some dependencies with this setup. What if you wanted to light up LED 1 and 6 simultaneously? You would have to supply the yellow cable with current for LED 1 and the green one for LED 6. Also, the brown and red cables would both have to act as ground. But wait a minute - in this configuration, the LEDs 2 and 5 would light up too, as both are now exposed to the same current as LEDs 1 and 6. I worked around this dependency with a “cheap” trick: When multiple LEDs should be activated, I light up one LED after the other, but switch between the LEDs at such a high frequency (about 5 ms for every LED) that with our surprisingly limited visual system it looks like the LEDs are actually glowing at the same time!</p>
<h2 id="part-2---buttons">Part 2 - Buttons</h2>
<p>Unfortunately, the wiring of the buttons turned out to be even more confusing than those of the LEDs. Basically, inside the mat there are two layers of foil separated by a layer of foam. The foil has conductive areas at the position of each button as well as black lines that connect the buttons to the PCB in the toy. When pressure is applied to a button, the two layers are pressed together as the foam gets squashed, and a current can now flow between the two layers of foil. So you can model this button as a resistor that changes in resistance depending on the applied pressure. I hope the picture below makes everything a little bit clearer, where you can see one of the layers and the electrical connections in black, and the foam underneath obstructing the second layer below.</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/img_20151120_180127.jpg" alt="IMG_20151120_180127.jpg" /></p>
<p>Similar to the buttons, the wiring was not straightforward, as you can see in the following schematics. For both layers, the connections accessible from the outside are drawn at the top.</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/button-wiring-schematics.png" alt="Button wiring schematics.png" /></p>
<p>On the PCB, I discovered that the buttons 1-4, 5-8 and 9-10 were each short-circuited on the second layer, as shown in red in the schematic. This made matters much more complicated, because it again introduced dependencies between the buttons that make separate measurements of the buttons much harder. It leads to matrix setup, with four connections on the first and three connections on the second layer. Here is a diagram showing this matrix:</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/button-matrix.png" alt="Button matrix.png" /></p>
<p>It assumes that current is applied on the digital pins 10 to 12 and the resulting voltage measured on the analog pins A0 to A3. For each coloured cable, the buttons that are connected to it are displayed. The matrix entries show the specific button addressed by the combination of two of these cables. So the idea is to apply HIGH to one of the three digital pins, while letting the other connections float, measuring the four analog voltages, and then selecting another digital pin to set to HIGH and repeat the procedure, and then do the same thing a final third time with the last remaining digital pin. The picture below shows a schematic of the Arduino setup for the analog pin A0 and demonstrates how setting exactly one pin between 10 and 12 to high allows for the measurement of the voltage for exactly one of the three buttons. It works analogously for the other analog pins A1 to A3.</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/button-arduino-setup-example.png" alt="Button Arduino setup example" /></p>
<p>Finally, I programmed my Arduino to periodically read these values, detect button presses using these measured voltages, and send the button states over the serial port, where they are received by the Unity game engine.</p>
<h1 id="game-development-in-unity">Game development in Unity</h1>
<p>The video game itself was developed in <a href="https://unity3d.com/">Unity</a>, which is a game development platform suited to build your own 3D as well as 2D games without going through the hassle of creating your own game engine and taking care of every little detail yourself. I can really recommend it to people interested in game development, because it represents a nice, accessible starting point for further endeavours.</p>
<h2 id="designing-the-3d-scene">Designing the 3D scene</h2>
<p>One of Unity’s strength is definitely the modelling of 3D game environments. Although it can be difficult to find the right 3D models and textures for your game (and often you find many expensive offers on the internet and only few or no free ones), designing the terrain and moving, scaling and rotating objects is very intuitive and quickly done. Here is the scene of Snowfall as viewed inside the editor of Unity:</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/unity-3d-scene.png" alt="Unity 3D scene.png" /></p>
<p>I quickly added some hills and a snow texture to the terrain, placed a lot of trees and obstacles (barely visible in the middle of the screen) and added 10 red skis with their positions relative to the main camera, so they move along with the player. Because the scene is rendered from the perspective of this main camera, for the player the skis do not seem to move at all. On the left side, you can see the scene hierarchy, containing all objects of the scene organised in a tree-like structure. As stated before, the skis are a child of the main camera object so that they do not change their position on the screen. The TreeGroup contains all trees, a directional light acts as the sunlight and the terrain features a set of obstacles in the form of many thin cuboids with a rocky texture. The UI consists of text labels from one to ten for every ski, a play button, a health slider, a title text for the start screen and a loss and win text shown when the player loses or wins. It is also simple to set it up so that a specific function in your own script is called every time the button is pressed or the slider is moved. Finally, the “GameLogic” object contains most of the C# scripts responsible for handling most of the game’s logic. I say most because some scripts are attached to specific game objects like the main camera.</p>
<h2 id="game-logic-with-c-scripts">Game logic with C# scripts</h2>
<p>So now I designed a pretty 3D game world, but nothing is really happening in it and no interaction with the player is taking place - you could hardly call that a game, right? At first, I implemented moving the camera at a certain speed through the terrain, maintaining the same height and orientation, and gradually speeding up as we go through the level. I achieved this with a script attached to the Main Camera object containing the class “CameraController”. It defines a minimum and maximum speed in units per second as well as the acceleration <code class="language-plaintext highlighter-rouge">additionalMinSpeedPerSec</code>. A boolean variable determines if the camera should be moving at the moment - an external script can modify this variable to stop and start camera movement. Every frame, a call to the <code class="language-plaintext highlighter-rouge">Update()</code> function causes the current camera speed to be increased by <code class="language-plaintext highlighter-rouge">Time.deltaTime * additionalMinSpeedPerSec</code>, but limited to the maximum camera speed. Then, the camera is moved according to this <code class="language-plaintext highlighter-rouge">currentCameraSpeed</code> using the following code: <code class="language-plaintext highlighter-rouge">Vector3 camPos = transform.localPosition; camPos.z += Time.deltaTime * currentCameraSpeed; transform.localPosition = camPos;</code> Next, I implemented the desired reaction to detected collisions between skis and the rocky terrain. In the 3D scene, I already added box colliders to the obstacles and the skis and marked them as a trigger. Also, I added rigid bodies to the skis. Afterwards I attached a script to every ski whose class <code class="language-plaintext highlighter-rouge">SkiCollisionDetector</code> overrides the <code class="language-plaintext highlighter-rouge">OnTriggerStay</code> function which is called in regular intervals as long as a collision with the corresponding game object is detected. My implementation for every ski simply counts the total duration of the collisions in a variable <code class="language-plaintext highlighter-rouge">collisionTime</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>void OnTriggerStay(Collider coll)
{
collisionTime += Time.deltaTime;
}
</code></pre></div></div>
<p>To maintain all ten skis in a comfortable way, I also implemented a <code class="language-plaintext highlighter-rouge">SkiController</code> that keeps a list of all skis, can add up all the <code class="language-plaintext highlighter-rouge">collisionTime</code> values of all skis to retrieve the total amount of damage, reset these variables back to zero in case a new game is started, and also set the visibility and the ability to trigger collision handling depending on the button states. The latter shall be explained in greater detail. The function <code class="language-plaintext highlighter-rouge">setSkiStates(bool[] buttonStates)</code> receives information about which buttons are currently pressed, goes through the <code class="language-plaintext highlighter-rouge">skiList</code> and makes only the skis belonging to those buttons visible and able to trigger the collision handling functions:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>public void setSkiStates(bool[] buttonStates)
{
for (int i = 0; i < skiList.Count; ++i) // Go through the ski list...
{
GameObject currentSki = ((GameObject)skiList[i]);
currentSki.GetComponent().enabled = buttonStates[i]; // Ski is visible if its button is pressed
currentSki.GetComponent().enabled = buttonStates[i]; // Ski can trigger calls of collision function (OnTriggerStay) if its button is pressed
}
}
</code></pre></div></div>
<p>A <code class="language-plaintext highlighter-rouge">HealthBar</code> class uses <code class="language-plaintext highlighter-rouge">GUI.DrawTexture</code> with a black and a green texture to draw the health bar, using the black texture as a background and the green texture for the bar itself. Finally, a <code class="language-plaintext highlighter-rouge">MainLoop</code> class ties everything together and also reacts to pressing the play button and varying the health slider on the start screen. Setting up to have a specific function called when a specific UI element is interacted with is very easy, just select the UI element in the Unity editor and then, in the inspector on the right hand side, select the function you want to call. As an example, this screenshot shows my play button and how I set it up to call the function <code class="language-plaintext highlighter-rouge">startGame</code> of the class <code class="language-plaintext highlighter-rouge">MainLoop</code> contained in the game object GameLogic whenever it is pressed:</p>
<p><img src="https://dans.world/assets/img/2015-11-20-Snowfall:-a-very-special-video-game-controller/event-handling.png" alt="Event handling.png" /></p>
<p>I will conclude this post with the code of the main loop during active gameplay. Hopefully, it is comprehensible with the extensive comments and can further develop your coding skills in Unity.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>// Update is called once per frame
void Update()
{
parser.readButtonStates(buttonStates); // Read the current button states
skiController.setSkiStates(buttonStates); // Set ski visibility and collision triggering according to these button states
if (gameActive) // If the game is active at the moment
{
float currentDamage = skiController.getSkiCollisionTime(); // Determine current collision time ("damage")
float percentage = currentDamage / maxSeconds; // Determine ratio between damage and total health
healthBar.setPercentage(1 - percentage); // Set health bar to reflect how many time is still available
// Check for game ending conditions
bool lossCondition = (percentage >= 1.0f); // Game lost if health percentage is above 1
bool winCondition = (camController.transform.position.z >= 5300.0f); // Game won if camera has moved to the end of the level (camera moves along z axis only, starts at 0, at z=5300 the end of the level is reached)
if (lossCondition || winCondition) // If a loss or a win occurred
{
gameActive = false; // Game is no longer active
camController.setMoving(false); // Stop camera movement
StartCoroutine(resetGame()); // Reset the game in 10 seconds
if (winCondition) // Show winning text when won...
{
winText.enabled = true;
print("WON");
}
else if (lossCondition) // ... otherwise show losing text
{
lossText.enabled = true;
print("LOST");
}
}
}
}
</code></pre></div></div>Daniel Stollerbusiness@dstoller.netHere is a short Youtube video explaining what this post it about. How did I do it? I will try to go through the main steps in the following.