Audio-Conditioned U-Net for Position Estimation in Full Sheet Images
AAudio-Conditioned U-Net forPosition Estimation in Full Sheet Images
Florian Henkel
Institute of Computational PerceptionJohannes Kepler University
Linz, Austriafl[email protected]
Rainer Kelz
Austrian Research Institute forArtificial Intelligence
Vienna, [email protected]
Gerhard Widmer
Institute of Computational PerceptionJohannes Kepler University
Linz, [email protected]
Abstract —The goal of score following is to track a musical per-formance, usually in the form of audio, in a corresponding scorerepresentation. Established methods mainly rely on computer-readable scores in the form of MIDI or MusicXML and achieverobust and reliable tracking results.Recently, multimodal deep learning methods have been used tofollow along musical performances in raw sheet images. Amongthe current limits of these systems is that they require a nontrivial amount of preprocessing steps that unravel the raw sheetimage into a single long system of staves.The current work is an attempt at removing this particularlimitation. We propose an architecture capable of estimatingmatching score positions directly within entire unprocessed sheetimages. We argue that this is a necessary first step towards afully integrated score following system that does not rely on anypreprocessing steps such as optical music recognition.
Index Terms —conditioning, multimodal deep learning, scorefollowing
I. I
NTRODUCTION
A large body of work on score following requires theuse of a computer-readable representation of the score, e.g.,MusicXML or MIDI [1]–[6]. Such representations can eitherbe manually created, which is tedious and time consuming, orextracted using optical music recognition (OMR). Recently,score following has also been demonstrated with raw sheetimages, using multimodal deep (reinforcement) learning [7],[8]. However, these latter approaches still rely on severalpreprocessing steps – in particular, the score must be ‘unrolled’into a single long staff: consecutive staves need to be detectedon the score sheet, cut out, and presented to the score followingsystem in sequence.In this work, we propose an architecture that can directlypredict which parts in a complete, unprocessed score pagematch a given audio excerpt. Our system is based on a U-Net[9] for musical symbol detection [10] and uses Feature-wiseLinear Modulation (FiLM) layers [11]. A similar architecturewas recently used for the task of music source separation[12]. As a proof of concept we test our model on simplemonophonic music from the Nottingham dataset [13]. Whilein its current state the system is not a full score follower yetbecause it ignores temporal context constraints, the capabilityto directly process full score sheet images is arguably anecessary first step. The remainder of this paper is structured as follows.In Section II, we introduce our proposed architecture andits underlying concepts. In Section III, we conduct severalexperiments to compare different architecture choices and toshow that our system is indeed able to predict score positionsin sheet images. Finally, in Section IV, we summarize ourwork and provide an outlook on how to adapt the architecturefor more complex scores and how to incorporate temporalcontext for a full score following setting.II. A
UDIO -C ONDITIONED
U-N
ET FOR P OSITION E STIMATION IN S HEET I MAGES
U-Nets are fully-convolutional neural networks that wereintroduced for the task of medical image segmentation [9].They can be used to segment an image into different parts,e. g., by classifying each pixel into either foreground or back-ground. In [10], Hajiˇc et al. use U-Nets for detecting musicalsymbols in sheet images. We adapt this architecture to predictpositions in sheet images that correspond to a given audioexcerpt, i. e., we segment the sheet image into regions thatmatch the audio snippet, and regions that do not. To thisend, we include
Feature-wise Linear Modulation (FiLM) layers[11] as a conditioning mechanism in the U-Net architecture,as shown in Figure 1. Each FiLM layer applies a simpleaffine transformation to the feature maps it is connected to,conditioned on an external input. In our case, the conditioninginput is an encoded representation z of the audio excerpt,which is created by another neural network depicted in Table I.The FiLM layer is defined as:FiLM ( x ) = γ ( z ) · x + β ( z ) , (1)with γ ( · ) and β ( · ) being learned functions. Each scalar outputcomponent γ k ( · ) , β k ( · ) scales and shifts the feature map x k ,where x k is the k -th output of a convolutional layer with K number of filters, after batch normalization is applied.Previous work combining sheet images and audio mainlyrelies on an embedding space for these two modalities whichis achieved by combining the representation of two separatenetworks [7], [8], [14]. FiLM layers on the other hand directly To get a better impression of what our network does we provide videos:https://github.com/CPJKU/audio conditioned unet a r X i v : . [ c s . L G ] O c t x2 Max Pooling2x2 Transposed Convolution, Stride 21x1 Convolution, Sigmoid I n p u t S h ee t I m a g e ( x x ) P r o b a b ili t y M a p ( x x ) + ELU
ELU
FiLM Layer Encoded Spectrogram
U-Net-Block ++++
Elementwise Sum
Fig. 1. Audio-Conditioned U-Net architecture. Each block (A-I) consists of two convolutional layers followed by batch normalization and the ELU activationfunction. The FiLM layer is placed before the last activation function. The audio spectrogram encoding used for conditioning is given by the output of thenetwork shown in Table I. Each symmetric block has the same number of filters starting with 8 in block A and increasing with depth to 128 in block E. interfere in the learned representation of the sheet image bymodulating its features, which helps the network to focusonly on those parts that are required for a correct prediction.Additionally, their inherent structure allows us to use a fullyconvolutional architecture for the sheet image, i. e., the net-work can process scores of arbitrary resolution, which will beuseful for exploring generalization capabilities in future work.The U-Net has a basic encoder-decoder structure, and con-sists of four down-sampling blocks (A-D), four up-samplingblocks (F-I) and a bottleneck block (E). Down-sampling isdone by × max pooling and for up-sampling we usetransposed convolutions with a filter size of × . After eachdown-sampling step, the number of filters is doubled (startingat 8), whereas each up-sampling step halves the number offilters, i. e., block A has 8 filters, E has 128 and I has again 8.Each block consists of two convolutional layers followed bybatch normalization [15] and the ELU activation function [16].The FiLM layer is placed after the last batch normalizationlayer and before the activation function. Adhering closely to[10], we add residual connections in the form of an element-wise sum between symmetric building blocks, as depicted inFigure 1. The output layer consists of a × convolutionfollowed by the sigmoid activation function, yielding a per-pixel pseudo probability map that highlights regions in thescore corresponding to a given audio excerpt.III. E XPERIMENTS
In the following, we describe data and ground truth requiredin our experiments, introduce the experimental setup, and
Audio (Spectrogram) × Conv 16x3x3 - stride-1 - BN - ELUConv 16x3x3 - stride-1 - BN - ELUConv 32x3x3 - stride-2 - BN - ELUConv 32x3x3 - stride-1 - BN - ELUConv 64x3x3 - stride-2 - BN - ELUConv 96x3x3 - stride-2 - BN - ELUConv 96x1x1 - stride-1 - BN - ELULinear 128 - BN - ELU
TABLE IT HE S PECTROGRAM E NCODER . W
E USE BATCH NORMALIZATION (BN)[15]
AND THE
ELU
ACTIVATION FUNCTION [16]. T
HE NETWORKSTRUCTURE RESEMBLES THE ONE USED IN [8]. discuss the performance of different architecture choices.
A. Data
We use a subset of the
Nottingham dataset, comprising274 monophonic melodies of folk music, partitioned into 172training, 60 validation and 42 test pieces [13]. The sheetmusic is created by automatically typesetting the MIDI scoreswith Lilypond, and the audio is synthesized using a pianosound font. The rendered score images have a resolution of × pixels, and are downscaled by a factor of 3 to × pixels, to be used as input to the convolutionalneural network.To obtain ground truth annotations between the audio andthe sheet music, we perform the same automatic notehead http://lilypond.org/ lignment as described in [14]. These notehead alignmentsyield ( x, y ) coordinate pairs, which are further adjusted suchthat the y coordinate corresponds to the middle of the staffthe respective note belongs to. As we present the networkwith isolated fixed-size audio (spectrogram) excerpts, i. e., wedisregard the temporal context, the annotations are no longerunambiguous since an excerpt could match several positionswithin the sheet image. Thus, we identify all positions in thesheet image that match the audio and create a binary mask withthe same shape as the downscaled score page. At positions ofa match, this mask contains rectangular regions of height 20and a width that depends on the distance between the first andlast note in the audio excerpt (see Figure 2). The task of theU-Net is to predict a corresponding mask, given a score pageand some audio excerpt.Audio is sampled at 22.05 kHz and processed at a framerate of 20 frames per second. The DFT is computed for eachframe with a window size of 2048 samples and then trans-formed with a logarithmic filterbank that processes frequenciesbetween 60 Hz and 6 kHz, yielding 78 log-frequency bins. Theconditioning network is presented with 40 consecutive frames,or roughly two seconds of audio. B. Experimental Setup
As shown in Figure 1, we introduce FiLM layers in all U-Net blocks. In our experiments, we test and compare severalsettings with the conditioning mechanism being activated indifferent parts of the architecture. As we cannot test allpossible combinations due to computational limitations, wechoose a subset which we think allows us to assess theinfluence of the FiLM layers on the final model performance.As an optimization target we minimize the
Dice coefficientloss [17], defined as D ( p , g ) = 1 − (cid:80) Ni p i g i (cid:80) Ni p i + (cid:80) Ni g i , (2)where p and g are vectors containing the predicted probabili-ties and ground truth, respectively. The advantage of the Dice coefficient loss compared to, e. g., binary cross-entropy , is thatit inherently deals with class imbalance. This is important asonly a small portion of the sheet image corresponds to a givenaudio excerpt. To optimize the target we use Adam [18] withdefault parameters, an initial learning rate of . , L weightdecay with a factor of e − and a batch size of . The weightsof the network are initialized orthogonally [19] and the bias isset to zero. If the loss on the validation set does not decreasefor 2 epochs we halve the learn rate. This is repeated five times.The trained model parameters with the lowest validation losswill be used for the final evaluation on the test set.Additionally, we apply data augmentation on the sheetimages by shifting them along the x and y axis. Note that ourgoal currently is not to generalize to scanned or handwrittenscores, which would require more sophisticated augmentationtechniques (e. g., as shown in [20]), but to show the networkthat note patterns can occur anywhere in the score image. Nottingham (42 test pieces)
Architecture Precision Recall F Score
FiLM Layers (E) 0.8647 0.9216 0.8922FiLM Layers (D-F) 0.8785 0.9292 0.9031FiLM Layers (C-G) 0.8929 0.9336 0.9128FiLM Layers (B-H) 0.8980 0.9261 0.9118FiLM Layers (A-I) 0.8903 0.9169 0.9034FiLM Layers (A-E) 0.8933 0.9267 0.9097FiLM Layers (E-I) 0.8674 0.9104 0.8884
TABLE IIC
OMPARISON OF DIFFERENT F I LM LAYER COMBINATIONS . I
N EACHSCENARIO , WE EVALUATE THE TRAINED MODEL PARAMETERS WITH THELOWEST VALIDATION LOSS . (D-F)
MEANS THE F I LM LAYERS IN BLOCK
D, E
AND F ARE ACTIVE . C. Results
Table II reports performance measures for different condi-tioning scenarios on the test set. They are defined asPrecision = tp tp + fp , Recall = tp tp + fn F = 2 · tp · tp + fp + fn with tp the number of pixels correctly predicted as 1, fp thepixels falsely predicted as 1 and fn the pixels falsely predictedas 0. The predictions are binarized with a threshold of . .Note that the optimization target given in Equation (2) closelyrelates to the F score, as the originally defined Sørensen-Dice coefficient corresponds to the F score in the binary case.Overall, we observe that the performance is high in alltested scenarios, with an F score greater than . . In allcases, the recall is higher than the precision, which couldbe improved by choosing a higher threshold value than . .The worst performance results when the FiLM layer is onlyactivated in the bottleneck and decoding blocks (E-I). We seea similar performance in terms of precision when we apply theconditioning mechanism only in the bottleneck block (E). Thissuggests that the FiLM layers might be more effective in theencoding part of the network, which is further substantiated bythe performance of the conditioning mechanism in encodingblocks (A-E). Nevertheless, a marginally higher F score isachieved when FiLM layers are applied both during encodingand decoding in blocks (C-G). This indicates that amountand location at which conditioning information is supplied tothe feature extraction network needs to be chosen carefully.A change in the overall resolution and depth of the featureextracting U-Net, will likely necessitate a re-tuning of thesehyperparameters.IV. C ONCLUSION AND F UTURE W ORK
We have introduced an architecture capable of inferringcorresponding positions for audio excerpts in full score sheetimages with high F score. Although this is not a full scorefollowing system yet, we believe this work is a necessary firststep towards a fully integrated system that can track musical ig. 2. Comparison of (a) a given ground truth that matches with the spectrogram excerpt and (b) the corresponding predictions of the network. performances in images of sheet music, without the need forcumbersome preprocessing steps such as OMR. The temporalcontext needed for the last step could come from the hiddenstate of a recurrent neural network, to be used as temporalconditioning information, in a similar way as described in [11].Currently, the system has only been tested on monophonicpiano music. Using the Multimodal Sheet Music Dataset(MSMD) [14], this can be further extended to more complexscores with polyphonic music. As the scores in this datasetare often several pages long, one could adapt the proposedarchitecture to either take multiple pages as channel inputs andpredict the final position probabilities for all pages at the sametime, or predict the positions for one page at a time and moveto the next page once the end of the current one is reached.Future work will explore the generalization performance ofthe system, both in terms of sheet image variations caused bylower quality scans, and differences in musical performances,where tempo, volume and timbre varies.R EPRODUCIBILITY
In the interest of reproducible research, we make both codeand data available on-line, along with detailed instructionon how to recreate the reported results. https://github.com/CPJKU/audio conditioned unetA
CKNOWLEDGMENT
This project has received funding from the European Re-search Council (ERC) under the European Union’s Horizon2020 research and innovation program (grant agreement num-ber 670035, project ”Con Espressione”).R
EFERENCES[1] S. Dixon and G. Widmer, “MATCH: A music alignment tool chest,” in
Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (London, UK), pp. 492–497, 2005.[2] A. Arzt,
Flexible and Robust Music Tracking . PhD thesis, JohannesKepler University Linz, 2016.[3] N. Orio, S. Lemouton, and D. Schwarz, “Score Following: State of theArt and New Developments,” in
Proceedings of the International Con-ference on New Interfaces for Musical Expression (NIME) , (Montreal,Canada), pp. 36–41, 2003.[4] A. Cont, “Realtime Audio to Score Alignment for Polyphonic MusicInstruments using Sparse Non-Negative Constraints and HierarchicalHMMS,” in
Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , vol. 5, (Toulouse,France), pp. 245–248, 2006.[5] D. Schwarz, N. Orio, and N. Schnell, “Robust Polyphonic Midi ScoreFollowing with Hidden Markov Models,” in
International ComputerMusic Conference (ICMC) , (Miami, Florida, USA), 2004. [6] E. Nakamura, P. Cuvillier, A. Cont, N. Ono, and S. Sagayama, “Au-toregressive Hidden Semi-Markov Model of Symbolic Music for ScoreFollowing,” in
Proceedings of the International Society for MusicInformation Retrieval Conference (ISMIR) , (M´alaga, Spain), pp. 392–398, 2015.[7] M. Dorfer, A. Arzt, and G. Widmer, “Towards Score Following in SheetMusic Images,” in
Proceedings of the International Society for MusicInformation Retrieval Conference (ISMIR) , (New York, USA), pp. 789–795, 2016.[8] M. Dorfer, F. Henkel, and G. Widmer, “Learning to Listen, Read,and Follow: Score following as a Reinforcement Learning Game,” in
Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Paris, France), pp. 784–791, 2018.[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in
International Conference onMedical image computing and computer-assisted intervention , pp. 234–241, Springer, 2015.[10] J. Hajiˇc jr., M. Dorfer, G. Widmer, and P. Pecina, “Towards Full-PipelineHandwritten OMR with Musical Symbol Detection by U-Nets.,” in
Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Paris, France), pp. 225–232, 2018.[11] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:Visual reasoning with a general conditioning layer,” in
Thirty-SecondAAAI Conference on Artificial Intelligence , 2018.[12] G. Meseguer-Brocal and G. Peeters, “Conditioned-U-Net: Introducing aControl Mechanism in the U-Net for Multiple Source Separations,” in
Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Delft, The Netherlands), 2019.[13] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “ModelingTemporal Dependencies in High-dimensional Sequences: Applicationto Polyphonic Music Generation and Transcription,” in
Proceedingsof the 29th International Conference on Machine Learning (ICML) ,(Edinburgh, Scotland), 2012.[14] M. Dorfer, J. Hajiˇc jr., A. Arzt, H. Frostel, and G. Widmer, “LearningAudio–Sheet Music Correspondences for Cross-Modal Retrieval andPiece Identification,”
Transactions of the International Society for MusicInformation Retrieval , vol. 1, no. 1, 2018.[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[16] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate DeepNetwork Learning by Exponential Linear Units (ELUs),” in
Proceedingsof the International Conference on Learning Representations (ICLR)(arXiv:1511.07289) , 2016.[17] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in , pp. 565–571,IEEE, 2016.[18] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza-tion,”
International Conference on Learning Representations (ICLR)(arXiv:1412.6980) , 2015.[19] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to thenonlinear dynamics of learning in deep linear neural networks,” arXivpreprint arXiv:1312.6120 , 2013.[20] E. van der Wel and K. Ullrich, “Optical music recognition with convolu-tional sequence-to-sequence models,” arXiv preprint arXiv:1707.04877arXiv preprint arXiv:1707.04877