[PDF] Audio-Conditioned U-Net for Position Estimation in Full Sheet Images

Abstract

The goal of score following is to track a musical performance, usually in the form of audio, in a corresponding score representation. Established methods mainly rely on computer-readable scores in the form of MIDI or MusicXML and achieve robust and reliable tracking results. Recently, multimodal deep learning methods have been used to follow along musical performances in raw sheet images. Among the current limits of these systems is that they require a non trivial amount of preprocessing steps that unravel the raw sheet image into a single long system of staves. The current work is an attempt at removing this particular limitation. We propose an architecture capable of estimating matching score positions directly within entire unprocessed sheet images. We argue that this is a necessary first step towards a fully integrated score following system that does not rely on any preprocessing steps such as optical music recognition.

Full PDF

AAudio-Conditioned U-Net forPosition Estimation in Full Sheet Images

Florian Henkel

Institute of Computational PerceptionJohannes Kepler University

Linz, Austriaﬂ[email protected]

Rainer Kelz

Austrian Research Institute forArtiﬁcial Intelligence

Vienna, [email protected]

Gerhard Widmer

Institute of Computational PerceptionJohannes Kepler University

Linz, [email protected]

Abstract —The goal of score following is to track a musical per-formance, usually in the form of audio, in a corresponding scorerepresentation. Established methods mainly rely on computer-readable scores in the form of MIDI or MusicXML and achieverobust and reliable tracking results.Recently, multimodal deep learning methods have been used tofollow along musical performances in raw sheet images. Amongthe current limits of these systems is that they require a nontrivial amount of preprocessing steps that unravel the raw sheetimage into a single long system of staves.The current work is an attempt at removing this particularlimitation. We propose an architecture capable of estimatingmatching score positions directly within entire unprocessed sheetimages. We argue that this is a necessary ﬁrst step towards afully integrated score following system that does not rely on anypreprocessing steps such as optical music recognition.

Index Terms —conditioning, multimodal deep learning, scorefollowing

I. I

NTRODUCTION

A large body of work on score following requires theuse of a computer-readable representation of the score, e.g.,MusicXML or MIDI [1]–[6]. Such representations can eitherbe manually created, which is tedious and time consuming, orextracted using optical music recognition (OMR). Recently,score following has also been demonstrated with raw sheetimages, using multimodal deep (reinforcement) learning [7],[8]. However, these latter approaches still rely on severalpreprocessing steps – in particular, the score must be ‘unrolled’into a single long staff: consecutive staves need to be detectedon the score sheet, cut out, and presented to the score followingsystem in sequence.In this work, we propose an architecture that can directlypredict which parts in a complete, unprocessed score pagematch a given audio excerpt. Our system is based on a U-Net[9] for musical symbol detection [10] and uses Feature-wiseLinear Modulation (FiLM) layers [11]. A similar architecturewas recently used for the task of music source separation[12]. As a proof of concept we test our model on simplemonophonic music from the Nottingham dataset [13]. Whilein its current state the system is not a full score follower yetbecause it ignores temporal context constraints, the capabilityto directly process full score sheet images is arguably anecessary ﬁrst step. The remainder of this paper is structured as follows.In Section II, we introduce our proposed architecture andits underlying concepts. In Section III, we conduct severalexperiments to compare different architecture choices and toshow that our system is indeed able to predict score positionsin sheet images. Finally, in Section IV, we summarize ourwork and provide an outlook on how to adapt the architecturefor more complex scores and how to incorporate temporalcontext for a full score following setting.II. A

UDIO -C ONDITIONED

U-N

ET FOR P OSITION E STIMATION IN S HEET I MAGES

U-Nets are fully-convolutional neural networks that wereintroduced for the task of medical image segmentation [9].They can be used to segment an image into different parts,e. g., by classifying each pixel into either foreground or back-ground. In [10], Hajiˇc et al. use U-Nets for detecting musicalsymbols in sheet images. We adapt this architecture to predictpositions in sheet images that correspond to a given audioexcerpt, i. e., we segment the sheet image into regions thatmatch the audio snippet, and regions that do not. To thisend, we include

Feature-wise Linear Modulation (FiLM) layers[11] as a conditioning mechanism in the U-Net architecture,as shown in Figure 1. Each FiLM layer applies a simpleafﬁne transformation to the feature maps it is connected to,conditioned on an external input. In our case, the conditioninginput is an encoded representation z of the audio excerpt,which is created by another neural network depicted in Table I.The FiLM layer is deﬁned as:FiLM ( x ) = γ ( z ) · x + β ( z ) , (1)with γ ( · ) and β ( · ) being learned functions. Each scalar outputcomponent γ k ( · ) , β k ( · ) scales and shifts the feature map x k ,where x k is the k -th output of a convolutional layer with K number of ﬁlters, after batch normalization is applied.Previous work combining sheet images and audio mainlyrelies on an embedding space for these two modalities whichis achieved by combining the representation of two separatenetworks [7], [8], [14]. FiLM layers on the other hand directly To get a better impression of what our network does we provide videos:https://github.com/CPJKU/audio conditioned unet a r X i v : . [ c s . L G ] O c t x2 Max Pooling2x2 Transposed Convolution, Stride 21x1 Convolution, Sigmoid I n p u t S h ee t I m a g e ( x x ) P r o b a b ili t y M a p ( x x ) + ELU

ELU

FiLM Layer Encoded Spectrogram

U-Net-Block ++++

Elementwise Sum

Fig. 1. Audio-Conditioned U-Net architecture. Each block (A-I) consists of two convolutional layers followed by batch normalization and the ELU activationfunction. The FiLM layer is placed before the last activation function. The audio spectrogram encoding used for conditioning is given by the output of thenetwork shown in Table I. Each symmetric block has the same number of ﬁlters starting with 8 in block A and increasing with depth to 128 in block E. interfere in the learned representation of the sheet image bymodulating its features, which helps the network to focusonly on those parts that are required for a correct prediction.Additionally, their inherent structure allows us to use a fullyconvolutional architecture for the sheet image, i. e., the net-work can process scores of arbitrary resolution, which will beuseful for exploring generalization capabilities in future work.The U-Net has a basic encoder-decoder structure, and con-sists of four down-sampling blocks (A-D), four up-samplingblocks (F-I) and a bottleneck block (E). Down-sampling isdone by × max pooling and for up-sampling we usetransposed convolutions with a ﬁlter size of × . After eachdown-sampling step, the number of ﬁlters is doubled (startingat 8), whereas each up-sampling step halves the number ofﬁlters, i. e., block A has 8 ﬁlters, E has 128 and I has again 8.Each block consists of two convolutional layers followed bybatch normalization [15] and the ELU activation function [16].The FiLM layer is placed after the last batch normalizationlayer and before the activation function. Adhering closely to[10], we add residual connections in the form of an element-wise sum between symmetric building blocks, as depicted inFigure 1. The output layer consists of a × convolutionfollowed by the sigmoid activation function, yielding a per-pixel pseudo probability map that highlights regions in thescore corresponding to a given audio excerpt.III. E XPERIMENTS

In the following, we describe data and ground truth requiredin our experiments, introduce the experimental setup, and

Audio (Spectrogram) × Conv 16x3x3 - stride-1 - BN - ELUConv 16x3x3 - stride-1 - BN - ELUConv 32x3x3 - stride-2 - BN - ELUConv 32x3x3 - stride-1 - BN - ELUConv 64x3x3 - stride-2 - BN - ELUConv 96x3x3 - stride-2 - BN - ELUConv 96x1x1 - stride-1 - BN - ELULinear 128 - BN - ELU

TABLE IT HE S PECTROGRAM E NCODER . W

E USE BATCH NORMALIZATION (BN)[15]

AND THE

ELU

ACTIVATION FUNCTION [16]. T

HE NETWORKSTRUCTURE RESEMBLES THE ONE USED IN [8]. discuss the performance of different architecture choices.

A. Data

We use a subset of the

Nottingham dataset, comprising274 monophonic melodies of folk music, partitioned into 172training, 60 validation and 42 test pieces [13]. The sheetmusic is created by automatically typesetting the MIDI scoreswith Lilypond, and the audio is synthesized using a pianosound font. The rendered score images have a resolution of × pixels, and are downscaled by a factor of 3 to × pixels, to be used as input to the convolutionalneural network.To obtain ground truth annotations between the audio andthe sheet music, we perform the same automatic notehead http://lilypond.org/ lignment as described in [14]. These notehead alignmentsyield ( x, y ) coordinate pairs, which are further adjusted suchthat the y coordinate corresponds to the middle of the staffthe respective note belongs to. As we present the networkwith isolated ﬁxed-size audio (spectrogram) excerpts, i. e., wedisregard the temporal context, the annotations are no longerunambiguous since an excerpt could match several positionswithin the sheet image. Thus, we identify all positions in thesheet image that match the audio and create a binary mask withthe same shape as the downscaled score page. At positions ofa match, this mask contains rectangular regions of height 20and a width that depends on the distance between the ﬁrst andlast note in the audio excerpt (see Figure 2). The task of theU-Net is to predict a corresponding mask, given a score pageand some audio excerpt.Audio is sampled at 22.05 kHz and processed at a framerate of 20 frames per second. The DFT is computed for eachframe with a window size of 2048 samples and then trans-formed with a logarithmic ﬁlterbank that processes frequenciesbetween 60 Hz and 6 kHz, yielding 78 log-frequency bins. Theconditioning network is presented with 40 consecutive frames,or roughly two seconds of audio. B. Experimental Setup

As shown in Figure 1, we introduce FiLM layers in all U-Net blocks. In our experiments, we test and compare severalsettings with the conditioning mechanism being activated indifferent parts of the architecture. As we cannot test allpossible combinations due to computational limitations, wechoose a subset which we think allows us to assess theinﬂuence of the FiLM layers on the ﬁnal model performance.As an optimization target we minimize the

Dice coefﬁcientloss [17], deﬁned as D ( p , g ) = 1 − (cid:80) Ni p i g i (cid:80) Ni p i + (cid:80) Ni g i , (2)where p and g are vectors containing the predicted probabili-ties and ground truth, respectively. The advantage of the Dice coefﬁcient loss compared to, e. g., binary cross-entropy , is thatit inherently deals with class imbalance. This is important asonly a small portion of the sheet image corresponds to a givenaudio excerpt. To optimize the target we use Adam [18] withdefault parameters, an initial learning rate of . , L weightdecay with a factor of e − and a batch size of . The weightsof the network are initialized orthogonally [19] and the bias isset to zero. If the loss on the validation set does not decreasefor 2 epochs we halve the learn rate. This is repeated ﬁve times.The trained model parameters with the lowest validation losswill be used for the ﬁnal evaluation on the test set.Additionally, we apply data augmentation on the sheetimages by shifting them along the x and y axis. Note that ourgoal currently is not to generalize to scanned or handwrittenscores, which would require more sophisticated augmentationtechniques (e. g., as shown in [20]), but to show the networkthat note patterns can occur anywhere in the score image. Nottingham (42 test pieces)

Architecture Precision Recall F Score

FiLM Layers (E) 0.8647 0.9216 0.8922FiLM Layers (D-F) 0.8785 0.9292 0.9031FiLM Layers (C-G) 0.8929 0.9336 0.9128FiLM Layers (B-H) 0.8980 0.9261 0.9118FiLM Layers (A-I) 0.8903 0.9169 0.9034FiLM Layers (A-E) 0.8933 0.9267 0.9097FiLM Layers (E-I) 0.8674 0.9104 0.8884

TABLE IIC

OMPARISON OF DIFFERENT F I LM LAYER COMBINATIONS . I

N EACHSCENARIO , WE EVALUATE THE TRAINED MODEL PARAMETERS WITH THELOWEST VALIDATION LOSS . (D-F)

MEANS THE F I LM LAYERS IN BLOCK

D, E

AND F ARE ACTIVE . C. Results

Table II reports performance measures for different condi-tioning scenarios on the test set. They are deﬁned asPrecision = tp tp + fp , Recall = tp tp + fn F = 2 · tp · tp + fp + fn with tp the number of pixels correctly predicted as 1, fp thepixels falsely predicted as 1 and fn the pixels falsely predictedas 0. The predictions are binarized with a threshold of . .Note that the optimization target given in Equation (2) closelyrelates to the F score, as the originally deﬁned Sørensen-Dice coefﬁcient corresponds to the F score in the binary case.Overall, we observe that the performance is high in alltested scenarios, with an F score greater than . . In allcases, the recall is higher than the precision, which couldbe improved by choosing a higher threshold value than . .The worst performance results when the FiLM layer is onlyactivated in the bottleneck and decoding blocks (E-I). We seea similar performance in terms of precision when we apply theconditioning mechanism only in the bottleneck block (E). Thissuggests that the FiLM layers might be more effective in theencoding part of the network, which is further substantiated bythe performance of the conditioning mechanism in encodingblocks (A-E). Nevertheless, a marginally higher F score isachieved when FiLM layers are applied both during encodingand decoding in blocks (C-G). This indicates that amountand location at which conditioning information is supplied tothe feature extraction network needs to be chosen carefully.A change in the overall resolution and depth of the featureextracting U-Net, will likely necessitate a re-tuning of thesehyperparameters.IV. C ONCLUSION AND F UTURE W ORK

We have introduced an architecture capable of inferringcorresponding positions for audio excerpts in full score sheetimages with high F score. Although this is not a full scorefollowing system yet, we believe this work is a necessary ﬁrststep towards a fully integrated system that can track musical ig. 2. Comparison of (a) a given ground truth that matches with the spectrogram excerpt and (b) the corresponding predictions of the network. performances in images of sheet music, without the need forcumbersome preprocessing steps such as OMR. The temporalcontext needed for the last step could come from the hiddenstate of a recurrent neural network, to be used as temporalconditioning information, in a similar way as described in [11].Currently, the system has only been tested on monophonicpiano music. Using the Multimodal Sheet Music Dataset(MSMD) [14], this can be further extended to more complexscores with polyphonic music. As the scores in this datasetare often several pages long, one could adapt the proposedarchitecture to either take multiple pages as channel inputs andpredict the ﬁnal position probabilities for all pages at the sametime, or predict the positions for one page at a time and moveto the next page once the end of the current one is reached.Future work will explore the generalization performance ofthe system, both in terms of sheet image variations caused bylower quality scans, and differences in musical performances,where tempo, volume and timbre varies.R EPRODUCIBILITY

In the interest of reproducible research, we make both codeand data available on-line, along with detailed instructionon how to recreate the reported results. https://github.com/CPJKU/audio conditioned unetA

CKNOWLEDGMENT

This project has received funding from the European Re-search Council (ERC) under the European Union’s Horizon2020 research and innovation program (grant agreement num-ber 670035, project ”Con Espressione”).R

EFERENCES[1] S. Dixon and G. Widmer, “MATCH: A music alignment tool chest,” in

Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (London, UK), pp. 492–497, 2005.[2] A. Arzt,

Flexible and Robust Music Tracking . PhD thesis, JohannesKepler University Linz, 2016.[3] N. Orio, S. Lemouton, and D. Schwarz, “Score Following: State of theArt and New Developments,” in

Proceedings of the International Con-ference on New Interfaces for Musical Expression (NIME) , (Montreal,Canada), pp. 36–41, 2003.[4] A. Cont, “Realtime Audio to Score Alignment for Polyphonic MusicInstruments using Sparse Non-Negative Constraints and HierarchicalHMMS,” in

Proceedings of the IEEE International Conference onAcoustics, Speech, and Signal Processing (ICASSP) , vol. 5, (Toulouse,France), pp. 245–248, 2006.[5] D. Schwarz, N. Orio, and N. Schnell, “Robust Polyphonic Midi ScoreFollowing with Hidden Markov Models,” in

International ComputerMusic Conference (ICMC) , (Miami, Florida, USA), 2004. [6] E. Nakamura, P. Cuvillier, A. Cont, N. Ono, and S. Sagayama, “Au-toregressive Hidden Semi-Markov Model of Symbolic Music for ScoreFollowing,” in

Proceedings of the International Society for MusicInformation Retrieval Conference (ISMIR) , (M´alaga, Spain), pp. 392–398, 2015.[7] M. Dorfer, A. Arzt, and G. Widmer, “Towards Score Following in SheetMusic Images,” in

Proceedings of the International Society for MusicInformation Retrieval Conference (ISMIR) , (New York, USA), pp. 789–795, 2016.[8] M. Dorfer, F. Henkel, and G. Widmer, “Learning to Listen, Read,and Follow: Score following as a Reinforcement Learning Game,” in

Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Paris, France), pp. 784–791, 2018.[9] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention , pp. 234–241, Springer, 2015.[10] J. Hajiˇc jr., M. Dorfer, G. Widmer, and P. Pecina, “Towards Full-PipelineHandwritten OMR with Musical Symbol Detection by U-Nets.,” in

Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Paris, France), pp. 225–232, 2018.[11] E. Perez, F. Strub, H. De Vries, V. Dumoulin, and A. Courville, “FiLM:Visual reasoning with a general conditioning layer,” in

Thirty-SecondAAAI Conference on Artiﬁcial Intelligence , 2018.[12] G. Meseguer-Brocal and G. Peeters, “Conditioned-U-Net: Introducing aControl Mechanism in the U-Net for Multiple Source Separations,” in

Proceedings of the International Society for Music Information RetrievalConference (ISMIR) , (Delft, The Netherlands), 2019.[13] N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent, “ModelingTemporal Dependencies in High-dimensional Sequences: Applicationto Polyphonic Music Generation and Transcription,” in

Proceedingsof the 29th International Conference on Machine Learning (ICML) ,(Edinburgh, Scotland), 2012.[14] M. Dorfer, J. Hajiˇc jr., A. Arzt, H. Frostel, and G. Widmer, “LearningAudio–Sheet Music Correspondences for Cross-Modal Retrieval andPiece Identiﬁcation,”

Transactions of the International Society for MusicInformation Retrieval , vol. 1, no. 1, 2018.[15] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deepnetwork training by reducing internal covariate shift,” arXiv preprintarXiv:1502.03167 , 2015.[16] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and Accurate DeepNetwork Learning by Exponential Linear Units (ELUs),” in

Proceedingsof the International Conference on Learning Representations (ICLR)(arXiv:1511.07289) , 2016.[17] F. Milletari, N. Navab, and S.-A. Ahmadi, “V-net: Fully convolutionalneural networks for volumetric medical image segmentation,” in , pp. 565–571,IEEE, 2016.[18] D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza-tion,”

International Conference on Learning Representations (ICLR)(arXiv:1412.6980) , 2015.[19] A. M. Saxe, J. L. McClelland, and S. Ganguli, “Exact solutions to thenonlinear dynamics of learning in deep linear neural networks,” arXivpreprint arXiv:1312.6120 , 2013.[20] E. van der Wel and K. Ullrich, “Optical music recognition with convolu-tional sequence-to-sequence models,” arXiv preprint arXiv:1707.04877arXiv preprint arXiv:1707.04877