[PDF] Improved Handling of Repeats and Jumps in Audio-Sheet Image Synchronization

Abstract

This paper studies the problem of automatically generating piano score following videos given an audio recording and raw sheet music images. Whereas previous works focus on synthetic sheet music where the data has been cleaned and preprocessed, we instead focus on developing a system that can cope with the messiness of raw, unprocessed sheet music PDFs from IMSLP. We investigate how well existing systems cope with real scanned sheet music, filler pages and unrelated pieces or movements, and discontinuities due to jumps and repeats. We find that a significant bottleneck in system performance is handling jumps and repeats correctly. In particular, we find that a previously proposed Jump DTW algorithm does not perform robustly when jump locations are unknown a priori. We propose a novel alignment algorithm called Hierarchical DTW that can handle jumps and repeats even when jump locations are not known. It first performs alignment at the feature level on each sheet music line, and then performs a second alignment at the segment level. By operating at the segment level, it is able to encode domain knowledge about how likely a particular jump is. Through carefully controlled experiments on unprocessed sheet music PDFs from IMSLP, we show that Hierarachical DTW significantly outperforms Jump DTW in handling various types of jumps.

Full PDF

IIMPROVED HANDLING OF REPEATS AND JUMPS IN AUDIO–SHEETIMAGE SYNCHRONIZATION

Mengyi Shan

Harvey Mudd College [email protected]

TJ Tsai

Harvey Mudd College [email protected]

ABSTRACT

This paper studies the problem of automatically generat-ing piano score following videos given an audio record-ing and raw sheet music images. Whereas previous worksfocus on synthetic sheet music where the data has beencleaned and preprocessed, we instead focus on developinga system that can cope with the messiness of raw, unpro-cessed sheet music PDFs from IMSLP. We investigate howwell existing systems cope with real scanned sheet mu-sic, ﬁller pages and unrelated pieces or movements, anddiscontinuities due to jumps and repeats. We ﬁnd that asigniﬁcant bottleneck in system performance is handlingjumps and repeats correctly. In particular, we ﬁnd that apreviously proposed Jump DTW algorithm does not per-form robustly when jump locations are unknown a priori.We propose a novel alignment algorithm called Hierarchi-cal DTW that can handle jumps and repeats even whenjump locations are not known. It ﬁrst performs alignmentat the feature level on each sheet music line, and then per-forms a second alignment at the segment level. By op-erating at the segment level, it is able to encode domainknowledge about how likely a particular jump is. Throughcarefully controlled experiments on unprocessed sheet mu-sic PDFs from IMSLP, we show that Hierarachical DTWsigniﬁcantly outperforms Jump DTW in handling varioustypes of jumps.

1. INTRODUCTION

This paper tackles the problem of generating piano scorefollowing videos in a fully automated manner. Given anaudio recording of a piano performance, our long-termgoal is to build a system that can (a) identify the piece andautomatically download the corresponding sheet musicPDF from the International Music Score Library Project(IMSLP) website, and (b) generate a video showing thecorresponding line of sheet music at each time instant inthe audio recording. In this work, we focus exclusively ontask (b), assuming that the correct sheet music PDF hasbeen identiﬁed. This task requires us to perform audio–sheet music alignment on completely unprocessed PDF c (cid:13) Mengyi Shan, TJ Tsai. Licensed under a Creative Com-mons Attribution 4.0 International License (CC BY 4.0).

Attribution:

Mengyi Shan, TJ Tsai. “Improved Handling of Repeats and Jumps inAudio–Sheet Image Synchronization”, 21st International Society for Mu-sic Information Retrieval Conference, Montréal, Canada, 2020. ﬁles from IMSLP. This paper describes the key insights wehave gained in building such a system, along with a novelalignment algorithm developed in the process.Many previous works have studied cross-modal align-ment between sheet music images and audio. Two gen-eral categories of approaches have been proposed. Theﬁrst approach is to convert the sheet music images to asymbolic representation using optical music recognition(OMR), to collapse the pitch information across octaves toget a chroma representation, and then to compare this rep-resentation to chroma features extracted from the audio.This approach has been applied to synchronizing audioand sheet music [4] [20] [26], identifying audio recordingsthat correspond to a given sheet music representation [14],and ﬁnding the corresponding audio segment given a shortsegment of sheet music [12]. The second approach is toconvert both sheet music and audio into a learned featurespace that directly encodes semantic similarity. This hasbeen done using convolutional neural networks combinedwith canonical correlation analysis [6] [11], pairwise rank-ing loss [8] [9], or some other suitable loss metric. Thisapproach has been explored in the context of online sheetmusic score following [5], sheet music retrieval given anaudio query [7] [8] [9], and ofﬂine alignment of sheet mu-sic and audio [8]. Recent works [10] [17] have also shownpromising results formulating the score following problemas a reinforcement learning game. See [23] for an overviewof work in this area.The main difference between our current task and previ-ous work is that we are working with totally unprocesseddata “in the wild." All of the above works make one ormore of the following assumptions, which are untrue inour task. First, many works focus primarily on trainingand testing with synthetic sheet music. In our case, weare primarily working with digital scans of physical sheetmusic. Second, most works assume that the data has beencleaned and preprocessed in various ways. For example,it is commonly assumed that unrelated pages of sheet mu-sic have been removed. Many works further assume thateach page has been segmented into lines, so that the datais presented as a sequence of image strips each contain-ing a single line of sheet music. In our task, the raw PDFfrom IMSLP may contain unrelated movements, pieces, orﬁller pages like the title page or table of contents. We alsoobviously cannot assume that each page has already beensegmented perfectly. Third, all of the above works assumethat the music does not have any jumps or repeats. In ourtask, we have to be able to handle common discontinuities a r X i v : . [ c s . MM ] J u l igure 1 . Architecture of proposed system. The sheet mu-sic and audio are both converted into bootleg scores, andthen aligned with the Hierarchical DTW algorithm.like repeats, D.C. al coda, D.S. al ﬁne, etc.In attempting to build a system that can handle messy,real-world data, we discovered two things. First, we foundthat most of the above issues can be resolved to a rea-sonable degree by suitably combining existing tools in theMIR literature. However, we also discovered that a sig-niﬁcant bottleneck in system performance was handlingjumps and repeats. In particular, we found that a previ-ously proposed Jump DTW alignment algorithm [13] doesnot yield satisfactory performance when jump locations areunknown a priori.There are several existing ofﬂine algorithms for align-ing two feature sequences in the presence of jumps orrepeats. Jump DTW [13] is a variant of dynamic timewarping (DTW) where additional long-range transitionsare allowed in the cost matrix at potential jump locations.Mueller and Appelt [22] and Grachten et al. [15] also pro-pose variants of DTW for partial alignment in the pres-ence of structural differences. One limitation of these lattertwo works is that repeated sections are handled by simplyskipping or deleting sections of features, so that the actualalignment of the repeated section is not known. Joder et al.[19] frame the alignment problem as a conditional randomﬁeld with additional transitions inserted at known jump lo-cations. Jiang et al. [18] use a modiﬁed Markov modelthat allows arbitrary jumps to follow a musician during apractice session with lots of do-overs and jumps. Thereare also several real-time score following algorithms thathandle various types of jumps [24] [2] [3] [25], though ourfocus in this work is on the ofﬂine context. In this study,we primarily focus on Jump DTW as the closest match toour target scenario: it is an ofﬂine algorithm, targeted atperformances rather than practice sessions, and it providesa complete estimated alignment in the presence of jumps.The main conceptual contribution of this paper is anovel alignment algorithm called Hierarchical DTW. Un-like Jump DTW, it does not require knowledge of jump lo-cations a priori, but instead considers every line transitionas a potential jump location. The algorithm is called Hi-erarchical DTW because it ﬁrst performs an alignment atthe feature level with each sheet music line, and then usesthe results to perform a second alignment at the segmentlevel. By performing an alignment at the segment level, wecan encode domain knowledge about which types of jumpsare likely. The algorithm is very simple and only has twohyperparameters, which both have very clear and intuitive interpretations. Through carefully controlled experimentson unprocessed PDFs from IMSLP, we show that Hierar-chical DTW signiﬁcantly outperforms Jump DTW on thepiano score following video generation task.

2. SYSTEM DESCRIPTION

Figure 1 shows a high-level overview of our proposed sys-tem. We will explain its design in three parts: feature ex-traction, alignment, and video generation.

The ﬁrst step is to convert both the sheet music and audiointo bootleg score representations. The bootleg score [27]is a recently proposed feature representation for aligningpiano sheet music images and MIDI. For sheet music, itencodes the position of ﬁlled noteheads relative to the stafflines. The bootleg score itself is a × N binary matrix,where 62 indicates the total number of possible staff linepositions in both the left and right hands, and where N indicates the total estimated number of simultaneous noteevents. For MIDI ﬁles, each note onset can be projectedonto the bootleg score using the rules of Western musicalnotation. Ambiguities due to enharmonic representationsor left-right hand attribution are handled by simply settingall possible positions to 1.We computed the bootleg score representations in thefollowing manner. We convert each PDF into a sequence ofPNG images at 300 dpi, compute a bootleg score for eachpage, and then represent the entire PDF as a sequence ofbootleg score fragments, where each fragment correspondsto a single line of music. Note that these fragments mayinclude lines of music from other unrelated movements orpieces in the same PDF, or may even represent nonsensefeatures coming from ﬁller pages. Next, we transcribe theaudio recording using the Onsets and Frames [16] auto-matic music transcription system, and then convert the es-timated MIDI into its corresponding bootleg score. In thiswork, we treat the bootleg score computation and musictranscription as ﬁxed feature extractors. The second main step is to align the bootleg score represen-tations. We propose a novel alignment algorithm called Hi-erarchical DTW to accomplish this task. Figure 2 shows anoverview of the algorithm, which consists of three stages.The ﬁrst stage is to perform feature-level alignment.We do this using a variant of DTW called subsequenceDTW, which ﬁnds the optimal alignment between a shortquery sequence and any subsequence within a referencesequence. We perform subsequence DTW between eachsheet music bootleg score fragment (each correspondingto one line of music) and the entire MIDI bootleg score, asshown on the left side of Figure 2. We use the normalizednegative inner product distance metric proposed in [27] Code, data, and example score following videos can be found at https://github.com/HMC-MIR/YoutubeScoreFollowing . In Figure 2, the horizontal axis corresponds to the reference (left toright) and the vertical axis corresponds to the query (bottom to top). igure 2 . Overview of Hierarchical DTW. SubsequenceDTW is performed at the feature level on each sheet musicline. The results are used to generate the segment-leveldata matrices, and then a second alignment is performed atthe segment level. Only a few selected elements of T seg are shown for illustration.along with allowable transitions { (1 , , (1 , , (2 , } withweights { , , } . For a more detailed explanation of sub-sequence DTW, we refer the reader to [21].The second stage is to construct the segment-level datamatrices. There are two matrices that need to be con-structed. The ﬁrst matrix is formed by taking the last rowin every cumulative cost matrix D i from stage 1 and stack-ing them into a matrix of size L × M , where L indicatesthe total number of lines of music in the PDF and M in-dicates the total number of features in the MIDI bootlegscore. This matrix contains subsequence path scores andis denoted as C seg in Figure 2. It will play a role analo-gous to the pairwise cost matrix when we do dynamic pro-gramming at the segment level. The second matrix T seg isthe same size as C seg and indicates allowable transitions atthe segment level. Each element T seg [ i, j ] is computed byidentifying the j th element in the last row of D i , and thenbacktracking from this element to determine the beginninglocation of the matching path. T seg [ i, j ] thus indicates thestarting location of the best matching path in the i th line ofsheet music ending at position j in the MIDI bootleg score.In Figure 2, a few selected elements in T seg are shown ascolored boxes to illustrate this process. Note that, in orderto construct T seg , we need to backtrace from every possiblelocation for every line of sheet music.The third stage is to perform segment-level alignment.Here, we use dynamic programming to ﬁnd the optimalpath through C seg using transitions in T seg . We constructa segment-level cumulative cost matrix D seg by ﬁllingout its entries column-by-column using dynamic program-ming. The ﬁrst column of D seg is initialized to all zeros,which ensures that the matching path can start on any lineof music without penalty. Note that, unlike regular DTWwhere the set of allowable transitions and weights is thesame at every location, here the set of allowable transi-tions and weights is different for each element of D seg .Since the transitions are all unique, we simply encode theprevious location rather than the transition type (e.g. theprevious location ( i − , j − instead of the transition (1 , ). When computing D seg [ i, j ] , there are two types ofallowable transitions. The ﬁrst type of transition is skip- ping elements. This means transitioning from ( i, j − and moving directly to the right by one position withoutaccumulating any score. Here, the candidate path score is D seg [ i, j ] = D seg [ i, j − . The second type of transitionis matching the i th line of music (ending) at this position.In this case, we can transition from the end of any line ofmusic immediately before the matching segment begins.If we let k (cid:44) T seg [ i, j ] be the beginning of the match-ing subsequence path, then there are L different possibletransitions from ( n, k − , n = 0 , . . . , L − where n in-dicates the line of music. Here, the candidate path scoresare D seg [ i, j ] = D seg [ n, k −

1] + w n,i · C seg [ i, j ] + p n,i ,where w n,i is a multiplicative weight and p n,i is an addi-tive penalty for jumps. We can summarize the dynamicprogramming rules for the segment-level alignment as k (cid:44) T seg [ i, j ] D seg [ i, j ] = min  D seg [ i, j − D seg [0 , k −

1] + w ,i · C seg [ i, j ] + p ,i D seg [1 , k −

1] + w ,i · C seg [ i, j ] + p ,i · · · where the minimum is calculated over all sheet music lines n = 0 , . . . , L − . When ﬁlling out the entries of D seg us-ing dynamic programming, we also keep track of backtraceinformation in a separate matrix. Once D seg has been con-structed, we identify the element in the last column of D seg with the lowest path score, and then backtrace from thatposition to determine the optimal alignment path. Figure 2shows the optimal alignment path as a series of black dotsand the induced segmentation of the MIDI bootleg score asgray rectangles.The real power of Hierarchical DTW comes from set-ting w n,i and p n,i in an intelligent way that encodes mu-sical domain knowledge. These values can be adaptedto allow no jumps, allow arbitrary jumps, or anything inbetween. For example, disallowing jumps means setting p n,i = ∞ · ( i (cid:54) = n + 1) . The system described below isone possible instantiation based on three assumptions: (a)the performed lines of music will form a contiguous block(e.g. we will not go from page 13 to 34 to 19), (b) back-wards jumps (from repeats) are to lines of music we haveseen before, and (c) forward jumps (from D.S. al ﬁne) areto one line past the furthest line of music that has been seenbefore (which we refer to as the “leading edge"). For theallowed jump transitions, multiplicative weights are set to and additive penalties are set to − γ · p avg , where γ is a hy-perparameter and p avg is the result of calculating the bestsubsequence path score for each line of sheet music and av-eraging the scores across all lines. So, if γ = 1 , the jumppenalty approximately offsets 1 line of matching music.Note that we can keep track of which lines have been seenbefore by deﬁning two matrices R lower and R upper whichare the same size as C seg and keep track of the range oflines that have been seen in the optimal path ending at anyposition ( i, j ) . R lower and R upper can be updated alongwith D seg and the backtrace matrix during the dynamicprogramming stage. For regular forward transitions, weallow moving to the next line, staying on the current line igure 3 . Generating audio with repeats. The original au-dio recording is segmented by lines of sheet music. Wesample k boundary points without replacement, and thensplice and concatenate audio segments to generate the datawith repeats.(slowing down), or skipping one line (speeding up). Thesethree transitions have multiplicative weights , α , and α and additive penalties of (all), respectively. We foundthat allowing additional time warping at the segment levelwith multiplicative penalty α = 0 . allows the algorithmto recover from large mistakes more easily.Hierarchical DTW is simple yet ﬂexible. The versiondescribed above only has two hyperparameters that corre-spond to a multiplicative penalty for speeding up/slowingdown ( α ) and an additive penalty for jumps ( γ ). Yet, theframework of Hierarchical DTW makes it possible to se-lectively allow very speciﬁc types of jumps that obey com-mon musical conventions. The third main step is to generate the score followingvideo. In order to translate the predicted segment-levelalignment into a score following video, we need additionalauxiliary information from the bootleg score feature com-putation. For the audio recording, we need to keep track ofthe correspondence between each MIDI bootleg score fea-ture column and its corresponding time in the audio record-ing. For the sheet music, we need to keep track of thecorrespondence between each sheet music bootleg scorefeature column and its corresponding page and pixel rangein the sheet music images. We modiﬁed the original codeprovided in [27] to return this information, in addition tothe bootleg score features. Given this auxiliary informationand the predicted segment-level alignment, we can gen-erate the score following video in a very straightforwardmanner: we simply show the predicted line of sheet musicat every time instant in the audio recording.

3. EXPERIMENTAL SETUP

In this section, we explain the datasets and metrics used toevaluate our proposed system.Our data is a derivative of the Sheet MIDI Retrievaldataset [27]. We will ﬁrst describe the original dataset, and then explain how we used it to generate the data forthis current work. The original dataset contains scannedsheet music from IMSLP for 200 solo piano pieces across25 composers. The sheet music comes with manual anno-tations of how many lines of music are on each page, andhow many measures are on each line. For each of the 200pieces, there is a corresponding MIDI ﬁle and ground truthannotations of measure-level timestamps.We derived our dataset in the following manner. Wesynthesize the MIDI ﬁles to audio using the FluidSynthlibrary. By combining the sheet music and MIDI annota-tions, we determine the time intervals in the audio record-ing that correspond to each line of sheet music. For eachsheet music PDF in the Sheet MIDI Retrieval dataset, weretrieved the original PDF from the IMSLP website. Theonly difference between these two ﬁles is that the originalIMSLP PDF contains other unrelated movements, pieces,and ﬁller pages that were removed during the preparationof the Sheet MIDI Retrieval dataset. For example, one PDFin the test set contains 127 pages, of which only 17 corre-spond to the piece of interest. Because we want to testhow well our system handles this type of noise, we use theoriginal PDF with no preprocessing or data cleaning what-soever. We augmented the sheet music annotations by con-verting the original IMSLP PDFs into PNG ﬁles at 300 dpiand manually annotating the vertical pixel range for everyline of sheet music played in the audio recording. This re-quired annotating a total of pages with , pixelpositions. By combining all of our annotations together,we can determine the page and pixel range of the line ofsheet music that is currently being played at every point inthe audio recording. In total, there are . hours of anno-tated audio. Because there are no repeats or jumps in thesheet music, we call this data the “No Repeat" dataset.We also generate several synthetic datasets to test howwell our system handles jumps and repeats. The processof generating a synthetic dataset consists of three steps,as shown in Figure 3. The ﬁrst step is to identify the L + 1 boundary positions of the L lines of sheet musicthat are played in the audio recording. The second stepis to randomly sample k boundary points without replace-ment. The value of k depends on the types of jumps wewant to simulate. In this work, we consider four schemas:1 repeat ( k = 2 ), 2 repeats ( k = 3 ), 3 repeats ( k = 4 ),and D.S. al ﬁne ( k = 3 ). The third step is to splice andconcatenate the audio to generate a modiﬁed audio record-ing as shown in Figure 3. Note that all of the syntheticdatasets have the exact same sheet music, but their audiorecordings have been spliced to reﬂect the desired schema.Since the process of sampling is random, we generate ﬁvedifferent samples for every audio recording. The four syn-thetic datasets described above have , , , and hours of audio, respectively. The ground truth annotationsare modiﬁed accordingly.We evaluate system performance using a simple accu-racy metric. Because our goal is to generate score follow-ing videos, we want to use an evaluation metric that corre-lates with user experience. The accuracy simply indicatesthe percentage of time that the correct line of music is be- igure 4 . Comparison of system performance on bench-marks with various types of jumps. The bar levels indicateaccuracy with a scoring collar of . sec. The short graylines indicate accuracy with scoring collars of and . seconds.ing shown to the user. When calculating accuracy, we usea scoring collar, in which small intervals ( t i − ∆ t, t i + ∆ t ) around the ground truth transition timestamps t i are ig-nored during scoring. This is a standard practice in eval-uating time-based segmentation tasks like speech activitydetection [1]. By using a range of scoring collar values, wecan also gain insight into what fraction of our errors occurvery close to the transition boundaries.For all experiments, we use (the same) 40 pieces fortraining and 160 pieces for testing. This results in 160 testqueries for the No Repeat benchmark ( . hours of au-dio) and × test queries for the benchmarkswith jumps ( . , . , . , and . hours). Since wetreat the bootleg score computation and automatic musictranscription as ﬁxed feature extractors, our system has notrainable weights and only 2 hyperparameters ( α , γ ). So,we only use a small fraction of the data for developing thealgorithm, and we reserve most of the data for testing.

4. RESULTS

In this section, we present our experimental results on thepiano score following video generation task.We compare our proposed system to three other base-line systems. The ﬁrst baseline system (‘bscore-subDTW’)is identical to our proposed system in Figure 1 except that itreplaces the Hierarchical DTW with a simple subsequenceDTW. The second baseline system (‘bscore-jumpDTW’) isalso identical to our proposed system except that it replacesthe Hierarchical DTW with Jump DTW [13]. BecauseJump DTW was designed to handle jumps and repeats, weexpect this system to provide the most competitive base-line results. The third baseline system (‘Dorfer-subDTW’)is based on Dorfer et. al [9]. This system approaches theaudio–sheet music alignment task by training a multimodalCNN to project chunks of sheet music and chunks of audiospectrogram into the same feature space where similarity can be computed directly. We used the pretrained CNNprovided in [9] as a feature extractor, and then apply sub-sequence DTW. Finally, our proposed Hierarchical DTWsystem is indicated as ‘bscore-hierDTW.’Figure 4 shows the results of these four systems. Thehistogram bars indicate the accuracies with a scoring collarof ∆ t = . sec. There are four things to notice about theseresults. First, the Dorfer-subDTW system performs poorlyon all benchmarks. This indicates that this system doesnot generalize well to the scanned sheet music from IM-SLP. Second, the bscore-subDTW system performs wellon the No Repeat benchmark ( . accuracy), but per-forms poorly on all other benchmarks (e.g. . on theRepeat 3 benchmark). This is to be expected, since sub-sequence DTW cannot handle jumps and repeats. Third,Jump DTW is signiﬁcantly worse than subsequence DTWon the No Repeat benchmark ( . vs. . ), but ithas consistent performance across benchmarks with re-peats and jumps ( . , . , . , and . ). Thisindicates that Jump DTW is able to cope with discontinu-ities, but with a signiﬁcant cost in performance. Fourth,the Hierarchical DTW system is only slightly worse thansubsequence DTW on the No Repeat benchmark ( . vs. . ), and its performance decreases only slightly onthe other benchmarks ( . , . , . , . ). Wecan see that the Hierarchical DTW system consistently out-performs Jump DTW by 10-13% across all benchmarks.These results indicate that Hierarchical DTW is able tohandle repeats and jumps reasonably well, and with a muchsmaller performance cost than Jump DTW.

5. ANALYSIS

In this section, we conduct two different analyses to gainmore insight into system behavior.

The ﬁrst analysis answers the question, “What are the fail-ure modes for each system?" To answer this question, weidentiﬁed the individual queries that had the poorest accu-racy, and then investigated the reasons for the errors.The Dorfer system has two primary failure modes. Theﬁrst failure mode is that the system is not designed to han-dle jumps, so it performs very poorly on any datasets withjumps or repeats. Note, however, that this system also per-forms poorly on the No Repeat benchmark. When we in-vestigated the reasons for this, we discovered the secondmajor failure mode: page segmentation. The sub-systemfor segmenting each page into lines of music performedvery poorly on many pages in the dataset. This is perhapsnot surprising, since the original system was developed andtrained on synthetic sheet music, where staff lines are per-fectly horizontal. In this case, the assumptions in this workdo not translate well to our task of working with IMSLPscanned sheet music.The subsequence DTW system also has two primaryfailure modes. The ﬁrst is (again) that the system cannothandle jumps or repeats. When we investigated the reasonsfor major errors on the No Repeat benchmark, we ﬁnd thathe failures primarily come from mistakes in the bootlegscore representation. The bootleg score does not accountfor octave markings or clef changes, and it does not de-tect non-ﬁlled noteheads (e.g. half or whole notes). Whenthere are long stretches of sheet music that contain severalof these elements at the same time, the bootleg score is apoor representation of the sheet music. For example, threeof the pieces in the test set are Erik Satie’s Gymnopedies,where the sheet music is almost entirely non-ﬁlled note-heads. These pieces had close to 0% accuracy and causeda decrease of several percentage points on the aggregateaccuracy score.The JumpDTW system has one primary failure mode: itoften jumps to incorrect lines of music. This occurs wheneither (a) there are similar lines of music in multiple places(e.g. the recapitulation of a theme), or (b) signiﬁcant boot-leg score errors cause the system to match random linesof music elsewhere in the piece. This problem is mostclearly seen in the No Repeat benchmark, where it oftentakes jumps when none are present.The Hierarchical DTW system has two primary failuremodes. The ﬁrst failure mode is prolonged bootleg scorefailures, which cause the algorithm to insert spurious smalljumps. Once the bootleg score becomes an accurate repre-sentation again, the system is usually able to recover. Thesecond failure mode is when the sheet music contains veryrepetitive measures and lines. This problem is particularlybad when the sheet music is very short (e.g. 2-3 pageslong) and has jumps or repeats.Figure 5 shows a visualization tool that was helpful indiagnosing failure modes. The top half of Figure 5 showsfour gray strips, each representating the duration of a sin-gle audio recording in the No Repeat benchmark. Thetopmost strip contains black vertical lines indicating thelocation of the ground truth sheet music line transitions.The three strips below it show the predictions of the subse-quence DTW, Jump DTW, and Hierarchical DTW systems,where errors are shown in red. The bottom half of Figure5 shows the same information for a query in the Repeat 3benchmark. The location of the jumps are indicated withblue vertical lines. We can see many of the failure modesdescribed above. For example, Jump DTW has spuriousjumps in both queries but is able to follow two of the re-peats in the bottom query. Subsequence DTW is unableto handle the jumps in the bottom query, but matches wellafter the last jump occurs. Finally, we can see that the Hier-archical DTW system is able to follow the correct sequenceof sheet music lines, and its errors primarily occur close toline transitions.

The second analysis answers the question, “Where are theerrors located?" One way we can answer this question is tocalculate system performance across a range of values forthe scoring collars. This can tell us how close the errorsare to line transition boundaries.Figure 4 shows the results of each system with variousscoring collar values. The histogram bar level indicatesthe default scoring collar ∆ t = . sec, and the results with Figure 5 . Visualization of system predictions for a querywith no repeats (top half) and a query with three repeats(bottom half). Each gray strip indicates the duration of theaudio recording. The black vertical lines show the groundtruth line transitions, and the red regions indicate timeswhen an incorrect line of sheet music is being shown. ∆ t set to sec and . sec are shown as short horizontalgray lines directly below and above the histogram bar level,respectively. Note that as ∆ t increases, the accuracy willincrease monotonically.There are two things to notice about the results withvarious scoring collars. First, we see that even with a gen-erous scoring collar of ∆ t = 1 sec, the accuracies of allsystems only increase about 1-2%. This indicates that mostof the errors are not slight misalignments at the line tran-sitions, but are instead large errors due to total alignmentfailures. Second, we observe that the results with Hierar-chical DTW on benchmarks with jumps is only marginallyworse than the No Repeat benchmark. This indicates thatHierarchical DTW is able to handle discontinuities reason-ably well. Combining these two observations, the failuresin the bscore-hierDTW system seem to primarily comefrom large misalignments due to prolonged bootleg scorefailures. This strongly suggests that the performance bot-tleneck is the bootleg score representation, not the Hierar-chical DTW alignment.

6. CONCLUSION

We present a method for generating piano score followingvideos. Our approach uses several recently proposed sys-tems to convert both the sheet music and audio into boot-leg score representations. We then apply a novel align-ment algorithm called Hierarchical DTW, which performsalignment at both the feature-level and the segment-levelin order to handle repeats, jumps, and unknown offset inthe sheet music. We perform experiments with completelyunprocessed sheet music from IMSLP, and we show thatHierarchical DTW signiﬁcantly outperforms a previouslyproposed Jump DTW algorithm for handling jumps andrepeats. For future work, we would like to augment thesystem to automatically identify a piece and retrieve thecorresponding sheet music from the IMSLP database in anautomated fashion. . REFERENCES [1]

NIST Open Speech-Activity-Detection EvaluationPlan , 2016 (accessed May 6, 2020). .[2] Andreas Arzt and Gerhard Widmer. Towards effective‘any-time’ music tracking. In

Proc. of the Starting AIResearchers’ Symposium , 2010.[3] Andreas Arzt, Gerhard Widmer, and Simon Dixon.Automatic page turning for musicians via real-timemachine listening. In

Proc. of the European Confer-ence on Artiﬁcial Intelligence (ECAI) , pages 241–245,2008.[4] David Damm, Christian Fremerey, Frank Kurth,Meinard Müller, and Michael Clausen. Multimodalpresentation and browsing of music. In

Proc. of theInternational Conference on Multimodal Interfaces(ICMI) , pages 205–208, 2008.[5] Matthias Dorfer, Andreas Arzt, Sebastian Böck,Amaury Durand, and Gerhard Widmer. Live score fol-lowing on sheet music images. In

Late Breaking Demosat the International Conference on Music InformationRetrieval (ISMIR) , 2016.[6] Matthias Dorfer, Andreas Arzt, and Gerhard Wid-mer. Towards end-to-end audio-sheet-music retrieval.In

Neural Information Processing Systems (NIPS) End-to-End Learning for Speech and Audio ProcessingWorkshop , 2016.[7] Matthias Dorfer, Andreas Arzt, and Gerhard Widmer.Towards score following in sheet music images. In

Proc. of the International Conference on Music Infor-mation Retrieval (ISMIR) , pages 789–795, 2016.[8] Matthias Dorfer, Andreas Arzt, and Gerhard Widmer.Learning audio-sheet music correspondences for scoreidentiﬁcation and ofﬂine alignment. In

Proc. of theInternational Conference on Music Information Re-trieval (ISMIR) , pages 115–122, 2017.[9] Matthias Dorfer, Jan Hajiˇc, Andreas Arzt, Harald Fros-tel, and Gerhard Widmer. Learning audio-sheet mu-sic correspondences for cross-modal retrieval and pieceidentiﬁcation.

Trans. of the International Society forMusic Information Retrieval , 1(1):22–33, 2018.[10] Matthias Dorfer, Florian Henkel, and Gerhard Widmer.Learning to listen, read, and follow: Score following asa reinforcement learning game. In

Proc. of the Interna-tional Conference on Music Information Retrieval (IS-MIR) , pages 784–791, 2018.[11] Matthias Dorfer, Jan Schlüter, Andreu Vall, Filip Ko-rzeniowski, and Gerhard Widmer. End-to-end cross-modality retrieval with cca projections and pairwiseranking loss.

International Journal of Multimedia In-formation Retrieval , 7(2):117–128, 2018. [12] Christian Fremerey, Michael Clausen, Sebastian Ew-ert, and Meinard Müller. Sheet music-audio identiﬁca-tion. In

Proc. of the International Conference on MusicInformation Retrieval (ISMIR) , pages 645–650, 2009.[13] Christian Fremerey, Meinard Müller, and MichaelClausen. Handling repeats and jumps in score-performance synchronization. In

Proc. of the Interna-tional Conference on Music Information Retrieval (IS-MIR) , pages 243–248, 2010.[14] Christian Fremerey, Meinard Müller, Frank Kurth, andMichael Clausen. Automatic mapping of scanned sheetmusic to audio recordings. In

Proc. of the InternationalConference on Music Information Retrieval (ISMIR) ,pages 413–418, 2008.[15] Maarten Grachten, Martin Gasser, Andreas Arzt, andGerhard Widmer. Automatic alignment of music per-formances with structural differences. In

Proc. of theInternational Society for Music Information RetrievalConference (ISMIR) , pages 607–612, 2013.[16] Curtis Hawthorne, Erich Elsen, Jialin Song, AdamRoberts, Ian Simon, Colin Raffel, Jesse Engel, SageevOore, and Douglas Eck. Onsets and frames: Dual-objective piano transcription. In

Proc. of the Interna-tional Conference on Music Information Retrieval (IS-MIR) , pages 50–57, 2018.[17] Florian Henkel, Stefan Balke, Matthias Dorfer, andGerhard Widmer. Score following as a multi-modalreinforcement learning problem.

Trans. of the In-ternational Society for Music Information Retrieval ,2(1):67–81, 2019.[18] Yucong Jiang, Fiona Ryan, David Cartledge, andChristopher Raphael. Ofﬂine score alignment for re-alistic music practice. In

Sound and Music ComputingConference , 2019.[19] Cyril Joder, Slim Essid, and Gaël Richard. A con-ditional random ﬁeld framework for robust and scal-able audio-to-score matching.

IEEE Trans. on Audio,Speech, and Language Processing , 19(8):2385–2397,2011.[20] Frank Kurth, Meinard Müller, Christian Fremerey,Yoon-Ha Chang, and Michael Clausen. Automatedsynchronization of scanned sheet music with audiorecordings. In

Proc. of the International Conference onMusic Information Retrieval (ISMIR) , pages 261–266,2007.[21] Meinard Müller.

Fundamentals of Music Processing:Audio, Analysis, Algorithms, Applications . Springer,2015.[22] Meinard Müller and Daniel Appelt. Path-constrainedpartial music synchronization. In

Proc. of the Inter-national Conference on Acoustics, Speech, and SignalProcessing (ICASSP) , pages 65–68, 2008.23] Meinard Müller, Andreas Arzt, Stefan Balke, MatthiasDorfer, and Gerhard Widmer. Cross-modal music re-trieval and applications: An overview of key method-ologies.

IEEE Signal Processing Magazine , 36(1):52–62, 2019.[24] Tomohiko Nakamura, Eita Nakamura, and ShigekiSagayama. Real-time audio-to-score alignment of mu-sic performances containing errors and arbitrary re-peats and skips.

IEEE/ACM Trans. on Audio, Speech,and Language Processing , 24(2):329–339, 2015.[25] Bryan Pardo and William Birmingham. Modeling formfor on-line following of musical performances. In

Proc.of the National Conference on Artiﬁcial Intelligence ,volume 20, pages 1018–1023, 2005.[26] Verena Thomas, Christian Fremerey, Meinard Müller,and Michael Clausen. Linking sheet music and audio –challenges and new approaches. In

Multimodal MusicProcessing , volume 3, pages 1–22. 2012.[27] Daniel Yang, Thitaree Tanprasert, Teerapat Jenrungrot,Mengyi Shan, and TJ Tsai. Midi passage retrieval us-ing cell phone pictures of sheet music. In