[PDF] A Computational Analysis of Real-World DJ Mixes using Mix-To-Track Subsequence Alignment

Abstract

A DJ mix is a sequence of music tracks concatenated seamlessly, typically rendered for audiences in a live setting by a DJ on stage. As a DJ mix is produced in a studio or the live version is recorded for music streaming services, computational methods to analyze DJ mixes, for example, extracting track information or understanding DJ techniques, have drawn research interests. Many of previous works are, however, limited to identifying individual tracks in a mix or segmenting it, and the sizes of the datasets are usually small. In this paper, we provide an in-depth analysis of DJ music by aligning a mix to its original music tracks. We set up the subsequence alignment such that the audio features are less sensitive to the tempo or key change of the original track in a mix. This approach provides temporally tight mix-to-track matching from which we can obtain cue-points, transition length, mix segmentation, and musical changes in DJ performance. Using 1,557 mixes from 1001Tracklists including 13,728 tracks and 20,765 transitions, we conduct the proposed analysis and show a wide range of statistics, which may elucidate the creative process of DJ music making.

Full PDF

AA COMPUTATIONAL ANALYSIS OF REAL-WORLD DJ MIXES USINGMIX-TO-TRACK SUBSEQUENCE ALIGNMENT

Taejun Kim Minsuk Choi Evan Sacks Yi-Hsuan Yang Juhan Nam Graduate School of Culture Technology, KAIST, South Korea Research Center for IT Innovation, Academia Sincia, Taiwan {taejun,minsukchoi,juhan.nam}@kaist.ac.kr, [email protected], [email protected]

ABSTRACT

A DJ mix is a sequence of music tracks concatenated seam-lessly, typically rendered for audiences in a live setting bya DJ on stage. As a DJ mix is produced in a studio or thelive version is recorded for music streaming services, com-putational methods to analyze DJ mixes, for example, ex-tracting track information or understanding DJ techniques,have drawn research interests. Many of previous worksare, however, limited to identifying individual tracks in amix or segmenting it, and the sizes of the datasets are usu-ally small. In this paper, we provide an in-depth analysisof DJ music by aligning a mix to its original music tracks.We set up the subsequence alignment such that the audiofeatures are less sensitive to the tempo or key change ofthe original track in a mix. This approach provides tempo-rally tight mix-to-track matching from which we can ob-tain cue-points, transition length, mix segmentation, andmusical changes in DJ performance. Using 1,557 mixesfrom including 13,728 tracks and 20,765transitions, we conduct the proposed analysis and show awide range of statistics, which may elucidate the creativeprocess of DJ music making.

1. INTRODUCTION

A Disc Jockey (DJ) is a musician who plays a sequence ofexisting music tracks or sound sources seamlessly by ma-nipulating the audio content based on musical elements.The outcomes can be medleys (mix), mash-ups, remixes,or even new tracks, depending on how much DJs edit thesubstance of the original music tracks. Among them, cre-ating a mix is the most basic role of DJs. This involvescurating music tracks and their sections to play, decidingthe order, and modifying them to splice one section to an-other as a continuous stream. In each step, DJs considervarious elements of the tracks such as tempo, key, beat, c (cid:13) Taejun Kim, Minsuk Choi, Evan Sacks, Yi-Hsuan Yang,Juhan Nam. Licensed under a Creative Commons Attribution 4.0 Inter-national License (CC BY 4.0).

Attribution:

Taejun Kim, Minsuk Choi,Evan Sacks, Yi-Hsuan Yang, Juhan Nam, “A Computational Analysis ofReal-World DJ Mixes using Mix-To-Track Subsequence Alignment”, in

Proc. of the 21st Int. Society for Music Information Retrieval Conf.,

Montréal, Canada, 2020. chord, rhythm, structure, energy, mood and genre. Thesedays, DJs create the mix not only for a live audience butalso for listeners in music streaming services.Recently, imitating the tasks of DJ using computationalmethods has drawn research interests [1–7]. On the otherhand, efforts have been made to understand the creativeprocess of DJ music making. In the perspective of reverseengineering, tasks extracting useful information from real-world DJ mixes can be useful in such a pursuit. In theliterature, at least the following tasks have been studied.(1)

Track identiﬁcation [8–10]: identifying which tracksare played in DJ music which can be either a mix or amanipulated track. (2)

Mix segmentation [11, 12]: ﬁndingboundaries between tracks in a DJ mix. (3)

Mix-to-trackalignment [13, 14]: aligning the original track to an au-dio segment in a DJ mix. (4)

Cue point extraction [14]:ﬁnding when a track starts and ends in a DJ mix. (5)

Tran-sition unmixing [13, 14]: explaining how DJs apply audioeffects to make a seamless transition from one track to an-other. However, the previous studies only focused on solv-ing the tasks usually with a small dataset and did not pro-vide further analysis using extracted information from thetasks. For example, Sonnleitner et al. [8] used 18 mixesfor track identiﬁcation. Glazyrin [11] and Scarfe et al. [12]respectively collected 103 and 339 mixes with boundarytimestamps for mix segmentation. The majority of previ-ous studies concentrated on identiﬁcation and segmenta-tion and few studies on the other three tasks used artiﬁ-cially generated datasets [13, 14].To address the need of a large-scale study, we collectedin a total of 1,557 real-world mixes and original tracksplayed in the mixes from , a community-based DJ music service. The mixes include 13,728 uniquetracks and 20,765 transitions. However, tracks used in DJmixes usually include various versions so-called “extendedmix”, “remix”, or “edit”. Also, a few tracks in track-lists of the collected dataset are annotated incorrectly byusers. Therefore, an alignment algorithm is required toensure that the collected tracks are exactly the same ver-sions as the ones used in the mixes. More importantly, thealignment will be a foundation for further computationalanalysis of DJ mixes. With these two motivations, we a r X i v : . [ ee ss . A S ] A ug ummary statistic All MatchedThe number of mixes 1,564 1,557The number of unique tracks 15,068 13,728The number of played tracks 26,776 24,202The number of transitions 24,344 20,765Total length of mixes (in hours) 1,577 1,570Total length of unique tracks (in hours) 1,038 913Average length of mixes (in minutes) 60.5 60.5Average length of unique tracks (in minutes) 4.1 4.0Average number of played tracks in a mix 17.1 15.5Average number of transitions in a mix 14.5 12.9 Table 1 . Statistics of the dataset. The orig-inal dataset size is denoted as ‘All’ and the size after ﬁlter-ing as ‘Matched’.set up the mix-to-track subsequence dynamic time warp-ing (DTW) [15] such that the mix can be aligned withthe original tracks in presence of possible tempo or keychanges. The warping paths from the DTW provide tem-porally tight mix-to-track matching from which we can ob-tain cue points, transition lengths, and key/tempo changesin DJ performances in a quantitative way. To evaluate theperformances of the alignment and the cue point extrac-tion methods simultaneously, we evaluate mix segmenta-tion performances regarding the extracted cue points asboundaries dividing two adjacent tracks in mixes, compar-ing them to human-annotated boundaries. Furthermore, byobserving the performance changes depending on the threedifferent types of cue points, we analyze the human anno-tating policy of track boundaries.Although DJ techniques are complicated and differentdepending on the characteristics of tracks, there has beencommon knowledge for making seamless DJ mixes. How-ever, to the best of our knowledge, the domain knowledgehas never been addressed in the literature with statisticalevidence obtained by computational analysis. In this study,we analyze the DJ mixes using the results from the subse-quence DTW mentioned above for the following hypothe-ses: 1) DJs tend not to change tempo and/or key of tracksmuch to avoid distorting the original essence of the tracks.2) DJs make seamless transitions from one track to anotherconsidering the musical structures of tracks. 3) DJs tend toselect cue points at similar positions in a single track.The analysis is performed based on the results obtainedfrom the subsequence alignment and provides insights sta-tistically for tempo adjustment, key transposition, track-to-track transition lengths, and agreements of the cue pointsamong DJs. We hope that the proposed analysis and vari-ous statistics may elucidate the creative process of DJ mu-sic making. The source code for the mix-to-track subse-quence DTW, the cue point analysis and the mix segmen-tation is available at the link.

2. THE DATASET

Our study is based on DJ music from . Weobtained a collection of DJ mix metadata via direct per-sonal communication with . Each entry of https://github.com/mir-aidj/djmix-analysis/ mixes contains a list of track, boundary timestamps andgenre. It also contains web links to the audio ﬁles of themixes and tracks. We downloaded them separately fromthe linked media service websites on our own. We founda small number of web links to tracks are not correct andso ﬁltered them out by a mix-to-track alignment methodautomatically (see Section 3.3). The boundary timestampsof tracks in a mix are annotated by the users of .Table 1 summarizes statistics of the dataset. The origi-nal size of the dataset is denoted as ‘All’ and the size afterﬁltering as ‘Matched’ in Table 1. Note that the number ofplayed tracks is greater than the number of unique tracksas a track can be played in multiple mixes. The datasetincludes a variety of genres but mostly focuses on Houseand Trance music. More detailed statistics of the datasetare available on the companion website.

3. MIX-TO-TRACK SUBSEQUENCE ALIGNMENT

The objective of mix-to-track subsequence alignment is toﬁnd an optimal alignment path between a subsequence of amix and a track used in the mix. This alignment result willbe the basis of diverse DJ mix analysis concerning the cuepoint, track boundary, key/tempo changes and transitionlength. We also use it for removing non-matching tracks.This section describes the detail of computational process.

When DJs create a mix, they often adjust tempo and/orkey of the tracks in the mix or add audio effects to them.Live mixes contain more changes in timbre and even othersound sources such as the voices from the DJ. In order toaddress the acoustic and musical variations between theoriginal track and the matched subsequence in the mix,we use beat synchronous chroma and mel-frequency cep-stral coefﬁcients (MFCC). The beat synchronous featurerepresentations enable tempo invariance and dramaticallyreduces the computational cost in the alignment. The ag-gregation of the features from the frame level to the beatlevel also smooths out local timbre variations. The chromafeature, on the other hand, facilitates key-invariance as cir-cular shift of the 12-dimensional vector corresponds tokey transposition. The MFCC feature captures generaltimbre characteristics. We used Librosa to extract thechroma and MFCC features with the default options ex-cept that the dimentionality of MFCC was set to 12 and thetype of chroma was to chroma energy normalized statistics(CENS) [16]. We compute the alignment by applying subsequence DTWto the beat synchronous features [15]. We used an im-plementation from Librosa, adopting the transposition-invariant approach from [17]. Speciﬁcally, we calculated12 versions of chroma features by performing all possible https://mir-aidj.github.io/djmix-analysis/ https://librosa.github.io/librosa/ T r a c k b e a t MFCC Chroma Chroma with key-invariant Chroma + MFCC Chroma with key-invariant + MFCC Ground Truth T r a c k b e a t

00 11 22 33 4 4 556 6 77 8 89 910 1011 111213 1200 11 2 2 3 3 4 45 5 66 7 78 89 910 1011 11 1213 1200 11 22 3 3 445 56 6 7 7 8 8 9910 1011 11 1213 1200 11 2 2 3 3 4 45 5 6 6 7 78 89 910 10 1111 1213 1200 11 2 2 3 3 445 56 6 7 7 8 8 99 101011 1112 13 120 1000 2000 3000 4000 5000 6000 7000

Mix beat T r a c k b e a t Figure 1 . Visualizations of the result of a DTW-based mix-to-track subsequence alignment between a mix and the originaltracks played in that mix. The colored solid lines show the warping paths of the alignment depending on the input feature,and whether or not applying the transposition-invariant method on the subsequence DTW. The tagged numbers on warpingpaths and ground truth boundaries indicate played and timestamped indices in the mix, respectively. A colored bar at thebottom of the ﬁgures is added if the alignment of the method is considered successful according to the match rate. (Top)A correctly matched example. (Middle) An unsuccessful example, due to the low sound quality of the mix. (Bottom) Thealignment can be improved using the key-invariant chroma. Best viewed in color.circular shifts on the original track side and select the onewith the lowest matching cost in the subsequence DTW.This result returns not only the optimal alignment path butalso the key transposition value of the original track.Figure 1 shows three examples of the alignment resultswhen different combinations of features (MFCC, chroma,and key-invariant chroma) are used. When the alignmentpath of the subsequence satisﬁes a match rate (describedin Section 3.3), we put a color strip corresponding to eachfeature in the bottom of the ﬁgure. Since we use beat syn-chronous representations for them, the warping paths be-come diagonal with a slope of one if a mix and a track aresuccessfully aligned. The top panel in the ﬁgure shows ansuccessfully aligned example for the most of tracks andfeatures where all warping paths have straight diagonalpaths. The middle panel shows a failing example becausesounds from crowds are also recorded in the mix. Thebottom panel shows a example where chroma with circularshift distinctively works better others as the DJ frequentlyuses key transposition on the mix. As stated above, we can measure the quality of the align-ment from the warping path. Ideally, when every singlemove on the path is diagonal, that is, one beat at a time forboth track and mix axis, we will obtain a perfect straightdiagonal line. However, the acoustic and musical changesdeform the path. We deﬁne the ratio of the diagonal moves https://1001.tl/14jltnct https://1001.tl/15fulzc1 https://1001.tl/bcx2z0t in a mix (one move per beat) as the match rate and use itfor ﬁltering out incorrectly annotated tracks. We experi-mentally chose 0.4 as a threshold. The size of the datasetafter the ﬁltering is denoted as “Matched” in Table 1. Weonly use the matched tracks for the analysis in this paper.

4. CUE POINT EXTRACTION

Cue points are timestamps in a track that indicate where tostart and end the track in a mix. Determining the cue pointsof played tracks is an essential task of DJ mixing. Thissection describes extracting cue points using the warpingpaths obtained from the aforementioned mix-to-track sub-sequence alignment.

We ﬁrst deﬁne terms related to cue points. In the context ofthe track-to-track transition, a cue-out point is a timestampthat the previous track starts fading out and the next trackstarts fading in, and a cue-in point is when the previoustrack is fully faded out and only the next track is beingplayed. The transition region is deﬁned as the time intervalfrom the cue-out point of the previous track to the cue-inpoint of the next track. Additionally, we deﬁne a cue-mid point as the middle of a transition, which can technicallybe considered as a boundary of the transition.

The mix-to-track alignment results naturally yield cuepoints of matched tracks. Figure 2 shows an example ofextracted cue points (a zoomed-in view of the top ﬁgure in

225 2250 2275 2300 2325 2350 2375 2400

Mix beat T r a c k b e a t Cue-out Cue-in Cue-mid Ground truth

Figure 2 . A zoomed-in view of a visualization of mix-to-track subsequence alignment explaining the three types ofextracted cue points. The two solid lines indicate warpingpaths representing alignment between the mix and tracks.The vertical colored dotted lines represent the extractedcue points on the mix and the horizontal dotted lines repre-sent the points on each track. The vertical black dotted lineis a human-annotated ground truth boundary between thetwo tracks. The solid lines are the fourth and ﬁfth warpingpaths from the top of Figure 1. Best viewed in color.Figure 1). The two alignment paths drift from the diago-nal lines in the transition region (between 2310 and 2324in mix beat) because the two tracks cross-fades. Based onthis observation, we detect the cue-out point of the previ-ous track by ﬁnding the last beat where preceding 32 beatshave diagonal moves in the alignment path. Likewise, wedetect the cue-in point of the next track by ﬁnding the ﬁrstbeat where succeeding 32 beats have diagonal moves in thealignment path.

5. MIX SEGMENTATION

The goal of mix segmentation is to divide a continuous DJmix into individual tracks, which can enhance the listen-ing experience and can be a foundation of further analysisor learning of DJ mixes. Since DJs make seamless tran-sitions, it is difﬁcult to notice that a track is fading in orout. To quantitatively measure how difﬁcult it is, a studyanalyzed how accurate humans are at creating the bound-ary timestamps and found that the standard deviation ofthe human disagreement for track boundaries in mixes isabout 9 seconds, which implies it is difﬁcult to ﬁnd theoptimal boundaries even for humans [12]. Furthermore,the ambiguous deﬁnition of the boundary and long lengthsof transitions makes it difﬁcult to annotate the boundarytimestamps [8].

Given the extracted cue point so far, we can estimate thetrack boundaries with three possible choices. The ﬁrst isthe position that the next track fully appears (cue-in point),the second is the position that previous track starts to dis-appear (cue-out point), and the last is the middle of thetransition (cue-mid point). By comparing each of them with human-annotated boundary timestamps, we can mea-sure which type of cue point humans tend to consider as aboundary.Figure 3 shows three histograms where each of them iscomputed from the differences between human-annotatedboundary timestamps and one of the cue point types inbeat unit. The overall trend shows that the distributionof cue-in point is mostly skewed towards zero. Interest-ingly, the distribution of cue-out point has more distinctivepeaks around every 32 beat than the distribution of cue-inpoint. Considering the histogram of the transition lengthhas peaks at every 32 beat as shown in Figure 6, this re-ﬂects that human annotators tend to label cue-in points asa boundary compared to cue-out (note that the transitionlength is computed by subtracting the cue-in point from thecue-out point). On the other hand, the distribution of cue-mid point has a gradually decreasing curve without peaks.While this distribution looks like having better estimatesthan the cue-out point, Table 3 shows an opposite result.That is, in terms of the number of cue points closest to thehuman annotations, the cue-out point is the second and thecue-mid point is the worst among the three types. These re-sults indicate that the cue-mid point is a safe choice. Thatis, although the cue-mid point is least likely to be a bound-ary as shown in Table 3, the difference between the esti-mate and human annotation is relatively small because it isthe middle of the transition region.Table 2 shows the difference between human-annotatedboundary timestamps and one of the cue point types interms of median time (in seconds) on the left side. Theoverall trend conﬁrms that the cue-in is the best estimate oftrack boundary and the cue-mid is a safer choice than thecue-out. The table also shows the result of “cue-best”. Thisis computed with the minimum difference among the threecue point types for each of the transition region. The resultshows that the median time differences are dramaticallydecreased to 4-5 seconds. Table 2 also shows the differ-ence between human-annotated boundary timestamps andthe cue-in point in terms of hit rates on the right side. Thehit rates are computed the ratio of correct estimates givena tolerance window. If the estimate is within the tolerancewindow on the human-annotated boundary timestamp, itis regarded as a correct estimate. We set three tolerancewindows (15, 30, and 60 seconds) considering that the av-erage tempo of tracks in the dataset is 127 beat per minute(BPM) and then the tolerance windows approximately cor-respond to 32, 64, 128 beats (multiples of a phrase unit).The result shows that the best hit rate with the 30 secondwindow (about 64 beats) is above 80%. Given the longtransition time as shown in Figure 6, the cue-in point maybe considered as a reasonable choice.

Table 2 also compares the median time difference betweenhuman-annotated boundary timestamps and one of the cuepoint types for different audio features used in the subse-quence DTW. In general, the chroma features are a betterchoices than MFCC (p-value of t-test < 0.001 for chroma

32 64 96 128

Difference of the number of beats F r e q u e n c y Cue-out

Difference of the number of beats

Cue-in

Difference of the number of beats

Cue-mid

Figure 3 . Histograms of distances to ground truth boundaries in the number of beats depending on the type of the cuepoint. The dotted lines are plotted at every 32 beats which is usually considered as a phrase in the context of dance music.

Median time difference (in seconds) Cue-in hit rateFeature Cue-out Cue-in Cue-mid Cue-best †

15 sec 30 sec 60 secMFCC 27.92 14.27 13.55 5.340 0.5187 0.7591 0.9023Chroma 23.85 11.80 12.33

Table 2 . Mix segmentation performances depending on the type of cue point and the input feature used to obtain thewarping paths. Median time differences between cue points and ground truths are shown on the left side and hit rates of cue-in points with thresholds in seconds are shown on the right side. “Key-invariant" indicates applying the key transposition-invariant method for the DTW. The best score of each criteria is shown in bold . † indicates the scores are computed usingthe best score among the three cue types. Cue-out Cue-in Cue-mid6,151 (30%) 10,844 (52%) 3,770 (18%)

Table 3 . The number of ground truth boundary timestampsclosest to the type of cue point.with or without key-invariant). When both of chroma andMFCC are combined, the median time difference slightlyreduces but it is statistically insigniﬁcant (p-value of t-test > 0.1). However, we observed that the subsequenceDTW does not work well for some genres such as Technowhich only contain drum and ambient sounds. This mightcan be improved by using MFCCs with a large number ofbins or using mel-spectrograms. The use of key-invariantchroma generally does not make much difference becausekey transposition does not performed frequently as dis-cussed in Section 6.2.

6. MUSICOLOGICAL ANALYSIS OF DJ MIXES

We hypothesize that DJs share common practices in thecreative process in terms of tempo change, track-to-tracktransition, and cue point selection. In this section, we val-idate them using the results from the mix-to-track subse-quence alignment and the cue point extraction.

We compare the estimated tempo of the original track tothe tempo of each audio segment where the track is playedin a mix. Figure 4 shows a histogram of percentage dif- ferences of tempo between the original track and the au-dio segment in the mix. For example, a difference of 5%indicates the tempo of the original track is increased by5% while played in the mix. As shown in the histogram,the adjusted tempo has an double exponential distribution,which means the adjusted tempo values are skewed to-wards zero. In detail, 86.1% of the tempo are adjusted lessthan 5%, 94.5% are less than 10%, and 98.6% are less than20%. If one implements an track identiﬁcation system forDJ mix that is robust to tempo adjustment, this distributioncould be a reference.

A function so-called “master tempo” or “key lock” thatpreserves pitch despite tempo adjustments is activated bydefault in modern DJ systems such as stand-alone DJ sys-tems, DJ softwares, and even turntables for vinyl records.Therefore, key transposition is usually performed when aDJ intentionally wants to change the key of a track. Asmentioned in Section 3.2, the transposition-invariant DTWcan provide the number of transposed semitones as a by-product. We computed the statistics of key transpositionusing them (using DTW taking both MFCCs and key-invariant chroma). Figure 5 shows a histogram of keytransposition between the original track and the audio seg-ment in the mix. Only 2.5% among the total 24,202 tracksare transposed and, among those transposed tracks, 94.3%of them are only one semitone transposed. This result in-dicates that DJs generally do not perform key transpositionmuch and leave the “master tempo” function turned on inmost cases.

20% -15% -10% -5% 0% 5% 10% 15% 20%

Adjusted tempo F r e q u e n c y Figure 4 . A histogram of adjusted tempo of tracks inmixes.

Transposed semitones T h e nu m b e r o f t r a c k s Figure 5 . The number of tracks depending on the numberof semitones in mixes.

Once we extract cue-in and cue-out points in the transitionregion, we can calculate the transition length. This canprovide some basic hints on how DJ makes the track-to-track transition in a mix. Figure 6 shows a histogram oftransition lengths in the number of beats. We annotatedthe dotted lines every 32 beat which is often considered asa phrase in the context of dance music. The histogram haspeaks at every phrase. This indicates that DJs consider therepetitive structures in the dominant genres of music whenthey make transitions or set cue points.

Deciding cue points of played tracks is a creative choicein DJ mixing. Observing the agreement of cue points on asingle track among DJs may elucidate the possibility ofﬁnding some common rules. To the end, we collectedall extracted cue points for each track and computed thestatistics of deviations in cue-in points and cue-out pointsamong DJs. Speciﬁcally, we computed all possible pairsand their distances separately for cue-in points and cue-outpoints. Since the two distributions were almost equal, wecombined them into a single distribution in Figure 7. Fromthe results, 23.6% of the total cue point pairs have zero de-viation. 40.4% of them were within one measure (4 beats),73.6% were within 8 measures and 86.2% were within 16measures. This indicates that there are some rules that DJsshare in deciding the cue points. It would be interesting toperform detailed pattern analysis to estimate the cue pointsusing this data in future work.

The number of beats in the transition F r e q u e n c y Figure 6 . A histogram of the transition lengths in numberof beats. The dotted lines are plotted at every 32 beats.

The number of beats to another cue F r e q u e n c y Figure 7 . A histogram of distances between cue points ofa single track in the number of beats.

7. CONCLUSIONS

We presented various statistics and analysis of 1,557 real-world DJ mixes from . Based on the mix-to-track subsequence DTW, we conducted cue point anal-ysis of individual tracks in the mixes and showed the pos-sibility of common rules in the music making that DJsshare. We also investigated mix segmentation by com-paring the three types of cue point to human-annotatedboundary timestamps and showed that humans tend to rec-ognize cue-in points of the next tracks as boundaries. Fi-nally, we showed the statistics of tempo and key changesof the original tracks in DJ performances. We believe thislarge-scale statistical analysis of DJ mixes can be beneﬁ-cial for computer-based research on DJ music. The cuepoint analysis can be the ground for the precise deﬁnitionof cue points and the tempo and key analysis can provide aguideline of the musical changes during the DJ mixing.As a future work, we plan to estimate cue pointswithin a track as a step towards automatically generatinga mix [3, 4]. The cue point estimation has many applica-tion such as DJ software and playlist generation on musicstreaming services. This will require structure analysis orsegmentation of a single music track, which is an importanttopic in MIR. Furthermore, we plan to analyze the transi-tion region in a mix to investigate DJ mixing techniques.For example, it is possible to estimate the gain changesin the cross-faded region by comparing the two adjacentoriginal tracks and the mix [13, 14]. The methods can beextended to the spectrum domain. Such detailed analysisof mixing techniques will allow us to understand how DJsseamlessly concatenate music tracks and provide a guideto develop automatic DJ systems. . ACKNOWLEDGEMENT

We greatly appreciate for offering us themix metadata employed in this study. We note that themetadata used for this analysis was obtained with permis-sion from , and suggest that people who areinterested in the data contact directly. Thisresearch was supported by BK21 Plus Postgraduate Or-ganization for Content Science (or BK21 Plus Program),Basic Science Research Program through the National Re-search Foundation of Korea (NRF-2019R1F1A1062908),and a grant from the Ministry of Science and Technology,Taiwan (MOST107-2221-E-001-013-MY2).

9. REFERENCES [1] Y.-T. Lin, C.-L. Lee, J.-S. Jang, and J.-L. Wu, “Bridg-ing music via sound effects,” in . IEEE, 2014, pp.116–122.[2] R. M. Bittner, M. Gu, G. Hernandez, E. J. Humphrey,T. Jehan, H. McCurry, and N. Montecchio, “Automaticplaylist sequencing and transitions,” in

Proc. Interna-tional Society for Music Information Retrieval Confer-ence (ISMIR) , 2017, pp. 442–448.[3] D. Schwarz, D. Schindler, and S. Spadavecchia, “Aheuristic algorithm for DJ cue point estimation,” in

Proc. Sound and Music Computing (SMC) Conference ,2018.[4] L. V. Veire and T. De Bie, “From raw audio to a seam-less mix: creating an automated DJ system for drumand bass,”

EURASIP Journal on Audio, Speech, andMusic Processing , vol. 2018, no. 1, p. 13, 2018.[5] A. Kim, S. Park, J. Park, J.-W. Ha, T. Kwon, andJ. Nam, “Automatic DJ mix generation using highlightdetection,” in

International Society for Music Informa-tion Retrieval Conference (ISMIR), Late-Breaking Pa-per , 2017.[6] Y.-S. Huang, S.-Y. Chou, and Y.-H. Yang, “Generat-ing music medleys via playing music puzzle games,” in

Proc. AAAI Conference on Artiﬁcial Intelligence , 2018.[7] ——, “DJnet: A dream for making an automaticDJ,” in

International Society for Music InformationRetrieval Conference (ISMIR), Late-Breaking Paper ,2017.[8] R. Sonnleitner, A. Arzt, and G. Widmer, “Landmark-based audio ﬁngerprinting for DJ mix monitoring.” in

Proc. International Society for Music Information Re-trieval Conference (ISMIR) , 2016, pp. 185–191.[9] P. S. Manzano, “Audio ﬁngerprinting techniquesfor sample identiﬁcation in electronic music,” Mas-ter’s thesis, Friedrich-Alexander-Universität Erlangen-Nürnberg (FAU), Erlangen, Germany, 2016. [10] P. Lopez Serrano Erickson, “Analyzing sample-basedelectronic music using audio processing techniques,”Ph.D. dissertation, Friedrich-Alexander-UniversitätErlangen-Nürnberg (FAU), Erlangen, Germany, 2019.[11] N. Glazyrin, “Towards automatic content-based sepa-ration of DJ mixes into single tracks,” in

Proc. Interna-tional Society for Music Information Retrieval Confer-ence (ISMIR) , 2014, pp. 149–154, soure code availableat [Online] https://github.com/nglazyrin/MixSplitter.[12] T. Scarfe, W. Koolen, and Y. Kalnishkan, “Segmenta-tion of electronic dance music,”

International Journalof Engineering Intelligent Systems for Electrical En-gineering and Communications , vol. 22, no. 3, p. 4,2014, soure code available at [Online] https://github.com/ecsplendid/DanceMusicSegmentation.[13] L. Werthen-Brabants, “Ground truth extraction & tran-sition analysis of DJ mixes,” Master’s thesis, GhentUniversity, Ghent, Belgium, 2018.[14] D. Schwarz and D. Fourer, “Methods and datasets forDJ-mix reverse engineering,” in

Proc. InternationalSymp. on Computer Music Multidisciplinary Research(CMMR) , 2019, pp. 426–437.[15] M. Müller,

Fundamentals of music processing: Audio,analysis, algorithms, applications . Springer, 2015,ch. 7.2.3. Subsequence DTW.[16] M. Müller and S. Ewert, “Chroma toolbox: MATLABimplementations for extracting variants of chroma-based audio features,” in

Proc. International Societyfor Music Information Retrieval Conference (ISMIR) ,2011.[17] M. Müller and M. Clausen, “Transposition-invariantself-similarity matrices.” in