[PDF] Deep Music Information Dynamics

Abstract

Music comprises of a set of complex simultaneous events organized in time. In this paper we introduce a novel framework that we call Deep Musical Information Dynamics, which combines two parallel streams - a low rate latent representation stream that is assumed to capture the dynamics of a thought process contrasted with a higher rate information dynamics derived from the musical data itself. Motivated by rate-distortion theories of human cognition we propose a framework for exploring possible relations between imaginary anticipations existing in the listener's mind and information dynamics of the musical surface itself. This model is demonstrated for the case of symbolic (MIDI) data, as accounting for acoustic surface would require many more layers to capture instrument properties and performance expressive inflections. The mathematical framework is based on variational encoding that first establishes a high rate representation of the musical observations, which is then reduced using a bit-allocation method into a parallel low rate data stream. The combined loss considered here includes both the information rate in terms of time evolution for each stream, and the fidelity of encoding measured in terms of mutual information between the high and low rate representations. In the simulations presented in the paper we are able to juxtapose aspects of latent/imaginary surprisal versus surprisal of the music surface in a manner that is quantifiable and computationally tractable. The set of computational tools is discussed in the paper, suggesting that a trade off between compression and prediction are an important factor in the analysis and design of time-based music generative models.

Full PDF

DDeep Music Information Dynamics

Shlomo Dubnov

Department of Music, University of California San Diego

Abstract.

Music comprises of a set of complex simultaneous events or-ganized in time. In this paper we introduce a novel framework that wecall Deep Musical Information Dynamics, which combines two parallelstreams - a low rate latent representation stream that is assumed to cap-ture the dynamics of a thought process contrasted with a higher rateinformation dynamics derived from the musical data itself. Motivated byrate-distortion theories of human cognition we propose a framework forexploring possible relations between imaginary anticipations existing inthe listener’s mind and information dynamics of the musical surface it-self. This model is demonstrated for the case of symbolic (MIDI) data, asaccounting for acoustic surface would require many more layers to cap-ture instrument properties and performance expressive inﬂections. Themathematical framework is based on variational encoding that ﬁrst es-tablishes a high rate representation of the musical observations, whichis then reduced using a bit-allocation method into a parallel low ratedata stream. The combined loss considered here includes both the infor-mation rate in terms of time evolution for each stream, and the ﬁdelityof encoding measured in terms of mutual information between the highand low rate representations. In the simulations presented in the paperwe are able to juxtapose aspects of latent/imaginary surprisal versussurprisal of the music surface in a manner that is quantiﬁable and com-putationally tractable. The set of computational tools is discussed in thepaper, suggesting that a trade oﬀ between compression and predictionare an important factor in the analysis and design of time-based musicgenerative models.

Music Information Dynamics is a ﬁeld in music analysis that is inspired bytheories of musical anticipation (Meyer, 1956)(Huron, 2006), which deals withquantifying the amount of information passing over time between past and futurein musical signal (Dubnov, 2006),(Abdallah & Plumbley, 2009). Music Informa-tion Dynamics can be estimated in terms of Information Rate, which is deﬁnedas mutual information between past and future of a musical signal. Generativemodels that maximize information rate were shown to provide good results inmachine improvisation systems (Pasquier, Eigenfeldt, Bown, & Dubnov, 2017).Since music is constantly changing, the ability to capture structure in time de-pends on the way similarity is computed over time. The underlying motivation inproposing a deep information model is to acknowledge the fact that imagination, a r X i v : . [ c s . S D ] F e b Shlomo Dubnov both for the composer, improviser and the listener, is playing an important andpossibly even a crucial role in experiencing and creation of music. Music gen-eration and listening are active processes that involve simultaneous processingof the incoming musical information in order to extract salient features, whileat the same time predicting the evolution of those features over time, an aspectthat builds anticipations and allows creation of surprise, validation or violationof expectation and building of tensions and resolutions in a musical narrative.

In order to allow quantitative approach to analysis of what’s going on in themusical mind, we propose an information theoretic model for the relation be-tween four factors: the signal past X , the signal present Y , and their internalor mental representation in terms of past and present latent variables Z and T ,respectively. This highly simpliﬁed model assumes a set of Markov chain rela-tions, as shown in Figure 1, between the past of the signal X that is encoded intoa latent representation Z , the future of the signal Y that depends on its past X , and its approximation by a latent representation T that is predicted frompast latent representation Z . Using Markov relation Z − X − Y , we see that Fig. 1.

Graph of the model variable statistical dependencies. The letter ”e” repre-sents an encoding that will be parametrized according to diﬀerent complexity throughchanging the bit-rate between the encoder and decoder in VAE. X ”shields” the future Y from the past latent Z , meaning that once the pastmusical surface X is considered, there is no additional information or statisticaldependency between the next musical surface Y and past internal state Z . Ac-cording to this rule we can try to formulate a mathematical expression for the unning Title 3 goals underlying the learning process of such a musicing system (Elliott, 1993).Our expression for the optimization goal comprises of a trade oﬀ between sim-plicity of representation and its prediction ability. Accordingly, we are lookingfor representation that is minimizing the discrepancy, or statistical distortion,measured by Kullback–Leibler divergence D KL , between signal prediction of Y using complete information about the past X , versus its prediction capabilityby using a simpliﬁed encoding of the past Z . Using I ( X, Y ) = H ( Y ) − H ( Y | X )to denote mutual information, with H ( · ) as the entropy, the overall quality ofsuch error averaged over all possible encoding pairs X, Z of the musical surfaceand its latent code becomes (cid:104) D KL ( p ( Y | X ) || p ( Y | Z )) (cid:105) p ( X,Z ) = I ( X, Y | Z )= I ( X, Y ) − I ( Z, Y ) . (1)Since I ( X, Y ) are independent of p ( Z | X ), minimizing the error between the trueconditional probability of the future on its past p ( Y | X ), compared to probabilityof the future conditioned on the latent representation p ( Y | Z ), requires minimiza-tion of − I ( Z, Y ), or maximizing the mutual information between the encodingof the past and the signal future.Using Markov relations between the past and present latent-variables accord-ing to diagram shown in Fig.1, we express I ( Z, Y ) = I ( Z, T ) − I ( Z, T | Y ). Thisshows that the ability of predicting the future of the musical surface Y fromthe latent (mental) representation of the past Z , measured by their mutual in-formation, comprises of a diﬀerence between the information dynamics of thelatent representation, measured in terms of the mutual information involved inimagining the next latent representation T from past latent states Z , and theresidual or redundant information between these latent states Z and T once theactual musical surface Y is revealed or heard by the listener. In other words,the amount of information between latent past and latent future states is beingreduced once the actual next instance of the musical surface is revealed to thelistener, and this diﬀerence between imagined music future ”in the brain” versusactually the surprise in hearing the next musical event amount to the qualityfactor in equation 1 that represents the the ability of the system to predict thenext musical surface from its past internal state.If we assume that the latent representation fully captures the surface, or inother words, if T = Y , then full knowledge about the musical surface is alreadycontained in the imaginary latent states sequence, resulting in zero residual infor-mation I ( Z, T | Y ) = 0. In such case we may ignore the right side of the equation,which creates an exceptional situation where there is no need to actually listento the music and the best experience is achieved simply by imaginary predic-tion. In case when the mental representation is not perfect, a non-zero surprisalfactor I ( Z, T | Y ) = I ( Z, T ) − I ( Z, Y ) allows for musical tension to emerge dur-ing listening, or be deliberately inserted by the creator during composition orimprovisation.

Shlomo Dubnov

In our information theoretical approach one needs to know the probability distri-bution of the relevant variables in order to compute the appropriate informationmeasures. Since we do not know the true probability dynamics of the musicalsurface p ( X, Y ), we will substitute it by a variational approximation by encod-ing it into latent codes Z and T using a Variational Auto-Encoder (VAE). Itwas shown that minimization of I ( X, Z ) under distortion constraint D ( X, Z )is equivalent to learning a VAE representation through minimizing of EvidenceLower Bound (ELBO)(Alemi et al., 2017). Combining ELBO and the predic-tion objective, gives us a combination of latent encoding quality and temporalinformation L = I ( X, Z ) + β (cid:104) D ( X, Z ) (cid:105) − γ ( I ( Z, T ) − I ( Z, T | Y )) . (2)It should be noted that in the above expression there are separate variablesreferring to past X and Z , and future T, Y . In the following we will ﬁrst traina neural network using past data so as to minimize the

X, Z part of the loss,and then manipulate

Z, T by bit-rate limited encoding. In the following we willassume that X = Z full − rate and Y = T full − rate , and compare it to lower bit-rateencoding. Using a noisy channel between encoder and decoder we are able to control thecomplexity of the encoding using bit-allocation. The rate-distortion theory quan-tiﬁes the trade oﬀ between the amount of information between two variables,measured by their mutual information, and the distortion or error between them.This theory is a basis for lossy compression, where less bits need to be trans-mitted for lower quality signals. According to this theory, for a simple Gaussianinformation source of variance σ , the rate R and given distortion level D isgiven by R ( D ) = (cid:40) log σ D , if 0 ≤ D ≤ σ , if D > σ . (3)One can see that for distortions above variance level, no bits need to be transmit-ted. The bit-allocation algorithm uses the above equation to allocate diﬀerentamount of bits to multiple variables, which in our case are the latent variablesof the VAE encoder. Starting from the highest variance, or strongest variable,it iteratively allocates bits in optimal manner, until the bit pool is exhausted.Then lower resolution variables are then used to generate new outputs through adecoder. Schematic representation of the channel is given by Figure 2. Encodingfor a lower bit-rate is given by the following optimal channel (Berger, 1971) Q ( z d | z e ) = N ormal ( µ d , σ d ) (4) µ d = z e + 2 − R ( µ e − z e ) (5) σ d = 2 − R (2 R − σ e (6) unning Title 5 Fig. 2.

Noisy channel between encoder and decoder

For each latent variable, the equation tells us the mean and variance of thedecoder’s conditional probability. One can see that a channels with zero rate willtransmit deterministic mean value of that element, while channels with inﬁniterate will transmit the input values with zero added noise.

The computation of Information Dynamics is done in terms of Information Rate,IR= I ( X, Y ), which is a measure of mutual information between past and presentin a time series. This requires a predictive model that can capture the jointinformation between past and present by learning from examples. Deep modelssuch as RNN can be used to model time sequences, and predictive informationmeasures can be implemented using estimators of mutual information betweenthe last hidden variable in RNN that summarizes the whole sequence, and thepredicted variable. One of the diﬃculties in using RNN is the limited history orpoor modeling of long sequences.Variable Markov Oracle (VMO) (Wang & Dubnov, 2014) is a method basedon the Factor Oracle (FO) string matching algorithm that initially quantizes asignal x T = x , x , . . . , x T , into a symbolic sequence s T = s , s , . . . , s T , overa ﬁnite alphabet s ∈ S , with X = x past = x T − and Y = x present = x T . IRis estimated by applying a string compression method C and using an approx-imation I ( X, Y ) = H ( Y ) − H ( Y | X ) ≈ C ( Y ) − C ( Y | X ), where we substituteentropy H with compression C , with C ( Y ) = log ( | S | ) being the number of bitsrequired for individual symbol encoding over alphabet S , and C ( Y | X ) being ablock-wise encoding that recursively points to repeated sub-sequences, such asin the Lempel-Ziv or Compror string compression algorithms (Wang & Dubnov,2015). In the following we apply VMO to estimate the information dynamics ofthe latent representation I ( Z, T ) at diﬀerent bit-rates.

Shlomo Dubnov

In the VAE training step we minimized I ( X, Z ) (and I ( Y, T ) as well) that givesan optimal instantaneous representation. Assuming Y = T full − rate , and Z = Z limited − rate , we use MINE (Mutual Information Neural Estimation) (Belghazi,Rajeswar, Baratin, Hjelm, & Courville, 2018) as a method for estimating I ( Z, Y ).The network used in the experiments comprises of two parallel networks withshared weights, one receiving ordered pairs of

Z, Y and the other receiving a pairof Z with a shuﬄed version of Y , both mapped though two fully connected layerswith 30 hidden states with a dropout layer of 0 . Fig. 3.

Estimation of the predictive quality of bit-limited representation. Sub-ﬁguresA,B,C show the bit-rate allocation at rates 10, 50, and 10000, respectively. The x-axiscorresponds to the 500 latent state variables, with portions showing no-bit allocationbasically not being transmitted or accounted for in the latent representation. Sub-ﬁgures D,E,F show the MINE estimate as function of the training epochs, for theserates.

In order to demonstrate the utility of the proposed model for music analysis, weshow the results of deep information rate analysis of a single musical piece. Theconcept of surprisal I ( Z, T | Y ) = I ( Z, T ) − I ( Z, Y ) was deﬁned as the diﬀerencebetween the ability to imagine the next latent state and the ability to predict thenext musical surface. From pure mathematical perspective both factors require unning Title 7 averaging over the complete data. In practice, VMO provides instantaneous mea-sure of information rate since we have access to compression rate at every timestep of the time series. The MINE method averages over all data pairs

Z, Y , out-putting a single number. In future work we plan to train a predictor in order tocompute an instantaneous signal (surface) prediction error, so that both factorsof surprisal could be considered in time.The experiments reported below were conducted on a MIDI ﬁle of the Preludeand Fugue No. 2 in C Minor, BWV 847, by J.S.Bach. The full rate representationwas obtained by training a VAE encoder on set of MIDI ﬁles from LABROSA .The VAE that was used for the encoding had a single fully connected hidden layerof size 500, trained with ELBO loss function with Cross-Entropy reconstructionand KL reparameterisation loss using an Adam optimizer, which are the standardsettings for VAE. Figure 3 shows the predictive quality, measured as mutualinformation I ( Z, Y ) between a bit-reduced latent representation Z and the musicsignal Y one bar into the future (we remind that Y is represented by T full − rate of the VAE encoder). The bit-allocation regime for 500 latent states is shownat the top row. The x-axis of the top row are indices of the latent vectors, andthe y-axis is the number of bits. The bottom row shows the MINE optimizationprocess that converges to an estimate of the mutual information I ( Z, Y ) when itreaches the plateau after about 100 epochs. The results show that for rate 10 theamount of predictive information is around 3 bit, at rate 50 around 5 bits and atfull (10000 bit) rate, it is between 8 and 9 bits. Such result can be expected sincemore complex (higher bit-rate) latent representation Z carries more informationabout the future of the signal Y .Figure 4 shows the information dynamics IR= I ( Z, T ) of the latent represen-tations itself at diﬀerent representation complexity levels. This analysis is meantto capture the imaginary expectation of music based on the dynamics of thelatent representation alone. A bit-allocation algorithm was used to reduce therepresentation complexity of the VAE encoding to the desired bit-rate. For mu-sical reference we provide plots of harmonic and thematic analysis of the piece in sub-ﬁgure B, and score rendering in sub-ﬁgure C. In this paper we presented a work in progress for developing a framework formodeling of musical surprisal, formulated in terms of information theoretical re-lations between full-rate (high-ﬁdelity) encoding of the musical data, and a lowercomplexity latent encoding that models mental or imaginary musical representa-tions. This formalizes the notion of musical anticipation that were proposed byvarious researchers in terms of information dynamics and representation learn-ing, taking into account the limited capacity of cognitive processing and thetrade oﬀ in ﬁdelity of its representation of sensory input. It is evident from theexperiments that lowering the bit-rate of the encoding has a dramatic eﬀect on https://labrosa.ee.columbia.edu/projects/piano/ http://bachwelltemperedclavier.org/pf-c-minor.html Shlomo Dubnov

Fig. 4.

Analysis of Information Rate using VMO at A:full-rate, D: rate 50, E: rate10. Sub-ﬁgure B shows harmonic analysis of the Prelude and Thematic analysis of theFugue, Sub-ﬁgure C shows the musical score. the information dynamics of the latent representation. Considering informationdynamics of latent states as expectations formed by our imagination, the pointswhere the expectations diﬀer at diﬀerent bit-rates are assumed to carry creativeor experiential signiﬁcance. In the experimental results one can see that tran-sition to new materials in bars 25-27 causes a drop in Information Rate. Alsothe development of thematic material in the Figure starting around bars 44-45increases IR in both full and 50 rate . For rate of 10 we see that except for ashort initial period where all materials are still new (low IR), the rest of themusic is perceived as one long repetition (high IR). It should be noted thatthese results may be also due to the nature of the piece itself and the quality ofthe encodes. Additional analysis on multiple pieces and diﬀerent encoding andpredictive architectures are anticipated in the future.The motivation for this type of modeling comes from the cognitive idea thatmusical creativity and musical perception obey a trade-oﬀ between abstraction orsimpliﬁed representation of music that captures more salient or structural aspectsof music, and perceptual sensibility to the musical surface that is abundantin detail. To the best of our knowledge, no such conceptual or computationalframework had been previously oﬀered. The units of bit-rate reduction are total bits per measureeferences 9

Acknowledgment

I would like to thank the reviewers for the detailed and insightful comments.This work is partially supported by Cygames, Inc.

References

Abdallah, S., & Plumbley, M. (2009, June). Information dynamics: Patterns ofexpectation and surprise in the perception of music.

Connect. Sci , (2-3),89–117.Alemi, A. A., Poole, B., Fischer, I., Dillon, J. V., Saurous, R. A., & Murphy, K.(2017). An information-theoretic analysis of deep latent-variable models. CoRR , abs/1711.00464 .Belghazi, I., Rajeswar, S., Baratin, A., Hjelm, R. D., & Courville, A. C. (2018).MINE: mutual information neural estimation. CoRR , abs/1801.04062 .Retrieved from http://arxiv.org/abs/1801.04062 Berger, T. (1971).

Rate distortion theory; a mathematical basis for data com-pression [Book]. Prentice-Hall Englewood Cliﬀs, N.J.Dubnov, S. (2006). Spectral anticipations.

Computer Music Journal , (2),63-83.Elliott, D. J. (1993). Musicing, listening, and musical understanding. Contribu-tions to Music Education (20), 64-83.Huron, D. (2006).

Sweet anticipation: Music and the psychology of expectation .MIT Press.Meyer, L. B. (1956).

Emotion and meaning in music . Univ. of Chicago Press.Pasquier, P., Eigenfeldt, A., Bown, O., & Dubnov, S. (2017). An introductionto musical metacreation.

Computers in Entertainment , (2).Wang, C.-i., & Dubnov, S. (2014). Variable markov oracle: A novel sequen-tial data points clustering algorithm with application to 3d gesture query-matching. In International symposium on multimedia (pp. 215–222).Wang, C.-i., & Dubnov, S. (2015). Pattern discovery from audio recordingsby variable markov oracle: A music information dynamics approach. In(pp. 215–222).Wang, C.-i., & Dubnov, S. (2015). Pattern discovery from audio recordingsby variable markov oracle: A music information dynamics approach. In