Inference of stochastic time series with missing data
IInference of stochastic time series with missing data
Sangwon Lee, Vipul Periwal, ∗ and Junghyo Jo
3, 4, † Department of Physics and Astronomy, Seoul National University, Seoul 08826, Korea Laboratory of Biological Modeling, National Institute of Diabetes and Digestive and Kidney Diseases,National Institutes of Health, Bethesda, Maryland 20892, USA Department of Physics Education and Center for Theoretical Physics andArtificial Intelligence Institute, Seoul National University, Seoul 08826, Korea School of Computational Sciences, Korea Institute for Advanced Study, Seoul 02455, Korea (Dated: January 29, 2021)Inferring dynamics from time series is an important objective in data analysis. In particular,it is challenging to infer stochastic dynamics given incomplete data. We propose an expectationmaximization (EM) algorithm that iterates between alternating two steps: E-step restores missingdata points, while M-step infers an underlying network model of restored data. Using synthetic datagenerated by a kinetic Ising model, we confirm that the algorithm works for restoring missing datapoints as well as inferring the underlying model. At the initial iteration of the EM algorithm, themodel inference shows better model-data consistency with observed data points than with missingdata points. As we keep iterating, however, missing data points show better model-data consistency.We find that demanding equal consistency of observed and missing data points provides an effectivestopping criterion for the iteration to prevent overshooting the most accurate model inference.Armed with this EM algorithm with this stopping criterion, we infer missing data points and anunderlying network from a time-series data of real neuronal activities. Our method recovers collectiveproperties of neuronal activities, such as time correlations and firing statistics, which have previouslynever been optimized to fit.
I. INTRODUCTION
System identification is one of the most importanttasks in data science [1]. To be specific, suppose we haveobservations { (cid:126)σ ( t ) } of a sample (cid:126)σ = ( σ , σ , · · · , σ N ),where t can denote a time index for time series data,or a sample index for independent data. Consider-ing stochastic systems, statistical mechanics has beenadopted to construct the least structured probabilisticmodel of P ( (cid:126)σ ) from observations. In particular, Ising-like models of P ( (cid:126)σ ) ∝ exp (cid:0) (cid:80) i,j W ij σ i σ j (cid:1) incorporatethe pair-wise interactions between variables. Its applica-bility ranges from neuroscience and biology [2–7] to eco-nomics and sociology [8–10]. Such pairwise interactionsare sufficient to explain complex higher-order patternsin many cases [2, 11], and the efficacy and limitationsof these models have also been extensively studied [12].To consider time-dependent data arising from kinetic in-teractions, another type of Ising model has been pro-posed [13]. Unlike the equilibrium model of P ( (cid:126)σ ), thekinetic Ising model has a probabilistic relation between σ i ( t + 1) and σ j ( t ) with the conditional probability of P ( σ i ( t + 1) | (cid:126)σ ( t )) ∝ exp (cid:0) (cid:80) j W ij σ i ( t + 1) σ j ( t ) (cid:1) . The ki-netic model has been used to reconstruct neural networksfrom temporal neuronal activities [14, 15].Both equilibrium and kinetic (or non-equilibrium)models have network parameters W ij . In the equilib-rium model, it is symmetric ( W ij = W ji ) to representundirected correlation between σ i and σ j . However, ∗ Corresponding author: [email protected] † Corresponding author: [email protected] in the kinetic model, W ij is not necessarily symmetric( W ij (cid:54) = W ji ) as it represents directed causality from σ j ( t )to σ i ( t + 1). Due to the wide applications of these mod-els, it is an important inverse problem to infer W ij fromobservatioons { (cid:126)σ ( t ) } . As concretely established in theequilibrium model [16], the kinetic model also has manyinference methods including various mean-field approxi-mations [14, 17], maximum likelihood estimation [14, 15],and the recent expectation-reflection method [18].In real-world problems, it is common that only a partof a network is observable. For example, it is impossi-ble to observe every neuron in the brain. Therefore, oneshould consider not only observed visible units but alsounobserved hidden units. Much effort has been devotedto infer both hidden variables and the network param-eters [19–23]. These methods basically rely on the cel-ebrated expectation maximization (EM) algorithm [24].The EM method is composed of two iterative steps: E-step predicts hidden variables by using mean-field ap-proaches [19, 20], replica methods [21, 22], or a likelihood-based method [23], and M-step optimizes the networkparameters from the reconstructed data.In addition to the issue of hidden units, another prac-tical issue is that even visible units are not always ob-servable throughout an experiment. At each time point,some units become accessible to observers and the oth-ers become inaccessible. For example, a large-scale neu-ral network can be partially scannable in neuroscienceexperiments [25]. This scenario is also common in fi-nance and social science [26]. For a trading network,trade records are available only when traders are active.Although some remarkable works exist, the research onthe kinetic Ising model with missing or partially masked a r X i v : . [ phy s i c s . d a t a - a n ] J a n data is still in its infancy. Camajola et al. [26] addressedthis issue inspired by the mean-field approach of Dunn et al. [19] that was originally developed to address theissue of hidden units. The mean-field approach imposesan a priori assumption that interactions are weak anddense [20]. Thus it is imperative to develop an approachfree from such constraints. Of course, any such methodhas to be tested against real-world problems beyond theproof-of-concept with synthetic data.Here we develop a general method to infer the param-eters of a kinetic Ising model from data with sporadicmissing values of any measured variable, extending ourrecent study on hidden variables [23]. The algorithm usesthe EM method: the E-step restores missing time pointsstochastically using a likelihood ratio, while the M-stepinfers network parameters from observed and restoreddata. Alternation of these two steps infers network cou-pling parameters as we iterate but excessive iteration fi-nally ends up with overestimation of the parameters. It istherefore crucial to find a criterion for stopping EM itera-tions that can be applied to real-world data. We find thatthe optimal number of iterations is reached when one ob-tains equal inference loss for both observed and restoreddata. Using the EM method with the optimal number ofiterations, we demonstrate that our method successfullyinfers synthetic data of the kinetic Ising model. We thenapply the method to infer kinetics of a neuronal networkfrom real neuronal activity data [27, 28], and confirm thatthe inference reproduces collective behaviors such as timecorrelation and firing statistics of neuronal activities.This paper is organized as follows. In Sec. II, wedescribe our EM-based method and the stopping crite-rion for optimal iterations. In Sec. III, we validate thismethod with simulated data using the kinetic Ising modeland examine inference performance depending on thefraction of missing observations, interaction strengths,and sample size. We then apply the method to infer dy-namics from real neuronal activity data, and compare theinference performance with equilibrium and null models.Finally, we summarize and discuss our findings in Sec. IV.To keep the paper relatively self-contained, we explainthe equilibrium model in Appendix A and describe ad-ditional experimental data on neuronal activities in Ap-pendix B. II. METHOD
We consider stochastic dynamics of the kinetic Isingmodel with N binary variables. The i -th spin σ i ( t +1) at time t + 1 is probabilistically determined by theconditional probability: P ( σ i ( t +1) = ± | (cid:126)σ ( t )) = exp[ ± H i ( t )]exp[ H i ( t )] + exp[ − H i ( t )] , (1)where the local field H i ( t ) = (cid:80) j W ij σ j ( t ) + b i integratesthe influence of connected spins as well as external bias b i . Here we choose a synchronous update for simplic-ity; refer to [15, 29] for an asynchronous update. Givenbinary time series data { (cid:126)σ ( t ) } Lt =1 of length L , variousmethods exist to infer W ij in Eq. (1) [14, 16–18]. In par-ticular, the expectation-reflection method provides fasterinference than the standard maximum likelihood estima-tion, since the iterative algorithm is based on a multi-plicative and parallelizable parameter update [23]. Inthis study, however, we used simply a logistic regressionfunction included in a Python package, scikit-learn [30],because the logistic regression shows similar performancewith the expectation-reflection method with high accu-racy and fast computation. Note that Eq. (1) implies thelogistic function: P ( σ i ( t + 1) = 1 | (cid:126)σ ( t )) = 11 + exp[ − (cid:80) j W ij σ j ( t ) − b i ] . (2)Now suppose that some of the data points are missing.Let M denote the set of missing data values. To recoverthese missing values, we first assign random binary valuesfor σ i ( t ) at every ( i, t ) ∈ M . Then, from the observedand the randomly-assigned values of { (cid:126)σ ( t ) } Lt =1 , we caninfer a first approximation to W ij using the logistic re-gression in Eq. (2). Next, we define a likelihood functionfor σ i ( t ) = ± t (cid:54) = 1 or L ) as L ± i,t = P ( σ i ( t ) = ± | (cid:126)σ ( t − N (cid:89) j =1 P ( σ j ( t + 1) | F ± i ( (cid:126)σ ( t )) , (3)where F ± i ( (cid:126)σ ( t )) = ( σ ( t ) , · · · , σ i ( t ) = ± , · · · , σ N ( t )).Note that L ± i,t is a product of the likelihoods determinedby the one-step backward state of (cid:126)σ ( t −
1) and the one-step forward state (cid:126)σ ( t + 1). The likelihoods for t = 1 and L involve only forward and backward states, respectively: L ± i, = (cid:81) Nj =1 P ( σ j (2) | F ± i ( (cid:126)σ (1)) and L ± i,L = P ( σ i ( L ) = ± | (cid:126)σ ( L − ± σ i ( t ) with a probability of L ± i,t / ( L + i,t + L − i,t ) forthe missing data points of ( i, t ) ∈ M . We update themissing data points in ascending order for t and randomorder for i . Note that update of one missing point isaffected by other missing points that may or may nothave been updated in that step. After updating all themissing data points (E-step), we optimize W ij from theobserved and restored data (M-step). Repeating these E-and M-steps, W ij is expected to converge to the true setof coupling strengths.However, excess iteration ends up with overestimationof W ij , especially when a larger fraction of data is miss-ing. Real world problems do not come with known true W ij , so how can we figure out the optimal number of it-erations to maximize the goodness of fit of the inferred W ij ? As a quality measure of inference, we consider thedata-model consistency between σ i ( t + 1) and its expec-tation value: (cid:104) σ i ( t + 1) (cid:105) = σ i ( t + 1) P ( σ i ( t + 1) | (cid:126)σ ( t )) − σ i ( t + 1) (cid:2) − P ( σ i ( t + 1) | (cid:126)σ ( t )) (cid:3) = σ i ( t + 1) (cid:2) P ( σ i ( t + 1) | (cid:126)σ ( t )) − (cid:3) . (4)The expectation value is simply obtained as (cid:104) σ i ( t + 1) (cid:105) =tanh H i ( t ) for the kinetic Ising model following the con-ditional probability of Eq. (1). Here the discrepancy be-tween σ i ( t + 1) and its expectation (cid:104) σ i ( t + 1) (cid:105) can bequantified as D = (cid:88) i,t (cid:2) σ i ( t + 1) − (cid:104) σ i ( t + 1) (cid:105) (cid:3) = (cid:88) i,t σ i ( t + 1) (cid:2) − P ( σ i ( t + 1) | (cid:126)σ ( t )) (cid:3) = 4 (cid:88) i,t (cid:2) − P ( σ i ( t + 1) | (cid:126)σ ( t )) (cid:3) (5)for every data point. For the derivation of the second linein Eq. (5), we used Eq. (4). Thus the discrepancy D mea-sures the loss of the likelihood P ( σ i ( t + 1) | (cid:126)σ ( t )) of data { (cid:126)σ ( t ) } Lt =1 . To separately examine the data-model con-sistency for observed and missing data points, we split D as D = D obs + D mis , where D obs and D mis representthe discrepancy contribution from observed data points(( i, t ) (cid:54)∈ M ) and missing data points (( i, t ) ∈ M ), re-spectively. During the iterations of the EM steps, both D obs and D mis keep decreasing because the M-step opti-mizes to minimize the data-model consistency. However,the relative size of two discrepancies of D obs and D mis changes with iterations. Therefore, the separate mea-sures, D obs and D mis , can provide a good stopping crite-rion for the EM iterations as follows: At the beginning ofthe iterations, D mis has a larger discrepancy because themissing data points of σ i ( t ) are randomly assigned. Aftersome iterations, however, the missing data points are ex-cessively fine-tuned whereas the observed data points arealways fixed. This makes D mis smaller than D obs . There-fore, it is natural to halt the iterations when the D obs and D mis curves cross each other to prevent overfitting.Finally we summarize the overall procedure:(i) Initialize missing data σ i ( t ) for ( i, t ) ∈ M withrandom binary values;(ii) M-step: Infer interaction parameters W ij from thewhole data using the logistic regression in Eq. (2);(iii) E-step: Re-assign the missing data based on thelikelihood ratio L ± i,t / ( L + i,t + L − i,t ) in Eq. (3);(iv) Repeat (ii) and (iii) until D mis − D obs < (cid:15) holdswith a small threshold (cid:15) . III. RESULTS
We test the performance of our method with simu-lated data using the kinetic Ising model. After checkingthe concordance of inferred and true couplings with this
FIG. 1. (a) The evolution of root-mean-square errors(RMSE) of W ij (left panels) and two quality measures of theinference, D obs and D mis (right panels), over the iteration ofE- and M-steps. Black dotted lines denote the time whenthe algorithm stops according to our stopping criterion. (b)Comparison between the true and inferred W ij at differenttimes: 5 (left), 20 (center), and 60 (right) steps after the iter-ation starts. A fraction of missing observations was p = 0 . N = 100 and a data length L = 10000 was used. simulated data, we apply the method to examine exper-imental recordings of neuronal activities. A. Inference of kinetic Ising model
We simulated a stochastic time series using theSherrington-Kirkpatrick (SK) model [31]. The SK modelfollows the conditional probability in Eq. (1), where thelocal field is defined as H i = (cid:80) j W ij σ j + b i . We turnoff the bias b i = 0 for simplicity, and consider asym-metric interactions W ij (cid:54) = W ji . In the SK model, eachvalue of W ij is independently drawn from a Gaussian dis-tribution, N (0 , g /N ), with zero mean and the varianceof g /N . We set the overall scale of the interaction as g = 1. With the selected W true ij , we obtain L = 10000time points of (cid:126)σ ( t ) = ( σ ( t ) , σ ( t ) , · · · , σ N ( t )) with a sys-tem size of N = 100. Then the total number of data val-ues of ( i, t ) ∈ A has a set size |A| = LN . We then hidesome data points of ( i, t ) ∈ M as well as the true W true ij .The fraction of missing data is defined as p = |M| / |A| .Our task is two folds: to restore the missing data pointsand to infer the true W true ij .As the EM algorithm iterates, we evaluate the qual-ity of the inference by measuring the root-mean-squareerror of W ij , RMSE = N − (cid:113)(cid:80) i,j ( W ij − W true ij ) , asin [26]. Too many iterations usually lead to increasingRMSE (Fig. 1(a)). In particular, the over-fitting becomesmore evident when the fraction of missing data pointsincreases. At early iterations, W ij is underestimatedbecause the missing data restored with random valuesreduce the correlation between variables (Fig. 1(b), leftpanels). On the other hand, at excessive iterations, W ij isoverestimated because the discrete values of the restoredmissing data are even better fitted with exaggerated H i or larger values of W ij (right panels). An optimal itera-tion gives an unbiased estimation of W ij (center panels).Together with the RMSE, we examine the model-dataconsistency measures of D obs and D mis for the observedand missing data points. As expected, D mis is larger than D obs at the beginning of iterations when the restorationof missing data points is more or less random. How-ever, after some iterations, D mis becomes smaller than D obs since the restoration is excessively fine-tuned. In-terestingly, the equality of D mis and D obs occurs at theoptimal iteration that minimizes the RMSE. In practice,the crossing iteration can be monitored by an inequal-ity condition, D mis − D obs < (cid:15) with a small positive (cid:15) .Here we used (cid:15) = 0 .
01 since smaller thresholds did notqualitatively affect the results.After justifying the stopping criterion, we examinedthe performance of our inference process with varying p. Figure 2(a) is a scatter plot of the true versus inferred W ij for p = 0 (intact data), 0 .
3, 0 .
5, and 0 .
7. It is evidentthat the inferred W ij deviates farther from its true value,and the scatter plot becomes broader as the fraction ofmissing values, p, increases, as expected. However, theresult for p = 0 . D ij = (cid:104) σ i ( t ) σ j ( t + 1) (cid:105) t − (cid:104) σ i ( t ) (cid:105) t (cid:104) σ j ( t ) (cid:105) t . Here (cid:104) f ( t ) (cid:105) t ≡ / ( L − (cid:80) L − t =1 f ( t ) represents a time-averaged value of f ( t ).Note that (cid:104) σ j ( t + 1) (cid:105) t ≈ (cid:104) σ j ( t ) (cid:105) t for large L . We com-pared D ij computed from complete and restored data.Figure 2(b) shows that the one-step lagged correlationsare successfully recovered with our inference even if alarge fraction ( p = 0 .
7) of data is missing.To evaluate the quality of data restoration, we definedthe restoration accuracy (RA) as the ratio of masked data
FIG. 2. Inference of coupling parameters W ij and re-construction of missing data. (a) Comparison between thetrue and inferred W ij with a fraction of missing observations p = 0 . . . D ij . (c)Restoration accuracy (RA) of missing data and (d) RMSEof W ij as a function of p . Black dotted line in (d) refers tothe RMSE when the full data have been used. A system size N = 100 and a data length L = 10000 were used. points that are correctly restored. Figure 2(c) shows thatRA decreases with p . Specifically, when 10% of the datawas missing, nearly 80% was correctly restored. However,if 80% of the data was missing, the restoration was moreor less the same as a random restoration. While RA de-creases with p, RMSE increases with p as Fig. 2(d). TheRMSE for p = 0 . W ij (seeFig. 2(a)). In contrast, the mean-field approach reportedthat when a considerable fraction of data is missing, in-ferred W ij tends to be biased towards zero [26].The inference performance usually depends greatly onthe interaction strength g and the length L of time series.Thus we examine their dependency for the restorationand inference by varying g and L with a fixed system size N = 80 . First, we found that RA increases as g increases(Fig. 3(a)). This is expected because strong interactionslower stochasticity in the kinetic Ising model. For a large FIG. 3. (a) RA of missing data and (b) rescaled RMSE of W ij as a function of the scale of couplings g from 0 . N = 80 and a data length was L = 6400,while a fraction of missing observations p was in the rangefrom 0 . . g (cid:39)
10 and p ≤ . , almost all missing data can becorrectly restored. On the other hand, when p ≥ . , the RA becomes significantly less than 1 even with largevalues of g. When g is too small ( ≤ . W ij , we used the re-scaled RMSE defined in [26], RRMSE =RMSE /g . In general, RRMSE is a decreasing function of g (Fig. 3(b)). Similar to the worse restoration (Fig. 3(a)),this is because small g induces larger stochasticity witha flip probability close to 50%, which makes inferencemore difficult. When the fraction of missing data is small( p ≤ . p ≥ .
7, the RRMSEbecomes suddenly larger than the result of p ≤ .
6, es-pecially when g is large. Taken together, these resultsindicate that our algorithm is applicable to a wide rangeof g (from 0 . p is less than 0 . L, which is not surprising.Especially from Fig. 4(b), we can observe that the er-ror of W ij follows a power-law like behavior in the widerange of L/N from 1 / et al. [26]. We can also see that the power-law exponent does not depend on p . These results canbe used to guide the appropriate sample size for applyingour inference algorithm. B. Inference of neuronal dynamics
Having confirmed that our algorithm works for syn-thetic data, we now apply the algorithm to explore real
FIG. 4. (a) RA of missing data and (b) RMSE of W ij as afunction of L/N from 0.125 to 8, where L is a data length anda system size was N = 80. A fraction of missing observations p was 0 .
3, 0 .
5, or 0 . data. We examine temporal data of neuronal activi-ties recorded in salamander retinal ganglion cells [32].The data consist of neuronal spike trains of 160 neurons,whose bin size is 20 ms. The state of each bin is consid-ered active ( σ i ( t ) = 1) when at least one spike is presentin the time bin, and inactive ( σ i ( t ) = −
1) otherwise.The total length of the time series is L = 283041. In thisstudy, we considered 60 active neurons among 160 neu-rons to exclude silent neurons that show low firing ratesfor most of the time. Figure 5(a) summarizes our prob-lem setting. We have neuronal activity recording data(top panel). After decimating 50% of them randomly(center panel), we seek to reconstruct the missing datausing inference algorithms (bottom panel).We consider three different inference methods: NEQ,NULL, and EQ. First, NEQ is the kinetic or non-equilibrium Ising model that we explained in the previoussections. Second, NULL is an unstructured model thatassumes independent firing of neurons. This model re-stores missing data with random binary values of whichmean activities are constrained to follow the data statis-tics. Third, EQ is the standard equilibrium Ising modelthat has been widely used to model collective behaviorof neural networks [2, 3, 27]. This algorithm uses anEM-based approach similar to NEQ, whereas EQ consid-ers the neural activities at different times separately (SeeAppendix A for the details of EQ). Note that to con-sider intrinsically silent neurons, NEQ and EQ includeexternal bias b i as well as neighboring interactions W ij .All three methods could restore the missing data (herewe used p = 0 . FIG. 5. Test of our algorithm on neuronal activity data published by Tkaˇcik et al. [32]. (a) Raster plot of activities of 60neurons (top). We erased the data randomly with a probability p = 0 . K simultaneous spikes. (d-f)Comparison between true and restored mean activity m i (d), one-step lagged correlation D ij (e), and equal-time correlation C ij (f) for the three inference algorithms. neurons are mostly inactive (see Fig. 5(a), top panel).Indeed, if we simply assign all the missing data as in-active ( σ i ( t ) = − K ( t ) = (cid:80) Ni =1 (cid:0) σ i ( t ) + 1 (cid:1) / P ( K ) (Fig. 5(c)). Figure 5(c) clearly indicates thatNEQ matches the original P ( K ) much better than theother two methods. NULL underestimates a probabilityof large K s, while EQ predicts a heavier tail for P ( K )than the original one. These tendencies for NULL andEQ have been also observed by Tkaˇcik et al. [27], wherethey tried to solve this issue by using an equilibriummodel with pairwise interactions and an additional po-tential energy depending on K. Here we emphasize thatNEQ, implementing non-equilibrium dynamics into thesystem, naturally captures the pattern of simultaneousfiring without further assumptions.Next, we examined mean activities m i = (cid:104) σ i ( t ) (cid:105) t (Fig. 5(d)), the one-step lagged correlations D ij = (cid:104) σ i ( t ) σ j ( t + 1) (cid:105) t − (cid:104) σ i ( t ) (cid:105) t (cid:104) σ j ( t ) (cid:105) t (Fig. 5(e)), andthe equal-time correlations C ij = (cid:104) σ i ( t ) σ j ( t ) (cid:105) t −(cid:104) σ i ( t ) (cid:105) t (cid:104) σ j ( t ) (cid:105) t (Fig. 5(f)). NEQ correctly infers all thestatistics of m i , C ij , and D ij , although it slightly under-estimates C ij . The correct estimation of C ij is surpris-ing because NEQ matches m i and D ij by tuning b i and W ij [14, 15], and therefore it does not directly affect C ij . Similarly, the correct estimation of m i and C ij by EQ isnot surprising because EQ models m i and C ij by tuning b i and W ij . However, EQ severely underestimates D ij , perhaps because neuronal firing is far from an equilib-rium process in that it is a manifestation of informationtransmission. Finally, NULL by definition fits only m i and obviously fails to account for pairwise correlationsbetween spins, both C ij and D ij . As a further corrobo-ration of the validity of NEQ, we examined another dataset of neuronal activities [33], and confirmed that NEQreproduces the collective behavior of neurons as shownhere (see Appendix B). To sum up, these results clearlydemonstrate that our algorithm (NEQ) is an effective ap-proach for understanding real experimental data.
IV. DISCUSSION
We developed an expectation-maximization based al-gorithm that infers the kinetic Ising model from stochas-tic time series with missing data. The algorithm alter-nates between an E-step restoring missing data pointsand an M-step optimizing model parameters. We foundan effective scheme for determining the optimal numberof iterations that provide the best model inference. Atthe optimal iteration, the data-model consistency of re-stored data (measured by D mis ) is equivalent to the data-model consistency ( D obs ) of observed data. Using theEM method with the optimal number of iterations, wecould successfully infer model parameters without under-or over-estimation even when up to 70% of the data ismissing. We demonstrated the performance of the infer-ence with synthetic data from simulations of the kineticIsing model, and with real neuronal data. In particu-lar, our algorithm, based on a non-equilibrium model,outperforms equilibrium models in reproducing collectivebehavior in neuronal activities.Despite the good inference performance, our algorithmhas plenty of room for improvement. The main drawbackof our EM algorithm is that it does not converge unlesssufficient amount of data is provided. As seen in Fig. 1,the data-model discrepancy D keeps decreasing with iter-ations even after the error of W ij starts to increase. Evenif we initialize the model parameters with true W ij , theRMSE ends up diverging after several iterations (datanot shown). This underscores the extent to which infer-ence and prediction are different problems. Especiallygiven a finite amount of data, over-estimation of W ij cansometimes provide better prediction of observed data (ordata-model consistency) and more accurate restorationof missing data. To address this issue of inference ver-sus prediction, we introduced an effective halting schemeof D obs = D mis . Although this idea works quite well,care has to be exercised in selecting the hyperparameter (cid:15) that determines the inequality, D mis − D obs < (cid:15). Thespecific value of (cid:15) does not change the inference perfor-mance, if it resides within a reasonable range. Unlike thesynthetic data, however, when we dealt with real data,sometimes too small (cid:15) could not detect the crossing of D obs and D mis . This study applied the EM algorithm for the kineticIsing model. However, other non-equilibrium models canbe considered as the underlying stochastic model. Forinstance, Marre et al. [34] proposed an Ising model incor-porating both spatial and temporal correlations amongneurons, and showed that this model predicts spatio-temporal patterns of neuronal activities significantly bet-ter than the standard Ising model (also see [35, 36]).Moreover, Tyrcha et al. [37] showed that applying non-stationary external bias can explain a structure of neu-ronal spike trains even without direct interactions be-tween neurons.Our methodology to deal with incomplete kinetic Isingdata can be applied to other fundamental scenarios. Forexample, we can apply our algorithm to data with ir-regular observation times. Unequally spaced time seriesdata are common in astronomy [38, 39], paleoclimatol-ogy [40], biology [41], and many other fields in whichregular observations are infeasible. In the context of thekinetic Ising model, sequences at unobserved times canbe correctly inferred using exactly the same approach asthis paper. Other applications would be the reconstruc-tion of interaction networks from missing data as in [25].Various techniques such as (cid:96) -regularization [42] or dec-imation [43, 44] can be utilized with our approach asin [26]. In conclusion, we expect that our new methodcan be generalized for many applications. FIG. A1. The same as Fig. 1(a) but with the equilibriumIsing model.
ACKNOWLEDGMENT
This work was supported by the Intramural ResearchProgram of the National Institutes of Health, NIDDK(V.P.), and the New Faculty Startup Fund from SeoulNational University, and the National Research Founda-tion of Korea (NRF) grant funded by the Korea govern-ment (MSIT) (Grant No. 2019R1F1A1052916) (J.J.).
Appendix A: Equilibrium Ising Model
In this Appendix, we describe the data restorationalgorithm based on the equilibrium Ising model, whichwe have briefly mentioned in Sec. III B. In the standardIsing model, the probability of the spin configuration (cid:126)σ = ( σ , σ , ..., σ N ) with σ i = ± E ( (cid:126)σ ) as P ( (cid:126)σ ) = Z − exp [ − E ( (cid:126)σ )], where Z = (cid:80) (cid:126)σ exp [ − E ( (cid:126)σ )] is the normalization factor. The en-ergy is described by E ( (cid:126)σ ) = − (cid:80) i (cid:54) = j W ij σ i σ j − (cid:80) i b i σ i ,where W ij is an interaction parameter and b i an externalbias. The task is to reconstruct samples and find the true W ij when some portion of the the data points have notbeen observed.We again split the algorithm into E- and M- steps.First, we initialize missing σ i ( t ) with random binary val-ues (in the equilibrium model, the time index t merelyacts as a label to distinguish different sequences of (cid:126)σ ). FIG. A2. The same as Fig. 2 but with the equilibrium Isingmodel. Also, the one-step lagged correlation D ij in Fig. 2(b)is replaced by the equal-time correlation C ij . From the randomly assigned data, we find an error-prone W ij (see below). Then it is reasonable to mimicthe Metropolis-Hastings algorithm [45] to update miss-ing data points (E-step). Indeed, at each turn, wechoose to flip or not to flip σ i ( t ) using the ratio oftwo probabilities P ( F i ( (cid:126)σ ( t ))) /P ( (cid:126)σ ( t )), where F i ( (cid:126)σ ( t )) =( σ ( t ) , · · · , − σ i ( t ) , · · · , σ N ( t )). As in the main text, weupdated all of the missing data one by one for each t. Wechecked that increasing the number of updates in onestep did not affect the results.Given the reconstructed samples, we can infer the best W ij (M-step) by maximizing the data likelihood calledthe Boltzmann machine [46]. However, this requiresan extensive calculation of the normalization factor Z, which entails the sum of 2 N configurations. Physicistshave developed a number of methods to circumvent theexact computation of Z , including mean-field [47, 48],Bethe approximation [49], Sessak-Monasson [50], andadaptive cluster expansions [51] (see also a pedagogi-cal review [16]). In this study, we chose a maximumpseudo-likelihood approach [52] that maximizes the local pseudo-likelihood of L i = (cid:81) t P ( σ i ( t ) | (cid:126)σ ci ( t )) for each i ,where (cid:126)σ ci denotes all spins besides the i -th spin. Since P ( σ i ( t ) | (cid:126)σ ci ( t )) is determined by a local field H i ( t ) = (cid:80) j (cid:54) = i W ij σ j ( t ) + b i , the same logistic regression func-tion [30] can be utilized, allowing fast and accurate in-ference. After optimizing each L i separately, we set W ij ← ( W ij + W ji ) / D obs and D mis ) with the re-placement of σ i ( t + 1) to σ i ( t ) . We tracked the evolutionof D obs and D mis and halted the algorithm when the stop-ping condition D mis − D obs < (cid:15) is satisfied (we again set (cid:15) = 0 . W ij . Figure A1 illustrates our stopping criterion for EQ.Samples were generated using the Metropolis-Hastingsalgorithm with a system size N = 100 and a sample size L = 10000 . We found that the algorithm stops severalsteps after reaching the minimum error of W ij , whichmeans that the inference of W ij is suboptimal. This isdifferent from the case of NEQ where we could see thecoincidence between the minimum of RMSE (Fig. 1, leftpanels) and the intersection of D obs and D mis (right pan-els). Even with this perhaps improper stopping condi-tion, we observed that the algorithm is capable of infer-ring the true W ij even with a fraction of missing data aslarge as p = 0 . C ij of the trueand restored data was found (Fig. A2(b)), in contrast tothe relatively large deviation of the inferred W ij . To address the quality of data restoration, we definedrestoration accuracy (RA) the same as in the main text.Figure A2(c) shows that RA is above 0.6 even when 80%of the data have been decimated. Figure A2(d) displaysa similar behavior to Fig. 2(d). Predictably, the qualityof the inference gradually decreases with p. When p issmaller than 0.4, we can obtain W ij with an error lessthan 5%. Appendix B: Other Experimental Data
In addition to Sec. III B, we tested our algorithms withother neuronal spike train data published by Kohn andSmith [33]. They recorded spiking activities of 70-100neurons in the visual cortex of adult monkeys undervarious visual stimuli. Here, we chose a “spontaneous”dataset where a uniform gray screen had been used. Webinned the neural activity with 20 ms and selected themost active 60 neurons. The data length was L = 123404.For the detailed experimental procedure, see [53, 54]. Wesummarize the results in Fig. A3 with the same formatas Fig. 5. The results are similar. FIG. A3. The same as Fig. 5 but using neuronal activity data published by Kohn and Smith [33].[1] S. Brunton and N. Kutz,
Data-driven science and engi-neering: machine learning, dynamical systems, and con-trol (Cambridge University Press, Cambridge, England,UK, 2019).[2] E. Schneidman, I. I. Michael J. Berry, R. Segev, andW. Bialek, Nature , 1007 (2006).[3] J. Shlens, G. D. Field, J. L. Gauthier, M. I. Grivich,D. Petrusca, A. Sher, A. M. Litke, and E. J. Chichilnisky,J. Neurosci. , 8254 (2006).[4] A. Lapedes, B. Giraud, and C. Jarzynski, arXiv (2012),1207.2484.[5] F. Morcos, A. Pagnani, B. Lunt, A. Bertolino, D. S.Marks, C. Sander, R. Zecchina, J. N. Onuchic, T. Hwa,and M. Weigt, Proc. Natl. Acad. Sci. U.S.A. , E1293(2011).[6] T. Mora, A. M. Walczak, W. Bialek, and C. G. Callan,Jr., Proc. Natl. Acad. Sci. U.S.A. , 5405 (2010).[7] T. R. Lezon, J. R. Banavar, M. Cieplak, A. Maritan, andN. V. Fedoroff, Proc. Natl. Acad. Sci. U.S.A. , 19033(2006).[8] T. Bury, Physica A , 1375 (2013).[9] Y. Shemesh, Y. Sztainberg, O. Forkosh, T. Shlapober-sky, A. Chen, and E. Schneidman, eLife (2013),10.7554/eLife.00759.[10] E. D. Lee, C. P. Broedersz, and W. Bialek, J. Stat. Phys. , 275 (2015).[11] M. Figliuzzi, P. Barrat-Charlaix, and M. Weigt, Mol.Biol. Evol. , 1018 (2018).[12] Y. Roudi, S. Nirenberg, and P. E. Latham, PLoS Com-put. Biol. , e1000380 (2009).[13] B. Derrida, E. Gardner, and A. Zippelius, EPL , 167(1987).[14] Y. Roudi and J. Hertz, Phys. Rev. Lett. , 048702(2011).[15] H.-L. Zeng, M. Alava, E. Aurell, J. Hertz, and Y. Roudi,Phys. Rev. Lett. , 210601 (2013). [16] H. C. Nguyen, R. Zecchina, and J. Berg, Adv. Phys. ,197 (2017).[17] M. M´ezard and J. Sakellariou, J. Stat. Mech.: TheoryExp. , L07001 (2011).[18] D.-T. Hoang, J. Song, V. Periwal, and J. Jo, Phys. Rev.E , 023311 (2019).[19] B. Dunn and Y. Roudi, Phys. Rev. E , 022127 (2013).[20] J. Tyrcha and J. Hertz, Mathematical Biosciences & En-gineering , 149 (2014).[21] L. Bachschmid-Romano and M. Opper, J. Stat. Mech.:Theory Exp. , P06013 (2014).[22] C. Battistin, J. Hertz, J. Tyrcha, and Y. Roudi, J. Stat.Mech.: Theory Exp. , P05021 (2015).[23] D.-T. Hoang, J. Jo, and V. Periwal, Phys. Rev. E ,042114 (2019).[24] A. P. Dempster, N. M. Laird, and D. B. Rubin, Journalof the Royal Statistical Society: Series B (Methodologi-cal) , 1 (1977).[25] D. Soudry, S. Keshri, P. Stinson, M.-h. Oh, G. Iyen-gar, and L. Paninski, PLoS Comput. Biol. , e1004464(2015).[26] C. Campajola, F. Lillo, and D. Tantari, Phys. Rev. E , 062138 (2019).[27] G. Tkaˇcik, O. Marre, D. Amodei, E. Schneidman,W. Bialek, and M. J. B. I. I., PLoS Comput. Biol. ,e1003408 (2014).[28] X. Chen, F. Randi, A. M. Leifer, and W. Bialek, Phys.Rev. E , 052418 (2019).[29] H.-L. Zeng, E. Aurell, M. Alava, and H. Mahmoudi,Phys. Rev. E , 041135 (2011).[30] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel,B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer,R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay, Jour-nal of Machine Learning Research , 2825 (2011).[31] D. Sherrington and S. Kirkpatrick, Phys. Rev. Lett. ,1792 (1975). [32] O. Marre, G. Tkacik, D. Amodei, E. Schneidman,W. Bialek, and M. Berry, Multi-electrode array record-ing from salamander retinal ganglion cells , IST Austria(2017).[33] A. Kohn and M. A. Smith,
Utah array extracellularrecordings of spontaneous and visually evoked activityfrom anesthetized macaque primary visual cortex (V1) ,CRCNS.org (2016).[34] O. Marre, S. El Boustani, Y. Fr´egnac, and A. Destexhe,Phys. Rev. Lett. , 138101 (2009).[35] H. Nasser, O. Marre, and B. Cessac, J. Stat. Mech.:Theory Exp. , P03006 (2013).[36] H. Nasser and B. Cessac, Entropy , 2244 (2014).[37] J. Tyrcha, Y. Roudi, M. Marsili, and J. Hertz, J. Stat.Mech.: Theory Exp. , P03005 (2013).[38] N. R. Lomb, Astrophys. Space Sci. , 447 (1976).[39] J. D. Scargle, Astrophys. J. , 835 (1982).[40] M. Schulz and M. Mudelsee, Comput. Geosci. , 421(2002).[41] T. Ruf, Biol. Rhythm Res. , 178 (1999).[42] P. Ravikumar, M. J. Wainwright, and J. D. Lafferty,Ann. Stat. , 1287 (2010). [43] A. Decelle and F. Ricci-Tersenghi, Phys. Rev. Lett. ,070603 (2014).[44] A. Decelle and P. Zhang, Phys. Rev. E , 052136 (2015).[45] W. K. Hastings, Biometrika , 97 (1970).[46] D. H. Ackley, G. E. Hinton, and T. J. Sejnowski, Cog-nitive Science , 147 (1985).[47] H. Kappen, F. Rodr´ıguez, and F. B. Rodr’iguez, in Ad-vances in Neural Information Processing Systems (TheMIT Press, 1998) pp. 280–286.[48] T. Tanaka, Phys. Rev. E , 2302 (1998).[49] F. Ricci-Tersenghi, J. Stat. Mech.: Theory Exp. ,P08015 (2012).[50] V. Sessak and R. Monasson, J. Phys. A: Math. Theor. , 055001 (2009).[51] J. P. Barton, E. De Leonardis, A. Coucke, and S. Cocco,Bioinformatics , 3089 (2016).[52] E. Aurell and M. Ekeberg, Phys. Rev. Lett. , 090201(2012).[53] M. A. Smith and A. Kohn, J. Neurosci. , 12591 (2008).[54] R. C. Kelly, M. A. Smith, R. E. Kass, and T. S. Lee, J.Comput. Neurosci.29