Parametric Copula-GP model for analyzing multidimensional neuronal and behavioral relationships
Nina Kudryashova, Theoklitos Amvrosiadis, Nathalie Dupuy, Nathalie Rochefort, Arno Onken
PParametric Copula-GP model for analyzingmultidimensional neuronal and behavioralrelationships
Nina Kudryashova , Theoklitos Amvrosiadis , Nathalie Dupuy , Nathalie Rochefort , and
Arno Onken School of Informatics, University of Edinburgh Centre for Discovery Brain Sciences, University of Edinburgh Simons Initiative for the Developing Brain, University of Edinburgh [email protected], [email protected], [email protected],[email protected], [email protected]
Abstract
One of the main challenges in current systems neuroscience is the analysis ofhigh-dimensional neuronal and behavioral data that are characterized by differentstatistics and timescales of the recorded variables. We propose a parametriccopula model which separates the statistics of the individual variables from theirdependence structure, and escapes the curse of dimensionality by using vine copulaconstructions. We use a Bayesian framework with Gaussian Process (GP) priorsover copula parameters, conditioned on a continuous task-related variable. Wevalidate the model on synthetic data and compare its performance in estimatingmutual information against the commonly used non-parametric algorithms. Ourmodel provides accurate information estimates when the dependencies in the datamatch the parametric copulas used in our framework. When the exact densityestimation with a parametric model is not possible, our Copula-GP model is stillable to provide reasonable information estimates, close to the ground truth andcomparable to those obtained with a neural network estimator. Finally, we applyour framework to real neuronal and behavioral recordings obtained in awakemice. We demonstrate the ability of our framework to 1) produce accurate andinterpretable bivariate models for the analysis of inter-neuronal noise correlationsor behavioral modulations; 2) expand to more than 100 dimensions and measureinformation content in the whole-population statistics. These results demonstratethat the Copula-GP framework is particularly useful for the analysis of complexmultidimensional relationships between neuronal, sensory and behavioral data.
Recent advances in imaging and recording techniques have enabled monitoring the activity ofhundreds to several thousands of neurons simultaneously [1–3]. These recordings can be madein awake animals engaged in specifically designed tasks or natural behavior [4–6], which furtheraugments these already large datasets with a variety of behavioral variables. These complex highdimensional datasets necessitate the development of novel analytical approaches [7–10] to addresstwo central questions of systems and behavioral neuroscience: how do populations of neurons encodeinformation? And how does this neuronal activity correspond to the observed behavior? In machine
Preprint. Under review. a r X i v : . [ s t a t . M E ] A ug earning terms, both of these questions translate into understanding the high-dimensional multivariatedependencies between the recorded variables [11–14].There are two major methods suitable for recording the activity of large populations of neurons frombehaving animals: the multi-electrode probes that provide milliseconds precision for recordingsof electrical activity [1], and calcium imaging methods [2, 3, 15] that use changes in intracellularcalcium concentration as a proxy for neuronal spiking activity at a lower (tens of milliseconds)temporal precision. As a result, the recorded neuronal and behavioral variables may operate atdifferent timescales and exhibit different statistics, which further complicates the statistical analysis.The natural approach to modeling statistical dependencies between the variables with drasticallydifferent statistics is based on copulas , which separate marginal statistics from the dependencestructure [16]. For this reason, copula models are particularly effective for mutual informationestimation [17, 18]. They can also escape the ‘curse of dimensionality’ by factorising the multi-dimensional dependence into pair-copula constructions called vines [19, 20]. Copula models havebeen successfully applied to spiking activity [21–24], 2-photon calcium recordings [25] and multi-modal neuronal datasets [26]. However, these models assumed that the dependence between variableswas static, whereas in neuronal recordings it may be dynamic or modulated by behavioral context [27,14]. Therefore, it might be helpful to explicitly model the continuous time- or context-dependentchanges in the relationships between variables, which reflect changes in an underlying computation.Here, we extend a copula-based approach by adding explicit conditional dependence to the parametersof the copula model, approximating these latent dependencies with Gaussian Processes (GP). It waspreviously shown that such a combination of parametric copula models with GP priors outperformsstatic copula models [28] and even dynamic copula models on many real-world datasets, includingweather forecasts, geological data or stock market data [29]. Yet, this method has never been appliedto neuronal recordings before.In this work, we improve the scalability of the method by using stochastic variational inference.We also increase the complexity of the copula models in order to adequately describe the complexdependencies commonly observed in neuronal data. In particular, we use mixtures of parametriccopula models to account for changes in tail dependencies. We develop model selection algorithms,based on the fully-Bayesian Watanabe–Akaike information criterion (WAIC). Finally and mostimportantly, we demonstrate that our model is suitable for estimating mutual information. It performsespecially well when the parametric model can closely approximate the target distribution. Whenit is not the case, our copula mixture model demonstrates sufficient flexibility and provides closeinformation estimates, comparable to the best state-of-the-art non-parametric information estimators.We first introduce the copula mixture models and propose model selection algorithms (Sec. 2). Wethen validate our model on synthetic data and compare its performance against other commonly usedinformation estimators (Sec. 3). Next, we demonstrate the utility of the method on real neuronal andbehavioral data (Sec. 4). We show that our Copula-GP method can produce interpretable bivariatemodels that emphasize the qualitative changes in tail dependencies and estimate mutual informationthat exposes the structure of the task without providing any explicit cues to the model. Finally,we apply the vine Copula-GP model to measure information content in the whole dataset with 5behavioral variables and more than 100 neurons. Our model is based on copulas: multivariate distributions with uniform marginals. Sklar’s theo-rem [30] states that any multivariate joint distribution can be written in terms of univariate marginaldistribution functions p ( Y i ) and a unique copula which characterizes the dependence structure: p ( Y , . . . , Y N ) = c ( F ( Y ) . . . F N ( Y N )) × (cid:81) Ni =1 p ( Y i ) . Here, F i ( · ) are the marginal cumulativedistribution functions (CDF) and, as a result, each F i ( Y i ) is uniformly distributed on [0,1].For high dimensional datasets (high dim Y ), maximum likelihood estimation for copula parametersmay become computationally challenging. The two-stage inference for margins (IFM) trainingscheme is typically used in this case [31]. First, univariate marginals are estimated and used to mapthe data onto a multidimensional unit cube. Second, the parameters of the copula model are inferred.2 otation-equivariant copulas x 4 rotations Mixture of copulas = copula Figure 1: Copula families used in the mixture models in our framework. The percentage in theupper-left corner shows how often each of the families was selected to be used in a copula mixturefor pairwise relationships in the real neuronal data from Pakan et al. [5] (see Sec. 4).
Conditional copulas
Following the approach by Hernández-Lobato et al. [29], we are usingGaussian Processes (GP) to model the conditional dependencies of copula parameters: p ( Y | X ) = c (cid:16) F ( Y | X ) , . . . , F N ( Y N | X ) (cid:12)(cid:12)(cid:12) X (cid:17) × (cid:34) N (cid:89) i =1 p ( Y i | X ) (cid:35) . (1)In the most general case, the marginal PDFs p ( Y i | X ) and CDFs F i ( Y i | X ) and the copula c ( . . . | X ) itself can all be conditioned on X . In our framework, X is assumed to be one-dimensional. AGaussian Process is ideally suited for copula parametrization, as it provides an estimate of theuncertainty in model parameters, which we utilize in our model selection process (Sec. 2.1). Conditional marginals
In order to estimate marginal CDFs F ( Y i | X ) , we use the non-parametricfastKDE [32] algorithm, which allows for direct estimation of the conditional distributions. Theconditional distribution is then used to map the data onto a unit hypercube using the probabilityintegral transform: F ( Y i | X ) → U i ∼ U [0 , , such that U i is uniformly distributed for any X . Bivariate copula families
We use 4 copula families as the building blocks for our copula models:Gaussian, Frank, Clayton and Gumbel copulas (Figure 1). All of these families have a singleparameter, corresponding to the rank correlation (Table 1). We also use rotated variants (90 ◦ , 180 ◦ ,270 ◦ ) of Clayton and Gumbel copula families in order to express upper tail dependencies and negativecorrelation. Table 1: Bivariate copula families and their GPLink functionsCopula Domain GPLink( f ) : R → dom( c j ) Independence – –Gaussian [-1,1]
Erf( f / . Frank (- ∞ , ∞ ) . · f + sign( f ) · (0 . · f ) Clayton [0, ∞ ) Exp(0 . · f ) Gumbel [1, ∞ ) . · f ) Since we are primarily focused on the analysis of neuronal data, we have first visualized the dependen-cies in calcium signal recordings after a probability integral transform, yielding empirical conditionalcopulas. As a distinct feature in neuronal datasets, we observed changes in tail dependencies withregard to the conditioning variable. Since none of the aforementioned families alone could describesuch conditional dependency, we combined multiple copulas into a linear mixture model (which isalso a copula [33]): c ( U | X ) = K (cid:88) j =1 φ j ( X ) c j ( U ; θ j ( X )) , (2)where K is the number of elements, φ j ( X ) is the concentration of the j th copula in a mixture, c j isthe pdf of the j th copula, and θ j is its parameter. 3ach of the copula families includes the Independence copula as a special case. To resolve thisovercompleteness, we add the Independence copula as a separate model with zero parameters(Table 1). For independent variables Y ind , the Independence model will be preferred over the othermodels in our model selection algorithm (Sec. 2.1), since it has the smallest number of parameters. Gaussian Process priors
We parametrize the mixture model (2) with the independent latent GPs: f ∼ N ( µ × , K λ ( X, X )) . For each copula family, we constructed GPLink functions (Table 1) thatmap the GP variable onto the copula parameter domain: θ j = GPlink c j ( f j ) , R → dom( c j ) . Next,we also use GP to parametrize concentrations φ j ( X ) , which are defined on a simplex ( (cid:80) φ = 1 ): φ j = (1 − t j ) j − (cid:89) m =1 t m , t m = Φ (cid:18) (cid:101) f m + Φ − (cid:18) M − m − M − m (cid:19)(cid:19) , t M = 0 , where Φ is a CDF of a standard normal distribution and (cid:101) f m ∼ N ( (cid:101) µ m × , (cid:101) K (cid:101) λ m ( X, X )) . Thisparametrization ensures that when all GP variables (cid:101) f m = 0 , all of the concentrations φ j are equal to /M . We use the RBF kernel K λ ( X, X ) with bandwidth parameter λ . Therefore, the whole mixturemodel with M copula elements requires [2 M − hyperparameters: { λ } M for θ and { (cid:101) λ } M − for φ . Approximate Inference
Since our model has latent variables with GP priors and intractableposterior distribution, the direct maximum likelihood Type-II estimation is not possible and anapproximate inference is needed. Such inference problem with copula models has previously beensolved with the expectation propagation (EP) algorithm [29]. Considering the recent developmentsin high-performance parallel computing and stochastic optimization algorithms, we chose to use stochastic variational inference (SVI) instead. In order to scale the SVI to a large number of inputs,we use Kernel Interpolation for Scalable Structured Gaussian Processes (KISS-GP) [34]. For efficientimplementation of these methods on GPU, we use the PyTorch [35] and GPyTorch libraries [36].
We use the Watanabe–Akaike information criterion (WAIC [37]) for model selection. WAIC is afully Bayesian approach to estimating the Akaike information criterion (AIC) (see Eq. 31 in theoriginal paper [37]). The main advantage of the method is that it avoids the empirical estimation ofthe effective number of parameters, which is often used for approximation of the out-of-sample bias.It starts with the estimation of the log pointwise posterior predictive density (lppd) [38]: (cid:100) lppd = N (cid:88) i =1 log (cid:32) S S (cid:88) s =1 p ( y i | θ s ) (cid:33) , p WAIC = N (cid:88) i =1 V Ss =1 (cid:16) log p ( y i | θ s ) (cid:17) , where { θ s } S is a draw from a posterior distribution, which must be large enough to represent theposterior. Next, the p WAIC approximates the bias correction, where V Ss =1 represents sample variance.Therefore, the bias-corrected estimate of the log pointwise posterior predictive density is given by: e (cid:100) lppd WAIC = lppd − p WAIC = − N · WAIC original . In the model selection process, we aim to choose the model with the lowest
WAIC . Since our copulaprobability densities are continuous, their values can exceed 1 and the resulting
WAIC is typicallynegative. Zero
WAIC corresponds to the Independence model ( pdf = 1 on the whole unit square).Since the total number of combinations of 10 copula elements (Fig. 1, considering rotations) is large,exhaustive search for the optimal model is not feasible. In our framework, we propose two modelalgorithms for constructing close-to-optimal copula mixtures: greedy and heuristic (see SupplementalMaterial for details). The greedy algorithm is universal and can be used with any other copula familieswithout adjustment, while the heuristic algorithm is fine-tuned to the specific copula families used inthis paper (Fig. 1). Both model selection algorithms were able to select the correct 1- and 2-componentmodel on simulated data and at least find a close approximation (within
WAIC tol = 0 . ) for morecomplex models (see validation of model selection in Supplemental Material). Our framework provides tools for efficient sampling from the conditional distribution and for calcu-lating the probability density p ( Y | X ) . Therefore, for each X = x the entropy H ( Y | X = x ) can be4stimated using Monte Carlo (MC) integration: H ( Y | X = x ) = − E p ( Y | X = x ) log p ( Y | X = x ) . (3) p ( Y | X = x ) factorizes into the conditional copula density and marginal densities (1), hence foreach x the entropy also factorizes [17] as H ( Y | X = x ) = (cid:80) H ( Y i | X = x ) + H c ( U X | X = x ) ,where U X = F ( Y | X ) . The conditional entropy can be integrated as H ( Y | X ) = (cid:80) Ni =1 H ( Y i | X ) + (cid:82) H c ( U X | X = x ) p ( x ) dx, separating the entropy of the marginals { Y i } N from the copula entropy.Now, I ( X, Y ) = I ( X, G ( Y )) if G ( Y ) is 1) a homeomorphism, 2) independent of X [39]. Ifmarginal statistics are independent of X , then the probability integral transform U = F ( Y ) satisfiesboth requirements, and I ( X, Y ) = I ( X, U ) . Then, in order to calculate the mutual information I ( X, U ) := H ( U ) − H ( U | X ) , we must also rewrite it using only the conditional distribution p ( U | X ) , which is modelled with our conditional Copula-GP model. This can be done as follows: I ( X, U ) = H ( U ) − (cid:90) H ( U | X = x ) p ( x ) dx = E p ( U ,X ) log p ( U | X ) − E p ( U ) log E p ( X ) p ( U | X ) . (4)The last term in (4) involves nested integration, which is computationally difficult and does not scalewell with N = dim U . Therefore, we propose an alternative way of estimating I ( X, Y ) , whichavoids double integration and allows us to use the marginals conditioned on X ( U X = F ( Y | X ) ),providing a better estimate of H ( Y | X ) . We can use two separate copula models, one for estimating p ( Y ) and calculating H ( Y ) , and another one for estimating p ( Y | X ) and calculating H ( Y | X ) : I ( X, Y ) = N (cid:88) i =1 I ( X, Y i ) + H c ( u , . . . , u N ) − (cid:90) H c ( u x , . . . , u xN | s = x ) p ( x ) dx, (5)where both entropy terms are estimated with MC (3). Here we only integrate over the unit cube [0 , N and then dom X , whereas (4) required integration over [0 , N × dom X .The performance of both (4) and (5) critically depends on the approximation of the dependencestructure, i.e. how well the parametric copula approximates the true copula probability density. Ifthe joint distribution p ( Y . . . Y N ) has a complex dependence structure, as we will see in syntheticexamples, then the mixture of parametric copulas may provide a poor approximation of p ( Y ) andoverestimate H c ( u , . . . , u N ) , thereby overestimating I ( X, Y ) . The direct integration (4), on theother hand, typically underestimates the I ( X, Y ) due to imperfect approximation of p ( Y | X ) , but itis only applicable if the marginals can be considered independent of X .We further refer to the direct integration approach (4) as "Copula-GP integrated" and to the alternativeapproach (5) as "Copula-GP estimated" and assess both of them on synthetic and real data. High-dimensional copulas can be constructed from bivariate copulas by organizing them into hierar-chical structures called copula vines [19]. In this paper, we focus on the canonical vine or C-vine ,which factorizes the high-dimensional copula probability density function as follows: c ( U ) = (cid:34) N (cid:89) i =2 c i ( U , U i ) (cid:35) × N (cid:89) i =2 N (cid:89) j = i +1 c ij |{ k } k
Code will be made available on GitHub upon paper acceptance.5 p( U ) p( U |X=0) p( U |X=1) p( U ) p( U |X=0) p( U |X=1) p( U ) p( U |X=0) p( U |X=1) Y ) I ( X , Y ) Y H ( Y | X ) Y ) I ( X , Y ) True Copula-GP integr. Copula-GP estim. BI-KSG KSG MINE100 MINE200 MINE500 Y H ( Y | X ) Y ) I ( X , Y ) Y H ( Y | X ) = = df=2 df=150 N=2 = A Gaussian with corr=-0.1 1. B StudenT with corr=0.7, df=2 150 C Gauss corr=-0.1 1.0, y i = y i + ( Nj =1 y j ) N Figure 2: Conditional entropy H ( Y | X ) and mutual information I ( X, Y ) measured by different meth-ods on synthetic data. Upper row shows the dependency structures p ( U ) and conditional dependencystructures at the beginning and the end of the dom X = [0 , . A Multivariate Gaussian samples. B Multivariate Student T samples. C Multivariate Gaussian samples Y (same as A ), morphed intoanother distribution p ( ˆ Y ) with a tail dependence, while I ( X, Y ) = I ( X, ˆ Y ) . Gray intervals showeither standard error of mean (SE, 5 repetitions), or (cid:112) ( SE ) + ( M C tol ) for integrated variables. We compare our method with the other commonly used non-parametric algorithms for mutualinformation estimation: Kraskov-Stögbauer-Grassberger (KSG [39]), Bias-Improved-KSG by Gao etal. (BI-KSG [41]) and the Mutual Information Neural Estimator (MINE [42]).First, we test these estimators on a dataset sampled from a multivariate Gaussian distribution, with cov( Y i , Y j ) = ρ + (1 − ρ ) δ ij , where δ ij is Kronecker’s delta and ρ = − . . X, X ∈ [0 , .Our algorithm selects a Gaussian copula on these data, which perfectly matches the true distribution.Therefore, Copula-GP measures both entropy and mutual information exactly (within integrationtolerance, see Fig. 2A). The performance of the non-parametric methods on this dataset is lower. Itwas shown before that KSG and MINE both severely underestimate the MI for high-dimensionalGaussians with high correlation (e.g. see Fig. 1 in Belghazi et al. [42]). The Copula-GP model(integrated) provides accurate estimates for highly correlated (up to ρ = 0 . , at least up to 20D)Gaussian distributions (see Supplemental Material).Next, we test the Copula-GP performance on the Student T distribution, which can only be approxi-mated by the Copula-GP model, but would not exactly match any of the parametric copula familiesin Table 1. We keep the correlation coefficient ρ fixed at . , and only change the number of degreesof freedom from 2 to 150 exponentially: df = exp(5 X ) + 1 , X ∈ [0 , . This makes the datasetparticularly challenging, as all of the mutual information I ( X, Y ) is encoded in tail dependencies of p ( Y | X ) . The true H ( Y | X ) of the Student T distribution was calculated analytically (see Eq. A.12in [43]) and I ( X, Y ) was integrated numerically according to (4) given the true p ( Y | X ) .Figure 2B shows that most of the methods underestimate I ( X, Y ) . Copula-GP (integrated) andMINE (with 100 hidden units) provide the closest estimates. The training curve for MINE with morehidden units (200,500) showed signs of overfitting (abrupt changes in loss at certain permutations)and the resulting estimate was higher than the true I ( X, Y ) at higher dimensions. It was shownbefore that MINE provides inaccurate and inconsistent results on datasets with low I ( X, Y ) [44].We also demonstrate I ( X, Y ) estimation with a combination of two copula models for H ( Y ) and H ( Y | X ) : "Copula-GP estimated" (see Eq. 5). In lower dimensions, it captures less informationthan "Copula-GP integrated", but starts overestimating the true MI at higher dimensions, when theinaccuracy of the density estimation for p ( Y ) builds up. This shows the limitation of the "estimated"method, which can either underestimate or overestimate the correct value due to parametric modelmismatch, whereas "integrated" method consistently underestimates the correct value. We conclude6hat Copula-GP and MINE demonstrate similar performance in this example, while KSG-basedmethods significantly underestimate I ( X, Y ) in higher dimensions.Finally, we created another artificial dataset that is not related to any of the copula models used inour framework (Table 1). We achieved that by applying a homeomorphic transformation F ( Y ) toa multivariate Gaussian distribution. Since the transformation is independent of the conditioningvariable, it does not change the I ( X, Y ) = I ( X, F ( Y )) [39]. Therefore, we possess the true I ( X, Y ) , which is the same as for the first example in Figure 2A. Note, however, that the entropy ofthe samples changes: H ( Y ) (cid:54) = H ( F ( Y )) . So, there is no ground truth for the conditional entropy. Wetransform the Gaussian copula samples Y ∈ U N [0 , from the first example as (cid:101) Y i = Y i +( (cid:81) Nj =1 Y j ) /N and again transform the marginals using the empirical probability integral transform U = F ( (cid:101) Y ).Both conditional p ( U | X ) and unconditional p ( U ) densities here do not match any of the parametriccopulas from Table 1. As a result, "Copula-GP estimated" overestimated the correct value, while"Copula-GP integrated" underestimated it similarly to the MINE estimator with 100 hidden units.Figure 2 demonstrates that the performance of the parametric Copula-GP model critically depends onthe match between the true probability density and the best mixture of parametric copula elements.When the parametric distribution matches the true distribution (Fig. 2A), our Copula-GP frameworkpredictably outperforms all non-parametric methods. Nonetheless, even when the exact reconstructionof the density is not possible (Figs. 2B-C), the mixtures of the copula models (2) are still able tomodel the changes in tail dependencies, at least qualitatively. As a result, our method performssimilarly to the neural-network based method (MINE) and still outperforms KSG-like methods. We investigate the dependencies observed in neuronal and behavioral data and showcase possibleapplications of the Copula-GP framework. We used two-photon calcium imaging data of neuronalpopulation activity in the primary visual cortex of mice engaged in a visuospatial navigation taskin virtual reality (data from Henschke et al. [45]). Briefly, the mice learned to run through a virtualcorridor with vertical gratings on the walls (Fig. 3A, 0-120cm) until they reached a reward zone(Fig. 3A, 120-140cm), where they could get a reward by licking a reward spout. We condition ourCopula-GP model on the position in the virtual environment X and studied the joint distribution ofthe behavioral ( ˜ Y . . . ˜ Y ) and neuronal ( ˜ Y . . . ˜ Y ) variables ( dim Y =109). Figure 3B shows a partof the dataset (trials 25-35 out of 130). The traces here demonstrate changes in the position X ofthe mouse as well as the activity of 3 selected neurons and the licking rate. These variables haveFigure 3: Applications of the Copula-GP framework to neuronal and behavioral data from the visualcortex. A Schematic of the experimental task [5, 45] in virtual reality (VR); B Example traces fromten example trials: X is a position in VR, Y is a vector of neuronal recordings (blue) and behavioralvariables (red); C-D
Density plots for: the noise correlation (C) and the behavioral modulation(D) examples;
E-G
Conditional entropy for the bivariate examples (E-F) and the population-widestatistics (G); H Comparison of Copula-GP vs. non-parametric estimators on subsets of variables.7ifferent patterns of activity depending on X and different signal-to-noise ratios. Both differencesare reflected in marginal statistics, which are shown on the right with the density plots of equal area. Constructing interpretable bivariate models
We first studied bivariate relationships betweenneurons. In order to do this, we transformed the raw signals (shown in Fig. 3B) with a probabilityintegral transform U = F ( Y ) . We observed strong non-trivial changes in the dependence structure c ( U | X ) subject to the position in the virtual reality X and related visual information (Fig. 3C). Suchstimulus-related changes in the joint variability of two neuronal signals are commonly describedas noise correlations . The Copula-GP model provides a more detailed description of the jointprobability that goes beyond linear correlation analysis. In this example, the dependence structure isbest characterized by a combination of Gaussian and Clayton copula (rotated by 90 ◦ ). The densityplots Fig. 3C demonstrate the match between the true density (outlines) and the copula model density(blue shades) for each part of the task. We measure the accuracy of the density estimation with theproportion of variance explained R , which shows how much of the variance of the variable Y canbe predicted given the variable Y (see Eq.(1) in Supplemental Material). The average R for all Y is provided in the upper right corner of the density plots.Next, we show that our model can be applied not only to the neuronal data, but also to any of thebehavioral variables. Fig. 3D shows the dependence structure between one of the neurons and thelicking rate. The best selected mixture model here is Frank + Clayton 0 ◦ + Gumbel 270 ◦ , whichagain provides an accurate estimate of the conditional dependence between the variables. Therefore,Figs. 3C-D demonstrate that our Copula-GP model provides both an accurate fit for the probabilitydistribution and an interpretable visualization of the dependence structure.Figs. 3E-F show the absolute value of the conditional entropy | H ( U X | X ) | , which is equivalent tothe mutual information between two variables I ( U X , U X ) . For both examples, the MI peaks in thereward zone. The bivariate Copula-GP models were agnostic of the reward mechanism in this task,yet they revealed the position of the reward zone as an anomaly in the mutual information. Measuring information content in a large neuronal population
Finally, we constructed a C-vine describing the distribution between all neuronal and behavioral variables ( dim U X = 109 )and measured the conditional entropy H ( U X | X ) for all variables in the dataset { U X ...U X } . Theconditional entropy in Fig. 3G peaks in the reward zone (similarly to Figs. 3E-F) and also at thebeginning of the trial. Now the model is informed on the velocity of the animal and the reward events,so the first peak can be attributed to the acceleration of the mouse at the start of a new trial [46, 6].While constructing the C-vine, we ordered the variables according to their pairwise rank correlations(see Sec. 2.3). We considered subsets of the first N variables and measured the MI with the positionfor each subset. We compared the performance of our Copula-GP method on these subsets of U X vs. KSG and MINE. Fig. 3H shows that all 3 methods provide similar results on subsets of up to10 variables, yet in higher dimensions both MINE and KSG show smaller I ( X, { U Xi We envision a wide range of impacts resulting from the use of the Copula-GP framework in computa-tional neuroscience as well as in machine learning research, information technology and economics.In computational neuroscience, this approach has the potential to reveal high dimensional context-dependent relationships between neuronal activity and behavioral variables. Therefore, our model canprovide novel insights into the principles of neural circuit-level computation, both under physiologicalconditions, and during the aberrant network function observed in many neuropsychiatric disorders [47–49]. Furthermore, understanding context-dependent processing in the brain might suggest ways tomimic the same principles in artificial neural networks.The proposed framework also has some potentially far reaching applications, not only in com-putational neuroscience but also in information technology and economics. Current problems ofinformation flow in computer networks as well as high-speed trading in micro- and macro-marketscan be phrased as non-stationary relationship problems that require proper stochastic representations,which, in turn, can benefit economic success. On the other hand, there is a possible risk of a one-sidedadoption of the method by malicious actors benefiting from market manipulation, which may givethem unfair advantage and thus reduce the market transparency and cause economic damage.We encourage the researchers adapting our method to understand the limitations of the use ofparametric copula models. The choice of the bivariate copula models and the vine structuresintroduces our beliefs about the particular conditional dependency or independency into the model.The misuse of parametric copula models has once had a negative impact on the insurance industryin the past [50]. Thus, these limitations must be taken into consideration when designing newapplications in the future.One potential negative societal impact related to the use of our framework may include a relativelylarge energy footprint from training the models on graphics processing units (GPUs). The modelsused for creating figures in this paper took about 2 weeks of computational time on 8 GPUs. Themitigation strategy should be focused on careful planning of the simulations, creating checkpointsand backups to prevent data loss, reuse of the trained models and other practices that reduce theenergy-consuming computation. Acknowledgments and Disclosure of Funding We thank the GENIE Program and the Janelia Research Campus, specifically V. Jayaraman, R. Kerr,D. Kim, L. Looger, and K. Svoboda, for making GCaMP6 available. This work was funded by theEngineering and Physical Sciences Research Council (grant EP/S005692/1 to A.O.), the WellcomeTrust and the Royal Society (Sir Henry Dale fellowship to N.R.), the Marie Curie Actions of theEuropean Union’s FP7 program (MC-CIG 631770 to N.R.), the Simons Initiative for the DevelopingBrain (to N.R.), the Precision Medicine Doctoral Training Programme (MRC, the University ofEdinburgh (to T.A.)). References [1] James J Jun, Nicholas A Steinmetz, Joshua H Siegle, Daniel J Denman, Marius Bauza, Brian Barbarits,Albert K Lee, Costas A Anastassiou, Alexandru Andrei, Ça˘gatay Aydın, et al. Fully integrated siliconprobes for high-density recording of neural activity. Nature , 551(7679):232–236, 2017.[2] Fritjof Helmchen. Two-photon functional imaging of neuronal activity . CRC Press, 2009.[3] Daniel A Dombeck, Anton N Khabbaz, Forrest Collman, Thomas L Adelman, and David W Tank. Imaginglarge-scale neural activity with cellular resolution in awake, mobile mice. Neuron , 56(1):43–57, 2007. 4] Carsen Stringer, Marius Pachitariu, Nicholas Steinmetz, Charu Bai Reddy, Matteo Carandini, and Kenneth DHarris. Spontaneous behaviors drive multidimensional, brainwide activity. Science , 364(6437):eaav7893,2019.[5] Janelle MP Pakan, Stephen P Currie, Lukas Fischer, and Nathalie L Rochefort. The impact of visual cues,reward, and motor feedback on the representation of behaviorally relevant spatial locations in primary visualcortex. Cell reports , 24(10):2521–2528, 2018.[6] Janelle MP Pakan, Valerio Francioni, and Nathalie L Rochefort. Action and learning shape the activity ofneuronal circuits in the visual cortex. Current opinion in neurobiology , 52:88–97, 2018.[7] Benjamin Staude, Stefan Rotter, and Sonja Grün. Cubic: cumulant based inference of higher-ordercorrelations in massively parallel spike trains. Journal of computational neuroscience , 29(1-2):327–350,2010.[8] Emery N Brown, Robert E Kass, and Partha P Mitra. Multiple neural spike train data analysis: state-of-the-artand future challenges. Nature neuroscience , 7(5):456–461, 2004.[9] Shreya Saxena and John P. Cunningham. Towards the neural population doctrine, 2019. ISSN 18736882.[10] Ian H. Stevenson and Konrad P. Kording. How advances in neural recording affect data analysis. NatureNeuroscience , 14(2):139–142, feb 2011. ISSN 1097-6256. doi: 10.1038/nn.2731. URL .[11] Robin AA Ince, Riccardo Senatore, Ehsan Arabzadeh, Fernando Montani, Mathew E Diamond, andStefano Panzeri. Information-theoretic methods for studying population codes. Neural Networks , 23(6):713–727, 2010.[12] Maoz Shamir and Haim Sompolinsky. Nonlinear population codes. Neural computation , 16(6):1105–1136,2004.[13] Adam Kohn, Ruben Coen-Cagli, Ingmar Kanitscheider, and Alexandre Pouget. Correlations and neuronalpopulation information. Annual review of neuroscience , 39:237–256, 2016.[14] Hideaki Shimazaki, Shun-ichi Amari, Emery N Brown, and Sonja Grün. State-space analysis of time-varying higher-order spike correlation for multiple neural spike train data. PLoS computational biology , 8(3), 2012.[15] Christine Grienberger, Xiaowei Chen, and Arthur Konnerth. Dendritic function in vivo. Trends inneurosciences , 38(1):45–54, 2015.[16] Harry Joe. Dependence modeling with copulas . CRC press, 2014.[17] Rick L. Jenison and Richard A. Reale. The shape of neural dependence. Neural Computa-tion , 16(4):665–672, 2004. doi: 10.1162/089976604322860659. URL https://doi.org/10.1162/089976604322860659 .[18] Rafael S Calsaverini and Renato Vicente. An information-theoretic approach to statistical dependence:Copula information. EPL (Europhysics Letters) , 88(6):68003, 2009.[19] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Pair-copula constructions of multipledependence. Insurance: Mathematics and economics , 44(2):182–198, 2009.[20] Claudia Czado. Pair-copula constructions of multivariate copulas. In Copula theory and its applications ,pages 93–109. Springer, 2010.[21] Pietro Berkes, Frank Wood, and Jonathan W Pillow. Characterizing neural dependencies with copulamodels. In Advances in neural information processing systems , pages 129–136, 2009.[22] Arno Onken, Steffen Grünewälder, Matthias HJ Munk, and Klaus Obermayer. Analyzing short-term noisedependencies of spike-counts in macaque prefrontal cortex using copulas and the flashlight transformation. PLoS computational biology , 5(11), 2009.[23] Meng Hu, Kelsey L Clark, Xiajing Gong, Behrad Noudoost, Mingyao Li, Tirin Moore, and Hualou Liang.Copula regression analysis of simultaneously recorded frontal eye field and inferotemporal spiking activityduring object-based working memory. Journal of Neuroscience , 35(23):8745–8757, 2015. 24] Babak Shahbaba, Bo Zhou, Shiwei Lan, Hernando Ombao, David Moorman, and Sam Behseta. Asemiparametric bayesian model for detecting synchrony among multiple neurons. Neural Computation ,26(9):2025–2051, 2014. doi: 10.1162/NECO\_a\_00631. URL https://doi.org/10.1162/NECO_a_00631 . PMID: 24922500.[25] Wang A Panzeri S Harvey CD Safaai, H. Characterizing information processing of parietal cortexprojections using vine copulas. In Bernstein Conference 2019 . American Physical Society, 2019. doi: doi:10.12751/nncn.bc2019.0067. URL https://abstracts.g-node.org/abstracts/f80ac63f-88fc-4203-9c2b-a279bb9e201a .[26] Arno Onken and Stefano Panzeri. Mixed vine copulas as joint models of spike counts and local fieldpotentials. In Proceedings of the 30th International Conference on Neural Information Processing Systems ,NIPS’16, page 1333–1341, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.[27] Brent Doiron, Ashok Litwin-Kumar, Robert Rosenbaum, Gabriel K Ocker, and Krešimir Josi´c. Themechanics of state-dependent neural correlations. Nature neuroscience , 19(3):383, 2016.[28] David Lopez-Paz, Jose Miguel Hernández-Lobato, and Ghahramani Zoubin. Gaussian process vine copulasfor multivariate dependence. In International Conference on Machine Learning , pages 10–18, 2013.[29] José Miguel Hernández-Lobato, James R Lloyd, and Daniel Hernández-Lobato. Gaussian process condi-tional copulas with applications to financial time series. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahra-mani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26 , pages1736–1744. Curran Associates, Inc., 2013. URL http://papers.nips.cc/paper/5084-gaussian-process-conditional-copulas-with-applications-to-financial-time-series.pdf .[30] Abe Sklar. Fonctions de reprtition an dimensions et leursmarges. Publ. Inst. Statis. Univ. Paris , 8:229–231,1959.[31] Harry Joe. Asymptotic efficiency of the two-stage estimation method for copula-based models. Journal ofMultivariate Analysis , 94(2):401 – 419, 2005. ISSN 0047-259X. doi: https://doi.org/10.1016/j.jmva.2004.06.003. URL .[32] Travis A. O’Brien, Karthik Kashinath, Nicholas R. Cavanaugh, William D. Collins, and John P. O’Brien.A fast and objective multidimensional kernel density estimation method: fastKDE. Computational Statistics& Data Analysis , 101:148 – 160, 2016. ISSN 0167-9473. doi: https://doi.org/10.1016/j.csda.2016.02.014.URL .[33] Roger B Nelsen. An introduction to copulas . Springer Science & Business Media, 2007.[34] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes(KISS-GP). In International Conference on Machine Learning , pages 1775–1784, 2015.[35] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, ZemingLin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in PyTorch. In Advancesin Neural Information Processing Systems , 2017.[36] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. GPyTorch: Black-box matrix-matrix Gaussian process inference with GPU acceleration. In Advances in Neural InformationProcessing Systems , pages 7576–7586, 2018.[37] Sumio Watanabe. A widely applicable Bayesian information criterion. Journal of Machine LearningResearch , 14(Mar):867–897, 2013.[38] Andrew Gelman, Jessica Hwang, and Aki Vehtari. Understanding predictive information criteria forbayesian models. Statistics and computing , 24(6):997–1016, 2014.[39] Alexander Kraskov, Harald Stögbauer, and Peter Grassberger. Estimating mutual information. Phys. Rev.E , 69:066138, Jun 2004. doi: 10.1103/PhysRevE.69.066138. URL https://link.aps.org/doi/10.1103/PhysRevE.69.066138 .[40] Claudia Czado, Ulf Schepsmeier, and Aleksey Min. Maximum likelihood estimation of mixed c-vineswith application to exchange rates. Statistical Modelling , 12(3):229–255, 2012.[41] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying Fixed k-Nearest Neighbor InformationEstimators, 2016.[42] Mohamed Ishmael Belghazi, Aristide Baratin, Sai Rajeswar, Sherjil Ozair, Yoshua Bengio, Aaron Courville,and R Devon Hjelm. MINE: Mutual Information Neural Estimation, 2018. 43] R. S. Calsaverini and R. Vicente. An information-theoretic approach to statistical dependence: Copulainformation. EPL (Europhysics Letters) , 88(6):68003, Dec 2009. ISSN 1286-4854. doi: 10.1209/0295-5075/88/68003. URL http://dx.doi.org/10.1209/0295-5075/88/68003 .[44] Jiaming Song and Stefano Ermon. Understanding the limitations of variational mutual informationestimators, 2019.[45] Julia U Henschke, Evelyn Dylda, Danai Katsanevaki, Nathalie Dupuy, Stephen P Currie, TheoklitosAmvrosiadis, Janelle MP Pakan, and Nathalie L Rochefort. Reward association enhances stimulus-specificrepresentations in primary visual cortex. Current Biology , 2020.[46] Adil G Khan and Sonja B Hofer. Contextual signals in visual cortex. Current opinion in neurobiology , 52:131–138, 2018.[47] Peter J Uhlhaas and Wolf Singer. Neuronal dynamics and neuropsychiatric disorders: toward a translationalparadigm for dysfunctional large-scale networks. Neuron , 75(6):963–980, 2012.[48] Danielle S Bassett, Cedric Huchuan Xia, and Theodore D Satterthwaite. Understanding the emergence ofneuropsychiatric disorders with network neuroscience. Biological Psychiatry: Cognitive Neuroscience andNeuroimaging , 3(9):742–753, 2018.[49] Urs Braun, Axel Schaefer, Richard F Betzel, Heike Tost, Andreas Meyer-Lindenberg, and Danielle SBassett. From maps to multi-dimensional network mechanisms of mental disorders. Neuron , 97(1):14–31,2018.[50] Catherine Donnelly and Paul Embrechts. The devil is in the tails: Actuarial mathematics and the subprimemortgage crisis. ASTIN Bulletin , 40(1):1–33, 2010. doi: 10.2143/AST.40.1.2049222. upplemental Material for “Parametric Copula-GPmodel for analyzing multidimensional neuronal andbehavioral relationships” Nina Kudryashova , Theoklitos Amvrosiadis , Nathalie Dupuy , Nathalie Rochefort , and Arno Onken School of Informatics, University of Edinburgh Centre for Discovery Brain Sciences, University of Edinburgh Simons Initiative for the Developing Brain, University of Edinburgh [email protected], [email protected], [email protected],[email protected], [email protected] Contents S1 Methods 13 S1.1 Goodness-of-fit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13S1.2 Variational inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14S1.3 Bayesian model selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15S1.4 Model selection algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15S1.5 Vine copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17S1.6 Algorithmic complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 S2 More validation on synthetic data 18 S2.1 Model selection for bivariate copulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18S2.2 Accuracy of entropy estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 S3 Model parameters for the bivariate neuronal and behavioural examples 19 S1 Methods S1.1 Goodness-of-fit We measure the accuracy of the density estimation with the proportion of variance explained R . We compare theempirical conditional CDF ecdf( U | U = y ) vs. estimated conditional CDF ccdf( U | U = y ) and calculate: R ( y ) = 1 − (cid:88) U (cid:18) ecdf( U | U = y ) − ccdf( U | U = y )ecdf( U | U = y ) − U (cid:19) , (S1)Preprint. Under review.here R ( y ) quantifies the portion of the total variance of U that our copula model can explain given U = y ,and U = F ( Y ) = 0 . . The sum was calculated for U = 0 . n, n = 0 . . . .Next, we select all of the samples from a certain interval of the task ( X ∈ [ X , X ] ) matching one of thoseshown in Figure 3 in the paper. We split these samples U ∈ [0 , into 20 equally sized bins: { I i } . For eachbin I i , we calculate (S1). We evaluate ccdf( U | U = y i ) ≈ ccdf( U | U ∈ I i ) using a copula model fromthe center of mass of the considered interval of X : X µ = mean( X ) for samples X ∈ [ X , X ] . We use theaverage measure: R = E p ( U ∈ I i ) R (cid:0) mean( U ∈ I i ) (cid:1) , (S2)to characterize the goodness of fit for a bivariate copula model. Since U is uniformly distributed on [0 , , theprobabilities for each bin p ( U ∈ I i ) are equal to / , and the resulting measure R is just an average R from all bins. The results were largely insensitive to the number of bins (e.g. 20 vs. 100). S1.2 Variational inference Since our model has latent variables with GP priors and intractable posterior distribution, the direct maximumlikelihood Type-II estimation is not possible and an approximate inference is needed. We used stochasticvariational inference (SVI) with a single evidence lower bound [1]: L ELBO = N (cid:88) i =1 E q ( f i ) (cid:2) log p ( y i | f i ) (cid:3) − KL[ q ( u ) || p ( u )] , (S3)implemented as VariationalELBO in GPyTorch [2]. Here N is the number of data samples, u are the inducingpoints, q ( u ) is the variational distribution and q ( f ) = (cid:82) p ( f | u ) q ( u ) d u .Following the Wilson and Nickisch [3] approach (KISS-GP), we then constrain the inducing points to a regulargrid, which applies a deterministic relationship between f and u . As a result, we only need to infer the variationaldistribution q ( u ) , but not the positions of u . The number of grid points is one of the model hyper-parameters: grid_size .Equation S3 enables joint optimization of the GP hyper-parameters (constant mean µ and two kernel parameters:scale and bandwidth) and parameters of the variational distribution q (mean and covariance at the inducingpoints: u ∼ N ( µ u × , Σ u ) ) [1]. We have empirically discovered by studying the convergence on syntheticdata, that the best results are achieved when the learning rate for the GP hyper-parameters ( base_lr ) is muchgreater than the learning rate for the variational distribution parameters ( var_lr , see Table S1). Priors For both the neuronal and the synthetic data, we use a standard normal prior p ( u ) ∼ N ( , I ) for a variational distribution. Note, that the parametrization for mixture models was chosen such that theaforementioned choice of the variational distribution prior with zero mean corresponds to a priori equal mixingcoefficients φ j = 1 /M for j = 1 . . . M . In our experiments with the simulated and real neuronal data, weobserved that the GP hyper-parameter optimisation problem often had 2 minima (which is a common situation,see Figure 5.5 on page 116 in Williams and Rasmussen [4]). One of those corresponds to a short kernellengthscale ( λ ) and low noise ( min f σ ), which we interpret as overfitting. To prevent overfitting, we used λ ∼ N (0 . , . prior on RBF kernel lengthscale parameter that allows the optimizer to approach the minimafrom the region of higher λ , ending up in the minimum with a larger lengthscale. Optimization We use the Adam optimizer with two learning rates for GP hyper-parameters ( base_lr ) andvariational distribution parameters ( var_lr ). We monitor the loss (averaged over 50 steps) and its changes in thelast 50 steps: ∆ loss = mean(loss[-100:-50]) - mean(loss[-50:]) . If the change becomes smallerthan check_waic , then we evaluate the model WAIC and check if it is lower than − WAIC tol . If it is higher,we consider that either the variables are independent, or the model does not match the data. Either way, thisindicates that further optimisation is counterproductive. If the WAIC < − WAIC tol , we proceed with theoptimisation until the change of loss in 50 steps ∆ loss becomes smaller than loss_tol (see Table S1). Effective learning rates for different families The coefficients in the GPLink functions for differentcopula families are also a part of model hyper-parameters. The choice of these coefficients affects the gradientsof the log probability function. Since GPLink functions are nonlinear, they affect the gradients in variousparameter ranges to a different extent. This results in variable convergence rates depending on the true copulaparameters.To address the problem of setting up these hyper-parameters, we have created the tests on synthetic data withdifferent copula parameters. Using these tests, we manually adjusted these hyper-parameters such that the GPparameter inference converged in around 1000-2000 iterations for every copula family and parameter range.We have also multiplied the GP values corresponding to the mixture coefficients by . , to effectively slow own the learning of the mixture coefficients φ compared to the copula coefficients θ , which also facilitates theconvergence. Hyper-parameter selection The hyper-parameters of our model (Table S1) were manually tuned, oftenconsidering the trade off between model accuracy and evaluation time. A more systematic hyper-parametersearch might yield improved results and better determine the limits of model accuracy. Table S1: Hyper-parameters of the bivariate Copula-GP modelHyper-parameter Value Description base_lr − Learning rate for GP parameters var_lr − Learning rate for variational distribution grid_size 128 Number of inducing points for KISS-GP waic_tol WAIC estimation loss_tol − Loss tolerance that indicates the convergence check_waic WAIC . . . and GPLink parameters listed in Table 1. S1.3 Bayesian model selection In model selection, we are aiming to construct a model with the lowest possible WAIC . Since our copulaprobability densities are continuous, their values can exceed 1 and the resulting WAIC is typically negative. Zero WAIC corresponds to the Independence model ( pdf = 1 on the whole unit square). We also set up a tolerance( WAIC tol = 0 . ), and models with WAIC ∈ [ − WAIC tol , WAIC tol ] are considered indistinguishable fromthe independence model.Since the total number of combinations of 10 copula elements (Fig.1) is large, exhaustive search for the optimalmodel is not feasible. In our framework, we propose two model algorithms for constructing close-to-optimalcopula mixtures: greedy and heuristic . S1.4 Model selection algorithmsAlgorithm 1: Greedy algorithm for copula mixture selection M, M old ← [ ] , [ ] ; S c ← [ Independence , Gauss , Frank , × Clayton , × Gumbel ] ; // 4 × includes all rotations// while every update of the model yields a new best while WAIC( M ) ≤ WAIC( M old ) and size ( S c ) > do M old ← M ; select c from S c such that WAIC( prepend ( c, M )) is minimal; M ← prepend ( c, M ) ; remove c from S c ; end M best ← reduce ( M old ) ; return M best ; The greedy algorithm (Algorithm 1) starts by comparing WAIC of all possible single-copula models (fromTable 1, in all rotations) and selecting the model with the lowest WAIC . After that, we add one more copula(from another family or in another rotation) to the first selected copula, and prepend the element that yields thelowest WAIC of the mixture. We repeat the process until the WAIC stops decreasing. After the best modelis selected, we remove the inessential elements using the reduce(.) function. This function removes thoseelements which have an average concentration of < everywhere on X ∈ [0 , . This step is added toimprove the interpretability of the models and computation time for entropy estimation (at a small accuracy cost)and can, in principle, be omitted.The greedy algorithm can be improved by adding model reduction after each attempt to add an element. In thiscase, the number of elements can increase and decrease multiple times during the model selection process, whichalso must be terminated if the algorithm returns to the previously observed solution. Even though it complicates lgorithm 2: Heuristic algorithm for copula mixture selection G ← [ Gauss ] ; if WAIC( G ) > − waic _ tol then return [ Independence ] ; end M Cl ← [ Independence , Gauss , × Clayton ] ; M Gu ← [ Independence , Gauss , × Gumbel ] ; M best , M worst ← ( M Cl , M Gu ) sorted by WAIC ; if WAIC( G ) < WAIC( M best ) then return G ; end for i ← . . . size ( M best ) do M ← M best with i -th element replaced by M worst [ i ] ; M best ← M if WAIC( M ) < WAIC( M best ) ; end M best ← reduce ( M best ) ; if Gauss ∈ M best then M ← M best with Gauss replaced by Frank ; M best ← M if WAIC( M ) < WAIC( M best ) ; end // Gauss often gets confused with pairs of e.g. Clayton ◦ + Gumbel ◦ if size ( M best ) > then for i ← . . . ( size ( M best ) − do for j ← ( i + 1) . . . size ( M best ) do M ← M best with i -th and j -th elements removed; M ← prepend ( Gauss , M ) ; if WAIC( M ) < WAIC( M best ) then M best ← M ; break ; end end end end M best ← reduce ( M best ) ; return M best ; the algorithm, it reduces the maximal execution time (observed on the real neuronal data) from ∼ 90 minutesdown to ∼ 40 minutes.The heuristic algorithm focuses on the tail dependencies (Algorithm 2). First, we try a single Gaussian copula. Ifvariables are not independent, we next compare 2 combinations of 6 elements, which are organized as follows: anIndependence copula together with a Gaussian copula and either 4 Clayton or 4 Gumbel copulas in all 4 rotations(0 ◦ , 90 ◦ , 180 ◦ , 270 ◦ ). We select the combination with the lowest WAIC . After that, we take the remainingClayton/Gumbel copulas one by one and attempt to switch the copula type (Clayton to Gumbel or vise versa).If this switching decreases the WAIC , we keep a better copula type for that rotation and proceed to the nextelement.Here we make the assumption, that because Clayton and Gumbel copulas have most of the probability densityconcentrated in one corner of the unit square (the heavy tail), we can choose the best model for each of the 4corners independently. When the best combination of Clayton/Gumbel copulas is selected, we can (optionally)reduce the model.We have not yet used a Frank copula in a heuristic algorithm. We attempt to substitute the Gaussian copula witha Frank copula (if it is still a part of the reduced mixture, see lines 16-19 in Alg. 2). Sometimes, a Gaussiancopula can be mistakenly modeled as a Clayton & Gumbel or two Gumbel copulas. So, as a final step (lines20-31, Alg. 2), we select all pairwise combinations of the remaining elements, and attempt to substitute eachof the pairs with a Gaussian copula, selecting the model with the lowest WAIC . Despite a large number ofsteps in this algorithm, the selection process takes only up to 25 minutes (in case all elements in all rotations arerequired). he procedure was designed after observing the model selection process on a variety of synthetic and realneuronal datasets. S1.5 Vine copulas Vine models provide a way to factorize the high-dimensional copula probability density into a hierarchical set ofbivariate copulas [5]. There are many possible decompositions based on different assumptions about conditionalindependence of specific elements in a model, which can be classified using graphical models called regularvines [6, 7]. A regular vine can be represented using a hierarchical set of trees, where each node correspondsto a conditional distribution function (e.g. F ( U | U ) ) and each edge corresponds to a bivariate copula (e.g. c ( U , U | U ) ). The copula models from the lower trees are used to obtain new conditional distributions (newnodes) with additional conditional dependencies for the higher trees, e.g. a ccdf of a copula c ( U , U | U ) and a marginal conditional distribution F ( U | U ) from the 1st tree provide a new conditional distribution F ( U | U , U ) for a 2nd tree. Therefore, bivariate copula parameters are estimated sequentially, starting fromthe lowest tree and moving up the hierarchy. The total number of edges in all trees (= the number of bivariatecopula models) for an m -dimensional regular vine equals m ( m − / .The regular vines often assume that the conditional copulas c ( U i , U j |{ U k } ) themselves are independent of theirconditioning variables { U k } , but depend on the them indirectly through the conditional distribution functions(nodes) [8]. This is known as the simplifying assumption for vine copulas [9], which, if applicable, allows toescape the curse of dimensionality in high-dimensional copula construction.In this study, we focus on the canonical vine or C-vine , which has a unique node in each tree, connected to all ofthe edges in that tree. For illustration, see, for example, Figure 2 in Aas et al. [19]. The C-vine was shown to bea good choice for neuronal datasets [10], as they often include some proxy of neuronal population activity asan outstanding variable, strongly correlated with the rest. This variable provides a natural choice for the firstconditioning variable in the lowest tree. In the neuronal datasets from Henschke et al. [11], this outstandingvariable is the global fluorescence signal in the imaged field of view (global neuropil).To construct a C-vine for describing the neuronal and behavioural data from Henschke et al. [11], we used aheuristic element ordering based on the sum of absolute values of Kendall’s τ of a given element with all of theother elements. It was shown by Czado et al. [12] that this ordering facilitates C-vine modeling. For all of theanimals and most of the recordings (14 out of 16), including the one used in Figure 3, the first variable after suchordering was the global neuropil activity. This again confirms, that a C-vine with the global neuropil activity as afirst variable is an appropriate model for the dependencies in neuronal datasets. S1.6 Algorithmic complexity In this section, we discuss the algorithmic complexity of the parameter inference for a C-vine copula model.The parameter inference for each of the bivariate Copula-GP models scales as O ( n ) , where n is the numberof samples, since we use a scalable kernel interpolation KISS-GP [34]. As we mentioned in Sec. S1.5, a full m -dimensional C-vine model requires m ( m − / bivariate copulas, trained sequentially. As a result, the O ( n ) GP parameter inference has to be repeated m ( m − / times, which yields O ( n · m ) complexity.In practice, the computational cost (in terms of time) of the parameter inference for each bivariate model variesfrom tens of seconds to tens of minutes. The heuristic model selection is designed in such a way, that it discardsindependent variables in just around 20 seconds (line 3 in Alg. 2). As a result, most of the models are quicklyskipped and further considered as Independence models, and their contribution to the total computational costcan be neglected. When the model is evaluated, the Independence components are also efficiently ‘skipped’during sampling, as ppcf function is not called for them. The Independence models also add zero to C-vine logprobability, so they are also ‘skipped’ during log probability calculation. They also reduce the total memorystorage, as no GP parameters, which predominate the memory requirements, are stored for these models.In a conditional C-vine trained on a real neuronal dataset with 109 variables, 5253 out of 5886 (89%) bivariatemodels were Independence, which leaves only 633 non-Independence models.In practice, this means that the algorithmic complexity of the model is much better than the naïve theoreticalprediction O ( n · m ) , based on the structure of the graphical model. Suppose that the actual number of thenon-Independence models N nI in a vine model is much smaller than m ( m − / and can be characterized byan effective number of dimensions m eff ∼ √ N nI . In this case, instead of the O ( m ) scaling with the numberof variables, the complexity highly depends on the sparsity of the dependencies in the graphical model and scaleswith as O ( n · N nI ) ∼ O ( n · m eff ) .Therefore, the our method is especially efficient on the datasets with a low effective dimensionality m eff , suchas the neuronal data. The number of variables m itself has little effect on the computational cost and memorystorage. Computing infrastructure We developed our framework and ran the majority of our experiments (de-scribed both in the paper and Supplemental Material) on an Ubuntu 18.04 LTS machine with 2 x Intel(R)Xeon(R) Gold 6142 CPU @ 2.60GHz and 1x GeForce RTX 2080 + 1 x GeForce RTX 2080 Ti GPUs. Fortraining C-vine models, we used another Scientific Linux 7.6 machine with 1 x Intel(R) Xeon(R) Silver 4114CPU @ 2.20GHz and 8 x GeForce RTX 2080 Ti GPUs. Code availability Code will be made available on GitHub upon paper acceptance. S2.1 Model selection for bivariate copulasSynthetic data We generate artificial data by sampling from a copula mixture, parametrized in two differentways: 1. mixing concentrations of all copulas were constant and equal to /N ( N = number of copulas), butcopula parameters θ were parametrized by the phase-shifted sinus functions: θ i = A i sin (cid:18) πm iN + 2 πx (cid:19) + B i , x ∈ [0 , (S4)where i is the index of the copula in a mixture, m = 1 . For Clayton and Gumbel copulas, the absolutevalue of the sinus was used. The amplitudes A i were chosen to cover most of the range of parameters,except for extremely low or high θ s for which all copula families become indistinguishable (fromindependence or deterministic dependence, respectively).2. copula parameters θ were constant, but mixing concentrations φ were parametrized by the phase-shifted sinus functions (same as Eq. S4, with A i = B i = 1 /N and m = 2 ). Such parametrizationensures that the sum of all mixing concentrations remains equal to one ( (cid:80) Ni =1 φ = 1 ). Yet, each φ turns to zero somewhere along this trajectory, allowing us to discriminate the models and infer thecorrect mixture. Identifiability tests We tested the ability of the model selection algorithms to select the correct mixture ofcopula models, the same as the one from which the data was generated. We generated 5000 samples with equallyspaced unique inputs on [0,1].Both model selection algorithms were able to correctly select all of the 1-component and most of the 2-componentmodels on simulated data. For simulated data with larger numbers of components (or 2 very similar components),the WAIC of the selected model was either lower (which is possible given a limited number of samples) or closeto the WAIC of the correct parametric model. In other words, the difference between the WAIC of the correctmodel and of the best selected model never exceeded the WAIC test _ tol = 0 . , which we set up as a criteriafor passing the test: ∆WAIC < WAIC test _ tol . Since all the tests were passed successfully, we conclude thatboth algorithms are capable of finding optimal or close-to-optimal solutions for copula mixtures. A more detailed report on the model identifiability tests Tables S2-S6 below illustrate the searchfor the best model. The copula model names in these tables are shortened to the first two letters, e.g. Gumbelbecomes ‘Gu’, Frank becomes ‘Fr’. The information in these Tables provides some intuition on the modelselection process and the range of WAIC s for the correct or incorrect models. The final selected models areshown in bold.Table S2 demonstrates that both greedy and heuristic algorithms can identify the correct single copula model.Some key intermediate models ( M in Alg. 1-2) with their WAIC s are listed in the table, along with the totalduration of simulations (T, in minutes) on RTX 2080Ti for both algorithms.Table S3 shows the identification of the mixtures with 2 components, where the copula parameters θ wereconstant (independent of X ) and mixing concentrations φ were parameterized by the phase-shifted sinusfunctions (Eq. S4). All of these models were correctly identified with both algorithms. The mixtures with 2components, where the copula parameters θ varied harmonically (as in Eq. S4) but the mixing concentrations φ were constant, were harder to identify. Table S4 shows that a few times, each of the algorithms selected a modelthat was better than the true model ( WAIC best − WAIC true < ). The greedy algorithm made one mistake,yet the model it selected was very close to optimal. Such misidentification happens due to the limited number ofsamples in a given synthetic dataset.Tables S5-S6 show the model selection for 3 component models. Again, as in Tables S3-S4, either θ or φ wasconstant. Here, the model selection algorithms could rarely identify the correct model (due to overcompletenessof the mixture models), but always selected the one that was very close to optimal: WAIC best − WAIC true (cid:28) WAIC test _ tol . ote, that WAIC test _ tol is different from waic_tol . We have set waic_tol for comparison against Independentmodel to such a small value (10x smaller than WAIC test _ tol ) because we want to avoid making false assumptionsabout conditional independences in the model. Also note, that the WAIC of the true model depends on theparticular synthetic dataset generated in each test. Therefore, the final WAIC in the left and in the right columnsof Tables S2-S6 can be slightly different (yet, right within WAIC test _ tol ). S2.2 Accuracy of entropy estimation H ( Y ) , b i t Gaussian entropy TrueCopula-GPBI-KSG 0.0 0.5 1.0corr. coef. = 051015 H ( Y ) = H t r u e H e s t , b i t Estimation error 10 20dim Y H ( Y ) , b i t Gaussian entropy 10 20dim Y H ( Y ) = H t r u e H e s t , b i t Estimation error A B C D Figure S4: Accuracy of the entropy estimation for multivariate Gaussian distributions. A Entropyof the 20-dimensional multivariate Gaussian distribution for different correlation coefficients ρ . B Estimation error for the entropy shown in A. C Entropy of the multivariate Gaussian distributionswith ρ = 0 . and varying dimensionality. D Estimation error for the entropy shown in C. In this section, we consider a fixed copula mixture model with known parameters θ and test the reliability ofthe entropy estimation with Monte Carlo (MC) integration. We test the accuracy of the entropy estimation ona multivariate Gaussian distribution, with cov( Y i , Y j ) = ρ + (1 − ρ ) δ ij , where δ ij is Kronecker’s delta and ρ ∈ [0 , . . Given a known Gaussian copula, we estimate the entropy with MC integration and compare itto the analytically calculated true value. We set up a tolerance to ∆ H = 0 . Y ) . As a result, for everycorrelation ρ (Fig. S4A-B) and every number of dimensions dim Y (Fig. S4C-D), the Copula-GP provides anaccurate result, within the error margin. In Figure S4, BI-KSG estimates [13] obtained on the dataset with 10ksamples are shown for comparison. This experiment 1) validates the MC integration; 2) validates the numericalstability of the probability density function of the Gaussian copula up to a specified maximal ρ = 0 . (for ρ > . the model is indistinguishable from the deterministic dependence U = U ). S3 Model parameters for the bivariate neuronal and behavioural examples _ n o r m a li z e d Copula parameters GaussianClayton 90°Pearson's rho 0 50 100 150Position in VR, [cm]0.00.51.0 [ c ] Mixing parameters N e u r o n A Inter-neuronal noise correlations _ n o r m a li z e d Copula parameters FrankClayton 0°Gumbel 270°Pearson's rho 0 50 100 150Position in VR, [cm]0.00.51.0 [ c ] Mixing parameters N e u r o n B Behavioral modulations Figure S5: Parameters of the copula mixture models. From left to right: copula probability densities(same as Fig.3C-D); a list of selected copula elements; copula parameters θ ; mixing concentrations φ .These plots are provided for: A the noise correlation example; B the behavioral modulation example. In this section, we provide visualisations for the parameters of the bivariate copula models from Figure 3C-F anddiscuss the interpretability of these models.Figure S5 shows the probability density of the joint distribution of two variables and the parameters of acorresponding Copula-GP mixture model. The plots on the left repeat Fig.3C-D and represent the true density(outlines) and the copula model density (blue shades) for each part of the task. n the noise correlation example (Fig. S5A), we observe the tail dependencies between the variables (i.e.concentration of the probability density in a corner of the unit square) around [0-60] cm and [140-160] cm ofthe virtual corridor. There is only one element with a tail dependency in this mixture: Clayton 90 ◦ copula. Onthe right-most plot in Fig. S5A, we see the mixing concentration for the elements of the mixture model. Theconcentration of Clayton 90 ◦ copula (orange line) is close to 100% around 20 cm and 150 cm, which agreeswith our observations from the density plots.The confidence intervals ( ± σ ) for the parameters approximated with Gaussian processes are shown with shadedareas in parameter plots. These intervals provide a measure of uncertainty in model parameters. For instance,when the concentration of the Gaussian copula in the mixture is close to 0% ( X around 20 cm and 150 cm), theconfidence intervals for the Gaussian copula parameter ( θ , blue shade) in Fig. S5A become very wide (fromalmost 0 to 1). Since this copula element is not affecting the mixture for those values of X , its θ parameterhas no effect on the mixture model log probability. Therefore, this parameter is not constrained to any certainvalue. In a similar manner, we see that the variables are almost independent between 60 and 120 cm (see densityplots on the left in Fig. S5). Both copula elements can describe this independence. As a result, the mixingconcentrations for both elements have high uncertainty in that interval of X . Yet, Gaussian copula with a slightlypositive correlation is still a bit more likely to describe the data in that interval.The copula parameter plot in Fig. S5A also shows Pearson’s ρ , which does not change much in this exampleand remains close to zero. This illustrates, that the traditional linear noise correlation analysis would ignore(or downplay) this pair of neurons as the ones with no dependence. This happens because the Pearson’s ρ onlycaptures the linear correlation and ignores the tail dependencies, whereas our model provides a more detaileddescription of the joint bivariate distribution.In the behavioural modulation example (Fig. S5B), we observe more complicated tail dependencies in the densityplots. The best selected model supports this observation and provides a mixture model with 3 components, 2of which have various tail dependencies. The Clayton 0 ◦ copula (orange) describes the lower tail dependenceobserved in the second part of the virtual corridor with gratings (around [60-120] cm, see Fig. 3A for taskstructure). This dependence can be verbally interpreted as follows: when there is no licking, the Neuron 60 iscertainly silent; but when the animal is licking, the activity of Neuron 60 is slightly positively correlated with thelicking rate .These examples illustrate, that by analysing the copula parameters and the mixing concentrations of the Copula-GP mixture model, one can interpret the changes in the bivariate dependence structure. Just like traditional tuning curves characterize the response of a single neuron, our mixture model characterizes the ‘tuning’ ofthe dependence structure between pairs of variables to a given stimulus or context. Knowing the qualitativeproperties of the copula elements that constitute a copula mixture, one can focus on the dominant element of thecopula mixture for every given conditioning variable X and describe the shape of the dependence. Supplementary references [1] James Hensman, Alexander Matthews, and Zoubin Ghahramani. Scalable variational gaussian processclassification. 2015.[2] Jacob Gardner, Geoff Pleiss, Kilian Q Weinberger, David Bindel, and Andrew G Wilson. GPyTorch:Blackbox matrix-matrix Gaussian process inference with GPU acceleration. In Advances in NeuralInformation Processing Systems , pages 7576–7586, 2018.[3] Andrew Wilson and Hannes Nickisch. Kernel interpolation for scalable structured Gaussian processes(KISS-GP). In International Conference on Machine Learning , pages 1775–1784, 2015.[4] Christopher KI Williams and Carl Edward Rasmussen. Gaussian processes for machine learning , volume 2.MIT press Cambridge, MA, 2006.[5] Kjersti Aas, Claudia Czado, Arnoldo Frigessi, and Henrik Bakken. Pair-copula constructions of multipledependence. Insurance: Mathematics and economics , 44(2):182–198, 2009.[6] Tim Bedford and Roger M Cooke. Probability density decomposition for conditionally dependent randomvariables modeled by vines. Annals of Mathematics and Artificial intelligence , 32(1-4):245–268, 2001.[7] Tim Bedford and Roger M Cooke. Vines: A new graphical model for dependent random variables. Annalsof Statistics , pages 1031–1068, 2002.[8] Elif F Acar, Christian Genest, and Johanna NešLehová. Beyond simplified pair-copula constructions. Journal of Multivariate Analysis , 110:74–90, 2012.[9] Ingrid Hobæk Haff, Kjersti Aas, and Arnoldo Frigessi. On the simplified pair-copula construction—simplyuseful or too simplistic? Journal of Multivariate Analysis , 101(5):1296–1310, 2010. 10] Arno Onken and Stefano Panzeri. Mixed vine copulas as joint models of spike counts and local fieldpotentials. In Proceedings of the 30th International Conference on Neural Information Processing Systems ,NIPS’16, page 1333–1341, Red Hook, NY, USA, 2016. Curran Associates Inc. ISBN 9781510838819.[11] Julia U Henschke, Evelyn Dylda, Danai Katsanevaki, Nathalie Dupuy, Stephen P Currie, TheoklitosAmvrosiadis, Janelle MP Pakan, and Nathalie L Rochefort. Reward association enhances stimulus-specificrepresentations in primary visual cortex. Current Biology , 2020.[12] Claudia Czado, Ulf Schepsmeier, and Aleksey Min. Maximum likelihood estimation of mixed c-vineswith application to exchange rates. Statistical Modelling , 12(3):229–255, 2012.[13] Weihao Gao, Sewoong Oh, and Pramod Viswanath. Demystifying Fixed k-Nearest Neighbor InformationEstimators, 2016. True Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGa Ga -0.1619 25 m Ga -0.1513 3 mGaFr -0.1610 InGaGu Gu Gu Gu -0.1499 Ga -0.1619 InGaCl Cl Cl Cl -0.1498 Ga -0.1513 Fr Fr -0.1389 57 m Ga -0.1400 3 mFrCl -0.1395 InGaGu Gu Gu Gu -0.1391FrCl Gu -0.1396 InGaCl Cl Cl Cl -0.1391FrCl Gu Gu -0.1396 Fr -0.1509Fr -0.1389 Cl Cl -0.5225 37 m Ga -0.3825 5 mCl Gu -0.5226 InGaGu Gu Gu Gu -0.4943Cl Gu Cl -0.5225 InGaCl Cl Cl Cl -0.5303 Cl -0.5224 Cl -0.5311 Gu Gu -0.6267 43 m Ga -0.5555 7 mGu Cl -0.6268 InGaGu Gu Gu Gu -0.5988Gu Cl Gu -0.6267 InGaCl Cl Cl Cl -0.5946 Gu -0.6230 GaGu -0.6040 Gu -0.6050 Cl Cl -0.5389 22 m Ga -0.3922 5 mCl Cl -0.5389 InGaGu Gu Gu Gu -0.5047 Cl -0.5389 InGaCl Cl Cl Cl -0.5409 Cl -0.5410 Gu Gu -0.6137 55 m Ga -0.5501 7 mGu Gu -0.6144 InGaGu Gu Gu Gu -0.5893Gu Gu Cl -0.6145 InGaCl Cl Cl Cl -0.5831Gu Gu Cl Cl -0.6144 GaGu -0.5887 Gu -0.6137 Gu -0.5950 Cl Cl -0.5566 36 m Ga -0.3932 7 mCl Cl -0.5582 InGaGu Gu Gu Gu -0.4956Cl Cl In -0.5582 InGaCl Cl Cl Cl -0.5493 Cl -0.5565 Cl -0.5489 Gu Gu -0.6131 43 m Ga -0.5553 6 mGu Cl -0.6164 InGaGu Gu Gu Gu -0.6091Gu Cl Fr -0.6163 InGaCl Cl Cl Cl -0.6045 Gu -0.6131 Gu -0.6154 Cl Cl -0.5434 23 m Ga -0.3909 5 mCl Gu -0.5433 InGaGu Gu Gu Gu -0.5094 Cl -0.5434 InGaCl Cl Cl Cl -0.5535 Cl -0.5548 Gu Gu -0.5928 51 m Ga -0.5763 6 mGu Cl -0.5934 InGaGu Gu Gu Gu -0.6277Gu Cl In -0.5935 InGaCl Cl Cl Cl -0.6179Gu Cl InCl -0.5931 Gu -0.6300Gu -0.5928 θ and variable φ True Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGu Ga -0.1877 101 m Ga -0.1922 11 mGa GaGu -0.2855 InGaGu Gu Gu Gu -0.3070GaGu Cl -0.2855 InGaCl Cl Cl Cl -0.2996GaGu Cl Fr -0.2856 GaCl Gu Gu -0.3082GaGu Cl FrGu -0.2856 GaCl Cl Gu -0.3076GaGu Cl FrGu -Cl -0.2856 GaGu -0.3091Gu Ga -0.2854 Ga Fr -0.1635 87 m Ga -0.1600 5 mCl FrCl -0.2707 InGaGu Gu Gu Gu -0.2687FrCl Ga -0.2747 InGaCl Cl Cl Cl -0.2835FrCl GaGu -0.2782 GaCl -0.2845 FrCl GaGu Cl -0.2781 GaCl -0.2821 Gu Gu -0.1681 99 m Ga -0.1534 8 mFr Gu Fr -0.2099 InGaGu Gu Gu Gu -0.1993Gu FrCl -0.2101 InGaCl Cl Cl Cl -0.1977Gu FrCl Cl -0.2105 InGaGu -0.2074Gu FrCl Cl In -0.2106 FrGu -0.2104 Gu FrCl Cl In-Gu -0.2099 FrGu -0.2099 Cl Fr -0.1587 92 m Ga -0.1652 5 mCl FrCl -0.2600 InGaGu Gu Gu Gu -0.3142FrCl Cl -0.3173 InGaCl Cl Cl Cl -0.3430FrCl Cl Gu -0.3176 Cl Cl -0.3448 FrCl Cl Gu In -0.3176FrCl Cl Gu InCl -0.3175 Cl Cl -0.3190 Cl Fr -0.2204 103 m Ga -0.1965 7 mGu FrCl -0.3488 InGaGu Gu Gu Gu -0.3591FrCl Gu -0.3874 InGaCl Cl Cl Cl -0.3688FrCl Gu Cl -0.3877 GaGu Cl -0.3771FrCl Gu Cl Ga -0.3878 Gu Cl -0.3772 FrCl Gu Cl Ga-Gu -0.3878 Gu Cl -0.3888 Table S4: The model selection histories for 2-element mixtures with constant φ and variable θ True Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGu Gu -0.1419 60 m Ga -0.1538 10 mGa Gu Fr -0.2022 InGaGu Gu Gu Gu -0.2320Gu FrCl -0.2024 InGaCl Cl Cl Cl -0.2218Gu FrCl Ga -0.2024 GaCl Gu Gu -0.2321 FrGu -0.2021 GaGu -0.2326 WAIC best − WAIC true : -0.0013 rue Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGa Gu -0.1495 56 m Ga -0.1062 7 mCl Gu Fr -0.1894 InGaGu Gu Gu Gu -0.1747Gu FrCl -0.1915 InGaCl Cl Cl Cl -0.1783Gu FrCl In -0.1902 GaGu Cl -0.1812 Cl FrGu -0.1915 GaCl -0.1801 WAIC best − WAIC true : Gu Gu -0.1600 58 m Ga -0.1331 8 mFr Gu Fr -0.2191 InGaGu Gu Gu Gu -0.1944Gu FrCl -0.2195 InGaCl Cl Cl Cl -0.1936Gu FrCl Cl -0.2190 GaGu Cl Gu Gu -0.1945 FrGu -0.2190 GaGu -0.1992 WAIC best − WAIC true : -0.0094 Cl Gu -0.0253 62 m Ga -0.0079 5 mCl Gu Cl -0.2383 InGaGu Gu Gu Gu -0.1904Gu Cl Cl -0.2506 InGaCl Cl Cl Cl -0.2330Gu Cl Cl In -0.2509 Cl Cl -0.2361 Gu Cl Cl InFr -0.2508 Cl Cl -0.2586 Cl Gu -0.0242 69 m Ga -0.0083 6 mGu Gu Cl -0.2499 InGaGu Gu Gu Gu -0.2277Gu Cl Gu -0.2517 InGaCl Cl Cl Cl -0.2535Gu Cl Gu In -0.2518 GaCl Cl -0.2549 Gu Cl Gu InCl -0.2518Gu Cl Gu InCl Fr -0.2518 Cl Gu -0.2500 WAIC best − WAIC true : -0.0098 Table S5: The model selection histories for 3-element mixtures with constant θ and variable φ True Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGa Gu -0.1399 44 m Ga -0.1252 6 mCl Gu Cl -0.2494 InGaGu Gu Gu Gu -0.2481Gu Gu Cl Cl -0.2519 InGaCl Cl Cl Cl -0.2565Gu Cl Cl Fr -0.2518 GaCl Cl -0.2564Cl Gu -0.2494 WAIC best − WAIC true : -0.0036 WAIC best − WAIC true : Fr Fr -0.0591 77 m Ga -0.0489 6 mCl FrCl -0.1460 InGaGu Gu Gu Gu -0.1573Gu FrCl Gu -0.1730 InGaCl Cl Cl Cl -0.1578FrCl Gu Cl -0.1736 GaCl Cl -0.1621 FrCl Gu Cl In -0.1734 Gu Cl Fr -0.1731 WAIC best − WAIC true : Fr Fr -0.0741 87 m Ga -0.0618 9 mCl FrCl -0.1513 InGaGu Gu Gu Gu -0.1567Gu FrCl Gu -0.1707 InGaCl Cl Cl Cl -0.1670FrCl Gu Cl -0.1708 InGaGu Cl Cl -0.1680FrCl Gu Cl Gu -0.1711 InGaGu Cl Gu -0.1695FrCl Gu Cl Gu -Cl -0.1710 InGu Cl -0.1735Gu Cl Fr -0.1703 WAIC best − WAIC true : -0.0011 rue Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGu Gu -0.1695 47 m Ga -0.1477 11 mGu Gu Cl -0.3040 InGaGu Gu Gu Gu -0.2986Cl Gu Cl Gu -0.3234 InGaCl Cl Cl Cl -0.3033Gu Cl Gu Cl -0.3233 GaGu Cl Cl -0.3054 Gu Cl Gu -0.3234 GaGu Cl Gu -0.3111 Gu Cl Gu -0.3113 Table S6: The model selection histories for 3-element mixtures with constant φ and variable θ True Greedy HeuristicModel Search attempts WAIC T Search attempts WAIC TGa Fr -0.0177 66 m Ga -0.0142 13 mCl FrGu -0.1284 InGaGu Gu Gu Gu -0.1291Gu FrGu Gu -0.1407 InGaCl Cl Cl Cl -0.1289FrGu Gu Cl -0.1423 InCl Gu Gu -0.1317FrGu Gu Cl Cl -0.1435 InCl Cl Gu -0.1346FrGu Gu Cl Cl In -0.1432 InCl Cl Cl -0.1301 Gu Gu -0.1451 GaCl Cl -0.1313 WAIC best − WAIC true : WAIC best − WAIC true : Fr Fr -0.0265 71 m Ga -0.0192 9 mCl FrGu -0.1290 InGaGu Gu Gu Gu -0.1411Gu FrGu Gu -0.1445 InGaCl Cl Cl Cl -0.1429FrGu Gu Cl -0.1450 InGaGu Cl Cl -0.1474FrGu Gu Cl Cl -0.1466 InGaGu Cl Gu -0.1472FrGu Gu Cl Cl In -0.1468 InCl Gu -0.1477Gu Gu -0.1451 WAIC best − WAIC true : WAIC best − WAIC true : Fr Fr -0.0129 61 m Ga -0.0185 6 mCl FrGu -0.1105 InGaGu Gu Gu Gu -0.1309Gu FrGu Gu -0.1237 InGaCl Cl Cl Cl -0.1326FrGu Gu Cl -0.1254 InGu Cl -0.1393FrGu Gu Cl Gu -0.1248 InGu Gu -0.1334 Gu Gu Fr -0.1234 InGu Gu -0.1326 WAIC best − WAIC true : WAIC best − WAIC true : Gu Gu -0.0756 55 m Ga -0.0454 7 mGu Gu Cl -0.2380 InGaGu Gu Gu Gu -0.2476Cl Gu Cl Cl -0.2556 InGaCl Cl Cl Cl -0.2459Gu Cl Cl Ga -0.2591 GaCl Gu Gu -0.2493Gu Cl Cl GaCl -0.2590 GaCl Cl Gu -0.2559 Cl Cl Gu -0.2555 Cl Cl Gu -0.2538 WAIC best − WAIC true : WAIC best − WAIC true :0.0006