Inferring food intake from multiple biomarkers using a latent variable model
IInferring food intake from multiple biomarkersusing a latent variable model
Silvia D’Angelo, Lorraine Brennan and Isobel Claire GormleyJune 5, 2020
Abstract
Metabolomic based approaches have gained much attention in recent years due to their promising potentialto deliver objective tools for assessment of food intake. In particular, multiple biomarkers have emerged forsingle foods. However, there is a lack of statistical tools available for combining multiple biomarkers to inferfood intake. Furthermore, there is a paucity of approaches for estimating the uncertainty around biomarkerbased prediction of intake.Here, to facilitate inference on the relationship between multiple metabolomic biomarkers and foodintake in an intervention study conducted under the A-DIET research programme, a latent variable model,multiMarker, is proposed. The proposed model draws on factor analytic and mixture of experts models,describing intake as a continuous latent variable whose value gives raise to the observed biomarker values.We employ a mixture of Gaussian distributions to flexibly model the latent variable. A Bayesian hierarchicalmodelling framework provides flexibility to adapt to different biomarker distributions and facilitates predictionof the latent intake along with its associated uncertainty.Simulation studies are conducted to assess the performance of the proposed multiMarker framework, priorto its application to the motivating application of quantifying apple intake.
Keywords: Latent variable models, factor analysis, mixture of experts, metabolomics, ordinal regression
Dietary biomarkers have emerged in recent years as objective measures of food intake (Baldrick et al. (2011)). Adietary biomarker is a small molecule called a metabolite that can provide information on the level of intake of afood (Gao et al. (2017), Dragsted et al. (2018)). In recent years, many biomarkers have emerged for a range offoods. The importance of such biomarkers stems from the fact that classical dietary assessment approaches relyon self-reported data which can be subjective and biased, and such issues are well documented in the literature(e.g. Bingham (2002), Kipnis et al. (2002), Subar et al. (2003), Lloyd et al. (2019), Siddique et al. (2019)). Asa consequence, dietary biomarkers have emerged as a potential objective tool to aid food intake assessment.As the biomarker field progressed and the analytical tools improved, the number of biomarkers identified aspotential biomarkers of intake has increased. Indeed, there are now multiple biomarkers for individual foods.However, there is a paucity of statistical tools for modelling the relationship between multiple biomarkers1 a r X i v : . [ s t a t . M E ] J un nd food intake. The work to date has employed biomarkers as panels to classify intake into categories, e.g.consumers and non-consumers. For example, Garcia-Aloy et al. (2017) propose a panel of biomarkers for cocoaintake, which is employed to estimate cocoa consumption or non-consumption through a forward stepwiselogistic regression model. Rothwell et al. (2014) propose a partial least squares discriminant analysis (PLS-DA)approach to distinguish between low and high consumption of coffee in a sample of individuals, using a panel ofthree coffee-specific biomarkers. Recent work by Vázquez-Manjarrez et al. (2019) on discovery and validationof banana intake biomarkers also employs a PLS-DA approach to detect low, medium and high consumptionof banana. The proposed panel of biomarkers proved to be effective in distinguishing between low and highconsumers, while medium consumers were difficult to separate from the two extremal groups. Gürdeniz et al.(2016) present a PLS-DA approach to detect beer intake (consumers versus non-consumers) using a panel ofaggregated biomarkers. While the aforementioned approaches avail of a panel of biomarkers, they provide acategorical quantification of intake, not an estimation of the quantity of intake, nor its associated uncertainty.The A-DIET research programme ( ) aims to identify new metabolomicbiomarkers of dietary intake, and here provides the motivating context. Data from a panel of four novelmetabolomic biomarkers were collected from an intervention study, where a group of participants consumedthree different food quantities (of apple), over a three week period. The statistical challenge is to infer therelationship between the biomarkers and food intake, and to predict intake from the panel of biomarkers alone.Here, to infer the relationship between multiple biomarkers and food intake we have developed the “multi-Marker” model. MultiMarker relies on a factor analytic latent variable construct (Knott & Bartholomew 1999) tocapture the relationship between the panel of observed biomarkers and the unobserved intake. The distributionof the latent factor (i.e. the unobserved food intake) is expressed through a flexible mixture model. Moreover,to improve prediction, multiMarker avails of a mixture of experts framework (Jacobs et al. (1991), Gormley &Frühwirth-Schnatter (2019)), such that the biomarker data first informs the predicted level of intake (relative tothe most likely food quantity), with the more refined predicted intake then resulting from the factor analyticaspect of multiMarker. The model is developed in a Bayesian framework, naturally allowing for uncertaintyquantification and therefore providing more informed quantification of intake.Traditional factor analysis models assume that the correlation structure between a collection of observedvariables can be represented in terms of a linear combination of a lower number of latent variables, the factors(Knott & Bartholomew 1999). The latent variable(s) are often assumed to follow a Gaussian distribution, andmany extensions have been proposed. Allowing an infinite number of factors to facilitate greater modellingflexibility has been proposed by Bhattacharya & Dunson (2011), and Murphy et al. (2020) extend the frameworkto an infinite mixture of infinite factor analysers. When in the presence of an heterogeneous population, Montanari& Viroli (2010) proposed an heteroscedastic factor mixture analysis model, where factors are distributed accordingto a mixture of multivariate Gaussian distributions. Relaxations of the Gaussianity assumption for the factorshave been considered by many authors, see for example McLachlan et al. (2007), Murray et al. (2014) and Linet al. (2016). An extension of the factor mixture analysis approach to multivariate binary response data hasbeen considered by Cagnone & Viroli (2012). Galimberti et al. (2009) propose an approach to dimensionalityreduction in factor mixture analysis models, where factor loadings are shrunk through a penalized likelihoodapproach. Robustification of factor analysis models was addressed by Pison et al. (2003), who introduced a2ethod to address outliers. A further issue with factor analytic models is non-identifiability, see for exampleLopes & West (2004), Ročková & George (2016) and Frühwirth-Schnatter & Lopes (2018).While much factor analytic research has focused on two common issues, the number of factors to employand the distribution to adopt, attempting to model the scale of the latent factors has received little attention.Typically the latent factors are perceived as instrumental tools to achieve a lower-dimensional representation ofthe data at hand. On the contrary, motivated by the application, our approach is focused on a single latentvariable that is a proxy for the latent intake. Information on the scale of this latent variable is available andnecessary in order to provide practically useful quantification of intake. While the availability of such informationalso ensures the multiMarker model is identifiable, unlike general factor analytic models, it adds complexity toinferring the latent variable which is no longer a simple instrument. Modelling the latent factor via a mixturemodel allows for a flexible framework, but introduces the issue of properly modelling mixture weights. Indeed,mixture components have the role of locating different regions in the intake range for the latent variable, wherethe order of such regions is relevant and should not be ignored. To this end, we embed the model for thelatent variable in a mixture of experts framework, where the weights are modelled as functions of the observedbiomarkers. Further, when modelling the weights we directly account for the ordinal nature of the food quantitydata via an ordinal regression model (Agresti 1999), employing the robust Cauchit link function (Morgan &Smith (1992)). Prediction of the latent intake and its associated uncertainty is available through the latentvariable’s posterior predictive distribution.In what follows, Section 2 details the motivating application of inferring food (specifically apple) intake.Section 3 outlines our proposed multiMarker framework and develops an efficient Metropolis-within-Gibbssampling strategy for Bayesian inference. Section 4 provides details of a thorough simulation study exploringthe performance of the proposed framework across a series of realistic settings. The multiMarker framework isapplied in the motivating context of inferring apple intake in Section 5, with the concluding Section 6 discussingthe application outcomes, the multiMarker framework and possible extensions. An R package, multiMarker,is freely available through to facilitate widespread use of the method, and with which allresults presented herein were produced. In the present work we develop a model to estimate the relationship between apple intake and P = 4 uri-nary biomarkers, identified using an untargeted metabolomics approach (McNamara et al. 2020). The foururinary biomarkers are: Xylose, Epicatechin Sulfate, [(4- { } -2-methylbut-2-en-1-yl)oxy] sulfonic acid, and Glucodistylin. Throughout the paper we willrefer to the third biomarker as (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHM P M B − SA ).The data were collected as part of an intervention study, where a group of 32 participants consumed differentquantities of apple daily, over a three week intervention period. The intervention study was conducted as part ofthe A-DIET research programme ( ), which aims to identify new metabolomicbiomarkers of dietary intake. Data was available following consumption of D = 3 apple quantities: 50, 100 and300 grams. Each intervention week was followed by a resting week. Biomarker data was available followingconsumption of 50, 100 and 300 grams, respectively for 29, 28 and 29 participants, leading to a total of n = 86observations. Throughout the paper we will treat the n = 86 observations as independent and this assumption isassessed in Section 5. The original values for Epicatechin Sulfate, (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHM P M B − SA ) and Glucodistylin caused computational instability (most values were largerthan 10 ) and consequently were scaled (see Appendix C). However, the transformation did not alter thecorrelations between the four biomarkers. A visualization of the data is given in Figure 1. Three of the foururinary biomarkers have similar median values for the first two apple quantities, and all biomarkers are highlyvariable for the 300 grams apple quantity. Also, the boxplots corresponding to different apple quantities partiallyoverlap, indicating that such quantities can not be completely separated by biomarker values.The intervention data are used to model the relationship between the panel of biomarkers and apple intake(in grams). We also examine the ability of the model to predict apple consumption at a participant level, usingonly the metabolomic biomarker measurements. l l l l l l l l l l l l l l l l l Apple quantity (grams) B i o m a r k e r l e v e l
50 100 300 50 100 300 50 100 300 50 100 300
Xylose Epicatechin Sulfate 4−3−[2−(2,4−dihydroxyphenyl)−2−oxoethyl]−DHMPMB−SA Glucodistylin
Figure 1: Biomarker levels following consumption of different quantities of apple. Boxplots for the n = 86 observations onthe four biomarkers, for the three apple quantities ( , and grams). The multiMarker model
To quantify food (in this case apple) intake from a panel of biomarkers we propose a factor analytic frameworktermed multiMarker. The proposed framework learns the relationship between the multiple biomarkers and foodquantity data and facilitates prediction of intake when only biomarker data are observed.
Consider a biomarker matrix Y of dimension ( n × P ), storing P different biomarkers measurements on a setof n independent observations. The D food quantities considered in the intervention study are denoted by X = { X , . . . , X d , . . . , X D } . Elements in X are ordered such that X d < X d +1 . We assume that the biomarkermeasurements are related to an unobserved, continuous intake value, for which the food quantities are proxies,leading to the following factor analytic model: y ip = α p + β p z i + (cid:15) ip , ∀ i = 1 , . . . , n, p = 1 , . . . , P, (1)where the (one dimensional) latent variable z i denotes the latent intake of observation i .The α p and β p parameters characterize, respectively, the intercept and the scaling effect for the p th biomarker.We assume a truncated Gaussian prior distribution, α p ∼ N [0 , ∞ ] (cid:0) µ α , σ α (cid:1) , for the intercept parameters. Thenon-negative assumption is required as biomarker levels can not be negative. The p th scaling parameter β p captures the effect of an increment in consumption of a given food on the observed level of biomarker p and atruncated Gaussian prior distribution is also assumed, β p ∼ N [0 , ∞ ] (cid:0) µ β , σ β (cid:1) .The error term (cid:15) p is the variability associated with the p th biomarker; a precise biomarker will have a valueclose to zero. As is common in factor analytic models, we assume that (cid:15) ip ∼ N (0 , σ p ), where σ p serves as aproxy for the precision of the p th biomarker. Hence, the likelihood function conditional on z = ( z , . . . , z n ) T is: ‘ ( Y | α, β, z , Σ) = P X p =1 n X i =1 −
12 log (cid:0) πσ p (cid:1) − σ p (cid:0) y ip − α p − β p z i (cid:1) ! , (2)where Σ = diag ( σ , . . . , σ P ). As stated, the quantities of interest z = ( z , . . . , z n ) T cannot be directly measured,only their discretizations (the food quantities from the intervention study) are available which are thereforeemployed to inform the distribution of the latent intakes. Thus, differently from standard factor analytic models,here the scale of the latent variable plays a central role, and its accurate recovery is a central requirement of theanalysis. To specify a prior distribution for the latent intakes, we exploit the information available in the food quantitiesfrom the intervention study. It is reasonable to assume that the latent intake of the i th observation, z i , willbe distributed around its corresponding consumed food quantity. Thus we assume a mixture of D truncatedGaussian distributions as the prior distribution for the latent intakes: z i ∼ D X d =1 π d N [0 , ∞ ] (cid:0) X d , σ d τ d (cid:1) (3)5he use of truncated distributions follows naturally from the definition of intake, which is non-negative. The d th component in the mixture model represents the distribution of the intakes of the observations who wereadministered the d th food quantity, and is therefore centered at X d . The variances σ , . . . , σ D represent foodquantity-specific intake variability, with lower values suggesting higher consumption-compliance. As the focushere is on recovering the latent variable in its innate scale, the τ d parameters serve as food quantity-specificscaling factors. These parameters account for large or small gaps between subsequent food quantities and,also, indirectly, for the possible difference in scale between the latent intakes and the biomarkers. Given theapplication context, here the mixture weights π = ( π , . . . , π D ) are known as they represent the proportion ofobservations that have consumed each food quantity. We adopt a Bayesian approach to estimate the hierarchical model’s parameters, implemented through a Metropoliswithin Gibbs Markov chain Monte Carlo (MCMC) algorithm. At each iteration of the algorithm, latent intakesand model parameters are sequentially updated from full conditional distributions. Hyperprior distributions areassumed on the prior parameters with the corresponding hyperparameter values fixed based on the data at hand,following an empirical Bayes approach. Hyperparameter specifications are reported in the Appendix A.1. Figure2 provides a graphical representation of the multiMarker model, with the update steps of the MCMC algorithmdetailed below.
Yα β z σ p µ α σ α µ β σ β τ d σ d π d ν P ν P X d m α m β τ β ν β ν β ν z , ν z Figure 2: Hierarchical structure of the multiMarker model. Shaded circles represent the data. Parameters and latentvariables are indicated with transparent circles and hyperparameters are indicated with no circles.
For the p th biomarker the truncated Gaussian prior distributions of both α p and β p are combined with thelikelihood function leading to the full conditional distributional distributions of α p and β p : α p ∼ N [0 , ∞ ] (cid:0) µ ∗ α p , σ ∗ α p (cid:1) ; β p ∼ N [0 , ∞ ] (cid:0) µ ∗ β p , σ ∗ β p (cid:1) σ ∗ α p = σ p σ α nσ α + σ p ; µ ∗ α p = σ ∗ α p " P ni =1 (cid:0) y ip − β p z i (cid:1) σ p + µ α σ α ; σ ∗ β p = σ p σ β σ β P ni =1 z i + σ p ; µ ∗ β p = σ ∗ β p " P ni =1 z i (cid:0) y ip − α p (cid:1) σ p + µ β σ β Assuming an inverse gamma prior distribution for the p th variance term σ p , with shape parameter ν P and scaleparameter ν P , leads to the following full conditional distribution: σ p ∼ Inv Γ ν ∗ P = n ν P , ν ∗ P = 12 n X i =1 ( y ip − α p − β p z i ) + ν P ! To derive the full conditional distribution for the latent intake, we exploit the complete data representation ofits prior distribution: z i | · · · ∼ D Y d =1 h π d N [0 , ∞ ] ( X d , σ d τ d ) i c id , c id = i consumed food quantity X d , . where the observation labels c i = { c i , . . . , c iD } and the mixture weights π d are known, given the structure of theintervention study. Thus for observation i who consumed quantity X d and has biomarkers y i , we may samplethe corresponding latent intake value from: z i | · · · ∼ N [0 , ∞ ] (cid:0) µ ∗ id , σ ∗ id (cid:1) , with σ ∗ id = P X p =1 β p τ d σ d + σ p /Pτ d σ d σ p ! − , µ ∗ id = σ ∗ id " P X p =1 β p (cid:0) y ip − α p (cid:1) σ p + X d τ d σ d Further, assuming an inverse gamma prior distribution on the components’ variance parameters, σ d ∼ Inv Γ (cid:0) ν z , ν z (cid:1) , the full conditional distribution is: σ d ∼ Inv Γ (cid:0) ν ∗ z , ν ∗ z (cid:1) , with ν ∗ z = n d ν z , ν ∗ z = ν z + τ d P ni =1 ( x i = X d )( z i − x d ) τ d , where n d = P ni =1 c id . The τ d parameters are fixed according to food quantity values ( τ d = x d d ) to account fordifferent gaps between these and allow for good coverage of the latent intakes’ range. To allow for the uncertainty in the parameters of the prior distributions of α p and β p , hyperprior distributionsare specified for µ α , µ β and σ β . Specifically, µ α ∼ N [0 , ∞ ] (cid:0) m α , τ α σ α (cid:1) , µ β ∼ N [0 , ∞ ] (cid:0) m β , τ β σ β (cid:1) , and σ β ∼ Inv Γ (cid:0) ν β , ν β (cid:1) , leading to the following full conditionals: µ g ∼ N [0 , ∞ ] (cid:0) µ ∗ g , σ ∗ g (cid:1) , with σ ∗ g = τ g σ g τ g P + 1 , µ ∗ g = σ ∗ g h τ g P Pp =1 g p + m g τ g σ g i ,σ β ∼ Inv Γ (cid:0) ν ∗ β , ν ∗ β (cid:1) , with ν ∗ β = P + 1 + 2 ν β , ν ∗ β = ν β + τ β P Pp =1 ( β p − µ β ) + ( µ β − m β ) τ β , where g = α, β . The set of hyperparameters (cid:0) m α , m β , τ α , τ β , ν β , ν β (cid:1) needs to be fixed in advance; somepractical guidelines are outlined in Appendix A.1. Finally, we fix the value of σ α = 1, for identifiability.7 .4 Predicting latent intakes While the multiMarker framework infers the relationship between biomarker and food quantity data from anintervention study, the aim is then to predict food intake when only biomarker data are available.Let us assume there are n ∗ observations for which biomarker data have been measured but no information ontheir consumed food quantity is available. To predict the latent intake z ∗ j for observation j , j = 1 , . . . , n ∗ , withbiomarker data y ∗ j = ( y ∗ j , . . . . , y ∗ jP ) T , we reconsider the log-likelihood (2) as a function of the latent intakesgiven the biomarker data: p ( z ∗ j | α, β, Σ , y ∗ j ) = P X p =1 −
12 log (cid:0) πσ p (cid:1) − β p σ p (cid:18) z ∗ j − (cid:0) y ∗ jp − α p β p (cid:1)(cid:19) ! (4)That is, p ( z ∗ j | . . . ) is the truncated Gaussian distribution z ∗ j ∼ N [0 , ∞ ] (cid:0) µ z , σ z (cid:1) , where σ z = (cid:0)P Pp =1 β p σ p (cid:1) − and µ z = σ z (cid:0)P Pp =1 β p ( y ∗ jp − α p ) σ p (cid:1) . Thus we now treat the latent intakes as the response data, regressed on the biomarkerdata. Then, combining (4) and (3) we derive the sampling distribution for z ∗ j : p ( z ∗ j | Ω) = N [0 , ∞ ] (cid:0) µ z , σ z (cid:1) D X d =1 π d N [0 , ∞ ] (cid:0) X d , σ d τ d (cid:1) = D X d =1 π d N [0 , ∞ ] (cid:18) µ z σ d τ d + X d σ z σ d τ d + σ z , (cid:0) σ d τ d + 1 σ z (cid:1) − (cid:19) (5)where Ω = { π , . . . , π D , µ z , . . . , µ zD , σ zq , . . . , σ zD } , µ zd = µ z σ d τ d + X d σ z σ d τ d + σ z and σ zd = (cid:0) σ d τ d + σ z (cid:1) − , for d =1 , . . . , D . Combining (5) with the posterior distribution of the multiMarker model (see Appendix A.3) providesthe posterior predictive distribution for z ∗ j : p ( z ∗ j | z ) ∝ Z p ( z ∗ j | Ω , z ) p (Ω | z ) d Ω = Z p ( z ∗ j | Ω) p (Ω | z ) d Ω (6)where p (Ω | z ) is the posterior distribution. Here p ( z ∗ j | Ω) is a mixture distribution with weights π = { π d } Dd =1 .As in the prediction setting only biomarker data are available, no food quantity information can be used toassign participants to different mixture components and so the weights are unknown. Replacing the weightswith the fixed values from the intervention study would bias the predictions, as the food quantity consumptionpattern of the n ∗ participants may be very different from that of the n participants used to train the multiMarkermodel. Thus in the multiMarker model we model the weights as a function of the observed biomarker values, tofacilitate flexible and adaptive predictions. Given the inherent ordering of the food quantities X , . . . , X D in the intervention study, we employ an ordinalregression model with Cauchit link function to model the weights π . The Cauchit link function is more robustthan the standard logit or probit link functions and thus is suited to the often highly variable biomarker data.The parameters of the ordinal regression model are inferred from the intervention study data using a slightlymodified specification of the multiMarker model, and then used later in the prediction step. Specifically, in themultiMarker model with ordinal regression coefficients denoted θ and γ , we now assume: p ( π | θ, γ, Y , c ) = n Y i =1 D Y d =1 π d ( θ d , γ, y i ) c id = n Y i =1 D Y d =1 h P r ( x i ≤ X d | θ d , γ, y i ) − P r ( x i ≤ X d − | θ d − , γ, y i ) i c id (7)8here P r ( x i ≤ X d | θ d , γ, y i ) = π h arctan (cid:16) ( θ d + γy i ) (cid:17) + π i is the probability that participant i consumed lessthan or equal to the food quantity X d , given its P biomarker measurements y i , expressed through a Cauchit link.The vector θ = {−∞ = θ , θ , . . . , θ d , . . . , θ D = ∞} contains food quantity-specific intercepts. The γ parameteris a P dimensional scaling coefficient for the biomarkers, expressing the contribution of each to the determinationof the mixture weights. These ordinal regression parameters are updated via a Metropolis Hastings step usingrandom walk proposal distributions within the MCMC algorithm. Details on their initialization are deferred toAppendix A.2. Estimation of these weights-specific parameters requires intervention study data, and hence it isperformed simultaneously with, yet independently of, that of multiMarker model parameters (see Section 3.3).We refer to this joint procedure as the learning phase of the multiMarker framework.The model in (7) is similar to a mixture of experts model, a type of mixture model in which parameters aremodelled as a function of observed covariates. However, here the mixture weights are modelled as a function ofthe observed response variable, as in the prediction step this variable changes its role to serve as a predictor.Given the parametrization for the mixture weights π in (7), we update the sampling distribution for z ∗ j (5) to: p ( z ∗ j | Ω , y ∗ j ) = D X d =1 π d ( θ d , γ, y ∗ j ) N [0 , ∞ ] ( µ zd , σ zd ) (8)where the definition of Ω is updated to Ω = { θ, γ, c j , µ z , . . . , µ zD , σ zq , . . . , σ zD } , and y ∗ j are the biomarkers’data for observation j . Finally, considering the updated definition of Ω and that of p ( z ∗ j | Ω , y ∗ j ), we can rewritethe posterior predictive distribution (6) as: p ( z ∗ j | z , y ∗ j ) ∝ Z p ( z ∗ j | Ω , y ∗ j ) p (Ω | z ) d Ω (9)We sample from the posterior predictive distribution (9) in three steps. First, relevant model parameters, Ω, aresampled from their full conditional and proposal distributions, the parameters of which are estimated in thelearning phase. Second, the labels c jd , d = 1 , . . . , D , are computed, using current sampled weights parametervalues and novel biomarker data y ∗ j , for j = 1 , . . . , n ∗ . Specifically, at the t th iteration we have that c ( t ) jd = 1,if arg max s π s ( θ ( t ) s , γ ( t ) , y ∗ j ) = π d ( θ ( t ) d , γ ( t ) , y ∗ j ), and c ( t ) jd = 0 otherwise. Last, predicted latent intake values aresampled from (8), conditioning on the current labelsand sampled model parameters Ω. We refer to the abovethree step procedure as the prediction phase of the multiMarker framework. This three-step sampler results inmore refined latent intake predictions. Indeed, not inferring the n ∗ labels and obtaining predictions by averagingacross the whole mixture distribution (8) leads to predicted intake values closer to the food quantities’ samplemean ¯ X d = D P d X d . This results in poor predictions for observations that are more likely to belong to acomponent located near small or large food quantity values. A schematic representation of the model for thecomponents’ weights and of that for latent intake predictions is given in Figure 3. Several simulation scenarios have been constructed to analyze the properties of the proposed approach. Perfor-mance is assessed both from estimation and prediction perspectives. The terms “train” and “test” are used torefer, respectively, to different simulated data employed in the learning and the prediction phases. To representa variety of real-world scenarios, different sample sizes, numbers of biomarkers and food quantity values have9 d θ γ X d Y (a) z j α β σ p µ α σ α µ β σ β σ d X d c jd π d θ γy j z ν P ν P (b) Figure 3: Hierarchical structure of (a) the components’ weights model and (b) that for latent intake predictions. Shadedcircles represent the data. Parameters and latent variables are indicated with transparent circles. No circles indicatehyperparameters, except for z and X d , which are conditioned on from the learning phase. been considered. To depict common experimental settings, the number of biomarkers considered is P = (3 , D = (3 , n = (30 , , , b . × n c observations.Three different specifications are considered for the α p and β p parameters, to represent different typesof biomarker measures. Also, across the P biomarkers, three different specifications are considered for σ p ,to explore the impact of increasing biomarker variability: small ( scenario 1 ), mixed ( scenario 2 ) or large( scenario 3 ) range of variance values. Finally, the mixture components’ parameters ( X d , σ d , π d ) employed reflectreal-world intervention study scenarios: “food quantities with stable increments” (“stable incremenents”), “foodquantities with increasing increments” (“increasing increments”), and “food quantities with decreasing increments”(“decreasing increments”). Two settings are implemented for the components’ variances σ d , corresponding tolow and high variability. We set π d ≈ D when simulating training datasets. Combinations of these settings (20simulated datasets for each) are used to investigate different aspects of performance in three simulation studiesdetailed below. Figure 4 illustrates two simulated training datasets with low and high biomarkers’ variabilitylevels.For the MCMC algorithm (Sections 3.3 and 3.4.1), 30000 iterations were run, both in the learning andthe prediction phases, with the first 6000 being discarded. The τ d parameters are fixed such that τ d = X d d toaccount for different gaps between the intervention studies’ food quantities, and to allow for good coverage ofthe latent intakes range. As discussed in Section 3.3 hyperparameters are fixed according to the observed datawith automatic procedures, allowing non-statisticians to easily use the proposed approach. Further details onthe simulation settings, hyperparameter specifications and initializations are deferred to Appendix A.1 andAppendix B.In the absence of comparable approaches for intake quantification in the presence of multiple biomarker data,we compare our results with those obtained using Bayesian linear regression (BLR) and partial least squares(PLS) regression. In BLR, the food quantities are regressed on a linear combination of the biomarkers in thetraining data with the resulting parameters used to predict intake in the test data. In PLS, the food quantities10 l l l l l l l l l l l l Portion B i o m a r k e r l e v e l
40 130 250 40 130 250 40 130 250
Food portion
Food quantity (a)
Scenario 1. l l l l l l l l l l l l l
Portion B i o m a r k e r l e v e l
45 95 239 45 95 239 45 95 239
Food portion
Food quantity (b)
Scenario 2.
Figure 4: Simulation study I. Examples of simulated biomarker data ( P = 3 , D = 3 ) under (a) the small (scenario 1) and(b) large (scenario 3) biomarkers’ variability scenarios. Here, n = 60 with (a) “stable increments” and (b) “increasingincrements” . are regressed on a dimensionally reduced representation of the biomarker data in the training set. Simulation study I assesses performance under increasing noise levels in the P biomarkers. In the case where P = D = 3, we compare results from the three biomarker variability scenarios. Table 1 (Simulation Study Icolumns) reports the median absolute errors in grams between true and estimated and predicted latent intakes.Overall, for estimated intakes, absolute error values are quite low, with median values ranging from 7gin scenario 1 (small biomarker variability) to 14g in scenario 3 (large biomarker variability). Notably, thewithin-scenario error variability is low, and in all cases is commensurate with the median error. Moreover, theabsolute errors are relatively stable across increasing noise levels σ p , suggesting robustness of the approach tobiomarker variability. Table 1 also shows that the estimated absolute errors under BLR and PLS are higher andmore uncertain than those obtained using the multiMarker approach. Also, the BLR and PLS errors increasewith increasing biomarker variability σ p .In terms of prediction the multiMarker approach performs well, but with larger errors in Scenario 3; however,the data simulated under Scenario 3 are extreme and unlikely to be seen in real applications (see Figure 4).The BLR prediction errors are much larger and more variable. Under PLS, the errors are comparable withthe proposed approach however, as is borne out in the application (see Section 5), error values under PLStend to be relatively low as predicted values tend to be close to the food quantities’ sample mean. While suchmean-prediction tendency yields good results in terms of median absolute error values, it corresponds to a lack11f precision in the predictions, with more extreme intakes being predicted in the middle of the intake range. While a small number of food quantities (e.g. D = 3) is practical and cost efficient a larger D could increase thecoverage of the intake range. Thus Simulation Study II investigates the impact of larger D on intake estimationand prediction. We analyse simulated data with P = 3 biomarkers and D = 6 food quantities, and compareresults with Simulation Study I where D = 3.Table 1 (Simulation Study II columns) reports the absolute errors computed between estimated and predictedlatent intakes and the truth where D = 6. As observed in Simulation Study I, the errors under BLR are large anduncertain when compared to the multiMarker and PLS approaches which perform similarly when the number offood quantities is large.The multiMarker approach performs similarly in Simulation Studies I and II where D = 3 or D = 6 respectively.This suggests that the predictive benefit of employing of a larger number of food quantities is minimal. Also,recalling that the the errors in Table 1 are summaries of a large variety of simulation settings, with differentfood quantities and sample sizes, suggests that modelling the latent intake range with a mixture distribution is aflexible and adaptive approach. Finally, in Simulation Studies I and II all of the other multiMarker parameters(see Section 3) have been recovered within a 99% credible interval. The multiMarker model (3) assumes that the relationship between the latent intakes and the observed biomarkersis linear. To assess the performance of the proposed approach in the presence of model misspecification, we havesimulated data according to the settings used in Simulation Studies I and II, but this time using a non-lineargenerating relationship between the intakes and the biomarkers: y ip = α p + β p z i + (cid:15) ip , ∀ i = 1 , . . . , n, p = 1 , . . . , P (10)Here the scaling coefficients β p have been re-scaled to 10% of the values used in Simulation Studies I and II (seeAppendix B), to obtain simulated biomarker data ranges similar to those in the two previous Simulation Studies.Table 1 (Simulation Study III columns) again reports the absolute errors between estimated and predictedlatent intakes and the true values. Here, error values are low and similar to those computed under SimulationStudies I and II. Indeed, median error values are always lower than 50g in the prediction setting, indicatingthat true intakes can still be recovered quite well, even when the underlying model is misspecified. This isdue to the role of the mixture distribution on the latent intake, which anchors the intakes around the foodquantities. However, the model parameters are hardly ever recovered and often overestimated, indicating thatthe model misspecification is absorbed by their estimates. The price of biased parameters estimates is a relativelylow one to pay in this context, as our goal is to obtain reliable predictions of intake. Further, the range oferrors confidence intervals is narrower in Simulation Study III than in the previous two studies. However,the results of the present study and those from Simulation Study I and II can only be partially compared.Indeed, recall that hyperparameters values are fixed across all Simulation Studies to allow comparability betweenparameter estimates. For a given set of hyperparameters values, the quadratic relationship between intakes12nd biomarkers in Simulation Study III produces biomarker data which tend to exhibit greater separation inbiomarker distributions between food quantities that are far apart from each other, when compared to biomarkerdata from Simulation Studies I and II (see Appendix B).Finally, comparison with the BLR and PLS results yields similar conclusions to those presented in SimulationStudies I and II. Further discussion of the Simulation Studies is deferred to Appendix B. Table 1: Simulation studies’ results. Median (
CI width) absolute error values (in grams) computed between true andestimated (E) or predicted (P) latent intakes. The values are reported for the three simulation studies (I,II,III) and thethree biomarkers’ variability scenarios (S.1,S.2,S.3). Results for the multiMarker (MM) model are reported, as well asthose from Bayesian linear regression (BLR) and PLS (PLS) regression.
Simulation StudyI II IIIModel S.1 S.2 S.3 S.1 S.2 S.3 S.1 S.2 S.3MM E 7(7) 8(6) 14(11) 8(7) 7(6) 10(6) 8(10) 9 (8) 16(5)P 7(16) 9(21) 31(57) 8(27) 10(38) 41(48) 8(15) 9(13) 36(9)BLR E 8(25) 10(33) 40(47) 9(25) 11(33) 40(41) 14(3) 14(7) 39(5)P 29(151) 69(255) 135(103) 43(153) 82(257) 125(142) 98(28) 69(82) 108(20)PLS E 8(25) 14(34) 40(47) 9(25) 15(35) 41(41) 14(3) 14(7) 47(5)P 9(26) 15(37) 42(50) 9(29) 15(38) 44(42) 14(3) 15(9) 50(5)
As detailed in Section 2, the motivating application was to infer the relationship between four biomarkers andapple intake from an intervention study, with a view to predicting intake when only the four biomarkers areavailable. The multiMarker framework was employed to infer the relationship between the biomarkers and foodintake, using 30000 MCMC iterations and discarding the first 6000, on a Intel(R) Core(TM) [email protected]. Model hyperparameters were fixed as discussed in Appendix A.1.Here we use leave-one-out cross validation to assess predictive performance on the apple data. Thus, n = 86models were fitted, each with a different set of ( n −
1) observations. In each case, the intake for the left outobservation was then predicted using its biomarker data only. The MCMC algorithms were run several timesand demonstrated no substantial difference in the parameters and latent intake predictions. The computationtime was approximately 3 .
47 minutes per model in the learning phase, and 6 .
78 seconds per data set in the13rediction phase. Details of Expected Sample Size (ESS) values are available in Appendix C.Results from the leave-one-out cross validation procedure are shown in Figure 5. Figure 5(a) shows boxplots ofthe median predicted intakes, computed from the latent intake posterior predictive distributions, split accordingto the true allocated apple quantity. Intake predictions are concentrated around the true apple intake quantities,with increasing variability in the predictions observed for those allocated the third quantity of 300 grams. Thisis not unexpected, as biomarker values related to the third apple quantity are quite variable (see Figure 1). Onlya few observations have predictions far from their true apple quantities. Indeed, the median absolute differencebetween predicted intakes and actual intake is 62 grams for the 50 grams apple quantity group, 33 grams for the100 grams quantity and 123 grams for the 300 grams quantity. Differences between apple quantities and medianpredicted intake values are further visualized in Figure 5(b). The 95% credible intervals for these differences areillustrated, showing that in general the latent intake posterior predictive distributions include the true applequantity values. Overall, there is good agreement between apple quantities and predicted intake, especiallywithin observations allocated the 50 and 100 grams quantities.To further visualize the results, Figure 6 presents posterior predictive intake distributions, for 6 observations,two for each apple quantity. Plots on the left side of Figure 6 correspond to accurately predicted intakes, whilethose on the right are poorly predicted. When intakes are correctly inferred, the range of the posterior predictivedistribution does not incorporate apple quantities far from the truth (e.g. observations 13, 32 and 82). For poorlypredicted intakes from the 50 or 100 gram apple quantities, the estimated posterior predictive distributionshave much higher uncertainty (e.g. observations 1 and 39). For observations allocated to the 300 grams applequantity, those with poor predictions tend to have notably underestimated predicted intakes (e.g. observation73), with a relatively narrow associated posterior predictive distribution range. Similar posterior predictivedistributions to those presented in Figure 6 have been obtained for the remaining observations. Notably theobservations with accurate predictions had largest posterior mean apple quantity probability of the correct applequantity while for those poorly predicted this was not the case.For comparative purposes, BLR and PLS regression were also used to predict apple intake via leave-one-outcross validation. Similar to the observed performance in the simulation studies, both methods proved unableto give precise intake predictions, with most predictions around the average of the three apple quantities( ≈
150 grams). Furthermore, 95% credible intervals obtained using BLR are more than twice the size ofthose obtained under the multiMarker framework, indicating low reliability of the predictions. Further, theuncertainty quantification under PLS regression was not useful, as the constructed 95% confidence intervals viacross validation yielded very narrow ranges. Further details on these results are deferred to Appendix C.To assess whether treating the n = 86 observations as independent was appropriate, we repeated the learningphase using slightly modified versions of the dataset. Specifically, we constructed modified versions of theoriginal dataset such that the same participant appeared only once i.e. each subject is present in only in oneapple quantity, which was selected at random. This procedure was repeated 100 times, and each time themultiMarker model was fitted to the modified dataset. Table 2 presents the posterior median parameter estimatesobtained from fitting the model to the original data and to the modified datasets. For the α p , β p , σ p parameters,dimensions 1 to 4 correspond to the four biomarkers, while for the σ d parameter the dimensions correspondto the three apple quantities. It is clear that the median estimates are similar. Also, posterior medians of the14arameters from the original data are always contained in the corresponding 95% credible intervals inferred fromthe modified datasets. These results suggest the inference is robust to our assumption of independence betweenthe n = 86 observations. Table 2: Posterior median (
CI width) parameter values inferred from the original apple data, and average posteriormedian (average
CI width) parameter values inferred from the modified apple datasets.
ParameterData Dimension α p β p σ p σ d Original 1 0.351 (0.725) 0.002 (0.004) 1.486 (0.411) 4.145 (2.763)2 0.673 (0.855) 0.003 (0.005) 1.501 (0.430) 6.734 (6.774)3 0.875 (1.009) 0.005 (0.005) 1.652 (0.497) 10.429 (13.220)4 1.005 (0.943) 0.004 (0.005) 1.588 (0.505) -Modified 1 0.409 (0.953) 0.002 (0.005) 1.503 (0.784) 3.834 (3.446)2 0.659 (1.231) 0.003 (0.005) 1.509 (0.781) 5.857 (7.026)3 0.851 (1.526) 0.005 (0.006) 1.689 (0.882) 9.634 (18.109)4 0.933 (1.527) 0.004 (0.008) 1.624 (0.856) -
Motivated by the need to infer the relationship between metabolomic biomarkers and food intake to faciltate futureintake predictions, we have introduced the general and flexible multiMarker framework. The proposed frameworkbuilds upon two classical regression models, multiple linear regression and ordinal regression, combining them inthe wider frameworks of factor analysis and mixture of experts models. The developed multiMarker frameworkfacilitates inference of the relationship between multiple biomarkers and food quantities, and allows for predictionof intake from multiple biomarkers data. The multiMarker framework advances on current approaches whichfocus on categorical intake quantification by providing more detailed predictions. Moreover, as the method isembedded in a Bayesian framework, uncertainty quantification is available, which is an important concern inboth the application domain and in a prediction context.Although the proposed approach has been motivated by the specifications of the A-DIET study, the overallmultiMarker framework is general and designed to be accessible to non-statisticians. In the multiMarkerframework, the extra layer of complexity introduced by the latent variable and associated model parametersbestows the advantage of more refined estimates and allows the model to adapt to different scenarios in the data.Furthermore, modelling the weights of the mixture distribution using a mixture of experts framework guidesthe quantitative prediction, informing the latter on which part of the intake range is to be considered whenpredicting intake. Unlike previous approaches, we explicitly model the ordinal nature of the food quantities.The multiMarker framework ability to model the relationship between biomarkers and food intake and15 lll
50 100 300
Apple quantity (grams) M ed i an p r ed i c t ed i n t a k e ( g r a m s )
50 gr100 gr300 gr llll
50 100 300 (a) − − − Observation P r ed i c t ed i n t a k e − A pp l e quan t i t y ( g r a m s ) median95%CIApple quantity:50 gr.100 gr.300 gr. (b) Figure 5: Apple intake predictions (in grams), obtained using the multiMarker framework. (a) Boxplots for the medianpredicted intakes, grouped by observed apple quantities. For clarity, horizontal red lines indicate the apple quantity values.(b) Dots represent the difference between posterior median predicted intake and known apple quantity, with vertical linesillustrating credible intervals. Horizontal lines represent overall median and credible intervals for the predictionerrors, and corresponding numerical values are reported in bold. Colours correspond to different apple quantities. ntake (grams) P o s t e r i o r p r ed i c t i v e d i s t r i bu t i on 020406080100120140160180200220240260280300320340360380400420440460480500520540 Observation 13
50 gr.100 gr.300 gr.medianportion (a)
Intake (grams) D en s i t y Observation 1
50 gr.100 gr.300 gr.medianapple quantity (b)
Intake (grams) D en s i t y Observation 32
50 gr.100 gr.300 gr.medianapple quantity (c)
Intake (grams) D en s i t y Observation 39
50 gr.100 gr.300 gr.medianapple quantity (d)
Observation 39.
Intake (grams) D en s i t y Observation 82
50 gr.100 gr.300 gr.medianapple quantity (e)
Intake (grams) D en s i t y Observation 73
50 gr.100 gr.300 gr.medianapple quantity (f)
Figure 6: Posterior predictive distributions of apple intake (in grams) for 6 observations. Orange and purple lines denotemedian predicted intake and known apple quantity, respectively. Posterior median apple quantity probabilities are alsoreported in the top horizontal bar in each plot.
17o predict food intake from biomarker data was demonstrated on apple data collected under the motivatingA-DIET research programme. Leave-one-cross validation results showed that generally intake predictions wereconcentrated around the true apple quantities. Comparison with existing regression models showed that themultiMarker framework was able not only to provide useful uncertainty quantification, but also much morereliable apple intake predictions.The proposed approach was also assessed in an extensive simulation study, where a large variety of biomarkerdata have been generated to check model performances under, among other things, different levels of biomarkeror intake variability. Furthermore, performances under model misspecification have been verified. In all of thecases the framework performed well, as intake values and their range were recovered with low error, even in amodel misspecification context.The proposed framework allows quantification of any unobserved quantity of interest for which priorinformation on its scale is observed (here, food quantities) and for which proxies can be measured (here,biomarkers). Furthermore, the observed variables’ scales are not relevant to recover that of the latent variable.Indeed, a collection of hyperparameters has been introduced to directly regulate different ranges between thebiomarkers. Such a feature is appealing for researchers, as biomarkers for the same intake could correspond toquite different measurement ranges, due to their sources (blood, urine, etc), the instrument used for collectionand so on.Although designed for a particular problem, the quantification of apple intake from a panel of urinarybiomarkers, the proposed method has general applicability outside of nutrition. Indeed, the framework couldhave applicability in any scenario where multiple outcomes are associated with an unobserved variable of interest,such as in toxicology or social science studies. Possible extensions to the proposed model are many and varied,including explicitly modelling the repeated biomarker measurements in the same group of participants. Further,the introduction of subject-specific covariates in the latent intake mixture weights, to allow a more flexible fit ofthe model, would be feasible. Thus there is much potential in the use of latent variable models to infer foodintake.
Acknowledgements
Supported by a research grant from the European Research Council (ERC)(647783) and a Science FoundationIreland grant (SFI / / RC / References
Agresti, A. (1999), ‘Modelling ordered categorical data: recent advances and future challenges’,
Statistics inmedicine , 2191–2207.Baldrick, F., Woodside, J., Elborn, J., Young, I. & McKinley, M. (2011), ‘Biomarkers of fruit and vegetable18ntake in human intervention studies: a systematic review’, Critical Reviews in Food Science and Nutrition , 795––815.Bhattacharya, A. & Dunson, D. (2011), ‘Sparse Bayesian infinite factor models’, Biometrika (2), 291––306.Bingham, S. (2002), ‘Biomarkers in nutritional epidemiology’, Public Health Nutrition (6(A)), 821–827.Cagnone, S. & Viroli, C. (2012), ‘A factor mixture analysis model for multivariate binary data’, StatisticalModelling (3), 257––277.Dragsted, L. et al. (2018), ‘Validation of biomarkers of food intake—critical assessment of candidate biomarkers’, Genes and Nutrition (14).Frühwirth-Schnatter, S. & Lopes, H. (2018), ‘Sparse Bayesian factor analysis when the number of factors isunknown’, arXiv:1804.04231 .Galimberti, G., Montanari, A. & Viroli, C. (2009), ‘Penalized factor mixture analysis for variable selection inclustered data’, Computational Statistics and Data Analysis (12), 4301–4310.Gao, Q. et al. (2017), ‘A scheme for a flexible classification of dietary and health biomarkers’, Genes andNutrition (34).Garcia-Aloy, M., Rabassa, M., Casas-Agustench, P., Hidalgo-Liberona, N., Llorach, R. & Andres-Lacueva,C. (2017), ‘Novel strategies for improving dietary exposure assessment: Multiple-data fusion is a moreaccurate measure than the traditional single-biomarker approach’, Trends in Food Science and Technology (B), 220–229.Gormley, I. & Frühwirth-Schnatter, S. (2019), Handbook of Mixture Analysis. Chapter 12: Mixture of ExpertsModels , Chapman and Hall/CRC.Gürdeniz, G. et al. (2016), ‘Detecting beer intake by unique metabolite patterns’,
Journal of Proteome Research (12), 4544–4556.Jacobs, R., Jordan, M., Nowlan, S. & Hinton, G. (1991), ‘Adaptive mixtures of local experts’, Neural Computation , 79–87.Kipnis, V. et al. (2002), ‘Bias in dietary-report instruments and its implications for nutritional epidemiology’, Public Health Nutrition (6(A)), 915–923.Knott, M. & Bartholomew, D. (1999), Latent variable models and factor analysis. Number 7. , Edward Arnold,London, second edition.Lin, T., McLachlan, G. & Lee, S. (2016), ‘Extending mixtures of factor models using the restricted multivariateskew-normal distribution’,
Journal of Multivariate Analysis , 398–413.Lloyd, A., Willis, N., Wilson, T., Zubair, H., Chambers, E., Garcia-Perez, I., Xie, L., Tailliart, K., Beckmann,M., Mathers, J. & Draper, J. (2019), ‘Addressing the pitfalls when designing intervention studies to discoverand validate biomarkers of habitual dietary intake’,
Metabolomics .19opes, H. & West, M. (2004), ‘Bayesian model assessment in factor analysis’, Statistica Sinica pp. 41—-67.McLachlan, G., Bean, R. & Jones, L. (2007), ‘Extension of the mixture of factor analyzers model to incorporatethe multivariate t-distribution’,
Computational Statistics and Data Analysis (11), 5327–5338.McNamara, A., Collins, C., Sri Harsha, P., González-Peña, D., Gibbons, H., McNulty, B., Nugent, A., Walton,J., Flynn, A. & Brennan, L. (2020), ‘Metabolomic-based approach to identify biomarkers of apple intake’, Molecular Nutrition and Food Research .Montanari, A. & Viroli, C. (2010), ‘Heteroscedastic factor mixture analysis’,
Statistical Modelling (4), 441—-460.Morgan, B. & Smith, D. (1992), ‘A note on wadley’s problem with overdispersion’, Applied Statistics (2), 287–497.Murphy, K., Viroli, C. & Gormley, I. (2020), ‘Infinite mixtures of infinite factor analysers’, Bayesian Analysis .Murray, P., Browne, R. & McNicholas, P. (2014), ‘Mixtures of skew-t factor analyzers’,
Computational Statisticsand Data Analysis , 326–335.Pison, G., Rousseeuw, P., Filzmoser, P. & Croux, C. (2003), ‘Robust factor analysis’, Journal of MultivariateAnalysis , 145—-172.Ročková, V. & George, E. (2016), ‘Fast Bayesian factor analysis via automatic rotations to sparsity’, Journal ofthe American Statistical Association , 1608—-1622.Rothwell, J. et al. (2014), ‘New biomarkers of coffee consumption identified by the non-targeted metabolomicprofiling of cohort study subjects’,
PLOS One .Siddique, J., Daniels, M., Carroll, R., Raghunathan, T., Stuart, E. & Freedman, L. (2019), ‘Measurement errorcorrection and sensitivity analysis in longitudinal dietary intervention studies using an external validationstudy’,
Biometrics (3), 927–937.Subar, A., Kipnis, V., Troiano, R., Midthune, D., Schoeller, D., Bingham, S., Sharbaugh, C., Trabulsi, J.,Runswick, S., Ballard-Barbash, R., Sunshine, J. & Schatzkin, A. (2003), ‘Using intake biomarkers to evaluatethe extent of dietary misreporting in a large sample of adults: The open study’, American Journal ofEpidemiology (11), 1–13.Vázquez-Manjarrez, N. et al. (2019), ‘Discovery and validation of banana intake biomarkers using untargetedmetabolomics in human intervention and cross-sectional studies’,
The Journal of Nutrition (10), 1701–1713.20
Posterior distribution and MCMC algorithm
Considering the multiMarker model’s likelihood, prior and hyperprior distributions, the posterior distribution is: P (Ω | Y ) = L ( Y | α, β, z , Σ) p ( α | µ α , σ α ) p ( µ α | m α , τ α , σ α ) p ( β | µ β , σ β ) p ( µ β | m β , τ β , σ β ) p ( σ β | ν β , ν β ) p ( z | π, X , . . . , X D , π, σ , . . . , σ D ) p ( π | θ, γ, Y , c ) p ( γ ) p ( θ ) p ( σ , . . . , σ D | ν z , ν z ) p (Σ | ν p , ν p ) (11)where for brevity Ω denotes the set of model parameters. A.1 Hyperparameter settings
Given an observed dataset, hyperparameter values are fixed automatically according to the following procedures.The overall means m α and m β are fixed, respectively, as the estimated intercept and slope coefficient of themultiple linear regression defined using biomarkers as response variable and food quantity values as predictor.Variances’ hyperparameters are fixed differently according to the parameter to which they refer: ( ν β , ν β ) = (2 , ν p , ν p ) = (1 , n ), and ( ν z , ν z ) = ( { D, . . . , } , n ). Regarding the α and β vector of parameters, their values areinitialized solving the following system of equations: X ( x i = X d ) ( y ip ) ≈ α p + β p x d for p = 1 , . . . , P and d = 1 , . . . , D. Biomarkers’ error variances are initialized exploiting the definition of estimated error variances under thefactor analytic model, adjusted for the extra variability brought in by the latent intakes prior distribution:Σ = (cid:92) V ( Y ) − D ββ T . Last, mixture components’ variance parameters are initialized with the following values: σ d = 1 P P X p =1 (cid:92) V ( Y pd ) − σ α − σ p σ β where (cid:92) V ( Y pd ) = ˆ var (cid:0) ( x i = X d ) y ip (cid:1) . A.2 Components’ weights
The prior distributions for the θ = { θ , θ , . . . , θ D } weights’ parameters are defined to represent the constrainedcharacteristic of ordinal data, that is: X < · · · < X d < · · · < X D . These constraints correspond to thefollowing in terms of model parameters: θ < · · · < θ d < · · · < θ D . A natural choice for the corresponding priordistributions is the following: θ d ∼ N ( θ d − ,θ d +1) (cid:0) , σ θ d (cid:1) , for d = 1 , . . . D − θ d parameters are updated with a random walk Metropolis Hastings stepinside the MCMC algorithm. Hyperparameters σ θ d , d = 1 , . . . , ( D − σ θ d < = 2. Smallvariations around this threshold have been tested and did not produce any substantial difference in the θ d parameter estimates. Further, biomarkers’ intercepts γ = { γ , . . . , γ P } have been given standard Gaussian priordistributions. Parameter estimates are initialized via their corresponding estimates obtained with the ordinalNet package, available on CRAN ( https://CRAN.R-project.org/package=ordinalNet ).21 .3 Latent intake posterior predictive distribution In Section 3.4 we have introduced a sampling distribution for the latent intakes, to be used in a predictionframework. This distribution was derived as the product of two terms, the first being the log-likelihood of themodel, expressed as a function of the latent intakes: ‘ ( y ∗ j | α, β, z ∗ j , Σ) = p ( z ∗ j | α, β, y ∗ j , Σ) = P X p =1 p ( z ∗ j | α p , β p , y ∗ j , σ p )= P X p =1 −
12 log (cid:0) πσ p (cid:1) − σ p (cid:18) z ∗ j β p − (cid:0) y ∗ jp − α p (cid:1)(cid:19) ! = P X p =1 −
12 log (cid:0) πσ p (cid:1) − β p σ p (cid:18) z ∗ j − (cid:0) y ∗ jp − α p β p (cid:1)(cid:19) ! ∝ P X p =1 − β p σ p (cid:18) z ∗ j − (cid:0) y ∗ jp − α p β p (cid:1)(cid:19) ! ∝ − z ∗ j (cid:18) P X p =1 β p σ p (cid:19) + z ∗ j (cid:18) P X p =1 β p ( y ∗ jp − α p ) σ p (cid:19) = − σ z z ∗ j + µ z σ z z ∗ j , The second term is the mixture distribution presented in Section 3.2. The product of these two terms can beexpressed as a mixture distribution: p ( z ∗ j | Ω) = N [0 , ∞ ] (cid:0) µ z , σ z (cid:1) D X d =1 π d N [0 , ∞ ] (cid:0) X d , σ d τ d (cid:1) = D X d =1 π d N [0 , ∞ ] (cid:0) µ z , σ z (cid:1) N [0 , ∞ ] (cid:0) X d , σ d τ d (cid:1) = D X d =1 π d N [0 , ∞ ] (cid:18) µ z σ d τ d + X d σ z σ d τ d + σ z , (cid:0) σ d τ d + 1 σ z (cid:1) − (cid:19) = D X d =1 π d N [0 , ∞ ] ( µ zd , σ zd ) (12)where Ω = { µ z , . . . , µ zD , σ zq , . . . , σ zD } . Combining p ( z ∗ j | Ω) with the posterior distribution of the multiMarkermodel provides the posterior predictive distribution for z ∗ j : p ( z ∗ j | z ) ∝ R p ( z ∗ j | Ω) p (Ω | z ) d Ω, where p (Ω | z ) = L ( Y | α, β, z , Σ) p ( α | µ α , σ α ) p ( β | µ β , σ β ) p ( z | { X d , σ d , τ d , π d } Dd =1 ) p (Σ | ν p , ν p )When the parametrization for the components weights introduced in Section 3.4.1 is considered, the posteriorpredictive distribution for z ∗ j is updated to: p ( z ∗ j | z , y ∗ j ) ∝ R p ( z ∗ j | Ω , y ∗ j ) p (Ω | z ) d Ω, where p (Ω | z ) = L ( Y | α, β, z , Σ) p ( α | µ α , σ α ) p ( β | µ β , σ β ) p (Σ | ν p , ν p ) p ( z | { X d , σ d , τ d , π d } Dd =1 ) p ( { π d } Dd =1 | θ, γ, Y , c ) p ( θ ) p ( γ )22 Additional simulation study details
In the simulation studies described in the paper, biomarkers’ intercepts and scale coefficients ( α, β ) are sampledfrom their prior distributions with hyperparameters: { µ α , µ β , σ α , σ β } = { (1 , . , , . , (20 , . , , . , (100 , , , } These hyperparameter specifications correspond to small, mixed and large range biomarker values, and areused to represent different types of biomarker measures (as for example measurements coming from differentinstruments or measurements non normalised by osmolality). Regarding the error variance terms σ p , thesevalues are sampled from Inverse Gamma distributions with expected values dependent on the biomarkers’ rangeconsidered: (1 , ,
15) for small variances and (3 , , X d are sampled from D zero-truncated Gaussian distributionswith means µ X d ranging in between 30 and 250, with µ X d < µ X d +1 . Three different settings are explored torepresent “food quantities with stable increments”, “food quantities with increasing increments”, and “foodquantities with decreasing increments”. In the first setting (“stable increments”), means µ X d are equispaced,that is d ( µ X d − , µ X d ) = d ( µ X d , µ X d +1 ). Instead, in the second and third settings we have that d ( µ X d − , µ X d )
Figure 7 reports an example of a comparison between biomarker data simulated in Simulation Study I (linearrelationship) and Simulation Study III (non-linear relationship), obtained using similar specifications of themodel parameters. In the example, n = 60 is considered, under the “increasing increments” food quantitiessetting. B.2 Comparison with PLS and Bayesian linear regression models
To compare the three Simulation Studies, Figure 8 presents the median prediction absolute error ratios obtainedusing either PLS or Bayesian linear regression models, and the multiMarker model. In approximately 50% ofthe cases, prediction error values from a Bayesian linear regression model are up to ten times those from themultiMarker model. For the other half of the cases, Bayesian linear regression performs up to 50 times worsethan the proposed one in terms of prediction error, with greater differences observed in Simulation Study I. PLSregression performs worse than the multiMarker model in approximately 85% of the simulations, with predictionerror values up to two to four times those of the proposed model. In the 15% of cases in which the error ratiovalues favor the PLS model, error values from the multiMarker framework are always less than twice thoseobtained with PLS regression. Absolute errors are computed between the estimated or predicted intakes and the true ones. igure 7: Simulation study I and III. Examples of simulated biomarker data ( P = 3 , D = 3 ) under the small (scenario 1)biomarkers’ variability scenario, and Simulation Study I (black boxplots) and Simulation Study III (grey boxplots). Here, n = 60 with “increasing increments” in food quantities. l l l l l Simulation study P r ed i c t ed i n t a k e ab s o l u t e e rr o r r a t i o ( B l m vs mM ) I II III (a)
Bayesian linear regression vs. the multiMarker model. l l l l l
Simulation study P r ed i c t ed i n t a k e ab s o l u t e e rr o r r a t i o ( P L S vs mM ) I II III (b)
PLS regression vs. the multiMarker model.
Figure 8: Absolute error ratios between (a) the Bayesian linear regression and (b) PLS regression models and themultiMarker model. Absolute error values are computed between true and predicted latent intakes. Results are presentedfor the three different Simulation Studies. The red line in the plots corresponds to an equivalence of the models considered,with values above this line indicating a better performance by the multiMarker model. Apple intake data
C.1 Data scaling
The original values for Epicatechin Sulfate, (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHM P M B − SA ) and Glucodistylin biomarkers caused computational instability (most values were larger than 10 ) andconsequently were scaled. Given a biomarker p , the corresponding scaled values ˜ y ip are computed as follows:˜ y ip = y ip − ¯ y p sd ( y p ) + 2 (cid:12)(cid:12)(cid:12)(cid:12) min i =1 ,...,n (cid:16) y ip − ¯ y p sd ( y p ) (cid:17)(cid:12)(cid:12)(cid:12)(cid:12) where y ip are the original measurements, i = 1 , . . . , n . The original measurements mean and standard deviationare denoted with ¯ y p and sd ( y p ), respectively. The transformation did not alter the correlations between the fourbiomarkers. C.2 Comparison to BLR and PLS regression
Figure 9 reports plots that are analogous to those in Figure 5 of the paper, obtained using either BLR or PLSregression, for a comparison.
C.3 MCMC diagnostics
Table 3 reports a summary of the ESS (Expected Sample Size) values for the model parameters (30000 MCMCiterations considered). Figures 10 and 11 report, respectively, the estimated posterior distributions for the α p and β p parameters, p = 1 , . . . , P , in the apple intake data. In addition, Figures 12 and 13 report, respectively,the trace plots for the estimated α p and β p parameters, p = 1 , . . . , P , in the apple intake data. Table 3: Apple data. Summary of the ESS (Expected Sample Size) values for the model parameters (
MCMCiterations considered).
Parameter µ α µ β σ β α β σ p σ d z γ θ min 1813 14786 14586 5233 5381 18105 1008 19380 12469 11202median 1813 14786 14586 6524 6659 19069 1094 26310 12554 11575Max 1813 14786 14586 18238 18917 20001 1293 28141 12639 1281225 l
50 100 300
Apple quantity (grams) M ed i an p r ed i c t ed i n t a k e ( g r a m s )
50 gr100 gr300 gr l l
50 100 300 (a) − − − Observation P r ed i c t ed i n t a k e − P o r t i on ( g r a m s )
34 gr−81 gr60 gr median95%CIApple quantity:50 gr.100 gr.300 gr. (b) llll lll
50 100 300
Apple quantity (grams) M ed i an p r ed i c t ed i n t a k e ( g r a m s )
50 gr100 gr300 gr llll lll
50 100 300 (c) − − − Observation P r ed i c t ed i n t a k e − A pp l e quan t i t y ( g r a m s ) median95%CIApple quantity:50 gr.100 gr.300 gr. (d) Figure 9: Plots (a) and (b): intake predictions (in grams), obtained fitting the PLS regression model and using leave oneout cross validation. Plots (c) and (d): intake predictions (in grams), obtained fitting the Bayesian linear regression modeland using leave one out cross validation. These plots are analogous to those in Figure 5 of the paper. alue D en s i t y .
224 0 .
448 0 .
672 0 .
896 1 . (a) Xylose.
Value D en s i t y . . . . . (b) Epicatechin Sulfate . Value D en s i t y .
396 0 .
792 1 .
188 1 .
584 1 . (c) (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHMP MB − SA ). Value D en s i t y . . . . (d) Glucodistylin . Figure 10: Apple data. Estimated posterior distributions for the α p parameters, p = 1 , . . . , P . alue D en s i t y .
001 0 .
002 0 .
004 0 .
005 0 . (a) Xylose.
Value D en s i t y .
002 0 .
003 0 .
005 0 .
006 0 . (b) Epicatechin Sulfate . Value D en s i t y .
002 0 .
004 0 .
007 0 .
009 0 . (c) (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHMP MB − SA ). Value D en s i t y .
002 0 .
004 0 .
006 0 .
008 0 . (d) Glucodistylin . Figure 11: Apple data. Estimated posterior distributions for the β p parameters, p = 1 , . . . , P . . . . . . . . Iteration (after burnIn) V a l ue (a) Xylose . . . . . . . . Iteration (after burnIn) V a l ue (b) Epicatechin Sulfate . . . . . . . . Iteration (after burnIn) V a l ue (c) (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHMP MB − SA ). . . . . . . . Iteration (after burnIn) V a l ue (d) Glucodistylin . Figure 12: Apple data. Trace plots for the estimated α p parameters, p = 1 , . . . , P . Correlation values computed betweenthe four intercept parameters’ chains lie in ( − . , . . . . . . Iteration (after burnIn) V a l ue (a) Xylose. . . . . Iteration (after burnIn) V a l ue (b) Epicatechin Sulfate . . . . . Iteration (after burnIn) V a l ue (c) (4 − − [2 − (2 , − dihydroxyphenyl ) − − oxoethyl ] − DHMP MB − SA ). . . . . Iteration (after burnIn) V a l ue (d) Glucodistylin . Figure 13: Apple data. Trace plots for the estimated β p parameters, p = 1 , . . . , P . Correlation values computed betweenthe four scaling coefficients parameters’ chains lie in ( − . , . ..