Parsimonious Bayesian Factor Analysis for modelling latent structures in spectroscopy data
PParsimonious Bayesian Factor Analysis formodelling latent structures in spectroscopy data
Alessandro Casa , Tom F. O’Callaghan , and Thomas Brendan Murphy School of Mathematics & Statistics, University College Dublin School of Food & Nutritional Sciences, University College Cork Vistamilk SFI Research Centre
Abstract
In recent years, within the dairy sector, animal diet and management practices have beenreceiving increased attention, in particular examining the impact of pasture-based feedingstrategies on the composition and quality of milk and dairy products, in line with the in-creased prevalence of premium grass-fed dairy products appearing on market shelves. Todate, there are limited testing methods available for the verification of grass-fed dairy andas a consequence these products are susceptible to food fraud and adulteration. Therefore,with this in mind, enhanced statistical tools studying potential differences among milk sam-ples coming from animals on different feeding systems are required, thus providing increasedsecurity around the authenticity of the products.Infrared spectroscopy techniques are widely used to collect data on milk samples andto predict milk related traits and characteristics. While these data are routinely used topredict the composition of the macro components of milk, each spectrum also provides areservoir of unharnessed information about the sample. The accumulation and subsequentinterpretation of these data present a number of challenges due to their high-dimensionalityand the relationships amongst the spectral variables.In this work, directly motivated by a dairy application, we propose a modification ofthe standard factor analysis to induce a parsimonious summary of spectroscopic data. Theproposed procedure maps the observations into a low-dimensional latent space while simulta-neously clustering observed variables. The proposed method indicates possible redundanciesin the data and it helps disentangle the complex relationships among the wavelengths. A flex-ible Bayesian estimation procedure is proposed for model fitting, providing reasonable valuesfor the number of latent factors and clusters. The method is applied on milk mid-infrared(MIR) spectroscopy data from dairy cows on distinctly different pasture and non-pasturebased diets, providing accurate modelling of the data correlation, the clustering of variablesin the data and information on differences between milk samples from cows on different diets.
Keywords:
Food authenticity studies, dairy science, spectroscopy, chemometrics, factor anal-ysis, redundant variables, clustering, Gibbs sampling
In recent years, the food industry has gone through rapid changes, partially due to ever evolvingconsumer preferences and increased consumers’ awareness of food and health. We are currentlyat the forefront of growing demand for detailed and accurate knowledge concerning food quality,authentication and security. The sectors potentially most vulnerable to these changing trendsare those processing and preparing foodstuffs of animal origin. As a consequence, the dairysector has been particularly involved in this transition, with an increasing attention towardsproduct quality, traceability and adherence to procedures respectful to animal welfare and theenvironment. 1 a r X i v : . [ s t a t . M E ] J a n n this scenario, one of the aspects which is gaining more attention and relevance is con-cerned with cattle feeding regimen. In general consumers regard pasture based feeding as morerespectful of the animal well-being, producing products that are more natural and healthy (El-gersma, 2012). These perceptions appear to have some basis in fact as mounting evidence hasdemonstrated that outdoor pasture based feeding of cows results in milk and dairy productswith enhanced beneficial nutrients compared to indoor total mixed ration (TMR) like systems(O’Callaghan et al., 2016b,a; O’Callaghan et al., 2017). Furthermore, pasture based feeding hasbeen demonstrated to lead to ameliorations in the organolectic characteristics of dairy productsproviding signature like traits (Faulkner et al., 2018; Alothman et al., 2019; Garvey et al., 2020)and improved quality. As such, in many markets, grass-fed dairy products usually demand apremium price from consumers, and as often happens with expensive products, milk producedby grass fed cows is susceptible to food adulteration and fraud. As a direct consequence, therehas been an increased requirement for development of methods capable to detect the presenceof adulterants and authenticate the traceability of the milk (see Kamal and Karoui, 2015, fora recent review). In the literature several different approaches which have been proposed relyon the identification of one or more useful biomarkers in order to build meaningful authenti-cation methods (see Capuano et al., 2014, and references therein). Furthermore, O’Callaghanet al. (2016b, 2018) highlighted the ability of fatty acid profiling coupled with multivariate anal-ysis and H-NMR to be able to distinguish between milk from pasture and TMR based diets.Nonetheless, these techniques are considered to be expensive and time consuming since they re-quire laboratory extraction routines to collect the data, compromising their widespread effectiveutility.On the other hand vibrational spectroscopy techniques, such as Fourier transform near-infrared (NIR) and mid-infrared (MIR) spectroscopy, are known to be cheap, rapid and non-disruptive alternatives to collect large amounts of data and to analyze different biological mate-rials. Such methods have been already proven useful in food authenticity studies (eg. Murphyet al., 2010); readers may refer to Downey (1996) and Reid et al. (2006) for more thorough re-views. When a material is analyzed via MIR spectroscopy the light is passed through a sampleof that material at a sequence of wavelengths in the mid-infrared region (900 to 5000 cm − ). Thepassage of the light activates the sample’s chemical bonds leading to an absorption of energyfrom the light itself. The amount of energy absorbed or transmitted by the sample, at differentwavelengths, creates the spectrum of that sample that might be subsequently used to analyzeits characteristics (see Figure 1 for a graphical illustration of some MIR spectra).Vibrational spectroscopy techniques have already been fruitfully used in the dairy frameworkto determine milk characteristics such as protein, lactose and casein concentration (De Marchiet al., 2014) as well as to predict milk and animal related traits such as milk fatty acids (Bonfattiet al., 2017) and energy efficiency and intake (McParland et al., 2014; McParland and Berry,2016). However, the usefulness of infrared spectroscopy data to authenticate cow feeding reg-imens has been less widely explored even though standard classification tools have producedpromising results in Coppa et al. (2012) and Valenti et al. (2013).Therefore, a thorough exploration of the features of spectroscopy data when used to analyzesamples coming from differently fed animals is somehow still missing. From a statistical pointof view NIR and MIR spectroscopy data introduce some challenges that have to be carefullyaddressed. The first one is concerned with their high-dimensionality since, usually, each singleobservation consists of more than 1000 transmittance or absorbance values over the MIR or NIRregions. Moreover these data are highly correlated with underlying chemical processes entailingrather complex correlation structures, as can be seen in Figure 2. From the figure it is clearhow, albeit adjacent wavelengths tend to be highly correlated, strong relationships are witnessedamong regions of the spectrum far from each other. The information contained in each spectrumis indeed known to be structured in a rather complicated way and possibly spread over differentlocations. 2 . . . . T r an s m i tt an c e Wave numbers
Figure 1: The mid-infrared spectra recorded for a subset of the analyzed milk samples corresponding todifferent diet regimens. The samples are colored as pasture-diet = red, total mixed ration-diet = blue.
For these reasons, statistical methodologies being able to provide a parsimonious representa-tion of the correlation structures in spectroscopy data turn out to be particularly useful. On onehand they can mitigate high-dimensionality related issues, by summarizing parsimoniously theinformation in a lower dimensional representation. On the other hand a proper reconstructionof the relations seen in Figure 2 may help in identifying which wavelength regions are carryingsimilar information, when the aim is to discriminate between samples coming from differentlyfed animals. This identification, when coupled with subject-matter knowledge, can highlightwhich chemical structures are responsible for the structures seen in the milk sample spectra.Moreover, it can be useful to identify the main nutritional differences implied in the milk fromdifferent diet regimens, serving as a stepping stone for classification purposes.When analyzing spectroscopy data latent variable models are considered to be the state ofthe art to tackle some of the challenges mentioned above. Techniques such as
Partial LeastSquares (PLS) and principal component analysis (PCA) are widely used both for predictivepurposes and to reduce the dimensionality of the data, by summarizing the information in alower number of newly built features. In a similar fashion, factor analysis (FA, Everitt, 1984;Bartholomew et al., 2011) provides a parsimonious representation of the observed data, bybuilding new variables called factors , while simultaneously explaining the correlation amonghigh-dimensional observations. For this reason, when the aim is to reconstruct structures as theones in Figure 2, factor analysis represents a suitable strategy to follow. Nonetheless, even ifFA effectively reduces the dimensionality of the data, standard FA does not provide informationabout possible redundancies in the observed features. For this reason, in this work, we proposea suitable modification of the standard factor analysis model which allows the detection ofredundant variables and which produces a partition of the variables themselves, thus possiblygaining useful insights about similarly behaving spectral regions.The rest of the paper is structured as follows. In Section 2 we described the mid-infraredspectroscopy data which motivates our proposal. In Section 3 we outline the proposed method-ology with a specific focus on the proposed Bayesian estimation procedure and on the involvedmodel selection steps. Some analyses on synthetic datasets are reported in Section 4, while inSection 5 we present the results obtained on the milk spectroscopy data. Finally, in Section 6, weconclude with some final remarks and highlight some advantageous avenues for future research.3
Figure 2: Sample correlation matrices computed on the milk samples produced by pasture fed cows (onthe left) and total mixed ration fed cows (on the right).
The data we consider in this study consists of a collection of mid-infrared spectra from 4320milk samples produced from cattle on three dietary treatments over a three year period on theTeagasc Moorepark Dairy research Farm (Fermoy, Co. Cork, Ireland). The data are comprisedof spectra extracted from morning (am) and evening (pm) milk samples collected weekly fromHolstein-Freisian cows using a ProFoss FT6000 series instruments (FOSS, Ireland) between theperiod of May and August in 2015, 2016 and 2017. In each year 54 cows were randomly assignedto each of the dietary treatment for the entire lactation period of that year. Treatments includedgrass (GRS) which consisted of cows maintained outdoors on a perennial ryegrass sward only,clover (CLV) whereby cows were maintained outdoors on a perennial ryegrass with 20% whiteclover sward only and total mixed ration (TMR) where cows were maintained indoors yearround and nutrients are combined together in a single nutritional mix consisting of grass silage,maize silage and concentrates. For further information on the experimental design and dietarytreatments, see Faulkner et al. (2018), O’Callaghan et al. (2016a), O’Callaghan et al. (2016b),O’Callaghan et al. (2017) and O’Callaghan et al. (2018). More specifically, 2931 samples comefrom cows being fed with pasture (GRS and CLV), while the remaining 1389 come from cowsbeing fed with TMR. Note that the original milk samples may be grouped according to thefeeding system into three different classes (GRS, CLV and TMR) rather than two (pasture andTMR). In practice, in our work the first two classes have been merged together into a generalpasture-based diet group, because of their strong similarities from a compositional perspective.The total number of cows involved in the experiment is equal to 120, thus implying thatmultiple animal measurements are available with a mean number of 36 samples per cow. Thesamples have been collected following a yearly balanced scheme and they represent a balance ofdifferent parities. The samples considered in this work have been restricted to the ones beingcollected mainly in the summer months since it represents a period of milk production withhighest prevalence of grass growth. Note that, for each sample, a spectrum consists of 1060transmittance measurements in the region going from 925 cm − to 5010 cm − . Finally, we havesome additional information about fat, protein and lactose content in the available milk samplesobtained using channels on the FT6000 calibrated against wet chemistry results.4 Factor analysis with redundant variables
As mentioned in the introduction standard Factor Analysis (denoted as FA in the following)provides a convenient and parsimonious representation of the dependence structure among high-dimensional observations by mapping them in a low-dimensional latent space.Let X = { x , . . . , x n } , with x i ∈ R p , the set of the observed data. Factor analysis modelseach observation x i as a linear combination of latent variables, called factors, as follows x i = µ + Λ u i + ε i , i = 1 , . . . , n, (1)where µ ∈ R p is the mean vector, Λ = { λ jk } j =1 ,...,p,k =1 ,...,K is a p × K factor loadings matrix with K being the number of factors, u i ∈ R K denotes the factor scores vector while ε i ∼ N p (0 , Ψ)is an idiosyncratic error term where Ψ = diag( ψ , . . . , ψ p ), with ψ j ’s often referred to as theuniquenesses. Without loss of generality we assume in the following that the data have beencentered, hence µ = 0. Moreover, the latent factors are assumed to be normally distributed withzero mean and covariance equal to the identity matrix.Consequently, we have that ( x i | u i ) ∼ N p (Λ u i , Ψ). On the other hand marginally x i is distributedaccording to a Gaussian distribution with zero mean and covariance matrixΣ = ΛΛ T + Ψ . (2)In practical applications the number of variables p is considerably higher than the numberof factors K . Therefore the decomposition in (2) introduces a convenient and parsimoniousrepresentation of the relationships among the observed features in high-dimensional settings.From (2), and recalling that Ψ is a diagonal matrix, it follows, in standard FA, that thecorrelation between the original variables is modelled via the loading matrix Λ. Therefore, inthe literature, the attention has been focused on how to model and estimate Λ appropriately. Inrecent years a lot of different solutions have been proposed, both from a frequentist (see e.g. Hi-rose and Konishi, 2012; Hirose and Yamamoto, 2015) and from a Bayesian (see e.g Bhattacharyaand Dunson, 2011; Legramanti et al., 2020) standpoint, to obtain sparse estimates of the load-ing matrix. Setting some values of Λ exactly equal to zero allows an even more parsimoniousrepresentation of the original covariance structure. Moreover, it could be convenient from aninterpretative point of view since relating each factor to a smaller number of observed variableshelps to give meaning to the factors themselves. Lastly, note that if all the elements in a row ofthe loading matrix are equal to zero we obtain an indication about the uninformative nature ofthe j -th variable, being uncorrelated with all the other features and essentially represented asnoise.The detection of uninformative features in the FA framework is tackled in the aforementionedreferences, but to the best of our knowledge possible redundancy has not been tackled yet.A variable is defined as redundant when it carries information similar to the one provided byanother variable (or variables), usually due to the strong correlation between them. The effectivedetection of redundancies, which is a challenging task, can lead to more parsimonious modelling.Note that, as briefly introduced in the previous sections, and as it is graphically represented inFigure 2, redundancy can be a complex issue when analyzing spectroscopy data. Therefore,in order to properly account for it, in this work we introduce a model in which some of thevariables are mapped into the latent space by means of the same loading coefficients thus givingan indication of possible grouping structures in the observed features themselves. The proposedmodel is then defined as follows x i = Z Λ c u i + ε i = ˜Λ u i + ε i i = 1 , . . . , n, (3)5ith x i , u i and ε i previously defined while Z = { z j } j =1 ,...,p , with z j = ( z j , . . . , z jG ), is a p × G latent allocation matrix where G is the number of variable clusters. Here the standard binarypartition is adopted for Z ; therefore z jg = 1 if the j -th variable belongs to the g -th group and 0otherwise. Lastly, Λ c = { Λ c,g } g =1 ,...,G , with Λ c,g = ( λ c,g , . . . , λ c,gK ), is a G × K matrix whose g -th row contains the unique and representative loading values for the g -th variable cluster. Notethat, as a direct consequence of the specification of the model (3), ˜Λ has duplicate row values.We believe this constitutes a sensible way to account for redundancy in the observed features, bysimply constraining the relations with the latent factors to be equal for those variables belongingto the same cluster.The distributional properties highlighted above for the standard FA model are still valid.Therefore, we have that ( x i | u i , z ) ∼ N p ( ˜Λ u i , Ψ) while ( x i | z ) ∼ N p (0 , ˜Σ) where ˜Σ = ˜Λ ˜Λ T +Ψ. It is straightforward to see how the proposed model induces an even more parsimoniousdecomposition of the covariance matrix. In fact the specification of our model entails a possiblydrastic reduction in the total number of covariance parameters to estimate; for model (3) thisnumber is equal to ( G × k )+ p , for model (1) it is equal to ( p × k )+ p ; due to rotational invariancein FA, the number of identifiable parameters is fewer than this. Clearly, the smaller the numberof variable clusters G , hence the more redundancy is observed in the data, the greater will bethe reduction.From an interpretative point of view the estimation of the allocation matrix Z allows obtain-ing a clustering of the variables. This partition gives insights into the redundancy phenomenonunder study by clearly highlighting which variables are strongly correlated and providing similarinformation.Finally, note that our proposal may be adapted in order to detect both redundant anduninformative variables by a priori forcing all the elements in a single specific row of Λ c to beexactly equal to zero. This would imply that all the variables assigned to the correspondinggroup are modelled as noise and are uncorrelated with all the other variables. Under the specification of model (3), and recalling that ( x i | u i , z ) ∼ N p ( Z Λ c u i , Ψ), the corre-sponding likelihood function is given by L ( X | Λ c , Ψ , Z, U ) = n Y i =1 (2 π ) − p | Ψ | − exp (cid:26) −
12 ( x i − Z Λ c u i ) T Ψ − ( x i − Z Λ c u i ) (cid:27) ∝ | Ψ | − n exp (cid:26) −
12 tr h Ψ − ( X − U Λ Tc Z T ) T ( X − U Λ Tc Z T ) i(cid:27) (4)where U = { u i } i =1 ,...,n , with u i = ( u i , . . . , u iK ), is the n × K factor scores matrix while X, Λ c , Z and Ψ are defined as in the previous section. Lastly | A | and tr[ A ] denotes respectively thedeterminant and the trace of a generic matrix A .Different strategies might be adopted to estimate the parameters involved in (3). From afrequentist perspective, the maximum likelihood estimates are usually obtained via iterativealgorithms such as the ones proposed in Jöreskog (1967), Jennrich and Robinson (1969) andRubin and Thayer (1982). Conversely, in this work we adopt a Bayesian approach to factoranalysis estimation (see e.g. Press and Shigemasu, 1989; Arminger and Muthén, 1998; Songand Lee, 2001). More specifically, we assume independent prior distributions for the modelparameters as follows Λ c,g ∼ N K (0 , σ λ I K ) for g = 1 , . . . , G (5) u i ∼ N K (0 , I K ) for i = 1 , . . . , n (6) ψ j ∼ IG( α, β j ) for j = 1 , . . . , p (7) z j ∼ PPM( α z ) for j = 1 , . . . , p. (8)6he choice of the hyperparameters for the inverse gamma prior on the uniquenesses is guidedby the suggestions in Frühwirth-Schnatter and Lopes (2010); here the authors avoid encurringin the Heywood problem by choosing α and β j so that ψ j tends to be bounded away from zero.More specifically in our analyses we set α = 2 . β j = ( α − /S − jj where S − representsthe inverse of the sample covariance matrix. On the other hand σ λ might be chosen to besubjectively large in order to consider an uninformative prior for the rows of Λ c .Some words of caution are required for the prior in (8). Let c be a clustering of indices { , . . . , p } ; even if different representations might be possible, we consider c = { C , . . . , C G } asa collection of disjoint subsets such that C g contain all the indices of the variables belongingto cluster g -th. A product partition model (PPM, Hartigan, 1990; Barry and Hartigan, 1992)assumes that the prior probability for c is expressed as follows π ( c = { C , . . . , C G } ) ∝ G Y g =1 ρ ( C g ) (9)where ρ ( · ) is known as the cohesion function . Since we have a one-to-one correspondence betweenthe representation of the partition c as a collection of blocks and the one via the allocationmatrix Z , with a slight abuse of notation, we specify the prior as in (8) even in our framework.Several different specifications for ρ ( · ) have been proposed in literature: in this work we consider π ( c ) ∝ α Gz Q Gg =1 ( | C g | − | C g | denotes the cardinality of the g -th cluster, sharing strongconnections with the Dirichlet process that is widely used in the Bayesian clustering framework(Quintana and Iglesias, 2003). Note that in our analyses we set α z = 1 as is standard; however,this choice did not seem influential. When specifying a prior over the set of the partitions,another reasonable approach to take within our framework would consist of borrowing ideasfrom the Bayesian spatial clustering literature or to consider some additional information suchas distances between the objects to be clustered (see e.g. Blei and Frazier, 2011; Page et al.,2016; Dahl et al., 2017; Wehrhahn et al., 2020, and references therein). In such a way contiguousgroups of wavelengths would be favored leading to some advantages in the modelling process forsome specific applications.Given the likelihood function in (4) and the specification of the priors outlined above, theposterior distribution is defined as follows π (Λ c , Ψ , Z, U | X ) = L ( X | Λ c , Ψ , Z, U ) π (Λ c | σ λ ) π (Ψ | α, β j ) π ( Z | α z ) π ( U ) (10) ∝ L ( X | Λ c , Ψ , Z, U ) G Y g =1 φ ( K ) (Λ c,g ; 0 , σ λ I K ) p Y j =1 IG( α, β j ) × α Gz G Y g =1 ( | C g | − n Y i =1 φ ( K ) ( u i ; 0 , I K )where φ ( K ) ( · , µ, Σ) denotes the pdf of a K -dimensional Gaussian random variable with meanvector µ and covariance matrix Σ. Model parameter estimation is carried out in a Bayesian framework, using Markov Chain Monte-Carlo. Due to the conditionally conjugate nature of various prior distributions adopted, samplesfrom (10) are obtained using a Gibbs sampling scheme with the exception of the latent variableallocation matrix Z which is sampled via a Metropolis-Hastings step. The full conditionaldistributions are listed below (more details on their derivation and on the involved parameters7re provided in the Appendix).vec(Λ c ) | . . . ∼ N G × K ( µ λ , Σ λ ) (11) u i | . . . ∼ N K ( µ u , Σ u ) (12) ψ j | . . . ∼ IG( α + n/ , β ∗ j ) (13)For the Metropolis-Hastings step to sample the allocation matrix Z , we adapt to our case oneof the moves proposed by Nobile and Fearnside (2007) in the so called allocation sampler . Eachsingle move attempts to reallocate to cluster g a group of variables previously assigned to cluster g ; in this way, by possibly reallocating blocks of variables, big moves are proposed so that thespace will be explored faster. The detailed steps of the procedure are outlined hereafter:1. Draw, from the total G variable clusters, a group g . If n g = | C g | = 0, where | C g | denotes the cardinality of group g , the move fails;2. Compute d (Λ c,g , Λ c,g ), the Euclidean distance between Λ c,g and Λ c,g , ∀ g ∈ { , . . . , G } with g = g . Afterwards a second group g is drawn from the set { , . . . , G } \ g with P ( g = g ) ∝ d (Λ c,g , Λ c,g ) − ;3. Draw M from the set { , . . . , n g } with P ( M = m ) ∝ /m, ∀ m = 1 , . . . , n g ;4. Select randomly M observations among the n g belonging to group g and reallocate themto group g ;5. Denote with Z and Z respectively the starting latent allocation matrix and the one afterthe reallocation move. The move itself is then accepted with probability min { , R } , whereR is given by R = π (Λ c , Ψ , Z , U | X ) π (Λ c , Ψ , Z, U | X ) P ( Z → Z ) P ( Z → Z ) . It can be easily shown that the proposal ratio is P ( Z → Z ) P ( Z → Z ) = P n g m =1 1 m P n g + Mm =1 1 m n g ! n g !( n g − M )!( n g + M )! . Our modification of the procedure proposed by Nobile and Fearnside (2007) consists in changingthe probabilities involved in the selection of g and M . The rationale lies in the need to propose,at each step, the reallocation of blocks of variables while increasing the acceptance ratio byproposing moves involving similar clusters and by keeping a reasonable size of the blocks.Lastly, note that the model we are proposing, having a factor analytic structure, inheritsstandard identifiability issues related to the rotational invariance property. A common solutionconsists in considering some constraints on the factor loadings (see e.g., Arminger and Muthén,1998; Lopes and West, 2004). In our framework, where the factor analytic structure of themodel may be seen as a tool to reconstruct Σ in a parsimonious way, identification is not strictlynecessary. Moreover it has been noted (Bhattacharya and Dunson, 2011) that identifiabilityconstraints may lead to order dependence among the variables and general inefficiencies. As adirect consequence we decided not to consider such constraints in our modelling strategy. In the previous sections, the number of factors K has been considered as fixed; in practiceinference on K constitutes one of the most challenging and trickiest issues to tackle when con-sidering factor analytic models. Several different approaches have been proposed in literature,8ith a standard one resorting to widely known information criteria, such as the AIC and BIC,as selection tools. Nonetheless these criteria might not be reliable in high-dimensional settingsand they are not theoretically justified in the FA framework. From a Bayesian perspective it isworth mentioning the work by Lopes and West (2004) where the authors propose a reversiblejump Markov Chain Monte-Carlo algorithm, moving between models having different number offactors. Another viable approach comes from the nonparametric literature where models withan infinite number of factors have been widely used in combination with shrinkage priors on theloading matrices (Bhattacharya and Dunson, 2011; Durante, 2017; Schiavon and Canale, 2020);such a strategy allows the automatic choice of the number of the active factors, being the oneswith non-negligible loading values.Note that in the framework developed herein, the model selection step is even more trou-blesome since the choice of K is coupled with the one of the number of variable clusters G . Asimilar problem arises in the context of mixture of factor analyzers (Ghahramani and Hinton,1996) where usually different models, corresponding to different configurations of factors andmixture components, are compared by means of information criteria. In a Bayesian frameworka solution has been proposed by Fokoué and Titterington (2003) where the authors adopt astochastic model search to jointly select the optimal number of clusters and factors.In order to address the issues mentioned above, in the considered settings different strategiesmay be adopted. A viable alternative to the information criteria usually adopted and morecoherent with the estimation routine outlined in Section 3.3, may consist of considering theBICM (BIC Monte-Carlo) or the AICM (AIC Monte-Carlo) proposed by Raftery et al. (2007)and defined as BICM = 2 log ˜ L − s l log( n ) and AICM = 2 log L − s l where ˜ L , L and s l arerespectively the maximum observed, the mean and the variance of the likelihood computed foreach posterior samples. Another, somewhat heuristic, approach is given by the so called BIC-MCMC (Frühwirth-Schnatter, 2011) with BIC-MCMC = 2 log ˜ L − ν log( n ) where ν is the totalnumber of parameters in the model. Note that, not depending on s l , the BIC-MCMC turns outto be less influenced by possible fluctuations and jumps in the log-likelihood values across theMCMC draws.Furthermore, an exhaustive search of the model space is computationally expensive, if notinfeasible, considering that in our scenario both K and G might span over a wide range of values.Moreover, in the given framework, the focus is on models providing good and parsimoniousreconstructions of the covariance matrices, jointly with indications about which variables providesimilar information through redundancy, rather than on finding the optimal number of factorsand groups. For these reasons, in this work, we consider an ad hoc initialization strategy whichyields a promising configuration ( K init , G init ) for the number of factors and variable clusters.The procedure consists in the following steps:1. Estimate a standard FA model as defined in (1) for k = 1 , . . . , K max , with K max chosensufficiently large. This yields the loading matrices Λ k , k = 1 , . . . , K max ;2. Use a model-based clustering strategy (see Fraley and Raftery (2002) or Bouveyron et al.(2019) for a recent review) to obtain a partition of the rows of Λ k , for k = 1 , . . . , K max , into G k groups with G k automatically selected by means of the Bayesian Information Criterion(BIC);3. Build new loading matrices Λ k , for k = 1 , . . . , K max , where the rows of Λ k are replaced withthe mean of the cluster they belong to. In such a way the repeated row values structureof ˜Λ in (3) is mimicked;4. Considering the distributional properties of FA models, compute the BIC for all the mod-els corresponding to different configurations ( k, G k ), with k = 1 , . . . , K max . Select as( K init , G init ) the configuration which attains the highest value for the BIC.9his approach allows to find reasonable values that might be used as the starting point of a localsearch. More specifically, once ( K init , G init ) are obtained by running the initialization procedureillustrated above, we consider a greedy search model selection strategy. From a practical pointof view we fit four different models corresponding to ( K init ± , G init ±
1) and we compare themby means of the BIC-MCMC. The model with the best value of the information criterion is thenselected and its neighboring models are subsequently estimated. These two steps are iterateduntil no improvements in the BIC-MCMC values are found. The best model according to theBIC-MCMC is then selected and subsequently used to reconstruct the covariance structure andto obtain a partition of the wavelengths.Lastly, note that some sensitivity analyses conducted and reported in the next section con-firms that running a global and exhaustive search is not strictly necessary if covariance recon-struction is the final aim.
In this section we investigate the performances of the proposed procedure on some syntheticdatasets. The aim of the analyses reported hereafter is twofold. On one hand we want toquantify the deterioration of the results when a wrong model, having different (
K, G ) valueswith respect to the model generating the data, is employed. The possible deterioration isstudied in terms of variable partitions quality, that is measured according the
Adjusted RandIndex (ARI, Hubert and Arabie, 1985), and of correlation reconstruction. The latter is evaluatedconsidering two different criteria measuring the dissimilarity between the true correlation matrix R = D − / Σ D − / and the estimated one ˆ R = ˆ D − / ˆΣ ˆ D − / with D = diag( σ , . . . , σ p ) andˆ D = diag(ˆ σ , . . . , ˆ σ p ). The first criterion we considered is the Mean Squared Error (MSE) thatis defined as MSE( R, ˆ R ) = 1 p ( p +1)2 p X j =1 X j ≥ j ( R jj − ˆ R jj ) where R jj represents the ( j, j )-th element of the matrix R . The second criterion adopted is theRV coefficient (Abdi, 2007) expressed asRV( R, ˆ R ) = tr( R T ˆ R ) q tr( R T R )tr( ˆ R T ˆ R )and taking values between 0 and 1 where values closer to 1 denote a greater similarity betweenthe matrices. Note that we consider the correlation matrices and not the covariance ones inorder to have more interpretable indications from a MSE perspective.On the other hand the second aim of the simulation study consists in the numerical explorationof the quality of the initialization strategy proposed in Section 3.4 in order to find promisingconfigurations for the number of factors K and variable clusters G .A total of B = 200 samples have been drawn with sample size n = 500 and p = 40 variables.The data are generated according to the probabilistic mechanism underlying model (3) so thatthe sampled vectors x i ’s are distributed as a Gaussian random variables with zero mean andcovariance matrix Σ true = Λ true Λ T true + Ψ true . The true number of factors and of variable clustershave been fixed to K true = 3 and G true = 5 respectively. Prior to running the Gibbs sampleroutlined in Section 3.3, the involved parameters are initialized by estimating a standard FA modelwhere the factor loadings are obtained as the cluster centroids of a k -means clustering procedure,which also allows to obtain the starting values for the loadings partition. The hyperparametershave been selected according to the indication given in Section 3.2 with σ λ = 5 to entail a prioriuninformativeness about the dispersion of the factor loadings. All the reported analyses have10 able 1: Means, over the B simulated samples, of the ARI values (and their standard errors) comparingthe true and the estimated covariance matrices for varying values of K and G . Bold cell represents thetrue model generating the data. K | G Table 2: Mean, over the B simulated samples, of the MSE (and their standard errors) comparing thetrue and the estimated correlation matrices for varying values of K and G . Bold cell represents the truemodel generating the data. K | G Table 3: Means, over the B simulated samples, of the RV coefficients (and their standard errors) compar-ing the true and the estimated correlation matrices for varying values of K and G . Bold cell representsthe true model generating the data. K | G R environment (R Core Team, 2020) with the aid of the mclust package (Scrucca et al., 2016).Results are reported in Tables 1, 2, 3 and 4. First of all note that variable partitions mightbe erroneously seen as a byproduct of the procedure proposed in Section 3.1, helping to reduceeven further the number of free parameters when resorting to factor analysis. Nonethelessobtaining variable clusters can represent the final aim of the analyses since it produces relevantinsights about the phenomenon that we are studying, as it will be clear for the application inSection 5. For this reason a proper evaluation of the clustering performances, even in simulatedscenarios, is crucial to validate our procedure. From Table 1 we can see how in these scenariosthe obtained variable partitions are generally close to the true groupings, as the ARI generallyachieves good values. A more careful investigation of the results shows how, as expected, thevalues are strongly influenced by the number of cluster G used in the model fitting procedure.Nonetheless this behavior turns out not being symmetrical since, while an underestimation ofthe number of clusters appears to be harmful, overestimating G looks harmless if not beneficial;this might give a clue about the spuriousness of some of the clusters when G > G true . Theimpact of the number of factors is less evident as the ARI values appear robust with respect tochanges in K . A relevant behavior to highlight consists in the slight degradation of the resultswhen increasing the number of factors; note that, as K increases, the dimensionality of the spacein which cluster searches are conducted increases too possibly enhancing the sparsity of the dataand deteriorating clustering performances.In Tables 2 and 3 the results concerning correlation matrix reconstruction are reported. Firstof all it is relevant to highlight how both the MSE and the RV coefficient tend to provide verysimilar indications. Generally speaking the obtained performances are good overall, regardless11 able 4: Proportion of times, over B = 200 simulated samples, when a specific configuration ( K, G ) hasbeen selected by the initialization strategy outlined in Section 3.4. Bold cell represents the true modelgenerating the data. K | G K and G . A more detailed look reveals how, coherently with what happens forthe clustering quality, underestimation of the number of variable clusters G seeems to have amore visible impact on the degradation of the results. On the other hand overestimation of G seems to have little impact on the values of the MSE and RV coefficient and the same holds fordifferent choices of the number of factors.Finally in Table 4 the performances of the model selection initialization strategy outlinedin Section 3.4 are displayed. A first promising result is given by the fact that the true modelgenerating the data is the one selected more often by this strategy. Moreover the right numberof factors is chosen in more than 80% of the cases. On the other hand the correct number ofclusters looks harder to detect, with G = 5 being selected one every two samples. Nonetheless atendency to overestimate G is witnessed with this behavior being generally beneficial according tothe results commented above on the quality of the partitions and the correlation reconstruction.We strongly believe that the results reported in this section have to be considered as a whole,in order to obtain useful indications about reasonable paths to take when analyzing real datasets.From Table 4 we can see how the proposed initialization strategy never selected G < K and G . These indications, if coupled with the ones obtained from Tables 1, 2 and 3, suggest thatthe initialization strategy tends to select models producing satisfactory performances both froma clustering and from a correlation reconstruction perspectives. As commented above a morepronounced deterioration of the results is witnessed especially when G = 3 while, overestimationof both the number of clusters and factors, generally lead to an amelioration of the performances.As a consequence we believe that the proposed strategy might be fruitfully used as a fast andeffective replacement of more intensive and time consuming grid searches over K and G coupledwith the reliance to some information criterion that has to be carefully selected.In fact, note that some further analyses, not reported in the paper, generally showed theunreliability of the information criteria introduced in Section 3.4. As a general indication it hasbeen noted that BIC-MCMC is more stable with respect to AICM and BICM since the latterones could be strongly influenced by jumps, usually due to updates of the allocation matrix Z ,of the likelihood at a given step of the Gibbs updating procedure. Nonetheless in our analysesall the three criteria selected the true model generating the data less often with respect to thesuggested initialization strategy. In this section, the proposed method is applied to the milk MIR spectroscopy data describedin Section 2. The initialization of the parameters involved in the model and the specification ofthe hyperparameters have been carried out coherently with what we have done for the syntheticdata in the previous section. Moreover, prior to running the proposed metholodogy, we removedfrom each single spectrum three wavelength regions supposed to be highly noisy (Hewavitharana12
Figure 3: Estimated correlation matrices computed on the milk samples produced by pasture fed cows (onthe left) and TMR fed cows (on the right). and van Brakel, 1997) namely the ones from 1592 cm − to 1720 cm − , from 2996 cm − to 3698cm − and from 3818 cm − to 5010 cm − . Consequently we end up working with a datasethaving n = 4320 milk samples and p = 533 wavelengths measured.The initialization procedure outlined in Section 3.4 selects different ( K, G ) configurations forthe Pasture and for the TMR samples. More specifically, in the first case it selects a numberof factors K equal to 4 and a number of variable clusters G equal to 25 while in the latterone K = 3 and G = 19. This might give a first rough indication about the more complexwavelengths relations underneath the samples coming from pasture fed cows since in order tocapture those relations an higher number of clusters, lying in an higher dimensional reducedsubspace, is needed.In Figure 3 the estimated correlation matrices, obtained running the proposed model with thementioned ( K, G ) values, are reported. By comparing these matrices with the sample correlationones displayed in Figure 2 we can obtain an indication about the capabilities of our methodologyto map the data into lower dimensional subspaces while retaining the relevant correlation struc-tures. Denoting with R the sample correlation matrix and with ˜ R the estimated one we havethat, for the pasture samples, MSE( R Pasture , ˜ R Pasture ) = 0 .
021 and RV( R Pasture , ˜ R Pasture ) =0 . R TMR , ˜ R TMR ) = 0 .
035 andRV( R TMR , ˜ R TMR ) = 0 . K and G .Furthermore, the graphical inspection of Figure 3, suggests how our variable clustering mech-anism tends to favour the appearance of blocky structures thus possibly simplifying the practicalinterpretation of the relations among wavelengths. Moreover, even in the face of a rather sim-ilar correlation structure, this characteristic of our proposal allows to highlight even more thedifferences in the correlation among different wavelengths regions between milk samples comingfrom pasture fed and TMR fed cows. These graphical differences may serve as an interestingstarting point to study how the diet regimens can impact the chemical processes underlying thespectral behaviour.Coherently with the comments we have made for the synthetic data analyses, another waywe consider to assess the performances of our methodology consists in the investigation of thepartitions of the variables. In Table 5, we report the confusion matrix comparing the twoclusterings of the wavelengths obtained on the pasture and on the TMR milk samples. It standsout how, despite having a different number of clusters, the two partitions are similar as the tableshow an almost diagonal structure. A confirmation about their similarity is given by the quitehigh ARI value being equal to 0.651. The agreement between the two partitions is somehow13 able 5: Confusion matrix comparing wavelengths partitions obtained on samples from pasture fed (onthe rows) and TMR fed (on the columns) cows. Blank spaces are used instead of zeroes. expected since we are examining milk samples where the only different experimental conditionconsists in the different diet regimens. Moreover, this behaviour may be seen as a strong signalabout the presence of a real clustering structure in the measured wavelengths, thus entailinga traceable redundancy in the information they provide. Nonetheless, a careful analysis of theresults in Table 5 reveals how the different number of clusters among the partitions generallyimply that the large TMR variable clusters are split in two pasture variable clusters, as ithappens for example for TMR clusters 11 and 17. This behaviour might provide some initialindications, possibly deserving further explorations, about how the different diets can impactchemical features in the milk in turn modifying the structures we see in the spectral data.Similar indications can be drawn from Figure 4 where the partitions of the wavelengths for thetwo different diet regimens are visually represented.Furthermore, note that the insights obtained from a clustering perspective may be exploitedto build variable selection tools possibly useful both for exploratory or graphical analyses andfor classification purposes. In fact, the indication about the strong redundancy implied by thewitnessed clustering structures can be used in order to build new features defined as summariesof the groups themselves possibly highlighting differences among pasture and TMR samples.In order to gain some further useful and practical knowledge about the phenomenon thatwe are studying, we considered a cluster-specific predictive analysis. Therefore, we run differ-ent linear regression models, separately for pasture and TMR samples, where the covariatesare given by the spectral measurements at the wavelengths belonging to a cluster, while theresponse variables are the content of fat, protein and lactose in the samples. These analyses,as briefly mentioned in the introduction, are important in order to understand if spectroscopydata may be fruitfully used in order to predict some important features of the milk in a rapidand non-expensive way. Moreover, in our case, the predictions are based only on a small subsetof variables, namely those assigned to a specific cluster, thus alleviating high-dimensionality14 luster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7Cluster 8Cluster 9Cluster 10Cluster 11Cluster 12Cluster 13Cluster 14Cluster 15Cluster 16Cluster 17Cluster 18Cluster 19Cluster 20Cluster 21Cluster 22Cluster 23Cluster 24Cluster 25 .
66 1425 .
66 1925 .
66 2425 .
66 2925 .
66 3425 .
66 3925 .
66 4425 .
66 4925 . Wave numbers
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5Cluster 6Cluster 7Cluster 8Cluster 9Cluster 10Cluster 11Cluster 12Cluster 13Cluster 14Cluster 15Cluster 16Cluster 17Cluster 18Cluster 19 .
66 1425 .
66 1925 .
66 2425 .
66 2925 .
66 3425 .
66 3925 .
66 4425 .
66 4925 . Wave numbers
Figure 4: Wavelengths partitions obtained on samples from pasture fed (on the top) and TMR fed (on thebottom) cows. The grey shaded areas correspond to the removed noisy regions. induced issues. In Figure 5 we report the results we obtained in terms of the
Adjusted R-squared index. At first glance it seems that the content of fat in the milk samples is easy to predict,regardless of the specific spectral region considered and of the diet regimens. Moreover, thepredictive performances are generally higher for the TMR samples in comparison to the pastureones. A more careful examination of the results, if paired with the suggestions obtained aboutthe wavelengths clustering structures, allows us to give some further indication about how theinformation carried in some spectral regions is different depending on the diet. For example,from Table 5 we can see how TMR cluster 15 find its correspondence with pasture clusters 16,17 and 19. It is straightforward to see how, in terms of the
Adjusted R-squared , the wavelengthsin the corresponding regions seem to produce better predictions of the lactose content for theTMR than for the pasture milk samples. Similar indications can be found by carefully studyingjointly the results shown in Table 5 and Figure 5.The clustering results obtained, if paired with previously conducted studies, can lead to otherrelevant insights. For example, the work by Picque et al. (1993) suggests that the measurementsin the region spanning from 1515cm − to 1593cm − are characteristic of the lactate ion. Onthe other hand, the regions from 1040cm − to 1100cm − and from 1298cm − to 1470cm − arerelated to galactose component of milk. Note that, from a practical standpoint, both lactateand galactose can be seen as indicators of the milk quality. As is visually clear in Figure 4,15 asture F a t P r o t e i n La c t o s e Cluster−1Cluster−2Cluster−3Cluster−4Cluster−5Cluster−6Cluster−7Cluster−8Cluster−9Cluster−10Cluster−11Cluster−12Cluster−13Cluster−14Cluster−15Cluster−16Cluster−17Cluster−18Cluster−19Cluster−20Cluster−21Cluster−22Cluster−23Cluster−24Cluster−25 00.20.40.60.81
TMR F a t P r o t e i n La c t o s e Cluster−1Cluster−2Cluster−3Cluster−4Cluster−5Cluster−6Cluster−7Cluster−8Cluster−9Cluster−10Cluster−11Cluster−12Cluster−13Cluster−14Cluster−15Cluster−16Cluster−17Cluster−18Cluster−19 00.20.40.60.81
Figure 5: Adjusted R-squared values of the regression models where only the wavelengths in a specificcluster are used to predict the content of three different milk traits, namely Fat, Protein and Lactosecontents. On the left the results for the pasture samples, on the right the ones for the TMR ones. the wavelengths pertaining to the lactate ion region mainly belong to pasture cluster 8 and toTMR cluster 7; a closer inspection of Table 5 reveals how these groups are strongly related inthe two partitions, giving an additional indication about the coherency of the clustering results.On the other hand the wavelengths belonging to the galactose regions are split into groups2, 5 and 7, for the pasture samples, while they are mainly associated with groups 3 and 6for the TMR samples; again with a strong correspondence among these clusters visible in theconfusion matrix. Moreover, the results obtained from the cluster specific regression analysesshow how these groups are among the best ones for predicting the lactose content in the milksamples. Since lactose is a disaccharide molecule formed by the linking of the galactose andglucose monosaccharide modules, these results serve as a confirmation of the practical utility ofthe variable partitions obtained from the model.Lastly, note that the correlation matrices estimated by using the proposed methodology canbe used as an exploratory tool to deepen our knowledge about the relations among wavelengthsin a specific spectral region. For example, if we focus our attention on the correlations amongthe first 50 wavelengths. From a visual inspection of Figure 3, the wavelengths in this regionseems to relate one to the other differently depending on the diet; this region is shown moreclosely in Figure 6. From Figure 6, we can see how the relationships among the spectral valuesat these wavelengths is highly dependent on the diet in this region. If we compare the two sub-matrices using the metrics considered previously, we obtain that MSE( ˆ R (1:50)Pasture , ˆ R (1:50)TMR ) = 0 . R (1:50)Pasture , ˆ R (1:50)TMR ) = 0 . Figure 6: Estimated correlation sub-matrices corresponding to the first 50 wavelengths computed on themilk samples produced by pasture fed cows (on the left) and TMR fed cows (on the right).
In this paper we have presented a modification of a standard Factor Analysis model where factorloadings matrix is reparameterised so that redundancy in the originally observed variables canbe detected. Moreover, to estimate the proposed model, a flexible Metropolis within Gibbssampler has been implemented. Our method yields a parsimonious representation of stronglydependent high-dimensional data with complex correlation structures. In fact, the specificationwe propose, entails a huge reduction of the number of parameters to be estimated with respectto a standard FA model. At the same time, as a direct consequence of the specification itself, themodel yields a grouping of the original variables when mapping them into the lower-dimensionalsubspace. The subsequent partition throws light on the relations among the observed featuresand about their possible redundancies. These indications, when supported by subject matterknowledge, can be translated into practical knowledge about the phenomenon under study.Our proposal was directly motivated by an application to vibrational spectroscopy dataanalysis and showed good performances on the dairy feed experiment data under investigation,both in terms of correlation reconstruction and interpretability of the results.Spectroscopy data usually present some recurring challenges, from a statistical perspective,as they are high-dimensional with strong and peculiar correlation structures among the wave-lengths, possibly entailing complex redundancies. The model we introduced has been provenparticularly useful in the given context since it has provided a parsimonious characterization ofthe correlation matrix. From a practical point of view, this allowed to gain relevant knowledgeand to highlight differences among different milk samples, thus possibly helping in authenticityassessment and in preventing food adulteration in the future. Moreover, by clustering the vari-ables, it provided interesting insights about which spectral regions carry the same informationimplying that, even if possibly far from each other, they may be influenced by similar chemicalprocesses. Lastly note that our proposal can be thought as a sort of starting point when buildingclassification tools aiming to discriminate samples according to the relations occurring amongthe observed features.Note that, the proposed methodology has been applied to MIR spectroscopy data but, inprinciple, its use may be extended to other data sharing some characteristics with the onesconsidered in this work. A possible interesting extension and research direction consists in theexploration of different possibilities concerning π ( Z ), the prior distribution for the allocationmatrix. When accounting for peculiar correlation structures in the data, it can be interesting,as we briefly mentioned in Section 3.2, to explore prior distributions incorporating informationabout specific relations and constraints, such spatial or temporal ones, for the variables to17e clustered. Another aspect that is worth examining is concerned with the model selection.In Section 3.4 we introduced an initialization strategy that provides good indications aboutreasonable values for the number of factors K and clusters G . Nonetheless several differentapproaches may be adopted and a thorough exploration of different model selection tools maybe beneficial. As briefly mentioned in Section 3.4, a possible extension consists in allowing K to go toward infinity and in considering shrinkage priors on the factor loadings as proposed inBhattacharya and Dunson (2011); Murphy et al. (2018). This strategy, possibly fruitful evenfor the determination of G , allows to circumvent the issues related to the selection of K , byautomating the choice of the active factors, i.e. the ones with non-negligible loading values, incharacterizing the covariance structure. Finally, as we mentioned above, our proposal can bethought as a stepping stone when building new classification tools. A possible straightforwardstrategy, pointing in this direction, would consist in embedding the model we introduced in aMixture of Factor Analysis (MFA, Ghahramani and Hinton, 1996) framework thus allowing toperform classification and clustering of high-dimensional data. Posterior conditional distributions
Full conditional for the factor scores U Let denote, as we did in the Section 3, ˜Λ = Z Λ c . The full conditional (12) for the factor scores U is obtained following standard FA results as follows π ( U | . . . ) ∝ L ( X | Λ c , Ψ , Z, U ) π ( U ) ∝ exp (cid:26) −
12 tr[Ψ − ( X − U Λ T ) T ( X − U Λ T )] (cid:27) exp (cid:26) −
12 tr[ U T U ] (cid:27) ∝ . . . ∝ exp (cid:26) −
12 tr[( I + Λ T Ψ − Λ)( U − ˜ U ) T ( U − ˜ U )] (cid:27) Therefore U is distributed as a matrix-normal random variable, U ∼ MN n,K ( ˜ U , I n , ( I K +Λ T Ψ − Λ) − ). Focusing on a single row of the factor scores matrix we obtain π ( u i | . . . ) ∼ N K (( I + Λ T Ψ − Λ) − Λ T Ψ − x i , ( I + Λ T Ψ − Λ) − ) for i = 1 , . . . , n so that µ u = ( I + Λ T Ψ − Λ) − Λ T Ψ − x i Σ u = ( I + Λ T Ψ − Λ) − Full conditional for the unique factor loadings matrix Λ c Using several times the properties of the trace operator and of the vectorization, the full condi-18ional (11) is obtained as follows π (Λ c | . . . ) ∝ L ( X | Λ c , Ψ , Z, U ) π (Λ c ) ∝ exp (cid:26) −
12 tr[Ψ − ( X − U Λ Tc Z T ) T ( X − U Λ Tc Z T )] (cid:27) exp (cid:26) −
12 tr[ σ − λ Λ Tc Λ c ] (cid:27) = exp (cid:26) −
12 tr[( X − U Λ Tc Z T )Ψ − ( X − U Λ Tc Z T ) T ] − σ − λ tr[Λ Tc Λ c ] (cid:27) = exp (cid:26) − h tr( U Λ Tc Z T Ψ − Z Λ c U T ) − X Ψ − Z Λ c U T ) + σ − λ tr(Λ Tc Λ c ) i(cid:27) = exp (cid:26) − (cid:2) tr( U Λ Tc Z T Ψ − Z Λ c U T ) − c ) T vec( Z T Ψ − X T U ) + σ − λ tr(Λ Tc Λ c ) (cid:3)(cid:27) = exp (cid:26) − (cid:2) vec(Ψ − / Z Λ c U T ) T vec(Ψ − / Z Λ c U T ) + σ − λ vec(Λ c ) T vec(Λ c ) − c ) T vec( Z T Ψ − X T U ) (cid:3)(cid:27) = exp (cid:26) − (cid:2) (( U ⊗ Ψ − / )vec(Λ c )) T ( U ⊗ Ψ − / )vec(Λ c ) + σ − λ vec(Λ c ) T vec(Λ c ) − c ) T vec( Z T Ψ − X T U ) (cid:3)(cid:27) ∝ exp (cid:26) − (cid:2) vec(Λ c ) T ( U T U ⊗ Z T Ψ − Z + σ − λ I )vec(Λ c ) − c ) T vec( Z T Ψ − X T U ) (cid:3)(cid:27) which is proportional to a pdf of a multivariate Gaussian random variable. Therefore we havethat vec(Λ c ) ∼ N G × K ( µ Λ , Σ Λ )where µ Λ = ( U T U ⊗ Z T Ψ − Z + σ − I ) − vec( Z T Ψ − X T U )Σ Λ = ( U T U ⊗ Z T Ψ − Z + σ − I ) − Full conditional for the uniquenesses ψ j ’s Let define M = ( X − U Λ Tc Z T ) T ( X − U Λ Tc Z T ) and M jj as the ( j, j )-th element of the matrix M . The full conditional (13) for the uniquenesses ψ j for j = 1 , . . . , p are obtained as π (Ψ | . . . ) ∝ L ( X | Λ c , Ψ , Z, U ) π (Ψ) ∝ | Ψ | − n/ exp (cid:26) −
12 tr[Ψ − M ] (cid:27) p Y j =1 β αj Γ( α ) ψ − ( α +1) j exp − β j ψ j ! ∝ p Y j =1 ψ − n/ j ψ − ( α +1) j exp −
12 tr[Ψ − M ] − β j p X j =1 ψ − j = p Y j =1 ψ − ( α + n +1) j exp − p X j =1 ψ − j M jj − β j p X j =1 ψ − j = p Y j =1 ψ − ( α + n +1) j exp − p X j =1 ψ − j (cid:18) M jj β j (cid:19) ψ j | . . . ) ∼ InvGamma( α + n/ , β j + M jj /
2) so that β ∗ j = β j + M jj / Acknowledgements
This publication has emanated from research conducted with the financial support of ScienceFoundation Ireland (SFI) and the Department of Agriculture, Food and Marine on behalf of theGovernment of Ireland under grant number (16/RC/3835) and the SFI Insight Research Centreunder grant number (SFI/12/RC/2289_P2).
References
Abdi, H. (2007). Rv coefficient and congruence coefficient.
Encyclopedia of measurement andstatistics , 849:853.Alothman, M., Hogan, S. A., Hennessy, D., Dillon, P., Kilcawley, K. N., O’Donovan, M., Tobin,J., Fenelon, M. A., and O’Callaghan, T. F. (2019). The “grass-fed” milk story: understandingthe impact of pasture feeding on the composition and quality of bovine milk.
Foods , 8(8):350.Arminger, G. and Muthén, B. O. (1998). A Bayesian approach to nonlinear latent variablemodels using the Gibbs sampler and the Metropolis-Hastings algorithm.
Psychometrika ,63(3):271–300.Barry, D. and Hartigan, J. A. (1992). Product partition models for change point problems.
TheAnnals of Statistics , 20(1):260–279.Bartholomew, D. J., Knott, M., and Moustaki, I. (2011).
Latent variable models and factoranalysis: A unified approach . John Wiley & Sons.Bhattacharya, A. and Dunson, D. B. (2011). Sparse Bayesian infinite factor models.
Biometrika ,98(2):291–306.Blei, D. M. and Frazier, P. I. (2011). Distance dependent Chinese restaurant processes.
Journalof Machine Learning Research , 12(8):2461–2488.Bonfatti, V., Tiezzi, F., Miglior, F., and Carnier, P. (2017). Comparison of bayesian regres-sion models and partial least squares regression for the development of infrared predictionequations.
Journal of Dairy Science , 100(9):7306–7319.Bouveyron, C., Celeux, G., Murphy, T. B., and Raftery, A. E. (2019).
Model-based clusteringand classification for data science: with applications in R . Cambridge University Press.Capuano, E., Van der Veer, G., Boerrigter-Eenling, R., Elgersma, A., Rademaker, J., Sterian,A., and Van Ruth, S. M. (2014). Verification of fresh grass feeding, pasture grazing andorganic farming by cows farm milk fatty acid profile.
Food Chemistry , 164:234–241.Coppa, M., Martin, B., Agabriel, C., Chassaing, C., Sibra, C., Constant, I., Graulet, B., andAndueza, D. (2012). Authentication of cow feeding and geographic origin on milk using visibleand near-infrared spectroscopy.
Journal of Dairy Science , 95(10):5544–5551.Dahl, D. B., Day, R., and Tsai, J. W. (2017). Random partition distribution indexed by pairwiseinformation.
Journal of the American Statistical Association , 112(518):721–732.De Marchi, M., Toffanin, V., Cassandro, M., and Penasa, M. (2014). Invited review:Mid-infrared spectroscopy as phenotyping tool for milk traits.
Journal of Dairy Science ,97(3):1171–1186. 20owney, G. (1996). Authentication of food and food ingredients by near infrared spectroscopy.
Journal of Near Infrared Spectroscopy , 4(1):47–61.Durante, D. (2017). A note on the multiplicative gamma process.
Statistics & Probability Letters ,122:198–204.Elgersma, A. (2012). New developments in the netherlands: dairies reward grazing because ofpublic perception.
Grassland Science in Europe , 17:420–422.Everitt, B. S. (1984).
An introduction to latent variable models . Chapman and Hall.Faulkner, H., O’Callaghan, T. F., McAuliffe, S., Hennessy, D., Stanton, C., O’Sullivan, M. G.,Kerry, J. P., and Kilcawley, K. N. (2018). Effect of different forage types on the volatile andsensory properties of bovine milk.
Journal of Dairy Science , 101(2):2034–1047.Fokoué, E. and Titterington, D. (2003). Mixtures of factor analysers. Bayesian estimation andinference by stochastic simulation.
Machine Learning , 50(1-2):73–94.Fraley, C. and Raftery, A. E. (2002). Model-based clustering, discriminant analysis, and densityestimation.
Journal of the American Statistical Association , 97(458):611–631.Frühwirth-Schnatter, S. (2011). Label switching under model uncertainty. In
Mixtures: Estima-tion and Application , pages 213–239. John Wiley & Sons.Frühwirth-Schnatter, S. and Lopes, H. F. (2010). Parsimonious Bayesian factor analysis whenthe number of factors is unknown. Technical report, University of Chicago Booth School ofBusiness.Garvey, E. C., Sander, T., O’Callaghan, T. F., Drake, M., Fox, S., O’Sullivan, M. G., Kerry,J. P., and Kilcawley, K. N. (2020). A cross-cultural evaluation of liking and perception ofsalted butter produced from different feed systems.
Foods , 9(12):1767.Ghahramani, Z. and Hinton, G. E. (1996). The em algorithm for mixtures of factor analyzers.Technical report, CRG-TR-96-1, University of Toronto.Hartigan, J. A. (1990). Partition models.
Communications in Statistics — Theory and Methods ,19(8):2745–2756.Hewavitharana, A. K. and van Brakel, B. (1997). Fourier transform infrared spectrometricmethod for the rapid determination of casein in raw milk.
Analyst , 122(7):701–704.Hirose, K. and Konishi, S. (2012). Variable selection via the weighted group lasso for factoranalysis models.
Canadian Journal of Statistics , 40(2):345–361.Hirose, K. and Yamamoto, M. (2015). Sparse estimation via nonconcave penalized likelihood infactor analysis model.
Statistics and Computing , 25(5):863–875.Hubert, L. and Arabie, P. (1985). Comparing partitions.
Journal of classification , 2(1):193–218.Jennrich, R. I. and Robinson, S. M. (1969). A Newton-Raphson algorithm for maximum likeli-hood factor analysis.
Psychometrika , 34(1):111–123.Jöreskog, K. G. (1967). Some contributions to maximum likelihood factor analysis.
Psychome-trika , 32(4):443–482.Kamal, M. and Karoui, R. (2015). Analytical methods coupled with chemometric tools fordetermining the authenticity and detecting the adulteration of dairy products: A review.
Trends in Food Science & Technology , 46(1):27–48.21egramanti, S., Durante, D., and Dunson, D. B. (2020). Bayesian cumulative shrinkage forinfinite factorizations.
Biometrika , 107(3):745–752.Lopes, H. F. and West, M. (2004). Bayesian model assessment in factor analysis.
StatisticaSinica , 14:41–67.McParland, S. and Berry, D. (2016). The potential of fourier transform infrared spectroscopy ofmilk samples to predict energy intake and efficiency in dairy cows.
Journal of Dairy Science ,99(5):4056–4070.McParland, S., Lewis, E., Kennedy, E., Moore, S. G., McCarthy, B., O’Donovan, M., Butler,S. T., Pryce, J., and Berry, D. P. (2014). Mid-infrared spectrometry of milk as a predictorof energy intake and efficiency in lactating dairy cows.
Journal of Dairy Science , 97(9):5863–5871.Murphy, K., Viroli, C., and Gormley, I. C. (2018). Infinite mixtures of infinite factor analysers.
Bayesian Analysis .Murphy, T. B., Dean, N., and Raftery, A. E. (2010). Variable selection and updating in model-based discriminant analysis for high dimensional data with food authenticity applications.
Annals of Applied Statistics , 4(1):396–421.Nobile, A. and Fearnside, A. T. (2007). Bayesian finite mixtures with an unknown number ofcomponents: The allocation sampler.
Statistics and Computing , 17(2):147–162.O’Callaghan, T. F., Mannion, D. T., Hennessy, D., McAuliffe, S., O’Sullivan, M. G., Leeuwen-daal, N., Beresford, T. P., Dillon, P., Kilcawley, K. N., Sheehan, J. J., Ross, R. P., andStanton, C. (2017). Effect of pasture versus indoor feeding systems on quality characteristics,nutritional composition, and sensory and volatile properties of full-fat cheddar cheese.
Journalof Dairy Science , 100(8):6053–6073.O’Callaghan, T. F., Faulkner, H., McAuliffe, S., O’Sullivan, M. G., Hennessy, D., Dillon, P.,Kilcawley, K. N., Stanton, C., and Ross, R. P. (2016a). Quality characteristics, chemicalcomposition, and sensory properties of butter from cows on pasture versus indoor feedingsystems.
Journal of Dairy Science , 99(12):9441–9460.O’Callaghan, T. F., Hennessy, D., McAuliffe, S., Kilcawley, K. N., O’Donovan, M., Dillon, P.,Ross, R. P., and Stanton, C. (2016b). Effect of pasture versus indoor feeding systems on rawmilk composition and quality over an entire lactation.
Journal of Dairy Science , 99(12):9424–9440.O’Callaghan, T. F., Vázquez-Fresno, R., Serra-Cayuela, A., Dong, E., Mandal, R., Hennessy,D., McAuliffe, S., Dillon, P., Wishart, D. S., Stanton, C., and Ross, R. (2018). Pasture feedingchanges the bovine rumen and milk metabolome.
Metabolites , 8(2):27.Page, G. L., Quintana, F. A., et al. (2016). Spatial product partition models.
Bayesian Analysis ,11(1):265–298.Picque, D., Lefier, D., Grappin, R., and Corrieu, G. (1993). Monitoring of fermentation byinfrared spectrometry: Alcoholic and lactic fermentations.
Analytica Chimica Acta , 279(1):67–72.Press, S. J. and Shigemasu, K. (1989). Bayesian inference in factor analysis. In
Contributionsto probability and statistics , pages 271–287. Springer.Quintana, F. A. and Iglesias, P. L. (2003). Bayesian clustering and product partition models.
Journal of the Royal Statistical Society: Series B (Statistical Methodology) , 65(2):557–574.22 Core Team (2020).
R: A Language and Environment for Statistical Computing . R Foundationfor Statistical Computing, Vienna, Austria.Raftery, A. E., Newton, M. A., Satagopan, J. M., and Krivitsky, P. N. (2007). Estimating theintegrated likelihood via posterior simulation using the harmonic mean identity.
Bayesianstatistics 8 , pages 1–45.Reid, L. M., O’Donnell, C. P., and Downey, G. (2006). Recent technological advances for thedetermination of food authenticity.
Trends in Food Science & Technology , 17(7):344–353.Rubin, D. B. and Thayer, D. T. (1982). EM algorithms for ML factor analysis.
Psychometrika ,47(1):69–76.Schiavon, L. and Canale, A. (2020). On the truncation criteria in infinite factor models.
Stat ,9(1):e298.Scrucca, L., Fop, M., Murphy, T. B., and Raftery, A. E. (2016). mclust 5: Clustering, classifi-cation and density estimation using gaussian finite mixture models.
The R journal , 8(1):289.Song, X.-Y. and Lee, S.-Y. (2001). Bayesian estimation and test for factor analysis model withcontinuous and polytomous data in several populations.
British Journal of Mathematical andStatistical Psychology , 54(2):237–263.Valenti, B., Martin, B., Andueza, D., Leroux, C., Labonne, C., Lahalle, F., Larroque, H.,Brunschwig, P., Lecomte, C., and Brochard, M. (2013). Infrared spectroscopic methods forthe discrimination of cows’ milk according to the feeding system, cow breed and altitude ofthe dairy farm.
International Dairy Journal , 32(1):26–32.Wehrhahn, C., Leonard, S., Rodriguez, A., Xifara, T., et al. (2020). A Bayesian approach to dis-ease clustering using restricted chinese restaurant processes.