[PDF] A Family of Mixture Models for Biclustering

Abstract

Biclustering is used for simultaneous clustering of the observations and variables when there is no group structure known \textit{a priori}. It is being increasingly used in bioinformatics, text analytics, etc. Previously, biclustering has been introduced in a model-based clustering framework by utilizing a structure similar to a mixture of factor analyzers. In such models, observed variables X are modelled using a latent variable U that is assumed to be from N(0,I) . Clustering of variables is introduced by imposing constraints on the entries of the factor loading matrix to be 0 and 1 that results in a block diagonal covariance matrices. However, this approach is overly restrictive as off-diagonal elements in the blocks of the covariance matrices can only be 1 which can lead to unsatisfactory model fit on complex data. Here, the latent variable U is assumed to be from a N(0,T) where T is a diagonal matrix. This ensures that the off-diagonal terms in the block matrices within the covariance matrices are non-zero and not restricted to be 1. This leads to a superior model fit on complex data. A family of models are developed by imposing constraints on the components of the covariance matrix. For parameter estimation, an alternating expectation conditional maximization (AECM) algorithm is used. Finally, the proposed method is illustrated using simulated and real datasets.

Full PDF

aa r X i v : . [ s t a t . M E ] S e p A Family of Mixture Models forBiclustering

Wangshu Tu ∗ Sanjeena Subedi † September 14, 2020

Abstract

Biclustering is used for simultaneous clustering of the observations andvariables when there is no group structure known a priori . It is being increas-ingly used in bioinformatics, text analytics, etc. Previously, biclustering hasbeen introduced in a model-based clustering framework by utilizing a structuresimilar to a mixture of factor analyzers. In such models, observed variables X are modelled using a latent variable U that is assumed to be from N ( , I ).Clustering of variables is introduced by imposing constraints on the entries ofthe factor loading matrix to be 0 and 1 that results in a block diagonal co-variance matrices. However, this approach is overly restrictive as oﬀ-diagonalelements in the blocks of the covariance matrices can only be 1 which can lead ∗ Department of Mathematical Sciences, Binghamton University, State University of New York,4400 Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] † Department of Mathematical Sciences, Binghamton University, State University of New York,4400 Vestal Parkway East, Binghamton, NY, USA 13902. e: [email protected] o unsatisfactory model ﬁt on complex data. Here, the latent variable U isassumed to be from a N ( , T ) where T is a diagonal matrix. This ensuresthat the oﬀ-diagonal terms in the block matrices within the covariance matri-ces are non-zero and not restricted to be 1. This leads to a superior model ﬁton complex data. A family of models are developed by imposing constraintson the components of the covariance matrix. For parameter estimation, analternating expectation conditional maximization (AECM) algorithm is used.Finally, the proposed method is illustrated using simulated and real datasets. Keywords :Model-based clustering, Biclustering, AECM, Factor analysis, Mix-ture models

Cluster analysis, also known as unsupervised classiﬁcation, assigns observations intoclusters or groups without any prior information on the group labels of any of theobservations. It diﬀers from supervised classiﬁcation where training data with knownlabels are used to build models with the aim of classifying observations with no la-bels. In many situations, labels for all observations are not available in advanceor are missing. In such cases, the observations are assigned to groups (or clusters)based on some measure of similarity (e.g., distance) (Saxena et al., 2017). Usinga similarity measure, the goal in clustering is to identify subgroups in a heteroge- Abbreviations:

AECM, alternating expectation conditional maximization; ALL, acute lym-phoblastic leukemia; AML, acute myeloid leukemia; ARI, adjusted Rand index; BIC, Bayesian in-formation criteria; EM, expectation-maximization; MFA, mixtures of factor analyzers; FDR, falsediscovery rate k -means (McQueen, 1967) andparametric approaches such as model-based clustering. k -means partitions a dataset into k distinct, non-overlapping clusters using a predeﬁned criteria. However, k -means and other similar approaches are highly dependent on starting values, cor-relation among variables are not taken into account for multivariate data, and canbe sensitive to outliers (Sisodia et al., 2012). Model-based clustering algorithms uti-lize ﬁnite mixture models and provide a probabilistic framework for clustering data.Such models assume that data comes from a ﬁnite collection of subpopulations orcomponents where each subpopulation can be represented by a distribution functiondepending on the nature of the data. In particular, a K -component ﬁnite mixturedensity can be written as f ( y i | ϑ ) = K X k =1 π k f k ( y i | θ k ) , π k > P Kk =1 π k = 1, f k ( y i | θ k ) is the densityfunction of each component, and ϑ = ( θ , θ , . . . , θ K ) represents the model parame-ters. In the last three decades, there has been an explosion in model-based approachesfor clustering diﬀerent types of data (Banﬁeld and Raftery, 1993; Fraley and Raftery,2002; Subedi and McNicholas, 2014; Franczak et al., 2014; Dang et al., 2015; Melnykov and Zhu,2018; Silva et al., 2019; Subedi and McNicholas, 2020).Traditional clustering algorithms, here referred to as one-way clustering meth-ods, aim to group observations based on similarities across all variables at thesame time. This can be too restrictive as observations may be similar under somevariables, but diﬀerent for others (Padilha and Campello, 2017). This limitationmotivated the development of biclustering algorithms that simultaneously clusterboth rows and columns, i.e., partitioning a data matrix into small homogeneousblocks (Mirkin, 1996). The idea of biclustering was ﬁrst introduced by Hartigan(1972) which proposed a partition based algorithm to ﬁnd constant biclusters in adata matrix. Cheng and Church (2000) proposed another approach for biclusteringthat aimed to ﬁnd homogeneous submatrices using a similarity score through iter-ative addition/deletion of rows/columns. A similar approach was also proposed byYang et al. (2002). In Kluger et al. (2003), the authors developed a biclustering thatutilized a singular value decomposition of the data matrix to ﬁnd biclusters. Whilethis approach is computationally eﬃcient compared to the previous approaches, thealgorithm and interpretations of the biclusters are reliant on the choice of normaliza-tion. Authors in Ben-Dor et al. (2002), Murali and Kasif (2003), and Liu and Wang(2003) focused on ﬁnding a coherent trend across the rows/columns of the data ma-4rix regardless of their exact values rather than trying to ﬁnd blocks with similarvalues. In Tanay et al. (2002), authors introduced a biclustering method based ongraph theory. This approach converts the rows and columns into a bipartite graphand tries to ﬁnd the densest subgraphs in a bipartite graph. However, these ap-proaches were often computationally intensive and there was a lack of statisticalmodel on which inferences can be made.In Govaert and Nadif (2008), authors introduced a model-based co-clustering al-gorithm using a latent block model for binary data by introducing an additionallatent variable that are column membership indicators. Nadif and Govaert (2010)extended this approach for contingency tables. A similar framework using a mix-tures of univariate Gaussian distributions for biclustering continuous data was uti-lized by Singh Bhatia et al. (2017). Alternatively, Martella et al. (2008) proposeda model-based biclustering framework based on the latent factor analyzer struc-ture. A factor analyzer model (Spearman, 1904; Bartlett, 1953) assumes that anobserved high dimensional variable Y can be modelled using a much smaller di-mensional latent variable U . Incorporating this factor analyzer structure in themixtures of Gaussian distribution, mixtures of factor analyzer have been devel-oped by Ghahramani et al. (1996); Tipping and Bishop (1999); McLachlan and Peel(2000 a ). Since then, mixtures of factor analyzers have been widely used for variousdata types (Andrews and McNicholas, 2011; Subedi et al., 2013; Murray et al., 2014;Subedi et al., 2015; Lin et al., 2016; Tortora et al., 2016). Martella et al. (2008) re-placed the factor loading matrix by a binary and row stochastic matrix and imposedconstraints on the components of the covariance matrices resulting in a family of5our models for model-based biclustering. In Wong et al. (2017), the authors furtherimposed additional constraints on the components of the covariance matrices andthe number of latent factors resulting in a family of eight models. However, one ma-jor limitation with both Martella et al. (2008) and Wong et al. (2017) is that thesemodels can only recover a restrictive covariance structure such that the oﬀ-diagonalelements in the block structure of the covariance matrices are restricted to be 1.In this paper, we modify the assumptions for the latent factors in the factor an-alyzer structure used by Martella et al. (2008) and Wong et al. (2017) to capture awider range of covariance structures. This modiﬁcation allows for more ﬂexibilityin the oﬀ-diagonal elements of the block structure of the covariance matrix. Fur-thermore, a family of parsimonious models is presented. The paper is organized asfollows. Details of the generalization are provided in Section 2 with details on pa-rameter estimation and model selection. In Section 3, we show that these extensionsallows for better recovery of the underlying group structure and can recover the spar-sity in the covariance matrix through simulation studies and real data analyses. Thepaper concludes with a discussion and future directions in Section 4.6 Methodology

In the factor analysis model (Spearman, 1904; Bartlett, 1953), a p -dimensional vari-able Y i can be written as Y i = µ + VU i + ǫ i , ( i = 1 , . . . , n )where U i ∼ N q ( , I q ) is a q -dimensional ( q ≪ p ) vector of latent factors, ǫ i ∼ N p ( , D ) where D is a diagonal matrix and ǫ i is independent of U i , V is a p × q matrix of factor loadings, and µ is a p -dimensional mean vector. Then, Y i ∼ N p ( µ , VV T + D ) , and, conditional on U , Y i | u i ∼ N p ( µ + Vu i , D ) . In the mixture of factor analyzer models with K components (Ghahramani and Hinton,1997; McLachlan and Peel, 2000 b ; McNicholas and Murphy, 2008), the p -dimensionalvariable Y i can be modeled as Y i = µ k + V k U ik + ǫ ik , with prob π k ( k = 1 , . . . , K ; i = 1 , . . . , n ) , where U ik ∼ N ( , I q k ) is a q k dimensional vector of latent factors in the k th compo-nent, µ k is the mean of the k th component, V k is p × q k matrix of factor loadings of7he k th component, and I q k is an identity matrix of size q k . Ghahramani and Hinton(1997); McLachlan and Peel (2000 b ), and McNicholas and Murphy (2008) assumethe same number of latent variables for all K components (i.e., q = . . . = q K = q ).In Martella et al. (2008) and Wong et al. (2017), authors proposed a family of modelsby replacing the factor loading matrix V with a binary row-stochastic matrix B . This p × q k dimensional matrix B can be regarded as cluster membership indicator matrixfor the variable clusters (i.e. column clusters) such that B [ i, j ] = 1 if the i th variablebelongs to j th column clusters and B [ i, k ] = 0 for all k = j . Under their frame-work, all clusters have the same number of latent variables, and ﬁxed covariance( I )for latent variables. By constraining the number and covariance of latent variables,the cluster covariance structure is very limited. The correlation between variableswill depend on D only. For complex real data, this is overly restrictive as diﬀerentclusters could have diﬀerent number of latent variable, and covariance of the latentvariable doesn’t have to be I . To overcome the limitations, we propose a modiﬁedMFA model and extend it for biclustering. Here, we utilize a modiﬁed the factor analyzer structure such that the p -dimensionalvariable Y i can be modeled as Y i = µ k + V k U ik + ǫ ik , with prob π k ( k = 1 , . . . , K ; i = 1 , . . . , n ) , U ik ∼ N ( , T q k ), where T q k is a diagonal matrix with entries { t , t , · · · , t q k } . Additionally, we allow diﬀerent clusters to have diﬀerent number oflatent variables. Hence, Y i ∼ N p ( µ k , V k T q k V Tk + D k ) , and, conditional on U ik , Y i | u ik ∼ N p ( µ k + V k u ik , D k ) . In order to do biclustering, similar to Martella et al. (2008), we replace loading ma-trix V k by a sparsity matrix B K with entries B k [ i, j ] = 1 if i th variable belongs to j th group, 0 otherwise. Under this assumption, we are clustering the variables (i.e.,columns) according to the underlying latent factors as each variable can only be rep-resented by one factor and variables represented by the same factors are clusteredtogether. In Martella et al. (2008), the authors assumed T q k = I q and as stated inSection 2.1, this imposes a stricter restriction on the structure of the component-speciﬁc of covariance matrices. This restriction not only inﬂuences recovering of thetrue component speciﬁc covariance but also aﬀects the clustering of the observations(i.e., rows). By assuming T q k to be a diagonal matrix with entries { t , t , · · · , t q k } , thecomponent speciﬁc covariance matrix becomes a block-diagonal matrix and withinthe block matrix, the oﬀ-diagonal elements are not restricted to 1. For illustration,9uppose we have B k =   × , T q k = diag( t , t , · · · , t ), and D k = diag( d , d , . . . , d ), the resulting componentspeciﬁc covariance matrix becomes B k T q k B Tk + D k =  t + d t t t t + d t t t t + d t + d t t t + d t + d t t t + d  . Therefore, with diﬀerent combination of t s and d ’s, each block in the block-diagonal covariance matrix can capture:- large variance, low correlation;- small variance, high correlation;- large variance, high correlation; and10 small variance, low correlation.Recall that in Martella et al. (2008) and in Wong et al. (2017), t s are restricted to 1and therefore, the model only allows for large variance and low correlation or smallvariance and high correlation. Additionally, Wong et al. (2017) imposed furtherrestriction that all components must have the same number of latent factors (i.e., q = q = . . . = q K = q ). Parameter estimation for mixture models is typically done using an expectation-maximization (EM) algorithm (Dempster et al., 1977). This is an iterative approachwhen the data are incomplete or are treated as incomplete. It involves two mainsteps: an expectation step (E-step) where the expected value of the complete-datalog-likelihood is computed using current parameter estimates, and a maximizationstep (M-step) where the expected value of the complete-data log-likelihood is thenmaximized with respect to the model parameters. The E- and M-steps are iterateduntil convergence. Herein, we utilize an alternating expectation conditional maxi-mization(AECM) algorithm Meng and Van Dyk (1997), which is an extension of theEM algorithm that uses diﬀerent speciﬁcations of missing data at each stage/cycleand the maximization step is replaced by a series of conditional maximization steps.Here, the observed data Y , . . . , Y n are viewed as incomplete data and the miss-ing data arises from two sources: the unobserved latent factor U , . . . , U n and the11omponent indicator variable Z , . . . , Z n where z ik =  i ∈ k th group0 otherwise . In ﬁrst cycle, we treat z ik as the missing data. Hence, the complete data log-likelihood is l ( Y , Z ) = n X i =1 log f ( y i , z i ) = n X i =1 log K Y k =1 (cid:8) π k f k ( y i ; µ k , B k T q k B Tk + D k ) (cid:9) z ik = n X i =1 K X k =1 z ik (cid:8) log( π k ) + log f k ( y i ; µ k , B k T q k B Tk + D k ) (cid:9) . In the E-step, we compute the expected value of the complete data log-likelihoodwhere the unknown memberships are replaced by their conditional expected values:ˆ z ik = E ( Z ik | y ) = ˆ π k f k ( ˆ µ k , ˆ B k ˆ T q k ˆ B Tk + ˆ D k ) P Kk =1 ˆ π k f k ( ˆ µ k , ˆ B k ˆ T q k ˆ B Tk + ˆ D k ) . Therefore, the expected complete data log-likelihood becomes Q ( µ k , π k ) = K X k =1 n k log( π k ) − np π ) − K X k =1 n k log | B k T q k B Tk + D k |− K X k =1 n k tr (cid:8) S k ( B k T q k B Tk + D k ) − (cid:9) , where n k = P ni =1 ˆ z ik and S k = P ni =1 ˆ z ik ( y i − ˆ µ k )( y i − ˆ µ k ) T n k . In the M-step, maximizing the12xpected complete data log-likelihood with respect to π k and µ k yieldsˆ π k = n k n , ˆ µ k = P ni =1 ˆ z ik y i n k . In the second cycle, we consider both Z and U as missing and the complete datalog-likelihood in this cycle has the following form: l ( Y , U , Z ) = n X i =1 log f ( y i , u i , z i ) = n X i =1 log K Y k =1 { π k f k ( y i | u i ; µ k + B k u ik , D k ) f k ( u i ; , T q k ) } z ik = n X i =1 K X k =1 z ik { log π k + log f k ( y i | u i ; µ k + B k u ik , D k ) + log f k ( u i ; , T q k ) } = C + K X k =1 " n X i =1 z ik (cid:18) log π k + 12 log | D − k | + 12 log | T − q k | (cid:19) −

12 tr T − q k n X i =1 z ik u ik u Tik ! −

12 tr ( D − k n X i =1 z ik ( y i − µ k )( y i − µ k ) T ) + n X i =1 z ik ( y i − µ k ) T D − k B k u ik −

12 tr ( B Tk D − k B k n X i =1 z ik u ik u Tik ) , where C is some value that does not depend on B k , D k , T q k , u ik , z i , and π k .Therefore, to compute the expected complete data log-likelihood, we must calcu-late the following expectations: E ( Z ik | y i ), E ( Z ik U ik | y i ), and E ( Z ik U ik U Tik | y i ).We have  y i u ik  | z ik ∼ M V N  µ k  ,  B k T q k B Tk + D k B k T q k T q k B Tk T q k  . E ( U ik | y i , z ik = 1) = T q k B Tk ( B k T q k B Tk + D k ) − ( y i − µ k ) := ˆ u ik ,E ( U ik U Tik | y i , z ik = 1) = T q k − T q k B Tk ( B k T q k B Tk + D k ) − B k T q k + ˆ u ik ˆ u Tik := θ k n k . Then, the expectation of the complete data log-likelihood Q can be written as: Q ( B k , D k , T q k ) = C + K X k =1 n k (cid:2) log | D − k | + log | T − q k | − tr { D − k S k } − tr { T − q k θ k } +2 n X i =1 ˆ z ik ( y i − ˆ µ k ) T D − k B k ˆ u ik − tr { D − k B k θ k B Tk } , (1)where C stands for terms that are independent of B k , D k , T q k , u ik , and n k = P ni =1 ˆ z ik .In the M-step, maximizing the expected value of the complete data log-likelihoodwith respect to D k and T q k yieldsˆ D ( t +1) k = diag { S k − B k ˆ T q k ˆ B Tk ( ˆ B k ˆ T q k ˆ B Tk + ˆ D ( t ) k ) − S k + ˆ B k θ k ˆ B Tk } , ˆ T ( t +1) q k = diag ( θ ik ) = diag (cid:18) ˆ T ( t ) q k − ˆ T ( t ) q k ˆ B Tk ( ˆ B k ˆ T ( t ) q k ˆ B Tk + ˆ D k ) − ˆ B k ˆ T ( t ) q k + P ni =1 ˆ z ik ˆ u ik ˆ u Tik n k (cid:19) . When estimating B k , we choose B k [ i, j ] = 1 when B k maximizes Q with a con-straint that P q k j =1 B k [ i, j ] = 1 for all k .Overall, the AECM algorithm consists of the following steps:1 Determine the number of clusters: K and q k , then give initial guesses for14 k , D k , T q k and z ik .2 First cycle:(a) E-step: update z ik (b) CM-step: update π k , µ k z ik again and u ik .(b) CM-step: update S k , D k , T q k , B k To introduce parsimony, constraints can be imposed on the components of the co-variance matrices B k , T q k and D k that results in a family of 16 diﬀerent models withvarying number of parameters (see Table 1). Here, “U” stands for unconstrained,“C” stands for constrained. This allows for a ﬂexible set of models with covariancestructures ranging from extremely constrained to completely unrestricted. Note thatthe biclustering model by Martella et al. (2008) can be recovered by imposing a con-straint such that q = q = . . . = q K = q and T q k = I q . Details on the parameterestimates for the entire family is provided in the Appendix 4.15able 1: Parsimonious family of models obtained by imposition of constraints on B k , T q k and D k . Model B k T q k D k Total number of parameters B k = B T q k = T D k = D D k = d k I UUUU U U U U p*K+ P Kk =1 q k +p*K+K-1+p*KUUUC U U U C p*K+ P Kk =1 q k +K+K-1+p*KUUCU U U C U p*K+ P Kk =1 q k +p+K-1+p*KUUCC U U C C p*K+ P Kk =1 q k +1+K-1+p*KUCUU U C U U p*K+ q k +p*K+K-1+p*KUCUC U C U C p*K+ q k +K+K-1+p*KUCCU U C C U p*K+ q k +p+K-1+p*KUCCC U C C C p*K+ q k +1+K-1+p*KCUUU C U U U p+ P Kk =1 q k +p*K+K-1+p*KCUUC C U U C p+ P Kk =1 q k +K+K-1+p*KCUCU C U C U p+ P Kk =1 q k +p+K-1+p*KCUCC C U C C p+ P Kk =1 q k +1+K-1+p*KCCUU C C U U p+ q k +p*K+K-1+p*KCCUC C C U C p+ q k +K+K-1+p*KCCCU C C C U p+ q k +p+K-1+p*KCCCC C C C C p+ q k +1+K-1+p*K Mixture models are known to be heavily dependent on model initialization. Here,the initial values are chosen as following:1. z ( ini ) ik : The row cluster membership indicator variable z ik can be initialized byperforming an initial partition using k -means, hierarchical clustering, randompartitioning, or ﬁtting a traditional mixture model-based clustering. Here, wechose the initial partitioning obtained via Gaussian mixture models availableusing the R package “mclust”Scrucca et al. (2016).2. D ( ini ) k , T ( ini ) q k : Similar to McNicholas and Murphy (2008), we estimate the sam-ple covariance matrix S k for each group and then use the ﬁrst q k principlecomponents as V k λ k V Tk where V k is the ﬁrst q k principle loading matrix and16 k is the ﬁrst q k variance of principle components. D ( ini ) k = diag { S k − V k T k V Tk } , and T ( ini ) q k = λ k . B ( ini ) k : Similar to Step 2, but we use a scaled version of PCA to get the loadingmatrix L k . Then for each row i, let L k [ i, j ] = 1 if L k [ i, j ] = max h { L k [ i, h ] } , 0otherwise. For assessing convergence, Aitken’s convergence criteria (Aitken, 1926) is used. TheAitken’s acceleration at iteration t is deﬁned as: a ( t ) = l ( t +1) − l ( t ) l ( t ) − l ( t − , where l ( t +1) stands for the log-likelihood values at t +1 iteration. Then the asymptoticestimate for log-likelihood at iteration t + 1 is: l ( t +1) ∞ = l ( t ) + l ( t +1) − l ( t ) − a ( t ) . The AECM can be considered converged when | l ( t +1) ∞ − l ( t ) ∞ | < ǫ, where ǫ is a small number (B¨ohning et al., 1994). Here, we choose ǫ = 10 − .17n the clustering context, the true number of components are unknown. The EMalgorithm or its variants are typically run for a range of possible number of clustersand model selection is done a posteriori using a model selection criteria. Here, thenumber of latent factors q k is also unknown. Therefore, the AECM algorithm isrun for all possible combinations of the number of clusters and number of latentvariables, and the best model is chosen using the Bayesian Information Criterion(BIC; Schwarz, 1978). Mathematically, BIC = 2 L ( y , ˆ ϑ ) − m log( n ) , where L ( y , ˆ ϑ ) is the log-likelihood evaluated using the estimated parameters, m isthe number of free parameters, and n is the number of observations. For performanceevaluation, we use the adjusted Rand index (ARI; Hubert and Arabie, 1985) whenthe true labels are known. The ARI is 1 for perfect agreement while the expectedvalue of ARI is 0 under random classiﬁcation. In one-way clustering, label switchingrefers to the invariance of the likelihood when the mixture component labels arerelabelled (Stephens, 2000) and it is typically dealt with imposition of identiﬁabilityconstraints on the model parameters. In biclustering, both the row and column mem-berships could be relabelled. The identiﬁability of the row membership is ensuredby imposing constraints on the mixing proportions such that π ≥ π . . . ≥ . . . ≥ π K .For column clusters, interchanging the columns of B k doesn’t change column clustermembership, however, the associated diagonal elements of the matrix T q k as well asthe error matrix D k needs to be permuted in order to recover the covariance ma-trix correctly. Failure to do so may trap or decrease overall likelihood. In order to18vercome this issue, if overall likelihood decreased at ( t + 1) th iteration, we assign B ( t +1) k = B ( t ) k , otherwise B ( t +1) k = B ( t +1) k . We did two sets of simulation studies. For each simulation study, we generate onehundred eight-dimensional datasets, each of size n = 1000 and ran all of our sixteenproposed models for K = 1 . . . q k = 1 . . .

4. We also ran the unconstrained UUUmodel by Wong et al. (2017). Note that this model can be obtained as a special caseof our UUUU model by constraining q = q = . . . = q K = q and T q k = I . For eachdataset, the model with the highest BIC is chosen a posteriori among all the modelsincluding the model by Wong et al. (2017). For the ﬁrst simulation study, 100 datasets were generated from the most constrainedCCCC model with K=3 and Q = [3 , , True parameters Average of estimated parameters(standard errors)Component 1( n = 500) µ [-5, -4, -3, -2, -1, 0, 1, 2] [-5.00, -4.00, -3.00, -1.99, -1.01, -0.01, 1.01, 2.01](0.09, 0.10, 0.10, 0.09, 0.09, 0.09, 0.09, 0.09) π n = 300) µ [0, 1, 2, 3, 4, 5, 6, 7] [0.00, 1.01, 2.01, 3.02, 4.02, 5.01, 6.03, 7.00](0.12, 0.12, 0.11, 0.12, 0.12, 0.13, 0.13, 0.12) π n = 200) µ [5, 6, 7, 8, 9, 10, 11, 12] [5.00, 6.00 , 7.01 , 8.00 , 9.00 , 9.99, 10.98 ,11.99](0.15, 0.19, 0.17, 0.16, 0.16, 0.16, 0.16, 0.14) π Σ  . . . . . . . .   .

47 1 .

98 1 .

98 0 0 0 0 01 .

98 4 .

47 1 .

98 0 0 0 0 01 .

98 1 .

98 4 .

47 0 0 0 0 00 0 0 4 .

47 1 .

98 1 .

98 0 00 0 0 1 .

98 4 .

47 1 .

98 0 00 0 0 1 .

98 1 .

98 4 .

47 0 00 0 0 0 0 0 4 .

47 1 .

980 0 0 0 0 0 1 .

98 4 .  sd( Σ )  . .

09 0 .

09 0 0 0 0 00 .

09 0 . .

09 0 0 0 0 00 .

09 0 .

09 0 . . .

09 0 .

09 0 00 0 0 0 .

09 0 . .

09 0 00 0 0 0 .

09 0 .

09 0 . . .

090 0 0 0 0 0 0 .

09 0 .  For the second simulation, 100 datasets were generated from the completely uncon-strained UUUU model with K=3 and q = [3 , , .2830.2620.329 0.3870.3490.3330.1950.2810.249 0.035−0.0020.0350.0610.0470.04−0.067−0.03−0.1 −0.151−0.0580.0120.024−0.012−0.0550.010.027−0.0550.2730.2660.258 0.066−0.018−0.02−0.080.007−0.0190.0610.0880.1470.360.2330.3410.2490.2980.341 0.0820.037−0.0040.003−0.0330.016−0.0650.068−0.04−0.0660.1190.0160.03−0.0260.020.021−0.062−0.081 0.0280.0430.048−0.0390.0110.004−0.0240.005−0.0520.006−0.057−0.0570.0570.0410.067−0.020.0610.0130.4070.3850.362 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y Y Y Y Y Y Y Y −10 −5 0 5 10 −10 −5 0 5 10 −5 0 5 10 −5 0 5 10 −5 0 5 10 15 0 10 −5 0 5 10 15 0 5 10 150.000.050.100.150.20−10−50510−50510−50510−5051015010−5051015051015 Figure 1: Scatterplot matrix for one of the hundred datasets for Simulation Study1. The sub-plots above the diagonal sub-plots contains the sample correlationbetween the respective observed variables for each cluster.for one of the hundred datasets and again, the clusters are not well-separated. In83 out of the 100 datasets, the BIC selected a three component model with somevariations of q and model types with an average ARI of 0.99 (standard error of 0.01)and in the remaining 17 datasets, a four component model was selected. In 44 outof those 83 datasets where the correct model (i.e., a three component UUUU modelwith q = [3 , , .630.70.17 0.440.470.130.090.230.11 0.09−0.03−0.040.10.040.10.1−0.010 0.26−0.03−0.020.07−0.030.050.040.01−0.060.150.490.34 −0.01−0.020.100.05−0.110.03−0.020.010.0300.31−0.13−0.050.36 −0.010.080.020.08−0.030.04−0.08−0.070−0.01−0.040.040−0.07−0.040.290.470.02 −0.09−0.110.010.110.03−0.090.080.090.040.030.03−0.04−0.060.080.070.320.28−0.050.250.550.82 Y1 Y2 Y3 Y4 Y5 Y6 Y7 Y8 Y Y Y Y Y Y Y Y −10 −5 0 5 −5 0 5 10 −10 −5 0 5 10 −5 0 5 10 −5 0 5 10 15 −5 0 5 10 −5 0 5 10 15−5 0 5 10 150.00.10.20.3−50510−10−50510−50510−5051015−50510−5051015−5051015 Figure 2: Scatterplot matrix for one of the hundred datasets for Simulation Study2. The sub-plots above the diagonal sub-plots contains the sample correlationbetween the respective observed variables for each cluster.22able 3: True parameters along with the averages and standard errors of theestimated values of the parameters from the out of the 100 datasets where a K = 3model was selected. True parameters Average of estimated parameters(standard errors)Component 1( n = 500) π µ [-5, -4, -3, -2, -1, 0, 1, 2] [-5.01, -4.00, -3.01, -2.00, -1.00, -0.01, 1.01 , 2.01](0.06, 0.09, 0.10, 0.06, 0.06, 0.07, 0.08, 0.08) Σ  . . . . . . . . . . .   .

50 0 .

50 0 0 0 0 00 .

50 3 .

51 0 .

50 0 0 0 0 00 .

50 0 .

50 4 .

49 0 0 0 0 00 0 0 2 .

00 1 .

01 1 .

01 0 00 0 0 1 .

01 2 .

00 1 .

01 0 00 0 0 1 .

01 1 .

01 2 .

00 0 00 0 0 0 0 0 3 .

48 2 .

990 0 0 0 0 0 2 .

99 3 .  sd( Σ )  .

18 0 .

12 0 .

12 0 0 0 0 00 .

12 0 .

20 0 .

12 0 0 0 0 00 .

12 0 .

26 0 0 0 0 00 0 0 0 .

13 0 .

08 0 .

08 0 00 0 0 0 .

08 0 .

14 0 .

08 0 00 0 0 0 .

08 0 .

13 0 00 0 0 0 0 0 0 .

25 0 .

270 0 0 0 0 0 0 .

27 0 .  Component 2( n = 300) π µ [0, 1, 2, 3, 4, 5, 6, 7] [-0.01, 1.00, 2.00, 3.02 , 4.01, 5.02 , 6.03 , 7.01] Σ  . . . .   .

91 3 .

74 3 .

74 0 0 0 0 03 .

74 4 .

16 3 .

74 0 0 0 0 03 .

74 3 .

74 4 .

59 0 0 0 0 00 0 0 4 .

33 2 .

35 0 0 00 0 0 2 .

35 4 .

31 0 0 00 0 0 0 0 4 .

08 3 .

06 3 .

060 0 0 0 0 3 .

06 3 .

54 3 .

060 0 0 0 0 3 .

06 3 .

06 4 .  sd( Σ )  .

55 0 .

54 0 .

54 0 0 0 0 00 .

54 0 .

55 0 .

54 0 0 0 0 00 .

54 0 .

58 0 0 0 0 00 0 0 0 .

70 0 .

63 0 0 00 0 0 0 .

63 0 .

73 0 0 00 0 0 0 0 0 .

31 0 .

30 0 .

300 0 0 0 0 0 .

30 0 .

32 0 .

300 0 0 0 0 0 .

30 0 .

30 0 .  Component 3( n = 200) π µ [5, 6, 7, 8, 9, 10, 11, 12] [5.01 , 6.01 , 7.02 , 8.01 , 9.01 , 9.99 ,10.97, 11.99](0.10, 0.13, 0.14, 0.16, 0.15, 0.11, 0.14, 0.10) Σ  . . .   .

99 1 .

89 1 .

89 0 0 01 .

89 2 .

40 1 .

89 1 .

89 0 0 01 .

89 1 .

89 2 .

90 1 .

89 1 .

89 0 0 01 .

89 1 .

89 4 .

89 1 .

89 0 0 01 .

89 1 .

89 3 .

94 0 0 00 0 0 0 0 2 .

13 1 .

15 1 .

150 0 0 0 0 1 .

15 3 .

16 1 .

150 0 0 0 0 1 .

15 1 .

15 2 .  d( Σ )  .

26 0 .

28 0 .

28 0 0 00 .

28 0 .

31 0 .

28 0 .

28 0 0 00 .

28 0 .

32 0 .

28 0 .

28 0 0 00 .

28 0 .

43 0 .

28 0 0 00 .

28 0 .

38 0 0 00 0 0 0 0 0 .

26 0 .

28 0 .

280 0 0 0 0 0 .

28 0 .

37 0 .

280 0 0 0 0 0 .

28 0 .

28 0 .  We applied our method to 3 datasets:1.

Alon data (Alon et al., 1999) contains the gene expression measurements of6500 genes using an Aﬀymetrix oligonucleotide Hum6000 array of 62 samples(40 tumor samples, 22 normal samples) from colon-cancer patients. We startedwith the preprocessed version of the data from McNicholas and Murphy (2010)that comprised of 461 genes. As p >> n and our algorithm is currently notdesigned for high dimensional data, to reduce the dimensionality, a t -test fol-lowed by false discovery rate (FDR) threshold of 0.1% was used that yieldedin 22 diﬀerentially expressed genes. Hence, the resulting dimensionality of thedataset to the sample size ratio (i.e. pn ≈ . Golub data (Golub et al., 1999) contains gene expression values of 7129 genesfrom 72 samples: 47 patients with acute lymphoblastic leukemia (ALL) and 25patients with acute myeloid leukemia (AML). We started with the preprocessedversion of the data from McNicholas and Murphy (2010) that comprised of 203025enes. Again, as p >> n and our algorithm is currently not designed for highdimensional data, to reduce the dimensionality, we select top 40 most diﬀeren-tially expressed genes by setting the FDR threshold as 0.00001%. Hence, theresulting dimensionality of the dataset to the sample size ratio (i.e. pn ≈ . Wine data available in the R package rattle (Williams, 2012), contains infor-mation on the 13 diﬀerent attributes from the chemical analysis of wines grownin speciﬁc areas of Italy. The dataset comprises of 178 samples of wine thatcan be categorized into three types: “Barolo”, “Grignolino”, and “Barbera”.Since the sample size n >> p , we use all 13 variables here. Since the chemi-cal measurements are in diﬀerent scales, the dataset was scaled before runningbiclustering methods.For each dataset, we ran all of the 16 proposed models for K = 1 . . . q k =1 . . .

8. For comparison, we show the clustering results of the following biclusteringapproaches applied to the above real datasets:1. U-OSGaBi family (Unsupervised version of OSGaBi family): Here, we run theapproach proposed by Wong et al. (2017). In Wong et al. (2017), the authorsdid one-way supervision (assuming observation’s memberships as unknownsand variable’s group memberships as knowns) however in our analysis we per-form unsupervised clustering for both rows and columns. We run all 8 modelsby Wong et al. (2017) for K = 1 . . . q = 1 . . . q = q = . . . = q k = q and T q k = I .26. Block-cluster: Here, we also run the biclustering models proposed by Singh Bhatia et al.(2017) for continuous data. This model utilizes mixtures of univariate Gaus-sian distributions. All four models obtained via imposition of constraints onthe mixing proportions and variances to be equal or diﬀerent across groupswere run using the R package “blockcluster”(Singh Bhatia et al., 2017).The performance of all three methods on the real datasets are summarized in Table 4.Our proposed model outperforms both U-OSGaBi family and “block-cluster” methodon Alon data and

Wine . However, both “block-cluster” method and our proposedmethod provide the same clustering performance on the

Golub data. It is interestingto note that on the

Wine data, the model selected by our approach is UUCU modeland the model selected from U-OSGaBi family is UCU model. Both models have thesame constraints for B k and D , however, in U-OSGaBi family, there is an additionalrestriction that T q k = I . Removing the restriction that T q k = I in our approach givesa substantial increase in the ARI (i.e from 0.74 to 0.93). Additionally, on the Golub dataset, ﬁtting U-OSGaBi family results in the selection of q = 3 and our proposedalgorithm chooses q = 6, both with the same constrain for B . The CUU modelselected for U-OSGaBi family has a constrained B matrix and ﬁxed T = I whereasour proposed approach also selects a model with constrained B matrix but a group-speciﬁc anisotropic matrix T k . However, the ARI from our proposed method (i.e.,ARI=0.94) is much higher than the ARI from U-OSGaBi family (i.e., ARI=0.69).Also, notice that for Wine data, our proposed model selects diﬀerent values for q for diﬀerent groups. Hence, this improvement in the clustering performance couldbe due to removal of the restriction that T q k = I for all k = 1 , . . . , K , removal of27he restriction q = q = . . . = q K = q , or both. Here, we will use Alon data fordetailed illustration. While ARI can be used for evaluating the agreement of therow cluster membership with a reference class indicator variable, a heatmap of theobservations is typically used to visualize the bicluster structure. Figure 3 showsthat our proposed method is able to recover the underlying bicluster structure fairlywell. V V V V V V V V V V V V V V V V V V V V V V −2−1012 Figure 3: Heatmap of the observations from the

Alon data.We also visualize component correlation matrices to gain an insight into the ob-served column clusters. As evident from the heatmap of the observed and estimatedcovariance matrices in Figure 4, variables that are highly correlated are together inthe same column clusters. 28

V 11V 12V 13V 17V 20V 6V 14V 16V 22V 5V 8V 9V 15V 18V 19V 4V 10V 21V 1V 2V 3V 7 V V V V V V V V V V V V V V V V V V V V V V −1−0.500.51 V 11V 12V 13V 17V 20V 6V 14V 16V 22V 5V 8V 9V 15V 18V 19V 4V 10V 21V 1V 2V 3V 7 V V V V V V V V V V V V V V V V V V V V V V −1−0.500.51 (a) Observed correlation structures. V 11V 12V 13V 17V 20V 6V 14V 16V 22V 5V 8V 9V 15V 18V 19V 4V 10V 21V 1V 2V 3V 7 V V V V V V V V V V V V V V V V V V V V V V −1−0.500.51 V 11V 12V 13V 17V 20V 6V 14V 16V 22V 5V 8V 9V 15V 18V 19V 4V 10V 21V 1V 2V 3V 7 V V V V V V V V V V V V V V V V V V V V V V −1−0.500.51 (b) Recovered correlation structures. Figure 4: Heatmap of the cluster-speciﬁc correlation structures in the two rowclusters of the

Alon data.Table 4: Summary of the clustering performances by the best model selected usingBIC for all three approaches.Data True q k ARI

Alon , U-OSGaBi family CCU 2 [5,5] 0.64Block-cluster “pik rhol sigma2” 4 [2 ,

2] 0.33

Golub , U-OSGaBi family CUU 2 [3,3] 0.69Block-cluster “pi rho sigma2kl” 2 [4 , Wine , , U-OSGaBi family UCU 3 [5,5,5] 0.74Block-cluster “pik rhol sigma2kl” 3 [7 , ,

7] 0.8829

Conclusion

In this paper, we propose an extended family of 16 models for model-based for bi-clustering. Parsimony is introduced in two ways. First, as the factor loading matrix B matrix is binary row-stochastic, the resulting covariance matrix is block-diagonal.Secondly, we also impose constraints on elements of the covariance matrices thatresults in a family of models with varying number of parameters. Our proposedmethod builds on the work by Martella et al. (2008) and Wong et al. (2017) whichutilized a factor analyzer structure for developing a model-based biclustering frame-work. However, those works restricted the covariance matrix of the latent variable tobe an identity matrix and assumed that the number of latent variables was the samefor all components. The restriction on the covariance matrix of the latent variableimposes a restriction on the structure on the block diagonal part of the covariancematrix, therefore only allowing for high variance-low covariance or low variance-highcovariance structure. Here, we propose a modiﬁed factor analyzer structure that as-sumes that the covariance of the latent variable is a diagonal matrix, and therefore, isable to recover a wide variety of covariance structures. Our simulations demonstratethat good parameter recovery. Additionally, we also allow diﬀerent components tohave diﬀerent q , and therefore, allowing diﬀerent grouping of the variables in dif-ferent clusters. Using simulated data, we demonstrate that these models give agood clustering performance and can recover the underlying covariance structure. Inreal datasets, we show through comparison with the method by Wong et al. (2017)that easing the restrictions on T and on the number of latent variables can providesubstantial improvement in the clustering performance. We also compared our ap-30roach to the block-cluster method by Singh Bhatia et al. (2017) and show that ourproposed method provides a competitive performance.Although our proposed models can capture an extended range of the covariancestructure compared to Martella et al. (2008) and Wong et al. (2017), it still can onlyallow positive correlations within the column clusters. However, typically in biology,it may be of interest to group variables based on the magnitude of the correlationregardless of the sign of the correlation. For example, suppose a particular pathwayplays a crucial role in a tumor development and consequently, genes involved in thatpathways show changes in their expression levels. Some gene may be up-regulatedwhile others may be down-regulated, and hence, these genes will be divided intomultiple column clusters which will lead over estimation of the number of columnclusters. Additionally, each column cluster will only provide a partial and incompleteview of the pathway’s involvement. Furthermore, investigation of approaches foreﬃcient update of B k is warranted. While the current approach of updating it row byrow provided satisfactory clustering performance, it is not computationally eﬃcientand can sometime miss the true underlying structure. Our proposed approach onlyallows for variables to be in one column cluster whereas from a practical viewpoint,introducing soft margin and allowing variables to be involved in more than onecolumn clusters might be informative. 31 eferences Aitken, A. C. (1926), ‘On Bernoulli’s numerical solution of algebraic equations’,

Proceedings of the Royal Society of Edinburgh , 289–305.Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D. and Levine,A. (1999), ‘Broad patterns of gene expression revealed by clustering analysis oftumor and normal colon tissues probed by oligonucleotide arrays’, Proceedings ofthe National Academy of Sciences (12), 6745–6750.Andrews, J. L. and McNicholas, P. D. (2011), ‘Mixtures of modiﬁed t-factor analyzersfor model-based clustering, classiﬁcation, and discriminant analysis’, Journal ofStatistical Planning and Inference (4), 1479–1486.Banﬁeld, J. D. and Raftery, A. E. (1993), ‘Model-based Gaussian and non-Gaussianclustering’,

Biometrics pp. 803–821.Bartlett, M. (1953), Factor analysis in psychology as a statistician sees it, in ‘Uppsalasymposium on psychological factor analysis’, number 3 in ‘Nordisk Psykologi’sMonograph Series’, Almquist and Wiksell Uppsala, Uppsala, Sweden, pp. 23–34.Ben-Dor, A., Chor, B., Karp, R. and Yakhini, Z. (2002), Discovering local structurein gene expression data, pp. 49–57.B¨ohning, D., Dietz, E., Schaub, R., Schlattmann, P. and Lindsay, B. (1994), ‘Thedistribution of the likelihood ratio for mixtures of densities from the one-parameterexponential family’, Annals of the Institute of Statistical Mathematics , 373–388.32heng, Y. and Church, G. (2000), ‘Biclustering of expression data’, Proceedings /... International Conference on Intelligent Systems for Molecular Biology ; ISMB.International Conference on Intelligent Systems for Molecular Biology , 93–103.Dang, U. J., Browne, R. P. and McNicholas, P. D. (2015), ‘Mixtures of multivariatepower exponential distributions’, Biometrics (4), 1081–1089.Dempster, A. P., Laird, N. M. and Rubin, D. B. (1977), ‘Maximum likelihood fromincomplete data via the EM algorithm’, Journal of the Royal Statistical Society:Series B , 1–38. URL:

Fraley, C. and Raftery, A. E. (2002), ‘Model-based clustering, discriminant anal-ysis, and density estimation’,

Journal of the American statistical Association (458), 611–631.Franczak, B. C., Browne, R. P. and McNicholas, P. D. (2014), ‘Mixtures of shiftedasymmetric Laplace distributions’, IEEE Transactions on Pattern Analysis andMachine Intelligence (6), 1149–1157.Ghahramani, Z. and Hinton, G. E. (1997), The em algorithm for mixtures of factoranalyzers, Technical report.Ghahramani, Z., Hinton, G. E. et al. (1996), The em algorithm for mixtures offactor analyzers, Technical report, Technical Report CRG-TR-96-1, University ofToronto. 33olub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P.,Coller, H., Loh, M. L., Downing, J. R., Caligiuri, M. A. and Bloomﬁeld, C. D.(1999), ‘Molecular classiﬁcation of cancer: class discovery and class prediction bygene expression monitoring’, Science , 531–537.Gonzales-Barron, U. and Butler, F. (2006), ‘A comparison of seven thresholdingtechniques with the k-means clustering algorithm for measurement of bread-crumbfeatures by digital image analysis’,

Journal of food engineering (2), 268–278.Govaert, G. and Nadif, M. (2008), ‘Block clustering with bernoulli mixture models:Comparison of diﬀerent approaches’, Computational Statistics & Data Analysis (6), 3233–3245.Hartigan, J. A. (1972), ‘Direct Clustering of a Data Matrix’, Journal of the AmericanStatistical Association (337), 123–129. URL: http://dx.doi.org/10.2307/2284710

Houdard, A., Bouveyron, C. and Delon, J. (2018), ‘High-dimensional mixture modelsfor unsupervised image denoising (hdmi)’,

SIAM Journal on Imaging Sciences (4), 2815–2846.Hubert, L. and Arabie, P. (1985), ‘Comparing partitions’, J. Classif (2), 193–218.Jiang, D., Tang, C. and Zhang, A. (2004), ‘Cluster analysis for gene expression data:a survey’,

IEEE Transactions on knowledge and data engineering (11), 1370–1386. 34ohnson, S. C. (1967), ‘Hierarchical clustering schemes’, Psychometrika (3), 241–254.Kluger, Y., Basri, R., Chang, J. T. and Gerstein, M. (2003), ‘Spectral Biclustering ofMicroarray Data: Coclustering Genes and Conditions’, Genome Res. (4), 703–716. URL:

Lin, T., McLachlan, G. J. and Lee, S. X. (2016), ‘Extending mixtures of factor modelsusing the restricted multivariate skew-normal distribution’,

Journal of MultivariateAnalysis , 398–413.Liu, J. and Wang, W. (2003), ‘Op-cluster: Clustering by tendency in high dimen-sional space’.Martella, F., Alf`o, M. and Vichi, M. (2008), ‘Biclustering of gene expression databy an extension of mixtures of factor analyzers.’,

The international journal ofbiostatistics , Article 3.McLachlan, G. and Peel, D. (2000 a ), Mixtures of factor analyzers, in ‘In Proceedingsof the Seventeenth International Conference on Machine Learning’, Citeseer.McLachlan, G. and Peel, D. (2000 b ), Mixtures of factor analyzers, in ‘In Proceed-ings of the Seventeenth International Conference on Machine Learning’, MorganKaufmann, pp. 599–606.McNicholas, P. D. and Murphy, T. B. (2008), ‘Parsimonious Gaussian mixture mod-35ls’, Statistics and Computing (3), 285–296. URL: https://doi.org/10.1007/s11222-008-9056-0

McNicholas, P. D. and Murphy, T. B. (2010), ‘Model-based clustering of microarrayexpression data via latent Gaussian mixture models’,

Bioinformatics (21), 2705–2712.McNicholas, P. D. and Subedi, S. (2012), ‘Clustering gene expression time course datausing mixtures of multivariate t-distributions’, Journal of Statistical Planning andInference (5), 1114–1127.McQueen, J. (1967), ‘Some methods for classiﬁcation and analysis of multivariateobservations’,

Computer and Chemistry , 257–272.Melnykov, V. and Zhu, X. (2018), ‘On model-based clustering of skewed matrix data’, Journal of Multivariate Analysis , 181–194.Meng, X.-L. and Van Dyk, D. (1997), ‘The em algorithm?an old folk-song sung toa fast new tune’,

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) (3), 511–567. URL: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-9868.00082

Mirkin, B. (1996), ‘Mathematical classiﬁcation and clustering: Kluwer academicpublishers’.Murali, T. and Kasif, S. (2003), ‘Extracting conserved gene expression motifs fromgene expression data’,

Paciﬁc Symposium on Biocomputing. Paciﬁc Symposiumon Biocomputing , 77–88. 36urray, P. M., McNicholas, P. D. and Browne, R. P. (2014), ‘A mixture of commonskew-t factor analysers’, Stat (1), 68–82.Nadif, M. and Govaert, G. (2010), ‘Latent block model for contingency table’, Com-munications in Statistics?Theory and Methods, 39(3), 416-425. .Padilha, V. and Campello, R. (2017), ‘A systematic comparative evaluation of bi-clustering techniques’, BMC Bioinformatics .Saunders, J. (1980), ‘Cluster analysis for market segmentation’, European Journalof marketing .Saxena, A., Prasad, M., Gupta, A., Bharill, N., Patel, O. P., Tiwari, A., Er, M. J.,Ding, W. and Lin, C.-T. (2017), ‘A review of clustering techniques and develop-ments’,

Neurocomputing , 664–681.Schwarz, G. (1978), ‘Estimating the dimension of a model’,

The Annals of Statistics , 461–464.Scrucca, L., Fop, M., Murphy, T. B. and Raftery, A. E. (2016), ‘mclust 5: clustering,classiﬁcation and density estimation using Gaussian ﬁnite mixture models’, The RJournal (1), 205–233. URL: https://journal.r-project.org/archive/2016-1/scrucca-fop-murphy-etal.pdf

Silva, A., Rothstein, S. J., McNicholas, P. D. and Subedi, S. (2019), ‘A multivariatepoisson-log normal mixture model for clustering transcriptome sequencing data’,

BMC bioinformatics (1), 394. 37ingh Bhatia, P., Iovleﬀ, S. and Govaert, G. (2017), ‘blockcluster: An R package formodel-based co-clustering’, Journal of Statistical Software (9), 1–24.Sisodia, D., Singh, L., Sisodia, S. and Saxena, K. (2012), ‘Clustering techniques:a brief survey of diﬀerent clustering algorithms’, International Journal of LatestTrends in Engineering and Technology (IJLTET) (3), 82–87.Spearman, C. (1904), ‘The proof and measurement of association between twothings’, The American Journal of Psychology (1), 72–101.Stephens, M. (2000), ‘Dealing with label switching in mixture models’, Journal ofthe Royal Statistical Society, Series B , 795–809.Subedi, S. and McNicholas, P. D. (2014), ‘Variational bayes approximations for clus-tering via mixtures of normal inverse Gaussian distributions’, Advances in DataAnalysis and Classiﬁcation (2), 167–193.Subedi, S. and McNicholas, P. D. (2020), ‘A variational approximations-dic rubricfor parameter estimation and mixture model selection within a family setting’, Journal of Classiﬁcation pp. 1–20.Subedi, S., Punzo, A., Ingrassia, S. and Mcnicholas, P. D. (2013), ‘Clustering andclassiﬁcation via cluster-weighted factor analyzers’,

Advances in Data Analysis andClassiﬁcation (1), 5–40.Subedi, S., Punzo, A., Ingrassia, S. and McNicholas, P. D. (2015), ‘Cluster-weighed t -factor analyzers for robust model-based clustering and dimension reduction’, Sta-tistical Methods & Applications (4), 623–649.38anay, A., Sharan, R. and Shamir, R. (2002), ‘Discovering statistically signiﬁcantbiclusters in gene expression data’, Bioinformatics (suppl 1), S136–S144. URL: https://doi.org/10.1093/bioinformatics/18.suppl 1.S136

Tipping, M. E. and Bishop, C. M. (1999), ‘Mixtures of probabilistic principal com-ponent analyzers’,

Neural computation (2), 443–482.Tortora, C., McNicholas, P. D. and Browne, R. P. (2016), ‘A mixture of general-ized hyperbolic factor analyzers’, Advances in Data Analysis and Classiﬁcation (4), 423–440.Ward Jr, J. H. (1963), ‘Hierarchical grouping to optimize an objective function’, Journal of the American statistical association (301), 236–244.Williams, G. (2012), Data Mining with Rattle and R: The Art of Excavating Datafor Knowledge Discovery .Wong, M. H., Mutch, D. M. and McNicholas, P. D. (2017), ‘Two-way learning withone-way supervision for gene expression data’,

BMC bioinformatics (1), 150.Yang, J., Wang, W., Wang, H. and Yu, P. (2002), delta-clusters: Capturing subspacecorrelation in a large data set, pp. 517 – 528.39 ppendix A: Estimation for 16 models in the family

Here, we provide parameter estimates for the components of the covariance matricesfor all 16 proposed models. Recall the following: S k = P ni =1 ˆ z ik ( y i − µ k )( y i − µ k ) T n k , n k = n X i =1 ˆ z ik , and θ k = P ni =1 ˆ z ik E ( U ik U Tik | y i ) n k .

1. UUUU model: Here, we assume no constraints on B k , T q k , and D k . Details onthe parameter estimates are provided in Section 2.3.2. UUUC model: Here, we assume D k = d k I p , and no constraints on B k and T q k .The estimates of B k and T q k are the same as UUUU model. Simply replace D k with d k I p in equation 1, then taking derivative respect to d k yeilds:ˆ d k = 1 p tr (cid:26) S k − B k ˆ T q k ˆ B Tk (cid:16) ˆ B k ˆ T q k ˆ B Tk + ˆ D ( t ) k (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

3. UUCU model: Here, we assume D k = D , and no constraints for B k and T q k .The estimates of B k and T q k are the same as UUUU model. The estimate for D isˆ D = K X k =1 ˆ π k diag (cid:26) S k − B k ˆ T q k ˆ B Tk (cid:16) ˆ B k ˆ T q k ˆ B Tk + ˆ D ( t ) (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

4. UUCC model: Here, we assume D k = d I p , and no constraints for B k and T q k .40he estimates of B k and T q k are the same as UUUU model. The estimate for d is ˆ d = K X k =1 ˆ π k p tr (cid:26) S k − B k ˆ T q k ˆ B Tk (cid:16) ˆ B k ˆ T q k ˆ B Tk + ˆ D ( t ) (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

5. UCUU model: Here, we assume T q k = T , and no constraints for B k and D k .The estimates of B k and D k are the same as UUUU model and the estimatefor T is ˆ T = K X k =1 ˆ π k diag( θ k ) .

6. UCUC model: Here, we assume T q k = T , D k = d k I p , and no constraint for B k .The estimate of B k is the same as UUUU model and the estimates are of T and d k areˆ T = K X k =1 ˆ π k diag( θ k ) , ˆ d k = 1 p tr (cid:26) S k − B k ˆ T ˆ B Tk (cid:16) ˆ B k ˆ T ˆ B Tk + ˆ D ( t ) k (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

7. UCCU model: Here, we assume T q k = T and D k = D , and no constraint for B k . The estimate of B k is the same as UUUU model and the estimates are of41 and D areˆ T = K X k =1 ˆ π k diag( θ k ) , ˆ D = K X k =1 ˆ π k diag (cid:26) S k − B k ˆ T ˆ B Tk (cid:16) ˆ B k ˆ T ˆ B Tk + ˆ D ( t ) (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

8. UCCC model: Here, we assume T q k = T and D k = d I p , and no constraint for B k . The estimate of B k is the same as UUUU model and the estimates are of T and d areˆ T = K X k =1 ˆ π k diag( θ k ) , ˆ d = K X k =1 ˆ π k p tr (cid:26) S k − B k ˆ T q k ˆ B Tk (cid:16) ˆ B k ˆ T q k ˆ B Tk + ˆ D ( t ) (cid:17) − S k + ˆ B k θ k ˆ B Tk (cid:27) .

9. CUUU model: Here, we assume B k = B , and no constraints for T Q k and D k . The parameter estimates for T Q k and D k are exactly the same as UUUUmodel. For parameter estimate for B , we deﬁne Q ∗ as Q ∗ = K X k =1 n k ( tr { D − k BT q k B T ( BT q k B T + D k ) − S k } − tr { B T D − k B θ k } ) . When estimate B , we choose B [ i, j ] = 1. when B maximize Q ∗ , with constrain P q k j =1 B [ i, j ] = 1 .10. CUUC model: Here, we assume B k = B , D k = d k I p , and no constraint for T q k .Estimation of B k are exactly same as CUUU model and estimation of d k and42 q k are the same as UUUC model.11. CUCU model: Here, we assume B k = B , D k = D , and no constraint for T q k .Estimation of B k are exactly same as CUUU model and estimation of D and T q k are the same as UUCU model.12. CUCC model: Here, we assume B k = B , D k = d I p , and no constrain for T q k .Estimation of B k are exactly same as CUUU model and the estimation of d and T q k are the same as UUCC model.13. CCUU model: Here, we assume B k = B , T q k = T , and no constrain for D k .Estimation of B k are exactly same as CUUU model and the estimation of D k and T are the same as UCUU model.14. CCUC model: Here, we assume B k = B , T q k = T , and D k = d k I p . Estimationof B k are exactly same as CUUU model and the estimation of d k and T arethe same as UCUC model.15. CCCU model: Here, we assume B k = B , T q k = T , and D k = D . Estimationof B k are exactly same as CUUU model and the estimation of D and T arethe same as UCCU model.16. CCCC model: Here, we assume B k = B , T q k = T , and D k = d I p . Estimationof B k are exactly same as CUUU model and the estimation of d and T are thesame as UCCC model 43 : Special cases Here, we show that the family of models proposed by Martella et al. (2008) andWong et al. (2017) can be obtained as special cases of our proposed models.

Models by Martella et al. (2008)

If we assume the constraints that T q k = I , the four models proposed by Martella et al.(2008) can be obtained as following:Table 5: Models by Martella et al. (2008) Models from Martella et al. (2008) Equivalent constraints for proposed models B k T q k D k Total B k = B T q k = I D k = D D k = d k I UU U I U U 3 pK + K − pK + p + K − pK + p + K − p + K − pK Models by Wong et al. (2017)

If we further allow constraint that D k = d k I p and q = q = . . . = q K = q , then wehave the eight models by Wong et al. (2017).44able 6: Models by Wong et al. (2017) Models from Wong et al. (2017) Equivalent constraints for proposed models B k T q k D k Total B k = B T q k = I D k = D D k = d k I UUU U I U U 3 pK + K − pK + p + K − pK + p + K − p + K − pK UUC U I U C 2 pK + 2 K − pK + K CUC C I U C pK + p + 2 K − p + K + pKpK