[PDF] Identifying Interpretable Discrete Latent Structures from Discrete Data

Abstract

High dimensional categorical data are routinely collected in biomedical and social sciences. It is of great importance to build interpretable models that perform dimension reduction and uncover meaningful latent structures from such discrete data. Identifiability is a fundamental requirement for valid modeling and inference in such scenarios, yet is challenging to address when there are complex latent structures. In this article, we propose a class of interpretable discrete latent structure models for discrete data and develop a general identifiability theory. Our theory is applicable to various types of latent structures, ranging from a single latent variable to deep layers of latent variables organized in a sparse graph (termed a Bayesian pyramid). The proposed identifiability conditions can ensure Bayesian posterior consistency under suitable priors. As an illustration, we consider the two-latent-layer model and propose a Bayesian shrinkage estimation approach. Simulation results for this model corroborate identifiability and estimability of the model parameters. Applications of the methodology to DNA nucleotide sequence data uncover discrete latent features that are both interpretable and highly predictive of sequence types. The proposed framework provides a recipe for interpretable unsupervised learning of discrete data, and can be a useful alternative to popular machine learning methods.

Full PDF

IIdentifying Interpretable Discrete LatentStructures from Discrete Data

Yuqi Gu ∗ Department of Statistical Science, Duke UniversityandDavid B. Dunson † Department of Statistical Science, Duke University

Abstract

High dimensional categorical data are routinely collected in biomedical and socialsciences. It is of great importance to build interpretable models that perform dimensionreduction and uncover meaningful latent structures from such discrete data. Identiﬁa-bility is a fundamental requirement for valid modeling and inference in such scenarios,yet is challenging to address when there are complex latent structures. In this article,we propose a class of interpretable discrete latent structure models for discrete dataand develop a general identiﬁability theory. Our theory is applicable to various types oflatent structures, ranging from a single latent variable to deep layers of latent variablesorganized in a sparse graph (termed a Bayesian pyramid). The proposed identiﬁabil-ity conditions can ensure Bayesian posterior consistency under suitable priors. As anillustration, we consider the two-latent-layer model and propose a Bayesian shrinkageestimation approach. Simulation results for this model corroborate identiﬁability andestimability of the model parameters. Applications of the methodology to DNA nu-cleotide sequence data uncover discrete latent features that are both interpretable andhighly predictive of sequence types. The proposed framework provides a recipe forinterpretable unsupervised learning of discrete data, and can be a useful alternative topopular machine learning methods.

Keywords:

Bayesian inference, deep belief networks, identiﬁability, interpretable machinelearning, latent class, multivariate categorical data, tensor decomposition. ∗ Email: [email protected]. Department of Statistical Science, Duke University, Durham, NC 27708. † Email: [email protected]. a r X i v : . [ s t a t . M E ] J a n Introduction

High-dimensional unordered categorical data are ubiquitous in many scientiﬁc disciplines,including the DNA nucleotides of A, G, C, T in genetics (Pokholok et al., 2005; Nguyenet al., 2016), occurrences of various species in ecological studies of biodiversity (Ovaskainenet al., 2016; Ovaskainen and Abrego, 2020), responses from psychological and educationalassessments or social science surveys (Eysenck et al., 2020; Skinner, 2019), and documentdata gathered from huge text corpora or publications (Blei et al., 2003; Erosheva et al.,2004). Modeling and extracting information from multivariate discrete data require diﬀerentstatistical methods and theoretical understanding from those for continuous data. In anunsupervised setting, it is an important task to uncover meaningful latent patterns from thepotentially high-dimensional and heterogeneous discrete observations. Ideally, the inferredlower-dimensional latent representations should not only provide scientiﬁc insights on theirown, but also aid downstream statistical analyses through eﬀective dimension reduction.Recently, there has been a surge of interest in interpretable machine learning, see Doshi-Velez and Kim (2017), Rudin (2019), and Murdoch et al. (2019), among many others. How-ever, latent variable approaches have received limited attention in this emerging literature,likely due to the associated complexities. Indeed, deep learning models with many layersof latent variables are usually considered as uninterpretable black boxes. For example, thedeep belief network (Hinton et al., 2006; Lee et al., 2009) is a very popular deep learningarchitecture, but it is not reliable to interpret the inferred latent structure. However, forhigh-dimensional data, it is highly desirable to perform dimension reduction to extract thekey signals in the form of lower-dimensional latent representations. If the latent representa-tions are themselves reliable, then they can be viewed as surrogate features of the data andthen passed along to existing interpretable machine learning methods for downstream tasks.A key to success in such modeling and analysis processes is the interpretability of the latentstructure. This in turn relies crucially on the identiﬁability of the statistical latent variablemodel being used.For linear factor models with continuous latent variables, research such as Anandkumaret al. (2013), Leung et al. (2016), and Chen et al. (2020b) studied model identiﬁability.Discrete latent variable models are an appealing alternative in terms of their combination2f interpretability and ﬂexibility. Finite mixture models (McLachlan and Basford, 1988)routinely used for model-based clustering provide a canonical example involving a singlecategorical latent variable. Such approaches are insuﬃciently ﬂexible for complex data sets.Extensions with multiple latent variables and/or multilayer structure have distinct advan-tages in such settings, but come with the challenge of how to uniquely identify the potentiallycomplex latent structure.There is a rich literature on latent variable models for multivariate discrete data; refer,for example to Dunson and Xing (2009), Dobra and Lenkoski (2011), and Zhou et al. (2015).However, the focus has been largely on using latent variables to induce a ﬂexible model forobservables and issues of identiﬁability of the latent structure have been rarely investigated.Although Allman et al. (2009) established an important identiﬁability result for certain latentvariable models leveraging results in Kruskal (1977), their result does not apply to complexlatent structures with parameter constraints. Recently, Xu (2017) and Gu and Xu (2020)studied identiﬁability of a family of binary latent variable models with binary responsesunder constraints, but their results again do not apply to modeling of broader discrete datawith discrete latent structure.In this paper, we develop a general theory for identiﬁability of discrete latent structureunderlying high-dimensional discrete data. The identiﬁability results are applicable to var-ious types of latent structures, ranging from a single latent variable to multiple layers ofdependent latent variables. In particular, we establish identiﬁability in cases where multi-ple layers of discrete latent variables form a “pyramid”- structured Bayesian network withobserved variables at the bottom. Various sparse connecting patterns are allowed betweenlayers, and our identiﬁability conditions are imposed on these connecting graph structures.The identiﬁability of this modeling framework provides a theoretical basis for learning po-tentially deep and interpretable latent structure from high-dimensional discrete data. Weshow that some popular machine learning models, such as deep belief networks and latenttree models (Choi et al., 2011), can be viewed as special cases of the proposed class.A nice consequence of the identiﬁability results is the posterior consistency of Bayesianprocedures under suitable priors. As an illustration, we consider the two-latent-layer modeland propose a Bayesian shrinkage estimation approach. We develop a Gibbs sampler withdata augmentation for computation. Simulation studies corroborate identiﬁability and show3ood performance of Bayes estimators. We apply the proposed method to DNA nucleotidesequence data from the UCI machine learning repository. For the splice junction data,when using latent representations learned from our two-latent-layer model in downstreamclassiﬁcation of nucleotide sequence types, we achieve an accuracy even comparable to ﬁne-tuned convolutional neural networks (i.e., in Nguyen et al., 2016). This suggests that thedeveloped recipe of unsupervised learning of discrete data may serve as a useful alternativeto popular machine learning methods.The rest of this paper is organized as follows. Section 2 proposes a general constrainedprobabilistic tensor factorization framework, connects it to an interpretable model with onediscrete latent variable, and establishes strict identiﬁability. Section 3 generalizes the modelto multiple discrete latent variables in multiple layers, and proposes transparent identiﬁa-bility conditions on the graphs between layers. Then to illustrate the proposed framework,Section 4 focuses on a two-latent-layer model and proposes a Bayesian estimation approach.Section 5 provides simulation studies that examine the performance of the proposed method-ology and corroborates the identiﬁability theory. Section 6 applies the method to real data ofnucleotide sequences. Finally, Section 7 discusses implications and future directions. Tech-nical proofs of key theoretical results are presented in the Appendix. Additional proofs,posterior computation details, and additional data analysis are included in the Supplemen-tary Material.

We focus on modeling the joint distribution of multivariate unordered categorical data. Foran integer m , denote [ m ] = { , , . . . , m } . Suppose for each subject, there are p observedvariables y = ( y , . . . , y p ) (cid:62) , where y j ∈ [ d j ] for each variable j ∈ [ p ]. The traditional latentclass model (LCM) (Lazarsfeld, 1950; Goodman, 1974) posits the following joint distributionof y , P ( y = c , . . . , y p = c p ) = k (cid:88) h =1 ν h p (cid:89) j =1 λ ( j ) c j ,h , ∀ c j ∈ [ d j ] , j ∈ [ p ] . (1)4he above equation speciﬁes a ﬁnite mixture model with one discrete latent variable having k categories (i.e., k latent classes). In particular, given a latent variable z ∈ [ k ], ν h = P ( z = h )denotes the probability of z belonging to the h th latent class, and λ ( j ) c j ,h = P ( y j = c j | z j = h ) denotes the conditional probability of response c j for variable y j given the latent classmembership h . Expression (1) implies the p observed variables y , . . . , y p are conditionallyindependent given the latent z . In the Bayesian literature, the model in Dunson and Xing(2009) is a nonparametric generalization of (1), which allows an inﬁnite number of latentclasses.Latent class modeling under (1) is widely used in social and biomedical sciences, whereresearchers often hope to infer subpopulations of individuals having diﬀerent proﬁles (Collinsand Lanza, 2009). However, the overly-simplistic form of (1) can lead to poor performancein inferring distinct and interpretable subpopulations. In particular, the model assumesthat individuals in diﬀerent subpopulations have completely diﬀerent probabilities λ ( j ) c j ,h forall c j ∈ [ d j ] and j ∈ [ p ], and conditionally on subpopulation membership all the variablesare independent. These restrictions can force the number of classes to increase in order toprovide an adequate ﬁt to the data. This can degrade interpretability and moreover thelinear growth in the number of parameters in (1) with k can lead to large uncertainty inparameter estimation. We introduce some notation before proceeding. Denote a p × k matrix whose entries all equalone by p × k . For a matrix S with p rows and a set A ⊆ [ p ], denote by S A , : the submatrix of S consisting of rows indexed in A .We propose a family of constrained latent class models, which enable learning of a po-tentially large number of interpretable, identiﬁable, and diverse latent classes. A key idea issharing of parameters within certain latent classes for each observed variable. We introducea binary constraint matrix S = ( S j,h ) of size p × k , which has rows indexed by the p ob-served variables and columns indexed by the k latent classes. The binary entry S j,h indicateswhether the conditional probability table λ ( j )1: d j ,h = ( λ ( j )1 ,h , . . . , λ ( j ) d j ,h ) is free or instead equal tosome unknown baseline. Speciﬁcally, if S j,h = 1 then λ ( j )1: d j ,h is free; while for those latentclasses h ∈ [ k ] such that S j,h = 0, their conditional probability tables λ ( j )1: d j ,h ’s are constrained5o be equal. Hence, S puts the following equality constraints on the parameters of (1),if S j,h = S j,h = 0 , then λ ( j )1: d j , h = λ ( j )1: d j , h . (2)We also enforce a natural inequality constraint for identiﬁability,if S j,h (cid:54) = S j,h , then λ ( j ) c j ,h (cid:54) = λ ( j ) c j ,h ∀ c j ∈ [ d j ] . (3)If S = p × k then there are no active constraints and the original latent class model (1) isrecovered. We generally denote the conditional probability parameters by Λ = { λ ( j ) c j ,h ; j ∈ [ p ] , c j ∈ [ d j ] , h ∈ [ k ] } where λ c j ,h := λ c j , if S j,h = 0. Our motivation of imposing the equalityconstraint (2) is to learn interpretable latent structure that can match insights of domainexperts or even generate new insights.Viewed from a diﬀerent perspective, a latent class model (1) speciﬁes a decomposition ofthe p -way probability tensor Π = ( π c ...c p ), where π c ··· c p = P ( y = c , . . . , y p = c p | Λ , S , ν ).This corresponds to the PARAFAC/CANDECOMP (CP) decomposition in the tensor liter-ature (Kolda and Bader, 2009), which can be used to factorize general real-valued tensors,while our focus is on non-negative probability tensors. The proposed equality constraint (2)induces a family of constrained CP decompositions. This family is connected with the sparseCP decomposition of Zhou et al. (2015), with both having equality constraints summarizedby a p × k binary matrix. However, Zhou et al. (2015) encourage diﬀerent observed variablesto share parameters, while our proposed model encourages diﬀerent latent classes to shareparameters through (2). Constraint (2) is more ﬂexible and suitable for learning interpretableand complex latent structures, especially when diﬀerent observed variables represent distinctfeatures so that parameter sharing among them may be restrictive.In Section 3 we show that our proposed model (1)-(2) naturally lends itself to an interest-ing generalization to deep layers of latent variables organized in a sparse graph. We refer tothis generalized family as multilayer sparse graph LCM (SG-LCM) or a Bayesian pyramid.SG-LCM is related to deep belief networks (e.g., Hinton et al., 2006; Lee et al., 2007, 2009),and also includes a popular family of psychometric models (Henson et al., 2009; von Davierand Lee, 2019). In the remainder of this section, we study theoretical properties of model(1)-(2). 6 .2 Identiﬁability and Posterior Consistency Although latent class models are appealing in terms of model ﬂexibility, dimensionalityreduction, and interpretability, identiﬁability issues need to be overcome to realize theseadvantages. The classic latent class model in (1) was shown by Gyllenberg et al. (1994) tobe not strictly identiﬁable. Strict identiﬁability generally requires one to establish a param-eterization in which the parameters can be expressed as one-to-one functions of observables.As a weaker notion, generic identiﬁability requires that the map is one-to-one except on ameasure zero subset with respect to the Lesbesgue measure. In a seminal paper, Allmanet al. (2009) leveraged a theorem from Kruskal (1977) to show generic identiﬁability for (1).Unfortunately, constraint (2) makes it impossible to apply Allman et al. (2009)’s approachdirectly.Below we establish strict identiﬁability of model (1)-(2), by carefully examining the al-gebraic structure imposed by the constraint matrix S . We ﬁrst introduce some notation.Denote by ⊗ the Kronecker product of matrices and by (cid:12) the Khatri-Rao product of ma-trices. In particular, consider matrices A = ( a i,j ) ∈ R m × r , B = ( b i,j ) ∈ R s × t ; and matrices C = ( c i,j ) = ( c : , | · · · | c : ,k ) ∈ R n × k , D = ( d i,j ) = ( d : , | · · · | d : ,k ) ∈ R (cid:96) × k , then there are A ⊗ B ∈ R ms × rt and C (cid:12) D ∈ R n(cid:96) × k with A ⊗ B =  a , B · · · a ,r B ... ... ... a m, B · · · a m,r B  , C (cid:12) D = (cid:16) c : , ⊗ d : , | · · · | c : ,k ⊗ d : ,k (cid:17) . The above deﬁnitions show the Khatri-Rao product is a column-wise Kronecker product;see more in Kolda and Bader (2009). We ﬁrst establish the following technical proposition,which is useful for the later theorem on identiﬁability.

Proposition 1.

For the constrained latent class model, deﬁne the following p parametermatrices subject to constraints (2) and (3) with some constraint matrix S , Λ ( j ) =  λ ( j )1 , λ ( j )1 , · · · λ ( j )1 ,k ... ... ... ... λ ( j ) d j , λ ( j ) d j , · · · λ ( j ) d j ,k  , j = 1 , . . . , p. enote the Khatri-Rao product of the above p matrices by K = (cid:12) pj =1 Λ ( j ) , which has size (cid:81) pj =1 d j × k . The following two conclusions hold.(a) If the k column vectors of the constraint matrix S are distinct, then K must have fullcolumn rank k .(b) If S contains identical column vectors, then K can be rank-deﬁcient. Remark 1.

Proposition 1 implies that S having distinct columns is suﬃcient (in part (a))and almost necessary (in part (b)) for the Khatri-Rao product K = (cid:12) pj =1 Λ ( j ) to be full rank.To see the “almost necessary” part, consider a special case where besides constraint (2) that λ ( j )1: d j , h = λ ( j )1: d j , h if S j,h = S j,h = 0 , the parameters also satisfy λ ( j )1: d j , h = λ ( j )1: d j , h if S j,h = S j,h = 1 . In this case, our proof shows that whenever the binary matrix S containsidentical column vectors in columns h and h , the matrix K also contains identical columnvectors in columns h and h and hence is surely rank-deﬁcient. In the Khatri-Rao product matrix K deﬁned in Proposition 1, each column characterizesthe conditional distribution of vector y given a particular latent class. Therefore, Proposition1 reveals an interesting and nontrivial algebraic structure: whether S ∈ { , } p × k has distinctcolumn vectors is linked to whether the k conditional distributions of y given each latentclass are linearly independent. The matrix S does not need to have full column rank in orderto have distinct column vectors. For example, a 3 × S with three columns being(1 , , (cid:62) , (0 , , (cid:62) , and (1 , , (cid:62) is rank-deﬁcient but has distinct column vectors. Indeed,it is not hard to see that a binary matrix S with m rows can have as many as 2 m distinctcolumn vectors.Although the linear independence of K ’s columns itself does not lead to identiﬁability,this nice structure uncovered in Proposition 1 provides a basis for investigating strict identi-ﬁability of our model. We introduce the deﬁnition of strict identiﬁability under the currentmodel setup and then give the strict identiﬁability result. Deﬁnition 1 (Strict Identiﬁability) . The constrained latent class model with (2) and (3) is said to be strictly identiﬁable if for any valid parameters ( Λ , S , ν ) , the following equalityholds if and only if ( Λ , S , ν ) and ( Λ , S , ν ) are identical up to a latent class permutation: P ( y = c | Λ , S , ν ) = P ( y = c | Λ , S , ν ) , ∀ c ∈ × pj =1 [ d j ] . (4)8or an arbitrary subset A ⊆ [ p ], denote by S A , : the submatrix of S that consists of thoserows indexed by variables belonging to A . The S A , : has size |A| × k . Theorem 1 (Strict Identiﬁability) . Consider the proposed constrained latent class modelunder (2) and (3) with true parameters ν = { ν h } h ∈ [ k ] , Λ = { Λ ( j ) } j ∈ [ p ] , and S . Suppose thereexists a partition of the p variables [ p ] = A ∪ A ∪ A such that(a) the submatrix S A i , : has distinct column vectors for i = 1 and ; and(b) for any h (cid:54) = h ∈ [ k ] , there is λ ( j ) c,h (cid:54) = λ ( j ) c,h for some j ∈ A and some c ∈ [ d j ] .Also suppose ν h > for each h ∈ [ k ] . Then ( Λ , S , ν ) are strictly identiﬁable up to a latentclass permutation. In the above theorem, the constraint matrix S is not assumed to be ﬁxed and known.This implies that both the matrix S and the parameters can be uniquely identiﬁed from thedata. We next give a corollary of Theorem 1, which replaces the condition on parameters Λ in part (b) with a slightly more stringent but also more transparent condition on the binarymatrix S . Corollary 1.

Consider the proposed constrained latent class model under (2) and (3) . Sup-pose ν h > for each h ∈ [ k ] . If there is a partition of the p variables [ p ] = A ∪ A ∪ A such that each submatrix S A i , : has distinct column vectors for i = 1 , , , then parameters ( ν , Λ , S ) are strictly identiﬁable up to a latent class permutation. The conditions of Corollary 1 are more easily checkable than those in Theorem 1 becausethey only depend on the structure of the constraint matrix S . It requires that after somecolumn rearrangement, matrix S should vertically stack three submatrices, each of whichhas distinct column vectors.The conclusions of Theorem 1 and Corollary 1 both regard strict identiﬁability, whichis the strongest possible conclusion on parameter identiﬁability up to label permutation. Ifwe consider the slightly weaker notion of generic identiﬁability as proposed in Allman et al.(2009), the conditions in Theorem 1 and Corollary 1 can be relaxed. Given a constraintmatrix S , denote the constrained parameter space for ( ν , Λ ) by T S = { ( ν , Λ ) : Λ satisﬁes the constraints speciﬁed by S } ; (5)9nd deﬁne the following subset of T S as N S = { ( ν , Λ ) ∈ T S : ∃ ( ν , Λ ) satisfying the constraints speciﬁed by some S (6)such that P ( y | ν , Λ ) = P ( y | ν , Λ ) } . With the above notation, the generic identiﬁability of the proposed constrained latentclass model is deﬁned as follows.

Deﬁnition 2 (Generic Identiﬁability) . Parameters ( Λ , S , ν ) are said to be generically iden-tiﬁable if N S deﬁned in (6) has measure zero with respect to the Lebesgue measure on T S deﬁned in (5) . Theorem 2 (Generic Identiﬁability) . Consider the constrained latent class model under (2) and (3) . Suppose for i = 1 , , changing some entries of S A i , : from “ ” to “ ” yields an (cid:101) S A i , : having distinct columns. Also suppose for any h (cid:54) = h ∈ [ k ] , there is λ ( j ) h ,c (cid:54) = λ ( j ) h ,c for some j ∈ A and some c ∈ [ d j ] . Then ( Λ , S , ν ) are generically identiﬁable. Remark 2.

Note that altering some S j,h from one to zero corresponds to adding one moreequality constraint that the distribution of the j -th variable given the h -th latent class is setto the baseline through (2) . Therefore, Theorem 2 intuitively implies that if enforcing moreparameters in T to be equal can give rise to a strictly identiﬁable model, then the parametersthat make the original model unidentiﬁable only occupy a negligible set in T . Theorem 2 relaxes the conditions on S for strict identiﬁability presented earlier. Inparticular, here the submatrices S A i , : need not have distinct column vectors; rather, it wouldsuﬃce if altering some entries of S A i , : from one to zero yield distinct column vectors. Aspointed out by Allman et al. (2009), generic identiﬁability is often suﬃcient for real dataanalyses.So far, we have focused on discussing model identiﬁability. Next, we show that ouridentiﬁability results guarantee Bayesian posterior consistency under suitable priors. Given asample of size n , denote the observations by y , . . . , y n , which are n vectors each of dimension p . Recall that under (1), the distribution of the vector y under the considered model can bedenoted by a p -way probability tensor Π = ( π c ··· c p ). When adopting a Bayesian approach,one can specify prior distributions for the parameters Λ , S , and ν , which induce a prior10istribution for the probability tensor Π . Within this context, we are now ready to statethe following theorem. Theorem 3 (Posterior Consistency) . Denote the collection of model parameters by

Θ =( Λ , S , ν ) . Suppose the prior distributions for the parameters Λ , S , and ν all have fullsupport around the true values. If the true latent structure S and model parameters Λ satisfy the proposed strict identiﬁability conditions in Theorem 1 or Corollary 1, we have P (Θ ∈ N c(cid:15) (Θ ) | y , . . . , y n ) → almost surely , where N c(cid:15) (Θ ) is the complement of an (cid:15) -neighborhood of the true parameters Θ in theparameter space. Theorem 3 implies that under an identiﬁable model and with appropriately speciﬁed pri-ors, the posterior distribution places increasing probability in arbitrarily small neighborhoodsof the true parameters of the constrained latent class model as sample size increases. Theseparameters include the mixture proportions and the class speciﬁc conditional probabilities.The ability to reliably estimate the parameters and sparsity structure is key to interpretingthe latent class structure.

The proposed model with equality constraint (2) can accommodate much more complexlatent structure than characterized by a single latent variable. We next show an interestinggeneralization to potentially deep latent structures. We focus mainly on multiple layers ofbinary latent variables, motivated by the simpler interpretability, with each variable encodingpresence or absence of a latent trait. The proposed model is closely related to deep beliefnetworks (Hinton et al., 2006), which are also probabilistic models with multiple layers ofbinary latent variables, but our proposed model is more general in terms of distributionalassumptions. In the following, we ﬁrst describe our proposed multilayer sparse graph LCM(SG-LCM) in Section 3.1, and then connect it to our previous constrained latent class modeland establish model identiﬁability in Section 3.2.11 .1 Multilayer Sparse Graph Latent Class Models

We use a Bayesian network consisting of multiple layers of discrete latent variables to modelthe joint distribution of high-dimensional unordered categorical data. Bayesian networks(Pearl, 2014) are directed acyclic graphs of random variables that can encode rich conditionalindependence information. We propose a “pyramid”-like Bayesian network with one latentvariable at the root and more and more latent variables in downward layers, where thebottom layer consists of the p observed variables y , . . . , y p . Denote the number of latentlayers in this Bayesian network by D , which can be viewed as the depth. Speciﬁcally, let thelayer of latent variables consecutive to the observed y be α (1) = ( α , . . . , α K ) ∈ { , } K with K variables, and let a deeper layer of latent variables consecutive to α ( m ) ∈ { , } K m be α ( m +1) ∈ { , } K m +1 for m = 1 , , . . . , D −

1. Finally, at the top and the deepest layerof the pyramid, we specify a single discrete latent variable z , or equivalently z ( D ) , whichhas two or more categories. In this Bayesian network, all the directed edges are pointingin the top-down direction only between consecutive layers, and there are no edges betweenvariables within a layer. This gives the following factorization of the distribution of y , P ( y , { α ( d ) } , z ( D ) ) = P ( y | α (1) ) D − (cid:89) m =1 P ( α ( m ) | α ( m +1) ) P ( α ( D − | z ( D ) ) P ( z ( D ) ); (7) P ( y | α (1) ) = p (cid:89) j =1 P (cid:16) y j (cid:12)(cid:12)(cid:12) α (1)pa( j ) (cid:17) , P ( α ( m ) | α ( m +1) ) = K m (cid:89) k =1 P (cid:16) α ( m ) k (cid:12)(cid:12)(cid:12) α ( m +1)pa( α ( m ) k ) (cid:17) . (8)In the above display, for variable j ∈ [ p ] the α (1)pa( j ) is a subvector of α (1) that only containsvariables j ’s parent variables in the ﬁrst latent layer, that is, the latent variables α (1) k whichhave a directed edge to variable y j . Similarly, α ( m +1)pa( α ( m ) k ) is the subvector of α ( m +1) thatcontains the parent variables of α ( m ) k in the ( m + 1)th latent layer. To make clear thesparsity structure of this graph, we introduce binary matrices G (1) , G (2) , · · · , G ( D − , termedas graphical matrices , to encode the connecting patterns between consecutive layers. That is,the G ( d ) summarizes the parent-child graphical patterns between the two consecutive layers d and d + 1. Speciﬁcally, matrix G (1) = ( g (1) j,k ) has size p × K and matrix G ( m ) = ( g ( m ) k,k (cid:48) ) has12ize K m − × K m for m = 2 , . . . , D −

1, with entries g (1) j,k = 1 , if α (1) k is a parent of y j ; g ( m ) k,k (cid:48) = 1 , if α ( m ) k is a parent of α ( m − k (cid:48) , ≤ m ≤ D − . Fig. 1 gives a graphical visualization of the proposed model. Each variable in the graph issubject-speciﬁc, that is, if there are n subjects in the sample, each of them would have itsown realizations of y and α ( d ) s. The proposed directed acyclic graph is not necessarily atree, as shown in Fig. 1. That is, each variable can have multiple parent variables in thelayer above it, while in a directed tree each variable can only have one parent. zα (2)1 · · · α (2) K α (1)1 α (1)2 α (1)3 · · · · · · · · · · · · · · · α (1) K y y y · · · · · · · · · · · · · · · · · ·· · · · · · · · · · · ·· · · · · · y p G (1) G (2) ... Figure 1: Multiple layers of binary latent traits α ( d ) s model the distribution of observed y .Binary matrices G (1) , G (2) , . . . encode the sparse connection patterns between consecutivelayers. Dotted arrows emanating from the root variable z summarize omitted layers { α ( d ) } .Interestingly, we have the following proposition linking the proposed multilayer sparsegraph LCM to the constrained latent class model under equality constraint (2). For twovectors a , b of the same length L , denote a (cid:23) b if a i ≥ b i for all i ∈ [ L ]. Denote by I ( · ) thebinary indicator function, which equals one if the statement inside is true and zero otherwise. Proposition 2.

Consider the multilayer sparse graph LCM with binary graphical matrices G (1) , G (2) , · · · , G ( D − .(a) In marginalizing out all the latent variables except α (1) , the distribution of y is aconstrained latent class model with K latent classes, where each latent class can be haracterized as one conﬁguration of the K -dimensional binary vector α (1) ∈ { , } K .The corresponding p × K constraint matrix S (1) is determined by the bipartite graphstructure between the α (1) -layer and the y -layer, with entries being S (1) j, α (1) = 1 − I (cid:16) α (1) k ≥ G (1) j,k for all k = 1 , . . . , K (cid:17) (9)= 1 − I (cid:16) α (1) (cid:23) G (1) j, : (cid:17) , j ∈ [ p ] , α (1) ∈ { , } K . (b) Further, in considering the distribution of α ( m ) ∈ { , } K m and marginalizing out allthe latent variables deeper than α ( m ) except α ( m +1) , the distribution of α ( m ) is alsoa constrained latent class model with K m +1 latent classes, where each latent class ischaracterized as one conﬁguration of the K m +1 -dimensional binary vector α ( m +1) ∈{ , } K m +1 . Its corresponding constraint matrix S ( m ) is determined by the bipartitegraph structure between the m -th and the ( m + 1) -th latent layers, with entries being S ( m +1) k, α ( m +1) = 1 − I (cid:16) α ( m +1) (cid:23) G ( m +1) k, : (cid:17) , k ∈ [ K m ] , α ( m +1) ∈ { , } K m +1 . (10)We present a toy example to illustrate Proposition 2 and discuss its implications. Example 1.

Consider a multilayer sparse graph LCM with p = 6 and K = 3 , with thegraph between α (1) and y displayed in Fig. 2. The × graphical matrix G (1) is presentedbelow. Proposition 2(a) states that there is a p × K constraint matrix S (1) taking the form, G (1) =   , S (1) = (000) (100) (010) (001) (110) (101) (011) (111)   . Entries of S (1) are determined according to (9) , for example, S (1)1 , (000) = 1 − I ((000) (cid:23) G (1)1 , : ) =1 − I ((000) (cid:23) (100)) = 1 ; and S (1)6 , (111) = 1 − I ((111) (cid:23) G (1)6 , : ) = 1 − I ((111) (cid:23) (011)) = 0 .Each column of the constraint matrix S (1) is indexed by a latent class characterized by aconﬁguration of a K -dimensional binary vector. This implies that if only considering the rst latent layer of variables α (1) , all the subjects are naturally divided into K latent classes,each endowed with a binary pattern. Proposition 2 gives a nice structural characterization of multilayer sparse graph LCMs.This characterization is achieved by relating the multilayer sparse graph to the constrainedlatent class model in a layer-wise manner. This proposition provides a basis for investigatingidentiﬁability of multilayer SG-LCMs, as presented in the following subsection.

According to Proposition 2, for a multilayer sparse graph LCM with a p × K binary graphicalmatrix G (1) between the two bottom layers, one can follow (9) to construct a p × K constraint matrix S (1) as illustrated in Example 1. We next provide transparent identiﬁabilityconditions that directly depend on the binary graphical matrices G ( m ) s. With the nexttheorem, one only needs to examine the structure of the between layer connecting graphs toestablish identiﬁability. Theorem 4.

Consider the multilayer latent variable model speciﬁed in (7) – (8) . Supposethe numbers K , . . . , K D − are known. Suppose each binary graphical matrix G ( m ) of size K m − × K m (size p × K if m = 1 ) takes the following form after some row permutation, G ( m ) =  I K m I K m I K m G ( m ) ,(cid:63)  , m = 1 , . . . , D − , (11) where G ( m ) ,(cid:63) generally denotes a submatrix of G ( m ) that can take an arbitrary form. Furthersuppose that the conditional distributions of variables satisfy the inequality constraint in (3) .Then the following parameters are identiﬁable up to a latent variable permutation within eachlayer: the probability distribution tensor τ for the deepest latent variable z ( D ) , the conditionalprobability table of each variable (including observed and latent) given its parents, and alsothe binary graphical matrices { G ( m ) ; m = 1 , . . . , D − } . Remark 3.

The proof of Theorem 4 provides a nice layer-wise argument on identiﬁability,that is, one can examine the structure of the Bayesian pyramid in the bottom-up direction. s long as for some (cid:96) there are G (1) , . . . , G ( (cid:96) ) taking the form of (11) , then the parametersassociated with the conditional distribution of y and α (1) , . . . , α ( (cid:96) − are identiﬁable and themarginal distributions of α ( (cid:96) ) are also identiﬁable. Theorem 4 implies a requirement that p ≥ K and that K m − ≥ K m for every m =2 , . . . , D −

1, through the form of G ( m ) in (11). That is, the number of latent variables perlayer decreases as the layer goes deeper up the pyramid. The condition (11) in Theorem 4requires that each latent variable α ( m ) k in the m -th latent layer has at least three children inthe ( m + 1)-th layer that do not have any other parents. Our identiﬁability conditions holdregardless of the speciﬁc models chosen for the conditional distributions, y j | α (1) k and each α ( m ) k | α ( m +1) k (cid:48) , as long as the graphical structure is enforced and these component modelsare not over-parameterized in a naive manner. We next give two concrete examples whichdiﬀerently model the distribution of α ( m ) | α ( m +1) but both respect the graph given by G ( m +1) . Example 2.

We ﬁrst consider modeling the eﬀects of the parent variables of each α ( m ) k as P ( α ( m ) k = 1 | α ( m +1) , β ( m +1) , G ( m +1) ) = f (cid:16) β ( m +1) k, + (cid:88) k (cid:48) : g ( m +1) k,k (cid:48) =1 β ( m +1) k,k (cid:48) α ( m +1) k (cid:48) (cid:17) , (12) where f : R → (0 , is a link function. The number of β -parameters in (12) equals (cid:80) K m +1 k (cid:48) =1 g ( m +1) k,k (cid:48) , which is the number of edges pointing to α ( m ) k . Choosing f ( x ) = 1 / (1 +exp( − x )) leads to a model similar to sparse deep belief networks (Hinton et al., 2006; Leeet al., 2007). Example 3.

To obtain a more parsimonious alternative to (12) , let P ( α ( m ) k = 1 | α ( m +1) , θ ( m +1) , G ( m +1) ) (13)=  θ ( m +1) k, , if I ( α ( m +1) k (cid:48) = g ( m +1) k,k (cid:48) = 1 for at least one k (cid:48) ∈ [ K m +1 ]); θ ( m +1) k, , otherwise . Model (13) satisﬁes the conditional independence encoded by G ( m +1) , since I ( α ( m +1) k (cid:48) = g ( m +1) k,k (cid:48) = 1 for at least one k (cid:48) ∈ [ K m +1 ]) ≡ (cid:81) k (cid:48) : g ( m +1) k,k (cid:48) =1 (1 − α ( m +1) k (cid:48) ) , implying that the distri- ution of α ( m ) k only depends on its parents in the ( m + 1) th latent layer. This model providesa probabilistic version of Boolean matrix factorization (Miettinen and Vreeken, 2014). Thebinary indicator equals the Boolean product of two binary vectors G ( m +1) k, : and α ( m +1) . The − θ ( m +1) k, and θ ( m +1) k, quantify the two probabilities that the entry α ( m ) k does not equal theBoolean product. Since Examples 2 and 3 satisfy the conditional independence constraints encoded bygraphical matrices G ( m +1) s, they satisfy the equality constraint in (2) with the constraintmatrix S ( m +1) . Therefore, our identiﬁability conclusion in Theorem 4 applies to both exam-ples with appropriate inequality constraints on the β –parameters or the θ -parameters; forexample, see Proposition 3 in the next Section 4.Besides Examples 2 and 3, there are many other models that respect the graphical struc-ture. For example, (12) can be extended to include interaction eﬀects of the parents of α ( m ) k as follows, P ( α ( m ) k = 1 | α ( m +1) , θ ( m +1) , G ( m +1) ) = f (cid:16) β ( m +1) k, + (cid:88) k (cid:48) : g ( m +1) k,k (cid:48) =1 β ( m +1) k, k (cid:48) α ( m +1) k (cid:48) (14)+ (cid:88) k (cid:54) = k g ( m +1) k,k g ( m +1) k,k β ( m +1) k, k k α ( m +1) k α ( m +1) k + · · · + β ( m +1) k, all (cid:89) (cid:96) : g ( m +1) k,(cid:96) =1 α ( m +1) (cid:96) (cid:17) . In (14) if α ( m ) k has M := (cid:80) K m +1 k (cid:48) =1 g ( m +1) k,k (cid:48) parents, then the number of β -parameters equals 2 M . We next brieﬂy review the connections between the proposed SG-LCM and existing statis-tical and machine learning models. In educational measurement research, Tatsuoka (1983)proposed the rule-space approach and Haertel (1989) ﬁrst used the term restricted latentclass models; these models are essentially binary latent skill models where each of a set oftest items may require diﬀerent combinations of skills. Further developments along this linein the psychometrics literature led to a popular family of cognitive diagnostic models (Ruppand Templin, 2008; von Davier and Lee, 2019). Recently, there have been emerging studieson the identiﬁability and estimation of such diagnostic models (e.g., Chen et al., 2015; Xu,2017; Fang et al., 2019; Gu and Xu, 2019, 2020; Chen et al., 2020a). However, to our knowl-17dge, there have not been works that model multilayer latent structure behind the data andestablish identiﬁability in such scenarios.The proposed SG-LCM is also related to sparse deep belief networks (DBNs, Hinton et al.,2006), sum-product networks (SPNs, Poon and Domingos, 2011), and latent tree graphicalmodels (LTMs, Mourad et al., 2013) in the machine learning literature. One distinctionbetween SG-LCMs and DBNs is that DBNs have undirected edges between the deepest twolayers, while our proposed model is fully generative with all the edges pointing top down.An SPN is a rooted directed acyclic graph consisting of sum nodes and product nodes. Zhaoet al. (2015) shows that an SPN can be converted to a bipartite Bayesian network, where thenumber of discrete latent variables equals the number of sum nodes in the SPN. Our modelis more general in the sense that in addition to modeling a bipartite network between thelatent layer and the observed layer, we further model the dependence of the latent variablesinstead of assuming them independent. LTMs are a special case of our model (15) becauseall the variables in an LTM form a tree while our model allows for more general DAGsbeyond trees. Although DBNs, SPNs and LTMs are extremely popular, identiﬁability hasreceived essentially no attention; an exception is Zwiernik (2018), which brieﬂy discussedidentiﬁability for relatively simple LTMs.

As discussed earlier, the proposed multilayer sparse graph LCM in Section 3 has universalidentiﬁability arguments for many diﬀerent model structures. In this section, as an illustra-tion, we focus on a two-latent-layer model with depth D = 2 and use a Bayesian approachto infer the latent structure and model parameters. We specify the conditional distributionof each variable using a multinomial or binomial logistic model with its parent variables aslinear predictors. Formally, we consider the following model, P ( y j = c | α (1) = α ) = exp (cid:16) β j,c, + (cid:80) K k =1 β j,c,k g (1) j,k α k (cid:17)(cid:80) d j m =1 exp (cid:16) β j,m, + (cid:80) K k =1 β j,m,k g (1) j,k α k (cid:17) , j ∈ [ p ] , c ∈ [ d j ]; P ( α (1) = α ) = B (cid:88) b =1 τ b K (cid:89) k =1 η α k k,b (1 − η k,b ) − α k , α ∈ { , } K . (15)18ere we assume β j,d j , ≡ β j,d j ,k ≡ k ∈ [ K ], as conventionally done in logisticmodels. The parameter β j,c,k can be viewed as the weight associated with the potentialdirected edge from latent variable α (1) k to observed variable y j , for response category c . The β j,c,k only impacts the likelihood if there is an edge from α (1) k to y j with g (1) j,k = 1. Denotethe collection of β parameters as β . (15) speciﬁes a usual latent class model in the secondlatent layer with B latent classes for the K -dimensional latent vector α (1) . This layer of themodel has latent class proportion parameters τ = ( τ , . . . , τ B ) (cid:62) and conditional probabilityparameters η = ( η k,b ) K × B . The full set of model parameters across layers is ( G (1) , β , τ , η ),and the model structure is shown in Fig. 2. We can denote the conditional probability P ( y j = c | α (1) = α ) in (15) by λ j,c, α . y · · · · · · · · · · · · y p α (1)1 · · · α (1) K z z ∈ [ B ] η = ( η k,b ) α (1) ∈ { , } K G (1) = ( g (1) j,k ) , β = { β j,c,k } y ∈ [ d ] × [ d ] × · · · × [ d p ] η , b η K , b β , , c β p , K , Figure 2: Two-latent-layer model. Latent variables z, α , . . . , α K and observed variables y , . . . , y p are subject-speciﬁc, and model parameters G (1) , β , τ , η are population quantities. For the two-latent-layer model in (15), the following proposition presents strict identiﬁabilityconditions in terms of explicit inequality constraints for the β parameters. Proposition 3.

Consider model (15) with true parameters ( G (1) , β , τ , η ) .(a) Suppose G (1) = ( I K ; I K ; I K ; ( G (cid:63) ) (cid:62) ) (cid:62) , β j, j, c (cid:54) = 0 , β j, j + K , c (cid:54) = 0 , and β j, j +2 K , c (cid:54) = 0 for j ∈ [ K ] , c ∈ [ d j − . Then G (1) , β , and probability tensor ν (1) of α ∈ { , } K arestrictly identiﬁable.(b) Under the conditions of part (a), if further there is K ≥ (cid:100) log B (cid:101) + 1 , then theparameters τ and η are generically identiﬁable from ν (1) .

19s mentioned in the last paragraph of Section 3, the identiﬁability results in the previoussection apply to general distributional assumptions for variables organized in a sparse multi-layer SG-LCM. When considering speciﬁc models, properties of the model can be leveragedto weaken the identiﬁability conditions. The next proposition illustrates this, establishinggeneric identiﬁability for the model in (15). Before stating the identiﬁability result, we for-mally deﬁne the allowable constrained parameter space for β under a graphical matrix G (1) as Ω( β ; G (1) ) = { β p, K , d j − ; β j,c,k (cid:54) = 0 if g (1) j,k = 1; and β j,c,k = 0 if g (1) j,k = 0 } . (16) Proposition 4.

Consider model (15) with β belonging to Ω( β ; G (1) ) in (16) . Suppose thegraphical matrix G (1) contains three non-overlapping K × K submatrices G (1) , , G (1) , ,and G (1) , , each of which has all its diagonal elements equal to one (and each oﬀ-diagonalelement can be either one or zero). Also suppose K ≥ (cid:100) log B (cid:101) + 1 . Then ( G (1) , β , τ , η ) are generically identiﬁable. We propose a Bayesian inference procedure for the two-latent-layer model. We apply a Gibbssampler by employing the Polya-Gamma data augmentation in Polson et al. (2013) togetherwith the auxiliary variable method for multinomial logit models in Holmes and Held (2006).First consider the case where the true number of binary latent variables K in the middlelayer is ﬁxed. Inferring the latent sparse graph G (1) is equivalent to inferring the sparsitystructure of the continuous parameters β jkc ’s. Let the prior for β j,k, be N ( µ , σ ), wherehyperparameters µ , σ can be set to give weakly informative priors for the logistic regressioncoeﬃcients (Gelman et al., 2008). Prior speciﬁcations for other parameters are(i) for ﬁxed K , k ∈ [ K ]: β jck | ( σ ck , g j,k = 1) ∼ N (0 , σ ck ); β jck | ( σ ck , g j,k = 0) ∼ N (0 , v ); σ ck ∼ InvGa( a σ , b σ );20 ( g j,k = 1) = 1 − P ( g j,k = 0) = γ ; (17)Here v is a small positive number specifying the “pseudo”-variance, and we take v = 0 . γ in (17) is further given a noninformative prior γ ∼ Beta(1 , a σ = b σ = 2. In thedata augmentation part, for each subject i ∈ [ n ], each observed variable j ∈ [ p ], and eachnon-baseline category c = 1 , . . . , d j −

1, we introduce Polya-Gamma random variable w ijc ∼ PG(1 , β jck . Given data y ij ∈ [ d j ], introducebinary indicators y ijc = I ( y ij = c ). The posteriors of β satisfy that p ( β | y n ) ∝ n (cid:89) i =1 d j − (cid:89) c =1 (cid:104) exp (cid:16) β jc + (cid:80) K k =1 β jck g j,k a i,k (cid:17)(cid:105) y ijc (cid:80) d j c (cid:48) =1 exp (cid:16) β jc (cid:48) + (cid:80) K k =1 β jc (cid:48) k g j,k a i,k (cid:17) ∝ n (cid:89) i =1 d j − (cid:89) c =1 exp( φ ijc ) y ijc φ ijc )where we introduce φ ijc = β jc + K (cid:88) k =1 β jck g j,k a i,k − C ij ( c ) ; (18) C ij ( c ) = log  (cid:88) ≤ (cid:96) ≤ d j , (cid:96) (cid:54) = c exp (cid:32) β j(cid:96) + K (cid:88) k =1 β j(cid:96)k g j,k a i,k (cid:33) . Next by the property of the Polya-Gamma random variables (Polson et al., 2013), we haveexp( φ ijc ) y ijc φ ijc ) = 2 exp { ( y ijc − / φ ijc } (cid:90) ∞ exp {− w ijc φ ijc / } p PG ( w ijc | , w ijc , where p PG ( w ijc | ,

0) denotes the density function of PG(1 , β jc ’s and β jck ’s are still Gaussian, and the conditional poste-rior of each w ijc is still Polya-Gamma with ( w ijc | − ) ∼ PG (cid:16) , β jc + (cid:80) K k =1 β jck g j,k a i,k − C ij ( c ) (cid:17) ;these full conditional distributions are easy to sample from. As for the binary entries g j,k ’sindicating the presence or absence of edges in the SG-LCM, we sample each g j,k individuallyfrom its posterior Bernoulli distribution. The detailed steps of such a Gibbs sampler withknown K are presented in the Supplementary Material.21n top of the sampling algorithm described above for ﬁxed K , we propose a methodfor simultaneously inferring K . In the context of mixture models, Rousseau and Mengersen(2011) deﬁned over-ﬁtted mixtures with more than enough latent components and usedshrinkage priors to eﬀectively delete the unnecessary ones. In a similar spirit but motivatedby Gaussian linear factor models, Legramanti et al. (2020) proposed the cumulative shrinkageprocess (CSP) prior, which has a spike and slab structure with increasing mass on the spike asthe component index increases. We use a CSP prior on the variances { σ ck } of { β jck } to inferthe number of latent binary variables K in a two-layer SG-LCM. Letting β jck ∼ N (0 , σ ck ),we choose(ii) for unknown K < K − k ∈ [ K ]: σ ck | π k ∼ (1 − π k )InvGa( a σ , b σ ) + π k δ θ ∞ ; (19) π k = k (cid:88) (cid:96) =1 v (cid:96) (cid:96) − (cid:89) m =1 (1 − v m ) , (20)where π k has a stick-breaking representation as in (20) with v , v , . . . , v K − following Beta(1 , α )and v K ≡

1, and InvGa( a σ , b σ ) denotes the inverse gamma distribution with shape a σ andscale b σ . The δ θ ∞ is the spike distribution and InvGa( a σ , b σ ) is the slab distribution, with θ ∞ a prespeciﬁed spike variance that is close to zero. The K in (19) is an upper bound for K + 1 since π K ≡

1. Since π k in (19) denotes the probability that the k th component is inthe spike, the eﬀective number of latent variables K (cid:63) has posterior mean, E [ K (cid:63) | Y ] = E (cid:34) K (cid:88) k =1 (1 − π k ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y (cid:35) . (21)The above CSP leads to Gibbs updating steps, as shown in the Supplementary Material. We conducted replicated simulation studies to assess the Bayesian procedure proposed inSection 4 and examine whether the model parameters are indeed estimable as implied byTheorem 3. Consider a two-latent-layer SG-LCM with p = 20 observed variables, d = 4 re-sponse categories for each observed variable, and K = 4 binary latent variables in the middlelayer and one binary latent variable in the deep layer. Let the true p × K binary graphical22atrix be G (1) = ( I ; I ; I ; 1100; 0110; 0011; 1001; 1010; 1001; 0101; 1110; 0111). Such a G (1) satisﬁes the conditions for strict identiﬁability in Theorem 4, since it contains threecopies of the identity matrix I as submatrices. Let the true intercept parameters for cate-gories c = 1 , , β j, , , β j, , , β j, , ) = ( − , − , −

1) for each j ; and for any g (1) j,k = 1, letthe corresponding true main-eﬀect parameters of the binary latent variables be β j,c,k = 3 ifvariable y j has a single parent and β j,c,k = 2 if y j has multiple parents. See Fig. 3(c) fora heatmap of the sparse matrix of the main-eﬀect parameters ( β j,k, ; j ∈ [ p ] , k ∈ [ K ]) forcategory c = 1.We use the estimation procedure proposed in Section 4 with the CSP prior for posteriorinference and specify K upper = 7. For the CSP prior, we take α = 5 and set the spikevariance to be θ ∞ = 0 .

07. The results are not very sensitive to the choices of α and θ ∞ in our experience. Throughout the Gibbs sampling iterations, we enforce the identiﬁabilityconstraint that β j,k,c > g j,k = 1. For each sample size n we conduct 50 inde-pendent simulation replications. In each replication, we ran the chain for 15000 iterations,discarding the ﬁrst 5000 iterations as burn-in, and retaining every ﬁfth sample post burn-into thin the chain. Since the model is identiﬁable up to a permutation of the latent variablesin each layer, we post-process the posterior samples to ﬁnd a column permutation of G (1) tobest match the simulation truth; then the columns of other parameter matrices are permutedaccordingly.Fig. 3 presents posterior means of the main-eﬀect parameters ( β j,k, ; j ∈ [ p ] , k ∈ [ K ]) forcategory c = 1, averaged across the 50 independent replications. The leftmost plot of Fig. 3shows that for a relatively small sample size n = 500, the posterior means of ( β j,k, ) alreadyexhibit similar structure as the ground truth in the rightmost plot. Also, the true number of K = 4 binary latent variables in the middle latent layer are revealed in Fig. 3(a)-(b). Forthe sample size n = 2000, Fig. 3(b) shows the posterior means of ( β j,k, ) are very close to theground truth. The posterior means for other categories c = 2 , , c = 1. The estimated ( β j,k, ) are slightly biased toward zero for the smaller samplesize ( n = 500) with bias less for the larger sample size ( n = 2000). Our results suggestthat the binary graphical matrix that underlies { β j,k,c } (that is, the sparsity structure ofthe continuous parameters) is easier to estimate than the nonzero { β j,k,c } themselves. Thatis, with ﬁnite samples, the existence of a link between each observed variable y j and each23inary latent variable α k is easier to estimate than the strength of such a link (if it exists).Figure 3: Posterior means of the main-eﬀect parameters { β j,k, } for response category c = 1averaged across 50 independent replications, for n = 500 or 2000. The K upper = 7 is taken inthe CSP prior for posterior inference, while the true K is 4 as shown in the rightmost plot.To assess how the approach performs with an increasing sample size, we consider eightsample sizes n = 250 · i for i = 1 , , . . . , K = 4, we retain exactly four binarylatent variables k ∈ [ K upper ]; choosing those having the largest average posterior variance1 /d (cid:80) dc =1 σ ck . For the discrete structure of the binary graphical matrix G (1) , we presentmean estimation errors in Fig. 4. Speciﬁcally, Fig. 4(a) plots the errors of recovering theentire matrix G (1) , Fig. 4(b) plots the errors of recovering the row vectors of G (1) , andFig. 4(b) plots the errors of recovering the individual entries of G (1) . For sample size assmall as n = 750, estimation of G (1) is very accurate across the simulation replications.Notably, n = 750 is much smaller than the total number of cells d p = 4 ≈ . × inthe contingency table for the observed variables y . For continuous parameters β = { β jc } , β = { β jck } , and η = { η kb } respectively, in Fig. 5 we plot average root-mean-square errors(RMSE) of the posterior means versus sample size n . For each β = { β jc } , β = { β jck } , and η = { η kb } , Fig. 5 shows RMSE decreases with sample size, which is as expected given ourposterior consistency result in Theorem 3. 24 a) Mean errors of recovering theentire matrix G (1) (b) Mean errors of recoveringthe row vectors of G (1) (c) Mean errors of recovering theindividual entries of G (1) Figure 4: Mean estimation errors of the binary graphical matrix G (1) at the matrix level (in(a)), row level (in (b)), and entry level (in (c)). The median, 25% quantile, and 75% quantilebased on the 50 independent simulation replications are shown by the circle, lower bar, andupper bar, respectively. (a) β = { β jck } (b) β = { β jc } (c) η = { η kb } Figure 5: Average root-mean-square errors (RMSE) of the posterior means of the main-eﬀects β , intercepts β , and deeper tensor arms η versus sample size n = 250 · i for i ∈{ , , . . . , } . The median, 25% quantile, and 75% quantile based on the 50 independentsimulation replications are shown by the circle, lower bar, and upper bar, respectively. We apply our proposed Bayesian two-latent-layer SG-LCM with the CSP prior to the SpliceJunction dataset of DNA nucleotide sequences available from the UCI machine learningrepository. We also analyze another dataset of nucleotide sequences, the Promoter data, andpresent the results in the Supplementary Material. The Splice Junction data consist of A,C, G, T nucleotides ( d = 4) at p = 60 positions for n = 3175 DNA sequences. There are twotypes of genomic regions called introns and exons; junctions between these two are calledsplice junctions (Nguyen et al., 2016). Splice junctions can be further categorized as (a) exon-intron junction; and (b) intron-exon junction. The n = 3175 samples in the Splice dataset25ach belong to one of three types: Exon-Intron junction (“EI”, 762 samples); Intron-Exonjunction (“IE”, 765 samples); and Neither EI or IE (“N”, 1648 samples). Previous studieshave used supervised learning methods for predicting sequence type (e.g., Li and Wong,2003; Nguyen et al., 2016). Here we ﬁt the proposed two-latent-layer SG-LCM to the datain a completely unsupervised manner, with the sequence type information held out. We usethe nucleotide sequences to learn discrete latent representations of each sequence, and theninvestigate whether the learned lower dimensional discrete latent features are interpretableand predictive of the sequence type.We let the variable z in the deepest latent layer have B = 3 categories, inspired bythe fact that there are three types of sequences: EI, IE, and N; but we do not use anyinformation of which sequence belongs to which type when ﬁtting the model to data. Inthe CSP prior, we take K upper = 7 and α = 2. We ran the Gibbs sampler for 15000iterations, discarding the ﬁrst 5000 as burn-in, and retaining every ﬁfth sample post burn-into thin the chain. Based on examining traceplots, the sampler has good mixing behavior.Our method selects K (cid:63) = 5 binary latent variables. We index the saved posterior samplesby r ∈ { , . . . , R = 2000 } , for each r denote the samples of G by ( g ( r ) j,k ; j ∈ [60] , k ∈ [5]).Similarly for each r , denote the posterior samples of the nucleotides sequences’ latent binarytraits by ( a ( r ) i,k ; i ∈ [3175] , k ∈ [5]), and denote those of the nucleotide sequences’ deep latentcategory by ( z ( r ) i ; i ∈ [3175]) where z ( r ) i ∈ { , , } . Deﬁne our ﬁnal estimators (cid:98) G = ( (cid:98) g j,k ), (cid:98) A = ( (cid:98) a i,k ), and (cid:98) Z = ( (cid:98) z i,b ) to be (cid:98) g j,k = I (cid:32) R R (cid:88) r =1 g ( r ) j,k > (cid:33) , (cid:98) a i,k = I (cid:32) R R (cid:88) r =1 a ( r ) i,k > (cid:33) , (cid:98) z i,b =  , if b = arg max b ∈ [ B ] R R (cid:88) r =1 I (cid:16) z ( r ) i = b (cid:17) ;0 , otherwise . The (cid:98) g j,k , (cid:98) a i,k , and (cid:98) z i,b summarize information of the element-wise posterior modes of thediscrete latent structures in our model. The 60 × (cid:98) G depicts how the loci loadon the binary latent traits, the 3175 × (cid:98) A depicts the presence or absence of eachbinary latent trait in each nucleotide sequence, and the 3175 × (cid:98) Z depicts which26eep latent group each nucleotide sequence belongs to. The (cid:98) G , (cid:98) A , and (cid:98) Z are all binarymatrices, but the ﬁrst two are binary feature matrices while the last one (cid:98) Z has each rowhaving exactly one entry of “1” indicating group membership. In Fig. 6, the ﬁrst three plotsdisplay the three estimated matrices (cid:98) G , (cid:98) A , and (cid:98) Z , respectively; and the last plot showsthe held-out gene type information for reference. As for the estimated loci loading matrix (cid:98) G , Fig. 6(a) provides information on how the p = 60 loci depend on the ﬁve binary latenttraits. Speciﬁcally, we found that the ﬁrst 27 loci show somewhat similar loading patternsand mainly load on the ﬁrst four binary traits. Also, the middle 10 loci (from locus 28 tolocus 37) are similar in loading on all ﬁve traits, and the last 23 loci (from locus 38 to locus60) are similar in exhibiting sparser loading structures. On the other hand, Fig. 6(b)–(d)show that the two matrices (cid:98) A and (cid:98) Z corresponding to the n = 3175 nucleotide sequencesexhibit a clear pattern of clustering, which aligns well with the known but held-out junctiontypes EI, IE, and N. EIIEN (a) (cid:98) G × (b) (cid:98) A × (c) (cid:98) Z × (d) held-outFigure 6: Splice junction data analysis under the CSP prior with K upper = 7. Plots arepresented with the K (cid:63) = 5 binary latent traits selected by our proposed method. Afterapplying the rule-lists approach to deterministically match the latent features to the genetypes as in (22), the accuracies for predicting the gene types EI, IE, N are all above 95%.To formally assess how the latent discrete features learned by the proposed method per-form in downstream prediction, we apply the “rule lists” classiﬁcation approach in Angelinoet al. (2017) to the estimated latent features in (cid:98) A and (cid:98) Z for n = 3175 nucleotide sequences.The rule-lists approach is an interpretable classiﬁcation method based on a categorical fea-ture space, and it ﬁnds simple and deterministic rules of the categorical features in predictinga (binary) class label. For each instance i ∈ { , . . . , n = 3175 } , we deﬁne the categorical27eatures to be the 8-dimensional vector (cid:98) x i = ( (cid:98) z i, , (cid:98) z i, , (cid:98) z i, , (cid:98) a i, , . . . , (cid:98) a i, ). The (cid:98) x i is concate-nated from (cid:98) z i and (cid:98) a i and is a feature vector of binary entries. Denote the ground-truthnucleotide sequence types by t = ( t i ; 1 ≤ i ≤ t i = EI for i = 1 , . . . , t i = IEfor i = 763 , . . . , t i = N for i = 1528 , . . . , t i ’sare not used in ﬁtting our SG-LCM to obtain the latent features (cid:98) x i ’s. We use the Pythonpackage CORELS for the rule-lists method with x i ’s and t i ’s as input, and ﬁnd rules thatmatch (cid:98) x i ’s to t i ’s. Speciﬁcally, these deterministic rules given by CORELS are: (cid:98) t EI i = I ( (cid:98) a i, = 1 and (cid:98) a i, = 1); (22) (cid:98) t IE i = I ( (cid:98) z i, = 1); (cid:98) t N i = I ( (cid:98) z i, = 0 and (cid:98) a i, = 0)Eq. (22) gives a very simple and explicit rule depending on (cid:98) z i, , (cid:98) a i, , and (cid:98) a i, for each nu-cleotide sequence i . This simple rule can be empirically veriﬁed by comparing the middletwo plots to the rightmost plot in Fig. 6. In (22), the rules for EI and IE are not mutuallyexclusive, but those for EI and N are mutually exclusive and the same holds for IE andN. Therefore, to obtain mutually exclusive rules for the three class labels EI, IE, and Nbased on (22), we can simply deﬁne the following two types of labels (cid:98) t (cid:63) = ( (cid:98) t (cid:63)i ; i ∈ [ n ]) and (cid:98) t † = ( (cid:98) t † i ; i ∈ [ n ]), (cid:98) t (cid:63)i =  EI , if (cid:98) t EI i = 1 and (cid:98) t IE i = 0;EI , if (cid:98) t IE i = 1;N , if (cid:98) t N i = 1; or (cid:98) t † i =  EI , if (cid:98) t EI i = 1;EI , if (cid:98) t IE i = 1 and (cid:98) t EI i = 0;N , if (cid:98) t N i = 1 . (23)For the two types of labels (cid:98) t (cid:63) and (cid:98) t † in (23), we provide their corresponding normalizedconfusion matrices in Fig. 7 along with their heatmaps. Fig. 7 shows that both (cid:98) t (cid:63) and (cid:98) t † havevery high prediction accuracies for the sequence types t , with all the diagonal entries above95%. The superb prediction performance as shown in Fig. 7 implies that the learned discretelatent features of nucleotide sequences are interpretable and very useful in the downstreamtask of classiﬁcation. For the Splice Juntion dataset, such accuracies are even comparable tothe state-of-the-art prediction performance given by ﬁne-tuned convolutional neural networks28 a) confusion matrix for the labels given by (cid:98) t (cid:63) (b) confusion matrix for the labels given by (cid:98) t † Figure 7: Downstream interpretable classiﬁcation results for splice junction data following ﬁt-ting the two-latent-layer SG-LCM. Prediction performance of the learned lower-dimensionaldiscrete latent features are summarized in two confusion matrices, corresponding to (cid:98) t (cid:63) and (cid:98) t † deﬁned in (23).given in Nguyen et al. (2016), which are also between 95% and 97%. In this article, we investigated how to identify interpretable discrete latent structures frommultivariate categorical data. The proposed multilayer sparse graph LCM is general and cov-ers multiple existing statistical and machine learning models as special cases, while allowingfor various assumptions on the conditional distributions. Our identiﬁability theory is keyin providing reassurance that one can reliably and reproducibly learn the latent structure,guarantees that are lacking from almost all of the related literature.The proposed Bayesian approach had excellent performance in our simulations and dataanalyses, showing good computational performance and a surprising ability to accuratelyinfer interpretable and meaningful latent structure. There are immediate applications ofthe proposed SG-LCM approach in many other disciplines. For example, in ecology andbiodiversity research, joint species distribution modeling is a very active area (see a recentbook Ovaskainen and Abrego, 2020). The data consist of high-dimensional binary indicatorsof presence or absence of diﬀerent species at each sampling location. In this contest, SG-LCMs provide a useful approach for inferring latent community structure of biologic and29cological interest.To realize the applied potential of the proposed framework, there are several immedi-ate research directions. The ﬁrst is to develop algorithms and inferential frameworks forimplementing SG-LCMs beyond the two layer case; unlike for deep neural networks, ouridentiﬁability theory provides a natural restriction on the number of layers and number oflatent variables per layer depending on the dimensionality of the data. Conceptually, theapproach we used for inferring the number of latent variables in the second layer can be usedat each of multiple layers, but it remains to test and reﬁne the associated algorithms. Wesuspect that interesting computational hurdles will arise in considering such extensions, par-ticularly due to the need to enforce identiﬁability while designing computational algorithms.More broadly, the proposed constrained CP decomposition and identiﬁability theory mayalso have applicability beyond models for multivariate discrete data. Consider the case wherethe data take the form of a non-negative tensor. Then the proposed constraint (2) on topof a nonnegative CP decomposition can also serve as a ﬂexible tensor factorization modelfor ﬁnding interpretable latent structure underlying nonnegative continuous data. In suchscenarios, it should be possible to extend our theory on identiﬁability to that on uniquenessof constrained-CP decomposition with some modiﬁcations.

Acknowledgements

This work was partially supported by grants R01ES027498 and R01ES028804 of the Na-tional Institutes of Environmental Health Sciences of the United States National Institutesof Health. This project has received funding from the European Research Council underthe European Union’s Horizon 2020 research and innovation program (grant agreement No856506). 30 ppendix: Technical Proofs of the Main Results

In this appendix, we give technical proofs of key theoretical results in the article, whichinclude the proofs of all the results in Sections 2-3. The proofs of additional results inSection 4, proofs of some technical lemmas, posterior computation details, and additionaldata analysis are deferred to the Supplementary Material.

A.1 Proof of Proposition 1

We ﬁrst prove part (a) of the proposition. Denote the Khatri-Rao product of the p matrices { Λ ( j ) , j ∈ [ p ] } by K = (cid:12) pj =1 Λ ( j ) , then K is of size (cid:81) pj =1 d j × k with rows indexed by thevariable pattern y ∈ × pj =1 [ d j ] and columns indexed by the latent category h ∈ [ k ]. We ﬁrstintroduce a useful lemma about the Khatri-Rao product. The proof of the lemma is providedin the Supplementary Material. Lemma 1.

For matrices A j , B j for j ∈ { , . . . , p } , there is (cid:12) pj =1 { A j B j } = (cid:8) ⊗ pj =1 A j (cid:9) · (cid:8) (cid:12) pj =1 B j (cid:9) , (A.1) where A j , B j have compatible dimensions such that the left hand side is well deﬁned. By Lemma 1, K can be written as K = p (cid:79) j =1  · · · · · · · · · − − · · · −  ·  λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, ... ... ... ... λ ( j )1 ,d j − λ ( j )2 ,d j − · · · λ ( j ) k,d j − · · ·  =  p (cid:79) j =1  · · · · · · · · · − − · · · −  ·  p (cid:75) j =1  λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, ... ... ... ... λ ( j )1 ,d j − λ ( j )2 ,d j − · · · λ ( j ) k,d j − · · ·  p (cid:79) j =1 B j · K =: B · K , where B is the Kronecker product of p lower-triangular matrices B j ’s whose diagonal entriesall equal to one. So by the property of the Kronecker product, rank( B ) = (cid:81) pj =1 rank( B j ) = (cid:81) pj =1 d j . Therefore B is invertible, and K has full column rank if and only if K has fullcolumn rank. We next constructively ﬁnd k rows of this (cid:81) pj =1 d j × k matrix to form a k × k invertible submatrix, which will establish the full-rankness of matrix K and hence thefull-rankness of K .We next introduce the lexicographic order between binary vectors of the same lengthto facilitate the proof. For two binary vectors x = ( x , . . . , x p ) , y = ( y , . . . , y p ) both oflength p , we say x is of greater lexicographic order than y and denote by x (cid:31) lex y if either x > y , or x (cid:96) > y (cid:96) for some (cid:96) ∈ { , . . . , p } and x m = y m for all m = 1 , . . . , (cid:96) −

1. Underthe assumption of the lemma, the k columns of the constraint matrix S are distinct. Animportant consequence of this is that the k columns of S can be arranged in a decreasinglexicographic order. Without loss of generality, we assume S : , (cid:31) lex S : , (cid:31) lex . . . (cid:31) lex S : ,k . (A.2)Next we construct the k × k invertible submatrix E of K as follows. Since the rows of K are indexed by y ∈ [ d j ] p , we focus on constructing k diﬀerent patterns y , . . . , y k ∈ [ d j ] p toform E . Denote by e j a J -dimensional standard basis vector with the j th entry being oneand the rest being zeros. We deﬁne y = (cid:88) j : S j, =1 y j e j + (cid:88) j : S j, =0 d j e j ; y = (cid:88) j : S j, =1 y j e j + (cid:88) j : S j, =0 d j e j ;... y k = (cid:88) j : S j,k =1 y kj e j + (cid:88) j : S j,k =0 d j e j , (A.3)where y hj ∈ { , . . . , d j − } for each h ∈ [ k ] can be arbitrary; as long as y hj (cid:54) = d j . We claim32hat E := K ( y , y ,..., y k ) , : is invertible. We next prove this claim. First, there is E (cid:96),h = K y (cid:96) ,h = (cid:89) j : S j,(cid:96) =1 λ ( j ) y (cid:96)j ,h (cid:89) j : S j,(cid:96) =0 (cid:89) j : S j,(cid:96) =1 λ ( j ) y (cid:96)j ,h , (cid:96) ∈ [ k ] , h ∈ [ k ] . (A.4)The matrix K can be viewed as a map taking the k matrices Λ = { Λ ( j ) : j ∈ [ p ] } as the input. Consider an arbitrary collection of p vectors ∆ := { δ ( j ) : j ∈ [ p ] } , where δ ( j ) = { δ ( j ) c : c ∈ [ d j ] and δ ( j ) d j = 0 } is d j -dimensional vector with the last element equal tozero for each j . Then Λ ( j ) and δ ( j ) · (cid:62) k have the same size d j × k . We introduce the followinguseful lemma, whose proof is in the Supplementary Material. Lemma 2.

There exists a (cid:81) pj =1 d j × (cid:81) pj =1 d j invertible square matrix B := B ( { δ ( j ) } j ∈ [ p ] ) depending only on { δ ( j ) } j ∈ [ p ] such that K ( { Λ ( j ) − δ ( j ) · (cid:62) k } j ∈ [ p ] ) = B ( { δ ( j ) } j ∈ [ p ] ) · K ( { Λ ( j ) } j ∈ [ p ] ) . (A.5)Lemma 2 implies that we can show K ( { Λ ( j ) } j ∈ [ p ] ) has full column rank by showing thatmatrix K ( { Λ ( j ) − δ ( j ) · (cid:62) k } j ∈ [ p ] ) has full column rank for some δ ( j ) ’s. In particular, weconstruct δ ( j ) to be δ ( j ) c = λ ( j ) c, for c ∈ [ d j − δ ( j ) d j = 0 . Recall the y , . . . , y k deﬁned in (A.3), then the ( y (cid:96) , h )th entry of K ( { Λ ( j ) − δ ( j ) · (cid:62) k } j ∈ [ p ] ) is E (cid:63)(cid:96),h = (cid:89) j : S j,(cid:96) =1 (cid:16) λ ( j ) y (cid:96)j ,h − δ ( j ) y (cid:96)j (cid:17) = (cid:89) j : S j,(cid:96) =1 (cid:16) λ ( j ) y (cid:96)j ,h − λ ( j ) y (cid:96)j , (cid:17) , (cid:96) ∈ [ k ] , h ∈ [ k ] . Now consider if (cid:96) < h , then because S : ,(cid:96) (cid:31) lex S : ,h under our assumption in (A.2), there mustbe some variable j ∈ [ p ] such that S j,(cid:96) = 1 and S j,h = 0, so λ ( j ) y (cid:96)j ,h = λ ( j ) y (cid:96)j , for this j . Thismeans for (cid:96) < h the E (cid:63)(cid:96),h must contain a factor of (cid:16) λ ( j ) y (cid:96)j , − λ ( j ) y (cid:96)j , (cid:17) = 0, hence E (cid:63)(cid:96),h = 0 for any (cid:96) < h . So far we have obtained that E (cid:63) is a lower-triangular matrix. We further look at itsdiagonal entries. For any h ∈ [ k ] there is E (cid:63)h,h = (cid:89) j : S j,h =1 (cid:16) λ ( j ) y hj ,h − λ ( j ) y hj ,h (cid:17) (cid:54) = 033ue to the inequality constraint (3) that λ ( j ) y j ,h (cid:54) = λ ( j ) y j , if S j,h = 1. Now we have shown E (cid:63) is alower-triangular matrix with all the diagonal entries being nonzero, therefore E (cid:63) is invertible.Since E (cid:63) is a submatrix of K ( { Λ ( j ) − δ ( j ) · (cid:62) k } j ∈ [ p ] ) containing k of its rows, the latter musthave full column rank. So by Lemma 2, K ( { Λ ( j ) } j ∈ [ p ] ) also has full column rank. This provespart (a) of the proposition.We next prove part (b) of the proposition. It suﬃces to show that in the special casedescribed in Remark 1 the Khatri-Rao product matrix K = (cid:12) pj =1 Λ ( j ) does not have fullcolumn rank. Speciﬁcally, suppose that besides constraint (2) that λ ( j )1: d j , h = λ ( j )1: d j , h if S j,h = S j,h = 0, the parameters also satisfy λ ( j )1: d j , h = λ ( j )1: d j , h if S j,h = S j,h = 1. Inthis case, we claim that if S : ,h = S : ,h then there must also be K : ,h = K : ,h . To see this,ﬁrst note that now the constraint put by the matrix S becomes λ ( j )1: d j , h = λ ( j )1: d j , h as long as S j,h = S j,h . So for each c = ( c , . . . , c p ) (cid:62) ∈ × pj =1 [ d j ], there is K c ,h = p (cid:89) j =1 λ ( j ) c j ,h = p (cid:89) j =1 λ ( j ) c j ,h = K c ,h , (A.6)where the second equality above follows from S j,h = S j,h for each j ∈ [ p ]. Now that (A.6)holds for all possible c ∈ × pj =1 [ d j ], we have K : ,h = K : ,h . Therefore, the matrix K is rank-deﬁcient since it contains identical column vectors. This completes the proof of Proposition1. A.2 Proof of Theorem 1

For a matrix M , the Kruskal’s rank is the maximal number r such that every r columns of M are linear independent. Denote the Kruskal’s rank of M by rank K ( M ). We next restate auseful version of the Kruskal’s theorem on three-way tensor decomposition here to facilitatethe proof; see more discussion on how this theorem can be invoked to show identiﬁability fora variety of latent variable models in Allman et al. (2009). Lemma 3 (Kruskal (1977)) . Suppose M , M , M are three matrices of dimension a i × k for i = 1 , , , N , N , N are three matrices each with k columns, and (cid:12) i =1 M i = (cid:12) i =1 N i . If rank K ( M ) + rank K ( M ) + rank K ( M ) ≥ k + 2 , then there exists a permutation matrix P nd three invertible diagonal matrices D i with D D D = I k and N i = M i D i P . Under assumption (a) in the theorem, we apply Lemma 3 to establish that K i := K ( { Λ ( j ) } j ∈A i ) has full column rank for i = 1 and 2. Now consider K := K ( { Λ ( j ) } j ∈A )for the set of variables A . We ﬁrst claim that for each column h of K , the sum of all theentries of K in this column equals one. To see this, denote the indices of variables in A by { j , . . . , j m } with m = |A | , then (cid:88) c ∈× j ∈A [ d j ] K c ,h = (cid:88) ( cj ,...,cjm ) ∈ [ dj ×···× [ djm ] (cid:32) m (cid:89) (cid:96) =1 λ ( j (cid:96) ) h,c j(cid:96) (cid:33) = m (cid:89) (cid:96) =1  (cid:88) c j(cid:96) ∈ [ d j(cid:96) ] λ ( j (cid:96) ) h,c j(cid:96)  = m (cid:89) (cid:96) =1 . From the above it is not hard to see that K c ,h = P ( y j = c j , . . . , y j m = c j m | z = h ) where z isthe latent class indicator variable. So K : ,h is a vector living in the ( (cid:81) j ∈A d j − h , characterizing the conditional joint distribution of { y j } j ∈A given z = h .Under assumption (b) in the theorem, we claim that the matrix K must have Kruskalrank at least two. We show this via proof by contradiction. Suppose there exist two diﬀerentcolumns of K indexed by h , h that are linear dependent, then there are some a, b ∈ R such that a · K : ,h + b · K : ,h = 0 , (A.7)with a (cid:54) = 0 or b (cid:54) = 0. Since in the last paragraph we established (cid:80) c K c ,h = 1 for each h ,we can take the sum over all the c ∈ × j ∈A [ d j ] and obtain a · b · a (cid:54) = 0, then K : ,h = − b/a K : ,h = K : ,h . Now we ﬁx an arbitrary (cid:96) ∈ A and c (cid:96) ∈ [ d (cid:96) ] in (A.7) and sumover all the remaining indices, (cid:88) m ∈ [ p ] m (cid:54) = j (cid:88) c m ∈ [ d m ] K c ,h = (cid:88) m ∈ [ p ] m (cid:54) = j (cid:88) c m ∈ [ d m ] p (cid:89) j =1 λ ( j ) h ,c j = λ ( (cid:96) ) h ,c (cid:96) (cid:89) m ∈ [ p ] m (cid:54) = (cid:96)  (cid:88) c m ∈ [ d m ] λ ( m ) h ,c m  = λ ( (cid:96) ) h ,c (cid:96) , so K : ,h = K : ,h indeed indicates that λ ( (cid:96) ) h ,c (cid:96) = λ ( (cid:96) ) h ,c (cid:96) for any (cid:96) ∈ A and c (cid:96) ∈ [ d (cid:96) ]. Thiscontradicts the assumption (b) in the theorem that for any h (cid:54) = h there exists j ∈ A and c ∈ [ d j ] for which λ ( j ) h ,c (cid:54) = λ ( j ) h ,c . This contradiction implies that K must have Kruskal rankat least two. 35inally, since the distribution for the observed y = ( y , . . . , y j ) (cid:62) can be written as thetensor Π = ( π c ··· c p ) and further Π = (cid:8) (cid:12) pj =1 Λ j (cid:9) · ν = (cid:8) (cid:12) i =1 K i (cid:9) · ν = K (cid:12) K (cid:12) (cid:8) K · diag( ν ) (cid:9) , (A.8)where ν = ( ν , . . . , ν k ) (cid:62) belongs to the ( k − ν ) denotes a k × k diagonal matrix with diagonal entries being ν h ’s. Recall that we have established K , K both have full column rank and K has Kruskal rank at least two. Under the assumptionthat ν h > h ∈ [ k ] in the theorem, it is not hard to see that matrix K · diag( ν )also has full column rank. Now suppose an alternative set of parameters ( ν , Λ ) and the trueparameters ( ν , Λ ) lead to the same distribution, that is, lead to the same Π . We similarlydenote K i := K ( { Λ ( j ) } j ∈A i ) for i = 1 , ,

3. Then K (cid:12) K (cid:12) (cid:8) K · diag( ν ) (cid:9) = K (cid:12) K (cid:12) (cid:110) K · diag( ν ) (cid:111) . Now we apply Lemma 3 to this three-way decomposition and obtain K i = K i D i P for i = 2 , K · diag( ν ) = K · diag( ν ) D P (A.9)for a permutation matrix P and three invertible diagonal matrices D i . Now note that both K i and K i have each column characterizing the conditional joint distribution of { y j } j ∈A i given the h th latent class z = h . Therefore the sum of each column of K i or K i equals one,which implies the diagonal matrix D i is an identity matrix for i = 2 or 3. Since Lemma3 ensures D D D = I k , we also obtain D = I k . So far we have obtained K i = K i P for i = 2 , K · diag( ν ) = K · diag( ν ) P . Now proceeding in the same way as the argumentafter Eq. (A.7), the K i = K i P for i = 2 , Λ j = Λ j · P for all j ∈ A ∪ A . For i = 3,consider h ∈ [ k ] and assume without loss of generality that the ( h , h )th entry of matrix P is P h ,h = 1. Then the h th column of the equality K · diag( ν ) = K · diag( ν ) P gives K c ,h · ¯ ν h = K c ,h · ν h ; summing over the index c ∈ × j ∈A [ d j ] gives ¯ ν h = ν h . Note we havegenerally established ¯ ν h = ν h whenever P h ,h = 1, which means ν (cid:62) = ν (cid:62) · P . Combiningthis with (A.9), we obtain K = K · P and Λ ( j ) = Λ ( j ) for j ∈ A as a last step. Thisestablishes the strict identiﬁability of all the parameters ν and Λ up to a relabeling of the36atent classes. A.3 Proof of Corollary 1

Following a similar argument as the proof of Theorem 1, we can establish that each K i = K ( { Λ ( j ) } j ∈A i ) has full column rank for i = 1 , ,

3. Since if a matrix with at least twocolumns has full column rank, it must also have Kruskal rank at least two, the assumptionsin Theorem 1 are satisﬁed and the conclusion of the corollary directly follows.

A.4 Proof of Theorem 2

We next introduce a useful concept in algebraic geometry, the algebraic variety , to facilitatethe proof. An algebraic variety V is deﬁned to be the simultaneous zero-set of a ﬁnite set ofmultivariate polynomials { f i } ni =1 ⊆ R [ x , x , . . . , x d ], V = V ( f , . . . , f n ) = { x ∈ R d | f i ( x ) =0 , ≤ i ≤ n } . An algebraic variety V equals the entire space R d if all of the polynomials f i deﬁning it are zero polynomials. Otherwise, V is called a proper subvariety of dimension lessthan d , in which case it must have Lebesgue measure zero in R d . The same argument stillholds if the above R d is replaced by a parameter space T ⊆ R d that contains an open ball offull dimension in R d . For the constrained latent class model under constraints (2) and (3),we have the following parameter space for Λ subject to the constraint of S (denoted by Λ S )and mixture proportions ν ,Ω S = (cid:110) ( Λ S , ν ) : ∀ j ∈ [ p ] , ∀ c j ∈ [ d j ] , max h : S j,h =0 λ ( j ) c j ,h = min h : S j,h =0 λ ( j ) c j ,h ∀ j ∈ [ p ] , ∀ c j ∈ [ d j ] , λ ( j ) c j ,h (cid:54) = λ ( j ) c j ,h if S j,h (cid:54) = S j,h ; k (cid:88) h =1 ν h = 1 , ν h > ∀ h ∈ [ k ] (cid:111) . On Ω S , altering some entries in S from one to zero is equivalent to imposing more equalityconstraints on Λ and forces the resulting parameters to be in Ω (cid:101) S instead of Ω S . Under thecondition that the altered submatrix (cid:101) S A i , : has distinct column vectors, Proposition 1 givesthat for any Λ ∈ Ω (cid:101) S the matrix (cid:12) j ∈A i Λ ( j ) has full column rank k for i = 1 ,

2. Note that37he statement that the (cid:81) j ∈A i d j × k matrix (cid:12) j ∈A i Λ ( j ) has full column rank k is equivalentto the following statement,(M) the maps sending the matrix (cid:12) j ∈A i Λ ( j ) to all of its C i := ( (cid:81) j ∈A i d j )! / ( k !( (cid:81) j ∈A i d j − k )!)possible k × k minors M i , M i , . . . , M iC i yields at least one nonzero minor.Here each matrix determinant M i(cid:96) is a polynomial of the parameters { Λ ( j ) } j ∈A i , so we cansimply write each of them as M i(cid:96) ( { Λ ( j ) } j ∈A i ). We now deﬁne N = (cid:91) i =1 , (cid:110) C i (cid:92) (cid:96) =1 { ( Λ , ν ) ∈ Ω S : M i(cid:96) ( { Λ ( j ) } j ∈A i ) = 0 } (cid:111) , then N ⊆ Ω S and N is an algebraic variety deﬁned by polynomials of the model parameters.An important observation is that this V is a proper subvariety of Ω S . We next prove thisclaim. First, note that for any Λ ∈ Ω (cid:101) S , there exists some Λ (cid:63) ∈ Ω S that is arbitrarilyclose to Λ . This is because the altered parameter space Ω (cid:101) S is obtained by changing someinequality constraints in the original Ω S to equality constraints. Therefore, ﬁxing a Λ ∈ Ω (cid:101) S and considering a corresponding Λ (cid:63) ∈ Ω S close enough to Λ , since the statement (M)ensures M i(cid:96) i ( { Λ ( j ) } j ∈A i ) (cid:54) = 0 for some (cid:96) i ∈ [ C i ], i = 1 ,

2, we also have M i(cid:96) i ( { Λ (cid:63), ( j ) } j ∈A i ) (cid:54) = 0for (cid:96) i ∈ [ C i ], i = 1 ,

2. The above argument holds because the polynomials M i(cid:96) i ( · )’s arecontinuous functions of the Λ -parameters and Λ (cid:63) ∈ Ω S and Λ are close enough. Thisindicates that the polynomials deﬁning V contain at least one nonzero polynomial on Ω S .This proves the earlier claim that V is a proper subvariety of Ω S , so V has measure zero withrespect to the Lebesgue measure on Ω S . Then we can simply consider parameters falling inΩ S \ N and proceed in the same way as in the proof of Theorem 1 to establish identiﬁability.The argument above establishes generic identiﬁability of the parameters by Deﬁnition 2.This completes the proof of Theorem 2. A.5 Proof of Theorem 3

First, based on Theorem 2 in Dunson and Xing (2009) we obtain that when the priors for Λ S (denoting the Λ subject to the constraints imposed by the constraint matrix S ) and ν both have full support, then the induced prior on the probability tensor Π for y i also has38ull support on the ( (cid:81) pj =1 d j − (cid:15) (cid:63) > N such that for all n > N there is P (Θ ∈ N c(cid:15) (Θ ) | y , . . . , y n ) < (cid:15) (cid:63) a.s. P . Deﬁne δ = inf Θ ∈N c(cid:15) (Θ ) (cid:107) Π Θ − Π Θ (cid:107) , where Π Θ denotes the probability tensor characterizing the distribution of y i under param-eters Θ . According to identiﬁability, there is P Θ (cid:54) = P Θ for Θ ∈ Θ \ N (cid:15) (Θ ). Since N (cid:15) is anopen set and the entire parameter space Θ is a compact set, the complement N c(cid:15) (Θ ) is alsocompact. Therefore the previously deﬁned δ is greater than zero. and there exists some N such that for n > N , P (Θ ∈ N c(cid:15) (Θ ) | y , . . . , y n ) ≤ P ( (cid:107) Π Θ − Π Θ (cid:107) > δ/ | y , . . . , y n ) < (cid:15) (cid:63) , where the last inequality above results from the assumption the prior has full support on thespace of the probability tensor Π . This proves the conclusion of the theorem. A.6 Proof of Proposition 2

We ﬁrst prove part (a). Recall the model (7) gives the following joint distribution of theobserved y and all the latent variables, P ( y , α (1) , . . . , α ( D − , z ( D ) ) = P ( y | α (1) ) · D − (cid:89) m =1 P ( α ( m ) | α ( m +1) ) · P ( α ( D − | z ( D ) ) · P ( z ( D ) ) . In the above display, if marginalizing out all the latent variables except the α (1) in the ﬁrstlatent layer, then we obtain P ( y , α (1) ) = (cid:88) α (2) · · · (cid:88) α ( D − (cid:88) z ( D ) P ( y , α (1) , . . . , α ( D − , z ( D ) )= P ( y | α (1) ) (cid:88) α (2) · · · (cid:88) α ( D − (cid:88) z ( D ) P ( α (1) | α (2) ) · · · P ( α ( D − | z ( D ) ) P ( z ( D ) )39 P ( y | α (1) ) · P ( α (1) ) , where the P ( α (1) ) can be understood as the probability mass function for the K -dimensionalbinary vector α (1) . We introduce a notation ν (1) = (cid:16) ν (1) a ; a ∈ { , } K (cid:17) to denote the pa-rameters of this categorical distribution for α (1) . Then ν (1) lives in the (2 K − α (1) ∈ { , } K , the vector y has the following probabilitymass function for each c ∈ × pj =1 [ d j ], P ( y = c | ν (1) , L (1) , G (1) ) = (cid:88) a ∈{ , } K ν (1) a p (cid:89) j =1 P ( y j = c j | α (1) = a , L (1) , G (1) ) , (A.10)where the L (1) = { Λ ( j ) ; j ∈ [ p ] } denotes the collection of conditional probability tables of y j given α (1) . First, note that the above display gives a latent class model for y with 2 K latentclasses. We next show that the above constrained latent class model satisﬁes the equalityconstraint (2) under the constraint matrix S (1) constructed in (9). To prove this conclusion,consider two latent classes characterized by a (cid:54) = a (cid:63) ∈ { , } K with a = ( a , . . . , a K ) and α (cid:63) = ( a (cid:63) , . . . , a (cid:63)K ). For some variable j ∈ [ p ], if S (1) j, a = S (1) j, a (cid:63) = 0, then by construction (9)we have I ( a (cid:23) G (1) j, : ) = I ( a (cid:63) (cid:23) G (1) j, : ) = 1. This means there are a (cid:23) G (1) j, : and a (cid:63) (cid:23) G (1) j, : .Then by the deﬁnition of the graphical matrix G (1) , the binary vector G (1) j, : speciﬁes the setof parent variables of variable y j . According to the factorization of the distribution of y ,the conditional distribution of y j given a (1) only depends on the parent variables of variable y j in the ﬁrst latent layer. In other words, the conditional probability distribution vector λ ( j )1: d j , a only depends on those k ∈ [ K ] such that g (1) j,k = 1. Note that for all such k ∈ [ K ]where g (1) j,k = 1, we have a k = a (cid:63)k = 1. This implies a and a (cid:63) must have the same conditionalprobability table for this variable y j , which means λ ( j )1: d j , a = λ ( j )1: d j , a (cid:63) . This is exactly theequality constraint (2) and thus proves the conclusion of part (a).We next prove the conclusion of part (b) of the proposition. For an arbitrary m =1 , . . . , D −

2, the Bayesian network structure implies that the joint distribution of α ( m ) and α ( m +1) is P ( α ( m ) , α ( m +1) ) 40 (cid:88) α ( m +2) · · · (cid:88) α ( D − (cid:88) z ( D ) P ( α ( m ) | α ( m +1) ) P ( α ( m +1) | α ( m +2) ) · · · P ( α ( D − | z ( D ) ) P ( z ( D ) )= P ( α ( m ) | α ( m +1) ) (cid:88) α ( m +2) · · · (cid:88) α ( D − (cid:88) z ( D ) P ( α ( m +1) | α ( m +2) ) · · · P ( α ( D − | z ( D ) ) P ( z ( D ) )= P ( α ( m ) | α ( m +1) ) · P ( α ( m +1) ) , where P ( α ( m +1) ) is the probability mass function for the K m +1 -dimensional binary vector α ( m +1) . Similar to the proof of part (a), we use ν ( m +1) = (cid:16) ν ( m +1) α ; α ∈ { , } K m +1 (cid:17) todenote the parameters of this categorical distribution for α ( m +1) . Then ν ( m +1) lives in the(2 K m +1 − α ( m ) can be written as follows, P ( α ( m ) = a ( m ) | ν ( m +1) , Λ ( m +1) , G ( m +1) ) (A.11)= (cid:88) a ∈{ , } Km +1 ν ( m +1) a K m (cid:89) k =1 P ( α ( m ) k = a ( m ) k | α ( m +1) = a , L ( m +1) , G ( m +1) ) , where a ( m ) = ( a ( m )1 , . . . , a ( m ) K m ) is an arbitrary binary vector of length K m . The above distribu-tion of α ( m ) takes a similar form as that in (A.10). Therefore, we can use a similar argumentas that in the proof of part (a) of the proposition to proceed with the proof. Speciﬁcally,(A.11) shows that after marginalizing out the distribution of all the latent variables deeperthan α ( m ) other than α ( m +1) , the distribution of α ( m ) becomes a constrained latent classmodel with 2 K m +1 latent classes. Next it suﬃces to show such a constrained latent classmodel satisﬁes the constraint (2) with S ( m +1) deﬁned in (10). Consider two latent classes a (cid:54) = a (cid:63) ∈ { , } K m +1 with a = ( a , . . . , a K m +1 ) and α (cid:63) = ( a (cid:63) , . . . , a (cid:63)K m +1 ). For some k ∈ [ K m ],if S ( m +1) k, a = S ( m +1) k, a (cid:63) = 0, then by construction (10) we have a (cid:23) G ( m +1) k, : , a (cid:63) (cid:23) G ( m +1) k, : (A.12)Then by deﬁnition of G ( m +1) , the vector G ( m +1) k, : speciﬁes the set of parent variables ofvariable α ( m ) k in the ( m + 1)-the latent layer. According to the factorization of distribution of α ( m ) , the conditional distribution of α ( m ) k given α ( m +1) only depends on its parent variables.Equivalently, the conditional probability of P ( α ( m ) k = 1 | α ( m +1) ) only depends on those α ( m +1) k (cid:48) for which g ( m +1) k,k (cid:48) = 1. Note that for all such k (cid:48) ∈ [ K m +1 ] where g ( m +1) k,k (cid:48) = 1, we have41 k = a (cid:63)k = 1 by (A.12). This implies a and a (cid:63) must have the same conditional probabilityof P ( α ( m ) k = 1 | α ( m +1) = a ) = P ( α ( m ) k = 1 | α ( m +1) = a (cid:63) ). Therefore there also is P ( α ( m ) k = 0 | α ( m +1) = a ) = P ( α ( m ) k = 0 | α ( m +1) = a (cid:63) ). The above two equalities areexactly the equality constraint (2) and this proves the conclusion of part (b). The proof ofProposition 2 is now complete. A.7 Proof of Theorem 4

We build on the conclusions of Proposition 2 and Corollary 1 and prove the identiﬁabil-ity of parameters layer by layer in the “bottom-up” direction. Recall that the values of K , . . . , K D − are known. We rewrite the form of G ( m ) as in (11) in the theorem, G ( m ) =  I K m I K m I K m G ( m ) ,(cid:63)  . First consider the K × K submatrix of G (1) consisting of its ﬁrst K rows (which is I K )and denote it by G (1)1: K , : . The G (1)1: K , : will have a corresponding K × K constraint matrix S (1)1: K , : , which consists of the ﬁrst K rows of S (1) . We next prove that this S (1)1: K , : has2 K distinct column vectors. Note that each column of S (1)1: K , : is indexed by a latent classcharacterized by a K -dimensional binary pattern α (1) ∈ { , } K . So we only need to showthat if α = ( α , . . . , α K ) (cid:54) = α (cid:63) = ( α (cid:63) , . . . , α (cid:63)K ), then S (1)1: K , α (cid:54) = S (1)1: K , α (cid:63) . Without loss ofgenerality, suppose α k (cid:54) = α (cid:63)k for certain k ∈ [ K ], with α k = 1 and α (cid:63)k = 0. Since the k th rowof G (1)1: K , : is the standard basis vector e k which equals one in the k th coordinate and zero inother coordinates, according to Proposition 2 we have S (1) k, α = 1 − I ( α (cid:23) e k ) = 1 − S (1) k, α (cid:63) = 1 − I ( α (cid:63) (cid:23) e k ) = 1 − . So S (1) k, α (cid:54) = S (1) k, α (cid:63) , and hence S (1)1: K , α (cid:54) = S (1)1: K , α (cid:63) . Since α and α (cid:63) are arbitrary, we have shownthat S (1)1: K , : has 2 K distinct column vectors. A similar argument applies to S (1)( K +1):(2 K ) , : S (1)(2 K +1):(3 K ) , : , since the second K rows and the third K rows of G (1) also both forman identity matrix I m by (11). This means the constraint matrix S (1) vertically stacks threesubmatrices S (1)1: K , : , S (1)( K +1):(2 K ) , : , S (1)(2 K +1):(3 K ) , : , each having distinct column vectors. ByCorollary 1, this guarantees the identiﬁability of the following quantities: the S (1) , the pa-rameters of each conditional distribution of y j | α (1)pa( j ) (denoted by L (1) ), and the parametersof the categorical distribution of α (1) (denoted by ν (1) ). Here ν (1) = { ν α ; α ∈ { , } K } is a 2 K -dimensional vector with non-negative entries summing to one. We claim that theidentiﬁability of S (1) also implies the identiﬁability of G (1) here. As shown earlier in theprevious paragraph, the existence of an identity submatrix I K in G (1) actually guaranteesthat the resulting S (1) contains 2 K distinct column vectors. This means each column of G (1) can indeed be read oﬀ from each column of S (1) . This proves the claim of the identiﬁabilityof G (1) .Now recall that part (b) of Proposition 2 states the following: the distribution of α (1) can be considered as a constrained latent class model with 2 K latent classes, which arecharacterized by conﬁgurations of latent variable α (2) . Since the distribution of α (1) is al-ready identiﬁed with parameters ν (1) identiﬁed, we can proceed as if α (1) are the observedvariables with a known probability tensor. This means we can further examine the struc-ture of the current constraint matrix S (2) to establish identiﬁability of parameters for thissecond constrained latent class model. By (11), G (2) also contains three copies of identitysubmatrices I K , so the same argument as in the last paragraph gives that S (2) verticallystacks three submatrices S (2)1: K , : , S (2)( K +1):(2 K ) , : , S (2)(2 K +1):(3 K ) , : , each having distinct columnvectors. Invoking Corollary 1 again gives the identiﬁability of the G (2) , the parameters ofeach conditional distribution of α (1) k | α (2)pa( α (1) k ) (denoted by L (2) ), and the parameters of thecategorical distribution of α (2) (denoted by ν (2) ). Then in a similar and recursive fashion,we can establish identiﬁability for G ( m ) , L ( m ) , and ν ( m ) for each m = 2 , . . . , D −

1. Thiscompletes the proof of Theorem 4.

References

Allman, E. S., Matias, C., and Rhodes, J. A. (2009). Identiﬁability of parameters in latentstructure models with many observed variables.

The Annals of Statistics , 37(6A):3099–3132. 43nandkumar, A., Hsu, D., Javanmard, A., and Kakade, S. (2013). Learning linear Bayesiannetworks with latent variables. In

International Conference on Machine Learning , pages249–257. PMLR.Angelino, E., Larus-Stone, N., Alabi, D., Seltzer, M., and Rudin, C. (2017). Learning certi-ﬁably optimal rule lists for categorical data.

The Journal of Machine Learning Research ,18(1):8753–8830.Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent Dirichlet allocation.

Journal ofMachine Learning Research , 3(Jan):993–1022.Chen, Y., Culpepper, S., and Liang, F. (2020a). A sparse latent class model for cognitivediagnosis.

Psychometrika , 85:121–153.Chen, Y., Li, X., and Zhang, S. (2020b). Structured latent factor analysis for large-scale data:Identiﬁability, estimability, and their implications.

Journal of the American StatisticalAssociation , 115(432):1756–1770.Chen, Y., Liu, J., Xu, G., and Ying, Z. (2015). Statistical analysis of Q -matrix based diagnos-tic classiﬁcation models. Journal of the American Statistical Association , 110(510):850–866.Choi, M. J., Tan, V. Y., Anandkumar, A., and Willsky, A. S. (2011). Learning latent treegraphical models.

Journal of Machine Learning Research , 12:1771–1812.Collins, L. M. and Lanza, S. T. (2009).

Latent class and latent transition analysis: Withapplications in the social, behavioral, and health sciences , volume 718. John Wiley & Sons.Dobra, A. and Lenkoski, A. (2011). Copula Gaussian graphical models and their applicationto modeling functional disability data.

The Annals of Applied Statistics , 5(2A):969–993.Doshi-Velez, F. and Kim, B. (2017). Towards a rigorous science of interpretable machinelearning. arXiv preprint arXiv:1702.08608 .Dunson, D. B. and Xing, C. (2009). Nonparametric Bayes modeling of multivariate categor-ical data.

Journal of the American Statistical Association , 104(487):1042–1051.Erosheva, E., Fienberg, S., and Laﬀerty, J. (2004). Mixed-membership models of scientiﬁcpublications.

Proceedings of the National Academy of Sciences , 101(suppl 1):5220–5227.Eysenck, S. B., Barrett, P. T., and Saklofske, D. H. (2020). The junior Eysenck personalityquestionnaire.

Personality and Individual Diﬀerences , page 109974.44ang, G., Liu, J., and Ying, Z. (2019). On the identiﬁability of diagnostic classiﬁcationmodels.

Psychometrika , 84(1):19–40.Gelman, A., Jakulin, A., Pittau, M. G., and Su, Y.-S. (2008). A weakly informative defaultprior distribution for logistic and other regression models.

The Annals of Applied Statistics ,2(4):1360–1383.Goodman, L. A. (1974). Exploratory latent structure analysis using both identiﬁable andunidentiﬁable models.

Biometrika , 61(2):215–231.Gu, Y. and Xu, G. (2019). Learning attribute patterns in high-dimensional structured latentattribute models.

Journal of Machine Learning Research , 20(115):1–58.Gu, Y. and Xu, G. (2020). Partial identiﬁability of restricted latent class models.

Annals ofStatistics , 48(4):2082–2107.Gyllenberg, M., Koski, T., Reilink, E., and Verlaan, M. (1994). Non-uniqueness in proba-bilistic numerical identiﬁcation of bacteria.

Journal of Applied Probability , 31:542–548.Haertel, E. H. (1989). Using restricted latent class models to map the skill structure ofachievement items.

Journal of Educational Measurement , 26(4):301–321.Henson, R. A., Templin, J. L., and Willse, J. T. (2009). Deﬁning a family of cognitivediagnosis models using log-linear models with latent variables.

Psychometrika , 74(2):191.Hinton, G. E., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep beliefnets.

Neural Computation , 18(7):1527–1554.Holmes, C. C. and Held, L. (2006). Bayesian auxiliary variable models for binary andmultinomial regression.

Bayesian Analysis , 1(1):145–168.Kolda, T. G. and Bader, B. W. (2009). Tensor decompositions and applications.

SIAMReview , 51(3):455–500.Kruskal, J. B. (1977). Three-way arrays: rank and uniqueness of trilinear decompositions,with application to arithmetic complexity and statistics.

Linear Algebra and its Applica-tions , 18(2):95–138.Lazarsfeld, P. F. (1950). The logical and mathematical foundation of latent structure anal-ysis.

Studies in Social Psychology in World War II Vol. IV: Measurement and Prediction ,pages 362–412. 45ee, H., Ekanadham, C., and Ng, A. (2007). Sparse deep belief net model for visual area v2.

Advances in Neural Information Processing Systems , 20:873–880.Lee, H., Grosse, R., Ranganath, R., and Ng, A. Y. (2009). Convolutional deep belief networksfor scalable unsupervised learning of hierarchical representations. In

Proceedings of the 26thannual International Conference on Machine Learning , pages 609–616.Legramanti, S., Durante, D., and Dunson, D. B. (2020). Bayesian cumulative shrinkage forinﬁnite factorizations.

Biometrika , 107(3):745–752.Leung, D., Drton, M., and Hara, H. (2016). Identiﬁability of directed Gaussian graphicalmodels with one latent source.

Electronic Journal of Statistics , 10(1):394–422.Li, J. and Wong, L. (2003). Using rules to analyse bio-medical data: a comparison betweenc4. 5 and pcl. In

International Conference on Web-Age Information Management , pages254–265. Springer.McLachlan, G. J. and Basford, K. E. (1988).

Mixture models: Inference and applications toclustering , volume 38. M. Dekker New York.Miettinen, P. and Vreeken, J. (2014). Mdl4bmf: Minimum description length for boolean ma-trix factorization.

ACM transactions on knowledge discovery from data (TKDD) , 8(4):1–31.Mourad, R., Sinoquet, C., Zhang, N. L., Liu, T., and Leray, P. (2013). A survey on latenttree models and applications.

Journal of Artiﬁcial Intelligence Research , 47:157–203.Murdoch, W. J., Singh, C., Kumbier, K., Abbasi-Asl, R., and Yu, B. (2019). Interpretablemachine learning: deﬁnitions, methods, and applications.

Proceedings of the NationalAcademy of Sciences , 116(44):22071–22080.Nguyen, N. G., Tran, V. A., Ngo, D. L., Phan, D., Lumbanraja, F. R., Faisal, M. R., Abapihi,B., Kubo, M., Satou, K., et al. (2016). DNA sequence classiﬁcation by convolutional neuralnetwork.

Journal of Biomedical Science and Engineering , 9(05):280.Ovaskainen, O. and Abrego, N. (2020).

Joint Species Distribution Modelling: With Applica-tions in R . Ecology, Biodiversity and Conservation. Cambridge University Press.Ovaskainen, O., Abrego, N., Halme, P., and Dunson, D. B. (2016). Using latent variablemodels to identify large networks of species-to-species associations at diﬀerent spatialscales.

Methods in Ecology and Evolution , 7(5):549–555.46earl, J. (2014).

Probabilistic reasoning in intelligent systems: networks of plausible infer-ence . Elsevier.Pokholok, D. K., Harbison, C. T., Levine, S., Cole, M., Hannett, N. M., Lee, T. I., Bell,G. W., Walker, K., Rolfe, P. A., Herbolsheimer, E., et al. (2005). Genome-wide map ofnucleosome acetylation and methylation in yeast.

Cell , 122(4):517–527.Polson, N. G., Scott, J. G., and Windle, J. (2013). Bayesian inference for logistic modelsusing P´olya–Gamma latent variables.

Journal of the American Statistical Association ,108(504):1339–1349.Poon, H. and Domingos, P. (2011). Sum-product networks: A new deep architecture. In ,pages 689–690. IEEE.Rousseau, J. and Mengersen, K. (2011). Asymptotic behaviour of the posterior distributionin overﬁtted mixture models.

Journal of the Royal Statistical Society: Series B (StatisticalMethodology) , 73(5):689–710.Rudin, C. (2019). Stop explaining black box machine learning models for high stakes decisionsand use interpretable models instead.

Nature Machine Intelligence , 1(5):206–215.Rupp, A. A. and Templin, J. L. (2008). Unique characteristics of diagnostic classiﬁcationmodels: A comprehensive review of the current state-of-the-art.

Measurement , 6(4):219–262.Skinner, C. (2019). Analysis of categorical data for complex surveys.

International StatisticalReview , 87:S64–S78.Tatsuoka, K. K. (1983). Rule space: An approach for dealing with misconceptions based onitem response theory.

Journal of Educational Measurement , pages 345–354.von Davier, M. and Lee, Y.-S. (2019).

Handbook of diagnostic classiﬁcation models . Springer.Xu, G. (2017). Identiﬁability of restricted latent class models with binary responses.

TheAnnals of Statistics , 45(2):675–707.Zhao, H., Melibari, M., and Poupart, P. (2015). On the relationship between sum-productnetworks and Bayesian networks. In

International Conference on Machine Learning , pages116–124. 47hou, J., Bhattacharya, A., Herring, A. H., and Dunson, D. B. (2015). Bayesian factoriza-tions of big sparse tensors.

Journal of the American Statistical Association , 110(512):1562–1576.Zwiernik, P. (2018). Latent tree models. In

Handbook of Graphical Models , pages 283–306.CRC Press. 48 upplement to “Identifying Interpretable

Latent Structures from Discrete Data”

In this Supplementary Material, we present additional technical proofs in Section S.1,provide the posterior computation details in Section S.2, and present the analysis of anotherdataset of nucleotide sequences in Section S.3.

S.1 Additional Technical Proofs

S.1.1 Proof of Proposition 3

Rewrite the distribution of each y j given α (1) under (15) as λ ( j ) c, α = P ( y j = c | α (1) = α )= exp (cid:16) β j,c, + (cid:80) K k =1 β j,c,k g (1) j,k α k (cid:17)(cid:80) d j m =1 exp (cid:16) β j, ,m + (cid:80) K k =1 β j,k,m g (1) j,k α k (cid:17) , c ∈ [ d j ] , α ∈ { , } K , where β j, ,d j = β j,k,d j = 0 for all k ∈ [ K ]. For each variable y j , consider a d j × K conditional probability matrix Λ j = ( λ ( j ) c, α ) whose rows are indexed by the d j categoriesand columns by the 2 K possible binary latent pattern conﬁgurations. Under the condition G (1)1: K , : = G (1)( K +1):2 K , : = G (1)(2 K +1):3 K , : = I K , consider the two Khatri-Rao product matrices K = (cid:12) K j =1 Λ ( j ) , K = (cid:12) K j = K +1 Λ ( j ) , K = (cid:12) K j =2 K +1 Λ ( j ) . We claim that each of K , K , and K has full column rank 2 K . We next focus on provingthis claim for K without loss of generality. Note K is of size (cid:81) K j =1 d j × K , where each ofits row is indexed by a K -dimensional response pattern y K ∈ × K j =1 [ d j ]. There is d j ≥ j , so to show K has full column rank, it suﬃces to ﬁnd a 2 K × K submatrix ofit that is invertible. To this end, we next speciﬁcally consider the rows of K indexed byresponse pattern y = ( y , . . . , y K ) where each y j ∈ { , d j } . These 2 K response patternsnaturally give rise to a 2 K × K submatrix of K , which we denote by K , sub . Now note49hat for each j ∈ [ K ] there is g (1) j, : = e j , the j th standard basis vector, therefore when α varies all the λ ( j )1 , α only takes two possible values: λ ( j )1 , α =  exp ( β j, , ) (cid:80) d j m =1 exp ( β j, ,m ) , if α (cid:15) e j ;exp ( β j, , + β j,j, ) (cid:80) d j m =1 exp ( β j, ,m + β j,j,m ) , if α (cid:23) e j . (S.1)Similarly there are λ ( j ) d j , α =  (cid:80) d j m =1 exp ( β j, ,m ) , if α (cid:15) e j ;1 (cid:80) d j m =1 exp ( β j, ,m + β j,j,m ) , if α (cid:23) e j . (S.2)Now an important observation is that the conditional probability tensor K , sub can be writtenas a Kronecker product of K matrices each of size 2 × K , sub = K (cid:79) j =1  exp ( β j, , ) (cid:80) d j m =1 exp ( β j, ,m ) exp ( β j, , + β j,j, ) (cid:80) d j m =1 exp ( β j, ,m + β j,j,m )1 (cid:80) d j m =1 exp ( β j, ,m ) 1 (cid:80) d j m =1 exp ( β j, ,m + β j,j,m )  =: K (cid:79) j =1 B ( j ) . (S.3)Under the assumptions β j,k, (cid:54) = 0 for any g (1) j,k = 1 stated in the theorem, we have β j,j, (cid:54) = 0for all j ∈ [ K ]. Therefore, the 2 × B ( j ) in (S.3) has full rank. According tothe property of the Kronecker product, the K , sub then must have full rank 2 K . The sameconclusion also holds for K , K and we have proved the earlier claim that each of K , K ,and K has full column rank. Note that after marginalizing out the deep latent variable z ,the model for y is simply a constrained latent class model with 2 K latent classes indexedby α ∈ { , } K . Now invoking Corollary 1 gives the identiﬁability of Λ (1) , G (1) , and ν (1) .This proves part (a) of the proposition.We next prove part (b) of the proposition.The conclusion of part (a) guarantees theidentiﬁability of ν (1) := { P ( α (1) = α ); α ∈ { , } K } , so now the α (1) can be viewed asif they are observed variables with known probability mass function ν (1) . Then it suﬃcesto note that the distribution of α (1) is simply an unconstrained latent class model with H K ≥ (cid:100) log B (cid:101) + 1, Corollary 5 in Allman et al. (2009) directly givesthe generic identiﬁability conclusion for τ and η . The proof of the proposition is complete. S.1.2 Proof of Proposition 4

Denote the three sets of row indices forming G (1) , , G (1) , , and G (1) , by A , A , and A ;that is G (1) ,i = A i for each i = 1 , ,

3. We claim that for each A i with i = 1 , ,

3, the inducedKhatri-Rao product matrix K i = (cid:12) j ∈A i Λ ( j ) has full column rank for generic parametersunder the equality constraint (2). If this claim is true, then we can invoke Corollary 1 toestablish the desired generic identiﬁability result. So we next focus on proving this claim.The statement in the corollary says that each G (1) ,i takes the following form, G (1) ,i =  ∗ · · · ∗∗ · · · ∗ ... ... . . . ... ∗ ∗ · · ·  . (S.4)We next consider a 2 K × K submatrix of the Khatri-Rao product matrix K A i = (cid:12) j ∈A Λ ( j ) .Similarly as in the proof of Proposition 3, we consider the rows of K A i indexed by responsepatterns y A i ∈ × j ∈A i { , d j } . First, in a special case where all the oﬀ-diagonal entries of G (1) ,i in (S.4) are equal to zero, there is G (1) ,i = I K and Eq. (S.3) in the proof of Proposition 3states that the matrix K A i ( β ) := ⊗ j ∈A i B ( j ) ( β ) with B ( j ) ( β ) =  exp ( β j, , ) (cid:80) d j m =1 exp ( β j, ,m ) exp ( β j, , + β j,j, ) (cid:80) d j m =1 exp ( β j, ,m + β j,j,m )1 (cid:80) d j m =1 exp ( β j, ,m ) 1 (cid:80) d j m =1 exp ( β j, ,m + β j,j,m )  (S.5)has full rank 2 K when the β -parameters vary in the I K -constrained parameter spaceΩ( β ; G (1) ,i = I K ) := { β K , K , d j − ; β j,j,c (cid:54) = 0 and β j,c,k = 0 if k (cid:54) = j } . Now an important observation is that for an arbitrary G (1) ,i =: G (cid:63) taking the form of (S.4),its corresponding constrained parameter space Ω( β ; G (1) ,i = G (cid:63) ) must contain some β that51s arbitrarily close to some β (cid:63) belonging to Ω( β ; G (1) ,i = I K ). Mathematically, this means:(C) For any small positive number (cid:15) > β ∈ Ω( β ; G (1) ,i = I K ), there must besome β (cid:63) ∈ Ω( β ; G (1) ,i = G (cid:63) ) such that (cid:107) β − β (cid:63) (cid:107) < (cid:15) .The reasoning behind this claim is as follows. Recall the deﬁnition of the parameter spaceΩ( β ; G (1) ) for a general graphical matrix G (1) given in (16) in the main text,Ω( β ; G (1) ) = { β p, K , d j − ; β j,c,k (cid:54) = 0 if g (1) j,k = 1; and β j,c,k = 0 if g (1) j,k = 0 } . By this deﬁnition, considering the structure of G (cid:63) with diagonal elements being one andsome oﬀ-diagonal elements potentially also being one, the space Ω( β ; G (1) ,i = G (cid:63) ) can beobtained from Ω( β ; G (1) ,i = I K ) by performing the following operation:(P) If g (1) ,(cid:63)j,k = 1 in G (cid:63) , then convert the equality constraint β j,c,k = 0 in Ω( β ; G (1) ,i = I K )into the equality constraint β j,c,k (cid:54) = 0.Based on the above (P) that Ω( β ; G (1) ,i = G (cid:63) ) is obtained through changing certain equalityconstraints in Ω( β ; G (1) ,i = I K ) to inequality constraints, it is not hard to see that theprevious claim (C) is true. Recall for any β ∈ Ω( β ; G (1) ,i = G (cid:63) ), the K A i deﬁned earlierhas full rank, that is, det (cid:0) K A i ( β ) (cid:1) (cid:54) = 0, where det( M ) denotes the determinant of matrix M . Since the det (cid:0) K A i ( β ) (cid:1) can be viewed as a continuous function of parameters β , weobtain that there exists some β (cid:63) ∈ Ω( β ; G (1) ,i = G (cid:63) ) close enough to β (see claim (C))such that det (cid:0) K A i ( β (cid:63) ) (cid:1) (cid:54) = 0. According to the multinomial logistic model speciﬁed in (15),the det (cid:0) K A i ( β ) (cid:1) is actually an analytic function of β , and det (cid:0) K A i ( β (cid:63) ) (cid:1) (cid:54) = 0 for some β (cid:63) ∈ Ω( β ; G (1) ,i = G (cid:63) ) implies that the following subset { β (cid:63) ∈ Ω( β ; G (1) ,i = G (cid:63) ) : det (cid:0) K A i ( β ) (cid:1) = 0 } has measure zero with respect to the Lebesgue measure on Ω( β ; G (1) ,i = G (cid:63) ). Thus wehave proved that for generic parameters, the (cid:12) j ∈A i Λ ( j ) has full column rank 2 K . Havingthree full rank structures (cid:12) j ∈A i Λ ( j ) for i = 1 , ,

3, we can invoke the Kruskal’s theorem inLemma 3 to conclude that the parameters G (1) , β , and ν (1) := { P ( α (1) = α ); α ∈ { , } K } are generically identiﬁable. Then combined with the condition that K ≥ (cid:100) log B (cid:101) + 1, as52hown in the last part of the proof of Proposition 3 we also have the generic identiﬁabilityof η and τ . This concludes the proof of Proposition 4. Proof of Lemma 1.

We prove the equality (A.1) by showing the corresponding elements ofthe two matrices on the left and right hand sides (LHS and RHS) are equal. Suppose each A j = ( a jm,(cid:96) ) has dimension x j × y j and each B j = ( b jm,(cid:96) ) has dimension y j × k . Then bothLHS and RHS have dimension (cid:81) pj =1 x j × k . For arbitrary c = ( c , . . . , c p ) (cid:62) with each c j ∈ [ x j ]and arbitrary h ∈ [ k ], there isLHS c ,h = p (cid:89) j =1 { A j B j } c j ,h = p (cid:89) j =1 (cid:32) y j (cid:88) m =1 a jc j ,m b jm,h (cid:33) = y (cid:88) m =1 · · · y p (cid:88) m p =1 p (cid:89) j =1 a jc j ,m j b jm j ,h ;RHS c ,h = (cid:88) m =( m ,...,mp ) ,mj ∈ [ yj ] (cid:8) ⊗ pj =1 A j (cid:9) c , m × (cid:8) (cid:12) pj =1 B j (cid:9) m ,h = y (cid:88) m =1 · · · y p (cid:88) m p =1 p (cid:89) j =1 a jc j ,m j b jm j ,h , therefore LHS c ,h = RHS c ,h for any c and h . This establishes the equality in the lemma andcompletes the proof. Proof of Lemma 2.

We can write out the explicit form of K ( { Λ ( j ) − δ ( j ) · (cid:62) k } j ∈ [ p ] ) as follows, (cid:12) pj =1  λ ( j )1 , − δ ( j )1 λ ( j )2 , − δ ( j )1 · · · λ ( j ) k, − δ ( j )1 ... ... ... ... λ ( j )1 ,d j − − δ ( j ) d j − λ ( j )2 ,d j − − δ ( j ) d j − · · · λ ( j ) k,d j − − δ ( j ) d j − · · ·  = (cid:12) pj =1  · · · − δ ( j )1 · · · − δ ( j )2 ... ... . . . ... ...0 0 · · · − δ ( j ) d j − · · ·  ·  λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, ... ... ... ... λ ( j )1 ,d j − λ ( j )2 ,d j − · · · λ ( j ) k,d j − · · ·   p (cid:79) j =1  · · · − δ ( j )1 · · · − δ ( j )2 ... ... . . . ... ...0 0 · · · − δ ( j ) d j − · · ·  ·  (cid:12) pj =1  λ ( j )1 , λ ( j )2 , · · · λ ( j ) k, ... ... ... ... λ ( j )1 ,d j − λ ( j )2 ,d j − · · · λ ( j ) k,d j − · · ·  = : (cid:40) p (cid:79) j =1 ∆ ( j ) (cid:41) · K ( { Λ ( j ) } j ∈ [ p ] ) , where the second equality above also results from Lemma 1. Because each ∆ ( j ) in the abovedisplay is an upper-triangular matrix with all the diagonal entries equal to one, the Kroneckerproduct of them is an invertible matrix of size (cid:81) pj =1 d j × (cid:81) pj =1 d j . Therefore we can take B ( { δ ( j ) } j ∈ [ p ] ) = (cid:78) pj =1 ∆ ( j ) and this completes the proof of the lemma. S.2 Posterior Computation Details for the Two-layerSG-LCA

S.2.1 Gibbs Sampler for a Fixed K = K (Number of Binary La-tent Traits in the Middle Layer) The posterior distribution is p ( β , G (1) , τ , η , A , Z | Y ) (S.6) ∝ n (cid:89) i =1 p (cid:89) j =1 d (cid:89) c =1 K (cid:89) k =1 (cid:104) exp (cid:16) β jc + (cid:80) K k =1 β jck g j,k a i,k (cid:17)(cid:105) I ( y ij = c ) (cid:80) dl =1 exp (cid:16) β jl + (cid:80) K k =1 β jlk g j,k a i,k (cid:17) × n (cid:89) i =1 B (cid:89) b =1 (cid:34) τ b K (cid:89) k =1 η a ik kb (1 − η kb ) − a ik (cid:35) z ib × d − (cid:89) c =1 p (cid:89) j =1 exp (cid:8) − / σ − c ( β jc − µ c ) (cid:9) × K (cid:89) k =1 p (cid:89) j =1 d − (cid:89) c =1 (cid:2) σ − ck exp (cid:0) − / σ − ck β jck (cid:1)(cid:3) g j,k (cid:2) v − exp (cid:0) − / v − β jck (cid:1)(cid:3) − g j,k × p (cid:89) j =1 K (cid:89) k =1 γ g j,k (1 − γ ) − g j,k × K (cid:89) k =1 d − (cid:89) c =1 σ − a σ − ck exp (cid:0) − b σ σ − ck (cid:1) n (cid:89) i =1 p (cid:89) j =1 d − (cid:89) c =1 PG( w ijc | , × Dirichlet( τ ; α τ ) × B (cid:89) b =1 K (cid:89) k =1 Beta( η kb ; α η ) × Beta( γ ; α γ ) . Here the PG( · | ,

0) denotes the density of the Polya-Gamma distribution (Polson et al.,2013). We simply take α τ = B and α η = α γ = (1 , (cid:62) . (1) Conditional distribution for β . We use the auxiliary variable approach in Holmesand Held (2006) for multinomial regression to derive the conditional distribution of each β jck . Recall the posterior p ( β , W | y n ) in Section 4.For an arbitrary given graphical vector g j, : = ( g j, , . . . , g j,K ), deﬁne K j = { k ∈ [ K ] : g j,k = 1 } , the set of parent latent variables of variable y j . Then denote β jc, eﬀ = ( β jck ; k ∈K j ) (cid:62) , (cid:101) β jc, eﬀ = ( β jc , β (cid:62) jc, eﬀ ) (cid:62) ; also denote a i, eﬀ = ( a i,k ; k ∈ K j ) (cid:62) , (cid:101) a i, eﬀ = (1 , a (cid:62) i, eﬀ ) (cid:62) . Thenwe can write φ ijl = (cid:101) a (cid:62) i (cid:101) β jc, : − C ij ( l ) . By the property of the Polya-Gamma augmentation(Polson et al., 2013), the conditional distribution of (cid:101) β jc, : is p ( (cid:101) β jc, eﬀ | Y , W )= p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 L ( y i | β ) ∝ p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 ξ y ijc ijc (1 − ξ ijc ) − y ijc = p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 [exp( φ ijc )] y ijc φ ijc ) ∝ p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 exp { ( y ijc − / φ ijc } exp (cid:8) − / w ijc φ ijc (cid:9) ∝ p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 exp (cid:40) − w ijc (cid:18) φ ijc − y ijc − / w ijc (cid:19) (cid:41) (use φ ijc = (cid:101) a (cid:62) i (cid:101) β jc, eﬀ − C ij ( c ) and deﬁne x ijc = y ijc − / w ijc + C ij ( c ) ) ∝ p ( (cid:101) β jc, eﬀ ) n (cid:89) i =1 exp (cid:26) − w ijc (cid:16)(cid:101) a (cid:62) i, eﬀ (cid:101) β jc, eﬀ − x ijc (cid:17) (cid:27) ∝ p ( (cid:101) β jc, eﬀ ) exp (cid:26) − (cid:16) x : ,jc − (cid:101) A : , eﬀ (cid:101) β jc, eﬀ (cid:17) (cid:62) diag( W : ,jc ) (cid:16) x : ,jc − (cid:101) A : , eﬀ (cid:101) β jc, eﬀ (cid:17)(cid:27) ∝ p ( (cid:101) β jc, eﬀ ) exp (cid:26) − (cid:16)(cid:101) β jc, eﬀ − (cid:101) A − : , eﬀ x : ,jc (cid:17) (cid:62) (cid:101) A (cid:62) : , eﬀ diag( W : ,jc ) (cid:101) A : , eﬀ (cid:16)(cid:101) β jc, eﬀ − (cid:101) A − : , eﬀ x : ,jc (cid:17)(cid:27) , where (cid:101) A : , eﬀ is a n × ( |K j | +1) matrix with the i th row being (cid:101) a i, eﬀ . Deﬁne µ c = ( µ c , , . . . , (cid:62) ,55 vector of length K + 1, and Σ jc = diag ( σ c , { σ ck ; k ∈ K j } ), a diagonal matrix of size( |K j | + 1) × ( |K j | + 1). Considering the prior for (cid:101) β jc, eﬀ , we have p ( (cid:101) β jc, eﬀ | − ) ∝ exp (cid:26) − (cid:16)(cid:101) β jc, eﬀ − µ c (cid:17) (cid:62) Σ − jc (cid:16)(cid:101) β jc, eﬀ − µ c (cid:17)(cid:27) × exp (cid:26) − (cid:16)(cid:101) β jc, eﬀ − (cid:101) A − : , eﬀ x : ,jc (cid:17) (cid:62) (cid:101) A (cid:62) : , eﬀ diag( W : ,jc ) (cid:101) A (cid:16)(cid:101) β jc, eﬀ − (cid:101) A − : , eﬀ x : ,jc (cid:17)(cid:27) , so ( (cid:101) β jc, eﬀ | − ) ∼ N ( µ jc , Σ jc ) where µ jc = Σ jc (cid:16) (cid:101) A (cid:62) diag( W : ,jc ) x : ,jc + Σ − jc µ c (cid:17) , Σ jc = (cid:16) (cid:101) A (cid:62) diag( W : ,jc ) (cid:101) A + Σ − jc (cid:17) − . (S.7)As for those k / ∈ K j with g j,k = 0, we sample β jkc from the pseudo prior N (0 , v ). Insummary, for any j ∈ [ p ], we sample β j, : , : = ( (cid:101) β jc, eﬀ , β jc, ine ) as follows (cid:101) β jc, eﬀ ∼ N ( µ jc , Σ jc ); β jc, ine ∼ N (0 , v I p −|K j | ) . (2) Conditional distribution for Polya-Gamma random variables w ijc . p ( w ijc | − ) ∝ PolyaGamma (cid:32) w ijc (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , β jc + K (cid:88) k =1 β jck g j,k a i,k − C ij ( c ) (cid:33) . (3) Conditional distribution for g j,k . P ( g j,k = 1 | − ) = γγ + (1 − γ ) O jk ; O jk = (cid:89) c v − exp (cid:0) − / v − β jck (cid:1) σ − ck exp (cid:0) − / σ − ck β jck (cid:1) (cid:89) i (cid:89) c p ( y ij = c | − , g j,k = 0) p ( y ij = c | − , g j,k = 1) ;Also there is γ ∼ Beta(1 + (cid:80) pj =1 (cid:80) Kk =1 g j,k , pK − (cid:80) pj =1 (cid:80) Kk =1 g j,k ). (4) Conditional distribution for σ ck . p ( σ ck | − ) ∝ p (cid:89) j =1 (cid:2) σ − ck exp (cid:0) − / σ − ck β jck (cid:1)(cid:3) g j,k × σ − a σ +1) ck exp (cid:0) − b σ σ − ck (cid:1) σ − / (cid:80) pj =1 g j,k + a σ +1) ck exp (cid:40) − σ − ck (cid:32) p (cid:88) j =1 g j,k β jck + b σ (cid:33)(cid:41) , so σ ck | − ∼ InvGa (cid:32) p (cid:88) j =1 g j,k + a σ , p (cid:88) j =1 g j,k β jck + b σ (cid:33) . (5) The conditional distribution for parameters of the deeper latent layer. Forparameters τ and η underlying the deeper latent layer,( τ , . . . , τ B ) ∼ Dirichlet (cid:32) n (cid:88) i =1 z i , . . . , n (cid:88) i =1 z iB (cid:33) ; η kb ∼ Beta (cid:32) n (cid:88) i =1 a ik z ib , n (cid:88) i =1 (1 − a ik ) z ib (cid:33) . (6) The conditional distribution for subject-speciﬁc local parameters A and Z. P ( a ik = 1 | − ) = (cid:81) b η z ib kb (cid:81) j (cid:81) c p ( y ij = c | − , a ik = 1) (cid:81) b η z ib kb (cid:81) j (cid:81) c p ( y ij = c | − , a ik = 1) + (cid:81) b (1 − η kb ) z ib (cid:81) j (cid:81) c p ( y ij = c | − , a ik = 0) ;( z i , . . . , z iB ) ∼ Categorical (cid:32) τ K (cid:89) k =1 η a ik k (1 − η k ) − a ik , . . . , τ B K (cid:89) k =1 η a ik kB (1 − η kB ) − a ik (cid:33) . Based on the full conditionals in the above (1)–(6), we can have a Gibbs sampler thatcycle through these steps. All the conditional distributions are standard distributions thatare easy to sample from. 57 .2.2 Gibbs Sampler under a Cumulative Shrinkage Process Priorwhen K is Unknown Introduce weight w (cid:96) := v (cid:96) (cid:81) (cid:96) − m =1 (1 − v m ) and introduce latent indicators z k with P ( z k = (cid:96) | w (cid:96) ) = w (cid:96) , for (cid:96) ∈ [ K ]. The joint conditional posterior of { σ ck } , { v k } , and { z k } are p ( { σ ck } , { z k } , { v k } | − ) ∝ K (cid:89) k =1 p (cid:89) j =1 (cid:34) d − (cid:89) c =1 σ − ck exp (cid:0) − / σ − ck β j k (cid:1)(cid:35) g j,k × K (cid:89) k =1 (cid:34) d − (cid:89) c =1 ( σ − ck ) a σ − exp (cid:0) − b σ σ − ck (cid:1)(cid:35) I ( z k >k ) (cid:34) d − (cid:89) c =1 I ( σ ck = θ ∞ ) (cid:35) I ( z k ≤ k ) × K (cid:89) (cid:96) =1 (cid:34) v (cid:96) (cid:96) − (cid:89) m =1 (1 − v m ) (cid:35) (cid:80) Kk =1 I ( z k = (cid:96) ) Based on the above display, we can derive the full conditional distribution for each quantityas follows. ( σ ck | − ) ∼  InvGa (cid:16) a σ + (cid:80) pj =1 g j,k , b σ + (cid:80) pj =1 g j,k β jck (cid:17) , if z k ≤ k ; σ ck = θ ∞ , if z k > k ;( v k | − ) ∼ Beta (cid:32) K (cid:88) (cid:96) =1 I ( z (cid:96) = k ) , α + K (cid:88) (cid:96) =1 I ( z (cid:96) > k ) (cid:33) ; P ( z k = (cid:96) | − ) ∝  w (cid:96) (cid:81) d − c =1 N p ( β eﬀ , ck ; 0 , θ ∞ I p eﬀ ) , if (cid:96) ≤ k ; (spike) w (cid:96) (cid:81) d − c =1 t a σ { β eﬀ , ck ; 0 , ( b σ /a σ ) I p eﬀ } , if (cid:96) > k. (slab)With the indicator variables { z k ; k ∈ [ K ] } , the posterior mean of the inferred number ofeﬀective latent variables takes the form of E [ K (cid:63) | Y ] = P (cid:32) K (cid:88) k =1 K (cid:88) (cid:96) = k +1 I ( z k = (cid:96) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Y (cid:33) . (S.8)58 .3 Additional Data Analysis of the Promoter Dataset The promoter data consist of A, C, G, T nucleotides at p = 57 positions for n = 106 DNAnucleotide sequences, along with a binary variable for each sequence indicating whether it isa promoter or a non-promoter sequence. In this dataset, the ﬁrst 53 sequences are promotersand remaining 53 are non-promoter sequences. We let the variable z in the deepest latentlayer have B = 2 categories, since there are two types of sequences; we do not use anyinformation of which sequence belongs to which type when ﬁtting the two-latent-layer SG-LCM to data.Similarly to the analysis of the splice junction data, we take K upper = 7 and α = 2 inthe CSP prior. We ran the chain for 15000 iterations, discarding the ﬁrst 5000 as burn-in, and retaining every ﬁfth sample post burn-in to thin the chain. Our method selects anumber of K (cid:63) = 4 binary latent variables in the middle layer. Denote the collected posteriorsamples of the nucleotide sequences’ latent binary traits by ( a ( r ) i,k ; i ∈ [106] , k ∈ [4]). anddenote those of the nucleotide sequences’ deep latent category by ( z ( r ) i ; i ∈ [106]) where z ( r ) i ∈ { , } . Similarly as for the splice junction data, we deﬁne the ﬁnal estimators as (cid:98) G = ( (cid:98) g j,k ), (cid:98) A = ( (cid:98) a i,k ), and (cid:98) Z = ( (cid:98) z i,b ) to be (cid:98) g j,k = I (cid:32) R R (cid:88) r =1 g ( r ) j,k > (cid:33) , (cid:98) a i,k = I (cid:32) R R (cid:88) r =1 a ( r ) i,k > (cid:33) , (cid:98) z i,b =  , if b = arg max b ∈ [ B ] R R (cid:88) r =1 I (cid:16) z ( r ) i = b (cid:17) ;0 , otherwise . In Fig. S.1, the three plots display the three estimated matrices (cid:98) G , (cid:98) A , and (cid:98) Z , respectively.As for the estimated loci loading matrix (cid:98) G , Fig. 6(a) provides information on how the p = 60loci depend on the four selected binary latent traits.The gray reference lines in Fig. S.1(b)-(c) mark the x -axis value of i = 53, which separatesthe two nucleotide sequence types in the sample: the promoter sequences i ∈ { , . . . , } andnon-promoter sequences i ∈ { , . . . , } . As for the learned binary latent representationsof the nucleotide sequences shown in Fig. S.1(b), we found that the ﬁrst 53 promoter se-quences generally possess more binary latent traits while the last 53 non-promoter sequences59re estimated to be not possessing any binary latent traits. In particular, the following eightpromoter sequences indexed from i = 10 , , . . . ,

17 possess the densest binary trait represen-tations,

PRNAB_P1 , PRNAB_P2 , PRNDEX_P2 , RRND_P1 , RRNE_P1 , RRNG_P1 , RRNG_P2 , RRNX_P1 .In particular, Fig. S.1(b) shows that seven out of these eight sequences, except the fourthone

RRND_P1 , are the only seven sequences in the dataset that possess at least three binarytraits. Very interestingly, the aforementioned eight sequences indexed from i = 10 , , . . . , n = 106 ones that have their names starting with “ PRN ”.This result shows the usefulness of the proposed methodology in learning interpretable andmeaningful latent features. Such insight also suggests that domain scientists could conductfuture research to investigate how these eight promoter sequences function diﬀerently fromthe other promoter sequences.(a) (cid:98) G × (b) (cid:98) A × (c) (cid:98) Z × Figure S.1: Promoter data analysis under the CSP prior with K upper = 7. Plots are presentedwith the K (cid:63) = 4 binary latent traits selected by our proposed method. The gray referencelines in (b) and (c) mark the x -axis value of i = 53, which separates the two nucleotidesequence types in the sample: the promoter sequences i ∈ { , . . . , } and non-promotersequences i ∈ { , . . . , }}