[PDF] Benchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Abstract

In representation learning, there has been recent interest in developing algorithms to disentangle the ground-truth generative factors behind a dataset, and metrics to quantify how fully this occurs. However, these algorithms and metrics often assume that both representations and ground-truth factors are flat, continuous, and factorized, whereas many real-world generative processes involve rich hierarchical structure, mixtures of discrete and continuous variables with dependence between them, and even varying intrinsic dimensionality. In this work, we develop benchmarks, algorithms, and metrics for learning such hierarchical representations.

Full PDF

BBenchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Andrew Slavin Ross Finale Doshi-Velez Abstract

In representation learning, there has been re-cent interest in developing algorithms to disen-tangle the ground-truth generative factors behinddata, and metrics to quantify how fully this oc-curs. However, these algorithms and metrics of-ten assume that both representations and ground-truth factors are ﬂat, continuous, and factorized,whereas many real-world generative processes in-volve rich hierarchical structure, mixtures of dis-crete and continuous variables with dependencebetween them, and even varying intrinsic dimen-sionality. In this work, we develop benchmarks,algorithms, and metrics for learning such hierar-chical representations.

1. Introduction

Autoencoders aim to learn structure in data by compressingit to a lower-dimensional representation with minimal lossof information. Although this has proven useful in manyapplications (LeCun et al., 2015), the individual dimensionsof autoencoder representations are often inscrutable, evenwhen the underlying data is generated by simple processes.Motivated by needs for interpretability (Alvarez-Melis &Jaakkola, 2018; Marx et al., 2019), fairness (Creager et al.,2019), and generalizability (Bengio et al., 2013), as wellas a basic intuition that representations should model thedata correctly, a subﬁeld has emerged which applies rep-resentation learning algorithms to synthetic datasets andchecks how well representation dimensions “disentangle”the known ground-truth factors behind the dataset.Perhaps the most common disentanglement approach hasbeen to learn ﬂat, continuous vector representations whosedimensions are statistically independent (and evaluate themusing metrics that assume ground-truth factors are inde-pendent), reasoning that factorization is a useful proxy(Ridgeway, 2016; Higgins et al., 2017; Chen et al., 2018;Kim & Mnih, 2018). However, this problem is not iden-tiﬁable (Locatello et al., 2018), and it seems unlikely that Harvard University, Cambridge, Massachusetts, USA. Corre-spondence to: Andrew Ross < andrew [email protected] > . Preliminary work. Under review. continuous, factorized, ﬂat representations are the optimalchoice for modeling many real-world generative processes,which are often highly structured.To address aspects of thisproblem, some approaches generalize to partially discreterepresentations (Jeong & Song, 2019), or encourage inde-pendence only conditionally, based on hierarchies or causalgraphs (Esmaeili et al., 2019; Tr¨auble et al., 2020). Almostall approaches require side-information, either about spe-ciﬁc instances or about the global structure of the dataset.Our approach in this paper is ambitious: we introduce (1) aﬂexible framework for modeling deep hierarchical structurein datasets, (2) novel algorithms for learning both struc-ture and structured autoencoders entirely from data, whichwe apply to (3) novel benchmark datasets, and evaluatewith (4) novel hierarchical disentanglement metrics. Ourframework is based on the idea that data may lie on multi-ple manifolds with different intrinsic dimensionalities, andthat certain (hierarchical groups of) dimensions may onlybecome active for a subset of the data. Though at ﬁrstglance this approach seems it should worsen, not improve,identiﬁability, our assumption of geometric structure alsoserves as an inductive bias that empirically helps us learnrepresentations that more faithfully (and explicitly) modelground-truth generative processes.

2. Related Work

Though interest in disentanglement is longstanding (Schmid-huber, 1992; Comon, 1994; Bengio, 2013), a relatively re-cent resurgence has focused on ﬂat factorized represen-tations . Ridgeway (2016) provide an inﬂuential survey ofsuch representations, arguing for their usefulness. Higginset al. (2017) develop β -VAE, which tries to encourage fac-torization in variational autoencoders (VAEs, Kingma &Welling (2013)) by increasing the KL divergence penalty.Chen et al. (2018) and Kim & Mnih (2018) factorize by di- As a concrete example, consider the problem of learning rep-resentations of medical phenotypes of patients with and withoutdiabetes mellitus, a complex disease with multiple types and sub-types (American Diabetes Association, 2005). Some underlyingfactors of phenotype variation—as well as the intrinsic complexityof these variations—are likely speciﬁc to the disease, its types, orits subtypes (Ahlqvist et al., 2018). A representation that faithfullymodeled the true factors of variation would need to be deeply hier-archical, with some dimensions only active for certain subtypes. a r X i v : . [ c s . L G ] F e b enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement rectly penalizing the total correlation (TC) between dimen-sions. Mixed discrete-continuous extensions are developedfor KL by Dupont (2018) and TC by Jeong & Song (2019).Although these innovations are speciﬁc to ﬂat representa-tions, there has been work on certain forms of hierarchy .Esmaeili et al. (2019) encourage different degrees of factor-ization within and across subgroups, which can be nested.However, they only apply their method in shallow contexts,and subgroups must be provided rather than learned fromdata. Choi et al. (2020) learn mixed discrete-continuous rep-resentations where some continuous dimensions are “public”in scope (and globally independent), while others are “pri-vate” to a categorical (and conditionally independent ofsiblings). However, they do not support deep hierarchies,and the structure of categorical, public, and private vari-ables is provided rather than learned. GINs (Sorrensonet al., 2020) can infer the dimensionality of such categorical-speciﬁc continuous dimensions groups, but still must begiven the (shallow) categorical structure. FineGAN (Singhet al., 2019) learns a kind of structure, but enforces a speciﬁcshallow hierarchy of background, shape, and pose. Adamset al. (2010) model data with arbitrarily wide and deeptrees using Bayesian non-parametric methods. However,there is no explicit encoder (representations are inferred viaMCMC), and all features are binary. Our method attemptsto provide the best of all these worlds; from data alone,we learn autoencoders whose representations have mixeddiscrete-continuous structure of arbitrary width and depth.Our approach is also complementary to recent shifts inthe disentanglement literature , especially from causalityresearchers. Parascandolo et al. (2018) and Tr¨auble et al.(2020) argue for learning representations that disentanglecausal mechanisms rather than statistically independent fac-tors. Locatello et al. (2018) show that disentangling globallyindependent factors is non-identiﬁable, and suggest a shiftin focus to inductive biases, weak supervision, and datasetswith ground-truth interactions. The literature on learningrepresentations with side-information is rich (Mathieu et al.,2016; Siddharth et al., 2017), and recent advances in dis-entanglement with extremely weak supervision are notable:Locatello et al. (2020a) learn disentangled representationsgiven instance-pairs that differ only by sparse sets of ground-truth factors, and Klindt et al. (2021) use similar principlesto disentangle factors that vary sparsely over time. In thiswork, we return to the problem of learning disentangledrepresentations from data alone. As our inductive bias toreduce (though not eliminate) non-identiﬁability, we assumethe data contains discrete hierarchical structure that can beinferred geometrically. Though this introduces challengingnew problems, it also creates opportunities to learn inter-pretable global summaries of the data. Other related approaches not directly in this line of re- search include relational autoencoders (Wang et al., 2014),which model structure between non-iid ﬂat data, and graphneural networks (Defferrard et al., 2016), which learn ﬂatrepresentations of structured data. In contrast, we modelstructure within ﬂat inputs. Also relevant are advances inobject representations, such as slot attention (Locatello et al.,2020b). While this area has generally not focused on hier-archically nested objects, it does learn structure and seam-lessly handles sets; we view our method as complementary.Finally, our hierarchy detection method is closely related towork in manifold learning. We build on work in multiple-and robust manifold learning (Mahapatra & Chandola, 2017;Mahler, 2020), contributing new innovations on top of them.

3. Hierarchical Disentanglement Framework

In this section, we outline our framework for modeling hier-archical structure in representations. In our framework, weassociate individual data points with paths down a dimen-sion hierarchy (examples in Fig. 1). Dimension hierarchiesconsist of dimension group nodes (shown as boxes), eachof which can have any number of continuous dimensions(shown as ovals) and an optional categorical variable (dia-monds) that leads to other groups based on its value. Forany data point, we “activate” only the dimensions along itscorresponding path. Notation-wise, root ( Z ) denotes thegroup at the root of a hierarchy, and children ( Z j ) denotesthe child groups of a categorical dimension Z j . In the con-text of a dataset, for a dimension Z j or a dimension group g , on ( Z j ) or on ( g ) denotes the subset of the dataset wherethat Z j or g is active.This framework can be readily extended to support multiplecategorical variables per node (e.g. recursing on both seg-ment halves in the timeseries dataset deﬁned below) or evenDAGs, such that instances can be associated with directedﬂows down multiple paths. For simplicity, however, wenarrow our scope to tree structures in this work.

4. Hierarchical Disentanglement Benchmarks

For new frameworks, it is especially important to have syn-thetic benchmarks for which the true structure is knownand ground truth disentanglement scores can be computed.Below we further describe the two benchmarks from Fig. 1.

Our ﬁrst benchmark dataset is Spaceshapes, a binary 64x64image dataset meant to hierarchically extend dSprites, ashape dataset common in the disentanglement literature(Matthey et al., 2017). Like dSprites, Spaceshapes imageshave location variables x and y , as well as a categorical shape with three options (in our case, moon , star , and ship ). However, depending on shape, we add other con- enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 1.

Examples and ground-truth variable hierarchies for Spaceshapes and two different variants of Chopsticks. Continuous variablesare shown as circles and discrete variables are shown as diamonds. Discrete variables have subhierarchies of additional variables that areonly active for particular discrete values. tinuous variables with very different effects: moon s have a phase ; star s have a sharpness to their shine ; and ship shave an angle . Finally, ship s can optionally have a jet ,which has a length ( jetlen ), but this is only deﬁned at thedeepest level of the hierarchy. The presence of jetlen al-ters the intrinsic dimensionality of the representation; it canbe either 3D or 4D depending on the path. As in dSprites,variables are sampled from continuous or discrete uniforms. Our second benchmark, Chopsticks, is actually a family ofarbitrarily deep timeseries datasets. Chopsticks samples are64D linear segments. Each segment can have a uniform-sampled slope and/or intercept ; different Chopsticksvariants can have one, the other, both , or either but notboth. For all variants, segments initially span the whole in-terval. However, a coin is then ﬂipped to determine whetherto chop the segment, in which case we add a uniform offsetto the slope and/or intercept of the right half. We repeat thisprocess recursively up to a conﬁgurable maximum depth ,biasing probabilities so that we have equal probability ofstopping at each level. Each chop requires increasing localdimensionality to track additional slopes and intercepts. Al-though the underlying process is quite simple, the structurecan be made arbitrarily deep, making it a useful benchmarkfor testing structure learning.Although these datasets are designed to have clear hierarchi-cal structure, in certain cases, there are multiple dimensionhierarchies that could arguably describe the same dataset.See Fig. 14 for more and §6.1 for how we handle them.

5. Hierarchical Disentanglement Algorithms

We next present a method for learning hierarchical disentan-gled representations from data alone. We split the probleminto two brunch-themed algorithms, MIMOSA (which in-fers hierarchies) and COFHAE (which trains autoencoders).

The goal of our ﬁrst algorithm, MIMOSA ( M ulti-manifold I so M ap O n S mooth A utoencoder), is to learn a hierarchy ˆ H from data, as well as an assignment vector ˆ A n of data pointsto hierarchy leaves. MIMOSA consists of the followingsteps (see Appendix for Algorithms 3-7 and complexity): Dimensionality Reduction (Algorithm 1, line 1):

Westart by performing an initial reduction of X to Z usinga ﬂat autoencoder. While we could start with Z = X , per-forming this reduction saves computation as later steps (e.g.ﬁnding neighbors) scale linearly with | Z | . Although thisrequires choosing | Z | , we ﬁnd the exact value is not criticalas long as it exceeds the (max) intrinsic dimensionality ofthe data. We also ﬁnd it important to use differentiable acti-vation functions (e.g. Softplus rather than ReLU) to keeplatent manifolds smooth; see Fig. 6 for more. Manifold Decomposition (Algorithms 3-6):

We decom-pose Z into a set of manifold “components” by computingSVDs locally around each point and merging neighboringpoints with sufﬁciently similar subspaces. We then performa second merging step over longer lengthscales, combin-ing equal-dimensional components with similar local SVDsalong their nearest boundary points and discarding smalloutliers, which we found was necessary to handle interstitialgaps when two manifolds intersect. The core of this step isbased on a multi-manifold learning method (Mahapatra &Chandola, 2017), but we make efﬁciency as well as robust-ness improvements by combining ideas from RANSAC (Fis-chler & Bolles, 1981) and contagion dynamics (Mahler,2020). The merging step is a new contribution.It bears emphasis that manifold decomposition, whichgroups points based on the similarity of local principal com-ponents, is distinct from clustering, which groups pointsbased on proximity. On our benchmarks, even hierarchicaliterative clustering methods like OPTICS (Ankerst et al., enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Algorithm 1

MIMOSA( X ) Encode the data X using a smooth autoencoder to reduce the initial dimensionality. Store as Z . Construct a neighborhood graph on Z using a Ball Tree (Omohundro, 1989). Run LocalSVD (Algorithm 3) on each point in Z and its neighbors to identify local manifold directions. Run BuildComponent (Algorithm 5) to successively merge neighboring points with similar local manifold directions. Run MergeComponents (Algorithm 6) to combine similar components over longer distances and discard outliers. Run ConstructHierarchy (Algorithm 7) to create a dimension hierarchy based on which components enclose others. return the hierarchy and component assignments.1999) will not sufﬁce, as nearby points may lie on differentmanifolds. Hierarchy Identiﬁcation (Algorithm 7):

We construct atree by drawing edges from low-dimensional componentsto the higher-dimensional components that best “enclose”them, which we deﬁne using a ratio of inter-componentto intra-component nearest neighbor distances; we believethis is novel. We use this tree and the component dimen-sionalities to construct a dimension hierarchy and a set ofassignments from points to paths, which we output.

Hyperparameters:

Each of these steps has several hy-perparameters, and we provide a full listing and sensitivitystudy in §A.3. The one we found most critical was theminimum SVD similarity to merge neighboring points.

COFHAE( X ) hierarchy, assignments = MIMOSA( X ) HAE θ = init hierarchical autoencoder (hierarchy) D ψ = init discriminator () for x, a ∼ minibatch( X, assignments) do a (cid:48) , z = HAE θ . encode( x ; τ ) x (cid:48) = HAE θ . decode(concat( a (cid:48) , z )) z (cid:48) = copy( z ) for i = 1 .. | z | do shuﬄe z (cid:48) : ,i over minibatch indices where on ( z : ,i ) end for L θ = (cid:80) n L x ( x (cid:48) n , x n )+ λ L a ( a (cid:48) n , a n ) − λ log D ψ ( z n )1 − D ψ ( z n ) L ψ = (cid:80) n − log D ψ ( z (cid:48) n ) − log(1 − D ψ ( z n )) θ = descent step( θ, L θ ) ψ = descent step( ψ, L ψ ) end for return HAE θ Our ﬁrst stage, MIMOSA, gives us the hierarchy and as-signments of data down it. In the second stage, COFHAE( CO nditionally F actorized H ierarchical A uto E ncoder, Al-gorithms 2 and 8), we learn an autoencoder that respectsthis hierarchy via (differentiable) masking operations thatimpose structure on ﬂat representations. Hierarchical Encoding:

Instances x pass through a neu-ral network encoder to an initial vector z pre , whose dimen-sions correspond to all continuous variables in the hierarchyas well as the one-hot encoded categorical variables. Cate-gorical dimensions (denoted a (cid:48) ) pass through a softmax withtemperature τ to softly mask z pre based on the hierarchy. Supervising Assignments:

Hierarchical encoding out-puts estimated assignments a (cid:48) . We add a penalty L a ( a (cid:48) , a ) ,weighted by λ , to make these close to MIMOSA values a . Conditional Factorization:

Kim & Mnih (2018) penal-ize the total correlation (TC) between dimensions of ﬂatcontinuous representations z with two tricks. First, notingthat TC is the KL divergence between q ( z ) (the joint dis-tribution of the encoded z ) and ¯ q ( z ) ≡ (cid:81) | z | j =1 q ( z j ) (theproduct of its marginals), they approximate samples from ¯ q ( z ) by randomly permuting the values of each z i acrossbatches (Arcones & Gine, 1992). Second, they approximatethe KL divergence between the two distributions using thedensity ratio trick (Sugiyama et al., 2012) on an auxiliary dis-criminator D ψ ( z ) , where KL ( q ( z ) || ¯ q ( z )) ≈ log D ψ ( z )1 − D ψ ( z ) if D ψ ( z ) outputs accurate probabilities of z having beensampled from ¯ q . We adopt a similar approach, except in-stead of permuting each z i across the full batch B , we onlypermute it where it is active , i.e. B ∩ on ( z i ) (deﬁned usingthe hardened version of the mask). This approximates ahierarchical version of ¯ q ( z ) where each dimension distribu-tion is a mixture of 0 and the product of its active marginals. D ψ ( z ) then lets us estimate the KL between this distributionand q ( z ) , which we penalize and weight with λ .This approach presumes ground-truth continuous variablesshould be conditionally independent given categorical val-ues, which is a major assumption. However, it is lessstrict than the assumption taken by many disentanglementmethods, i.e. that continuous variables are independentmarginally, and it may remain useful as an inductive bias.

6. Hierarchical Disentanglement Metrics

In this section, we develop metrics for quantifying how welllearned representations and hierarchies match ground-truth. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Our goal in designing metrics is to measure whether we havelearned the “right representation,” both in terms of globalstructure and speciﬁc variable correspondences. In an idealworld, we would measure whether a learned representation Z is identically equal to a ground-truth V . However, mostexisting disentanglement metrics are invariant to permu-tations, so that dimensions V i can be reordered to matchdifferent Z j , as well as univariate transformations, so thatthe values of Z j do not need to be identical to V i . In thecase of methods like the SAP score (Kumar et al., 2017),these univariate transformations must be linear, but as theuniformity of scaling can be arbitrary, we permit generalnonlinear transformations, as long as they are 1:1, or in-vertible. Also, in the hierarchical case, there are certainambiguities about the right vertical placement of continuousvariables. For example, on Spaceshapes, the phase , shine ,and angle variables could all be “merged up” to a singletop-level variable whose effect changes based on shape .Alternatively, x and y position could be “pushed down” andduplicated for each shape despite their analogous effects(see Fig. 14 for an illustration). Such “merge up” and “pushdown” transformations change the vector representation, butleave local dimensionality and the group structure of thehierarchy unchanged. We defer the problem of deciding themost natural vertical placement of continuous variables tofuture work, and make our main metrics invariant to them. H -error, Purity, Coverage The ﬁrst metric we use to evaluate MIMOSA is the H -error , which measures whether learned hierarchy ˆ H hasthe same essential structure as the ground-truth hierarchy H .To compute the H -error, we iterate over all possible paths p and p (cid:48) down both H and ˆ H , and attempt to pair thembased on whether the minimum downstream dimensionalityof p and p (cid:48) matches at each respective node. The number ofunpaired paths in either hierarchy is taken to be the H -error.This metric can only be 0 if both hierarchies have the samedimensionality structure, but is invariant to the “merge up”and “push down” operations described in §6.1.The second MIMOSA metric is purity , which measureswhether the assignments output by MIMOSA match ground-truth. To compute purity, we iterate through points assignedto each path ˆ p in ˆ H , ﬁnd the path p in H to which mostof them belong, and then compute the fraction of pointsin ˆ p that belong to the majority p . Then we average thesepurity scores across ˆ H , weighting by the number of pointsin ˆ p . This metric only falls below 1 when we group togetherpoints with different ground-truth assignments.The ﬁnal metric we use to evaluate MIMOSA is coverage .Since MIMOSA discards small outlier components, it ispossible that the ﬁnal set of assignments will not cover the full training set. If almost all points are discarded this way,the other metrics may not be meaningful. As such, wemeasure coverage as the fraction of the training set whichis not discarded. We note that hyperparameters can betuned to ensure high coverage without knowing ground-truth assignments. R and R c Scores

Per our desiderata, we seek to check whether every ground-truth variable V i can be mapped invertibly to some learneddimension Z j . As a preliminary deﬁnition, we say that alearned Z j corresponds to a ground-truth V i over some set S ⊆ R if a bijection between them exists; that is, ∃ f ( · ) : S → R s.t. f ( V i ) = Z j and f − ( Z j ) = V i (1)We say that Z disentangles V if all V i have a corresponding Z j . To measure the extent to which bijections exist, we cansimply try to learn them (over random splits of many pairedsamples of V i and Z j ). Concretely, for each pair of learnedand true dimensions, we train univariate models to map inboth directions, compute their coefﬁcients of determination( R ), and take their geometric mean: f ≡ min f ∈F E train (cid:2) ( f ( X ) − Y ) (cid:3) R ( X → Y ) ≡ E test (cid:20) − (cid:80) ( f ( X ) − Y ) (cid:80) ( E [ Y ] − Y ) (cid:21) R ( X ↔ Y ) ≡ (cid:112) (cid:98) R ( X → Y ) (cid:99) + (cid:98) R ( Y → X ) (cid:99) + , (2)where we average over train/test splits (we use 5), assume F is sufﬁciently ﬂexible to contain the optimal bijection(we use gradient-boosted decision trees), and assume ourdataset is large enough to reliably identify f ∈ F . In thelimit, R ( X ↔ Y ) can only be 1 if a bijection exists, as anyregion of non-zero mass in the joint distribution of X and Y where this is false would imply E [( f ( X ) − Y ) ] > or E [( f ( Y ) − X ) ] > . In the special case that Y is discreterather than continuous, we use classiﬁers for f and accuracyinstead of R , but the same argument holds.To measure whether a set of variables Z disentangles an-other set of variables V , we check if, for each V i , there is atleast one Z j for which R ( V i ↔ Z j ) = 1 : R ( V, Z ) ≡ | V | (cid:88) i max j R ( V i ↔ Z j ) , (3)We call this the “right-representation” R , or R score. Notethat this metric is related to the existing SAP score (Kumaret al., 2017), except we allow for nonlinearity, require high R in both directions, and take the maximum over eachscore column rather than the difference between the top twoentries (which avoids assuming ground-truth is factorized). enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 2.

MIMOSA results for the depth-2 either version of Chopsticks, colored by ground-truth assignments. MIMOSA learns aninitial 4D softplus AE representation (left), decomposes it into lower-dimensional components (middle), and infers a hierarchy (right). Inthis case, correspondence to ground-truth is very close (99.8% component purity, covering 93.7% of the training set, with the correcthierarchical relationships). Similar examples are shown for other datasets in Figs. 15-20 of the Appendix.

Although R is useful for measuring correspondence be-tween sets of variables that are both always active, it doesnot immediately apply to hierarchical representations unlessinactive variables are represented somehow, e.g. as 0 (an ar-bitrary implementation decision that affects R by changing E [ Y ] ). It also lacks invariance to merge-up and push-downoperations. Instead, we seek conditional correspondence between V i and a set of dimensions in Z , deﬁned asfor all V i ∈ on ( V i ) ∃ Z i = { Z j , Z k , . . . } s . t . ( a ) V i corresponds to Z j over on ( V i ) ∩ on ( Z j ) , ( b ) on ( Z j ) ∩ on ( Z k ) = ∅ for all j (cid:54) = k, and ( c ) (cid:83) z ∈Z i on ( z ) = on ( V i ) , (4)or rather that we can ﬁnd some tiling of on ( V i ) into regionswhere it corresponds 1:1 with different Z j which are neveractive simultaneously. This allows for one Z j to correspondto non-overlapping elements of V (e.g. merging up), as wellas for one V i to be modeled by non-overlapping elements of Z (e.g. pushing down).We can then formulate a conditional R c score which quanti-ﬁes how closely conditional correspondence holds: R c ( V i , g ) ≡ max (cid:18) max j ∈ g (cid:16) R (cid:0) V i ↔ Z j (cid:12)(cid:12) on ( V i ) ∩ on ( g ) (cid:1) , (cid:88) g (cid:48) ∈ children ( Z j ) R c ( V i , g (cid:48) ) | on ( V i ) ∩ on ( g (cid:48) ) || on ( V i ) | (cid:17)(cid:19) , for a dimension group g ; the overall disentanglement is: R c ( V ↔ Z ) ≡ | V | | V | (cid:88) i =1 R c ( V i , root ( Z )) . (5)In the special case that V and Z are ﬂat, R c reduces to R .We note that even for ﬂat representations, the R score maybe a useful measure of disentanglement when ground-truthvariables are not factorized.

7. Experimental Setup

Benchmarks:

We ran experiments on nine benchmarkdatasets: Spaceshapes, and eight variants of Chopsticks(varying slope , intercept , both , and either at re-cursion depths of 2 and 3). See §4 for more details, andFig. 8 for preliminary experiments on noisy data. Algorithms:

In addition to COFHAE with MIMOSA, wetrained the following baselines: autoencoders (AE), varia-tional autoencoders (Kingma & Welling, 2013) (VAE), the β -total correlation autoencoder (Chen et al., 2018) (TC-VAE), and FactorVAE (Kim & Mnih, 2018). We also ranCOFHAE ablations using the ground-truth hierarchy andassignments, testing all possible combinations of loss termsand comparing conditional vs. marginal TC penalties; re-sults are in Fig. 4. See §A.1 for training and model details. Metrics:

To evaluate hierarchies, we computed purity,coverage, and H -error (§6.2). Results are in Table 1. Tomeasure disentanglement, we primarily use R c (§6.3); re-sults for all datasets and models are in Fig. 3. We alsocompute the following baseline metrics: the SAP score(Kumar et al., 2017) (SAP), the mutual information gap(Chen et al., 2018) (MIG, estimated using 2D histograms),the FactorVAE score (Kim & Mnih, 2018) (FVAE), andthe DCI disentanglement score (Eastwood & Williams,2018) (DCI). Most implementations were adapted from disentanglement lib (Locatello et al., 2018). Wealso compute our marginal R score. Results across metricsare shown for a subset of datasets and models in Fig. 5. Hyperparameters:

COFHAE is only given instances X ,which complicates cross-validation. However, we can stilltune parameters to ensure assignments a (cid:48) match MIMOSAoutputs a and reconstruction loss for x is low (which canfail to happen if the adversarial term dominates). Overa grid of τ in { , , } , λ in { , , } , and λ in { , , } , we select the model with the lowest training enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement reconstruction loss L x from the with the lowest assign-ment loss L a . For MIMOSA, hyperparameters can be tunedto ensure high coverage (purity and H -error require side-information); see §A.3 for more. For β -TCVAE and Fac-torVAE, we show results for β =5 and γ =10 , but test bothacross { , , , } in Fig. 11. Figure 3.

Hierarchical disentanglement results for representationlearning methods (baselines and COFHAE + MIMOSA) over allnine datasets. COFHAE almost perfectly disentangles ground-truthon the six simplest versions of Chopsticks, with some degrada-tions on the two most complex versions (with very deep hierar-ches) and on Spaceshapes (with a shallower hierarchy, but higher-dimensional inputs). Baseline methods were generally much moreentangled, though β -TCVAE is competitive on Spaceshapes. Figure 4.

Ablation study for COFHAE on the depth-2 both ver-sion of Chopsticks (over 5 restarts). Hierarchical disentanglementis low for ﬂat AEs (Flat); adding the ground-truth hierarchy H improves it (Hier H ), as does also adding supervision for ground-truth assignments A ( H + A ). Adding a FactorVAE-style marginalTC penalty ( H + A + T C ( Z ) ) does not appear to help disentangle-ment, but making that TC penalty conditional ( H + A + T C ( Z | on ) ,i.e. COFHAE) brings it close to the near-optimal disentanglementof a hierarchical model whose latent representation is fully su-pervised ( H + A + Z ). However, the hierarchical conditional TCpenalty fails to produce this same disentanglement without anysupervision over assignments ( H + T C ( Z ) ).

8. Results and Discussion

MIMOSA consistently recovered the right hierarchies.

Per Table 1, we consistently found the right hierarchy for alldatasets except depth-3 either -Chopsticks, but even there

Figure 5.

Comparison of disentanglement metrics across twodatasets and four models. Only R and R c correctly and consis-tently award near-optimal scores to the supervised H+A+Z model. results were close, generally recovering 12 of 14 possiblehierarchy paths (see Fig. 20 for more details). Purity andcoverage were also high, often near perfect as in Space-shapes or depth-2 Chopsticks. COFHAE signiﬁcantly outperformed baselines.

PerFig. 3, COFHAE R c scores were near-perfect for 6 out of 9datasets, and better than baselines on all. On Spaceshapesand the depth-3 either and both versions of Chopsticks,scores were slightly worse. Part of this suboptimality couldbe due to non-identiﬁability. For Spaceshapes and the both versions of Chopsticks, dimension group nodes contain mul-tiple continuous variables, which even conditionally canbe modeled by multiple factorized distributions (Locatelloet al., 2018). However, optimization issues could also be atfault, as we do not see suboptimal R c on Chopsticks until adepth of 3, and even supervised H + A + Z models occasion-ally fail to converge on Spaceshapes. Kim & Mnih (2018)note that the relatively low-dimensional discriminator usedby FactorVAE is easier to optimize than the generally high-dimensional discriminators used in GANs, which are notori-ously tricky to train (Mescheder et al., 2018). In our case,ﬂattened hierarchy vectors can be high-dimensional (e.g.Fig. 21), and in any given batch, instances correspondingto different paths down the hierarchy may have differentnumbers of samples (potentially requiring larger batch sizesor stratiﬁed sampling to ensure sufﬁcient coverage). Finally,alongside non-identiﬁability and optimization issues, MI-MOSA errors (e.g. merge-up/push-down differences forSpaceshapes and suboptimal purity and coverage for Chop-sticks) also may play a role, as evidenced by performanceimprovements in our full COFHAE ablations in Fig. 10. De-spite all of these issues, COFHAE is still closer to optimalthan any of our baseline algorithms. R c provides more insight into disentanglement thanbaselines. One way to evaluate an evaluation metric isto test it against a precisely known quantity. In this case,we know the H + A + Z model, whose encoder is supervised enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement MIMOSA Chopsticks, depth=2 Chopsticks, depth=3 Space-Metric inter slope both either inter slope both either shapesPurity . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Coverage . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± .

01 1 . ± . H -error . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± .

34 0 . ± . Table 1.

MIMOSA results across all datasets, with means and standard deviations across 5 restarts. In general, MIMOSA componentscontained points only from single ground-truth sets of paths (purity), were inclusive of most points in the training set (coverage), andresulting in perfectly accurate hierarchies ( H errors), with the greatest or only exception being the Chopsticks depth-3 either dataset(where we tended to recover only 12 of its 14 possible paths). to match ground-truth, should receive a near-perfect score.The only metrics to do this consistently are R and R c . Notethat the DCI disentanglement score, based on the entropy ofnormalized feature importances from an estimator predict-ing single ground-truth factors from all learned dimensions,comes close. Intuitively, this metric could behave similarlyto R if its estimator was trained to be sparse (placing im-portance on as few dimensions as possible). However, using R s of univariate estimators is more direct, and also incor-porates information from the DCI informativeness score.Another way to evaluate an evaluation metric is to testwhether quantitative differences capture salient qualitativedifferences. To this point, speciﬁcally to compare R and R c , we consider several examples in Fig. 12 and Fig. 13.First, we see that for the Spaceshapes COFHAE model inFig. 12c, its R c score (0.89) is higher than its R (0.79).This increase is due to the fact that R penalizes the “push-down” differences (§6.1) between the learned and true fac-tors representing x and y position, while R c is invariant tothem. However, the overall increase is less dramatic thanone might expect due to modest decreases in correspondencescores for other dimensions (e.g. . → . for jetlen ),which occur because R c is not biased by spurious equal-ity between dimensions which are both inactive. Anotherexample of a difference between R and R c (illustratinginvariance to “merging up” rather than “pushing down”) isfor the Spaceshapes β -TCVAE in Fig. 12b. In this case, his-tograms show that one β -TCVAE variable ( Z ) correspondsclosely to both moon phase and star shine (and to a lesserextent, jetlen ), only one of which is active at a time. The R score (0.47) assigns low scores to these correspondences,but R c (0.69) properly factors them in. COFHAE and MIMOSA subcomponents improve per-formance.

Though COFHAE contains many movingparts, results in Fig. 4 and Fig. 10 suggest they all count.Autoencoders only achieve optimal disentanglement if pro-vided with the hierarchy, assignments, and a conditional(not marginal) penalty on the TC of continuous variables;no partial subset sufﬁces. In the Appendix, Fig. 9 showsablations and sensitivity analyses for MIMOSA that validateits subcomponents are important as well.

9. Conclusion

In this work, we introduced the problem of hierarchicaldisentanglement, where ground-truth representation dimen-sions are organized into a tree and activated or deactivatedbased on the values of categorical dimensions. We pre-sented benchmarks, algorithms, and metrics for learningand evaluating such hierarchical representations.There are a number of promising avenues for future work.One is extending the method to handle a wider variety of un-derlying structures, e.g. non-hierarchical dimension DAGs,or integrating our method with object representation tech-niques to better model generative processes involving or-dinal variables or unordered sets (Locatello et al., 2020b).Another is to better solve or understand hierarchical dis-entanglement as we have already formulated it, e.g. byimproving robustness to noise, or providing a better the-oretical understanding of identiﬁability and when we canguarantee methods will succeed. Finally, there are ample op-portunities to apply these techniques to real-world cases thatwe expect to have hierarchical structure, such as causal in-ference, patient phenotype, or population genetics datasets.More generally, we feel it is important for representationlearning to move beyond ﬂat vectors, and work towardsexplicitly modeling the rich structure contained in the realworld. For a long time, many symbolic AI and cognitivescience researchers have argued that AI progress should beevaluated not by improvements in accuracy or reconstruc-tion error, but by how well we can build models that buildtheir own interpretable models of the world (Lake et al.,2017). Our work takes steps in this direction.

References

Adams, R., Wallach, H., and Ghahramani, Z. Learning thestructure of deep sparse graphical models. In

Interna-tional Conference on Artiﬁcial Intelligence and Statistics ,2010.Ahlqvist, E., Storm, P., K¨ar¨aj¨am¨aki, A., Martinell, M.,Dorkhan, M., Carlsson, A., Vikman, P., Prasad, R. B.,Aly, D. M., Almgren, P., et al. Novel subgroups of adult-onset diabetes and their association with outcomes: a enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement data-driven cluster analysis of six variables.

The LancetDiabetes & Endocrinology , 6(5):361–369, 2018.Alvarez-Melis, D. and Jaakkola, T. S. Towards robust inter-pretability with self-explaining neural networks. In

Ad-vances in Neural Information Processing Systems , 2018.American Diabetes Association. Diagnosis and classiﬁca-tion of diabetes mellitus.

Diabetes Care , 2005.Ankerst, M., Breunig, M. M., Kriegel, H.-P., and Sander, J.Optics: Ordering points to identify the clustering struc-ture.

ACM Sigmod Record , 28(2):49–60, 1999.Arcones, M. A. and Gine, E. On the bootstrap of u and vstatistics.

The Annals of Statistics , pp. 655–674, 1992.Bengio, Y. Deep learning of representations: Looking for-ward. In

International Conference on Statistical Lan-guage and Speech Processing , pp. 1–37. Springer, 2013.Bengio, Y., Courville, A., and Vincent, P. Representationlearning: A review and new perspectives.

IEEE transac-tions on pattern analysis and machine intelligence , 2013.Burgess, C. P., Higgins, I., Pal, A., Matthey, L., Watters, N.,Desjardins, G., and Lerchner, A. Understanding disentan-gling in beta-vae. arXiv:1804.03599 , 2018.Chen, T. Q., Li, X., Grosse, R. B., and Duvenaud, D. K.Isolating sources of disentanglement in variational autoen-coders. In

Advances in Neural Information ProcessingSystems , 2018.Choi, J., Hwang, G., and Kang, M. Discond-vae:disentangling continuous factors from the discrete. arXiv:2009.08039 , 2020.Comon, P. Independent component analysis, a new concept?

Signal processing , 1994.Creager, E., Madras, D., Jacobsen, J.-H., Weis, M., Swersky,K., Pitassi, T., and Zemel, R. Flexibly fair representationlearning by disentanglement. In

International Conferenceon Machine Learning , 2019.Defferrard, M., Bresson, X., and Vandergheynst, P. Con-volutional neural networks on graphs with fast localizedspectral ﬁltering.

Advances in Neural Information Pro-cessing Systems , 2016.Dupont, E. Learning disentangled joint continuous and dis-crete representations. In

Advances in Neural InformationProcessing Systems , 2018.Eastwood, C. and Williams, C. K. A framework for thequantitative evaluation of disentangled representations.In

International Conference on Learning Representations ,2018. Esmaeili, B., Wu, H., Jain, S., Bozkurt, A., Siddharth, N.,Paige, B., Brooks, D. H., Dy, J., and van de Meent, J.-W.Structured disentangled representations.

InternationalConference on Artiﬁcial Intelligence and Statistics , 2019.Fischler, M. A. and Bolles, R. C. Random sample consensus:a paradigm for model ﬁtting with applications to imageanalysis and automated cartography.

Communications ofthe ACM , 24(6):381–395, 1981.Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X.,Botvinick, M., Mohamed, S., and Lerchner, A. beta-vae: Learning basic visual concepts with a constrainedvariational framework. In

International Conference onLearning Representations , 2017.Jeong, Y. and Song, H. O. Learning discrete and con-tinuous factors of data via alternating disentanglement. arXiv:1905.09432 , 2019.Kim, H. and Mnih, A. Disentangling by factorising. In

International Conference on Machine Learning , 2018.Kingma, D. P. and Welling, M. Auto-encoding variationalbayes. arXiv:1312.6114 , 2013.Klindt, D. A., Schott, L., Sharma, Y., Ustyuzhaninov, I.,Brendel, W., Bethge, M., and Paiton, D. Towards nonlin-ear disentanglement in natural data with temporal sparsecoding. In

International Conference on Learning Repre-sentations , 2021.Kumar, A., Sattigeri, P., and Balakrishnan, A. Variationalinference of disentangled latent concepts from unlabeledobservations. arXiv:1711.00848 , 2017.Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-man, S. J. Building machines that learn and think likepeople.

Behavioral and Brain Sciences , 40, 2017.LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. nature ,521(7553):436–444, 2015.Little, A. V., Lee, J., Jung, Y.-M., and Maggioni, M. Esti-mation of intrinsic dimensionality of samples from noisylow-dimensional manifolds in high dimensions with mul-tiscale svd. In

IEEE/SP 15th Workshop on StatisticalSignal Processing , 2009.Locatello, F., Bauer, S., Lucic, M., R¨atsch, G., Gelly, S.,Sch¨olkopf, B., and Bachem, O. Challenging commonassumptions in the unsupervised learning of disentangledrepresentations. arXiv:1811.12359 , 2018.Locatello, F., Poole, B., R¨atsch, G., Sch¨olkopf, B., Bachem,O., and Tschannen, M. Weakly-supervised disentangle-ment without compromises. arXiv:2002.02886 , 2020a. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran,A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., and Kipf,T. Object-centric learning with slot attention.

Advancesin Neural Information Processing Systems , 2020b.Mahapatra, S. and Chandola, V. S-isomap++: multi mani-fold learning from streaming data. In

IEEE InternationalConference on Big Data , 2017.Mahler, B. I. Contagion dynamics for manifold learning. arXiv:2012.00091 , 2020.Marx, C., Phillips, R., Friedler, S., Scheidegger, C., andVenkatasubramanian, S. Disentangling inﬂuence: Us-ing disentangled representations to audit model predic-tions.

Advances in Neural Information Processing Sys-tems , 2019.Mathieu, M. F., Zhao, J. J., Zhao, J., Ramesh, A., Sprech-mann, P., and LeCun, Y. Disentangling factors of varia-tion in deep representation using adversarial training.

Ad-vances in Neural Information Processing Systems , 2016.Matthey, L., Higgins, I., Hassabis, D., and Lerchner,A. dsprites: Disentanglement testing sprites dataset.https://github.com/deepmind/dsprites-dataset/, 2017.Mescheder, L., Geiger, A., and Nowozin, S. Which trainingmethods for gans do actually converge? In

InternationalConference on Machine Learning , 2018.Omohundro, S. M.

Five balltree construction algorithms .International Computer Science Institute Berkeley, 1989.Parascandolo, G., Kilbertus, N., Rojas-Carulla, M., andSch¨olkopf, B. Learning independent causal mechanisms.In

International Conference on Machine Learning , 2018.Ridgeway, K. A survey of inductive biases for factorialrepresentation-learning. arXiv:1612.05299 , 2016.Schmidhuber, J. Learning factorial codes by predictabilityminimization.

Neural Computation , 4(6):863–879, 1992.Siddharth, N., Paige, B., Van de Meent, J.-W., Desmaison,A., Goodman, N., Kohli, P., Wood, F., and Torr, P. Learn-ing disentangled representations with semi-superviseddeep generative models. In

Advances in Neural Informa-tion Processing Systems , 2017.Singh, K. K., Ojha, U., and Lee, Y. J. Finegan: Unsu-pervised hierarchical disentanglement for ﬁne-grainedobject generation and discovery. In

IEEE Conference onComputer Vision and Pattern Recognition , 2019.Sorrenson, P., Rother, C., and K¨othe, U. Disentanglement bynonlinear ica with general incompressible-ﬂow networks(gin). arXiv:2001.04872 , 2020. Sugiyama, M., Suzuki, T., and Kanamori, T. Density-ratiomatching under the bregman divergence: a uniﬁed frame-work of density-ratio estimation.

Annals of the Instituteof Statistical Mathematics , 64(5):1009–1044, 2012.Tr¨auble, F., Creager, E., Kilbertus, N., Goyal, A., Locatello,F., Sch¨olkopf, B., and Bauer, S. Is independence all youneed? on the generalization of representations learnedfrom correlated data. arXiv:2006.07886 , 2020.Wang, W., Huang, Y., Wang, Y., and Wang, L. Generalizedautoencoder: A neural network framework for dimension-ality reduction. In

IEEE Conference on Computer Visionand Pattern Recognition Workshops , 2014.

A. Appendix

A.1. Training and Architecture Details

For Chopsticks, our encoders and decoders used two hid-den layers of width 256, and our loss function L x wasdeﬁned as a zero-centered Gaussian negative log likeli-hood with σ = 0 . . For Spaceshapes, encoders and de-coders used the 7-layer convolutional architecture fromBurgess et al. (2018), and our loss function L x wasBernoulli negative log likelihood. All models were imple-mented in Tensorﬂow. Code to reproduce experiments willbe made available at https://github.com/dtak/hierarchical-disentanglement .For both models, the assignment loss L a was set to mean-squared error, but only for assignments that were de-ﬁned. This was implemented by setting undeﬁned assign-ment components to -1, and then deﬁning L a ( a, a (cid:48) ) = (cid:80) i [ a (cid:48) i ≥ a i − a (cid:48) i ) .All activation functions were set to ReLU ( max(0 , x ) ), ex-cept in the case of the initial smooth autoencoder, wherethey were replaced with Softplus ( ln(1 + e x ) ). This initialautoencoder was trained with dimensionality equal to oneplus the maximum intrinsic dimensionality of the dataset.We investigate varying this parameter in Fig. 9 and ﬁnd itcan be much larger, and perhaps would have produced betterresults (though nearest neighbor calculation and local SVDcomputations would have been slower).All models were trained for 50 epochs with a batch sizeof 256 on a dataset of size 100,000, split 90%/10% intotrain/test. We used the Adam optimizer with a learning ratestarting at 0.001 and decaying by halfway and three-quarters of the way through training.For COFHAE, we selected softmax temperature τ , the as-signment penalty strength λ , and the adversarial penaltystrength λ based on training set reconstruction error andMIMOSA assignment accuracy. Splitting off a separate val-idation set was not necessary, as the most common problem enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement we faced was poor convergence, not overﬁtting; the adver-sarial penalty would dominate and prevent the procedurefrom learning a model that could reconstruct X or A .Speciﬁcally, for each restart, we ran COFHAE with τ in { , , } , λ in { , , } , and λ in { , , } .We then selected the model with the lowest training MSE (cid:80) n || x n − x (cid:48) n || , but restricting ourselves to the 33.3% ofmodels with the lowest assignment loss (cid:80) n L a ( a n , a (cid:48) n ) .For evaluating R and R c , we used gradient boosted deci-sion trees, which were faster to train than neural networks. A.2. Complexity and Runtimes

Per Fig. 7, the total runtime of our method is dominated byCOFHAE, an adversarial autoencoder method which hasthe same complexity as FactorVAE (Kim & Mnih, 2018)(linear in dataset size N and number of training epochs, andstrongly affected by GPU speed).MIMOSA could theoretically take more time, however,as the complexity of constructing a ball tree (Omohun-dro, 1989) for nearest neighbor queries is O ( | Z | N log N ) .As such, initial dimensionality reduction is critical; in ourSpaceshapes experiments, | Z | is 7, whereas | X | is 4096.Other MIMOSA steps can also take time. With a num nearest neighbors of k , the complexity of runninglocal SVD on every point in the dataset is O ( N ( | Z | k + | Z | k + k )) , providing another reason to reduce initialdimensionality and keep neighborhood size manageable(though ideally k should increase with | Z | to robustly learnlocal manifold directions). Iterating over the dataset inBuildComponent and computing cosine similarity will alsohave complexity at least O ( N kd ( d + | Z | )) for compo-nents of local dimensionality d , and detecting componentboundaries can actually have complexity O ( N ke d ) (if thisis implemented, as in our experiments, by checking if pro-jected points are contained in their neighbors’ convex hulls—though we also explored a much cheaper O ( N k d ) strategyof checking for the presence of neighbors in all principalcomponent directions that worked almost as well).Although these scaling issues are worth noting, MIMOSAwas still relatively fast in our experiments, where runtimeswere dominated by neural network training (Fig. 7). A.3. MIMOSA Hyperparameters

In this section, we list and describe all hyperparameters forMIMOSA, along with values that we used for our mainresults. We also present sensitivity analyses in Fig. 9.

MIMOSA initial autoencoder (Algorithm 1, line 1) • initial dim - the dimensionality of the initialsmooth autoencoder. As Fig. 9 shows, this can be larger than the intrinsic dimensionality of the data, which MI-MOSA will estimate. We defaulted to using the max.intrinsic dimensionality plus 1; in a real-world contextwhere this information is not available, it can be es-timated by reducing from initial dim = | X | untilreconstruction error starts increasing.• Training and architectural details appropriate for thedata modality (e.g. convolutional layers for images).See §A.1 for our choices. LocalSVD (Algorithm 3) • num nearest neighbors - the neighborhood sizefor local SVD and later traversal. We used . Mustbe larger than initial dim ; could also be replacedwith a search radius.• ransac frac - the fraction of neighbors to reﬁtSVD. We used / . Note that we do not run tradi-tional multi-step RANSAC (Fischler & Bolles, 1981),but a two-step approximation, where we deﬁne loss byaggregating reconstruction errors across dimensions.Another (slower but potentially more robust) optionwould be to iteratively reﬁt SVD on the points with low-est reconstruction error at each dimension, and checkif the resulting eigenvalues meet our cutoff criteria.• eig cumsum thresh - the minimum fraction ofvariance SVD dimensions must explain to determine lo-cal dimensionality. We used . . For noisy or sparsedata, it might be useful to reduce this parameter.• eig decay thresh - the minimum multiplicativefactor by which SVD eigenvalues must decay to deter-mine local dimensionality. We used . It might also beuseful to reduce this parameter for sparse data.Note that our LocalSVD algorithm can be seen as a fasterversion of Multiscale SVD (Little et al., 2009), which isused in an analogous way by Mahapatra & Chandola (2017),but would require repeatedly computing singular value de-compositions over different search radii for each point. BuildComponent (Algorithm 5) • cos simil thresh - neighbors’ local SVDs mustbe this similar to add to the component. This corre-sponds to the (cid:15) parameter from Mahapatra & Chandola(2017). We used . for Chopsticks and . forSpaceshapes; in general, we feel this is one of the mostimportant parameters to tune, and should generally bereduced in the presence of noise or data scarcity.• contagion num - only add similar points to a man-ifold component when a threshold fraction of theirneighbors have already been added. This is use-ful for robustness, and corresponds to the T param-eter from Mahler (2020) (but expressed as a num- enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Algorithm 3

LocalSVD( Z ) Run SVD on Z (a design matrix of dimension num nearest neighbors by initial dim ) if ransac frac < then for each dimension d from 1 to initial dim − do for each point z n do Compute the reconstruction error for z n using the only ﬁrst d SVD dimensions end for end for Take the norm of reconstruction errors across dimensions, giving a vector of length num nearest neighbors Re-ﬁt SVD on points whose error-norms are less than the × ransac frac percentile value. end if for each dimension d from 1 to initial dim − do Check if the cumulative sum of the ﬁrst d eigenvalues is at least eig cumsum thresh Check if the ratio of the d th to the d + 1 st eigenvalue is at least eig decay thresh if both of these conditions are true then return only the ﬁrst d SVD components end if end for return the full set of SVD components otherwise

Algorithm 4

TangentPlaneCos(

U, V ) if U and V are equal-dimensional then return | det( U · V T ) | else return end ifAlgorithm 5 BuildComponent( z i , neighbors, svds) Initialize component to z i and neighbors z j not already in other components where TangentPlaneCos(svds i , svds j ) ≥ cos simil thresh (Algorithm 4). while the component is still growing do Add all points z k for which at least contagion num of their neighbors z (cid:96) are already in the component with TangentPlaneCos(svds k , svds (cid:96) ) ≥ cos simil thresh . Skip adding any z k already in another component. end while return the set of points in the component Algorithm 6

MergeComponents(components, svds) Discard components smaller than min size init . for each component c i do Construct a local ball tree for the points in c i . Set c i .edges to points not contained in the convex hull of their neighbors in local SVD space. end for Initialize edge overlap matrix M of size | components | by | components | to 0. for each ordered pair of equal-dimensional components ( c i , c j ) do Set M ij to the fraction of points in c i .edges for which the closest point in c j .edges has local SVD tangent planesimilarity above cos simil thresh . end for Average M with its transpose to symmetrize. Merge all components c i (cid:54) = c j of equal dimensionality d where M ij ≥ min common edge frac ( d ) . Discard components smaller than min size merged . return the merged set of components enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Algorithm 7

ConstructHierarchy(components) for each component c i do Set c i .neighbor lengthscale to the average distance of points to their nearest neighbors inside the component (computedusing the local ball tree from Algorithm 6) end for for each pair of different-dimensional components ( c i , c j ) , c i higher-dimensional do Compute the average distance from points in c i to their nearest neighbors in c j (via ball tree). Divide this average distance by c i . neighbor lengthscale . if the resulting ratio ≤ neighbor lengthscale mult then Set c j ∈ c i ( c j is enclosed by c i ) end if end for Create a root node with edges to all components which do not enclose others.

Transform the component enclosure DAG into a tree (where enclosing components are children of enclosed components)by deleting edges which:1. are redundant because an intermediate edge exists, e.g. if c ∈ c ∈ c , we delete the edge between c and c .2. are ambiguous because a higher-dimensional component encloses multiple lower-dimensional components (i.e.has multiple parents). In that case, preserve only the edge with the lowest distance ratio. Convert the resulting component enclosure tree into a dimension hierarchy:1. If the root node has only one child, set it to be the root. Otherwise, begin with a dimension group with a singlecategorical dimension whose options point to groups containing each child.2. For the rest of the component tree, add continuous dimensions until the total number of continuous dimensions upto the root equals the component’s dimensionality.3. If a component has children, add a categorical dimension that includes those child groups as options (recursingdown the tree), along with an empty group ( ∅ ) option. return the dimension hierarchy Algorithm 8

HAE θ .encode( x ; τ ) Encode x using any neural network architecture as a ﬂat vector z pre , with size equal to the number of continuousvariables plus the number of categorical options in HAE θ . hierarchy . Associate each group of dimensions in the ﬂat vector with variables in the hierarchy. For all of the categorical variables, pass their options through a softmax with temperature τ . Use the resulting vector to mask all components of z pre corresponding to variables below each option in HAE θ . hierarchy . return the masked representation, separated into discrete a (cid:48) , continuous z , as well as the mask m (for determiningactive dimensions later). Figure 6.

Comparison of the latent spaces learned by MIMOSA initial autoencoders with ReLU (top) vs. Softplus (bottom) activations.Each plot shows encoded Chopsticks data samples colored by their ground-truth location in the dimension hierarchy. Because ReLUactivations are non-differentiable at , the resulting latent manifolds contain sharp corners, which makes it difﬁcult for MIMOSA’s localSVD procedure to merge points together into the correct components. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 7.

Mean runtimes and percentage breakdowns for COFHAEand MIMOSA on Chopsticks and Spaceshapes, based on Tensor-ﬂow implementations running on single GPUs (exact model variesbetween Tesla K80, Tesla V100, GeForce RTX 2080, etc). Run-times tend to be dominated by COFHAE, which is similar incomplexity to existing adversarial methods (e.g. FactorVAE). ber rather than a fraction). We used for Chop-sticks and for Spaceshapes. Values above of num nearest neighbors will likely produce poor re-sults, and we found the greatest increases in robustnessjust going from 1 (or no contagion dynamics) to 2. MergeComponents (Algorithm 6) • min size init - discard initial components smallerthan this. We used . of the dataset size, or about20 points. This parameter helps speed up the algorithm(by reducing the number of pairwise comparisons) andavoids incorrect merges through single-point compo-nents.• min size merged - discard merged componentssmaller than this. We used of the dataset size,or about 2000 points. This parameter helps excludespurious high-dimensional interstitial points that ap-pear at boundaries where low-dimensional componentsintersect.• min common edge frac(d) - the minimum frac-tion of edges that two manifold components must sharein common to merge, as a function of dimensionality d .We used − d − + 2 − d − ; this is based on the idea thattwo neighboring (possibly distorted) hypercubes of di-mension d should match on one of their sides; sincethey have d sides, the fraction of matching edge pointswould be − d . However, for robustness (as not all com-ponents will be hypercubes, and even then some edgepoints may not match), we average this with the smallerfraction that a d +1 dimensional hypercube would need,or − d − , for our resulting − d − + 2 − d − . We foundthat this choice was not critical in preliminary experi-ments, as matches were common for components withthe same true assignments and rare for others, but itcould become more important for sparse or noisy data. (a) Chopsticks X es corrupted by Gaussian noise.(b) Effect of noise on initial autoencoder Z s.(c) Effect of noise on MIMOSA for two variants. Figure 8.

Illustration of the sensitivity of MIMOSA to data noise.In preliminary experiments, we ﬁnd that noise poses the greatestproblem for identifying the lowest-dimensional components, e.g.the 1D components in (b) that end up being classiﬁed as 2D or 3D.Tuning parameters would help, but we lack labels to cross-validate.

ConstructHierarchy (Algorithm 7) • neighbor lengthscale mult - the thresholdfor deciding whether a higher-dimensional compo-nent “encloses” a lower-dimensional component, ex-pressed as a ratio of (1) the average distance fromlower-dimensional component points to their near-est neighbors in the higher-dimensional component(inter-component distance), to (2) the average dis-tance of points in the higher-dimensional component totheir nearest neighbors in that same component (intra-component distance). We used , which we foundwas robust for our benchmarks, though it may needto be increased if ground-truth components are higher-dimensional than those in our benchmarks. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 9.

Effect of varying different hyperparameters (and ablating different robustness techniques) on MIMOSA. Default values areshown with vertical gray dotted lines, and for each hyperparameter (top to bottom), average coverage (left), purity (middle), and H error(right) when deviating from defaults are shown for three versions of the Chopsticks dataset. Results suggest both a degree of robustnessto changes (degradations tend not to be severe for small changes), but also the usefulness of various components; for example, resultsmarkedly improve on some datasets with contagion num > and ransac frac < (implying contagion dynamics and RANSAC bothhelp). Many parameters exhibit tradeoffs between component purity and dataset coverage. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 10.

A fuller version of main paper Fig. 4 showing COFHAE ablations on all datasets. Hierarchical disentanglement tends to be lowfor ﬂat AEs (Flat), better with ground-truth hierarchy H (Hier H ), and even better after adding supervision for ground-truth assignments A ( H + A ). Adding a FactorVAE-style marginal TC penalty ( H + A + T C ( Z ) ) sometimes helps disentanglement, but making that TCpenalty conditional ( H + A + T C ( Z | on ) , i.e. COFHAE) tends to help more, bringing it close to the near-optimal disentanglement ofa hierarchical model whose latent representation is fully supervised ( H + A + Z ). Partial exceptions include the hardest three datasets(Spaceshapes and depth-3 compound Chopsticks), where disentanglement is not consistently near 1; this may be due to non-identiﬁabilityor adversarial optimization difﬁculties. Figure 11.

Varying disentanglement penalty hyperparameters for baseline algorithms (TCVAE and FactorVAE). In contrast to COFHAE,no setting produces near-optimal disentanglement, even sporadically. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement (a) AE pairwise histograms and R / R c scores (b) TCVAE pairwise histograms and R / R c scores(c) COFHAE pairwise histograms and R / R c scores Figure 12.

Pairwise histograms of ground-truth vs. learned variables for a ﬂat autoencoder (top left), β -TCVAE (top right), and the best-performing run of COFHAE (bottom) on Spaceshapes. Histograms are conditioned on both variables being active, and dimension-wisecomponents of the R c score are shown on the right. β -TCVAE does a markedly better job disentangling certain components than the ﬂatautoencoder, but in this case, COFHAE is able to fully disentangle the ground-truth by modeling the discrete hierarchical structure. SeeFig. 13 for a latent traversal visualization. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement Figure 13.

Hierarchical latent traversal plot for the Spaceshapes COFHAE model shown in Fig. 12c. Individual latent traversals showthe effects of linearly sweeping each active dimension from its 1st to 99th percentile value (center column shows the same inputwith intermediate values for all active dimensions). Consistent with Fig. 12c, the model is not perfectly disentangled, though primarycorrespondences are clear: star shine is modeled by Z , moon phase is modeled by Z , ship angle is modeled by Z , ship jetlen ismodeled by Z , and ( x , y ) are modeled by ( Z , Z ) , ( Z , Z ) , and ( Z , Z ) respectively for each shape. Figure 14.

Three different potential hierarchies for Spaceshapes which all have the same structure of variable groups and dimensionalities,but with different distributions of continuous variables across groups. The ambiguity in this case is that the continuous variable thatmodiﬁes each shape ( phase , shine , angle ) could either be a child of the corresponding shape category, or be “merged up” and combinedinto a single top-level continuous variable that controls the shape in different ways based on the category. Alternatively, the locationvariables x and y could instead be “pushed down” from the top level and duplicated across each shape category. In each of these cases, thelearned representation still arguably disentangles the ground-truth factors—in the sense that for any ﬁxed categorical assignment, there isstill 1:1 correspondence between all learned and ground-truth continuous factors. We deliberately design our R c and H -error metrics in§6 to be invariant to these transformations, leaving this speciﬁc disambiguation to future work. Figure 15.

MIMOSA-learned initial encoding (left), components (middle), and hierarchy (right) for Spaceshapes. Initial points are in 7dimensions and projected to 3D for plotting. Three identiﬁed components are 3D and one is 4D. Analogue of Fig. 2 in the main text. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Figure 16.

MIMOSA-learned initial encoding (left), 2D and 1D components (middle), and hierarchy (right) for depth-2 Chopsticksvarying the slope. Analogue of Fig. 2 in the main text.

Figure 17.

MIMOSA-learned initial encoding (left), 2D, 1D, and 3D components (middle), and hierarchy (right) for depth-3 Chopsticksvarying the slope. Initial points are in 4 dimensions and projected to 3D for plotting. Analogue of Fig. 2 in the main text.

Figure 18.

MIMOSA-learned initial encoding (left), 2D and 4D components (middle), and hierarchy (right) for depth-2 Chopsticksvarying both slope and intercept. Initial points are in 5 dimensions and projected to 3D for plotting. Analogue of Fig. 2 in the main text.

Figure 19.

MIMOSA-learned initial encoding (left), 2D, 4D, and 6D components (middle), and hierarchy (right) for depth-2 Chopsticksvarying both slope and intercept. Initial points are in 7 dimensions and projected to 3D for plotting. Analogue of Fig. 2 in the main text.

Figure 20.

MIMOSA-learned initial encoding (left), 1D-3D components (middle), and hierarchy (right) for depth-3 Chopsticks varyingeither slope or intercept. Note that the learned hierarchy is not quite correct (two nodes at the deepest level are missing). Initial points arein 5 dimensions and projected to 3D. Analogue of Fig. 2. enchmarks, Algorithms, and Metrics for Hierarchical Disentanglement

Figure 21.

Pairwise histograms of ground-truth vs. learned variables for COFHAE on the most complicated hierarchical benchmark(Chopsticks at a recursion depth of 3 varying either slope or intercept). Histograms are conditioned on both variables being active, anddimension-wise components of the R cc