Hierarchical VAEs Know What They Don't Know
HHierarchical VAEs Know What They Don’t Know
Jakob D. Havtorn
Jes Frellsen Søren Hauberg Lars Maaløe
Abstract
Deep generative models have shown themselvesto be state-of-the-art density estimators. Yet, re-cent work has found that they often assign a higherlikelihood to data from outside the training distri-bution. This seemingly paradoxical behavior hascaused concerns over the quality of the attaineddensity estimates. In the context of hierarchicalvariational autoencoders, we provide evidence toexplain this behavior by out-of-distribution datahaving in-distribution low-level features. We ar-gue that this is both expected and desirable be-havior. With this insight in hand, we develop afast, scalable and fully unsupervised likelihood-ratio score for OOD detection that requires datato be in-distribution across all feature-levels. Webenchmark the method on a vast set of data andmodel combinations and achieve state-of-the-artresults on out-of-distribution detection.
1. Introduction
The reliability and safety of machine learning systems ap-plied in the real-world is contingent on the ability to detectwhen an input is different from the training distribution.Supervised classifiers built as deep neural networks arewell-known to misclassify such out-of-distribution (OOD)inputs to known classes with high confidence (Goodfel-low et al., 2015; Nguyen et al., 2015). Several approacheshave been suggested to equip deep classifiers with OODdetection capabilities (Hendrycks & Gimpel, 2017; Laksh-minarayanan et al., 2017; Hendrycks et al., 2019; DeVries &Taylor, 2018). But, such methods are inherently supervisedand require in-distribution labels or examples of OOD datalimiting their applicability and generality.Unsupervised generative models that estimate an explicitlikelihood should understand what it means to be in- andout-of-distribution without requiring labels or examples of Department of Applied Mathematics and Computer Science,Technical University of Denmark, Kongens Lyngby, Denmark Corti AI, Copenhagen, Denmark. Correspondence to: Jakob D.Havtorn < [email protected] > , Lars Maaløe < [email protected] > .Preprint. Under review. Figure 1.
Reconstructions using a hierarchical VAE trained onFashionMNIST. Reconstruction quality of OOD data is compara-ble to in-distribution data, resulting in high likelihoods and poorOOD discrimination. By sampling the k bottom-most latent vari-ables from the conditional prior distribution p ( z ≥ l | z >l ) (latentreconstructions) instead of the approximate posterior q ( z >l | z In this paper , we examine the failure cases of deep genera-tive models on OOD detection tasks within the context ofhierarchical VAEs, and make the following contributions: a r X i v : . [ c s . L G ] M a r ierarchical VAEs Know What They Don’t Know Figure 2. Absolute correlations between data representations in alllayers of the inference network of a hierarchical VAE trained onFashionMNIST and of another trained on MNIST. We compute thecorrelation between the representations of the two different modelsgiven the same data, FashionMNIST (top) and MNIST (bottom). (i) We provide evidence that the root cause of OOD fail-ures is that learned low-level features generalize wellacross datasets and dominate the estimated likelihoods.(ii) We then propose a fast, scalable, and fully unsuper-vised likelihood-ratio score for OOD detection that isexplicitly developed to ensure that data should be in-distribution across all feature levels, which preventsthe low-level features from dominating.(iii) With the likelihood-ratio score, we demonstrate state-of-the-art performance across a wide range of knownOOD failure cases. 2. Why does OOD detection fail? The inability to detect out-of-distribution data with deepgenerative models is surprising. Before the advent of deepgenerative models, this was not considered a major issue forprobabilistic models (Bishop, 1994). Is the failure due to in-creased model flexibility, overfitting or something different?Deep learning models are generally believed to form hierar-chies of representations that range from low-level featuresto more conceptual ones related to semantics (Bengio et al.,2013). This has also been observed within deep generativemodels (Maaløe et al., 2019; Child, 2021). For image datathere is a trend that the low-level features are quite simi-lar across models (edge detectors, etc.), which raises thequestion to which extend such features are relevant whendetecting OOD data, also suggested by (Nalisnick et al.,2019a). To investigate, we train two hierarchical VAEs(subsection 3.2) on FashionMNIST and MNIST, respec-tively, and compute the between-models correlation of theextracted features of in-distribution data and OOD data.The result appears in Figure 2. We observe that featuresextracted in the early layers (low-level features) correlatestrongly between the two models, and that this correlationdrops as we get into later layers. This suggests that low-levelfeatures do not carry much information for OOD detection.To shed further light on the impact of semantic versus low-level features, we look at model reconstructions of images Figure 3. Reconstructions of in-distribution data (CelebA) of theBIVA model using higher latent variables (Maaløe et al., 2019).The higher the latent variable, the more the reconstructions fallinto the mode of the learned distribution. It is more common towear regular glasses than sunglasses but most common not to wearglasses at all. A man with long hair collapses into the mode of themore common long-haired woman. with a hierarchical VAE (Figure 3). To study the featurehierarchy, we replace the inference distribution with the cor-responding conditional prior in the first layers of the modelto see what information is lost. We observe that as morelayers rely on the prior, more details are lost. Sunglasses,which are uncommon, are first replaced by more commonglasses, and then finally disappear. This suggests that aswe fall back to the conditional priors of each layer, we arepushed closer to local modes of the modeled distribution.Finally, we look at reconstructions of out-of-distributiondata. Figure 1 illustrates that MNIST data is surprisinglywell reconstructed by a hierarchical VAE trained on Fashion-MNIST. Similar results have been found elsewhere (Xiaoet al., 2020). We repeat the previous experiment and re-place inference distributions by their corresponding con-ditional prior, and now observe that reconstructions fromhigher latent layers become increasingly similar to thedata on which the model was trained. The reliance onconditional priors seem to prevent accurate reconstructionof out-of-distribution data. Some details are lost on in-distribution data too, but the distinction between that andout-of-distribution data becomes more clear. These observations lead to our main hypothesis. We hy-pothesize that the lowest latent variables in a hierarchicalVAE learn generic features that can be used to describe awide range of data. This enables the model to achieve highrates of compression, and hence high likelihoods, even onout-of-distribution data as long as the learned low-level fea-tures are appropriate. We further suggest that OOD data arein-distribution with respect to these low-level features, butnot with respect to semantic ones. 3. Background and related work The variational autoencoder (VAE) (Kingma & Welling,2014; Rezende et al., 2014) is a framework for constructing ierarchical VAEs Know What They Don’t Know deep generative models defined by an observed variable x and a stochastic latent variable z . Typically, a neuralnetwork with parameters θ is chosen to parameterize thegenerative distribution p θ ( x , z ) = p θ ( x | z ) p ( z ) , where theprior p ( z ) is commonly a standard Gaussian N ( , I ) . Thetrue posterior p ( z | x ) is generally not analytically tractableand is approximated by a variational distribution q φ ( z | x ) parameterized via another neural network with parameters φ .The approximate posterior q φ ( z | x ) is most often a diagonalcovariance Gaussian. The model parameters θ and varia-tional parameters φ are jointly optimized by maximizing the evidence lower bound (ELBO), log p θ ( x ) ≥ E q φ ( z | x ) (cid:20) log p θ ( x , z ) q φ ( z | x ) (cid:21) ≡ L ( x ; θ, φ ) . (1)For brevity, we will denote L ( x ; θ, φ ) as L ( x ) or L . Thereparameterization trick is used to backpropagate gradientsthrough the stochastic latent variables with low variance.The VAE is defined with a single latent variable which limitsthe ability to learn a high likelihood representation of com-plex input distributions, e.g. natural images. There exist twocomplementary approaches to make the VAE more flexible:(i) model a more expressive prior distribution (Rezende &Mohamed, 2015; Kingma et al., 2016) and (ii) learn a deeperhierarchy of latent variables (Burda et al., 2016; Sønderbyet al., 2016). We here focus on the latter. Hierarchical VAEs are a family of probabilistic latent vari-able models which extends the basic VAE by introducing ahierarchy of L latent variables z = z , . . . , z L . The mostcommon generative model is defined from the top down as p θ ( x | z ) = p ( x | z ) p θ ( z | z ) · · · p θ ( z L − | z L ) . The infer-ence model can then be defined in two ways respectivelyreferred to as bottom-up (Burda et al., 2016) q φ ( z | x ) = q φ ( z | x ) (cid:81) Li =2 q φ ( z i | z i − ) (2)and top-down (Sønderby et al., 2016) q φ ( z | x ) = q φ ( z L | x ) (cid:81) i = L − q φ ( z i | z i +1 ) . (3)Regardless of the choice of inference model, a hierarchicalVAE is still trained using the ELBO (1).Until recently, hierarchical VAEs gave inferior likelihoodscompared to state-of-the-art autoregressive (Ho et al., 2019)and flow-based models (Salimans et al., 2017). This waschanged by Maaløe et al. (2019), Vahdat & Kautz (2020),and Child (2021), which introduced complementary meth-ods to extend the number of latent variables to a very deep hi-erarchy resulting in state-of-the-art likelihood performance.In this paper we employ a simple hierarchical VAE withbottom-up inference paths and the more powerful BIVA vari-ant with a bidirectional (top-down and bottom-up) inference model (Maaløe et al., 2019). We employ skip connectionsbetween latent variables but omit them for brevity. So far, no reliable direct likelihood-based method has beenfound for fully unsupervised deep generative model OODdetection. A major line of work considers developing newscores that are more reliable than the likelihood. Thisincludes the typicality test presented by Nalisnick et al.(2019b) which is an OOD detection test based on the typical-ity of a batch of potentially OOD examples. This approachhowever requires a batch of examples from the same class(OOD or not) which limits its practical applicability. InRen et al. (2019), the likelihood ratio between a primarymodel and a background model was shown to be an effectivescore for OOD detection. However, to train the backgroundmodel, the in-distribution data is perturbed via a data aug-mentation technique that is designed with knowledge aboutthe confounding factors between the in-distribution dataand the OOD data. Furthermore, it is tuned towards highperformance on a known OOD dataset. Serr`a et al. (2020)take a similar approach and attribute the failure to detectOOD data to the high influence of the input complexity onthe likelihood and choose a generic lossless compressionalgorithm as the background model. Although this methodgives good results, no single best choice of compressionalgorithm exists for all types of OOD data, and any partic-ular choice encodes prior knowledge about the data intothe detection method. Both these methods can be seen ascorrecting for low-level features of the OOD data beingassigned high model likelihood by using a second modelfocused exclusively on these features.Similar to these methods, the majority of the approachesto OOD detection make assumptions about the nature ofthe OOD data. The assumptions encompass using labelson the in-distribution data (Hendrycks & Gimpel, 2017;Liang et al., 2018; Alemi et al., 2018; Lee et al., 2018;Lakshminarayanan et al., 2017), examples of OOD data(Hendrycks et al., 2019), augmenting in-distribution datato mimic it (Ren et al., 2019), or assuming a certain datatype (Serr`a et al., 2020). Any of these assumptions encodeimplicit biases into the model about the attributes of OODdata which, in turn, might impair performance on trulyunknown data examples (unknown unknowns).While some of these methods achieve very good resultson OOD detection with autoregressive models (Oord et al.,2016b; Salimans et al., 2017) and invertible flow-basedmodels (Kingma & Dhariwal, 2018), it was recently shownthat they can be much less effective for VAEs (Xiao et al.,2020) highlighting the need for a more reliable OOD scorefor VAEs. Although VAEs have the same failure cases asautoregressive and flow-based models, the caveat is that ierarchical VAEs Know What They Don’t Know the difference in the likelihood is generally not as big andreconstructions of OOD can be surprisingly good (Xiaoet al., 2020). Xiao et al. (2020) alleviate this by refittingthe inference network, as previously proposed by Cremeret al. (2018); Mattei & Frellsen (2018), to a potentially OODexample and measuring the so-called likelihood regret . How-ever, refitting the inference network can be computationallyexpensive, especially for the large hierarchical VAEs thatare used to model complex data (Maaløe et al., 2019; Vah-dat & Kautz, 2020; Child, 2021). Furthermore, this scalespoorly to large amounts of potentially OOD examples as theoptimization is done per example.A few methods have approached OOD detection in a com-pletely unsupervised fashion (Maaløe et al., 2019; Choiet al., 2019; Xiao et al., 2020). The work of Maaløe et al.(2019) is the most related to ours. They introduce BIVA, adeep hierarchy of stochastic latent variables with a top-downand bottom-up inference model and achieve state-of-the-artlikelihood scores. They also provide early results indicativethat a looser likelihood bound may have value in OOD de-tection. In this paper, we provide an explanation of thoseresults, and significantly improve upon them. 4. OOD detection with hierarchical VAEs If the first latent variable in the VAE hierarchy codes fora large part of the low-level features required to recon-struct the input with high accuracy, as exemplified in Fig-ure 1-3, then p θ ( x | z ) will be high for both in- and out-of-distribution data. Hence, any OOD detection capabil-ities based on the ELBO L = E q φ ( z | x ) [log p θ ( x | z )] − D KL ( q φ ( z | x ) || p ( z )) from (1) relies on the KL-term forOOD detection. For a bottom-up hierarchical VAE, theKL-term D KL ( q φ ( z | x ) || p ( z )) can be expressed by a hierar-chical sum E q φ ( z | x ) (cid:104) (cid:80) L − i =1 log p θ ( z i | z i +1 ) q φ ( z i | z i − ) + log p θ ( z L ) q φ ( z L | z L − ) (cid:105) . (4)In general, the absolute log-ratios grow with dim( z i ) asthe individual log probability terms are computed by sum-ming over the dimensionality of z i . This means that thevalue of the KL-term is dominated by terms where z i ishigh-dimensional. Since hierarchical VAEs are generallyconstructed with a bottleneck type structure, the terms corre-sponding to latent variables towards the top of the hierarchywill have a vanishing influence on the value of the KL-term. However, as the semantic information most relevantfor OOD detection has a tendency to be represented in thetop-most latent variables, this makes OOD detection usingthe regular ELBO difficult, even for state-of-the-art models.This behavior has also been reported by Xiao et al. (2020).To shift the ELBO from primarily being based on the approx- imate posterior of the first latent variables to instead focuson the conditional prior, Maaløe et al. (2019) introducedanother version of the ELBO defined as L >k = E p θ ( z ≤ k | z >k ) q φ ( z >k | x ) (cid:20) log p θ ( x | z ) p θ ( z >k ) q φ ( z >k | x ) (cid:21) (5)where k ∈ { , , . . . , L } (see Appendix for the derivation).We note that L > is the regular ELBO (1) and that em-pirically we always observe that L ≥ L >k ∀ k althoughthis need not hold in general. The core idea behind thisvariation on the ELBO is to sample the k lowest latent vari-ables from the conditional prior z , . . . , z l ∼ p θ ( z ≤ k | ˆ z >k ) and only the L − k highest from the approximate posterior z k +1 , . . . , z L ∼ q φ ( z >k | x ) . Importantly, this has the effectthat the data likelihood p ( x | z ) is dependent on the approxi-mate posterior through a latent variable z k +1 different from z for all k > . Thereby, the likelihood can be evaluatedwith a reconstruction from each of the latent variables z k ofthe hierarchical VAE. Hence, we can now test how well theinput x is reconstructed from each latent variable. The nota-tion L >k highlights that for latent variables z >k , the boundis the regular ELBO while for the latent variables z ≤ k , thebound is evaluated using the (conditional) prior rather thanthe approximate posterior as the proposal distribution.For a different interpretation of the bound (5), considerthat in an ideal VAE, the KL-divergence between q φ and p θ is zero. In practice, this indicates posterior collapse(Maaløe et al., 2017), but in the ideal case, it would meanthat our approximate posterior has perfectly fitted the truemodel posterior, without collapsing into the prior, such that q φ ( z i | z i − ) = p θ ( z i | z i +1 ) ∀ i with z ≡ x . In that case,(5) is equal to the ELBO for all values of k , L = L >k ∀ k . While the L >k bound provides a score for performing se-mantic OOD detection, it still relies on the likelihood func-tion (see equation (7) below), which is known to be prob-lematic for OOD detection (section 3.3). To alleviate this,we phrase OOD detection as a likelihood ratio test of being semantically in-distribution. A standard likelihood ratio test(Buse, 1982) suggests to consider the ratio between the asso-ciated likelihoods, which we can approximate on a log-scaleby the corresponding lower bounds L and L >k , LLR >k ( x ) = L ( x ) − L >k ( x ) . (6)Since, empirically, L ≥ L >k , the ratio is always posi-tive as is standard for likelihood ratio tests. A low valueof LLR >k ( x ) means that the ELBO and L >k are almostequally tight for the data. On the contrary, a high valueindicates that L >k looser on the data than the ELBO, whichindicates that the data may be OOD.We can gather further insights about this score if we writethe regular ELBO and the L >k bounds in the exact form ierarchical VAEs Know What They Don’t Know that includes the intractable KL-divergence between theapproximate and true posteriors, L = log p θ ( x ) − D KL ( q φ ( z | x ) || p θ ( z | x )) , (7) L >k = log p θ ( x ) − D KL ( p θ ( z ≤ | z >k ) q φ ( z >k | x ) || p θ ( z | x )) . Subtracting these cancel out the two data likelihood terms log p θ ( x ) and only the KL-divergences from the approxi-mate to the true posterior remain, LLR >k ( x ) = − D KL ( q φ ( z | x ) || p θ ( z | x )) (8) + D KL ( p θ ( z ≤ k | z >k ) q φ ( z >k | x ) || p θ ( z | x )) . Hence, it is clear that compared to the likelihood bound L >k , this likelihood-ratio measures divergence exclusivelyin the latent space whereas L >k includes the log p θ ( x ) termsimilar to the ELBO. Therefore, the LLR >k score should bean improved method for semantic OOD detection comparedto L >k . Finally, it can be noted that if we replace the regularELBO L (7) with the strictly tighter importance weightedbound (Burda et al., 2016), L S = E q ( z | x ) (cid:34) log 1 N S (cid:88) s =1 p ( x , z ( s ) ) q ( z ( s ) | x ) (cid:35) , (9)then, in the limit S → ∞ , we have L S → log p θ ( x ) andconsequently the limit of the likelihood ratio reduces to LLR >kS ( x ) S → D KL ( p ( z ≤ | z >k ) q ( z >k | x ) || p ( z | x )) . (10)In practice, this limit is well-approximated for a finite S . 5. Experimental setup Tasks : We follow existing literature (Nalisnick et al., 2019a;Hendrycks et al., 2019) and evaluate our method by settingup OOD detection tasks from FashionMNIST (Xiao et al.,2017) to MNIST (LeCun et al., 1998) and from CIFAR10(Krizhevsky, 2009) to SVHN (Netzer et al., 2011). Foreach experiment we train our model on the train split of theformer dataset and test its ability to recognize the test splitof the latter dataset as OOD from the test split of the formerdataset. We use the standard train/test splits for the datasets.More details on the datasets can be found in the Appendix. Models : For each OOD task, we train a simple bottom-uphierarchical VAE with L stochastic layers which we will re-fer to as ”HVAE”. To alleviate posterior collapse we includeskip-connections that connect z i to z i +2 for i ∈ { , L − } and z ≡ x in both the inference and generative models(Dieng et al., 2019) and employ the free bits scheme with λ = 2 (Kingma et al., 2016). We use weight-normalization(Salimans & Kingma, 2016) on all weights and residualnetworks in the deterministic paths. A graphical represen-tation of this model can be seen in Figure 4. We use a xz z q φ ( z | x ) xz z p θ ( x , z ) Figure 4. The inference and generativemodels, q φ and p θ , for an L = 2 layered bottom-up hierarchical VAEas the one used in our experiments.Dashed lines indicate deterministicskip connections which are employedin both networks. Skip connectionsare found to be useful for optimizinglatent variable models (Dieng et al.,2019; Maaløe et al., 2019). Bernoulli output distribution for FashionMNIST/MNISTand a discretized mixture of logistics output distribution(Salimans et al., 2017) for CIFAR10/SVHN. We use L = 3 for grey-scale images and L = 4 for natural images. ForCIFAR/SVHN, we also train a BIVA model (Maaløe et al.,2019) with L = 10 and similar configuration as used bythe original paper . We implement our models in PyTorch(Paszke et al., 2017). Full model details are in the Appendix. Baselines : We group baselines into those that use priorknowledge about OOD data, ones that use labels associatedwith the in-distribution data and purely unsupervised ap-proaches that do not make such assumptions. Our methodfalls into the latter category. For more information on eachbaseline, we refer to the original literature. Evaluation : Following previous work (Hendrycks & Gim-pel, 2017; Hendrycks et al., 2019; Alemi et al., 2018;Ren et al., 2019; Choi et al., 2019) we use the threshold-independent evaluation metrics of Area Under the ReceiverOperator Characteristic (AUROC ↑ ), Area Under the Pre-cision Recall Curve (AUPRC ↑ ) and False Positive Rate at80% true positive rate (FPR80 ↓ ) where the arrow indicatesthe direction of improvement. Note that these metrics areonly computable given examples of OOD data but facedwith truly OOD data (unknown unknowns), there are manyways to select thresholds to use in practice e.g. as the onethat yields a specific tolerable false positive rate on the in-distribution test data. To compute the metrics, we use anequal number of samples from the in-distribution and OODdatasets by including all examples in the smallest of the twosets and randomly sampling equally many from the larger.We compute the LLR >k score with one and S importancesamples denoted by LLR >kS . Selection of k : To determine whether an example is OODin practice, the value of LLR >k is computed on the in-distribution test set for all k and the resulting empiricaldistribution is used as reference. If for any value of k ,the LLR >k score of a new input differs significantly fromthe empirical distribution, it is regarded OOD. If it differs github.com/larsmaaloee/BIVA and github.com/vlievin/biva-pytorch ierarchical VAEs Know What They Don’t Know Method Dataset Avg. bits/dim log p ( x ) L > L > L > Trained on FashionMNIST Glow FashionMNIST 2.96 - -MNIST 1.83 - -HVAE (Ours) FashionMNIST 0.420 0.476 0.579 -MNIST 0.317 0.601 0.881 - Trained on CIFAR10 Glow CIFAR10 3.46 - -SVHN 2.39 - -HVAE (Ours) CIFAR10 3.74 17.8 54.3 75.7SVHN 2.62 10.2 64.0 93.9BIVA (Ours) CIFAR10 3.46 8.74 19.7 37.3SVHN 2.35 6.62 25.1 59.0 Table 1. Average bits per dimension of different datasets for mod-els trained on FashionMNIST and CIFAR10. For the hierarchicalmodels we include the L >k bounds. The likelihoods of trainingand test splits of the in-distribution data are all cases close. Sincewe train on dynamically binarized FashionMNIST, our bits/dimare smaller than for Glow. As k is increased for the L >k bound,the bound gets looser but the model eventually assigns higher like-lihood to the in distribution data than to the OOD data. Glow refersto Kingma & Dhariwal (2018); Nalisnick et al. (2019a). BIVArefers to our implementation of Maaløe et al. (2019). for multiple values of k , the value for which it differs themost is selected. In our experiments, we consider an entiredataset at a time and report the results of LLR >k withthe value of k that yielded the highest AUROC ↑ for thatdataset in a threshold-free manner. In practice, slightlybetter performance may be achieved by choosing k perexample. This would not exclude the use of batching in ourmethod, since LLR >k is computed after the forward pass. 6. Results The likelihoods for our trained models are in Table 1 along-side baseline results for in-distribution and OOD data. Themain results of the paper on the OOD tasks can be seen alongwith comparisons to the baseline methods in Table 2. Wenote that for all our results, the value of the score ( L >k and LLR >k ) for the training and test splits of the in-distributiondata was observed to have the same empirical distributionas within sampling error hence yielding an AUROC scoreof ≈ . as expected. Results on additional commonly useddatasets are found in the Appendix. We first report the results of the different variations of the L >k bound for OOD detection. We reconfirm the results of Serr`a et al. (2020) performs the best when high likelihoodsare assigned to OOD data such that the overlap with in-distributiondata is low. Performance is worse when the overlap is high, cf.Serr`a et al. (2020, Table 1), as seen with complex images. Method AUROC ↑ AUPRC ↑ FPR80 ↓ FashionMNIST (in) / MNIST (out)Use prior knowledge of OOD Backgr. contrast. LR (PixelCNN) [1] . 994 0 . 993 0 . Backgr. contrast. LR (VAE) [7] . - -Binary classifier [1] . 455 0 . 505 0 . p (ˆ y | x ) with OOD as noise class [1] . 877 0 . 871 0 . p (ˆ y | x ) with calibration on OOD [1] . 904 0 . 895 0 . Input complexity ( S , Glow) [9] . - -Input complexity ( S , PixelCNN++) [9] . - -Input complexity ( S , HVAE) (Ours) - - - Use in-distribution data labels yp (ˆ y | x ) [1], [2] . 734 0 . 702 0 . Entropy of p ( y | x ) [1] . 746 0 . 726 0 . ODIN [1, 3] . 752 0 . 763 0 . VIB [4, 7] . - -Mahalanobis distance, CNN [1] . 942 0 . 928 0 . Mahalanobis distance, DenseNet [5] . - -Ensemble, 20 classifiers [1, 6] . 857 0 . 849 0 . No OOD-specific assumptions WAIC, 5 models, PixelCNN [1] . 221 0 . 401 0 . WAIC, 5 models, Glow [7] . - -Likelihood regret [8] . - - L > + HVAE (ours) . 268 0 . 363 0 . L > + HVAE (ours) . 593 0 . 591 0 . L > + HVAE (ours) . 712 0 . 750 0 . LLR > + HVAE (ours) . 964 0 . 961 0 . LLR > + HVAE (ours) . 984 0 . 984 0 . CIFAR10 (in) / SVHN (out)Use prior knowledge of OOD Backgr. contrast. LR (PixelCNN) [1] . 930 0 . 881 0 . Backgr. contrast. LR (VAE) [8] . - -Outlier exposure [9] . - -Input complexity ( S , Glow) [10] . - -Input complexity ( S , PixelCNN++) [10] . - -Input complexity ( S , HVAE) (Ours) [10] . 833 0 . 855 0 . Use in-distribution data labels y Mahalanobis distance [5] . - - No OOD-specific assumptions Likelihood regret [8] . - - LLR > + HVAE (ours) . 811 0 . 837 0 . LLR > + BIVA (ours) . 891 0 . 875 0 . Table 2. AUROC ↑ , AUPRC ↑ and FPR80 ↓ for OOD detection fora FashionMNIST model using scores on the FashionMNIST testset as reference. HVAE (ours) refers to our hierarchical bottom-upVAE. BIVA (ours) refers to our implementation of the hierarchicalBIVA model (Maaløe et al., 2019). [1] is (Ren et al., 2019), [2]is (Hendrycks & Gimpel, 2017), [3] is (Liang et al., 2018), [4]is (Alemi et al., 2018), [5] is (Lee et al., 2018), [6] is (Lakshmi-narayanan et al., 2017), [7] is (Choi et al., 2019), [8] is (Xiao et al.,2020), [9] is (Hendrycks et al., 2019), [10] is (Serr`a et al., 2020). Nalisnick et al. (2019a) by observing that our hierarchicallatent variable models also assign higher L > to the OODdataset in the FashionMNIST/MNIST and CIFAR10/SVHNcases resulting in an AUROC ↑ inferior to random (Table 2).Switching the in-distribution data for the OOD data in bothcases result in correctly detecting the OOD data; an asym-metry also reported by Nalisnick et al. (2019a). Figure 5ashows the density of L > in bits per dimension (Theis et al.,2016) by the model trained on FashionMNIST when eval-uated on the FashionMNIST and MNIST test sets. Weobserve a high degree of overlap, with less separation of theOOD data compared to similar results of autoregressive and ierarchical VAEs Know What They Don’t Know D e n s i t y FashionMNIST test >0 MNIST test >0 (a) D e n s i t y FashionMNIST test >2 MNIST test >2 (b) D e n s i t y FashionMNIST test LLR >1 MNIST test LLR >1 (c) Figure 5. Empirical densities of FashionMNIST (in-distribution) and MNIST (OOD) using the raw likelihood (a), the L > bound (b) andthe LLR > score (c. All densities area computed using the HVAE model.). For the regular likelihood MNIST is very clearly more likelyon average than the FashionMNIST test data while with the L > bound separation is better but significant overlap remains. The LLR > provides a high degree of separation. Likelihoods are reported in units of the natural log of the number of bits per dimension. flow-based models, like Xiao et al. (2020).We then evaluate the looser L >k (5) for k ∈ { , L } . Fig-ure 5b shows the result for L > , which yielded the highestAUCROC ↑ , only slightly better than random. Like Maaløeet al. (2019), we see that increasing the value of k gener-ally leads to improved OOD detection. However, we alsoobserve that the two empirical distributions never cease tooverlap. Importantly, depending on the OOD dataset, theamount of remaining overlap can be high which limits thediscriminatory power of the likelihood-based L >k bound.This is in-line with the pathological behavior of the rawlikelihood of latent variable models when used for OODdetection (Xiao et al., 2020). Since a high degree of overlapalso seems present in Maaløe et al. (2019), and we see thesame problem for our BIVA model trained on CIFAR10, wedo not expect this to be due to the less expressive HVAE. We now move to the likelihood ratio-based score. Wefind that LLR >k separates the OOD MNIST data fromin-distribution FashionMNIST to a higher degree than thelikelihood estimates as can be seen by the empirical densi-ties of the score in Figure 5c. We note that the likelihoodratio between the ELBO and the L >k bound provides thehighest degree of separation of MNIST and FashionMNISTas measured by the AUROC ↑ for k = 1 smaller than L . Thisis not surprising since the value of k that provides the maxi-mal separation to the reference in-distribution dataset neednot be the one for which LLR >k is overall maximal for theOOD dataset. We also visualize the ROC curves resultingfrom using the LLR >k score for OOD detection on bothFashionMNIST/MNIST and CIFAR10/SVHN and compareit to the ROC curves resulting from the different L >k boundsin Figures 6 and 7, respectively. On both datasets we seesignificantly better discriminatory performance when usingthe LLR >k score. T r u e P o s i t i v e R a t e >0 AUROC = 0.268 >1 AUROC = 0.593 >2 AUROC = 0.712 LLR >1 AUROC = 0.984 Figure 6. ROC curves with AUROC score for detecting MNISTas OOD with the HVAE model trained on FashionMNIST. AROC curve is plotted for each of the L >k bounds including theELBO along with one for the best-performing log likelihood-ratio LLR > . T r u e P o s i t i v e R a t e >0 AUROC = 0.090 >1 AUROC = 0.183 >2 AUROC = 0.719 LLR >2 AUROC = 0.891 Figure 7. ROC curves with AUROC score for detecting SVHN asOOD with the BIVA model trained on CIFAR10. A ROC curveis plotted for each of the L >k bounds including the ELBO alongwith one for the best-performing log likelihood-ratio LLR > . ierarchical VAEs Know What They Don’t Know Table 2 shows that BIVA improves upon the HVAE modelfor OOD detection on CIFAR while Table 1 shows thatthe BIVA model also improves upon the HVAE in termsof likelihood. We hypothesize that models larger than ourimplementation of BIVA, with better likelihood scores mayperform even better (Maaløe et al., 2019; Vahdat & Kautz,2020; Child, 2021). : Table 2 summarize our results compared tobaselines based on the commonly used AUROC ↑ , AUPRC ↑ and FPR80 ↓ metrics. Our method outperforms other gen-erative model-based methods such as WAIC (Choi et al.,2019) with Glow model and performs similarly to the likeli-hood regret method of (Xiao et al., 2020). Furthermore, ourmethod performs similarly to the background constrativelikelihood ratio method of Ren et al. (2019) on FashionM-NIST/MNIST but contrary to the failure of that method onCIFAR10/SVHN reported by (Xiao et al., 2020), our methodperforms very well on this task too. Our approach outper-forms all supervised approaches that use in-distribution la-bels or synthetic examples of OOD data derived from thein-distribution data including ODIN (Liang et al., 2018) andthe predictive distribution of a classifier p (ˆ y | x ) trained andevaluated in various ways (see Ren et al. (2019)). Runtime : For a full evaluation of a single example acrossall feature levels of a model with L stochastic layers, ourmethod requires L − forward passes through the inferenceand generative networks as well as computing the likelihoodratio, of which the forward passes are dominant. For a typ-ical forward pass that is linear in the input dimensionality, D , and the number of stochastic layers, L , this amounts tocomputation of O ( DL ) . Compared to some related workthat either requires an M > sized batch of inputs of whicheither all or none are OOD (Nalisnick et al., 2019b) or can-not be applied to batches due to the required per-exampleoptimization (Xiao et al., 2020), our method additionally isapplicable to batches of any size that may consist of bothOOD and in-distribution examples which provides drasticspeed-ups via vectorization and parallelization. Further-more, the method of Xiao et al. (2017) requires refitting theinference network of a VAE which can be computationallydemanding. Compared to the likelihood ratio proposed inRen et al. (2019), our method requires training only a singlemodel on a single dataset. 7. Discussion Deep generative models are state-of-the-art density estima-tors, but the OOD failures reported in recent years haveraised concerns about the limitations of such density es-timates. Recent work on improving OOD detection haslargely sidestepped this concern by relying on additional assumptions that strictly should not be needed for modelswith explicit likelihoods. While the engineering challengeof building reliable OOD detection schemes is important, itis of more fundamental importance to understand why thenaive likelihood test fails. We have provided evidence thatlow-level features of the neural nets dominate the likelihood,which gives a cause to the why . The fact that a simple scorefor measuring the importance of semantic features yieldstate-of-the-art results on OOD detection without access toadditional information gives validity to our hypothesis.The findings from, amongst others, Nalisnick et al. (2019a);Serr`a et al. (2020) have a clear relation to information the-ory and compression. Semantically complex in-distributiondata yields models with diverse low-level feature sets thatenable generalization across datasets. Simpler datasets canonly yield models with less diverse low-level feature setscompared to complex training data. Hence, there can be anasymmetry where the likelihoods of simple OOD data canbe high for a model trained on complex data, but not theother way around. Loosely put, the minimal number of bitsrequired to losslessly compress data sampled from somedistribution is the entropy of the generating process (Shan-non, 1948; MacKay, 2003). Townsend et al. (2019) recentlyshowed that VAEs can be used for lossless compression atrates superior to more generic algorithms.We also note that since the hierarchical VAE is a proba-bilistic graphical latent variable model, it lends itself verynaturally to manipulation at the feature level (Kingma et al.,2014; Maaløe et al., 2016; 2017). This property sets it apartfrom other generative models that do not explicitly definesuch a hierarchy of features. This in turn enables reliableOOD detection with our methodology while making no ex-plicit assumptions about the nature of OOD data and onlyusing a single model. This has not been achieved withautoregressive or flow-based models. 8. Conclusion In this paper we study unsupervised out-of-distribution de-tection using hierarchical variational autoencoders. We pro-vide evidence that highly generalizeable low-level featurescontribute greatly to estimated likelihoods resulting in poorOOD detection performance. We proceed to develop alikelihood-ratio based score for OOD detection and define itto explicitly ensure that data must be in-distribution acrossall feature levels to be regarded in-distribution. This ratiois mathematically shown to perform OOD detection in thelatent space of the model, removing the reliance on the trou-blesome input-space likelihood. We point out that contraryto much recent literature on OOD detection, our approach isfully unsupervised and does not make assumptions about thenature of OOD data. Finally, we demonstrate state-of-the-artperformance on a wide range of OOD failure cases. ierarchical VAEs Know What They Don’t Know Acknowledgements References Alemi, A. A., Fischer, I., and Dillon, J. V. Uncertainty in theVariational Information Bottleneck. arXiv:1807.00906 ,July 2018. URL http://arxiv.org/abs/1807.00906 .Bengio, Y., Courville, A. C., and Vincent, P. Representationlearning: A review and new perspectives. IEEE Transac-tions on Pattern Analysis and Machine Intelligence , 35(8):1798–1828, 2013. doi: 10.1109/TPAMI.2013.50. URL https://doi.org/10.1109/TPAMI.2013.50 .Bishop, C. M. Novelty Detection and Neural-NetworkValidation. IEE Proceedings - Vision, Image and Sig-nal Processing , 141(4):217–222, 1994. ISSN 1350245x,13597108. doi: 10.1049/ip-vis:19941330.Bulatov, Y. notMNIST dataset, September 2011. URL http://yaroslavvb.blogspot.com/2011/09/notmnist-dataset.html .Burda, Y., Grosse, R., and Salakhutdinov, R. R. Impor-tance Weighted Autoencoders. In Proceedings of the4th International Conference on Learning Representa-tions (ICLR) , pp. 8, San Juan, Puerto Rico, 2016. URL https://arxiv.org/abs/1509.00519 .Buse, A. The likelihood ratio, Wald, and Lagrange multi-plier tests: An expository note. The American Statistician ,36(3a):153–157, 1982.Child, R. Very Deep VAEs Generalize Autoregressive Mod-els and Can Outperform Them on Images. In Proceed-ings of the 9th International Conference on LearningRepresentations (ICLR) , 2021. URL https://arxiv.org/pdf/2011.10650.pdf .Choi, H., Jang, E., and Alemi, A. A. WAIC, but Why?Generative Ensembles for Robust Anomaly Detection. arXiv:1810.01392 , May 2019. URL http://arxiv.org/abs/1810.01392 .Clanuwat, T., Bober-Irizar, M., Kitamoto, A., Lamb, A.,Yamamoto, K., and Ha, D. Deep Learning for Classi-cal Japanese Literature. arXiv:1812.01718 [cs, stat] , December 2018. doi: 10.20676/00000341. URL http://arxiv.org/abs/1812.01718 . arXiv:1812.01718.Cremer, C., Li, X., and Duvenaud, D. Inference Sub-optimality in Variational Autoencoders. In Dy, J. andKrause, A. (eds.), Proceedings of the 35th Interna-tional Conference on Machine Learning (ICML) , vol-ume 80 of Proceedings of machine learning research ,pp. 1078–1086, Stockholmsm¨assan, Stockholm, Swe-den, July 2018. PMLR. URL http://proceedings.mlr.press/v80/cremer18a.html .DeVries, T. and Taylor, G. W. Learning Confidencefor Out-of-Distribution Detection in Neural Networks. arXiv:1802.04865 , February 2018. URL http://arxiv.org/abs/1802.04865 .Dieng, A. B., Kim, Y., Rush, A. M., and Blei, D. M.Avoiding latent variable collapse with generative skipmodels. In Chaudhuri, K. and Sugiyama, M. (eds.), Proceedings of the 22nd International Conference onArtificial Intelligence and Statistics (AISTATS) , vol-ume 89, pp. 2397–2405, Naha, Okinawa, Japan,2019. PMLR. URL http://proceedings.mlr.press/v89/dieng19a.html .Fukushima, K. Neocognitron: A Self-organizing NeuralNetwork Model for a Mechanism of Pattern RecognitionUnaffected by Shift in Position. Biological Cybernetics ,36(4):193–202, 1980.Goodfellow, I. J., Shlens, J., and Szegedy, C. Explainingand harnessing adversarial examples. In Bengio, Y. andLeCun, Y. (eds.), Proceedings of the 3rd InternationalConference on Learning Representations (ICLR) , SanDiego, CA, USA, 2015. URL http://arxiv.org/abs/1412.6572 .Hendrycks, D. and Gimpel, K. A Baseline for Detect-ing Misclassified and Out-of-Distribution Examples inNeural Networks. In Proceedings of the 5th Interna-tional Conference on Learning Representations (ICRL) ,Toulon, France, 2017. URL http://arxiv.org/abs/1610.02136 . arXiv: 1610.02136.Hendrycks, D., Mazeika, M., and Dietterich, T. G. Deepanomaly detection with outlier exposure. In Proceed-ings of the 7th International Conference on Learn-ing Representations (ICLR) , New Orleans, LA, USA,2019. URL https://openreview.net/forum?id=HyxCxhRcY7 .Ho, J., Chen, X., Srinivas, A., Duan, Y., and Abbeel, P.Flow++: Improving Flow-Based Generative Models withVariational Dequantization and Architecture Design. In ierarchical VAEs Know What They Don’t Know Proceedings of the 36th International Conference on Ma-chine Learning (ICML) , pp. 9, Long Beach, CA, USA,2019. URL http://proceedings.mlr.press/v97/ho19a/ho19a.pdf .Ioffe, S. and Szegedy, C. Batch Normalization: AcceleratingDeep Network Training by Reducing Internal CovariateShift. In Proceedings of the International Conference onMachine Learning (ICML) , Lille, France, 2015. URL http://arxiv.org/abs/1502.03167 . arXiv:1502.03167.Kingma, D. P. and Ba, J. L. Adam: A Methodfor Stochastic Optimization. arXiv preprint , Decem-ber 2014. URL http://arxiv.org/abs/1412.6980 . arXiv: 1412.6980.Kingma, D. P. and Dhariwal, P. Glow: Generative Flow withInvertible 1×1 Convolutions. In Proceedings of the 32ndConference on Neural Information Processing Systems(NeurIPS) , pp. 10, Montr´eal, Canada, 2018.Kingma, D. P. and Welling, M. Auto-Encoding Varia-tional Bayes. In Proceedings of the 2nd InternationalConference on Learning Representations (ICLR) , Banff,AB, Canada, 2014. URL http://arxiv.org/abs/1312.6114 . arXiv: 1312.6114.Kingma, D. P., Rezende, D. J., Mohamed, S., andWelling, M. Semi-Supervised Learning with DeepGenerative Models. In Proceedings of the 28th Inter-national Conference on Neural Information Process-ing Systems (NeurIPS) , Montr´eal, Quebec, Canada,June 2014. URL http://arxiv.org/abs/1406.5298 . arXiv: 1406.5298.Kingma, D. P., Salimans, T., Jozefowicz, R., Chen, X.,Sutskever, I., and Welling, M. Improved variational infer-ence with inverse autoregressive flow. In Proceedings ofthe 30th International Conference on Neural InformationProcessing Systems (NeurIPS) , NIPS’16, pp. 4743–4751,Barcelona, Spain, 2016. ISBN 978-1-5108-3881-9. URL http://arxiv.org/abs/1606.04934 .Kipf, T. N. and Welling, M. Variational Graph Auto-Encoders. arXiv:1611.07308 , November 2016. URL http://arxiv.org/abs/1611.07308 .Krizhevsky, A. Learning Multiple Layers of Features fromTiny Images . PhD thesis, University of Toronto, 2009.arXiv: 1011.1669v3 ISBN: 9788578110796 ISSN: 1098-6596.Lake, B. M., Salakhutdinov, R., and Tenenbaum, J. B.Human-level concept learning through probabilisticprogram induction. Science , 350(6266):1332–1338,2015. ISSN 0036-8075. doi: 10.1126/science. aab3050. URL https://science.sciencemag.org/content/350/6266/1332 .Lakshminarayanan, B., Pritzel, A., and Blundell, C. Sim-ple and Scalable Predictive Uncertainty Estimation us-ing Deep Ensembles. In In Proceddings of the 31stConference on Neural Information Processing Systems(NeurIPS) , Long Beach, CA, USA, 2017. URL http://arxiv.org/abs/1612.01474 .LeCun, Y., Huang, F., and Bottou, L. Learning methodsfor generic object recognition with invariance to poseand lighting. In IEEE Computer Society Conference onComputer Vision and Pattern Recognition (CVPR 2004) ,volume 2, pp. II–104 Vol.2, 2004.LeCun, Y. A., Bottou, L., Bengio, Y., and Haffner, P.Gradient-based learning applied to document recognition. Proceedings of the IEEE , 86(11):2278–2323, 1998.Lee, K., Lee, K., Lee, H., and Shin, J. A SimpleUnified Framework for Detecting Out-of-DistributionSamples and Adversarial Attacks. In Proceed-ings of the 32nd Conference on Neural Infor-mation Processing Systems (NeurIPS) , pp. 11,Montr´eal, Quebec, Canada, 2018. URL https://papers.nips.cc/paper/2018/file/abdeb6f575ac5c6676b747bca8d09cc2-Paper.pdf .Liang, S., Li, Y., and Srikant, R. Enhancing the reliabilityof out-of-distribution image detection in neural networks.In Proceedings of the 6th International Conference onLearning Representations (ICLR) , Vancouver, Canada,2018. URL https://openreview.net/forum?id=H1VGkIxRZ .Maaløe, L., Sønderby, C. K., Sønderby, S. K., and Winther,O. Auxiliary deep generative models. In Balcan,M. F. and Weinberger, K. Q. (eds.), Proceedings ofthe 33rd international conference on machine learn-ing , volume 48 of Proceedings of machine learning re-search , pp. 1445–1453, New York, New York, USA, June2016. PMLR. URL http://proceedings.mlr.press/v48/maaloe16.html .Maaløe, L., Fraccaro, M., and Winther, O. Semi-Supervised Generation with Cluster-aware GenerativeModels. arXiv:1704.00637 , April 2017. URL http://arxiv.org/abs/1704.00637 .Maaløe, L., Fraccaro, M., Li´evin, V., and Winther, O. BIVA:A Very Deep Hierarchy of Latent Variables for Genera-tive Modeling. In Proceedings of the 32nd Conferenceon Neural Information Processing Systems (NeurIPS) , pp.6548–6558, Vancouver, Canada, February 2019. URL http://arxiv.org/abs/1902.02102 . arXiv:1902.02102. ierarchical VAEs Know What They Don’t Know MacKay, D. J. C. Information theory, inference, and learn-ing algorithms . Cambridge University Press, 1 edition,2003. ISBN 978-0-521-64298-9.Mattei, P.-A. and Frellsen, J. Refit your encoder when newdata comes by. In , 2018.Muirhead, R. J. Aspects of multivariate statistical theory ,volume 197. John Wiley & Sons, 2009.Nair, V. and Hinton, G. E. Rectified linear units im-prove restricted boltzmann machines. In Proceed-ings of the 27th International Conference on MachineLearning (ICML 2021) , pp. 807–814, Haifa, Israel,2010. URL https://icml.cc/Conferences/2010/papers/432.pdf .Nalisnick, E., Matsukawa, A., Teh, Y. W., Gorur, D., andLakshminarayanan, B. Do Deep Generative ModelsKnow What They Don’t Know? In Proceedings of the7th International Conference on Learning Representa-tions (ICLR) , New Orleans, LA, USA, 2019a. URL http://arxiv.org/abs/1810.09136 . arXiv:1810.09136.Nalisnick, E., Matsukawa, A., Teh, Y. W., and Lak-shminarayanan, B. Detecting Out-of-Distribution In-puts to Deep Generative Models Using Typicality. arXiv:1906.02994 , pp. 15, 2019b. URL https://arxiv.org/abs/1906.02994 .Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu,B., and Ng, A. Y. Reading Digits in NaturalImages with Unsupervised Feature Learning. In NeurIPS Workshop on Deep Learning and Unsuper-vised Feature Learning 2011 , 2011. URL http://ufldl.stanford.edu/housenumbers/nips2011_housenumbers.pdf .Nguyen, A., Yosinski, J., and Clune, J. Deep neural net-works are easily fooled: High confidence predictionsfor unrecognizable images. In Proceedings of the IEEEConference on Computer Vision and Pattern Recognition(CCVPR) , volume 07-12-June, pp. 427–436, 2015. ISBN978-1-4673-6964-0. doi: 10.1109/CVPR.2015.7298640.Oord, A. v. d., Dieleman, S., Zen, H., Simonyan, K.,Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A., andKavukcuoglu, K. WaveNet: A Generative Model for RawAudio. In In Proceedings of the 9th ISCA Speech Syn-thesis Workshop , Sunnyval, CA, USA, September 2016a.URL http://arxiv.org/abs/1609.03499 .Oord, A. v. d., Kalchbrenner, N., and Kavukcuoglu, K.Pixel Recurrent Neural Networks. In Proceedings ofthe 33rd International Conference on Machine Learning (ICML) , New York, NY, USA, August 2016b. Journal ofMachine Learning. URL http://arxiv.org/abs/1601.06759 . arXiv: 1601.06759.Paszke, A., Chanan, G., Lin, Z., Gross, S., Yang, E., Antiga,L., and Devito, Z. Automatic differentiation in PyTorch.In In Proceedings of the 31st Conference on Neural In-formation Processing Systems (NeurIPS) , 2017. URL https://pytorch.org/ .Ren, J., Liu, P. J., Fertig, E., Snoek, J., Poplin, R.,Depristo, M., Dillon, J., and Lakshminarayanan, B.Likelihood Ratios for Out-of-Distribution Detection.In Proceedings of the 33rd International Conferenceon Neural Information Processing Systems (NeurIPS) ,pp. 12, Vancouver, Canada, 2019. URL https://papers.nips.cc/paper/2019/file/1e79596878b2320cac26dd792a6c51c9-Paper.pdf .Rezende, D. J. and Mohamed, S. Variational Inferencewith Normalizing Flows. In Proceedings of the 32ndInternational Conference on Machine Learning (ICML) ,Lille, France, 2015. URL http://arxiv.org/abs/1505.05770 .Rezende, D. J., Mohamed, S., and Wierstra, D.Stochastic Backpropagation and Approximate Infer-ence in Deep Generative Models. In Proceed-ings of Machine Learning Research , volume 32, pp.1278–1286, Beijing, China, January 2014. PMLR.URL http://proceedings.mlr.press/v32/rezende14.pdf . arXiv: 1401.4082.Salimans, T. and Kingma, D. P. Weight Normalization:A Simple Reparameterization to Accelerate Training ofDeep Neural Networks. In Proceedings of the 30thConference on Neural Information Processing Systems(NeurIPS) , Barcelona, Spain, February 2016. URL http://arxiv.org/abs/1602.07868 .Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P.PixelCNN++: Improving the PixelCNN with DiscretizedLogistic Mixture Likelihood and Other Modifications.In Proceedings of the 5th International Conferenceon Learning Representations (ICLR) , Toulon, France,April 2017. URL http://arxiv.org/abs/1701.05517 .Serr`a, J., ´Alvarez, D., G´omez, V., Slizovskaia, O., N´u˜nez,J. F., and Luque, J. Input complexity and out-of-distribution detection with likelihood-based generativemodels. In Proceedings of the 8th International Confer-ence on Learning Representations (ICLR) , Addis Ababa,Ethiopia, 2020. URL https://openreview.net/forum?id=SyxIWpVYvr . ierarchical VAEs Know What They Don’t Know Shannon, C. E. A Mathematical Theory of Communication. The Bell System Technical Journal , 27(July 1948):379–423, 1948. ISSN 07246811. doi: 10.1145/584091.584093.arXiv: chao-dyn/9411012 ISBN: 0252725484.Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K.,and Winther, O. Ladder Variational Autoencoders. In Proceedings of the 29th Conference on Neural Informa-tion Processing Systems (NeurIPS) , Barcelona, Spain,December 2016. URL http://arxiv.org/abs/1602.02282 . arXiv: 1602.02282.Theis, L., Oord, A. v. d., and Bethge, M. A note on theevaluation of generative models. In Proceedings of the 4thInternational Conference on Learning Representations(ICLR) , San Juan, Puerto Rico, May 2016. URL http://arxiv.org/abs/1511.01844 .Townsend, J., Bird, T., and Barber, D. Practical LosslessCompression With Latent Variables Using Bits Back Cod-ing. In , pp. 13, New Orleans, LA, USA,2019.Vahdat, A. and Kautz, J. NVAE: A Deep HierarchicalVariational Autoencoder. In , Virtual,July 2020. URL http://arxiv.org/abs/2007.03898 . arXiv: 2007.03898.Xiao, H., Rasul, K., and Vollgraf, R. Fashion-MNIST: aNovel Image Dataset for Benchmarking Machine Learn-ing Algorithms. arXiv:1708.07747 , 2017. URL http://arxiv.org/abs/1708.07747 .Xiao, Z., Yan, Q., and Amit, Y. Likelihood Regret: AnOut-of-Distribution Detection Score for VariationalAuto-Encoder. In Proceedings of the 33rd Conferenceon Neural Information Processing Systems (NeurIPS) ,Virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/eddea82ad2755b24c4e168c5fc2ebd40-Abstract.html . ierarchical VAEs Know What They Don’t Know A. Datasets Table 3 lists the datasets used in the paper. We use thepredefined train/test splits for the datasets.For SmallNORB and Omniglot we resize the original grey-scale images to × with ordinary bi-linear interpolation.For each of these datasets, we also create a version wherethe grey-scale is inverted. We do this because, the overallwhite nature of the images tends to make detecting them asOOD from FashionMNIST artificially easy. The inversion isdone via the simple transformation x inverted = 255 − x original since images are encoded as 8 bit unsigned integers. Dataset Dimensionality ExamplesFashionMNIST (Xiao et al., 2017) × × × × × × × × × × × × × × × × Table 3. Overview of the used datasets. B. Model details In Table 4 we specify the hyperparameters used when train-ing our models. B.1. Hierarchical VAE Our Hierarchical VAE (HVAE) model uses bottom-up in-ference and top-down generative paths as specified in thepaper. For grey-scale images, the output is parameterizedby a Bernoulli distribution while for natural images we usea Discretized Logistic Mixture (Salimans et al., 2017). Thelatent variables are parameterized by stochastic layers thatoutput the mean and log-variance of a diagonal covarianceGaussian. The prior distribution on the top-most latent isa standard Gaussian. For grey-scale images, the first latentspace is parameterized by a convolutional neural networkand has dimensions × × interpreted as (height × width × latent dimension). The last two latent variablesare parameterized by dense transformations with and units, respectively. For natural images, all latent variablesare parameterized by convolutional neural networks andhave dimensions × × , × × and × × ,respectively for z , z and z .Each stochastic layer is preceeded by a determininistic trans-formation. For both grey-scale and natural images, each de-terministic transformation consists of three residual blocksof the same type used by Maaløe et al. (2019). The structure of a residual block is: y = Conv ( Act ( Conv s ( Act ( x )))) + x , where ”Conv” refers to a same-padded convolution and”Act” to the activation function. Within a residual block,the first convolution always has stride 1 while the secondconvolution has stride s . In a deterministic transformation,any non-unit stride is performed in the third residual block.For grey-scale images, we stride by 2 in the first and seconddeterministic transformations but not the third. For naturalimages, we stride by 2 in all three deterministic blocks.In both cases, the first deterministic block uses a kernelsize of 5 and the latter two a kernel of size 3. We use theReLU activation function (Fukushima, 1980; Nair & Hinton,2010).Since the benefits and drawbacks of using batch normaliza-tion (Ioffe & Szegedy, 2015) in hierarchical VAEs is stillthe matter of some debate (Sønderby et al., 2016; Vahdat& Kautz, 2020; Child, 2021) we choose to use weight nor-malization (Salimans & Kingma, 2016) as in other work(Maaløe et al., 2019) and initialize the model using the orig-inally proposed data-dependent initialization. To have thestochastic layers initialize to standard Gaussian distributions(zero mean, unit variance), with this initialization, we selectthe activation function for the variance as a Softplus,Softplus ( x ) = 1 β log (1 + exp( β x )) , with β = log(2) ≈ . to output 1 for x = 0 .Training took approximately two days on a single NVIDIAGTX 1080 Ti graphics card. B.2. BIVA For the BIVA model (Maaløe et al., 2019), we use a specifi-cation that is very similar to that of the HVAE above, andto that of the original paper. The model has 3 spatial latentvariables with the rest being densely connected in order tohave an architecture similar to the HVAE. The model usesan overall stride of 8, achieved by striding by 2 in the first,fourth and sixth deterministic transformations.Training took approximately a week on a single NVIDIAGTX 1080 Ti graphics card. C. Analysis of the influence of latent variableson the marginal likelihood In the paper, we argue that the lowest level latent variables,which have the highest dimensionality, contribute the mostto the approximate likelihood. Here, we provide a stringentmathematical argument that generalizes this to the exactmarginal likelihood in a model with a deterministic decoder. ierarchical VAEs Know What They Don’t Know Hyperparameter Setting/Range All Optimization Adam (Kingma & Ba, 2014)Learning rate e − Batch size 128Epochs 2000Free bits per z Free bits constant 200 epochsFree bits annealed 200 epochsActivation ReLUInitialization Data-dependent(Salimans & Kingma, 2016) HVAE Convolution kernel 5-3-3Stride 2-2-2 (natural) / 2-2-1 (grey)Warmup anneal period 200 epochs BIVA Convolution kernel 5-3-3-3-3-3-3-3-3-3Stride 2-1-1-2-1-2-1-1-1-1 Table 4. Selection of most important hyperparameters and theirsetting. See Appendix B for more details. C.1. Model specification For an arbitrary hierarchical latent variable model, we havea prior p ( z L ) and a generative mapping f : R d → R D , suchthat x = f ( z L ) and D > d . Note that we will assume that f is deterministic, such that we are effectively working with p ( x | z ) = δ f ( z ) ( x ) . This is a limiting assumption, but itallows working through the following. For shorthand wewill simply write z = z L .Let f have a bottleneck architecture, i.e. f ( z ) = f ( . . . f L − ( f L ( z ))) , (11)where f i : R d i → R d i − , i = L, . . . , . (12)Here we use the notation d = D = | x | and d L = d = | z | and further assume d ≥ d ≥ . . . ≥ d L − ≥ d L whichgives the bottleneck.Assuming x is such that a corresponding latent variable z exists, i.e. that there exists z such that x = f ( z ) , then wecan write the likelihood of x through a standard change ofvariables (similar to flow-based models), p ( x ) = p ( z ) L (cid:89) i =1 (cid:18)(cid:113) det J Ti J i (cid:19) − , (13) Layer input dimensionality − − − − − − l og d e t J T i J i σ = 0 . σ = 0 . σ = 1 σ = 10 Figure 8. The expected inverse volume change for Gaussian Jaco-bians (17) on a log-scale. where J i is the Jacobian of f i , i.e. J i = ∂f i ∂ z i ∈ R d i × d i − . (14)Here we use the notation that z i is the representation atlayer i . Note that J Ti J i is a d i − × d i − symmetric positivesemidefinite matrix (determinant ≥ ).The log-likelihood can be written as log p ( x ) = log p ( z ) − L (cid:88) i =1 log det J Ti J i . (15)By construction of determinants, we can generally expectthese determinants to grow with the dimensionality of thematrix. We should expect the determinant of a d × d matrixto be of the order O ( λ d ) for some number λ > . With thatin mind, we should generally expect that det J Ti +1 J i +1 > det J Ti J i , (16)due to the bottleneck assumption. If so, we seethat the marginal likelihood p ( x ) will be dominated by (cid:16)(cid:112) det J T J (cid:17) − , i.e. low-level features have a higher influ-ence on the likelihood than more important semantic ones. C.2. The Gaussian case The previous remarks can be made more precise if we makedistributional assumptions on the Jacobians. Here we willassume that the Jacobians of each layer follow a Gaussiandistribution. Specifically, we will assume that each entry in J i is distributed as N (0 , σ ) . The analysis below extends tononzero means and more general covariance structure, butthis comes with a cost of less transparent notation. In this ierarchical VAEs Know What They Don’t Know setting, J Ti J i follows a Wishart distribution (in the generalsetting it would follow a non-central Wishart distribution).Muirhead (2009) tells us that the expected multiplicativecontribution to the likelihood of each layer is E (cid:34)(cid:18)(cid:113) det J Ti J i (cid:19) − (cid:35) = σ − d i − − di − Γ d i − (cid:0) d i − (cid:1) Γ d i − (cid:0) d i (cid:1) = σ − d i − − di − Γ (cid:0) ( d i − d i − ) (cid:1) Γ (cid:0) d i (cid:1) (17)where Γ d is the multivariate Gamma function. Assumingthat the increase in layer dimension d i − d i − is constant,then we see that (17) goes to zero as d i goes to infinity asthe Γ function grows super-exponentially to infinity. Thissuper-exponential growth further implies that the first layersdominate the marginal likelihood p ( x ) . This is also visuallyevident in Figure 8. D. Derivation of the L >k bound In this section we present the derivation of L >k and showthat it is a lower bound on the marginal likelihood.First, we consider a two-layered VAE with bottom-up in-ference. We proceed very similarly to the derivation of theregular ELBO and also use Jensen’s inequality. log p ( x ) = log (cid:90) (cid:90) p ( x | z ) p ( z | z ) p ( z ) d z d z (18) = log (cid:90) (cid:90) q ( z | x ) q ( z | x ) p ( x | z ) p ( z | z ) p ( z ) d z d z = log (cid:90) (cid:90) q ( z | x ) p ( z | z ) p ( x | z ) p ( z ) q ( z | x ) d z d z ≥ E p ( z | z ) q ( z | x ) (cid:20) log p ( x | z ) p ( z ) q ( z | x ) (cid:21) ≡ L > . Here, we have introduced the variational distribution q ( z | x ) which, naively, is different from any of the availablevariational distributions q ( z | x ) and q ( z | z ) . However, it’seasy to see that we can simply define q ( z | x ) = q ( z | d ( x )) where d ( x ) = E [ q ( z | x )] . I.e. we compute the distribu-tion over z via the mode of q ( z | x ) . This is possible sincewe exclusively manipulate the variational proposal distribu-tion without altering the generative model p ( x , z ) .In general, the derivation of L >k for an L -layered hierarchi- cal VAE with z = z , . . . , z L is as follows: log p ( x ) = log (cid:90) p ( x | z ) p ( z ) d z (19) = log (cid:90) q ( z >k | x ) q ( z >k | x ) p ( x | z ) p ( z ) d z = log (cid:90) q ( z >k | x ) p ( z ) p ( x | z ) q ( z >k | x ) d z = log (cid:90) q ( z >k | x ) p ( z ≤ k | z >k ) p ( z >k ) p ( x | z ) q ( z >k | x ) d z = log (cid:90) q ( z >k | x ) p ( z ≤ k | z >k ) p ( x | z ) p ( z >k ) q ( z >k | x ) d z ≥ E p ( z ≤ k | z >k ) (cid:20) log q ( z >k | x ) p ( x | z ) p ( z >k ) q ( z >k | x ) (cid:21) ≥ E p ( z ≤ k | z >k ) q ( z >k | x ) (cid:20) log p ( x | z ) p ( z >k ) q ( z >k | x ) (cid:21) ≡ L >k . Similar to the L = 2 case above, we have defined q ( z >k | x ) = q ( z >k | d k ( x )) with d k defined recursively as d k ( x ) = E [ q ( z k | d k − ( x ))] , d ( x ) = x . That is, we simply consider the inference network below z k +1 to be a deterministic encoder and forward pass themode of each preceding variational distribution.Additionally, we obtain p ( z ≤ k | z >k ) p ( z >k ) by splitting p ( z ) = p ( z L ) p ( z L − | z L ) · · · p ( z | z ) at index k . Importantly, we then evaluate p ( z >k ) = p ( z L ) p ( z L − | z L ) · · · p ( z k +1 | z k +2 ) with samples from q ( z >k | x ) while p ( z ≤ k | z >k ) = p ( z k | z k +1 ) p ( z k − | z k ) · · · p ( z | z ) is evaluated for z k with z k +1 ∼ q ( z >k | x ) and for z In this research we choose model parameterizations relyingon bottom-up inference (Burda et al., 2016), q φ ( z | x ) = q φ ( z | x ) (cid:81) Li =2 q φ ( z i | z i − ) . (22)We do this because bottom-up inference enables the modelto learn covariance between the latent variables in the hierar-chy. In the inference model, any latent variable is dependenton the latent variables below it in the hierarchy and, impor-tantly, the top most latent variable is dependent on all otherlatent variables.In contrast, a top-down inference model (Sønderby et al.,2016) has a topmost latent variable z L that is independentof the other latent variables and is directly given by x . q φ ( z | x ) = q φ ( z L | x ) (cid:81) i = L − q φ ( z i | z i +1 ) . (23)This, in essence, makes z L a mean-field approximationwithout any covariance structure tying it to the other latentvariables, Cov ( z L,i , z k,j ) = 0 for k < L . Furthermore,since the approximate posterior (and the prior) typicallyhave diagonal covariance, z L is also mean-field within itsown elements, Cov ( z L,i , z L,j ) = 0 for i (cid:54) = j .We hypothesize that the covariance of latent variables to-wards the top of the hierarchy with other latent variables isimportant for learning semantic representations. However,top-down inference models are easier to optimize as hasrecently been demonstrated (Sønderby et al., 2016; Vahdat& Kautz, 2020; Child, 2021).In the following, we inspect the differences between theELBO used for bottom-up inference and the ELBO usedfor top-down inference and show that it is not generallypossible to decompose the total KL-divergence into separateKL-divergences per latent variable. Specifically, for top-down inference it is possible to obtain KL-divergence at the top-most latent variable and an expectation of a KL-divergence for the other latent variables. For bottom-upinference, the resulting terms are no longer KL-divergencesexcept at the top-most latent variable.We ask the question whether models relying on top-downinference are impeded in their use for semantic OOD detec-tion, or whether they still learn to assign a more semanticrepresentation in the top-most variables simply due to theflexibility of the deterministic neural network layers. Thisremains an open research question. F.1. Bottom-up inference By splitting up the expectation, we can write the ELBO of atwo-layer bottom-up hierarchical VAE as log p ( x ) ≥ E q ( z , z | x ) [log p ( x | z )] (24) + E q ( z , z | x ) [log p ( z | z ) − log q ( z | x )]+ E q ( z , z | x ) [log p ( z ) − log q ( z | z )] . We can write out the expectations in order to derive theKL-divergence terms of the bottom-up ELBO: log p ( x ) ≥ (cid:90) (cid:90) log p ( x | z ) d z z (25) + (cid:90) q ( z | x ) (cid:90) q ( z | z ) log p ( z | z ) q ( z | x ) d z z + (cid:90) q ( z | x ) (cid:90) q ( z | z ) log p ( z ) q ( z | z ) d z z . From the above, we can see that since the decompositionis in a reverse order, we cannot derive the KL-divergencefor the second term. This will hold in general for L -layeredmodels for any latent variables z , ..., z L − : log p ( x ) ≥ E q ( z , z | x ) [log p ( x | z )] (26) + E q ( z | x ) (cid:20) E q ( z | z ) (cid:20) log p ( z | z ) q ( z | x ) (cid:21)(cid:21) + E q ( z | x ) [ − D KL [ q ( z | z ) || p ( z )]] . F.2. Top-down inference By splitting up the expectation, we can write the ELBO of atwo-layer top-down hierarchical VAE as log p ( x ) ≥ E q ( z , z | x ) [log p ( x | z )] (27) + E q ( z , z | x ) [log p ( z | x ) − log q ( z | x )]+ E q ( z , z | x ) [log p ( z | z ) − log q ( z | z )] . ierarchical VAEs Know What They Don’t Know OOD dataset Metric AUROC ↑ AUPRC ↑ FPR80 ↓ Trained on SVHN CIFAR10 L > . 992 0 . 993 0 . CIFAR10 L > . 988 0 . 990 0 . CIFAR10 L > . 746 0 . 756 0 . CIFAR10 LLR > . 939 0 . 950 0 . SVHN L > . 599 0 . 587 0 . SVHN L > . 555 0 . 543 0 . SVHN L > . 403 0 . 431 0 . SVHN LLR > . 489 0 . 484 0 . Table 5. Additional results for the HVAE model trained on SVHN.All results computed with 1000 importance samples. We can write out the expectations in order to derive theKL-divergence terms: log p ( x ) ≥ (cid:90) (cid:90) log p ( x | z ) d z z (28) + (cid:90) q ( z | x ) log p ( z | x ) q ( z | x ) d z + (cid:90) q ( z | x ) (cid:90) q ( z | z ) log p ( z | z ) q ( z | z ) d z z . The KL-divergence terms can now easily be computed by: log p ( x ) ≥ E q ( z , z | x ) [log p ( x | z )] (29) − D KL [ q ( z | x ) || p ( z )] − E q ( z | x ) [ D KL [ q ( z | z ) || p ( z | z )] . Note that the KL-divergence in the second layer is not exactsince it is dependent on the sample-noise from the layerbelow. An exact solution can only be derived if the latentvariables z are all conditionally independent. However, thiscomes at the cost of not learning a covariance structure. G. Additional results We provide additional results for a model trained on Fash-ionMNIST in Table 7, a model trained on MNIST in Table 8,a model trained on CIFAR10 in Table 6 and a model trainedon SVHN in Table 5.We note that while the likelihood is highly unreliable acrossthe datasets, the proposed log likelihood-ratio score is con-sistent and always allows correct OOD detection with highAUROC ↑ . OOD dataset Metric AUROC ↑ AUPRC ↑ FPR80 ↓ Trained on CIFAR10 SVHN L > L > L > LLR > L > L > L > LLR > Table 6. Additional results for the HVAE model trained on CI-FAR10. All results computed with 1000 importance samples. OOD dataset Metric AUROC ↑ AUPRC ↑ FPR80 ↓ Trained on FashionMNIST MNIST L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > L > L > L > LLR > Table 7. Additional results for the HVAE model trained on Fash-ionMNIST. All results computed with 1000 importance samples. ierarchical VAEs Know What They Don’t Know OOD dataset Metric AUROC ↑ AUPRC ↑ FPR80 ↓ Trained on MNIST FashionMNIST L > . 000 1 . 000 0 . FashionMNIST L > . 000 1 . 000 0 . FashionMNIST L > . 981 0 . 983 0 . FashionMNIST LLR > . 999 0 . 999 0 . notMNIST L > . 000 1 . 000 0 . notMNIST L > . 000 1 . 000 0 . notMNIST L > . 000 1 . 000 0 . notMNIST LLR > . 000 0 . 999 0 . KMNIST L > . 000 1 . 000 0 . KMNIST L > . 000 1 . 000 0 . KMNIST L > . 987 0 . 987 0 . KMNIST LLR > . 999 0 . 999 0 . Omniglot28x28 L > . 000 1 . 000 0 . Omniglot28x28 L > . 000 1 . 000 0 . Omniglot28x28 L > . 000 1 . 000 0 . Omniglot28x28 LLR > . 000 1 . 000 0 . Omniglot28x28Inverted L > . 862 0 . 902 0 . Omniglot28x28Inverted L > . 923 0 . 943 0 . Omniglot28x28Inverted L > . 749 0 . 691 0 . Omniglot28x28Inverted LLR > . 944 0 . 953 0 . SmallNORB28x28 L > . 000 1 . 000 0 . SmallNORB28x28 L > . 000 1 . 000 0 . SmallNORB28x28 L > . 000 1 . 000 0 . SmallNORB28x28 LLR > . 000 1 . 000 0 . SmallNORB28x28Inverted L > . 000 1 . 000 0 . SmallNORB28x28Inverted L > . 000 1 . 000 0 . SmallNORB28x28Inverted L > . 977 0 . 980 0 . SmallNORB28x28Inverted LLR > . 985 0 . 987 0 . MNIST L > . 488 0 . 486 0 . MNIST L > . 469 0 . 469 0 . MNIST L > . 514 0 . 505 0 . MNIST LLR > . 515 0 .