[PDF] Likelihood Assignment for Out-of-Distribution Inputs in Deep Generative Models is Sensitive to Prior Distribution Choice

Abstract

Recent work has shown that deep generative models assign higher likelihood to out-of-distribution inputs than to training data. We show that a factor underlying this phenomenon is a mismatch between the nature of the prior distribution and that of the data distribution, a problem found in widely used deep generative models such as VAEs and Glow. While a typical choice for a prior distribution is a standard Gaussian distribution, properties of distributions of real data sets may not be consistent with a unimodal prior distribution. This paper focuses on the relationship between the choice of a prior distribution and the likelihoods assigned to out-of-distribution inputs. We propose the use of a mixture distribution as a prior to make likelihoods assigned by deep generative models sensitive to out-of-distribution inputs. Furthermore, we explain the theoretical advantages of adopting a mixture distribution as the prior, and we present experimental results to support our claims. Finally, we demonstrate that a mixture prior lowers the out-of-distribution likelihood with respect to two pairs of real image data sets: Fashion-MNIST vs. MNIST and CIFAR10 vs. SVHN.

Full PDF

LLikelihood Assignment for Out-of-Distribution Inputs in Deep GenerativeModels is Sensitive to Prior Distribution Choice

Ryo Kamoi, Kei KobayashiKeio University, Japan ryo kamoi [email protected], [email protected]

Abstract

Recent work has shown that deep generative models as-sign higher likelihood to out-of-distribution inputs than totraining data. We show that a factor underlying this phe-nomenon is a mismatch between the nature of the prior dis-tribution and that of the data distribution, a problem foundin widely used deep generative models such as VAEs andGlow. While a typical choice for a prior distribution is astandard Gaussian distribution, properties of distributionsof real data sets may not be consistent with a unimodalprior distribution. This paper focuses on the relationshipbetween the choice of a prior distribution and the likeli-hoods assigned to out-of-distribution inputs. We proposethe use of a mixture distribution as a prior to make likeli-hoods assigned by deep generative models sensitive to out-of-distribution inputs. Furthermore, we explain the theo-retical advantages of adopting a mixture distribution as theprior, and we present experimental results to support ourclaims. Finally, we demonstrate that a mixture prior lowersthe out-of-distribution likelihood with respect to two pairsof real image data sets: Fashion-MNIST vs. MNIST and CI-FAR10 vs. SVHN.

1. Introduction

The out-of-distribution detection is an important area ofstudy that has attracted considerable attention [28, 11, 21,31] to improve the safety and reliability of machine learn-ing systems. Detection methods based on density estima-tion using a parametric model have been studied for lowdimensional data [28], and deep generative models seem tobe a reasonable choice when dealing with high-dimensionaldata. However, recent work [23, 12, 31, 24, 5] has shownthat deep generative models such as VAEs [18], PixelCNN[34], and ﬂow-based models [8, 16] cannot distinguishtraining data from out-of-distribution inputs in terms of thelikelihood. For instance, deep generative models trained onFashion-MNIST assign higher likelihoods to MNIST than to Fashion-MNIST, and those trained on CIFAR-10 assignhigher likelihood to SVHN than to CIFAR-10 [23]. Meth-ods for mitigating this problem have been proposed fromvarious perspectives [12, 2, 5, 24].We focus on the inﬂuence of the prior distribution ofdeep generative models on the likelihood assigned to out-of-distribution data. Although the typical choice is a stan-dard normal distribution, various studies have analyzed al-ternatives [7, 4, 33, 35]. However, present work mainlyfocuses on the representative ability and the likelihood as-signed to in-distribution data when evaluating prior distri-butions. To the best of our knowledge, no existing workhas analyzed the effect that the prior distribution has on thelikelihood assigned to out-of-distribution inputs. Here, weconsider data sets that can be naturally partitioned into clus-ters, so the underlying distribution can be approximated bya multimodal distribution with modes apart from each other.This assumption is reasonable for many data sets found inthe wild such as Fashion-MNIST, which contains differenttypes of images, including T-shirts, shoes, and bags. If aunimodal prior distribution is used to train generative mod-els on such data sets, the models are forced to learn themapping between unimodal and multimodal distributions.We consider this inconsistency an important factor underly-ing the assignment of high likelihood to out-of-distributionareas.We use untrainable mixture prior distributions and man-ually allocate similar data to each component before train-ing by using labels of data sets or k-means clustering. Un-der these conditions, the models trained on Fashion-MNISTsuccessfully assign lower likelihoods to MNIST. Our ap-proach also lowers the likelihoods assigned to SVHN bymodels trained on CIFAR-10. We provide three explana-tions for our observations. First, as mentioned above, amultimodal prior distribution can alleviate the inconsistencybetween a prior and a data distribution, which is a possi-ble factor underlying the out-of-distribution problem. Sec-ond, allocating similar data to each component can reducethe possibility of accidentally assigning undesirable out-of-distribution points to high likelihood areas. Our second or-1 a r X i v : . [ s t a t . M L ] N ov er analysis can theoretically justify this intuition in a man-ner similar to the work of Nalisnick et al . [23]. Third, out-of-distribution points are forced out of high likelihood areasof the prior distribution when a multimodal prior is used.Somewhat surprisingly, the out-of-distribution phenomenonstill occurs when a model with a unimodal prior is trainedonly on data that would be allocated to one component inthe multimodal case. This is a novel observation that mo-tivates further investigation of designing the latent variablespace to mitigate the out-of-distribution phenomenon.

2. Related Work

Our work is directly motivated by the recent observationthat deep generative models can assign higher likelihoodsto out-of-distribution inputs [23, 5]. The use of prior distri-butions has been studied independently of this line of work.

Although model likelihood is often used to evaluate deepgenerative models, Theis et al . [32] showed that high likeli-hood is neither sufﬁcient nor necessary for models to gener-ate high quality images. Remarkably, Nalisnick et al . [23]has reported that deep generative models such as VAEs,ﬂow-based models, and PixelCNN can assign higher like-lihoods to out-of-distribution inputs. Similar phenomenahave also been reported in parallel studies [5, 12].Solutions have been proposed from various perspec-tives. Hendrycks et al . [12] proposed “outlier exposure”,a technique that uses carefully chosen outlier data setsduring training to lower the likelihood assigned to out-of-distribution inputs. B¨utepage et al . [2] focused on VAEs andreported that the methods for evaluating likelihood and theassumption of a visual distribution on pixels inﬂuence thelikelihood assigned to out-of-distribution inputs. Anotherline of study is to use alternative metrics. Choi et al . [5]proposed using the Watanabe-Akaike Information Criterion(WAIC) as an alternative. Nalisnick et al . [24] hypothesizedthat out-of-distribution points are not located in the model’s“typical set”, and thus proposed the use of a hypothesis testto check whether an input resides in the model’s typical set.

A typical choice for a prior distribution for deep gen-erative models such as VAEs and ﬂow-based models is astandard Gaussian distribution. However, various studieshave proposed the use of different alternatives. One lineof study selects more expressive prior distributions, such asmultimodal distributions [15, 7, 33, 25, 14], stochastic pro-cesses [25, 10, 3], and autoregressive models [4, 35]. An-other option is to use discrete latent variables [29, 35]. Pre-vious work on the choice of the prior distribution for deep Figure 1: Motivation for using a multimodal prior distri-bution from a topological point of view. If the prior dis-tribution is mapped to a distribution with a different topol-ogy, the mapped distribution will inevitably have undesir-able high likelihood areas. The black and red areas repre-sent the typical sets of the prior and the data distribution,respectively. The gray and yellow areas represent high like-lihood areas of the prior and the data distribution, respec-tively. While the distributions are shown in two-dimensionsin this ﬁgure, this inconsistency between high likelihood ar-eas and typical sets is a problem observed in high dimen-sional data.generative models have focused on the representative abil-ity, natural ﬁt to data sets, and the likelihood or reconstruc-tion of in-distribution inputs. To the best of our knowledge,no previous study has focused on the relationships betweenthe prior distribution and the likelihood assigned to out-of-distribution data.

3. Motivation

In this section, we discuss the theoretical motivations forusing a multimodal prior distribution; topology mismatchand second order analysis. On a related note, we have ob-served that a multimodal prior distribution can force out-of-distribution points out of high likelihood areas. We explainthis effect in Section 5.3.

We focus on data sets that have “clusters”, and adoptan assumption that the underlying distribution can be ap-proximated as a multimodal distribution with componentslocated far away from each other. We analyze deep genera-tive models by approximating them as topology-preservinginvertible mappings between a prior distribution and a datadistribution. Nalisnick et al . [24] focused on the “typicalset” [6] of deep generative models and the data distribution.As suggested by Nalisnick et al . [24], here, we assume thatdeep generative models learn mappings from the typical setof the prior distribution to the typical set of the data distribu-2 a) Standard Gaussian Prior (b) bimodal Gaussian Prior (c) Data Points

Figure 2: Visualization of the topology mismatch problem on a two-dimensional Gaussian mixture data. (a, b) Contoursof the log-likelihoods assigned by ﬂow-based generative models using a standard Gaussian prior and a bimodal Gaussianmixture prior. The 10 contour lines in the images range from -10 to -1. The model with a standard Gaussian prior assignshigh likelihoods outside the high probability areas of the true distribution. (c) Training data (blue) and out-of-distributioninputs (orange) used in this experiment.tion. Figure 1 visualizes the intuition of the mappings froma bimodal data distribution to two different types of priordistributions under our assumptions. If the bimodal datadistribution is mapped to a unimodal prior distribution, wecannot eliminate the possibility of the model mapping out-of-distribution inputs to the typical set or high likelihood ar-eas of the prior distribution. We will refer to this issue as thetopology mismatch problem. This simple analysis explainsthe out-of-distribution phenomenon and the results of priorwork [24], implying that out-of-distribution inputs can evenreside in the typical set. By contrast, if a prior distributionis topologically consistent with the data distribution, thereexists a mapping that decreases the possibility of the out-of-distribution phenomenon. Note that we cannot say thatthe modiﬁcation in a prior distribution can single-handedlysolve the problem as the probability density of latent vari-ables on a prior distribution is not the only factor inﬂuencingthe likelihood of deep generative models such as VAEs andGlow. In addition, it has been reported that deep generativemodels trained on similar images can generate dissimilarimages [27], and thus our analysis on topology mismatchcannot explain this result. However, we later experimen-tally show that the choice of the prior distribution nonethe-less has a signiﬁcant inﬂuence on the likelihoods assignedto out-of-distribution inputs (Section 5).To justify our analysis, we conduct experiments on somesimple artiﬁcial data sets. Figure 2 shows the likelihoodsassigned by ﬂow-based deep generative models trained onpoints sampled from a bimodal Gaussian mixture distribu-tion. We use a simple model architecture with four afﬁnecoupling layers and reverse the features after each layer. Wecompare a unimodal Gaussian prior and a bimodal Gaus-sian mixture prior. Figure 2 shows that the contours ofthe log-likelihood assigned by the model using a standardGaussian prior distribution have high likelihood areas out-side the region data points reside. Because the prior dis-tribution is mapped to a distribution with a different topol- ogy, the mapped distribution will inevitably have undesir-able high likelihood areas. By contrast, the contours ofthe model using a Gaussian mixture prior successfully sep-arates the two modes, and do not have high likelihood areasin out-of-distribution areas. To show that a model with astandard Gaussian prior can assign high likelihood to out-of-distribution inputs even in the low-dimensional case, wecompare the likelihoods assigned to out-of-distribution in-puts that are points sampled from a Gaussian distributionwith mean zero and variance . . As shown in Figure 2c,the out-of-distribution inputs have minimal overlap withthe in-distribution data. However, the mean of the log-likelihoods assigned by the model using a standard Gaus-sian prior to in-distribution inputs is − . , which is sim-ilar to the log-likelihood assigned to out-of-distribution in-puts ( − . ). By contrast, the mean of log-likelihood as-signed to out-of-distribution inputs by the model using amultimodal prior is much lower ( − . ). As the dimen-sionality of the data increases, this phenomenon becomesmore pronounced; however, using a multimodal distributionas a prior can signiﬁcantly alleviated this problem. Furtherdetails are presented in Appendix C. Nalisnick et al . [23] provided a second order analysiswith implications consistent with their experimental obser-vations, while they put some strong assumptions. One im-plication of their analysis is that deep generative modelswith unimodal prior distributions assign higher likelihood ifout-of-distribution images have lower variance over imagepixels. However, since they use a unimodal prior distribu-tion, their analysis do not apply here. To explain why ourproposition may help, we perform a similar analysis on as-sumptions corresponding to our models. Although we stilladopt some strong assumptions and apply coarse approx-imations in a similar manner as the original analysis, ouranalysis provides an intuitive explanation for our experi-3ental results. The value we are interested in evaluatingis E q [log p ( x ; θ )] − E p ∗ [log p ( x ; θ )] (1)where p is a given generative model, q is the adversarial dis-tribution (out-of-distribution), and p ∗ is the training distri-bution. If the value of is positive, the adversarial distributionis assigned a higher likelihood by the generative model. Inthe following analysis, we assume that p reasonably approx-imates p ∗ . Note that the analysis for unimodal prior modelssuggests that q can be assigned higher likelihood even if p perfectly approximates p ∗ .Nalisnick et al . [23] approximate the probability dis-tribution function of the given generative model p as log p ( x ; θ ) (cid:39) log p ( x ; θ ) + ∇ x log p ( x ; θ ) T ( x − x ) + Tr {∇ x log p ( x ; θ )( x − x )( x − x ) T } , which is equiv-alent to assuming that the generative model can be ap-proximated with a Gaussian distribution. In this work,however, we focus on data sets with an underlying distri-bution which can be approximated by a mixture distribu-tion. Therefore, we assume that p can be approximated as log p ( x ; θ ) (cid:39) log K (cid:80) K p i ( x ; θ ) , where p i corresponds toeach component approximated by a Gaussian distribution.We assume that each component of the generative model p i ( x ; θ ) corresponds to a component of the prior distribu-tion p i ( z ; ψ ) .Here, we adopt the assumption that the generative modelis constant-volume Glow (CV-Glow) as is done in [23].The derivation and detailed assumptions are given in Ap-pendix A. Finally, we derive the following formula: E q [log p ( x ; θ )] − E p ∗ [log p ( x ; θ )] (cid:39) − σ ψ C (cid:88) c =1  L (cid:89) l =1 C l (cid:88) j =1 u l,c,j  (cid:88) h,w K (cid:88) i =1 ( w i σ D i ,h,w,c − w ∗ i σ D ∗ i ,h,w,c ) ≤ − σ ψ C (cid:88) c =1  L (cid:89) l =1 C l (cid:88) j =1 u l,c,j  (cid:88) h,w ( σ D min ,h,w,c − σ D ∗ max ,h,w,c ) , (2)where σ ψ is the variance of the prior distribution (weassume that all the components have identical variance), D ∗ i and D i correspond to the in-distribution and out-of-distribution data allocated to the i -th component and w ∗ i , w i are the ratio of data allocated to the i -th component satis-fying (cid:80) K w ∗ i = 1 , (cid:80) K w i = 1 . u l,c,j is the weight ofthe l -th 1x1 convolution, which is ﬁxed for any inputs. Fur-ther, h and w index the input spatial dimensions, c indexesthe input channel dimensions, l indexes the series of ﬂows, and j indexes the column dimensions of the C l × C l ker-nel. σ D i ,h,w,c , σ D ∗ i ,h,w,c are diagonal elements of Σ D i = E [( x − ¯x i )( x − ¯x i ) T |D i ] , Σ D ∗ i = E [( x − ¯x i )( x − ¯x i ) T |D ∗ i ] ,where ¯x i is the elementwise mean of the images generatedfrom the i -th component, and the two matrices are assumedto be diagonal. D min and D ∗ max are chosen so the ﬁnalexpression is maximized. Expanding the formula by usingCV-Glow does not seem to be a reasonable, as we do notuse it in our experiments. However, Nalisnick et al . [23] re-ported that the out-of-distribution phenomenon occurs evenon CV-Glow, one of the simplest deep generative models,as with many other more complex deep generative modelssuch as general Glow and VAEs. Therefore, it is worth con-sidering CV-Glow to analyze the problem of general deepgenerative models.Roughly speaking, we can say that if σ D min takessmaller values than σ D ∗ max , the likelihood assigned to out-of-distribution data can be larger than that assigned to in-distribution data. However, if this is the case, this indi-cates that one of the out-of-distribution modes has a meanthat is close to the mean of one of the modes of the gen-erative model with small variance. If out-of-distributiondata satisﬁes this condition, such a mode can no longer beconsidered out-of-distribution, as inputs corresponding to itmust be similar to the images corresponding to the mode ofthe generative model. Note that a mode of the generativemodel has a small variance and contains similar images un-der our assumptions. By contrast, the analysis by Nalisnick et al . [23] assumed that the data distribution can be approx-imated by a unimodal Gaussian distribution with possiblylarge variance. Therefore, low-variance out-of-distributiondata with mean identical to in-distribution data can containcompletely different images.Our analysis indicates that the squared distance from themean of each mode is an important factor of likelihood as-signment. We later show that our experimental results areconsistent with this analysis. Note that this analysis doesnot provide an exhaustive explanation for our results, as ourexperiments show that the squared distance is not the onlyimportant factor underlying the likelihood assigned to out-of-distribution inputs (Section 5.3). However, our simpleanalysis provides an intuitive interpretation of our experi-mental results similar to the suggestion from the analysisby Nalisnick et al . [23].

4. Proposed Model

We replace the prior distributions of deep generativemodels with mixture distributions (cid:80) Ki =1 p i /K that are nottrainable, and we assume that all components are uniformlyweighted. Although some previous studies have performedclustering using VAEs with a trainable multimodal priordistribution [7, 33], we manually assign each input to a4igure 3: Probability density functions of a standard Gaus-sian distribution (blue) and a generalized Gaussian distribu-tion with parameters α = (cid:112) Γ(1 /β ) / Γ(3 /β ) , β = 4 (or-ange).component of the prior distribution before training. We sim-ply use the labels of the data sets or apply k-means cluster-ing on the training data to decide on which components toassign to. During training, the likelihood for each input isevaluated using a different unimodal prior distribution p i (using a different index i for each input), which is the com-ponent of the multimodal prior distribution assigned to eachinput. The test likelihood is evaluated on a mixture priordistribution (cid:80) Ki =1 p i /K without the component assignmentused during training.We use Gaussian distributions and generalized Gaussiandistributions for the components of the mixture distribu-tions. The probability density function of a univariate gen-eralized Gaussian distribution is p ( x ; α, β ) = β α Γ(1 /β ) exp (cid:32) − (cid:18) | x − µ | α (cid:19) β (cid:33) (3)where Γ is the Gamma function and α, β ∈ (0 , + ∞ ) areparameters. The assumption made in our analysis in Sec-tion 3.2 suggests that the components of prior distributionsshould be far from each other and have a small overlap. Weobserve that Gaussian distributions are too heavy-tailed andcomponents must be placed far away from each other in or-der to lower the out-of-distribution likelihood. To deal withthis problem, we propose the use of a generalized Gaussiandistribution. A generalized Gaussian distribution with pa-rameters β = 2 and α = √ is equal to a standard Gaussiandistribution. A generalized Gaussian distribution with large β is light-tailed, so we use a generalized Gaussian distribu-tion with parameters α = (cid:112) Γ(1 /β ) / Γ(3 /β ) , β = 4 (Sec-tion 5.3). Note that the variance of a generalized Gaussiandistribution using α = (cid:112) Γ(1 /β ) / Γ(3 /β ) is one.

5. Experiments

We evaluate the effect of a multimodal prior distribu-tion on the likelihoods assigned to out-of-distribution inputson two pairs of real image data sets: Fashion-MNIST vs.MNIST, and CIFAR-10 vs. SVHN. (a) VAE(b) Glow

Figure 4: Histograms of the log-likelihoods assigned byVAEs and Glow trained on Fashion-MNIST (label 1 and 7).“uni” denotes a standard Gaussian prior and “multi” denotesa bimodal Gaussian mixture prior. For Fashion-MNIST, wereport likelihoods evaluated on test data. Bimodal priorsmitigate the out-of-distribution problem.

We use two pairs of image data sets. The ﬁrst pair isFashion-MNIST [36] (training data) and MNIST [20] (out-of-distribution inputs). The second pair is CIFAR-10 [19](training data) and SVHN [26] (out-of-distribution inputs).For training, we use a small subset of the data sets, as usingall images requires a large number of clusters to lower theout-of-distribution likelihood. For CIFAR-10, 10% randomwidth and height shifting is applied during training as dataaugmentation.

Our implementation of VAE is based on the architecturedescribed in [30, 23]. Both the encoder and the decoderare convolutional neural networks. Our implementation ofGlow is based on the authors’ code hosted at OpenAI’s opensource repository . To remove spatial dependencies on thelatent variables, we do not use the multi-scale architecture,and apply × convolution over three dimensions (width,height, channel) after the decoder. Further details are dis-cussed in Appendix B. We ﬁrst analyze our model on simple data sets of im-ages. Here, models are trained on images in label 1 https://github.com/openai/glow i )indicates that the model is trained on the images in the i -th label. “uni” denotes a standard Gaussian prior and“multi” denotes a bimodal Gaussian mixture prior. The uni-modal prior models trained only on images for one labelof Fashion-MNIST still exhibit the out-of-distribution phe-nomenon for MNIST when compared to multimodal priormodels.(Trouser) and 7 (Sneaker) in Fashion-MNIST. We com-pare two types of prior distributions: a standard Gaus-sian distribution and a bimodal Gaussian mixture distribu-tion. The means of the bimodal priors are [ ± , , . . . , for VAE and [ ± , , . . . , for Glow. The variances are diag([1 , . . . , on all the components. Figure 4 shows thatthe models using multimodal prior distributions correctlyassign low likelihood to MNIST, the out-of-distributiondata, while the models using unimodal prior distributionsassign high likelihood to MNIST. Multi-Modal Priors Force Out Out-of-DistributionPoints

Our analysis in Section 3.2 suggests that multi-modal prior models mitigate the out-of-distribution problembecause each component is trained on simpler data. How-ever, although the complexity of data allocated to each com-ponent is important, unimodal prior models trained on dataallocated to a single component still assign high likelihoodsto out-of-distribution inputs when compared to multimodalprior models (Table 1). Figure 5 shows that the modelwith the multimodal prior correctly places MNIST in anout-of-distribution area wihin the latent variable space. Incontrast, MNIST and Fashion-MNIST (label 7) have largeoverlap within the latent variable space of the model with aunimodal prior trained only on label 7 of Fashion-MNIST.These results imply that separating in-distribution data inthe latent variable space by using a multimodal prior dis-tribution has a strong effect of forcing out-of-distributionpoints out of high-likelihood areas. This observation sug-gests a new approach for mitigating the out-of-distributionphenomenon; improving latent variable design. (a) bimodal, label 1 and 7(b) Unimodal, label 7

Figure 5: Histograms of the ﬁrst-idx of the latent variableson Glow trained on label 1 and 7 and the 613rd-idx trainedonly on label 7 of Fashion-MNIST. For the unimodal priormodel, we select the dimension of the latent variable withthe largest absolute mean for MNIST. Further results arereported in Appendix F.3.Figure 6: Relationships of the distance between two compo-nents and the mean log-likelihoods assigned to MNIST bymodels trained on Fashion-MNIST (label 1 and 7). Whilelikelihoods assigned to out-of-distribution inputs are sensi-tive to the distance between components regardless of com-ponent choice, the Gaussian mixture priors require muchlarger distances to lower the likelihood assigned to out-of-distribution inputs. The histograms and the mean values ofthe log-likelihoods are reported in Appendix F.1

Distance between Two Components

We analyze the re-lationship of the likelihoods assigned to out-of-distributioninputs and the distance between two components usingtwo types of distributions: Gaussian and generalized Gaus-sian mixture distributions. Figure 6 shows the mean log-likelihoods assigned to MNIST by the models trained onFashion-MNIST (label 1 and 7) with different types andcomponent distances of prior distributions. Likelihoodsassigned to out-of-distribution inputs are sensitive to the6 a) Per-dimensional empirical mean of the squared distance be-tween the mean image of each component of the VAE trained onFashion-MNIST (label 1 and 7) and images in Fashion-MNIST(label 1 and 7).(b) Per-dimensional empirical mean of the squared distance be-tween the mean image of each component of the VAE trained onFashion-MNIST (label 1 and 7) and images in MNIST.(c) Per-dimensional variance of images in Fashion-MNIST (label1 and 7) and MNIST

Figure 7: Comparison of the experimental results with thesuggestion based on the analysis in Section 3.2. (a), (b)The per-dimensional empirical mean of the squared dis-tance from the mean image of each component of the VAEwith a bimodal prior distribution. The model is trained onFasion-MNIST (label 1 and 7) and test images in Fashion-MNIST (label 1 and 7) and MNIST are assumed to be allo-cated to the nearest component in the latent variable space.“fashion i ” and “mnist i ” denote the data allocated to the i -th component. (c) The per-dimensional variances over pix-els of images in MNIST and Fashion-MNIST. The y-axis isclipped for visualization.distance between components regardless of the componentchoice. However, models using Gaussian mixture priors re-quire larger distances to lower the out-of-distribution like-lihoods. A generalized Gaussian prior ( β = 4 ) is par-ticularly effective at assigning low likelihoods to out-of-distribution inputs even with much smaller distances be-tween components. The means of the bimodal distributionsare [ ± d/ , , . . . , , and the variance is diag([1 , . . . , forall the components. Note that the likelihoods assigned tothe test data of Fashion-MNIST (label 1 and 7) are rela-tively unaffected by the distance between two components(Appendix F.1). Second Order Analysis

Our analysis in Section 3.2 sug-gests that the conditional expectation of the squared dis-tances from the mean of the images generated from themodes, i.e . σ D ∗ i ,h,w,c , σ D i ,h,w,c , inﬂuence the assigned like-lihoods. We show that our experimental results are consis-tent with the analysis. Note that this is not the only inﬂuenc-ing factor that lowers the out-of-distribution likelihood asmentioned above. Figure 7a, b show histograms of the per-dimensional empirical mean of the squared distance fromthe mean of the images generated from each component ofthe VAE with a bimodal prior distribution, which are theempirical values of σ D ∗ i ,h,w,c and σ D i ,h,w,c , respectively.The pixel values are scaled to [0 , . As explained in Sec-tion 3.2, we consider each image as allocated to the clos-est component in the latent variable space, and the squareddistance from the mean is calculated over the pixel space.These results are evaluated on the VAE using a bimodalGaussian mixture prior with means [ ± , , . . . , trainedon Fashion-MNIST (label 1 and 7). Figure 7a, b show thatFashion-MNIST has lower empirical values of σ D ∗ i ,h,w,c compared to those of MNIST σ D i ,h,w,c . The results areconsistent with the implication of our analysis that higherlikelihood is assigned if the values are small.Figure 7c shows the per-dimensional variance of imagesin Fashion-MNIST (label 1 and 7) and MNIST. In con-trast to our analysis, the analysis by Nalisnick et al . [23]for unimodal prior models suggests that the models assignhigher likelihood to an adversarial distribution if the per-dimensional variance is small. As has been reported in [23],most pixels of images found in MNIST have low variance,and this is consistent with the result that MNIST is assignedhigher likelihood when a standard Gaussian prior is used.The differences between these two types of histograms pro-vide an intuitive explanation for the difference between like-lihoods assigned to out-of-distribution inputs by models us-ing unimodal and multimodal prior distributions. We evaluate our proposition on more complex data sets.While multimodal priors assign lower likelihoods, the effectis especially limited on Glow. The results on Glow may beaffected by the spatial dependencies on the latent variables,and our efforts to remove the dependencies may not be suf-ﬁcient. Our observations suggest that Glow requires furthermodiﬁcations to solve this problem, so we leave this prob-lem for future work. More investigation into latent variablespace as well as separation of data sets is required for fur-ther performance. Alternatively, our method can be used intandem with other techniques such as [12, 24].

Fashion-MNIST (label 0, 1, 7, 8) vs MNIST

Weevaluate our method on a VAE trained on label 0, 1,7, and 8. (T-shirt, Trouser, Sneaker, and Bag). We7 a) VAE(b) Glow

Figure 8: Likelihoods assigned by the models trained onFashion-MNIST (label 0, 1, 7, 8). The models with uni-modal priors assign higher likelihood to MNIST, whereasthose using multimodal priors mitigate this problem.compare a standard Gaussian prior and a quad-modalGaussian mixture prior. For the VAE, the means are [150 , , . . . , , [0 , , . . . , , [0 , , , , . . . , , and [0 , , , , , . . . , ). However, we observe that thisscheme of placing modes on differing dimensions does notwork on Glow. Hence, we use [200 × i, , . . . , i =0 , , , for Glow. The variances are diag([1 , . . . , forall components. Figure 8 shows that the models using stan-dard Gaussian priors produce the out-of-distribution phe-nomenon on MNIST, while the models using Gaussian mix-ture priors do to a lesser degree. CIFAR-10 (label 0 and 4) vs SVHN

We ﬁnd that im-ages for one label in CIFAR-10 are still too diverse forour method. Therefore, we apply k-means and separatethe images in label 0 and 4 of CIFAR-10 into four respec-tive clusters, and we use images in one cluster of each la-bel. We compare the models using a standard Gaussianprior and a bimodal Gaussian mixture prior with means [ ± , , . . . , for VAE, and [ ± , , . . . , for Glow.Figure 9 shows that the models with multimodal priors as-sign lower likelihoods to SVHN compared to the modelsusing unimodal prior distributions. However, this effect islimited on Glow. We hypothesize that images in each clus-ter are still too diverse for Glow with our settings. Theseresults imply that it is difﬁcult to adopt our method if a dataset does not consist of low-variance and distant clusters, andfurther study is required particularly for Glow. (a) VAE(b) Glow Figure 9: Likelihood assigned by models trained on CIFAR-10 (label 0 and 7). The models using standard Gaussianpriors assign higher likelihood to SVHN. The models usingmultimodal priors mitigate this problem while the effect islimited on Glow.

6. Conclusion and Discussion

We analyzed the inﬂuence of prior distribution choice ofdeep generative models on the likelihoods assigned to out-of-distribution inputs. Recent work [23, 5] on deep gener-ative models with unimodal prior distributions has shownthat these models can assign higher likelihoods to out-of-distribution inputs than to training data. In this paper, weshowed that models using multi-modal prior distributionslower the likelihoods assigned to out-of-distribution inputsfor Fashion-MNIST vs. MNIST and CIFAR10 vs. SVHN.We also provided theoretical explanations for the advan-tages of the use of multi-modal prior distributions.Unfortunately, our experimental results suggested that itis difﬁcult to apply our method to complex data sets evenwhen we use prior knowledge. Thus, our work demon-strates the limitation of the high-dimensional likelihoodsyet again, and encourages future work on alternative met-rics such as [5, 24]. Nevertheless, our work is the ﬁrst toshow that likelihoods assigned to out-of-distribution inputsare affected by the choice of the prior distribution, whichhas been mainly studied as a way to improve the represen-tative ability of deep generative models for in-distributiondata. Our observations motivate further study on the priordistributions of deep generative models, as well as on meth-ods to control the structure of the latent variables to makethe model likelihood sensitive to out-of-distribution inputs.8 cknowledgements

This paper has beneﬁted from advice and English lan-guage editing from Masayuki Takeda. This work was sup-ported by JSPS KAKENHI (JP19K03642, JP19K00912)and RIKEN AIP Japan.

References [1] David Arthur and Sergei Vassilvitskii. k-means++: The Ad-vantages of Careful Seeding. In

ACM-SIAM symposium onDiscrete algorithms , 2007.[2] Judith B¨utepage, Petra Pokulukar, and Danica Kragic. Mod-eling assumptions and evaluation schemes: On the assess-ment of deep latent variable models. In

The IEEE Confer-ence on Computer Vision and Pattern Recognition (CVPR)Workshops , 2019.[3] Francesco Paolo Casale, Adrian V. Dalca, Luca Saglietti,Jennifer Listgarten, and Nicolo Fusi. Gaussian process priorvariational autoencoders. In

Conference on Neural Informa-tion Processing System (NeurIPS) , 2018.[4] Xi Chen, Diederik P Kingma, Tim Salimans, Yan Duan, Pra-fulla Dhariwal, John Schulman, Ilya Sutskever, and PieterAbbeel. Variational Lossy Autoencoder. In

InternationalConference on Learning and Representation (ICLR) , 2017.[5] Hyunsun Choi, Eric Jang, and Alexander A Alemi. WAIC,but Why? Generative Ensembles for Robust Anomaly De-tection. arXiv preprint arXiv:1810.01392 , 2019.[6] Thomas M. Cover and Joy A. Thomas.

Elements of Informa-tion Theory . Wiley-Interscience, 2nd edition, 2006.[7] Nat Dilokthanakul, Pedro A M Mediano, Marta Garnelo,Matthew C. H. Lee, Hugh Salimbeni, Kai Arulkumaran,and Murray Shanahan. Deep Unsupervised Clustering withGaussian Mixture Variational Autoencoders. arXiv preprintarXiv:1611.02648 , 2016.[8] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE:Non-linear Independent Components Estimation. arXivpreprint arXiv:1410.8516 , 2015.[9] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio.Density estimation using Real NVP. In

International Con-ference on Learning and Representation (ICLR) , 2017.[10] Prasoon Goyal, Zhiting Hu, Xiaodan Liang, Chenyu Wang,and Eric P Xing. Nonparametric Variational Auto-encodersfor Hierarchical Representation Learning. In

IEEE Interna-tional Conference on Computer Vision (ICCV) , 2017.[11] Dan Hendrycks and Kevin Gimpel. A Baseline for Detect-ing Misclassiﬁed and Out-of-Distribution Examples in Neu-ral Networks. In

International Conference on Learning andRepresentation (ICLR) , 2017.[12] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich.Deep Anomaly Detection with Outlier Exposure. In

Interna-tional Conference on Learning and Representation (ICLR) ,2019.[13] Sergey Ioffe and Christian Szegedy. Batch Normalization:Accelerating Deep Network Training by Reducing InternalCovariate Shift. In

International Conference on MachineLearning (ICML) , 2015. [14] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and An-drew Gordon Wilson. Semi-Supervised Learning with Nor-malizing Flows. In

Workshop on Invertible Neural Networksand Normalizing Flows (ICML 2019) , 2019.[15] Matthew J Johnson, David Duvenaud, Alexander BWiltschko, Sandeep R Datta, and Ryan P Adams. Com-posing graphical models with neural networks for structuredrepresentations and fast inference. In

Conference on NeuralInformation Processing Systems (NIPS) , 2016.[16] Diederik P Kingma and Prafulla Dhariwal. Glow: Genera-tive Flow with Invertible 1x1 Convolutions. In

Conferenceon Neural Information Processing System (NeurIPS) , 2018.[17] Diederik P Kingma and Jimmy Lei Ba. Adam: A Methodfor Stochastic Optimization. In

International Conference onLearning Representations (ICLR) , 2014.[18] Diederik P Kingma and Max Welling. Auto-Encoding Varia-tional Bayes. In

International Conference on Learning Rep-resentations (ICLR) , 2014.[19] Alex Krizhevsky. Learning Multiple Layers of Features fromTiny Images. Technical report, University of Toronto, 2009.[20] Yann LeCun, Leon Bottou, Yoshua Bengio, and PatrickHaffner. Gradient-Based Learning Applied to DocumentRecognition.

Proceedings of the IEEE , 86(11):2278 – 2324,1998.[21] Shiyu Liang, Yixuan Li, and R Srikant. Enhancing The Reli-ability of Out-of-distribution Image Detection in Neural Net-works. In

International Conference on Learning and Repre-sentation (ICLR) , 2018.[22] Vinod Nair and Geoffrey E. Hinton. Rectiﬁed Linear UnitsImprove Restricted Boltzmann Machines, 2010.[23] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, DilanGorur, and Balaji Lakshminarayanan. Do Deep GenerativeModels Know What They Don’t Know? In

InternationalConference on Learning and Representation (ICLR) , 2019.[24] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, and Bal-aji Lakshminarayanan. Detecting Out-of-Distribution Inputsto Deep Generative Models Using a Test for Typicality. arXivpreprint arXiv:1906.02994 , 2019.[25] Eric Nalisnick and Padhraic Smyth. Stick-Breaking Varia-tional Autoencoders. In

International Conference on Learn-ing Representations (ICLR) , 2017.[26] Yuval Netzer, Tao Wang, Adam Coates, Alessandro Bis-sacco, Bo Wu, and Andrew Y. Ng. Reading Digits in NaturalImages with Unsupervised Feature Learning. In

NIPS Work-shop on Deep Learning and Unsupervised Feature Learning ,2011.[27] Pramuditha Perera, Ramesh Nallapati, and Bing Xiang. OC-GAN: One-class Novelty Detection Using GANs with Con-strained Latent Representations. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2019.[28] Marco A.F. Pimentel, David A. Clifton, Lei Clifton, and Li-onel Tarassenko. A review of novelty detection.

Signal Pro-cessing , 99:215–249, 2014.[29] Jason Rolfe. Discrete variational autoencoders. In

Inter-national Conference on Learning Representations (ICLR) ,2017.

30] Mihaela Rosca, Balaji Lakshminarayanan, and Shakir Mo-hamed. Distribution Matching in Variational Inference. arXiv preprint arXiv:1802.06847 , 2018.[31] Alireza Shafaei, Mark Schmidt, and James J. Little. A LessBiased Evaluation of Out-of-distribution Sample Detectors. arXiv preprint arXiv:1802.06847 , 2018.[32] Lucas Theis, Aron Van Den Oord, and Matthias Bethge. ANote on the Evaluation of Generative Models. In

Interna-tional Conference on Learning and Representation (ICLR) ,2016.[33] Jakub M Tomczak and Max Welling. VAE with a Vamp-Prior. In

International Conference on Artiﬁcial Intelligenceand Statistics (AISTATS) , 2017.[34] Aron van Den Oord, Nal Kalchbrenner, Oriol Vinyals, LasseEspeholt, Alex Graves, and Koray Kavukcuoglu. Condi-tional Image Generation with PixelCNN Decoders. In

Con-ference on Neural Information Processing Systems (NIPS) ,2016.[35] Aaron van den Oord, Oriol Vinyals, and KorayKavukcuoglu. Neural Discrete Representation Learn-ing. In

Conference on Neural Information ProcessingSystems (NIPS) , 2017.[36] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-MNIST: a Novel Image Dataset for Benchmarking MachineLearning Algorithms. arXiv preprint arXiv:1708.07747 ,2017. upplimentary Material of“Likelihood Assignment forOut-of-Distribution Inputs in DeepGenerative Models is Sensitive to PriorDistribution Choice”A. Second Order Analysis In this section, we present detailed explanations for theanalysis in Section 3.2. We adopt the assumption that theprobability distribution function of of the given genera-tive model p can be approximated by mixture distribution log p ( x ; θ ) (cid:39) log K (cid:80) Ki =1 p i ( x ; θ ) , where p i correspondsto each component that can be approximated by a Gaussiandistribution. For simplicity, we assume that the componentsare assigned the uniform weights, and have equal variances.In addition, corresponding to the nature of the data sets thatwe are considering, we assume that components of the dis-tribution are far from each other and have small variances.Under the assumptions, we can approximate the probabil-ity distribution for each input by taking the value from thecomponent that yields the maximum value for the data: log p ( x ; θ ) (cid:39) log K (cid:80) Ki =1 p i ( x ; θ ) (cid:39) max i log K p i ( x ; θ ) .Thus, we can write the expectation by the training data dis-tribution p ∗ as E p ∗ [log p ( x ; θ )] = K (cid:88) i =1 w ∗ i E [log p ( x ; θ ) |D ∗ i ] (cid:39) K (cid:88) i =1 w ∗ i E  log 1 K K (cid:88) j =1 p j ( x ; θ ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) D ∗ i  (cid:39) K (cid:88) i =1 w ∗ i E [log p i ( x ; θ ) |D ∗ i ] − log K (4)where D ∗ i represents in-distribution data allocated to the i -th component and w ∗ i is the ratio of data allocated to the i -th component satisfying (cid:80) Ki =1 w ∗ i = 1 . We can alsoexpand the expectation by the adversarial distribution q as E q [log p ( x ; θ )] (cid:39) (cid:80) Ki =1 w i E [log p i ( x ; θ ) |D i ] − log K where D ∗ i represents out-of-distribution data allocated tothe i -th component, and w i is the ratio of data allocated to i -th component satisfying (cid:80) Ki =1 w i = 1 .Since we assume that each component can beapproximated by a Gaussian distribution, we usesecond order approximation for each component: log p i ( x ; θ ) (cid:39) log p i ( ¯x i ; θ ) + ∇ ¯x i log p i ( ¯x i ; θ ) T ( x − ¯x i ) + Tr {∇ ¯x i log p i ( ¯x i ; θ )( x − ¯x i )( x − ¯x i ) T } . Here, ¯x i is the mean of images generated from each compo-nent. Therefore, ∇ ¯x i log p ( ¯x i ; θ ) T ( x − ¯x i ) (cid:39) since argmax x log p i ( x ; θ ) (cid:39) ¯x i . Thus we can expand the conditional expectation as E [log p i ( x ; θ ) |D ∗ i ] (cid:39) E (cid:20) log p i ( ¯x i ; θ )+ 12 Tr {∇ ¯x i log p i ( ¯x i ; θ )( x − ¯x i )( x − ¯x i ) T } (cid:12)(cid:12)(cid:12)(cid:12) D ∗ i (cid:21) = log p i ( ¯x i ; θ ) + 12 Tr {∇ ¯x i log p i ( ¯x i ; θ ) Σ D ∗ i } (5)where Σ D ∗ i = E [( x − ¯x i )( x − ¯x i ) T |D ∗ i ] , and it is assumed tobe diagonal as in [23]. Furthermore, E [log p i ( x ; θ ) |D i ] =log p i ( ¯x i ; θ ) + Tr {∇ ¯x i log p i ( ¯x i ; θ ) Σ D i } where Σ D i = E [( x − ¯x i )( x − ¯x i ) T |D i ] . Note that Σ D ∗ i and Σ D i are notthe variance matrices as ¯x i are the mean images of the gen-erative model.Because we assume that variances of all components areidentical, log p i ( ¯x i ; θ ) can be approximated to be identicalfor all i . Finally, we can write the difference of the twolog-likelihoods (Equation 1) in a relatively simple form inparallel with the ﬁrst line of Equation 5 in [23]: E q [log p ( x ; θ )] − E p ∗ [log p ( x ; θ )] (cid:39) K (cid:88) i =1 w i E [log p i ( x ; θ ) |D i ] − K (cid:88) i =1 w ∗ i E [log p i ( x ; θ ) |D ∗ i ]= 12 Tr (cid:40) K (cid:88) i =1 w i ∇ ¯x i log p i ( ¯x i ; θ ) Σ D i − K (cid:88) i =1 w ∗ i ∇ ¯x i log p i ( ¯x i ; θ ) Σ D ∗ i (cid:41) . (6)If we assume that each p i is precisely a Gaussian distri-bution, we can simply compute the Hessian in Equation 6.However, because this assumption is too strong, Nalisnick et al . [23] expanded this formula by adopting the assump-tion that the generative model is constant-volume Glow(CV-Glow). Although we do not use CV-Glow in our ex-periments, we apply the expression derived by Nalisnick etal . [23]:Tr (cid:8) ∇ ¯x i log p i ( ¯x i ; θ ) Σ D i (cid:9) = − σ ψ C (cid:88) c =1  L (cid:89) l =1 C l (cid:88) j =1 u l,c,j  (cid:88) h,w σ D i ,h,w,c (7)where σ ψ is the variance of a component of the prior distri-bution (we assume all components have identical variance)and σ D i ,h,w,c are diagonal elements of Σ D i . u l,c,j is theweight of the l -th 1x1 convolution of Glow, which is ﬁxed11or any inputs. h and w index the input spatial dimensions, c indexes the input channel dimensions, l indexes the se-ries of ﬂows, and j indexes the column dimensions of the C l × C l kernel. We assume that each component of the gen-erative model p i ( x ; θ ) corresponds to a component of priordistribution p i ( z ; ψ ) . Finally, we arrive Equation 2. B. Experimental Settings

We present model architectures and training settings ofthe experiments shown in Section 5.

VAE

Our implementation of VAE [18] is based on the ar-chitecture described in [30, 23]. Both the encoder and thedecoder are convolutional neural networks described in Ta-ble 2 and 3. We use batch normalization [13] after everyconvolutional layer except for the last layer of the encoderand the decoder. All the convolutional layers in the decoderuse ReLU [22] as an activation function after batch normal-ization. After the ﬁnal layer of the decoder, we apply thesoftmax function, and assume i.i.d. categorical distributionson pixels as visual distributions.We perform training for 1,000 epochs using the Adamoptimizer [17] ( β = 0 . , β = 0 . ) with a constant learn-ing rate of − . We use 5,000 sample points to approxi-mate test likelihoods.Operation Kernel Strides Channels PadConvolution × × × × × × — (a) Encoder. The outputs are means (50) and log variances (50). Operation Kernel Strides Channels PadFully-Connected — — 3136 —Reshape — — 64 —Convolution × × × × (b) Decoder. “Convolution” in the decoder is transposed convo-lution. The reshape operation reshape the latent variables sized , to × × . Table 2: Model architecture of VAE for Fashion-MNISTand MNIST. “Channels” denotes the size of the outputchannel, and “Pad” denotes paddings.

Glow

For our experiments of Glow [16], our implemen-tation is based on the code hosted at OpenAI’s open sourcerepository . For Fashion-MNIST vs. MNIST, we use 1block of 32 afﬁne coupling layers, squeezing the spatial di-mension after the 16-th layer. For CIFAR-10 vs. SVHN,we use 1 block of 24 afﬁne coupling layers, squeezing thespatial dimension after the 8-th and 16-th layer.To alleviate the spatial dependencies on the latent vari-ables, we do not use the multi-scale architecture, whichsplits the latent variables after squeezing [9]. In addition,we apply × convolution over three dimensions (width,height, channel) after the encoder, and apply the inverse op-eration before the decoder. In the implementation, we addthe code in Listing 1 after the encoder, and add the inverseoperation before the decoder. Moreover, we add a smallpositive value (0.1 in our implementation) to the scale ofafﬁne coupling layers to stabilize the training as suggestedat . While Nalisnick et al . [23] remove actnorm and applytheir original initialization scheme, we use actnorm and ap-ply the original initialization scheme in the OpenAI’s code.We perform training for 1,000 epochs using the Adamoptimizer in accordance with the OpenAI’s code. We use alearning rate of − , which is linearly annealed from zeroover the ﬁrst 10 epochs.Operation Kernel Strides Channels PadConvolution × × × × × × — (a) Encoder. The outputs are means (50) and log variances (50). Operation Kernel Strides Channels PadConvolution × × × × × (b) Decoder. “Convolution” in the decoder is transposed convo-lution. The reshape operation reshapes the latent variables sized , to × × . Table 3: Model Architecture of VAE for CIFAR-10 andSVHN. “Channels” is the size of output channel, and “Pad”is paddings.12 z = tf.transpose(z, perm=[0, 3, 2, 1]) z, logdet = invertible_1x1_conv("invconv1", z,logdet) z = tf.transpose(z, perm=[0, 3, 2, 1]) z = tf.transpose(z, perm=[0, 1, 3, 2]) z, logdet = invertible_1x1_conv("invconv2", z,logdet) z = tf.transpose(z, perm=[0, 1, 3, 2]) z, logdet = invertible_1x1_conv("invconv3", z,logdet) Listing 1: Code added for permutation after the encoderof Glow to remove the spacial dependencies on latentvariables. The inverse operation is added before thedecoder.

C. Simple Artiﬁcial Data

For artiﬁcial data used in Section 3.1, we comparethe likelihoods assigned to in-distribution and out-of-distribution data to show that a standard Gaussian prior canassign high likelihoods to out-of-distribution inputs. The in-distribution data is generated from a two-dimensional Gaus-sian mixture distribution whose means are [ ± . , andvariance is diag([0 . , , and the out-of-distribution datais sample points from a two-dimensional Gaussian distri-bution with zero mean and 0.01 variance Figure 2c showsthat out-of-distribution inputs does not have any overlapwith in-distribution data. However, Figure 10a shows thatthe log-likelihoods assigned for in-distribution and out-of-distribution inputs by the model using a standard Gaussianprior are similar. On the contrary, the model using a multi-modal prior distribution assigns much lower likelihoods toout-of-distribution inputs.This phenomenon is more serious for high dimensionaldata. The in-distribution data is generated from a 10 di-mensional Gaussian mixture distribution whose means are [ ± . , , . . . , and variances are diag([0 . , , . . . , forboth components. The out-of-distribution data is gener-ated from a 10 dimensional Gaussian distribution with zeromean and 0.01 variance. Figure 10b, c shows that the log-likelihoods assigned by the model using a standard Gaus-sian prior assigned to out-of-distribution inputs are muchhigher than those assigned to in-distribution data, althoughthe model using a multimodal prior assigns much smallerlikelihoods to out-of-distribution inputs. D. Mean Images of Clusters

Figure 11 shows the images corresponding to the meansof components of the bimodal prior distributions of VAEand Glow trained on label 0 and 7 of Fahion-MNIST. Themeans are [ ± , , . . . , on the VAE, and [ ± , , . . . , https://github.com/openai/glow https://github.com/openai/glow/issues/40 (a) 2 dimensional data(b) 10 dimensional data, Standard Gaussian prior(c) 10 dimensional data, multimodal prior Figure 10: (a) Histograms of the log-likelihoods assignedto training and out-of-distribution data by ﬂow-based gen-erative models trained on simple two-dimensional Gaussianmixture data in Section 3.1. “uni” denotes a unimodal prior,and “multi” denotes a multimodal prior. While a modelwith a unimodal prior assigns relatively high likelihoods toout-of-distribution inputs, a model with a multimodal priorassigns much lower likelihoods to out-of-distribution in-puts. (b, c) The histograms of the log-likelihoods assignedby ﬂow-based generative models for 10 dimensional data.The out-of-distribution problem is more serious for high-dimensional data.on the Glow. Figure 12 shows the images corresponding tothe means of the unimodal prior distributions of the VAEand Glow. The mean images of the a bimodal prior VAEare similar to the images in each cluster. However, the meanimage of the standard Gaussian prior VAE is different fromthe training data. For Glow with a standard Gaussian prior,while the mean image is similar with in-distribution data,some images from random sampling of Glow are differentfrom the training data. The results suggests that the mod-els with unimodal prior distribution can assign high likeli-13igure 11: Images corresponding to the means of compo-nents of the bimodal prior distributions of VAE and Glowtrained on label 0 and 7 of FahionMNIST (left). Images inthe data set allocated to each cluster (right). (a) VAE with a unimodal prior(b) Glow with a unimodal prior

Figure 12: Images corresponding to the means of the uni-modal distributions of VAE and Glow trained on label 0 and7 of Fashion-MNIST (left), and images generated from ran-dom sampling (right). The mean image of VAE is dissim-ilar with training data while image from random samplingare similar with training data. While the mean image ofGlow is similar to training data, some images from randomsampling of Glow are disimilar with training data.hoods to out-of-distribution inputs because they can containout-of-distribution inputs in their high likelihood areas ortypical sets.

E. K-means Clustering for CIFAR-10

In the experiments reported in Section 5.4, we separatethe images in label 0 and 4 of CIFAR-10 by k-means clus-tering ( k = 4 ) initialized by k-means++ [1] respectively.Figure 13 shows sample images from the clusters and theper-dimensional variance of the images in each cluster. Thehistograms show that k-means clustering successfully de-creases the per-dimensional variance. In our experiments,we use images in the cluster corresponding to the secondrows. (a) Sample images in four clusters of label 0. Each row corre-sponds to one cluster. We use the images in the second row.(b) Per-dimensional variance of images in each cluster of label 0.(c) Sample images in four clusters of label 4. Each row corre-sponds to one cluster. We use the images in the second row.(d) Per-dimensional variance of images in each cluster of label 4. Figure 13: K-means clustering for CIFAR-10 label 0 and 4.

F. Additional Experimental Results

We show additional materials for the results reported inSection 5.

F.1. Distance between Two Components

Figure 14, 15 show the histograms of the log-likelihoodsassigned to MNIST and the test data of Fashion-MNIST (la-bel 1, 7) by Glow and VAEs trained on Fashion-MNIST14 a) VAE, Gaussian mixture(b) VAE, Generalized Gaussian mixture(c) Glow, Gaussian mixture(d) Glow, Generalized Gaussian mixture

Figure 14: Distances between two components and log-likelihoods assigned to MNIST by models trained onFashion-MNIST (label 1 and 4). The results in these im-ages correspond to those in Figure 6.(label 1, 7) using different distances between two compo-nents. Figure 14 shows the histograms corresponding to theresults reported in Figure 6. The likelihoods assigned tothe test data of Fashion-MNIST are not affected by the dis-tances between two components signiﬁcantly compared tothose assigned to MNIST. (a) VAE, Gaussian mixture, Fashion-MNIST (1, 7)(b) VAE, Generalized Gaussian mixture, Fashion-MNIST (1, 7)(c) Glow, Gaussian mixture, Fashion-MNIST (1, 7)(d) Glow, Generalized Gaussian mixture, Fashion-MNIST (1, 7)

Figure 15: Distances between two components and the log-likelihoods assigned to test data (Fashion-MNIST (1, 7)) byVAEs trained on Fashion-MNIST (1, 7). The likelihoodsassigned to the test data are relatively not affected by thedistance between two components.

F.2. unimodal Prior Models Trained on SimplerData

Figure 16 shows the log-likelihoods assigned to MNISTand the test data by models with standard Gaussian priorstrained on Fashion-MNIST (label 1 or 7). Although themodels assign lower likelihood to MNIST, the effect of al-leviating the out-of-distribution behaviour is not signiﬁcantcompared to that of the model using a multimodal prior.15 a) VAE, label 1(b) VAE, label 7(c) Glow, label 1(d) Glow, label 7

Figure 16: Histograms of the log-likelihoods assigned bymodels with standard Gaussian priors trained on Fashion-MNIST (label 1 or 7). The results corresponds to those inTable 1. While the models trained on a simpler data setassign lower likelihoods to out-of-distribution inputs, themodels using multimodal distributions assign much lowerlikelihoods.

F.3. Histograms of the Latent Variables

Figure 17 shows the histograms of the latent variablesof the models with bimodal prior distributions trained onFashion-MNIST (label 1, 7). For the results on VAE, weshow the histograms of the means of posterior distributions.Figure 19 shows the histograms of the latent variables on the (a) VAE(b) Glow

Figure 17: Latent variables on the models with bimodalGaussian priors trained on Fashion-MNIST (label 1, 7). Thelatent variables of MNIST reside in out-of-distribution ar-eas. (a) VAE(b) Glow

Figure 18: Latent variables on the models with bimodalGaussian priors trained on CIFAR-10 (label 0, 4). The la-tent variables of SVHN reside near in-distribution areas.models with standard Gaussian priors trained on Fashion-MNIST (label 1 or 7). In the latent variable spaces of themodels with multimodal prior distributions, MNIST residesin out-of-distribution areas and it does not have a large over-lap with Fashion-MNIST. By contrast, while we select the16 a) VAE, label 1(b) VAE, label 7(c) Glow, label 1(d) Glow, label 7a) VAE, label 1(b) VAE, label 7(c) Glow, label 1(d) Glow, label 7