Why Normalizing Flows Fail to Detect Out-of-Distribution Data
WWhy Normalizing Flows Fail to DetectOut-of-Distribution Data
Polina Kirichenko ∗ , Pavel Izmailov ∗ , Andrew Gordon Wilson New York University
Abstract
Detecting out-of-distribution (OOD) data is crucial for robust machine learn-ing systems. Normalizing flows are flexible deep generative models that oftensurprisingly fail to distinguish between in- and out-of-distribution data: a flowtrained on pictures of clothing assigns higher likelihood to handwritten digits.We investigate why normalizing flows perform poorly for OOD detection. Wedemonstrate that flows learn local pixel correlations and generic image-to-latent-space transformations which are not specific to the target image dataset. Weshow that by modifying the architecture of flow coupling layers we can bias theflow towards learning the semantic structure of the target data, improving OODdetection. Our investigation reveals that properties that enable flows to generatehigh-fidelity images can have a detrimental effect on OOD detection.
Normalizing flows [39, 9, 10] seem to be ideal candidates for out-of-distribution detection,since they are simple generative models that provide an exact likelihood. However, Nalisnicket al. [27] revealed the puzzling result that flows often assign higher likelihood to out-of-distribution data than the data used for maximum likelihood training. In Figure 1(a), weshow the log-likelihood histogram for a RealNVP flow model [10] trained on the ImageNetdataset [35] subsampled to × resolution. The flow assigns higher likelihood to boththe CelebA dataset of celebrity photos, and the SVHN dataset of images of house numbers,compared to the target ImageNet dataset.While there has been empirical progress in improving OOD detection with flows [27, 7, 28,36, 37, 45], the fundamental reasons for why flows fail at OOD detection in the first placeare not fully understood. In this paper, we show how the inductive biases [26, 44] of flowmodels — implicit assumptions in the architectures and training procedures — can hinderOOD detection.In particular, our contributions are the following:• We show that flows learn latent representations for images largely based on local pixelcorrelations, rather than semantic content, making it difficult to detect data withanomalous semantics.• We identify mechanisms through which normalizing flows can simultaneously increaselikelihood for all structured images. For example, in Figure 1(b, c), we show that thecoupling layers of RealNVP transform the in-distribution ImageNet in the same wayas the OOD CelebA. ∗ Equal contribution. a r X i v : . [ s t a t . M L ] J un a) Log-likelihoods (b) ImageNet input, in-distribution (c) CelebA input, OOD Figure 1:
RealNVP flow on in- and out-of-distribution images. ( a ): A histogramof log-likelihoods that a RealNVP flow trained on ImageNet assigns to ImageNet, SVHN andCelebA. The flow assigns higher likelihood to out-of-distribution data. ( b , c ): A visualizationof the intermediate layers of a RealNVP model on an (b) in-distribution image and (c)OOD image. The first row shows the coupling layer activations, the second and thirdrows show the scale s and shift t parameters predicted by a neural network applied to thecorresponding coupling layer input. Both on in-distribution and out-of-distribution images, s and t accurately approximate the structure of the input, even though the model has notobserved inputs (images) similar to the OOD image during training. Flows learn genericimage-to-latent-space transformations that leverage local pixel correlations and graphicaldetails rather than the semantic content needed for OOD detection .• We show that by changing the architectural details of the coupling layers, we canencourage flows to learn transformations specific to the target data, improving OODdetection.• We show that OOD detection is improved when flows are trained on high-level featureswhich contain semantic information extracted from image datasets.We also provide code at https://github.com/PolinaKirichenko/flows_ood . We briefly introduce normalizing flows based on coupling layers. For a more detailedintroduction, see Papamakarios et al. [31] and Kobyzev et al. [23].
Normalizing flows
Normalizing flows [39] are a flexible class of deep generative modelsthat model a target distribution p ∗ ( x ) as an invertible transformation f of a base distribution p Z ( z ) in the latent space. Using the change of variables formula, the likelihoods for an input x and a dataset D are p X ( x ) = p Z ( f − ( x )) (cid:12)(cid:12)(cid:12)(cid:12) det ∂f − ∂x (cid:12)(cid:12)(cid:12)(cid:12) , p ( D ) = (cid:89) x ∈D p X ( x ) . (1)The latent space distribution p Z ( z ) is commonly chosen to be a standard Gaussian. Flowsare typically trained by maximizing the log-likelihood (1) of the training data with respectto the parameters of the invertible transformation f . Coupling layers
We focus on normalizing flows based on affine coupling layers . In theseflows, the transformation performed by each layer is given by f − aff ( x id , x change ) = ( y id , y change ) , (cid:40) y id = x id y change = ( x change + t ( x id )) (cid:12) exp( s ( x id )) (2)where x id and x change are disjoint parts of the input x , y id and y change are disjoint parts ofthe output y , and the scale and shift parameters s ( · ) and t ( · ) are usually implemented by2 neural network (which we will call the st -network ). The split of the input into x id and x change is defined by a mask : a coupling layer transforms the masked part x change = mask ( x ) of the input based on the remaining part x id . The transformation (2) is invertible and allowsfor efficient Jacobian computation in (1): log (cid:12)(cid:12)(cid:12)(cid:12) det ∂f − aff ∂x (cid:12)(cid:12)(cid:12)(cid:12) = dim ( x change ) (cid:88) i =1 s ( x id ) i . (3) Flows with coupling layers
Coupling layers can be stacked together into flexible nor-malizing flows: f = f K ◦ f K − ◦ . . . ◦ f . Examples of flows with coupling layers includeNICE [9], RealNVP [10], Glow [22], and many others [e.g., 4, 5, 11, 15, 16, 21, 25, 32]. Out-of-distribution detection using likelihood
Flows can be used for out-of-distribution detection based on the likelihood they assign to the inputs. One approach isto choose a likelihood threshold (cid:15) on a validation dataset, e.g. to satisfy a desired falsepositive rate, and during test time identify inputs which have likelihood lower than (cid:15) as OOD.Qualitatively, we can estimate the performance of the flows for OOD detection by plotting ahistogram of the log-likelihoods such as Figure 1(a): the likelihoods for in-distribution datashould generally be higher compared to OOD. Alternatively, we can treat OOD detection asa binary classification problem using likelihood scores, and compute accuracy with a fixedlikelihood threshold (cid:15) , or AUROC (area under the receiver operating characteristic curve).
Recent works have shown that normalizing flows, among other deep generative models, canassign higher likelihood to out-of-distribution data [27, 7]. The work on OOD detectionwith deep generative models falls into two distinct categories. In group anomaly detection(GAD), the task is to label a batch of n > datapoints as in- or out-of-distribution. Pointanomaly detection (PAD) involves the more challenging task of labelling single points asout-of-distribution. Group anomaly detection
Nalisnick et al. [28] introduce the typicality test whichdistinguishes between a high density set and a typical set of a distribution induced by amodel. However, the typicality test cannot detect OOD data if the flow assigns it with asimilar likelihood distribution to that of in-distribution data. Song et al. [37] showed thatout-of-distribution datasets have lower likelihoods when batch normalization statistics arecomputed from a current batch instead of accumulated over the train set, and proposeda test based on this observation. Zhang et al. [45] introduce a GAD algorithm based onmeasuring correlations of flow’s latent representations corresponding to the input batch. Themain limitation of GAD methods is that for most practical applications the assumption thatthe data comes in batches of inputs that are all in-distribution or all OOD is not realistic.
Point anomaly detection
Choi et al. [7] proposed to estimate the Watanabe-AkaikeInformation Criterion using an ensemble of generative models, showing accurate OODdetection on some of the challenging dataset pairs. Ren et al. [33] explain the poor OODdetection performance of deep generative models by the fact that the likelihood is dominatedby background statistics. They propose a test based on the ratio of the likelihoods for theimage and background likelihood estimated using a separate background model . Serrà et al.[36] show that normalizing flows assign higher likelihoods to simpler datasets and propose tonormalize the flow’s likelihood by an image complexity score.In this work we argue that it is the inductive biases of the model that determine its OODperformance. While most work treats flows as black-box density estimators, we conduct acareful study of the latent representations and image-to-latent-space transformations learnedby the flows. Throughout the paper, we connect our findings with prior work and providenew insights. 3
Why flows fail to detect OOD data
Normalizing flows consistently fail at out-of-distribution detection when applied to commonbenchmark datasets (see Appendix D). In this paper, we discuss the reasons behind thissurprising phenomenon. We summarize our thesis as follows:The maximum likelihood objective has a limited influence on OOD detection, relativeto the inductive biases of the flow, captured by the modelling assumptions of thearchitecture.
Why should flows be able to detect OOD inputs?
Flows are trained to maximize thelikelihood of the training data. Likelihood is a probability density function p ( D ) defined onthe image space and hence has to be normalized. Thus, likelihood cannot be simultaneouslyincreased for all the inputs (images). In fact, the optimal maximizer of (1) would only assignpositive density to the datapoints in the training set, and, in particular, would not evengeneralize to the test set of the same dataset. In practice, flows do not seem to overfit,assigning similar likelihood distributions to train and and test (see e.g. Figure 1(a)). Thus,despite their flexibility, flows are not maximizing the likelihood (1) to values close to theglobal optimum. What is OOD data?
There are infinitely many distributions that give rise to anyvalue of the likelihood objective in (1) except the global optimum. Indeed, any non-optimalsolution assigns probability mass outside of the training data distribution; we can arbitrarilyre-assign this probability mass to get a new solution with the same value of the objective (seeAppendix A for a detailed discussion). Therefore the inductive biases of a model determineswhich specific solution is found through training. In particular, the inductive biases willaffect what data is assigned high likelihood (in-distribution) and what data is not (OOD).
What inductive biases are needed for OOD detection?
The datasets in computervision are typically defined by the semantic content of the images. For example, the CelebAdataset consists of images of faces, and SVHN contains images of house numbers. In orderto detect OOD data, the inductive biases of the model have to be aligned with learning thesemantic structure of the data, i.e. what objects are represented in the data.
What are the inductive biases of normalizing flows?
In the remainder of the paper,we explore the inductive biases of normalizing flows. We argue that flows are biased towardslearning graphical properties of the data such as local pixel correlations (e.g. nearby pixelsusually have similar colors) rather than semantic properties of the data (e.g. what objectsare shown in the image).
Flows have capacity to distinguish datasets
In Appendix B, we show that if weexplicitly train flows to distinguish between a pair of datasets, they can assign large likelihoodto one dataset and low likelihood to the other. However, when trained with the standardmaximum likelihood objective, flows do not learn to make this distinction. The inductivebiases of the flows prefer solutions that assign high likelihood to most structured datasetssimultaneously.
Normalizing flows learn highly non-linear image-to-latent-space mappings often using hun-dreds of millions of parameters. One could imagine that the learned latent representationshave a complex structure, encoding high-level semantic information about the inputs. Inthis section, we visualize the learned latent representations on both in-distribution and4ut-of-distribution data and demonstrate that they encode simple graphical structure ratherthan semantic information.
Observation : There exists a correspondence between the coordinates in an imageand in its learned representation. We can recognize edges of the inputs in their latentrepresentations.
Significance for OOD detection:
In order to detect OOD images, a model has toassign likelihood based on the semantic content of the image (see Sec. 4). Flows donot represent images based on their semantic contents, but rather directly encodetheir visual appearance.In the first four columns of Figure 2, we show latent representations of a RealNVPmodel trained on FashionMNIST for an in-distribution FashionMNIST image and an out-of-distribution MNIST digit. The first column shows the original image x , and the secondcolumn shows the corresponding latent z . The latent representations appear noisy both for in-and out-of-distribution samples, but the edges of the MNIST digit can be recognized in thelatent. In the third column of Figure 2, we show latent representations averaged over K = 40 samples of dequantization noise (cid:15) k : K (cid:80) Kk =1 f − ( x + (cid:15) k ) . In the averaged representation,we can clearly see the edges from the original image. Finally, in the fourth column ofFigure 2, we visualize the latent representations (for a single sample of dequantization noise)from a flow when batch normalization layers are in train mode [18]. In train mode, batchnormalization layers use the activation statistics of the current batch, and in evaluationmode they use the statistics accumulated over the train set. While for in-distribution datathere is no structure visible in the latent representation, the out-of-distribution latent clearlypreserves the shape of the -digit from the input image. In the remaining panels of Figure2, we show an analogous visualization for a RealNVP trained on CelebA using an SVHNimage as OOD. In the third panel of this group, we visualize the blue channel of the latentrepresentations. Again, the OOD input can be recognized in the latent representation; someof the edges from the in-distribution CelebA image can also be seen in the correspondinglatent variable. Additional visualizations (e.g. for Glow) are in Appendix F. Insights into prior work
The group anomaly detection algorithm proposed in Zhanget al. [45] uses correlations of the latent representations as an OOD score. Song et al. [37]showed that normalizing flows with batch normalization layers in train mode assign muchlower likelihood to out-of-distribution images than they do in evaluation mode, while forin-distribution data the difference is not significant. Our visualizations explain the presenceof correlations in the latent space and shed light into the difference between the behaviour ofthe flows in train and test mode.
To better understand the inductive biases of coupling-layer based flows, we study thetransformations learned by individual coupling layers.
What are coupling layers trained to do?
Each coupling layer updates the maskedpart x change of the input x to be x change ← ( x change + t ( x id )) · exp( s ( x id )) , where x id is thenon-masked part of x , and s and t are the outputs of the st -network given x id (see Section2). The flow is encouraged to predict high values for s since for a given coupling layer theJacobian term in the likelihood of Eq. (1) is given by (cid:80) j s ( x id ) j (see Section 4). Intuitively, For the details of the visualization procedure and the training setup please see Appendices E and C. When training flow models on images or other discrete data, we use dequantization to avoid pathologicalsolutions [43, 41]: we add uniform noise (cid:15) ∼ U [0; 1] to each pixel x i ∈ { , , . . . , } . Every time we pass animage through the flow f ( · ) , the resulting latent representation z will be different. nput Latent Avg Latent BN Train Input Latent Latent Blue BN Train Figure 2:
Latent spaces.
Visualization of latent representations for the RealNVPmodel on in-distribution and out-of-distribution inputs. Panels 1-4: original images, latentrepresentations, latent representation averaged over samples of dequantization noise,and latent representations for batch normalization in train mode for a flow trained onFashionMNIST and using MNIST for OOD data. Panels 5-8: same as 1-4 but for a modeltrained on CelebA with SVHN for OOD, except in panel 7 we show the blue channel of thelatent representation from panel 6 instead of an averaged latent representation. For bothdataset pairs, we can recognize the shape of the input image in the latent representations.The flow represents images based on their graphical appearance rather than semantic content.to afford large values for scale s without making the latent representations large in normand hence decreasing the density term p Z ( z ) in (1), the shift − t has to be an accurateapproximation of the masked input x change . For example, in Figure 1(b, c) the − t outputsof the first coupling layers are a very close estimate of the input to the coupling layer. Thelikelihood for a given image will be high whenever the coupling layers can accurately predictmasked pixels. To the best of our knowledge, this intuition has not been discussed in anyprevious work. Observation : We describe two mechanisms through which coupling layers learnto predict the masked pixels: (1) leveraging local color correlations and (2) usinginformation about the masked pixels encoded by the previous coupling layer (couplinglayer co-adaptation).
Significance for OOD detection : These mechanisms allow the flows to predictthe masked pixels equally accurately on in- and out-of-distribution datasets. As aresult, flows assign high likelihood to OOD data.
In Figure 3(a, b), we visualize intermediate coupling layer activations of a small RealNVPmodel with 2 coupling layers and checkerboard masks trained on FashionMNIST. For themasked inputs, the outputs of the st -network are shown in black. Even though the flow wastrained on FashionMNIST and has never seen an MNIST digit, the st -networks can easilypredict masked from observed pixels on both FashionMNIST and MNIST. Figure 1 showsthe same behaviour in the first coupling layers of RealNVP trained on ImageNet.With the checkerboard mask, the st -networks predict the masked pixels from neighbouringpixels (see Appendix G for a discussion of different masks). Natural images have localstructure and correlations: with a high probability, a particular pixel value will be similar toits neighbouring pixels. The checkerboard mask creates an inductive bias for the flow to pickup on these local correlations. In Figure 3, we can see that the outputs of the s -networkare especially large for the background pixels and large patches of the same color (larger6 a) Checkerboard (b) Checkerboard, OOD (c) Horizontal (d) Horizontal, OOD Figure 3:
Coupling layers.
Visualization of RealNVP’s intermediate coupling layer activa-tions, as well as scales s and shifts t predicted by each coupling layer on in-distribution (panelsa, c) and out-of-distribution inputs (panels b, d). RealNVP was trained on FashionMNIST.(a), (b): RealNVP with a standard checkerboard masks. The st -networks are able to predictthe masked pixels well both on in-distribution and OOD inputs from neighbouring pixels.(c), (d): RealNVP with a horizontal mask. Despite being trained on FashionMNIST, the st -networks are able to correctly predict the bottom half of MNIST digits in the secondcoupling layer due to coupling layer co-adaptation.values are shown with lighter color), where the flow simply predicts for example that a pixelsurrounded by black pixels would itself be black.In addition to the checkerboard mask, RealNVP and Glow also use channel-wise masks.These masks are applied after a squeeze layer, which puts different subsampled versions ofthe image in different channels. As a result, the st -network is again trained to predict pixelvalues from neighbouring pixels. We provide additional visualizations for RealNVP and Glowin Appendix H. To better understand the transformations learned by the coupling layers, we replaced thestandard masks in RealNVP with a sequence of horizontal masks that cover one half of theimage (either top or bottom). For example, the first coupling layer of the flow shown inpanels (c, d) of Figure 3 transforms the bottom half of the image based on the top half,the second layer transforms the top half based on the bottom half, and so on. In Figure3(c, d) we visualize the coupling layers for a -layer RealNVP with horizontal masks onin-distribution (FashionMNIST) and OOD (MNIST) data.In the first coupling layer, the shift output − t of the st -network predicts the bottom half ofthe image poorly and the layer does not seem to transform the input significantly. In thesecond and third layer, − t presents an almost ideal reconstruction of the masked part of theimage on both the in-distribution and, surprisingly, the OOD input. It is not possible forthe st -network that was only trained on FashionMNIST to predict the top half of an MNISTdigit based on the other half. The resolution is that the first layer encodes information aboutthe top half into the bottom half of the image; the second layer then decodes this informationto accurately predict the top half. Similarly, the third layer leverages information aboutthe bottom half of the image encoded by the second layer. We refer to this phenomenon as coupling layer co-adaptation . Additional visualizations are in Appendix H.Horizontal masks allow us to conveniently visualize the coupling layer co-adaptation, but wehypothesize that the same mechanism applies to standard checkerboard and channel-wisemasks in combination with local color correlations.7 − − − (a) Baseline − − − − (b) l = 100 − − − − (c) l = 50 − − − − (d) l = 10 MNISTMNIST TrainFashionMNIST TrainFashionMNISTNotMNIST
Figure 4:
Effect of st -networks capacity. Histograms of log-likelihoods of in- andout-of-distribution data for RealNVP trained on FashionMNIST, varying the dimension l ofthe bottleneck in the st -networks. Flows with lower l work better for OOD detection: thebaseline assigns higher likelihood to the out-of-distribution MNIST images, while the flowswith l = 50 and l = 10 assign much higher likelihood to in-distribution FashionMNIST data.With l = 100 the flow assigns higher likelihood to in-distribution data, but the overlap ofthe likelihood distribution with OOD MNIST is higher than for l = 50 and l = 10 . Insights into prior work
Prior work showed that the likelihood score is heavily affectedby the input complexity [36] and background statistics [33]; however, prior work does notexplain why flows exhibit such behavior. Simpler images (e.g. SVHN compared to CIFAR-10)and background often contain large patches of the same color, which makes it easy to predictmasked pixels from their neighbours and to encode and decode the information via couplinglayer co-adaptation.
Our observations in Sections 5 and 6 suggest that normalizing flows are biased towardslearning transformations that increase likelihood simultaneously for all structured images.We discuss two simple ways of changing the inductive biases for better OOD detection.By changing the masking strategy or the architecture of st -networks in flows we canimprove OOD detection based on likelihood. Changing masking strategy
We consider two three types of masks. We introduced thehorizontal mask in Section 6.2: in each coupling layer the flow updates the bottom half ofthe image based on the top half or vice versa. With a horizontal mask, flows cannot simplyuse the information from neighbouring pixels when predicting a given pixel, but they exhibitcoupling layer co-adaptation (see Section 6.2). To combat coupling layer co-adaptation,we additionally introduce the cycle-mask , a masking strategy where the information abouta part of the image has to travel through three coupling layers before it can be used toupdate the same part of the image (details in Appendix I.1). To compare the performanceof the checkerboard mask, horizontal mask and cycle-mask, we construct flows of exactly thesame size and architecture (RealNVP with 8 coupling layers and no squeeze layers) witheach of these masks, trained on CelebA and FashionMNIST. We present the results in theAppendix I.1. As expected, for the checkerboard mask, the flow assigns higher likelihood tothe simpler OOD datasets (SVHN for CelebA and MNIST for FashionMNIST). With thehorizontal mask, the OOD data still has higher likelihood on average, but the relative rankingof the in-distribution data is improved. Finally, for the cycle-mask, on FashionMNIST thelikelihood is higher compared to MNIST on average. On CelebA the likelihood is similar butslightly lower compared to SVHN. st -networks with bottleneck Another way to force the flow to learn global structurerather than local pixel correlations and to prevent coupling layer co-adaptation is to restrict8 − − − SVHN − − − − CelebA − − − − CIFAR − − SVHN Embeddings − CelebA Embeddings − CIFAR Embeddings
SVHN SVHN Train CIFAR CIFAR Train CelebA CelebA Train
Figure 5:
Image embeddings.
Log-likelihood histograms for RealNVP trained on rawpixel data (first three panels) and embeddings extracted for the same image datasets usingEfficientNet trained on ImageNet. On raw pixels, the flow assigns the highest likelihood toSVHN regardless of its training dataset. On image embeddings, flows always assign higherlikelihood to in-distribution data. When trained on features capturing the semantic contentof the input, flows can detect OOD.the capacity of the st -networks. To do so, we introduce a bottleneck to the st -networks: apair of fully-connected layers projecting to a space of dimension l and back to the originalinput dimension. We insert these layers after the middle layer of the st -network. If the latentdimension l is small, the st -network cannot simply reproduce its input as its output, andthus cannot exploit the local pixel correlations discussed in Section 6. Passing informationthrough multiple layers with a low-dimensional bottleneck also reduces the effect of couplinglayer co-adaptation. We train a RealNVP flow varying the latent dimension l on CelebA andon FashionMNIST. We present the results in Figure 4 and Appendix I. On FashionMNIST,introducing the bottleneck forces the flow to assign lower likelihood to OOD data (Figure 4).Furthermore, as we decrease l , the likelihood of the OOD data decreases but FashionMNISTlikelihood stays the same. On CelebA the relative ranking of likelihood for in-distributiondata is similarly improved when we decrease the dimension l of the bottleneck, but SVHN isstill assigned slightly higher likelihood than CelebA. See Appendix I for detailed results.While the proposed modifications do not completely resolve the issue of OOD data havinghigher likelihood, the experiments support our observations in Section 6: preventing theflows from leveraging local color correlations and coupling layer co-adaptation, we improvethe relative likelihood ranking for in-distribution data. In Section 4 we argued that in order to detect OOD data the model has to assign likelihoodbased on high-level semantic features of the data, which the flows fail to do when trained onimages. In this section, we test out-of-distribution detection using image representationsfrom a deep neural network.Normalizing flows can detect OOD images when trained on high-level semanticrepresentations instead of raw pixels.We extract embeddings for CIFAR-10, CelebA and SVHN using an EfficientNet [40] pretrainedon ImageNet [35] which yields 1792-dimensional features . We train RealNVP on each ofthe representation datasets, considering the other two datasets as OOD. We present thelikelihood histograms for all datasets in Figure 5(b). Additionally, we report AUROC scoresin Appendix Table 2. For the models trained on SVHN and CelebA, both OOD datasetshave lower likelihood and the AUROC scores are close to 100%. For the model trained onCIFAR-10, CelebA has lower likelihood. Moreover, the likelihood distribution on SVHN, The original images are -dimensional, so the dimension of the embeddings is only two times smaller.Thus, the inability to detect OOD images cannot be explained just by the high dimensionality of the data.
Non-image data
In Appendix K we evaluate flows on tabular UCI datasets, where thefeatures are relatively high-level compared to images. On these datasets, normalizing flowsassign higher likelihood to in-distribution data.
Many of the puzzling phenomena in deep learning can be boiled down to a matter of inductivebiases . Neural networks in many cases have the flexibility to overfit datasets, but they do notbecause the biases of the architecture and training procedures can guide us towards reasonablesolutions. In performing OOD detection, the biases of normalizing flows can be more of acurse than a blessing. Indeed, we have shown that flows tend to learn representations thatachieve high likelihood through generic graphical features and local pixel correlations, ratherthan discovering semantic structure that would be specific to the training distribution.To provide insights into prior results [e.g., 27, 7, 28, 37, 45, 36], part of our discussion hasfocused on an in-depth exploration of the popular class of normalizing flows based on affinecoupling layers. We hypothesize that many of our conclusions about coupling layers extendat a high level to other types of normalizing flows [e.g., 3, 6, 12, 20, 13, 30, 38, 17, 8]. A fullstudy of these other types of flows is a promising direction for future work.
Acknowledgements
PK, PI, and AGW are supported by an Amazon Research Award, Amazon Machine LearningResearch Award, Facebook Research, NSF I-DISRE 193471, NIH R01 DA048764-01A1,NSF IIS-1910266, and NSF 1922658 NRT-HDR: FUTURE Foundations, Translation, andResponsibility for Data Science. We thank Marc Finzi, Greg Benton, Wesley Maddox, andAlex Wang for helpful discussions.
References [1] Andrei Atanov, Alexandra Volokhova, Arsenii Ashukha, Ivan Sosnovik, and DmitryVetrov. Semi-conditional normalizing flows for semi-supervised learning. arXiv preprintarXiv:1905.00505 , 2019.[2] Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and Daniel Whiteson. Pa-rameterized machine learning for high-energy physics. arXiv preprint arXiv:1601.07913 ,2016.[3] Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Invertible residualnetworks. arXiv preprint arXiv:1811.00995 , 2018.[4] Apratim Bhattacharyya, Shweta Mahajan, Mario Fritz, Bernt Schiele, and StefanRoth. Normalizing flows with multi-scale autoregressive priors. In
Proceedings of theConference on Computer Vision and Pattern Recognition , pages 8415–8424, 2020.105] Jianfei Chen, Cheng Lu, Biqi Chenli, Jun Zhu, and Tian Tian. Vflow: More expressivegenerative flows with variational data augmentation. arXiv preprint arXiv:2002.09741 ,2020.[6] Ricky TQ Chen, Jens Behrmann, David Duvenaud, and Jörn-Henrik Jacobsen. Residualflows for invertible generative modeling. arXiv preprint arXiv:1906.02735 , 2019.[7] Hyunsun Choi, Eric Jang, and Alexander A Alemi. Waic, but why? generative ensemblesfor robust anomaly detection. arXiv preprint arXiv:1810.01392 , 2018.[8] Nicola De Cao, Ivan Titov, and Wilker Aziz. Block neural autoregressive flow. arXivpreprint arXiv:1904.04676 , 2019.[9] Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: Non-linear independentcomponents estimation. arXiv preprint arXiv:1410.8516 , 2014.[10] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using RealNVP. arXiv preprint arXiv:1605.08803 , 2016.[11] Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural splineflows. In
Advances in Neural Information Processing Systems , pages 7509–7520, 2019.[12] Marc Finzi, Pavel Izmailov, Wesley Maddox, Polina Kirichenko, and Andrew GordonWilson. Invertible convolutional networks. In
Workshop on Invertible Neural Nets andNormalizing Flows, International Conference on Machine Learning , 2019.[13] Will Grathwohl, Ricky TQ Chen, Jesse Betterncourt, Ilya Sutskever, and David Duve-naud. Ffjord: Free-form continuous dynamics for scalable reversible generative models. arXiv preprint arXiv:1810.01367 , 2018.[14] Dan Hendrycks, Mantas Mazeika, and Thomas Dietterich. Deep anomaly detectionwith outlier exposure. arXiv preprint arXiv:1812.04606 , 2018.[15] Jonathan Ho, Xi Chen, Aravind Srinivas, Yan Duan, and Pieter Abbeel. Flow++:Improving flow-based generative models with variational dequantization and architecturedesign. arXiv preprint arXiv:1902.00275 , 2019.[16] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions forgenerative normalizing flows. arXiv preprint arXiv:1901.11137 , 2019.[17] Chin-Wei Huang, David Krueger, Alexandre Lacoste, and Aaron Courville. Neuralautoregressive flows. arXiv preprint arXiv:1804.00779 , 2018.[18] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[19] Pavel Izmailov, Polina Kirichenko, Marc Finzi, and Andrew Gordon Wilson. Semi-supervised learning with normalizing flows. In
International Conference on MachineLearning , 2020.[20] Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, and DanielDuckworth. Invertible convolutional flow. In H. Wallach, H. Larochelle, A. Beygelzimer,F. d'Alché-Buc, E. Fox, and R. Garnett, editors,
Advances in Neural InformationProcessing Systems 32 , pages 5635–5645. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/8801-invertible-convolutional-flow.pdf .[21] Sungwon Kim, Sang-gil Lee, Jongyoon Song, Jaehyeon Kim, and Sungroh Yoon.Flowavenet: A generative flow for raw audio. arXiv preprint arXiv:1811.02155 , 2018.[22] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1 × Advances in Neural Information Processing Systems , pages 10215–10224,2018. 1123] Ivan Kobyzev, Simon Prince, and Marcus A Brubaker. Normalizing flows: Introductionand ideas. arXiv preprint arXiv:1908.09257 , 2019.[24] Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. arXiv preprintarXiv:1711.05101 , 2017.[25] Xuezhe Ma, Xiang Kong, Shanghang Zhang, and Eduard Hovy. Macow: Maskedconvolutional generative flow. In
Advances in Neural Information Processing Systems ,pages 5891–5900, 2019.[26] Tom M Mitchell.
The need for biases in learning generalizations . Department ofComputer Science, Laboratory for Computer Science Research . . . , 1980.[27] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, Dilan Gorur, and Balaji Lakshmi-narayanan. Do deep generative models know what they don’t know? arXiv preprintarXiv:1810.09136 , 2018.[28] Eric Nalisnick, Akihiro Matsukawa, Yee Whye Teh, and Balaji Lakshminarayanan.Detecting out-of-distribution inputs to deep generative models using a test for typicality. arXiv preprint arXiv:1906.02994 , 2019.[29] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neuralnetworks. arXiv preprint arXiv:1601.06759 , 2016.[30] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flowfor density estimation. In
Advances in Neural Information Processing Systems , pages2338–2347, 2017.[31] George Papamakarios, Eric Nalisnick, Danilo Jimenez Rezende, Shakir Mohamed, andBalaji Lakshminarayanan. Normalizing flows for probabilistic modeling and inference. arXiv preprint arXiv:1912.02762 , 2019.[32] Ryan Prenger, Rafael Valle, and Bryan Catanzaro. Waveglow: A flow-based generativenetwork for speech synthesis. In
ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) , pages 3617–3621. IEEE, 2019.[33] Jie Ren, Peter J Liu, Emily Fertig, Jasper Snoek, Ryan Poplin, Mark Depristo, JoshuaDillon, and Balaji Lakshminarayanan. Likelihood ratios for out-of-distribution detection.In
Advances in Neural Information Processing Systems , pages 14680–14691, 2019.[34] Byron P Roe, Hai-Jun Yang, Ji Zhu, Yong Liu, Ion Stancu, and Gordon McGregor.Boosted decision trees as an alternative to artificial neural networks for particle identifi-cation.
Nuclear Instruments and Methods in Physics Research Section A: Accelerators,Spectrometers, Detectors and Associated Equipment , 543(2-3):577–584, 2005.[35] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma,Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenetlarge scale visual recognition challenge.
International journal of computer vision , 115(3):211–252, 2015.[36] Joan Serrà, David Álvarez, Vicenç Gómez, Olga Slizovskaia, José F Núñez, and JordiLuque. Input complexity and out-of-distribution detection with likelihood-based genera-tive models. arXiv preprint arXiv:1909.11480 , 2019.[37] Jiaming Song, Yang Song, and Stefano Ermon. Unsupervised out-of-distribution detec-tion with batch normalization. arXiv preprint arXiv:1910.09115 , 2019.[38] Yang Song, Chenlin Meng, and Stefano Ermon. Mintnet: Building invertible neuralnetworks with masked convolutions. In
Advances in Neural Information ProcessingSystems , pages 11002–11012, 2019. 1239] Esteban G Tabak and Cristina V Turner. A family of nonparametric density estimationalgorithms.
Communications on Pure and Applied Mathematics , 66(2):145–164, 2013.[40] Mingxing Tan and Quoc V Le. Efficientnet: Rethinking model scaling for convolutionalneural networks. arXiv preprint arXiv:1905.11946 , 2019.[41] Lucas Theis, Aäron van den Oord, and Matthias Bethge. A note on the evaluation ofgenerative models. arXiv preprint arXiv:1511.01844 , 2015.[42] Antonio Torralba, Rob Fergus, and William T Freeman. 80 million tiny images: A largedata set for nonparametric object and scene recognition.
IEEE transactions on patternanalysis and machine intelligence , 30(11):1958–1970, 2008.[43] Benigno Uria, Iain Murray, and Hugo Larochelle. Rnade: The real-valued neuralautoregressive density-estimator. In
Advances in Neural Information Processing Systems ,pages 2175–2183, 2013.[44] Andrew Gordon Wilson and Pavel Izmailov. Bayesian deep learning and a probabilisticperspective of generalization. arXiv preprint arXiv:2002.08791 , 2020.[45] Yufeng Zhang, Wanwei Liu, Zhenbang Chen, Ji Wang, Zhiming Liu, Kenli Li, HongmeiWei, and Zuoning Chen. Out-of-distribution detection with distance guarantee in deepgenerative models. arXiv preprint arXiv:2002.03328 , 2020.13 ppendix outline
This appendix is organized as follows.• In Section A, we provide additional discussion and a formal statement of the argumentpresented in Section 4.• In Section B, we show that normalizing flows can be trained to assign high likelihoodto the target data and low likelihood to a given OOD dataset.• In Section C, we provide the hyperparameters that we used for the experiments in thispaper.• In Section D, we report the log-likelihood histograms and OOD detection AUROCscores for the baseline RealNVP and Glow models on various datasets.• In Section E, we explain the visualization procedure that we use to visualize the latentrepresentations and coupling layers of normalizing flows.• In Section F, we provide additional latent representation visualizations.• In Section G, we explain the different masking strategies for coupling layers of normal-izing flows.• In Section H, we provide additional coupling layer visualizations.• In Section I, we provide additional details on the experiments of Section 7.• In Section J, we provide samples from baseline models on various datasets. We alsodiscuss an experiment on resampling parts of the latent variables corresponding todifferent images with normalizing flows.• In Section K, we provide additional details and results for the experiments on imageembeddings and tabular data from Section 8.
A Maximum likelihood objective is agnostic to whatdata is OOD
In Section 4 we argued that the maximum likelihood objective by itself does not defineout-of-distribution detection prefromance of a normalizing flow. Instead, it is the inductivebiases of the flow that define what data will be assigned with high or low likelihood. Weillustrate this point in Figure 6.The yellow and red shaded regions illustrate the high-probability regions of two distributionsdefined on the image space X . The distribution in yellow assigns high likelihood to thetrain (CelebA) images corrupted by a small level of noise, or brightness adjustments. Thisdistribution represents how a human could describe the target dataset. The red distributionon the other hand assigns high likelihood to all structured images including those fromImageNet and SVHN, but does not support noisy train images. The red distributionrepresents a distribution learned by normalizing flow.For simplicity, we could think that the distributions are uniform on the highlighted sets, andthe sets have the same volume. Then, both distributions assign equally high likelihood tothe training data, but the split of the data into in-distribution and OOD is different. As14igure 6: Inductive biases define what data is OOD.
A conceptual visualization oftwo distributions in the image space (shown in yellow and red), training CelebA data isshown with crosses, and other images are shown with circles. The distribution shown inyellow could represent inductive biases of a human: it assigns high likelihood to all images ofhuman faces, regardless of small levels of noise, and small brightness changes. The seconddistribution, shown in red, could represent a normalizing flow: it assigns high likelihood to allsmooth structured images, including images from SVHN and ImageNet. Both distributionsassign the same likelihood to the training set, but their high-probability sets are different.both distributions provide the same density to the target data, the value of the maximumlikelihood objective in Equation (1) would be the same for the corresponding models.More generally, for any distribution that only assigns finite density to the train set, we canconstruct another distribution that assigns the same density to the train data, but alsohigh density to a given set of (OOD) datapoints. In particular, the new distribution willachieve the same value of the maximum likeihood objective in Equation (1). We formalizeour reasoning in the following simple proposition.
Proposition 1.
Let p ( · ) be a probability density on the space X , and let D = { x i } Ni =1 bethe training dataset, where x i ∈ X for i = 1 , . . . , N . Assume for simplicity that p is upperbounded: for any x p ( x ) ≤ u . Let D OOD be an arbitraty finite set of points. Then, for any c ≥ there exists a distribution with density p (cid:48) ( · ) such that p (cid:48) ( x ) = p ( x ) for all x ∈ D , and p (cid:48) ( x (cid:48) ) ≥ c for all x (cid:48) ∈ D OOD .Proof.
Consider the set S ( r ) = ∪ x i ∈D B ( x i , r ) , where B ( x, r ) is a ball of radius r centered at x . The probability mass of this set P ( S ( r )) = (cid:82) x ∈S ( r ) p ( x ) dx . As r → , the volume V ( S ( r )) of the set S ( r ) goes to zero. We have P ( S ( r )) = (cid:90) x ∈S ( r ) p ( x ) dx ≤ V ( S ( r )) · u r → −−−→ . (4)Hence, there exists r such that P ( S ( r )) ≤ .Now, define the a neighborhood of the set D OOD as S OOD = ∪ x (cid:48) ∈D OOD B ( x, ˆ r ) , (5)where ˆ r is selected so that the total volume of set S OOD is / c . Then, we can define a newdensity p (cid:48) by redistributing the mass in p ( · ) from outside the set S ( r ) to the neighborhood S OOD as follows: p (cid:48) ( x ) = p ( x ) , if x ∈ S ( r ) , c · (cid:0) − P ( S ( r )) (cid:1) , if x ∈ S OOD , , otherwise . (6)The density p (cid:48) ( · ) integrates to one, coincides with p on the training data, and assigns densityof at least c to points in D OOD . 15 a) CIFAR ↑ , SVHN ↓ (b) SVHN ↑ , CIFAR ↓ (c) CIFAR ↑ , CelebA ↓ (d) CelebA ↑ , CIFAR ↓ (e) Fashion ↑ , MNIST ↓ (f) MNIST ↑ , Fashion ↓ Figure 7:
Negative training.
The histograms of log-likelihood for RealNVP when intraining likelihood is maximized on one dataset and minimized on another dataset: (a)maximized on CIFAR, minimized on SVHN; (b) maximized on SVHN, minimized on CIFAR;(c) maximized on CIFAR, minimized on CelebA; (d) maximized on CelebA, minimized onCIFAR. (e) maximized on FashionMNIST, minimized on MNIST; (f) maximized on MNIST,minimized on FashionMNIST;
B Flows have capacity to distinguish datasets
Normalizing flows are unable to detect OOD image data when trained to maximize likelihoodon the train set. It is natural to ask whether these models are at all capable of distinguishingdifferent image datasets. In this section we demonstrate the following:
Observation : Flows can assign high likelihood to the train data and low likelihoodto a given OOD dataset if they are explicitly trained to do so.
Relevance to OOD detection : While flows have sufficient capacity to distinguishdifferent data, they are biased towards learning solutions that assign high likelihoodto all structured data and consequently fail to detect OOD inputs.We introduce an objective that encouraged the flow to maximize likelihood on the targetdataset and to minimize likelihood on a specific OOD dataset. The objective we used is N D (cid:88) x ∈D log p ( x ) − N OOD (cid:88) x ∈D OOD log p ( x ) · I [log p ( x ) > c ] , (7)where I [ · ] is an indicator function and the constant c allows us to encourage the flow to onlypush the likelihood of OOD data to a threshold rather than decreasing it to −∞ ; N D is thenumber of train datapoints and N OOD = (cid:80) x ∈D OOD I [log p ( x ) > c ] is the number of OODdatapoints that have likelihood above the threshold c .We trained a RealNVP flow with the objective (7) using different pairs of target and OODdatasets: CIFAR-10 vs CelebA, CIFAR-10 vs SVHN and FashionMNIST vs MNIST. Wepresent the results in Figure 7. In each case, the flow is able to push the likelihood of theOOD dataset to very low values, and simultaneously maximize the likelihood on the targetdataset creating a clear separation between the two.16 yper-parameters For the flow architecture and training used the same hyper-parameters as we did for the baselines, described in Appendix C. For CelebA, CIFARand SVHN models we set c = − , and for MNIST, FashionMNIST and NotMNIST weset c = − . Connection with prior work
Flows can be used as classifiers separating different classesof the same dataset [28, 19, 1], which further highlights the fact that flows can distinguishimages based on their contents when trained to do so. A similar experiment for the PixelCNNmodel [29] was presented in Hendrycks et al. [14]. The authors maximized the likelihood ofCIFAR-10 and minimized the likelihood of the TinyImages dataset [42]. In their experiments,this procedure consistently led to CIFAR-10 having higher likelihood than any of the otherbenchmark datasets. In Figures 7, for each experiment in addition to the two datasets thatwere used in training we show the log-likelihood distribution on another OOD dataset. Forexample, when we train the flow to separate CIFAR-10 from CelebA (panels c, d), the flowsuccessfully does so but assigns SVHN with likelihood similar to that of CIFAR. When wetrain the flow to separate CIFAR-10 from SVHN (panels c, d), the flow successfully does sobut assigns CelebA with likelihood similar to that of CIFAR. Similar observations can bemade for MNIST, FashionMNIST and notMNIST. At least for normalizing flows, minimizingthe likelihood on a single OOD dataset does not lead to all the other OOD datasets achievinglow-likelihood.
C Details of the experiments
RealNVP
For all RealNVP models, we generally follow the architecture design of Dinhet al. [10]. We use multi-scale architecture where after a block of coupling layers half of thevariables are factored out and copied forward directly to the latent representation. Eachscale consists of 3 coupling layers with checkerboard mask, followed by a squeeze operationand 3 coupling layers with channel-wise mask (see Figure 9). For the st -network we use deepconvolutional residual networks with additional skip connections following Dinh et al. [10].In all experiments, we use Adam optimizer. On grayscale images (MNIST, FashionMNIST),we used 2 scales in RealNVP, 6 blocks in residual st -network, learning rate × − , batchsize 32 and trained model for 80 epochs. On CIFAR-10, CelebA and SVHN, we used 3 scales,8 blocks in st -network, learning rate − , batch size 32, weight decay × − and trainedthe model for 100 epochs. On ImageNet, we used 5 scales, 2 blocks in st -network, learningrate − , batch size 64, weight decay × − and trained the model for 42 epochs. OnCelebA × , we used 4 scales, 4 blocks in st -network, learning rate − , batch size 64,weight decay × − and trained the model for 100 epochs. Glow
We follow the training details of Nalisnick et al. [27] for multi-scale Glow models.Each scale consists of a sequence of actnorm, invertible × convolution and couplinglayers [22]. The squeeze operation is applied before each scale, and half of the variables arefactored out after each scale. In all experiments, we use RMSprop optimizer. On grayscaleimages (MNIST, FashionMNIST), we used 2 scales with 16 coupling layers, a 3-layer Highwaynetwork with 200 hidden units for st -network, learning rate × − , batch size 32 andtrained model for 80 epochs. On color images (CIFAR-10, CelebA, SVHN), we used 3 scaleswith 8 coupling layers, a 3-layer Highway network with 400 hidden units for st -network,learning rate × − , batch size 32 and trained model for 80 epochs.17 − − − − × − (a) RNVP, ImageNet − − − − − × − (b) RNVP, CelebA-HQ ImageNetImageNet TrainCelebA-HQSVHN − − − − × − (c) RNVP, MNIST − − − − × − (d) RNVP, Fashion − − − − × − (e) Glow, MNIST − − − − × − (f) Glow, Fashion MNISTMNIST TrainFashionMNIST TrainFashionMNISTNotMNIST − − − (g) RNVP, CelebA − − − (h) RNVP, CIFAR-10 − − − − (i) RNVP, SVHN − − − − × − (j) Glow, CelebA − − − − × − (k) Glow, CIFAR-10 − − − − × − (l) Glow, SVHN SVHNSVHN TrainCIFARCIFAR TrainCelebACelebA Train
Figure 8:
Baseline log-likelihoods.
The histograms of log-likelihood for RealNVP andGlow models trained on various datasets. Both flows consistently assign similar or higherlikelihood to OOD data compared to the target dataset. The likelihood distribution for trainand test sets of the target data is typically very similar.
D Baseline models likelihood distributions and AUROCscores
In Figure 8, we plot the histograms of the log likelihoods on in-distribution and out-of-distribution datasets RealNVP and Glow models. In Table 1 we report AUROC scoresfor OOD detection with these models. As reported in prior work, Glow and RealNVPconsistently fail at OOD detection.
E Visualization implementation
Normalizing flows such as RealNVP and Glow consist of a sequence of coupling layers whichchange the content of the input and squeeze layers (see Figure 9) which reshape it. Dueto the presence of squeeze layers, the latent representations of the flow have a different18
OD data OOD dataModel Train data CelebA CIFAR-10 Data SVHN MNIST Fashion NotMNISTRealNVP CelebA – 67.7 6.3 MNIST – 99.99 99.99CIFAR-10 56.0 – 6.0 Fashion 10.8 – 72.1SVHN 99.0 98.4 –Glow CelebA – 69.1 6.4 MNIST – 99.96 100.0CIFAR-10 52.9 – 5.5 Fashion 13.3 – 80.2SVHN 99.9 99.1 –
Table 1:
Baseline AUROC.
AUROC scores on OOD detection for RealNVP and Glowmodels trained on various image data. Flows consistently assign higher likelihoods to OODdataset except when trained on MNIST and SVHN. The AUROC scores for RealNVP andGlow are close. (a) Squeeze layer (b) CB mask (c) CW mask (d) Hor. mask
Figure 9:
Squeeze layers and masks. (a) : A squeeze layer squeezes an image of size c × h × w into c × h/ × w/ . The first panel shows the mask, where each color correspondsto a channel added by the squeeze layer (for visual clarity we show the mask for a × image). The second panel shows a × × MNIST digit, and the last panel showsthe channels produced by the squeeze layer. The colors of the boundaries of the channelvisualizations correspond to the colors of the pixels in the mask. Each channel producedby the squeeze layer is a subsampled version of the input image. (b)-(d) : Checkerboard,channel-wise and horizontal masks applied to the same input image. Masked regions areshown in red. Channel-wise mask is obtained by applying a squeeze layer and masking twoof the channels (e.g. the last two); here we show the masked pixels in the un-squeezed image.Masks are typically alternated: in the subsequent layers the masked and observed positionsare swapped.shape compared to the input. In order to visualize latent representations, we revert allsqueezing operations of the flow and visualize unsqueeze ( z ) . Similarly, for visualizationof coupling layer activations and scale and shift parameters predicted by st -network, werevert all squeezing operations and join all factored out tensors in the case of multi-scalearchitecture (i.e., we feed the corresponding tensor through inverse sub-flow without applyingcoupling layers or invertible convolutions). F Additional latent representation visualizations in Figure 10, we plot additional latent representations for RealNVP and Glow trained onFashionMNIST with MNIST as OOD dataset, RealNVP trained on CelebA with SVHN asOOD. The results agree with Section 5: we can recognize edges from the original inputs intheir latent representations. 19 z A v g z B N z (a) RealNVP trained on FashionMNIST x z A v g z (b) Glow trained on FashionMNIST x zz B l u e B N z (c) RealNVP trained on CelebA Figure 10:
Latent spaces.
Visualization of latent representations for RealNVP andGlow models on in-distribution and out-of-distribution inputs.
Rows 1-3 in (a) and (b) :original images, latent representations, latent representation averaged over samples ofdequantization noise for RealNVP and Glow model trained on FashionMNIST and usingMNIST for OOD data. Row 4 in (a) : latent representations for batch normalization intrain mode.
Rows 1-4 in (c) : original images, latent representations, the blue channel ofthe latent representation, and the latent representations for batch normalization in trainmode for a RealNVP model trained on CelebA and using SVHN as OOD data. For bothdataset pairs, we can recognize the shape of the input image in the latent representations.The flow represents images based on their graphical appearance rather than semantic content.20 a) RealNVP trained on FashionMNIST(b) Glow trained on FashionMNIST(c) RealNVP trained on CelebA
Figure 11:
Coupling layer visualizations.
Visualization of intermediate coupling layeractivations and st -network predictions for (a) : RealNVP trained on FashionMNIST; (b) :Glow trained on FashionMNIST; (c) : RealNVP trained on CelebA. The top half of eachsubfigure shows the visualizations for an in-distribution image (FashionMNIST or CelebA)while the bottom half shows the visualizations for an OOD image (MNIST or SVHN). Forall models, the shape of the input both for in- and out-of-distribution image is clearly visiblein s and t predictions of the coupling layers.21 Masking strategies
In Figure 9, we visualize checkerboard, channel-wise masks and horizontal masks on asingle-channel image. The checkerboard and channel-wise masks are commonly used inRealNVP, Glow and other coupling layer-based flows for image data. We use the horizontalmask to better understand the transformations learned by the coupling layers in Section 6.
H Additional coupling layer visualizations
In Figure 11, we plot additional visualizations of coupling layer activations and scale s andshift t parameters predicted by st -networks. In Figure 12 we visualize the coupling layeractivations for the small flow with horizontal mask from Section 6.2 on several additionalOOD inputs. These visualizations provide additional empirical support for Section 6.Figure 12: Coupling layer co-adaptation.
Visualization of intermediate coupling layeractivations, as well as scales s and shifts t predicted by each coupling layer of a RealNVPmodel with a horizontal mask on out-of-distribution MNIST inputs. Although RealNVPwas trained on FashionMNIST, the st -networks are able to correctly predict the bottom halfof MNIST digits in the second coupling layer due to coupling layer co-adaptation. I Changing biases in flow models for better OOD detec-tion
I.1 Cycle-mask
In Section 6 we identified two mechanisms through which normalizing flows learn to predictmasked pixels from observed pixels on OOD data: leveraging local color correlations andcoupling layer co-adaptation. We reduce the applicability of these mechanisms with cycle-mask : a new masking strategy for the coupling layers illustrated in Figure 13.With cycle-mask, the coupling layers do not have access to neighbouring pixels whenpredicting the masked pixels, similarly to the horizontal mask. Furthermore, cycle maskreduce the effect of coupling layer co-adaptation: the information about a part of the imagehas to travel through coupling layers before it can be used to update the same part of theimage. Changing masking strategy
In Figure 14 we show the log-likelihood histograms andsamples for a RealNVP of a fixed size with checkerboard, horizontal and cycle-mask.22 ransformed MaskedMasked Transformed Masked MaskedTransformed MaskedMaskedTransformedMaskedMasked
Figure 13:
Cycle-mask.
A new sequence of masks for coupling layers in RealNVP thatwe evaluate in Section 7. We separate the input image space of size c × h × w into fourquadrants of size c × h/ × w/ each. Each coupling layer transforms one quadrant basedon the previous quadrant. Cycle-mask prevents co-adaptation between subsequent couplinglayers discussed in Section 6: the information from a quadrant has to propagate throughfour coupling layers before reaching the same quadrant. F a s h i o n M N I S T − − − − NotMNISTMNISTFashionMNIST C e l e b A − − − − CelebACIFARSVHN (a) Checkerboard Mask − − − − − − − − (b) Horizontal Mask − − − − − − − − (c) Cycle-Mask C e l e b A F a s h i o n M N I S T (d) Checkerboard Mask (e) Horizontal Mask (f) Cycle-Mask Figure 14:
Effect of masking strategy
The first two rows show log likelihood distributionfor RealNVP models trained on FashionMNIST and CelebA with (a) checkerboard mask; (b)horizontal mask; and (c) cycle-mask. The third and the fourth rows show samples producedby the corresponding models. 23 e l e b A LL s − − − − CelebACIFARSVHN (a) Baseline − − − − (b) l = 150 − − − − (c) l = 80 − − − (d) l = 30 C e l e b A (e) Baseline (f) l = 30 (g) l = 80 (h) l = 150 F a s h i o n M N I S T (i) Baseline (j) l = 10 (k) l = 50 (l) l = 100 (m) RealNVP with l = 10 trained on FashionMNIST Figure 15:
Effect of st -network capacity. The first row shows the histogram of loglikelihoods for a RealNVP model trained on CelebA dataset: (a) for a baseline model, and (b)-(d) for models with different bottleneck dimensions l in st -network. The second andthird rows show samples from RealNVP model trained on CelebA and FashionMNISTrespectively: (e) and (i) for baseline models, and (f )-(h) and (j)-(l) for models withdifferent bottleneck dimensions l . In (m) , we show the visualization of the coupling layeractivations and st -network predictions for a RealNVP model trained on FashionMNIST witha bottleneck of dimension l = 10 . The top half shows the visualizations for an in-distributionFashionMNIST image while the bottom half shows the visualizations for an OOD MNISTimage. st -network with restricted capacity cannot accurately predict masked pixels of theOOD image in the intermediate coupling layers. Moreover, in the middle coupling layers forthe MNIST input the activations resemble FashionMNIST images in s and t predictions.24 hanging the architecture of st -networks In Figure 15, we show likelihood distribu-tions, samples and coupling layer visualization for RealNVP model with st -network with abottleneck trained on FashionMNIST and CelebA datasets. The considered bottleneck dimen-sions for FashionMNIST are { , , } , and for CelebA the dimensions are { , , } .In the baseline RealNVP model, we use a standard deep convolutional residual networkwithout additional skip connections from the intermediate layers to the output which wereused in Dinh et al. [10]. J Samples
In Figure 17, we show samples for RealNVP and Glow models trained on CelebA, CIFAR-10,SVHN, FashionMNIST and MNIST, and a RealNVP model trained on ImageNet × and CelebA × . J.1 Latent variable resampling
To further understand the structure of the latent representations learned by the flow, westudy the effect of resampling part of the latent representations corresponding to imagesfrom different datasets from the base Gaussian distribution. In Figure 16, using a RealNVPmodel trained on CelebA we compute the latent representations corresponding to inputimages from CelebA, SVHN, and CIFAR-10 datasets, and randomly re-sample the subsetof latent variables corresponding to a × square in the center of the image (to findthe corresponding latent variables we apply the squeeze layers from the flow to the × mask). We then invert the flow and compute the reconstructed images from the alteredlatent representations.Both for in-distribution and out-of-distribution data, the model almost ideally preserves thepart of the image other than the center, confirming the alignment between the latent spaceand the original input space discussed in Section 5. The model adds a face to the resampledpart of the image, preserving the consistency with the background to some extent. (a) Celeb-A (b) CIFAR-10 (c) SVHN Figure 16:
Latent variable resampling.
Original images ( top row ) and reconstructionswith the latent variables corresponding to a × square in the center of the image randomlyre-sampled for a RealNVP model trained on Celeb-A ( bottom row ). The model adds faces(as it was trained Celeb-A) to the part of the image that is being re-sampled.25 a) RNVP, CelebA (b) RNVP, CIFAR-10 (c) RNVP, SVHN (d) RNVP, FashionM-NIST(e) RNVP, MNIST (f) RNVP, CelebA-HQ (g) RNVP, ImageNet (h) Glow, CelebA(i) Glow, CIFAR-10 (j) Glow, SVHN (k) Glow, Fashion (l) Glow, MNIST Figure 17:
Baseline Samples.
Samples from baseline RealNVP and Glow models. ForImageNet and CelebA-HQ we used datasets with (64 × definition. K Out-of-distribution detection on tabular data (a) Image embeddings
Train data OOD dataCelebA CIFAR-10 SVHNCelebA – 99.99 99.99CIFAR-10 99.99 – 73.31SVHN 100.0 99.98 – (b) Tabular data
Train class (OOD class) DatasetHEPMASS MINIBOONEBackground (Signal) 83.78 72.71Signal (Background) 70.73 87.56
Table 2:
Image embedding and UCI AUROC. (a) : AUROC scores on OOD detectionfor RealNVP model trained on image embeddings extracted from EfficientNet. The model istrained on one of the embedding datasets while the remaining two are considered OOD. Themodels consistenly assign higher likelihood to in-distribution data, and in particular AUROCscores are significantly better compared to flows trained on the original images (see Table 1). (b) : AUROC scores on OOD detection for RealNVP trained on one class of Hepmass andMiniboone datasets while the other class is treated as OOD data.26
00 110 120 130 1400.00.020.040.060.080.1
Background Class
50 70 90 110 1300.00.010.020.030.040.050.06
Signal Class (a) Miniboone Dataset − − − − Background Class − − − − − Signal Class
BackgroundBackground TrainSignalSignal Train (b) Hepmass Dataset
Figure 18:
UCI datasets.
The histograms of log-likelihood for RealNVP on Hepmassand Miniboone tabular datasets when trained on one class and the other class is viewedas OOD. The train and test likelihood distributions are almost identical when trained oneither class, and the OOD class receives lower likelihoods on average. There is however asignificant overlap between the likelihoods for in- and out-of-distribution data.
K.1 Model
We use RealNVP with 8 coupling layers, fully-connected st -network and masks which splitinput vector by half in an alternating manner. For UCI experiments, we use 1 hidden layerand 256 hidden units in st -networks, learning rate − , batch size 32 and train the modelfor 100 epochs. For image embeddings experiments, we use 3 hidden layer and 512 hiddenunits in st -networks, learning rate − , batch size 1024 and train the model for 120 epochs.For all experiments, we use the AdamW optimizer [24] and weight decay − . K.2 EfficientNet embeddings
We train RealNVP model on image embeddings for CIFAR-10, CelebA and SVHN extractedfrom EfficientNet train on ImageNet, and report AUROC scores in Table 2(a).
K.3 UCI datasets
In this experiment, we use 2 UCI classification datasets which were used for unsupervisedmodeling in prior works on normalizing flows [30, 11, 13]: HEPMASS [2] and MINIBOONE[34]. HEPMASS and MINIBOONE are both binary classification datasets originating fromphysics, and the two classes represent background and signalsignal