RG-Flow: A hierarchical and explainable flow model based on renormalization group and sparse prior
Hong-Ye Hu, Dian Wu, Yi-Zhuang You, Bruno Olshausen, Yubei Chen
UUnder review
RG-F
LOW : A
HIERARCHICAL AND EXPLAINABLEFLOW MODEL BASED ON RENORMALIZATION GROUPAND SPARSE PRIOR
Hong-Ye Hu ∗ Dian Wu ∗ Yi-Zhuang You
Department of PhysicsUC San Diego { hyhu, yzyou } @ucsd.edu Bruno Olshausen Yubei Chen
Redwood Center, Berkeley AI ResearchUC Berkeley { baolshausen, yubeic } @berkeley.edu A BSTRACT
Flow-based generative models have become an important class of unsupervisedlearning approaches. In this work, we incorporate the key idea of renormaliza-tion group (RG) and sparse prior distribution to design a hierarchical flow-basedgenerative model, called RG-Flow, which can separate information at differentscales of images with disentangled representations at each scale. We demonstrateour method mainly on the CelebA dataset and show that the disentangled repre-sentations at different scales enable semantic manipulation and style mixing ofthe images. To visualize the latent representations, we introduce receptive fieldsfor flow-based models and find that the receptive fields learned by RG-Flow aresimilar to those in convolutional neural networks. In addition, we replace thewidely adopted Gaussian prior distribution by a sparse prior distribution to furtherenhance the disentanglement of representations. From a theoretical perspective,the proposed method has O (log L ) complexity for image inpainting compared toprevious generative models with O ( L ) complexity. NTRODUCTION
One of the most important unsupervised learning tasks is to learn the data distribution and buildgenerative models. Over the past few years, various types of generative models have been proposed.Flow-based generative models are a particular family of generative models with tractable distribu-tions (Dinh et al., 2017; Kingma & Dhariwal, 2018; Chen et al., 2018b; 2019; Behrmann et al., 2019;Hoogeboom et al., 2019; Brehmer & Cranmer, 2020; Rezende et al., 2020; Karami et al., 2019). Yetthe latent variables are on equal footing and mixed globally. Here, we propose a new flow-basedmodel, RG-Flow, which is inspired by the idea of renormalization group in statistical physics. RG-Flow imposes locality and hierarchical structure in bijective transformations. It allows us to accessinformation at different scales in original images by latent variables at different locations, which of-fers better explainability. Combined with sparse priors (Olshausen & Field, 1996; 1997; Hyv¨arinen& Oja, 2000), we show that RG-Flow achieves hierarchical disentangled representations.Renormalization group (RG) is a powerful tool to analyze statistical mechanics models and quantumfield theories in physics (Kadanoff, 1966; Wilson, 1971). It progressively extracts more coarse-scalestatistical features of the physical system and decimates irrelevant fine-grained statistics at eachscale. Typically, the local transformations used in RG are designed by human physicists and theyare not bijective. On the other hand, the flow-based models use cascaded invertible global transfor-mations to progressively turn a complicated data distribution into Gaussian distribution. Here, we ∗ H.-Y. Hu and D. Wu contribute equally to this work. Correspondence to: [email protected] a r X i v : . [ c s . L G ] D ec nder reviewwould like to combine the key ideas from RG and flow-based models. The proposed RG-flow en-ables the machine to learn the optimal RG transformation from data, by constructing local invertibletransformations and build a hierarchical generative model for the data distribution. Latent represen-tations are introduced at different scales, which capture the statistical features at the correspondingscales. Together, the latent representations of all scales can be jointly inverted to generate the data.This method was recently proposed in the physics community as NeuralRG (Li & Wang, 2018; Huet al., 2020).Our main contributions are two-fold: First, RG-Flow can separate the signal statistics of differentscales in the input distribution naturally, and represent information at each scale in its latent vari-ables z . Those hierarchical latent variables live on a hyperbolic tree. Taking CelebA dataset (Liuet al., 2015) as an example, the network will not only find high-level representations, such as thegender factor and the emotion factor for human faces, but also mid-level and low-level representa-tions. To visualize representations of different scales, we adopt the concept of receptive field fromconvolutional neural networks (CNN) (LeCun, 1988; LeCun et al., 1989) and visualize the hiddenstructures in RG-flow. In addition, since the statistics are separated into a hierarchical fashion, weshow that the representations can be mixed at different scales. This achieves an effect similar to stylemixing. Second, we introduce the sparse prior distribution for latent variables. We find the sparseprior distribution is helpful to further disentangle representations and make them more explainable.The widely adopted Gaussian prior is rotationally symmetric. As a result, each of the latent vari-ables in a flow model usually does not have a clear semantic meaning. By using a sparse prior, wedemonstrate the clear semantic meaning in the latent space. ELATED WORK
Some flow-based generative models also possess multi-scale latent space (Dinh et al., 2017; Kingma& Dhariwal, 2018), and recently hierarchies of features have been utilized in Schirrmeister et al.(2020), where the top-level feature is shown to perform strongly in out-of-distribution (OOD) detec-tion task. Yet, previous models do not impose hard locality constraint in the multi-scale structure.In Appendix C, the differences between globally connected multi-scale flows and RG-Flow arediscussed, and we see that semantic, meaningful receptive fields do not show up in the globallyconnected cases. Recently, other more expressive bijective maps have been developed (Hoogeboomet al., 2019; Karami et al., 2019; Durkan et al., 2019), and those methods can be incorporated intothe proposed structure to further improve the expressiveness of RG-Flow.Some other classes of generative models rely on a separate inference model to obtain the latentrepresentation. Examples include variational autoencoders (Kingma & Welling, 2014), adversarialautoencoders (Makhzani et al., 2015), InfoGAN (Chen et al., 2016), and BiGAN (Donahue et al.,2017; Dumoulin et al., 2017). Those techniques typically do not use hierarchical latent variables,and the inference of latent variables is approximate. Notably, recent advances suggest that havinghierarchical latent variables may be beneficial (Vahdat & Kautz, 2020). In addition, the coarse-to-fine fashion of the generation process has also been discussed in other generative models, suchas Laplacian pyramid of adversarial networks (Denton et al., 2015), and multi-scale autoregressivemodels (Reed et al., 2017).
Disentangled representations (Tenenbaum & Freeman, 2000; DiCarlo & Cox, 2007; Bengio et al.,2013) is another important aspect in understanding how a model generates images (Higgins et al.,2018). Especially, disentangled high-level representations have been discussed and improved frominformation theoretical principles (Cheung et al., 2015; Chen et al., 2016; 2018a; Higgins et al.,2017; Kipf et al., 2020; Kim & Mnih, 2018; Locatello et al., 2019; Ramesh et al., 2018). Apart fromthe high-level representations, the multi-scale structure also lies in the distribution of natural images.If a model can separate information of different scales, then its multi-scale representations can beused to perform other tasks, such as style transfer (Gatys et al., 2016; Zhu et al., 2017), face mixing(Karras et al., 2019; Gambardella et al., 2019; Karras et al., 2020), and texture synthesis (Bergmannet al., 2017; Jetchev et al., 2016; Gatys et al., 2015; Johnson et al., 2016; Ulyanov et al., 2016).Typically, in flow-based generative models, Gaussian distribution is used as the prior for the latentspace. Due to the rotational symmetry of Gaussian prior, an arbitrary rotation of the latent spacewould lead to the same likelihood. Sparse priors (Olshausen & Field, 1996; 1997; Hyv¨arinen & Oja,2000) was proposed as an important tool for unsupervised learning and it leads to better explain-2nder reviewability in various domains (Ainsworth et al., 2018; Arora et al., 2018; Zhang et al., 2019). To breakthe symmetry of Gaussian prior and further improve the explainability, we introduce a sparse priorto flow-based models. Please refer to Figure 12 for a quick illustration on the difference betweenGaussian prior and the sparse prior, where the sparse prior leads to better disentanglement.
Renormalization group (RG) has a broad impact ranging from particle physics to statistical physics.Apart from the analytical studies in field theories (Wilson, 1971; Fisher, 1998; Stanley, 1999), RGhas also been useful in numerically simulating quantum states. The multi-scale entanglement renor-malization ansatz (MERA) (Vidal, 2008; Evenbly & Vidal, 2014) implements the hierarchical struc-ture of RG in tensor networks to represent quantum states. The exact holographic mapping (EHM)(Qi, 2013; Lee & Qi, 2016; You et al., 2016) further extends MERA to a bijective (unitary) flowbetween latent product states and visible entangled states. Recently, Li & Wang (2018); Hu et al.(2020) incorporates the MERA structure and deep neural networks to design a flow-base generativemodel that allows machine to learn the EHM from statistical physics and quantum field theory ac-tions. In quantum machine learning, recent development of quantum convolutional neural networksalso (Cong et al., 2019) utilize the MERA structure. The similarity between RG and deep learn-ing has been discussed in several works (B´eny, 2013; Mehta & Schwab, 2014; B´eny & Osborne,2015; Oprisa & Toth, 2017; Lin et al., 2017; Gan & Shu, 2017). The information theoretic objectivethat guides machine-learning RG transforms are proposed in recent works (Koch-Janusz & Ringel,2018; Hu et al., 2020; Lenggenhager et al., 2020). The meaning of the emergent latent space hasbeen related to quantum gravity (Swingle, 2012; Pastawski et al., 2015), which leads to the excitingdevelopment of machine learning holography (You et al., 2018; Hashimoto et al., 2018; Hashimoto,2019; Akutagawa et al., 2020; Hashimoto et al., 2020).
ETHODS
Flow-based generative models.
Flow-based generative models are a family of generative modelswith tractable distributions, which allows efficient sampling and exact evaluation of the probabilitydensity (Dinh et al., 2015; 2017; Kingma & Dhariwal, 2018; Chen et al., 2019). The key idea is tobuild a bijective map G ( z ) = x between visible variables x and latent variables z . Visible variables x are the data that we want to generate, which may follow a complicated probability distribution.And latent variables z usually have simple distribution that can be easily sampled, for example thei.i.d. Gaussian distribution. In this way, the data can be efficiently generated by first sampling z andmapping them to x through x = G ( z ) . In addition, we can get the probability associated with eachdata sample x , log p X ( x ) = log p Z ( z ) − log (cid:12)(cid:12)(cid:12)(cid:12) ∂G ( z ) ∂ z (cid:12)(cid:12)(cid:12)(cid:12) . (1)The bijective map G ( z ) = x is usually composed as a series of bijectors, G ( z ) = G ◦ G ◦ · · · ◦ G n ( z ) , such that each bijector layer G i has a tractable Jacobian determinant and can be invertedefficiently. The two key ingredients in flow-based models are the design of the bijective map G andthe choice of the prior distribution p Z ( z ) . Structure of RG-Flow networks.
Much of the prior research has focused on designing morepowerful bijective blocks for the generator G to improve its expressive power and to achieve betterapproximations of complicated probability distributions. Here, we focus on designing the archi-tecture that arranges the bijective blocks in a hierarchical structure to separate features of differentscales in the data and to disentangle latent representations.Our design is motivated by the idea of RG in physics, which progressively separates the coarse-grained data statistics from fine-grained statistics by local transformations at different scales. Let x be the visible variables, or the input image (level-0), denoted as x (0) ≡ x . A step of the RGtransformation extracts the coarse-grained information x (1) to send to the next layer (level-1), andsplits out the rest of fine-grained information as auxiliary variables z (0) . The procedure can bedescribed by the following recursive equation (at level- h for example), x ( h +1) , z ( h ) = R h ( x ( h ) ) , (2)which is illustrated in Fig. 1(a), where dim( x ( h +1) ) + dim( z ( h ) ) = dim( x ( h ) ) , and the RG trans-formation R h can be made invertible. At each level, the transformation R h is a local bijective map,3nder review ( a ) Renormalization RG scaleFine - grained Coarse - grained x ( ) z ( ) x ( ) z ( ) x ( ) z ( ) x ( ) z ( ) decimated features ( b ) Generation inverseRG scaleFine - grainedCoarse - grained z ( ) z ( ) z ( ) z ( ) x visible variableslatent variables z Figure 1: (a) The forward RG transformation splits out decimated features at different scales. (b)The inverse RG transformation generates the fine-grained image from latent variables.which is constructed by stacking trainable bijective blocks. We will specify its details later. Thesplit-out information z ( h ) can be viewed as latent variables arranged at different scales. Then theinverse RG transformation G h ≡ R − h simply generates the fine-grained image, x ( h ) = R − h ( x ( h +1) , z ( h ) ) = G h ( x ( h +1) , z ( h ) ) . (3)The highest-level image x ( h L ) = G h L ( z ( h L ) ) can be considered as generated directly from latentvariables z ( h L ) without referring to any higher-level coarse-grained image, where h L = log L − log m , for the original image of size L × L with local transformations acting on kernel size m × m .Therefore, given the latent variables z = { z ( h ) } at all levels h , the original image can be restoredby the following nested maps, as illustrated in Fig. 1(b), x ≡ x (0) = G ( G ( G ( · · · , z (2) ) , z (1) ) , z (0) ) ≡ G ( z ) , (4)where z = { z , · · · , z h L } . RG-Flow is a flow-based generative model that uses the above compos-ite bijective map G as the generator.To model the RG transformation, we arrange the bijective blocks in a hierarchical network archi-tecture. Fig. 2(a) shows the side view of the network, where each green or yellow block is a localbijective map. Following the notation of MERA networks, the green blocks are the disentanglers ,which reparametrize local variables to reduce their correlations, and the yellow blocks are the deci-mators , which separate the decimated features out as latent variables. The blue dots on the bottomare the visible variables x from the data, and the red crosses are the latent variables z . We omit colorchannels of the image in the illustration, since we keep the number of color channels unchangedthrough the transformation.Fig. 2(b) shows the top-down view of a step of the RG transformation. The green/yellow blocks(disentanglers/decimators) are interwoven on top of each other. The covering area of a disentangleror decimator is defined as the kernel size m × m of the bijector. For example, in Fig. 2(b), the kernelsize is × . After the decimator, three fourth of the degrees of freedom are decimated into latentvariables (red crosses in Fig. 2(a)), so the edge length of the image is halved.As a mathematical description, for the single-step RG transformation R h , in each block ( p, q ) la-beled by p, q = 0 , , . . . , L h m − , the mapping from x ( h ) to ( x ( h +1) , z ( h ) ) is given by (cid:110) y ( h )2 h ( mp + m + a,mq + m + b ) (cid:111) ( a,b ) ∈ (cid:3) m = R dis h (cid:18)(cid:110) x ( h )2 h ( mp + m + a,mq + m + b ) (cid:111) ( a,b ) ∈ (cid:3) m (cid:19)(cid:110) x ( h +1)2 h ( mp + a,mq + b ) (cid:111) ( a,b ) ∈ (cid:3) m , (cid:110) z ( h )2 h ( mp + a,mq + b ) (cid:111) ( a,b ) ∈ (cid:3) m / (cid:3) m = R dec h (cid:18)(cid:110) y ( h )2 h ( mp + a,mq + b ) (cid:111) ( a,b ) ∈ (cid:3) m (cid:19) , (5)where (cid:3) km = { ( ka, kb ) | a, b = 0 , , . . . , mk − } denotes the set of pixels in a m × m squarewith stride k , and y is the intermediate result after the disentangler but not the decimator. The no-tation x ( h )( i,j ) stands for the variable (a vector of all channels) at the pixel ( i, j ) and at the RG level h (similarly for y and z ). The disentanglers R dis h and decimators R dec h can be any bijective neuralnetwork. Practically, We use the coupling layer proposed in the Real NVP networks (Dinh et al.,4nder review2017) to build them, with a detailed description in Appendix A. By specifying the RG transforma-tion R h = R dec h ◦ R dis h above, the generator G h ≡ R − h is automatically specified as the inversetransformation. x Visible variables z Latent variables ( a ) ( b ) x Visible variables z Latent variables ( c ) G e n e r a t i o n C a u s a l C o n e x Visible variables z Latent variables ( d ) I n f e r e n ce C a u s a l C o n e x Visible variables z Latent variables ( a ) ( b ) x Visible variables z Latent variables ( c ) G e n e r a t i o n C a u s a l C o n e x Visible variables z Latent variables ( d ) I n f e r e n ce C a u s a l C o n e z ( h )
Training objective.
After decomposing the statistics into multiple scales, we need to make the la-tent features decoupled. So we assume that the latent variables z are independent random variables,described by a factorized prior distribution p Z ( z ) = (cid:89) l p ( z l ) , (6)where l labels every element in z , including the RG level, the pixel position and the channel. Thisprior gives the network the incentive to minimize the mutual information between latent variables.This minimal bulk mutual information (minBMI) principle was previously proposed to be the infor-mation theoretic principle that defines the RG transformation (Li & Wang (2018); Hu et al. (2020)).Starting from a set of independent latent variables z , the generator G should build up correlationslocally at different scales, such that the multi-scale correlation structure can emerge in the resultingimage x to model the correlated probability distribution of the data. To achieve this goal, we shouldmaximize the log likelihood for x drawn from the data set. The loss function to minimize reads L = − E x ∼ p data ( x ) log p X ( x ) = − E x ∼ p data ( x ) (cid:18) log p Z ( R ( x )) + log (cid:12)(cid:12)(cid:12)(cid:12) ∂R ( x ) ∂ x (cid:12)(cid:12)(cid:12)(cid:12)(cid:19) , (7)where R ( x ) ≡ G − ( x ) = z denotes the RG transformation, which contains trainable parameters.By optimizing the parameters, the network learns the optimal RG transformation from the data. Receptive fields of latent variables.
Due to the nature of local transformations in our hierarchicalnetwork, we can define the generation causal cone for a latent variable to be the affected area whenthat latent variable is changed. This is illustrated as the red cone in Fig. 2(c).To visualize the latent space representation, we define the receptive field for a latent variable z l asRF l = E z ∼ p Z ( z ) (cid:12)(cid:12)(cid:12)(cid:12) ∂G ( z ) ∂z l (cid:12)(cid:12)(cid:12)(cid:12) c , (8)where | · | c denotes the -norm on the color channel. The receptive field reflects the response of thegenerated image to an infinitesimal change of the latent variable z l , averaged over p Z ( z ) . Therefore,the receptive field of a latent variable is always contained in its generation causal cone. Higher-levellatent variables have larger receptive fields than those of the lower-level ones. Especially, if thereceptive fields of two latent variables do not overlap, which is often the case for lower-level latentvariables, they automatically become disentangled in the representation.5nder review Image inpainting and error correction.
Another advantage of the network locality can bedemonstrated in the inpainting task. Similar to the generation causal cone, we can define the in-ference causal cone shown as the blue cone in Fig. 2(d). If we perturb a pixel at the bottom of theblue cone, all the latent variables within the blue cone will be affected, whereas the latent variablesoutside the cone cannot be affected. An important property of the hyperbolic tree-like network isthat the higher level contains exponentially fewer latent variables. Even though the inference causalcone is expanding as we go into higher levels, the number of latent variables dilutes exponentially aswell, resulting in a constant number of latent variables covered by the inference causal cone on eachlevel. Therefore, if a small local region on an image is corrupted, only O (log L ) latent variablesneed to be modified, where L is the edge length of the entire image. While for globally connectednetworks, all O ( L ) latent variables have to be varied. Sparse prior distribution.
We have chosen to hard-code the RG information principle by using afactorized prior distribution, i.e. p Z ( z ) = (cid:81) l p ( z l ) . The common practice is to choose p ( z l ) to bethe standard Gaussian distribution, which is spherical symmetric. If we apply any rotation to z , thedistribution will remain the same. Therefore, we cannot avoid different features from being mixedunder the arbitrary rotation.To overcome this issue, we use an anisotropic sparse prior distribution for p Z ( z ) . In our implemen-tation, we choose the Laplacian distribution p ( z l ) = b exp( −| z l | /b ) , which is sparser comparedto Gaussian distribution and breaks the spherical symmetry of the latent space. In Appendix E, weshow a two-dimensional pinwheel example to illustrate this intuition. This heuristic method will en-courage the model to find more semantically meaningful representations by breaking the sphericalsymmetry. XPERIMENTS
Synthetic multi-scale datasets.
To illustrate RG-Flow’s ability to disentangle representations atdifferent scales and spatially separated representations, we propose two synthetic datasets withmulti-scale features, named MSDS1 and MSDS2. Their samples are shown in Appendix B. Ineach image, there are 16 ovals with different colors and orientations. In MSDS1, all ovals in animage have almost the same color, while their orientations are randomly distributed. So the color isa global feature in MSDS1, and the orientation is a local feature. In MSDS2, on the contrary, theorientation is a global feature, and the color is a local one.We implement RG-Flow as shown in Fig. 2. After training, we find that RG-Flow can easily capturethe characteristics of those datasets. Namely, the ovals in each image from MSDS1 have almostthe same color; and from MSDS2, the same orientation. Especially, in Fig. 3, we plot the effect ofvarying latent variables at different levels, together with their receptive fields. For MSDS1, if wevary a high-level latent variable, the color of the whole image will change, which shows that thenetwork has captured the global feature of the dataset. And if we vary a low-level latent variable,the orientation of only the corresponding one oval will change. As the ovals are spatially separated,the low-level representation of different ovals is disentangled. Similarly, for MSDS2, if we vary ahigh-level latent variable, the orientations of all ovals will change. And if we vary a low-level latentvariable, the color of only the corresponding one oval will change.For comparison, we also trained Real NVP on our synthetic datasets. We find that Real NVP failsto learn the global and local characteristics of those datasets. Details can be found in Appendix B.
Human face dataset.
Next, we apply RG-Flow to more complicated multi-scale datasets. Mostof our experiments use the human face dataset CelebA (Liu et al., 2015), and we crop and scalethe images to × pixels. Details of the network and the training procedure can be found inAppendix A. Experiments on other datasets, such as CIFAR-10 (Krizhevsky et al.), and quantitativeevaluations can also be found in Appendix G.After training, the network learns to progressively generate finer-grained images, as shown inFig. 4(a). The colors in the coarse-grained images are not necessarily the same as those at the samepositions in the fine-grained images, because there is no constraint to prevent the RG transformationfrom mixing color channels. 6nder review MSDS1High level MSDS2Low level
Figure 3: Multi-scale latent representations for MSDS1 and MSDS2. h=0h=1h=2h=3 Receptive Fields (b) (c) G e n er a t i o n ( I n v er s e R G ) F i n e - g r a i n e d C oa r s e - g r a i n e d (a) Figure 4: Subplot (a) shows the progressive generation of images during the inverse RG. Subplot(b) shows some receptive fields of latent variables from low level to high level. The strength ofeach receptive field is rescaled to one for better visualization. Subplot (c) shows the statistics of thereceptive fields’ strength.
Receptive fields.
To visualize the latent space representation, we calculate the receptive field foreach latent variable, and list some of them in Fig. 4(b). We can see the receptive size is small forlow-level variables and large for high-level ones, as indicated from the generation causal cone. Inthe lowest level ( h = 0 ), the receptive fields are merely small dots. In the second lowest level( h = 1 ), small structures emerge, such as an eyebrow, an eye, a part of hair, etc. In the middlelevel ( h = 2 ), we can see eyebrows, eyes, forehead bang structure emerge. In the highest level( h = 3 ), each receptive field grows to the whole image. We will investigate those explainablelatent representations in the next section. For comparison, we show receptive fields of Real NVPin Appendix C. Even though Real NVP has multi-scale structure, since it is not locally constrained,semantic representations at different scales do not emerge. Learned features on different scales.
In this section, we show that some of these emergent struc-tures correspond to explainable latent features. Flow-based generative model is the maximal en-coding procedure , because the core of flow-based generative models is the bijective maps, and theypreserves the dimensionality before and after the encoding. Usually, the images in the dataset liveon a low dimensional manifold, and we do not need to use all the dimensions to encode such data.In Fig. 4(c) we show the statistics of the strength of receptive fields. We can see most of the latentvariables have receptive fields with relatively small strength, meaning that if we change the valueof those latent variables, the generated images will not be affected much. We focus on those latentvariables with receptive field strength greater than one, which have visible effects on the generatedimages. We use h to label the RG level of latent variables, for example, the lowest-level latent vari-ables have h = 0 , whereas the highest-level latent variables have h = 4 . In addition, we will focuson h = 1 (low level), h = 2 (mid level), h = 3 (high level) latent variables. There are a few latentvariables with h = 0 that have visible effects, but their receptive fields are only small dots with noemergent structures. 7nder reviewFor high-level latent representations, we found in total latent variables that have visible effects,and six of them are identified with disentangled and explainable meanings. Those factors are gender,emotion, light angle, azimuth, hair color, and skin color. In Fig. 5(a), we plot the effect of varyingthose six high-level variables, together with their receptive fields. For the mid-level latent represen-tations, we plot the four leading variables together with their receptive fields in Fig. 5(b), and theycontrol eye, eyebrow, upper right bang, and collar respectively. For the low-level representations,some leading variables control an eyebrow and an eye as shown in Fig. 5(c). We see them achievebetter disentangled representations when their receptive fields do not overlap. Mid level latent representationsEyeEyebrowBangCollarGenderEmotionLight AzimuthHairSkinHigh level latent representations(a) (b)Low level latent representationsLeft eyebrow(c) Left eye
Figure 5: Semantic factors found on different levels.
703 975 328 352 415 779 914715689777325
High level source M i d l e v e l s o u rce (a)(b) (c) Figure 6: Image mixing in the hyperbolic tree-like latent space.
Image mixing in scaling direction.
Given two images x A and x B , the conventional image mixingtakes a linear combination between z A = G − ( x A ) and z B = G − ( x B ) by z = λ z A + (1 − λ ) z B with λ ∈ [0 , and generates the mixed image from x = G ( z ) . In our model, latent variables z iscoordinated by the pixel position ( i, j ) and the RG level h . The direct access of the latent variable z ( h )( i,j ) at each point enables us to mix the latent variables in a different manner, which may be dubbedas a “hyperbolic mixing”. We consider mixing the large-scale (high-level) features of x A and thesmall-scale (low-level) features of x B by combining their corresponding latent variables via z ( h ) = (cid:40) z ( h ) A , for h ≥ Θ , z ( h ) B , for h < Θ , (9)where Θ serves as a dividing line of the scales. As shown in Fig. 6(a), as we change Θ from 0 to 3,more low-level information in the blonde-hair image is mixed with the high-level information of theblack-hair image. Especially when h = 3 , we see the mixed face have similar eyes, nose, eyebrows,and mouth as the blonde-hair image, while the high-level information, such as face orientation andhair color, is taken from the black-hair image. In addition, this mixing is not symmetric under theinterchange of z A and z B , see Fig. 6(b) for comparison. This hyperbolic mixing achieves the similar8nder revieweffect of StyleGAN (Karras et al., 2019; 2020) that we can take mid-level information from an imageand mix it with the high-level information of another image. In Fig. 6(c), we show more examplesof mixing faces. Ground truthCorrupted ImageRG-FlowReal NVPGround truthCorrupted ImageRG-FlowConstrained Real NVPReal NVP
Figure 7: Inpainting locally corrupted images.
Image inpainting and error correction.
The existence of the inference causal cone ensures thatat most O (log L ) latent variables will be affected, if we have a small local corrupted region to beinpainted. In Fig. 7, we show that RG-Flow can faithfully recover the corrupted region (marked asred) only using latent variables locating inside the inference causal cone, which are around one thirdof all latent variables. For comparison, if we randomly pick the same number of latent variables tomodify in Real NVP, it fails to inpaint as shown in Fig. 7 (Constrained Real NVP). To achieve therecovery of similar quality in Real NVP, as shown in Fig. 7 (Real NVP), all latent variables need tobe modified, which are of O ( L ) order. See Appendix F for more details about the inpainting taskand its quantitative evaluations. ISCUSSION AND CONCLUSION
In this paper, we combined the ideas of renormalization group and sparse prior distribution to designRG-Flow, a probabilistic flow-based generative model. This versatile architecture can be incorpo-rated with any bijective map to achieve an expressive flow-based generative model. We have shownthat RG-Flow can separate information at different scales and encode them in latent variables livingon a hyperbolic tree. To visualize the latent representations in RG-Flow, we defined the receptivefields for flow-based models in analogy to that in CNN. Taking CelebA dataset as our main example,we have shown that RG-Flow will not only find high-level representations, but also mid-level andlow-level ones. The receptive fields serve as a visual guidance for us to find explainable representa-tions. In contrast, the semantic representations of mid-level and low-level structures do not emergein globally connected multi-scale flow models, such as Real NVP. We have also shown that the latentrepresentations can be mixed at different scales, which achieves an effect similar to style mixing.In our model, if the receptive fields of two latent representations do not overlap, they are naturallydisentangled. For high-level representations, we propose to utilize a sparse prior to encourage dis-entanglement. We find that if the dataset only contains a few high-level factors, such as the 3DChair dataset (Aubry et al., 2014) shown in Appendix G, it is hard to find explainable high-leveldisentangled representations, because of the redundant nature of the encoding in flow-based models.Incorporating information theoretic criteria to disentangle high-level representations in the redun-dant encoding procedure will be an interesting future direction. R EFERENCES
Samuel K Ainsworth, Nicholas J Foti, Adrian KC Lee, and Emily B Fox. oi-vae: Output inter-pretable vaes for nonlinear group factor analysis. In
International Conference on Machine Learn-ing , pp. 119–128, 2018.Tetsuya Akutagawa, Koji Hashimoto, and Takayuki Sumimoto. Deep learning and ads/qcd.
PhysicalReview D , 102(2), Jul 2020. ISSN 2470-0029. doi: 10.1103/physrevd.102.026020.9nder reviewSanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic struc-ture of word senses, with applications to polysemy.
Transactions of the Association for Compu-tational Linguistics , 6:483–495, 2018.Mathieu Aubry, Daniel Maturana, Alexei A. Efros, Bryan C. Russell, and Josef Sivic. Seeing 3dchairs: Exemplar part-based 2d-3d alignment using a large dataset of CAD models. In , pp. 3762–3769. IEEEComputer Society, 2014.Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and J¨orn-Henrik Jacobsen.Invertible residual networks. In
Proceedings of the 36th International Conference on MachineLearning, ICML 2019 , volume 97 of
Proceedings of Machine Learning Research , pp. 573–582.PMLR, 2019.Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and newperspectives.
IEEE transactions on pattern analysis and machine intelligence , 35(8):1798–1828,2013.C´edric B´eny and Tobias J Osborne. The renormalization group via statistical inference.
New Journalof Physics , 17(8):083005, aug 2015. doi: 10.1088/1367-2630/17/8/083005.Urs Bergmann, Nikolay Jetchev, and Roland Vollgraf. Learning texture manifolds with the periodicspatial GAN. In
Proceedings of the 34th International Conference on Machine Learning, ICML2017 , volume 70 of
Proceedings of Machine Learning Research , pp. 469–477. PMLR, 2017.Johann Brehmer and Kyle Cranmer. Flows for simultaneous manifold learning and density estima-tion.
CoRR , abs/2003.13913, 2020.C´edric B´eny. Deep learning and the renormalization group, 2013.Tian Qi Chen, Xuechen Li, Roger B. Grosse, and David Duvenaud. Isolating sources of disentan-glement in variational autoencoders. In
Advances in Neural Information Processing Systems 31:Annual Conference on Neural Information Processing Systems 2018, , pp. 2615–2625, 2018a.Tian Qi Chen, Yulia Rubanova, Jesse Bettencourt, and David Duvenaud. Neural ordinary differentialequations. In
Advances in Neural Information Processing Systems 31: Annual Conference onNeural Information Processing Systems 2018 , pp. 6572–6583, 2018b.Tian Qi Chen, Jens Behrmann, David Duvenaud, and J¨orn-Henrik Jacobsen. Residual flows for in-vertible generative modeling. In
Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019 , pp. 9913–9923, 2019.Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:Interpretable representation learning by information maximizing generative adversarial nets. In
Advances in Neural Information Processing Systems 29: Annual Conference on Neural Informa-tion Processing Systems 2016 , pp. 2172–2180, 2016.Brian Cheung, Jesse A. Livezey, Arjun K. Bansal, and Bruno A. Olshausen. Discovering hidden fac-tors of variation in deep networks. In , 2015.Iris Cong, Soonwon Choi, and Mikhail D. Lukin. Quantum convolutional neural networks.
NaturePhysics , 15(12):1273–1278, Dec 2019. ISSN 1745-2481. doi: 10.1038/s41567-019-0648-8.Emily L. Denton, Soumith Chintala, Arthur Szlam, and Rob Fergus. Deep generative image modelsusing a laplacian pyramid of adversarial networks. In
Advances in Neural Information ProcessingSystems 28: Annual Conference on Neural Information Processing Systems 2015 , pp. 1486–1494,2015.James J DiCarlo and David D Cox. Untangling invariant object recognition.
Trends in cognitivesciences , 11(8):333–341, 2007.Laurent Dinh, David Krueger, and Yoshua Bengio. NICE: non-linear independent components esti-mation. In , 2015.10nder reviewLaurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real NVP. In . OpenReview.net, 2017.Jeff Donahue, Philipp Kr¨ahenb¨uhl, and Trevor Darrell. Adversarial feature learning. In . OpenReview.net, 2017.Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Mart´ın Arjovsky, Olivier Mastropi-etro, and Aaron C. Courville. Adversarially learned inference. In . OpenReview.net, 2017.Conor Durkan, Artur Bekasov, Iain Murray, and George Papamakarios. Neural spline flows. In
Ad-vances in Neural Information Processing Systems 32: Annual Conference on Neural InformationProcessing Systems 2019 , pp. 7509–7520, 2019.G. Evenbly and G. Vidal. Class of highly entangled many-body states that can be efficiently simu-lated.
Phys. Rev. Lett. , 112:240502, Jun 2014. doi: 10.1103/PhysRevLett.112.240502.Michael E. Fisher. Renormalization group theory: Its basis and formulation in statistical physics.
Rev. Mod. Phys. , 70:653–681, Apr 1998. doi: 10.1103/RevModPhys.70.653.Andrew Gambardella, Atilim G¨unes Baydin, and Philip H. S. Torr. Transflow learning: Repurposingflow models without retraining.
CoRR , abs/1911.13270, 2019.Wen-Cong Gan and Fu-Wen Shu. Holography as deep learning.
International Journal of ModernPhysics D , 26(12):1743020, Oct 2017. ISSN 1793-6594. doi: 10.1142/s0218271817430209.L. A. Gatys, A. S. Ecker, and M. Bethge. Image style transfer using convolutional neural networks.In , pp. 2414–2423,2016.Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. Texture synthesis using convolutionalneural networks. In
Advances in Neural Information Processing Systems 28: Annual Conferenceon Neural Information Processing Systems 2015 , pp. 262–270, 2015.Koji Hashimoto.
AdS / CFT correspondence as a deep boltzmann machine.
Phys. Rev. D , 99:106017, May 2019. doi: 10.1103/PhysRevD.99.106017.Koji Hashimoto, Sotaro Sugishita, Akinori Tanaka, and Akio Tomiya. Deep learning and holo-graphic qcd.
Phys. Rev. D , 98:106014, Nov 2018. doi: 10.1103/PhysRevD.98.106014.Koji Hashimoto, Hong-Ye Hu, and Yi-Zhuang You. Neural ode and holographic qcd, 2020.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog-nition. In , pp.770–778. IEEE Computer Society, 2016.Irina Higgins, Lo¨ıc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with aconstrained variational framework. In . OpenReview.net, 2017.Irina Higgins, David Amos, David Pfau, S´ebastien Racani`ere, Lo¨ıc Matthey, Danilo J. Rezende,and Alexander Lerchner. Towards a definition of disentangled representations.
CoRR ,abs/1812.02230, 2018.Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for generativenormalizing flows. In
Proceedings of the 36th International Conference on Machine Learning,ICML 2019 , volume 97 of
Proceedings of Machine Learning Research , pp. 2771–2780. PMLR,2019.Hong-Ye Hu, Shuo-Hui Li, Lei Wang, and Yi-Zhuang You. Machine learning holographic mappingby neural network renormalization group.
Phys. Rev. Research , 2:023369, Jun 2020.11nder reviewAapo Hyv¨arinen and Erkki Oja. Independent component analysis: algorithms and applications.
Neural networks , 13(4):411–430, 2000.Nikolay Jetchev, Urs Bergmann, and Roland Vollgraf. Texture synthesis with spatial generativeadversarial networks.
CoRR , abs/1611.08207, 2016.Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer andsuper-resolution. In
Computer Vision - ECCV 2016 - 14th European Conference, Proceedings,Part II , volume 9906 of
Lecture Notes in Computer Science , pp. 694–711. Springer, 2016.Leo P. Kadanoff. Scaling laws for ising models near T c . Physics Physique Fizika , 2:263–272, Jun1966.Mahdi Karami, Dale Schuurmans, Jascha Sohl-Dickstein, Laurent Dinh, and Daniel Duckworth.Invertible convolutional flow. In
Advances in Neural Information Processing Systems 32: AnnualConference on Neural Information Processing Systems 2019 , pp. 5636–5646, 2019.Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generativeadversarial networks. In
IEEE Conference on Computer Vision and Pattern Recognition, CVPR2019 , pp. 4401–4410. Computer Vision Foundation / IEEE, 2019.Tero Karras, Samuli Laine, Miika Aittala, Janne Hellsten, Jaakko Lehtinen, and Timo Aila. Ana-lyzing and improving the image quality of stylegan. In , pp. 8107–8116. IEEE, 2020.Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In
Proceedings of the 35th Inter-national Conference on Machine Learning, ICML 2018 , volume 80 of
Proceedings of MachineLearning Research , pp. 2654–2663. PMLR, 2018.Diederik P. Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolu-tions. In
Advances in Neural Information Processing Systems 31: Annual Conference on NeuralInformation Processing Systems 2018 , pp. 10236–10245, 2018.Diederik P. Kingma and Max Welling. Auto-encoding variational bayes. In Yoshua Bengio and YannLeCun (eds.), , 2014.Thomas N. Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured world mod-els. In . OpenReview.net,2020.Maciej Koch-Janusz and Zohar Ringel. Mutual information, neural networks and the renormal-ization group.
Nature Physics , 14(6):578–582, Jun 2018. ISSN 1745-2481. doi: 10.1038/s41567-018-0081-4.Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced re-search). URL .Y. LeCun. A theoretical framework for back-propagation. 1988.Yann LeCun, Bernhard E. Boser, John S. Denker, Donnie Henderson, Richard E. Howard, Wayne E.Hubbard, and Lawrence D. Jackel. Handwritten digit recognition with a back-propagation net-work. In
Advances in Neural Information Processing Systems 2 , pp. 396–404. Morgan Kaufmann,1989.Ching Hua Lee and Xiao-Liang Qi. Exact holographic mapping in free fermion systems.
Phys. Rev.B , 93:035112, Jan 2016. doi: 10.1103/PhysRevB.93.035112.Patrick M. Lenggenhager, Doruk Efe G¨okmen, Zohar Ringel, Sebastian D. Huber, and Maciej Koch-Janusz. Optimal renormalization group transformation from information theory.
Phys. Rev. X , 10:011037, Feb 2020. doi: 10.1103/PhysRevX.10.011037.Shuo-Hui Li and Lei Wang. Neural network renormalization group.
Phys. Rev. Lett. , 121:260601,Dec 2018. 12nder reviewHenry W. Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work sowell?
Journal of Statistical Physics , 168(6):1223–1247, Sep 2017. ISSN 1572-9613. doi:10.1007/s10955-017-1836-5.Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In , pp. 3730–3738. IEEE Computer Society, 2015. doi: 10.1109/ICCV.2015.425. URL https://doi.org/10.1109/ICCV.2015.425 .Francesco Locatello, Stefan Bauer, Mario Lucic, Gunnar R¨atsch, Sylvain Gelly, BernhardSch¨olkopf, and Olivier Bachem. Challenging common assumptions in the unsupervised learningof disentangled representations. In
Proceedings of the 36th International Conference on MachineLearning, ICML 2019 , volume 97 of
Proceedings of Machine Learning Research , pp. 4114–4124.PMLR, 2019.Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In . OpenReview.net, 2019.Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, and Ian J. Goodfellow. Adversarial autoen-coders.
CoRR , abs/1511.05644, 2015.Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalizationgroup and deep learning, 2014.Bruno A Olshausen and David J Field. Emergence of simple-cell receptive field properties bylearning a sparse code for natural images.
Nature , 381(6583):607, 1996.Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategyemployed by v1?
Vision research , 37(23):3311–3325, 1997.Dan Oprisa and Peter Toth. Criticality & deep learning ii: Momentum renormalisation group, 2017.Fernando Pastawski, Beni Yoshida, Daniel Harlow, and John Preskill. Holographic quantum error-correcting codes: toy models for the bulk/boundary correspondence.
Journal of High EnergyPhysics , 2015(6):149, Jun 2015. ISSN 1029-8479. doi: 10.1007/JHEP06(2015)149.Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, TrevorKilleen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨opf, EdwardYang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner,Lu Fang, Junjie Bai, and Soumith Chintala. Pytorch: An imperative style, high-performance deeplearning library. In
Advances in Neural Information Processing Systems 32: Annual Conferenceon Neural Information Processing Systems 2019, NeurIPS 2019 , pp. 8024–8035, 2019.Xiao-Liang Qi. Exact holographic mapping and emergent space-time geometry. arXiv: High EnergyPhysics - Theory , 2013.Aditya Ramesh, Youngduck Choi, and Yann LeCun. A spectral regularizer for unsupervised disen-tanglement.
CoRR , abs/1812.01161, 2018.Scott E. Reed, A¨aron van den Oord, Nal Kalchbrenner, Sergio Gomez Colmenarejo, Ziyu Wang,Yutian Chen, Dan Belov, and Nando de Freitas. Parallel multiscale autoregressive density esti-mation. In
Proceedings of the 34th International Conference on Machine Learning, ICML 2017 ,volume 70 of
Proceedings of Machine Learning Research , pp. 2912–2921. PMLR, 2017.Danilo Jimenez Rezende, George Papamakarios, S´ebastien Racani`ere, Michael S. Albergo, GurtejKanwar, Phiala E. Shanahan, and Kyle Cranmer. Normalizing flows on tori and spheres.
CoRR ,abs/2002.02428, 2020.Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to ac-celerate training of deep neural networks. In
Advances in Neural Information Processing Systems29: Annual Conference on Neural Information Processing Systems 2016 , pp. 901, 2016.13nder reviewRobin Tibor Schirrmeister, Yuxuan Zhou, Tonio Ball, and Dan Zhang. Understanding anomalydetection with deep invertible networks through hierarchies of distributions and features.
CoRR ,abs/2006.10848, 2020.H. Eugene Stanley. Scaling, universality, and renormalization: Three pillars of modern criticalphenomena.
Rev. Mod. Phys. , 71:S358–S366, Mar 1999. doi: 10.1103/RevModPhys.71.S358.Brian Swingle. Entanglement renormalization and holography.
Phys. Rev. D , 86:065007, Sep 2012.doi: 10.1103/PhysRevD.86.065007.Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Re-thinking the inception architecture for computer vision. In ,pp. 2818–2826. IEEE Computer Society, 2016. doi: 10.1109/CVPR.2016.308. URL https://doi.org/10.1109/CVPR.2016.308 .Joshua B. Tenenbaum and William T. Freeman. Separating style and content with bilinear models.
Neural Comput. , 12(6):1247–1283, 2000.Dmitry Ulyanov, Vadim Lebedev, Andrea Vedaldi, and Victor S. Lempitsky. Texture networks:Feed-forward synthesis of textures and stylized images. In
Proceedings of the 33nd InternationalConference on Machine Learning, ICML 2016 , volume 48 of
JMLR Workshop and ConferenceProceedings , pp. 1349–1357. JMLR.org, 2016.Arash Vahdat and Jan Kautz. NVAE: A deep hierarchical variational autoencoder.
CoRR ,abs/2007.03898, 2020.G. Vidal. Class of quantum many-body states that can be efficiently simulated.
Phys. Rev. Lett. ,101:110501, Sep 2008. doi: 10.1103/PhysRevLett.101.110501.Kenneth G. Wilson. Renormalization group and critical phenomena. i. renormalization group andthe kadanoff scaling picture.
Phys. Rev. B , 4:3174–3183, Nov 1971.Yi-Zhuang You, Xiao-Liang Qi, and Cenke Xu. Entanglement holographic mapping of many-bodylocalized system by spectrum bifurcation renormalization group.
Phys. Rev. B , 93:104205, Mar2016. doi: 10.1103/PhysRevB.93.104205.Yi-Zhuang You, Zhao Yang, and Xiao-Liang Qi. Machine learning spatial geometry from entangle-ment features.
Phys. Rev. B , 97:045153, Jan 2018. doi: 10.1103/PhysRevB.97.045153.Juexiao Zhang, Yubei Chen, Brian Cheung, and Bruno A. Olshausen. Word embedding visualizationvia dictionary learning.
CoRR , abs/1910.03833, 2019.Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translationusing cycle-consistent adversarial networks. In
Computer Vision (ICCV), 2017 IEEE Interna-tional Conference on , 2017. 14nder review
Appendices for RG-Flow
A D
ETAILS OF THE NETWORK AND THE TRAINING PROCEDURE m
ETAILS OF THE SYNTHETIC MULTI - SCALE DATASETS
To illustrate RG-Flow’s ability to disentangle representations at different scales and spatially sep-arated representations, we propose two synthetic toy datasets with multi-scale features, namedMSDS1 and MSDS2, as shown in Fig. 9. Each dataset contains images of × pixels.In each image, there are ovals with different colors and orientations, and their positions havesmall random variations to deform the × grid. In MSDS1, all ovals in an image have almostthe same color, while their orientations are randomly distributed. So the color is a global feature inMSDS1, and the orientation is a local feature. In MSDS2, on the contrary, the orientation is a globalfeature, and the color is a local one. MSDS1 MSDS2
Figure 9: Samples from MSDS1 and MSDS2.We trained RG-Flow on those datasets, with n layer = 4 and n res = 4 , and other hyperparameters de-scribed in Appendix. A. For comparison. we also trained Real NVP on those datasets, with approxi-mately the same number of trainable parameters. Their generated images are shown in Fig. 10, wherewe can intuitively find that RG-Flow has learned the characteristics of the two datasets. Namely, ineach image in MSDS1 the ovals should have almost the same color, and in MSDS2 the same ori-entation. In contrast, Real NVP fails to capture the global and local features of those datasets. Themetrics of bits per dimension (BPD) and Fr´echet Inception distance (FID) are listed in Table. 1.Note that FID may not reflect much semantic property for such synthetic datasets.16nder review MSDS1 MSDS2RG-FlowReal NVP
Figure 10: Samples from RG-Flow and Real NVP trained on MSDS1 and MSDS2.BPD ↓ FID ↓ MSDS1 MSDS2 MSDS1 MSDS2RG-Flow
Real NVP 1.07 1.14 50.4 78.0Table 1: BPD and FID from RG-Flow and Real NVP trained on MSDS1 and MSDS 2.
C D
ISCUSSION ON R EAL
NVP
As a comparison to Fig. 4, we plot receptive fields of Real NVP in Fig. 11(a), together with thestatistics of their strength in Fig. 11(b). Without local constraints on the bijective maps, there isno generation causal cone for Real NVP and other globally connected multi-scale models, and wecannot find semantic factors separated at different scales in general. In contrast, our RG-Flow canseperate semantic factors at different scales, as shown in Fig. 4. h=3h=2h=1h=0 Receptive Fields (Real NVP) (a) (b)
Figure 11: Subplot (a) shows the randomly picked receptive fields from the trained Real NVP onCelebA at different levels. Subplot (b) shows the statistics of the receptive fields’ strength.17nder review
D I
MAGE GENERATION USING EFFECTIVE AND MIXED TEMPERATURES
Just like other generative models, we can define the effective temperature for RG-flow. Our priordistribution is p Z ( z ) = (cid:89) l p ( z l ) , (11)where l = ( i, j, h ) labels every latent variable by its coordinate on the hyperbolic tree, especially h is the RG level. A model with effective temperature T ( T > ) changes the prior distribution to p Z,T ( z ) , with p Z,T ( z ) ∝ ( p Z ( z )) /T . (12)For Laplacian prior, the effective temperature is implemented as p Z,T ( z ) = (cid:89) l T exp (cid:18) − | z l | T (cid:19) . (13)Moreover, we can define a mixed temperature model on the hyperbolic tree by p Z, mix ( z ) ∝ (cid:89) l ( p ( z l )) /T l , (14)where T l can be coordinate-dependent.In our training procedure, we find the bijective maps on the lowest level take a longer time toconverge. Therefore, when plotting the results in Fig. 5, we use the mixed temperature schemewith T h =0 = 0 . , T h =1 = T h =2 = T h =3 = 0 . , where T h is the effective temperature on level h .Then we vary each latent variable from to . if it is in high level or mid level, and from to ifit is in low level. E A
TOY MODEL FOR SPARSE PRIOR DISTRIBUTION
Latent space Visible space Target L a p l ace D i s t r i b u t i o n G a u ss i a n D i s t r i b u t i o n Printed by Wolfram Mathematica Student Edition
Density plot Printed by Wolfram Mathematica Student Edition
Figure 12: Two-dimensional pinwheel model.In the first column of Fig. 12, the density plots of Laplacian distribution and Gaussian distributionare shown. As we can see, the Gaussian distribution has rotational symmetry, whereas the Laplaciandistribution breaks SO (2) rotational symmetry to C symmetry. A flow-based generative model istrained to match the target distribution, which is a four-leg pinwheel as shown in the last columnin Fig. 12. Given either Gaussian distribution or Laplacian distribution as the prior distribution, themodel can learn the target distribution. In the second column, we sample points and color them18nder reviewby their living quadrant in the prior distribution. Then we map them to the visible space by thetrained model, as shown in the third column. We see that four quadrants are approximately mappedto the four legs of the pinwheel if Laplacian prior is used. But for the Gaussian case, since it hasrotational symmetry, the points in different quadrants are mixed more in the visible space, whichmakes it harder to interpret the mapping. F D
ETAILS OF INPAINTING EXPERIMENTS
For the inpainting experiments shown in Fig. 7, we randomly choose a corrupted region of × pixels on the ground truth image, marked as the red patch in the second row of Fig. 7. We generatean image x g from latent variables z g , and use its corresponding region to fill in the corrupted region.Then we map the filled image x f back to the latent variables z f , and compute the log-likelihood.To recover the ground truth image, we optimize z g to maximize the log-likelihood.For RG-Flow, we only variate the latent variables living inside the inference causal cone, which isabout 1200 out of 3072 latent variables. For the constrained Real NVP, we randomly pick the sameamount of latent variables and allow them to variate, and we find it fails to inpaint the image ingeneral. As a check, we find Real NVP can successfully inpaint the images if we optimize all latentvariables, as shown in the last row of Fig. 7.We use the conventional Adam optimizer to do the optimization. During the optimization procedure,we find the optimizer can be trapped in local minima. Therefore, for all experiments, we firstrandomly draw initial samples of latent variables that are allowed to variate, then pick the onewith the largest log-likelihood as the initialization.To quantitatively assess the quality of inpainted images, we compute the peak signal-to-noise ratio(PSNR) of them against the ground truth images, and take the average over the samples shownin Fig. 7. To further incorporate the semantic properties in the assessment, we can also use theInception-v3 network (Szegedy et al., 2016). We first scale the images to × , then feed theminto the Inception-v3 network, extract the features from the “pool3” layer (as in FID score), andcompute the PSNR. The results are listed in Table 2.PSNR ↑ Inception-PSNR ↑ RG-Flow
Constrained Real NVP 23.2 15.7Real NVP 37.0 24.8Table 2: PSNR and Inception-PSNR of inpainted images.
G E
XPERIMENTS ON OTHER DATASETS
For CIFAR-10 dataset, we use the same hyperparameters as described in Appendix. A. For 3DChair dataset, we use n layer = 8 for all RG steps, because there is not so much low-level informationas in CelebA and CIFAR-10. We also trained Real NVP on those datasets for comparison, withapproximately the same number of trainable parameters. The metrics of bits per dimension (BPD)and Fr´echet Inception distance (FID) are listed in Table. 3.BPD ↓ FID ↓ CelebA CIFAR-10 3D Chair CelebA CIFAR-10 3D ChairRG-Flow 3.47
Real NVP
126 73.8Table 3: Bits per dimension (BPD) from RG-Flow and Real NVP trained on various datasets.19nder reviewFigure 13: Samples from RG-Flow trained on CelebA dataset. We use T = 0 . when sampling.20nder review Figure 14: Samples from RG-Flow trained on CIFAR-10 dataset.21nder review Figure 15: Samples from RG-Flow trained on 3D Chair dataset.22nder review H R
ECEPTIVE FIELDS OF THE LATENT REPRESENTATIONS
More examples of randomly picked receptive fields are plotted in Fig. 16, Fig. 17, Fig. 18, andFig. 19. For better visualization, we normalize all receptive fields’ strength to one.Figure 16: Receptive fields of high-level latent variables ( h = 3 ).Figure 17: Receptive fields of mid-level latent variables ( h = 2 ).23nder review Figure 18: Receptive fields of low-level latent variables ( h = 1 ).Figure 19: Receptive fields of the lowest-level latent variables ( h = 0= 0