Dirichlet Variational Autoencoder
UUnder review as a conference paper at ICLR 2019 D IRICHLET V ARIATIONAL A UTOENCODER
Weonyoung Joo, Wonsung Lee, Sungrae Park & Il-Chul Moon
Department of Industrial and Systems EngineeringKorea Advanced Institute of Science and TechnologyDaejeon, South Korea { es345,aporia,sungraepark,icmoon } @kaist.ac.kr A BSTRACT
This paper proposes Dirichlet Variational Autoencoder (DirVAE) using a Dirichletprior for a continuous latent variable that exhibits the characteristic of the categor-ical probabilities. To infer the parameters of DirVAE, we utilize the stochasticgradient method by approximating the Gamma distribution, which is a componentof the Dirichlet distribution, with the inverse Gamma CDF approximation. Addi-tionally, we reshape the component collapsing issue by investigating two problemsources, which are decoder weight collapsing and latent value collapsing, andwe show that DirVAE has no component collapsing; while Gaussian VAE ex-hibits the decoder weight collapsing and Stick-Breaking VAE shows the latentvalue collapsing. The experimental results show that 1) DirVAE models the la-tent representation result with the best log-likelihood compared to the baselines;and 2) DirVAE produces more interpretable latent values with no collapsing is-sues which the baseline models suffer from. Also, we show that the learned latentrepresentation from the DirVAE achieves the best classification accuracy in thesemi-supervised and the supervised classification tasks on MNIST, OMNIGLOT,and SVHN compared to the baseline VAEs. Finally, we demonstrated that theDirVAE augmented topic models show better performances in most cases.
NTRODUCTION A Variational Autoencoder (VAE) (Kingma & Welling, 2014c) brought success in deep generativemodels (DGMs) with a Gaussian distribution as a prior distribution (Jiang et al., 2017; Miao et al.,2016; 2017; Srivastava & Sutton, 2017). If we focus on the VAE, the VAE assumes the prior dis-tribution to be N ( , I ) with the learning on the approximated ˆ µ and ˆ Σ . Also, Stick-Breaking VAE (SBVAE) (Nalisnick & Smyth, 2017) is a nonparametric version of the VAE, which modeled thelatent dimension to be infinite using a stick-breaking process (Ishwaran & James, 2001).While these VAEs assume that the prior distribution of the latent variables to be continuous randomvariables, recent studies introduce the approximations on discrete priors with continuous randomvariables (Jang et al., 2017; Maddison et al., 2017; Rolfe, 2017). The key of these approximationsis enabling the backpropagation with the reparametrization technique, or the stochastic gradientvariational Bayes (SGVB) estimator, while the modeled prior follows a discrete distribution. Theapplications of these approximations on discrete priors include the prior modeling of a multinomialdistribution which is frequently used in the probabilistic graphical models (PGMs). Inherently, themultinomial distributions can take a Dirichlet distribution as a conjugate prior, and the demands onsuch prior have motivated the works like Jang et al. (2017); Maddison et al. (2017); Rolfe (2017)that support the multinomial distribution posterior without explicit modeling on a Dirichlet prior.When we survey the work with explicit modeling on the Dirichlet prior, we found a frequent ap-proach such as utilizing a softmax Laplace approximation (Srivastava & Sutton, 2017). We arguethat this approach has a limitation from the multi-modality perspective. The Dirichlet distributioncan exhibit a multi-modal distribution with parameter settings, see Figure 1, which is infeasible togenerate with the Gaussian distribution with a softmax function. Therefore, the previous continuousdomain VAEs cannot be a perfect substitute for the direct approximation on the Dirichlet distribu-tion. 1 a r X i v : . [ c s . L G ] J a n nder review as a conference paper at ICLR 2019Figure 1: Illustrated probability simplex with Gaussian-Softmax, GEM, and Dirichlet distributions.Unlike the Gaussian-Softmax or the GEM distribution, the Dirichlet distribution is able to capturethe multi-modality that illustrates multiple peaks at the vertices of the probability simplex.Utilizing a Dirichlet distribution as a conjugate prior to a multinomial distribution has an advantagecompared to the usage of a softmax function on a Gaussian distribution. For instance, Figure 1illustrates the potential difficulties in utilizing the softmax function with the Gaussian distribution.Given the three-dimensional probability simplex, the Gaussian-Softmax distribution cannot generatethe illustrated case of the Dirichlet distribution with a high probability measure at the vertices of thesimplex, i.e. the multi-modality where the necessity was emphasized in Hoffman & Johnson (2016).Additionally, the Griffiths-Engen-McCloskey (GEM) distribution (Pitman, 2002), which is the priordistribution of the SBVAE, is difficult to model the multi-modality because the sampling procedureof the GEM distribution is affected by the rich-get-richer phenomenon, so a few components tendto dominate the weight of the samples. This is different from the Dirichlet distribution that does notexhibit such phenomenon, and the Dirichlet distribution can fairly distribute the weights to the com-ponents, and the Dirichlet distribution is more likely to capture the multi-modality by controlling theprior hyper-parameter (Blei et al., 2003). Then, we conjecture that enhanced modeling on Dirichletprior is still needed 1) because there are cases that the Gaussian-Softmax approaches, or the softmaxLaplace approximation, cannot imitate the Dirichlet distribution; and 2) because the nonparametricapproaches could be influenced by the biases that the Dirichlet distribution does not suffer from.Given these motivations for modeling the Dirichlet distribution with the SGVB estimator, this pa-per introduces the Dirichlet Variational Autoencoder (DirVAE) that shows the same characteristicsof the Dirichlet distribution. The DirVAE is able to model the multi-modal distribution that wasnot possible with the Gaussian-Softmax and the GEM approaches. These characteristics allow theDirVAE to be the prior of the discrete latent distribution, as the original Dirichlet distribution is.Introducing the DirVAE requires the configuration of the SGVB estimator on the Dirichlet distri-bution. Specifically, the Dirichlet distribution is a composition of the Gamma random variables,so we approximate the inverse Gamma cumulative distribution function (CDF) with the asymptoticapproximation. This approximation on the inverse Gamma CDF becomes the component of approx-imating the Dirichlet distribution. We compared this approach to the previously suggested approx-imations, i.e. approaches with the Weibull distribution and with the softmax Gaussian distribution,and our approximation shows the best log-likelihood among the compared approximations.Moreover, we report that we had to investigate the component collapsing along with the research onDirVAE. It has been known that the component collapsing issue is resolved by the SBVAE becauseof the meaningful decoder weights from the latent layer to the next layer. However, we found thatSBVAE has latent value collapsing issue resulting in many near-zero values on the latent dimensionsthat leads to the incomplete utilization of the latent dimension. Hence, we argue that Gaussian VAE(GVAE) suffers from the decoder weight collapsing , previously limitedly defined as componentcollapsing ; and SBVAE has a problem of the latent value collapsing . Finally, we suggest that thedefinition of component collapsing should be expanded to represent both cases of decoder weight and latent value collapsings . The proposed DirVAE shows neither the near-zero decoder weights northe near-zero latent values, so the reconstruction uses the full latent dimension information in mostcases. We investigated this issue because our performance gain comes from resolving the expandedversion of the component collapsing . Due to the component collapsing issues, the existing VAEshave less meaningful latent values or could not effectively use its latent representation. Meanwhile,DirVAE does not have component collapsing due to the multi-modal prior which possibly leadsto superior qualitative and quantitative performances. We experimentally showed that the DirVAE2nder review as a conference paper at ICLR 2019has more meaningful or disentangled latent representation by image generation and latent valuevisualizations.Technically, the new approximation provides the closed-form loss function derived from the evi-dence lower bound (ELBO) of the DirVAE. The optimization on the ELBO enables the represen-tation learning with the DirVAE, and we test the learned representation from the DirVAE in twofolds. Firstly, we test the representation learning quality by performing the supervised and thesemi-supervised classification tasks on MNIST, OMNIGLOT, and SVHN. These classification tasksconclude that DirVAE has the best classification performances with its learned representation. Sec-ondly, we test the applicability of DirVAE to the existing models, such as topic models with DirVAEpriors on 20Newsgroup and RCV1-v2. This experiment shows that the augmentation of DirVAEto the existing neural variational topic models improves the perplexity and the topic coherence, andmost of best performers were DirVAE augmented.
RELIMINARIES
ARIATIONAL AUTOENCODERS
A VAE is composed of two parts: a generative sub-model and an inference sub-model. In thegenerative part, a probabilistic decoder reproduces ˆ x close to an observation x from a latent variable z ∼ p ( z ) , i.e. x ∼ p θ ( x | z ) = p θ ( x | ζ ) where ζ = MLP ( z ) is obtained from a latent variable z by a multilayer perceptron (MLP). In the inference part, a probabilistic encoder outputs a latentvariable z ∼ q φ ( z | x ) = q φ ( z | η ) where η = MLP ( x ) is computed from the observation x by a MLP.Model parameters, θ and φ , are jointly learned by optimizing the below ELBO with the stochasticgradient method through the backpropagations as the ordinary neural networks by using the SGVBestimators on the random nodes. log p ( x ) ≥ L ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − KL ( q φ ( z | x ) || p θ ( z )) (1)In GVAE (Kingma & Welling, 2014c), the prior distribution of p ( z ) is assumed to be a standardGaussian distribution. In SBVAE (Nalisnick & Smyth, 2017), the prior distribution becomes a GEMdistribution that produces samples with a Beta distribution and a stick-breaking algorithm.2.2 D IRICHLET DISTRIBUTION AS A COMPOSITION OF G AMMA RANDOM VARIABLES
The Dirichlet distribution is a composition of multiple Gamma random variables. Note that theprobability density functions (PDFs) of Dirichlet and Gamma distributions are as follows:Dirichlet ( x ; α ) = Γ( (cid:80) α k ) (cid:81) Γ( α k ) (cid:89) x α k − k , Gamma ( x ; α, β ) = β α Γ( α ) x α − e − βx (2)where α k , α, β > . In detail, if there are K independent random variables following the Gammadistributions X k ∼ Gamma ( α k , β ) or X ∼ MultiGamma ( α , β · K ) where α k , β > for k =1 , · · · , K , then we have Y ∼ Dirichlet ( α ) where Y k = X k / (cid:80) X i . It should be noted that the rateparameter, β , should be the same for every Gamma distribution in the composition. Then, the KLdivergence can be derived as the following:KL ( Q || P ) = (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k ) ψ (ˆ α k ) (3)for P = MultiGamma ( α , β · K ) and Q = MultiGamma ( ˆ α , β · K ) where ψ is a digamma function.The detailed derivation is provided in Appendix B.2.3 SGVB FOR G AMMA RANDOM VARIABLE AND APPROXIMATION ON D IRICHLETDISTRIBUTION
This section discusses several ways of approximating the Dirichlet random variable; or the SGVBestimators for the Gamma random variables which compose a Dirichlet distribution. UtilizingSGVB requires a differentiable non-centered parametrization (DNCP) for the distribution (Kingma& Welling, 2014d). The main SGVB for Gamma random variables, used in DirVAE, is using theinverse Gamma CDF approximation explained in the next section. Prior works include two ap-proaches: the use of the Weibull distribution and the softmax Gaussian distribution, and the twoapproaches are explained in this section. 3nder review as a conference paper at ICLR 2019
Approximation with Weibull distribution.
Because of the similar PDFs between the Weibull dis-tribution and the Gamma distribution, some prior works used the Weibull distribution as a posteriordistribution of the prior Gamma distribution (Zhang et al., 2018):Weibull ( x ; k, λ ) = kλ (cid:16) xλ (cid:17) k − e − ( x/λ ) k where k, λ > . (4)The paper Zhang et al. (2018) pointed out that there are two useful characteristics when approxi-mating the Gamma distribution with the Weibull distribution. One useful property is that the KLdivergence expressed in a closed form, and the other is the simple reparametrization trick with aclosed form of the inverse CDF from the Weibull distribution. However, we noticed that the Weibulldistribution has a component of e − ( x/λ ) k , and the Gamma distribution does not have the additionalpower term of k in the component. Since k is placed in the exponential component, small changeson k can cause a significant difference that limits the optimization. Approximation with softmax Gaussian distribution.
As in MacKay (1998); Srivastava & Sutton(2017), a Dirichlet distribution can be approximated by a softmax Gaussian distribution by using asoftmax Laplace approximation. The relation between the Dirichlet parameter α and the Gaussianparameters µ , Σ is explained as the following: µ k = log α k − K (cid:88) i log α i , Σ k = 1 α k (cid:16) − K (cid:17) + 1 K (cid:88) i α i , (5)where Σ is assumed to be a diagonal matrix, and we use the reparametrization trick in the usualGVAE for the SGVB estimator. ODEL DESCRIPTION
Along with the inverse Gamma CDF approximation, we describe two sub-models in this section:the generative sub-model and the inference sub-model. Figure 2 describes the graphical notations ofvarious VAEs and the neural network view of our model. (a) GVAE (b) SBVAE (c) DirVAE (d) DirVAE in the neuralnetwork view
Figure 2: Sub-figures 2a, 2b, and 2c are the graphical notations of the VAEs as latent variablemodels. The solid lines indicate the generative sub-models where the waved lines denote a priordistribution of the latent variables. The dotted lines indicate the inference sub-models. Sub-figure2d denotes a neural network structure corresponding to Sub-figure 2c. Red nodes denote the randomnodes which allow the backpropagation flows to the input.
Generative sub-model.
The key difference between the generative models between the DirVAEand the GVAE is the prior distribution assumption on the latent variable z . Instead of using the stan-dard Gaussian distribution, we use the Dirichlet distribution which is a conjugate prior distributionof the multinomial distribution. z ∼ p ( z ) = Dirichlet ( α ) , x ∼ p θ ( x | z ) (6) Inference sub-model.
The probabilistic encoder with an approximating posterior distribution q φ ( z | x ) is designed to be Dirichlet ( ˆ α ) . The approximated posterior parameter ˆ α is derived bythe MLP from the observation x with the softplus output function, so the outputs can be positivevalues constrained by the Dirichlet distribution. Here, we do not directly sample z from the Dirich-let distribution. Instead, we use the Gamma composition method described in Section 2.2. Firstly,we draw v ∼ MultiGamma ( α , β · K ) . Afterwards, we normalize v with its summation (cid:80) v i .4nder review as a conference paper at ICLR 2019The objective function to optimize the model parameters, θ and φ , is composed of Equation (1) and(3). Equation (7) is the loss function to optimize after the composition. The inverse Gamma CDFmethod explained in the next paragraph enables the backpropagation flows to the input with thestochastic gradient method. Here, for the fair comparison of expressing the Dirichlet distributionbetween the inverse Gamma CDF approximation method and the softmax Gaussian method, we set α k = 1 − /K when µ k = 0 and Σ k = 1 by using Equation (5); and β = 1 . L ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − ( (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k ) ψ (ˆ α k )) (7) Approximation with inverse Gamma CDF.
A previous work Knowles (2015) suggested that, if X ∼ Gamma ( α, β ) , and if F ( x ; α, β ) is a CDF of the random variable X , the inverse CDF can beapproximated as F − ( u ; α, β ) ≈ β − ( uα Γ( α )) /α . Hence, we can introduce an auxiliary variable u ∼ Uniform (0 , to take over all the randomness of X , and we treat the Gamma sampled X as adeterministic value in terms of α and β .It should be noted that there has been a practice of utilizing the combination of decomposing aDirichlet distribution and approximating each Gamma component with inverse Gamma CDF. How-ever, such practices have not been examined with its learning properties and applicabilities. Thefollowing section shows a new aspect of component collapsing that can be remedied by this combi-nation on Dirichlet prior in VAE, and the section illustrates the performance gains in a certain set ofapplications, i.e. topic modeling. XPERIMENTAL RESULTS
This section reports the experimental results with the following experiment settings: 1) a pure VAEmodel; 2) a semi-supervised classification task with VAEs; 3) a supervised classification task withVAEs; and 4) topic models with DirVAE augmentations.4.1 E
XPERIMENTS FOR REPRESENTATION LEARNING OF
VAE S Baseline models.
We select the following models as baseline alternatives of the DirVAE: 1) thestandard GVAE; 2) the GVAE with softmax (GVAE-Softmax) approximating the Dirichlet distri-bution with the softmax Gaussian distribution; 3) the SBVAE with the Kumaraswamy distribution(SBVAE-Kuma) & the Gamma composition (SBVAE-Gamma) described in Nalisnick & Smyth(2017); and 4) the DirVAE with the Weibull distribution (DirVAE-Weibull) approximating theGamma distribution with the Weibull distribution described in Zhang et al. (2018). We use the fol-lowing benchmark datasets for the experiments: 1) MNIST; 2) MNIST with rotations (MNIST + rot);3) OMNIGLOT; and 4) SVHN with PCA transformation. We provide the details on the datasets inAppendix D.1. Experimental setting.
As a pure VAE model, we compare the DirVAE with the following mod-els: GVAE, GVAE-Softmax, SBVAE-Kuma, SBVAE-Gamma, and DirVAE-Weibull. We use -dimension and -dimension latent variables for MNIST and OMNIGLOT, respectively. We pro-vide the details of the network structure and optimization in Appendix D.2. We set α = 0 . · for MNIST and α = 0 . · for OMNIGLOT for the fair comparison to GVAEs by usingEquation (5). All experiments use the Adam optimizer (Kingma & Ba, 2014a) for the parameterlearning. Finally, we acknowledge that the hyper-parameter could be updated as Appendix C, andthe experiment result with the update is separately reported in Appendix D.2. Quantitative result.
For the quantitative comparison among the VAEs, we calculated the Monte-Carlo estimation on the marginal negative log-likelihood, the negative ELBO, and the reconstructionloss. The marginal log-likelihood is approximated as p ( x ) ≈ (cid:80) i p ( x | z i ) p ( z i ) q ( z i ) for single instance x where q ( z ) is a posterior distribution of a prior distribution p ( z ) , which is further derived in Ap-pendix A. Table 1 shows the overall performance of the alternative VAEs. The DirVAE outperformsall baselines in both datasets from the log-likelihood perspective. The value of DirVAE comes fromthe better encoding of the latent variables that can be used for classification tasks which we examinein the next experiments. While the DirVAE-Weibull follows the prior modeling with the Dirichlet5nder review as a conference paper at ICLR 2019Table 1: Negative log-likelihood, negative ELBO, and reconstruction loss of the VAEs for MNISTand OMNIGLOT dataset. The lower values are the better for all measures. MNIST ( K = 50 ) OMNIGLOT ( K = 100 )Neg. LL Neg. ELBO Reconst. Loss Neg. LL Neg. ELBO Reconst. LossGVAE (Nalisnick & Smyth, 2017) . − − − − − SBVAE-Kuma (Nalisnick & Smyth, 2017) . − − − − − SBVAE-Gamma (Nalisnick & Smyth, 2017) . − − − − − GVAE . ± . . ± . . ± . . ± . . ± . . ± . GVAE-Softmax . ± . . ± . . ± . . ± . . ± . . ± . SBVAE-Kuma . ± . . ± . . ± . . ± . . ± . . ± . SBVAE-Gamma . ± . . ± . . ± . . ± . . ± . . ± . DirVAE-Weibull . ± . . ± . . ± . . ± . . ± . . ± . DirVAE . ± . . ± . . ± . . ± . . ± . . ± . distribution, the Weibull based approximation can be improved by adopting the proposed approachwith the inverse Gamma CDF. Qualitative result.
As a qualitative result, we report the latent dimension-wise reconstructionswhich are decoder outputs with each one-hot vector in the latent dimension. Figure 3a shows reconstructed images corresponding to each latent dimension from GVAE-Softmax, SBVAE, andDirVAE. We manually ordered the digit-like figures in the ascending order for GVAE-Softmax andDirVAE. We can see that the GVAE-Softmax and the SBVAE have components without significantsemantic information, which we will discuss further in Section 4.2, and the DirVAE has interpretablelatent dimensions in most of the latent dimensions. Figure 3b also supports the quality of the latentvalues from DirVAE by visualizing learned latent values through t-SNE (Maaten & Hinton, 2008). (a) Latent dimension-wise reconstructions of GVAE-Softmax, SBVAE, and DirVAE. The DirVAEshows more meaningful latent dimensions than other VAEs.(b) t-SNE latent embeddings of (Left) GVAE, (Middle) SBVAE, (Right) DirVAE. Figure 3: Latent dimension visualization with reconstruction images and t-SNE latent embeddings.4.2 D
ISCUSSION ON COMPONENT COLLAPSING
Decoder weight collapsing, a.k.a. component collapsing.
One main issue of GVAE is compo-nent collapsing that there are a significant number of near-zero decoder weights from the latentneurons to the next decoder neurons. If these weights become near-zero, the values of the latentdimensions loose influence to the next decoder, and this means an inefficient learning given a neuralnetwork structure. The same issue occurs when we use the GVAE-Softmax. We rename this com-ponent collapsing phenomenon as decoder weight collapsing to specifically address the collapsingsource. 6nder review as a conference paper at ICLR 2019
GVAE
GVAE-Softmax
SBVAE
DirVAE (a) Latent dimension-wise L -norm of decoder weights of VAEs. SBVAE
DirVAE (b) Latent values of VAEs.
Figure 4: Sub-figure 4a shows GVAE and GVAE-Softmax have component collapsing issue, whileSBVAE and DirVAE do not. Sub-figure 4b shows that SBVAE has many near-zero output values inthe latent dimensions.
Latent value collapsing.
SBVAE claims that SBVAE solved the decoder weight collapsing bylearning the meaningful weights as shown in Figure 4a. However, we notice that SBVAE producesthe output values, not the weight parameters, from the latent dimension to be near-zero in manylatent dimensions after averaging many samples obtained from the test dataset. Figure 4b showsthe properties of DirVAE and SBVAE from the perspective of the latent value collapsing, whichSBVAE shows many near-zero average means and near-zero average variances, while DirVAE doesnot. The average Fisher kurtosis and average skewness of DirVAE are 5.76 and 2.03, respectivelyover the dataset, while SBVAE has 20.85 and 4.35, which states that the latent output distributionfrom SBVAE is more skewed than that of DirVAE. We found out that these near-zero latent valuesprevent learning on decoder weights, which we introduce as another type of collapsing problem, as latent value collapsing that is different from the decoder weight collapsing . These results mean thatSBVAE distributes the non-near-zero latent values sparsely over a few dimensions while DirVAEsamples relatively dense latent values. In other words, DirVAE utilizes the full spectrum of latentdimensions compared to SBVAE, and DirVAE has a better learning capability in the decoder net-work. Figure 3a supports the argument on the latent value collapsing by activating each and singlelatent dimension with a one-hot vector through the decoder. The non-changing latent dimension-wise images of SBVAE proves that there were no generation differences between the two differentlyactivated one-hot latent values.4.3 A
PPLICATION EXPERIMENTS OF ( SEMI -) SUPERVISED CLASSIFICATION WITH
VAE S Semi-supervised classification task with VAEs.
There is a previous work demonstrating that theSBVAE outperforms the GVAE in semi-supervised classification task (Nalisnick & Smyth, 2017).The overall model structure for this semi-supervised classification task uses a VAE with separaterandom variables of z and y , which is introduced as the M2 model in the original VAE work (Kingmaet al., 2014b). The detailed settings of the semi-supervised classification tasks are enumerated inAppendix D.3. Fundamentally, we applied the same experimental settings to GVAE, SBVAE, andDirVAE in this experiment, as specified by the authors in Nalisnick & Smyth (2017).Table 2 enumerates the performances of the GVAE, the SBVAE, and the DirVAE, and the resultshows that the error rate of classification result using , and of labeled data for eachdataset. In general, the experiment shows that the DirVAE has the best performance out of threealternative VAEs. Also, it should be noted that the performance of the DirVAE is more improved inthe most complex task with the SVHN dataset. 7nder review as a conference paper at ICLR 2019Table 2: The error rate of semi-supervised classification task using VAEs. MNIST ( K = 50 ) MNIST + rot ( K = 50 ) SVHN ( K = 50 )
10% 5% 1% 10% 5% 1% 10% 5% 1%
GVAE (Nalisnick & Smyth, 2017) . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . SBVAE (Nalisnick & Smyth, 2017) . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . DirVAE . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Supervised classification task with latent values of VAEs.
Also, we tested the performance ofthe supervised classification task with the learned latent representation from the VAEs. We appliedthe vanilla version of VAEs to the datasets, and we classified the latent representation of instanceswith k -Nearest Neighbor ( k NN) which is one of the simplest classification algorithms. Hence, thisexperiment can better distinguish the performance of the representation learning in the classificationtask. Further experimental details can be found in Appendix D.4.Table 3 enumerates the performances from the experimented VAEs in the datasets of MNIST andOMNIGLOT. Both datasets indicated that the DirVAE shows the best performance in reducing theclassification error, which we conjecture that the performance is gathered from the better represen-tation learning. It should be noted that, to our knowledge, this is the first reported comparison oflatent representation learning on VAEs with k NN in the supervised classification using OMNIGLOTdataset. We identified that the classification with OMNIGLOT is difficult given that the k NN errorrates with the raw original data are as high as . , . , and . . This high error ratemainly originates from the number of classification categories which is categories in our testsetting of OMNIGLOT, compared to categories in MNIST.Table 3: The error rate of k NN with the latent representations of VAEs.
MNIST ( K = 50 ) OMNIGLOT ( K = 100 ) k = 3 k = 5 k = 10 k = 3 k = 5 k = 10 GVAE (Nalisnick et al., 2016) .
40 20 .
96 15 . − − − SBVAE (Nalisnick et al., 2016) .
34 8 .
65 8 . − − − DLGMM (Nalisnick et al., 2016) .
14 8 .
38 8 . − − − GVAE . ± . . ± . . ± . . ± . . ± . . ± . GVAE-Softmax . ± . . ± . . ± . . ± . . ± . . ± . SBVAE . ± . . ± . . ± . . ± . . ± . . ± . DirVAE . ± . . ± . . ± . . ± . . ± . . ± . Raw Data .
00 3 .
21 3 .
44 69 .
94 69 .
41 70 . PPLICATION EXPERIMENTS OF TOPIC MODEL AUGMENTATION WITH D IR VAEOne usefulness of the Dirichlet distribution is being a conjugate prior to the multinomial distribu-tion, so it has been widely used in the field of topic modeling, such as
Latent Dirichlet Allocation (LDA) (Blei et al., 2003). Recently, some neural variational topic (or document) models have beensuggested, for example, ProdLDA (Srivastava & Sutton, 2017), NVDM (Miao et al., 2016), andGSM (Miao et al., 2017). NVDM used the GVAE, and the GSM used the GVAE-Softmax to makethe sum-to-one positive topic vectors. Meanwhile, ProdLDA assume the prior distribution to bethe Dirichlet distribution with the softmax Laplace approximation. To verify the usefulness of theDirVAE, we replace the probabilistic encoder part of the DirVAE to each model. Two popular perfor-mance measures in the topic model fields, which are perplexity and topic coherence via normalizedpointwise mutual information (NPMI) (Lau et al., 2014), have been used with and
RCV1-v2 datasets. Further details of the experiments can be found in Appendix D.5. Table 4 indi-cates that the augmentation of DirVAE improves the performance in general. Additionally, the bestperformers from the two measurements are always the experiment cell with DirVAE augmentationexcept for the perplexity of RCV1-v2, which still remains competent.
ONCLUSION
Recent advances in VAEs have become one of the cornerstones in the field of DGMs. The VAEsinfer the parameters of explicitly described latent variables, so the VAEs are easily included in theconventional PGMs. While this merit has motivated the diverse cases of merging the VAEs to thegraphical models, we ask the fundamental quality of utilizing the GVAE where many models have8nder review as a conference paper at ICLR 2019Table 4: Topic modeling performances of perpexity and NPMI with DirVAE augmentations. K = 50 ) RCV1-v2 ( K = 100 )ProdLDA NVDM GSM LDA (Gibbs) ProdLDA NVDM GSM LDA (Gibbs)Reported - - - - -Reproduced ± . ± . ± . ± . ± . ± . ± . ± . Perplexity Add SBVAE ± . ± . ± . - ± . ± . ± . -Add DirVAE ± . ± . ± . - ± . ± . ± . -Reported . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . NPMI Add SBVAE . ± . . ± . . ± . - . ± . . ± . . ± . -Add DirVAE . ± . . ± . . ± . - . ± . . ± . . ± . - latent values to be categorical probabilities. The softmax function cannot reproduce the multi-modaldistribution that the Dirichlet distribution can. Recognizing this problem, there have been some pre-vious works that approximated the Dirichlet distribution in the VAE settings by utilizing the Weibulldistribution or the softmax Gaussian distribution, but the DirVAE with the inverse Gamma CDFshows the better learning performance in our experiments of the representation: the semi-supervised,the supervised classifications, and the topic models. Moreover, DirVAE shows no component col-lapsing and it leads to better latent representation and performance gain. The proposed DirVAE canbe widely used if we recall the popularity of the conjugate relation between the multinomial and theDirichlet distributions because the proposed DirVAE can be a brick to the construction of complexprobabilistic models with neural networks. R EFERENCES
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation.
Journal of Machine LearningResearch , 2003.X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural networks.
International Conference on Artificial Intelligence and Statistics , 2010.M. Hoffman and M. Johnson. Elbo surgery: yet another way to carve up the variational evidencelower bound.
Neural Information Processing Systems Workshop on Advances in ApproximateBayesian Inference , 2016.H. Ishwaran and L. F. James. Gibbs sampling methods for stick-breaking priors.
Journal of theAmerican Statistical Association , 2001.E. Jang, S. Gu, and B. Poole. Categorical reparameterization with gumbel-softmax.
InternationalConference on Learning Representations , 2017.Z. Jiang, Y. Zheng, H. Tan, B. Tang, and H. Zhou. Variational deep embedding: An unsupervisedand generative approach to clustering.
International Joint Conference on Artificial Intelligence ,2017.D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014a.D. P. Kingma and M. Welling. Auto-encoding variational bayes.
International Conference onLearning Representations , 2014c.D. P. Kingma and M. Welling. Efficient gradient-based inference through transformations betweenbayes nets and neural nets.
International Conference on Machine Learning , 2014d.D. P. Kingma, S. Mohamed, D. J. Rezende, , and M. Welling. Semi-supervised learning with deepgenerative models.
Neural Information Processing Systems , 2014b.D. A. Knowles. Stochastic gradient variational bayes for gamma approximating distributions. arXivpreprint arXiv:1509.01631 , 2015.B. M. Lake, R. R. Salakhutdinov, and J. Tenenbaum. One-shot learning by inverting a compositionalcausal process.
Neural Information Processing Systems , 2013.9nder review as a conference paper at ICLR 2019J. H. Lau, D. Newman, and T. Baldwin. Machine reading tea leaves: Automatically evaluatingtopic coherence and topic model quality.
European Chapter of the Association for ComputationalLinguistics , 2014.Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to documentrecognition.
Proceedings of the Institute of Electrical and Electronics Engineers , 1998.L. V. D. Maaten and G. Hinton. Visualizing data using t-sne.
Journal of machine learning research ,2008.D. J. C. MacKay. Choice of basis for laplace approximation.
Machine Learning , 1998.C. J. Maddison, A. Mnih, and Y. W. Teh. The concrete distribution: A continuous relaxation ofdiscrete random variables.
International Conference on Learning Representations , 2017.Y. Miao, L. Yu, and P. Blunsom. Neural variational inference for text processing.
InternationalConference on Machine Learning , 2016.Y. Miao, E. Grefenstette, and P. Blunsom. Discovering discrete latent topics with neural variationalinference.
International Conference on Machine Learning , 2017.T. Minka. Estimating a dirichlet distribution.
Technical report, M.I.T. , 2000.V. Nair and G. Hinton. Rectified linear units improve restricted boltzmann machines.
InternationalConference on Machine Learning , 2010.E. Nalisnick and P. Smyth. Stick-breaking variational autoencoders.
International Conference onLearning Representations , 2017.E. Nalisnick, L. Hertel, and P. Smyth. Approximate inference for deep latent gaussian mixtures.
Neural Information Processing Systems Workshop on Bayesian Deep Learning , 2016.J. Pitman. Combinatorial stochastic processes.
Technical report, UC Berkeley , 2002.D. J. Rezende and S. Mohamed. Variational inference with normalizing flows.
International Con-ference on Machine Learning , 2015.J. T. Rolfe. Discrete variational autoencoders.
International Conference on Learning Representa-tions , 2017.A. Srivastava and C. Sutton. Autoencoding variational inference for topic models.
InternationalConference on Learning Representations , 2017.C. K. Snderby, T. Raiko, L. Maale, S. K. Snderby, and O. Winther. Ladder variational autoencoders.
Neural Information Processing Systems , 2016.H. Zhang, B. Chen, D. Guo, and M. Zhou. Whai: Weibull hybrid autoencoding inference for deeptopic modeling.
International Conference on Learning Representations , 2018.10nder review as a conference paper at ICLR 2019 A PPENDIX
This is an appendix for
Dirichlet Variational Autoencoder . Here, we describe the derivations of keyequations and experimental setting details which were used in the body of the paper. The detailedinformation such as model names, parameter names, or experiment assumptions is based on themain paper.
A M
ONTE -C ARLO ESTIMATION ON THE MARGINAL LIKELIHOOD
Proposition A.1.
The marginal log-likelihood is approximated as p ( x ) ≈ (cid:80) i p ( x | z i ) p ( z i ) q ( z i ) , where q ( z ) is a posterior distribution of a prior distribution p ( z ) .Proof. p ( x ) = (cid:90) z p ( x , z ) d z = (cid:90) z p ( x , z ) q ( z ) q ( z ) d z = (cid:90) z p ( x | z ) p ( z ) q ( z ) q ( z ) d z = (cid:90) z p ( x | z ) p ( z ) q ( z ) q ( z ) d z ≈ (cid:88) i p ( x | z i ) p ( z i ) q ( z i ) where z i ∼ q ( z ) B KL
DIVERGENCE OF TWO MULTI -G AMMA DISTRIBUTIONS
Proposition B.1.
Define X = ( X , · · · , X K ) ∼ MultiGamma ( α , β · K ) as a vector of K in-dependent Gamma random variables X k ∼ Gamma ( α k , β ) where α k , β > for k = 1 , · · · , K .The KL divergence between two MultiGamma distributions P = MultiGamma ( α , β · K ) and Q = MultiGamma ( ˆ α , β · K ) can be derived as the following:KL ( Q || P ) = (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k ) ψ (ˆ α k ) , (8) where ψ is a digamma function.Proof. Note that the derivative of a Gamma-like function Γ( α ) β α can be derived as follows: ddα Γ( α ) β α = β − α (Γ (cid:48) ( α ) − Γ( α ) log β ) = (cid:90) ∞ x α − e − βx log x dx . Then, we have the following.KL ( Q || P ) = (cid:90) D q ( x ) log q ( x ) p ( x ) d x = (cid:90) ∞ · · · (cid:90) ∞ (cid:89) Gamma (ˆ α k , β ) log β (cid:80) ˆ α k (cid:81) Γ − (ˆ α k ) e − β (cid:80) xk (cid:81) x ˆ αk − k β (cid:80) α k (cid:81) Γ − ( α k ) e − β (cid:80) xk (cid:81) x αk − k d x = (cid:90) ∞ · · · (cid:90) ∞ (cid:89) Gamma (ˆ α k , β ) × (cid:104) (cid:88) (ˆ α k − α k ) log β + (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k ) log x k (cid:105) d x = (cid:104) (cid:88) (ˆ α k − α k ) log β + (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) (cid:105) + (cid:90) ∞ · · · (cid:90) ∞ β ˆ α k (cid:81) Γ(ˆ α k ) e − β (cid:80) x k (cid:89) x ˆ α k − k (cid:0) (cid:88) (ˆ α k − α k ) log x k (cid:1) d x = (cid:104) (cid:88) (ˆ α k − α k ) log β + (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) (cid:105) + (cid:88) (ˆ α k − α k ) β ˆ α k Γ − (ˆ α k ) β − ˆ α k (cid:0) Γ (cid:48) (ˆ α k ) − Γ(ˆ α k ) log β (cid:1) = (cid:88) (ˆ α k − α k ) log β + (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k )( ψ (ˆ α k ) − log β )= (cid:88) log Γ( α k ) − (cid:88) log Γ(ˆ α k ) + (cid:88) (ˆ α k − α k ) ψ (ˆ α k ) C H
YPER - PARAMETER α LEARNING STRATEGY
In this section, we introduce the method of moment estimator (MME) to update the Dirichlet priorparameter α . Suppose we have a set of sum-to-one proportions D = { p , · · · , p N } sampled fromDirichlet ( α ) , then the MME update rule is as the following: α k ← SN (cid:88) n p n,k where S = 1 K (cid:88) k ˜ µ ,k − ˜ µ ,k ˜ µ ,k − ˜ µ ,k for ˜ µ j,k = 1 N (cid:88) n p jn,k . (9)After the burn-in period for stabilizing the neural network parameters, we use the MME for thehyper-parameter learning using the sampled latent values during training. We alternatively updatethe neural network parameters and hyper-parameter α . We choose this estimator because of itsclosed form nature and consistency (Minka, 2000). The usefulness of the hyper-parameter updatecan be found in Appendix D.2. Proposition C.1.
Given a proportion set D = { p , · · · , p N } sampled from Dirichlet ( α ) , MME ofthe hyper-parameter α is as the following: α k ← SN (cid:88) n p n,k where S = 1 K (cid:88) k ˜ µ ,k − ˜ µ ,k ˜ µ ,k − ˜ µ ,k for ˜ µ j,k = 1 N (cid:88) n p jn,k . Proof.
Define µ j,k = E [ p jk ] as the j th moment of the k th dimension of Dirichlet distribution withprior α . Then, by the law of large number, µ j,k ≈ ˜ µ j,k . It can be easily shown that µ ,k = α k (cid:80) i α i and µ ,k = α k (cid:80) i α i α k (cid:80) i α i = µ ,k α k (cid:80) i α i so thatnumerator (cid:16) µ ,k − µ ,k µ ,k − µ ,k (cid:17) = α k (cid:80) i α i − α k (cid:80) i α i α k (cid:80) i α i = α k ( (cid:80) i (cid:54) = k α i )( (cid:80) i α i )(1 + (cid:80) i α i ) denominator (cid:16) µ ,k − µ ,k µ ,k − µ ,k (cid:17) = α k (cid:80) i α i α k (cid:80) i α i − (cid:16) α k (cid:80) i α i (cid:17) = α k ( (cid:80) i (cid:54) = k α i )( (cid:80) i α i ) (1 + (cid:80) i α i ) holds for each k = 1 , · · · , K . Therefore, (cid:88) i α i = µ ,k − µ ,k µ ,k − µ ,k ≈ K (cid:88) k µ ,k − µ ,k µ ,k − µ ,k ≈ K (cid:88) k ˜ µ ,k − ˜ µ ,k ˜ µ ,k − ˜ µ ,k and hence, ˆ α k = ( (cid:88) i α i )˜ µ ,k = SN (cid:88) n p n,k . D E
XPERIMENTAL SETTINGS
In this section, we support Section in the original paper with more detailed experimental settings.Our Tensorflow implementation is available at https://TO BE RELEASED .D.1 D ATASET DESCRIPTION
We use the following benchmark datasets for the experiments in the original paper: 1) MNIST;2) MNIST with rotations (MNIST + rot); 3) OMNIGLOT; and 4) SVHN with PCA transformation.MNIST (LeCun et al., 1998) is a hand-written digit image dataset of size × with labels,consists of , training data and , testing data. MNIST + rot data is reproduced by theauthors of Nalisnick & Smyth (2017) consists of MNIST and rotated MNIST . OMNIGLOT (Lakeet al., 2013; Snderby et al., 2016) is another hand-written image dataset of characters with × size and labels, consists of , training data and , testing data. SVHN is a StreetView House Numbers image dataset with the dimension-reduction by PCA into dimensions(Nalisnick & Smyth, 2017).D.2 R EPRESENTATION LEARNING OF
VAE S We divided the datasets into { train,valid,test } as the following: MNIST = { ,
000 : 5 ,
000 :10 , } and OMNIGLOT = { ,
095 : 2 ,
250 : 8 , } .For MNIST, we use -dimension latent variables with two hidden layers in the encoder and onehidden layer in the decoder of dimensions. We set α = 0 . · for the fair comparison toGVAEs using the Equation (5). The batch size was set to be . For OMNIGLOT, we use -dimension latent variables with two hidden layers in the encoder and one hidden layer in the decoderof dimensions. We assume α = 0 . · for the fair comparison to the GVAEs using theEquation (5). The batch size was set to be .For both datasets, the gradient clipping is used; ReLU function (Nair & Hinton, 2010) is used as anactivation function in hidden layers; Xavier initialization (Glorot & Bengio, 2010) is used for theneural network parameter initialization; and the Adam optimizer (Kingma & Ba, 2014a) is used asan optimizer with learning rate for all VAEs except for the SBVAEs. The prior assump-tions for each VAE is the following: 1) N ( , I ) for the GVAE and the GVAE-Softmax; 2) GEM (5) for the SBVAEs; and 3) Dirichlet (0 . · ) (MNIST) and Dirichlet (0 . · ) (OMNIGLOT)for the DirVAE-Weibull. Finally, to compute the marginal log-likelihood, we used samples foreach , randomly selected from the test data.We add the result of VAE with 20 normalizing flows (GVAE-NF ) (Rezende & Mohamed, 2015)as a baseline in Table 5. Also, latent dimension-wise decoder weight norm and t-SNE visualizationon latent embeddings of MNIST is given in Figure 5a and 5b which correspond to Figure 4a and 3,respectively.Additionally, DirVAE-Learning use the same α for the initial value, but the DirVAE-Learning op-timizes hyper-parameter α by the following stages through the learning iterations using the MMEmethod in Appendix C: 1) the burn-in period for stabilizing the neural network parameters; 2) thealternative update period for the neural network parameters and α ; and 3) the update period for theneural network parameters with the fixed learned hyper-parameter α . Table 5 shows that there areimprovements in the marginal log-likelihood, ELBO, and reconstruction loss with DirVAE-Learningin both datasets. We also give the learned hyper-parameter α in Figure 6.D.3 S EMI - SUPERVISED CLASSIFICATION TASK WITH
VAE S The overall model structure for this semi-supervised classification task uses a VAE with a separaterandom variable of z and y , which is introduced as the M2 model in the original VAE work (Kingmaet al., 2014b). However, the same task with the SBVAE uses a different model modified to ignore https://github.com/yburda/iwae/tree/master/datasets/OMNIGLOT http://ufldl.stanford.edu/housenumbers/ MNIST ( K = 50 ) OMNIGLOT ( K = 100 )Neg. LL Neg. ELBO Reconst. Loss Neg. LL Neg. ELBO Reconst. LossGVAE (Nalisnick & Smyth, 2017) . − − − − − SBVAE-Kuma (Nalisnick & Smyth, 2017) . − − − − − SBVAE-Gamma (Nalisnick & Smyth, 2017) . − − − − − GVAE . ± . . ± . . ± . . ± . . ± . . ± . GVAE-Softmax . ± . . ± . . ± . . ± . . ± . . ± . GVAE-NF
20 95 . ± . . ± . . ± . . ± . . ± . . ± . SBVAE-Kuma . ± . . ± . . ± . . ± . . ± . . ± . SBVAE-Gamma . ± . . ± . . ± . . ± . . ± . . ± . DirVAE-Weibull . ± . . ± . . ± . . ± . . ± . . ± . DirVAE . ± . . ± . . ± . . ± . . ± . . ± . DirVAE-Learning . ± . . ± . . ± . . ± . . ± . . ± . GVAE-NF20 (a) Latent dimension-wise L -norm of decoder weights ofGVAE-NF . (b) GVAE-NF t-SNE visualization. Figure 5: Decoder weight collapsing and t-SNE latent embeddings visualization of GVAE-NF onMNIST.the relation between the class label variable y and the latent variable z , but they still share the sameparent nodes: q φ ( z , y | x ) = q φ ( z | x ) q φ ( y | x ) where q φ ( y | x ) is a discrimitive network for the unseenlabels. We follow the structure of SBVAE. Finally, the below are the objective functions to optimizefor the labeled and the unlabeled instances of the semi-supervised classification task, respectively: log p ( x , y ) ≥ L labeled ( x , y ) = E q φ ( z | x ) [log p θ ( x | z , y )] − KL ( q φ ( z | x ) || p θ ( z )) + log q φ ( y | x ) , (10) log p ( x ) ≥ L unlabeled ( x ) = E q φ ( z , y | x ) [log p θ ( x | z , y ) + H ( q φ ( y | x ))] − KL ( q φ ( z | x ) || p θ ( z )) . (11)In the above, H is an entropy function. The actual training on the semi-supervised learning optimizesthe weighted sum of Equation (10) and (11) with a ratio hyper-parameter < λ < .The datasets are divided into { train, valid, test } as the following: MNIST = { ,
000 : 5 ,
000 :10 , } , MNIST + rot = { ,
000 : 10 ,
000 : 20 , } , and SVHN = { ,
000 : 8 ,
257 : 26 , } .For SVHN, dimension reduction into dimensions by PCA is applied as preprocessing.Fundamentally, we applied the same experimental settings to GVAE, SBVAE and DirVAE in thisexperiment, as specified by the authors in Nalisnick & Smyth (2017). Specifically, the threeVAEs used the same network structures of 1) a hidden layer of dimension for MNIST; and 2)four hidden layers of dimensions for MNIST + rot and SVHN with the residual network for thelast three hidden layers. The latent variables have dimensions for all settings. The ratio parameter λ is set to be . for the MNISTs, and . for SVHN. ReLU function is used as an activationfunction in hidden layers, and the neural network parameters were initialized by sampling from N (0 , . . The Adam optimizer is used with learning rate and the batch size was set to be . Finally, the DirVAE sets α = 0 . · by using Equation (5). https://github.com/enalisnick/stick-breaking dgms ∼ enalisni/sb dgm supp mat.pdf α values from DirVAE-Learning with MNIST.D.4 S UPERVISED CLASSIFICATION TASK WITH LATENT VALUES OF
VAE S For the supervised classification task on the latent representation of the VAEs, we used exactly thesame experimental settings as in D.2. Since DLGMM is basically a Gaussian mixture model withthe SBVAE, DLGMM is a more complex model than the VAE alternatives. We only report theauthors’ result from Nalisnick et al. (2016) for the comparison purposes. Additionally, we omitthe comparison with the VaDE (Jiang et al., 2017) because the VaDE is more customized to be aclustering model rather than the ordinary VAEs that we choose as baselines.Table 6: The error rate of k NN with the latent representations of VAEs.
MNIST ( K = 50 ) OMNIGLOT ( K = 100 ) k = 3 k = 5 k = 10 k = 3 k = 5 k = 10 GVAE (Nalisnick et al., 2016) .
40 20 .
96 15 . − − − SBVAE (Nalisnick et al., 2016) .
34 8 .
65 8 . − − − DLGMM (Nalisnick et al., 2016) .
14 8 .
38 8 . − − − GVAE . ± . . ± . . ± . . ± . . ± . . ± . GVAE-Softmax . ± . . ± . . ± . . ± . . ± . . ± . GVAE-NF
20 25 . ± . . ± . . ± . . ± . . ± . . ± . SBVAE . ± . . ± . . ± . . ± . . ± . . ± . DirVAE . ± . . ± . . ± . . ± . . ± . . ± . Raw Data .
00 3 .
21 3 .
44 69 .
94 69 .
41 70 . D.5 T
OPIC MODEL AUGMENTATION WITH D IR VAEFor the topic model augmentation experiment, two popular performance measures in the topicmodel fields, which are perplexity and topic coherence via normalized pointwise mutual information(NPMI) (Lau et al., 2014), have been used with and RCV1-v2 datasets. 20News-groups has , train data and , test data with vocabulary size , . For the RCV1-v2dataset, due to the massive size of the whole data, we randomly sampled , train data and , test data with vocabulary size , . The lower is better for the perplexity, and the higheris better for the NPMI.The specific model structures can be found in the original papers, Srivastava & Sutton (2017); Miaoet al. (2016; 2017). We replace the model prior to that of DirVAE to each model and search thehyper-parameter as Table 7 with , randomly selected test data. We use -dimension hiddenlayers and topics for 20Newsgroups, and , -dimension hidden layers and topics forRCV1-v2.Table 8 shows top- high probability words per topic by activating single latent dimensions in thecase of Newsgroups. Also, we visualized the latent embeddings of documents by t-SNE in Figure7,8, and 9. https://github.com/akashgit/autoencoding vi for topic models http://scikit-learn.org/stable/datasets/rcv1.html K = 50 ) RCV1-v2 ( K = 100 )ProdLDA NVDM GSM ProdLDA NVDM GSMAdd DirVAE . · . · . · . · . · . · Table 8: Sample of learned per topic top- high probability words from Newsgroups withDirVAE augmentation by activating single latent dimensions.ProdLDA + DirVAETopic turks turkish armenian genocide village armenia armenians muslims turkey greeceTopic doctrine jesus god faith christ scripture belief eternal holy bibleTopic season defensive puck playoff coach score flyers nhl team iceTopic pitcher braves hitter coach pen defensive injury roger pitch playerTopic ide scsi scsus controller motherboard isa cache mb floppy ramTopic toolkit widget workstation xlib jpeg xt vendor colormap interface pixelTopic spacecraft satellite solar shuttle nasa mission professor lunar orbit rocketTopic knife handgun assault homicide batf criminal gun firearm police apartmentTopic enforcement privacy encrypt encryption ripem wiretap rsa cipher cryptography escrowTopic min detroit tor det calgary rangers leafs montreal philadelphia cal (a) DirVAE augmentation to ProdLDA NVDM + DirVAETopic armenian azerbaijan armenia genocide armenians turkish militia massacre village turksTopic arab arabs israeli palestinian jews soldier turks nazi massacre jewTopic resurrection bible christianity doctrine scripture eternal belief christian faith jesusTopic hitter season braves pitcher baseball pitch game player defensive teamTopic directory file compile variable update ftp version site copy hostTopic performance speed faster mhz rate clock processor average twice fastTopic windows microsoft driver dos nt graphic vga card virtual upgradeTopic seat gear rear tire honda oil front mile wheel engineTopic patient disease doctor treatment symptom medical health hospital pain medicineTopic pt la det tor pit pp vs van cal nj (b) DirVAE augmentation to NVDM GSM + DirVAETopic turkish armenian armenians people one turkey armenia turks greek historyTopic israel israeli jews attack world jewish article arab peace landTopic god jesus christian religion truth believe bible church christ beliefTopic team play game hockey nhl score first division go winTopic drive video mac card port pc system modem memory speedTopic image software file version server program system ftp package supportTopic space launch orbit earth nasa moon satellite mission project centerTopic law state gun government right rights case court police crimeTopic price sell new sale offer pay buy good condition moneyTopic internet mail computer send list fax phone email address information (c) DirVAE augmentation to GSM Newsgroups latent document embedding visulaization with t-SNE by replacing themodel prior to the Dirichlet. (Left) ProdLDA + DirVAE, (Middle) NVDM + DirVAE, (Right)GSM + DirVAE.Figure 8: Newsgroups latent document embedding visulaization with t-SNE by replacing themodel prior to the Stick-Breaking. (Left) ProdLDA + SBVAE, (Middle) NVDM + SBVAE, (Right)GSM + SBVAE.Figure 9:20