[PDF] MHVAE: a Human-Inspired Deep Hierarchical Generative Model for Multimodal Representation Learning

Abstract

Humans are able to create rich representations of their external reality. Their internal representations allow for cross-modality inference, where available perceptions can induce the perceptual experience of missing input modalities. In this paper, we contribute the Multimodal Hierarchical Variational Auto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning. Inspired by human cognitive models, the MHVAE is able to learn modality-specific distributions, of an arbitrary number of modalities, and a joint-modality distribution, responsible for cross-modality inference. We formally derive the model's evidence lower bound and propose a novel methodology to approximate the joint-modality posterior based on modality-specific representation dropout. We evaluate the MHVAE on standard multimodal datasets. Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalities and cross-modality inference.

Full PDF

MMHVAE: A H UMAN -I NSPIRED D EEP H IERARCHICAL G ENERATIVE M ODELFOR M ULTIMODAL R EPRESENTATION L EARNING

Miguel Vasco

INESC-ID & Instituto SuperiorTécnico, Universidade de Lisboa [email protected]

Francisco S. Melo

INESC-ID & Instituto SuperiorTécnico, Universidade de Lisboa

Ana Paiva

INESC-ID & Instituto SuperiorTécnico, Universidade de LisboaJune 5, 2020 A BSTRACT

Humans are able to create rich representations of their external reality. Their internal representationsallow for cross-modality inference, where available perceptions can induce the perceptual experienceof missing input modalities. In this paper, we contribute the Multimodal Hierarchical VariationalAuto-encoder (MHVAE), a hierarchical multimodal generative model for representation learning.Inspired by human cognitive models, the MHVAE is able to learn modality-speciﬁc distributions, ofan arbitrary number of modalities, and a joint-modality distribution, responsible for cross-modalityinference. We formally derive the model’s evidence lower bound and propose a novel methodologyto approximate the joint-modality posterior based on modality-speciﬁc representation dropout. Weevaluate the MHVAE on standard multimodal datasets. Our model performs on par with other state-of-the-art generative models regarding joint-modality reconstruction from arbitrary input modalitiesand cross-modality inference. K eywords Representation Learning · Deep Learning

Humans are provided with a remarkable cognitive framework which allows them to create a rich representation oftheir external reality. This framework contains the tools to learn novel representations of their environment and torecognize previously learned representations, which are stored in memory [1, 2]. The information provided by theenvironment is of a multimodal nature, captured and processed by the different sensory input channels ( senses ) humanspossess. Yet, information is often incomplete, be it due to some modality not being provided by the environmentor due to human sensory malfunction. To overcome such events, the human cognitive framework also allows forcross-modality inference, a process in which an available input modality can induce perceptual experiences of themissing modalities [3, 4, 5]. Figure 1 illustrates how cross-modality inference is essential in humans to act upon theirenvironment in scenarios of incomplete perceptual observations.Artiﬁcial agents, on the other hand, struggle to obtain rich representations of their environment. For example, in spiteof being endowed with multiple sensors, robots often disregard the multimodal nature of environmental informationand learn internal representations from a single perceptual modality, often vision [6, 7]. However, such disregardleads to the agent’s inability to understand and act upon its environment when that modality-speciﬁc information isunavailable or in the (frequent) case of sensory malfunction. If we aim at having artiﬁcial agents—such as servicerobots or autonomous vehicles—acting reliably in their environments, they must be provided with mechanisms toovercome potential perceptual issues. Rich joint-modality representations can play a fundamental role in robust policytransfer across different input modalities of artiﬁcial agents [8].Inspired by the human cognitive framework, we contribute a novel model capable of learning rich multimodalrepresentations and performing cross-modality inference. Multimodal generative models have shown great promise a r X i v : . [ c s . L G ] J un uditoryVisual z s SoundRepresentation Auditory z s SoundRepresentation z i ImageRepresentation z c MultimodalRepresentation

Visual z i ImageRepresentation

Figure 1: An example of the importance of multimodal representation learning for human tasks: in the absence of light,humans can navigate their environment by employing perceptual information from other modalities (such as sound) togenerate the absent visual perceptual experience. Following human cognitive models, in this work, we contribute withthe MHVAE, a novel multimodal hierarchical variational autoencoder able to perform cross-modality inference.in doing so by learning a single joint-distribution of multiple modalities [9, 10, 11, 12]. This single representationspace has to encode information to account for the complete generation process of all modalities, often of differentcomplexities. As such, for each input modality, the representation capability of this single joint-representation spacemust pale in comparison with that of an individual modality-speciﬁc space. Indeed, according to the Convergence-Divergence Zone (CDZ) cognitive model [2], humans process perceptual information not in a single representationspace but in a hierarchical structure: sensory data is processed at lower-levels of the model, generating modality-speciﬁcrepresentations; and divergent information from these representations is merged at higher levels of the model, generatingmultimodal representations [1, 13]. The architecture of the CDZ model is presented in Figure 2(a).Inspired by CDZ architecture, we propose the MHVAE, a novel generative model that learns multimodal representationsin an unsupervised way. The MHVAE model is a multimodal hierarchical Variational Autoencoder (VAE) that learnsmodality-speciﬁc distributions, of an arbitrary number of modalities, and a joint-modality distribution, allowing forcross-modality inference. Moreover, we formally derive the model’s evidence lower bound (ELBO) and, based onmodality-speciﬁc representation dropout, we propose a novel methodology to approximate the joint-modality posterior.This approach allows the encoding of information from an arbitrary number of modalities and naturally promotescross-modality inference during the model’s training, with minimal computational cost.We evaluate the potential of the MHVAE as a multimodal generative model on standard multimodal datasets. We showthat the MHVAE outperforms other state-of-the-art multimodal generative models on modality-speciﬁc reconstructionand cross-modality inference.In summary, the main contributions of this paper are: • We propose a novel multimodal hierarchical VAE, inspired by the CDZ-based human neural architecture [2].The model learns modality-speciﬁc distributions and a joint distribution of all modalities, allowing for cross-modality inference in the presence of incomplete perceptual information. We formally derive the model’sevidence lower bound. • We propose a new methodology for approximating the joint-modality posterior, based on modality-speciﬁcrepresentation dropout. This approach allows the encoding of information from an arbitrary number of modali-ties and naturally promotes cross-modality inference during the model’s training, with minimal computationalcost. • We evaluate the model on standard multimodal datasets and show that the MHVAE performs on par toother state-of-the-art multimodal generative models on modality-speciﬁc reconstruction and cross-modalityinference. 2 DZ Visual

Higher-orderassociation cortices

CDZ n CDZ Auditory (a) z m Modality 1

Core multimodallatent distribution z c Modality N Modality 2 x x x N h z m h z mN h N (b) Figure 2: Schematic of the CDZ and MHVAE models, where grey rectangles indicate observed variables: (a) theinference and generative models of the CDZ model; (b) the MHVAE model, in which a high-level core latent variable z c generates the modality-speciﬁc latent distributions Z m = { z . . . z N } , responsible for sampling modality data X = { x . . . x N } . This generative process is represented by the orange segments. The blue segments represent theinference model of the MHVAE, where modality observations X = { x . . . x N } are encoded both in the modality-speciﬁc latent distributions Z m = { z . . . z N } and in the core latent distribution z c , considering the hidden modality-speciﬁc representations h = { h . . . h N } . Deep generative models have shown great promise in learning generalized representations of data. For single-modalitydata, the VAE is widely used. It learns a joint distribution p θ ( x , z ) of data x , which is generated by a latent variable z .This latent variable is often of lower dimensionality in comparison with the modality itself and acts as the representationvector in which data is encoded.The joint distribution takes the form p θ ( x , z ) = p θ ( x | z ) p ( z ) , where p ( z ) (the prior distribution) is often an unitary Gaussian ( z ∼ N ( , I ) ). The generative distribution p θ ( x | z ) ,parameterized by θ , is usually composed with a simple likelihood term (e.g. Bernoulli or Gaussian).The training procedure of the VAE model involves the maximization of the evidence likelihood p ( x ) , by marginalizingover the latent variable: p ( x ) = (cid:90) z p θ ( x , z ) = (cid:90) z p θ ( x | z ) p ( z ) . However, the above likelihood is intractable. As such, we resort to an inference network q φ ( z | x ) for its estimation: p ( x ) = (cid:90) z p θ ( x | z ) p ( z ) q φ ( z | x ) q φ ( z | x ) . Applying the logarithm and Jensen’s inequality we obtain a lower-bound on the log-likelihood of the evidence (ELBO),i.e., log p ( x ) ≥ L ( x ) , where L ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − KL [ q φ ( z | x ) || p ( z )] , where the Kullback-Leibler divergence term, KL [ q φ ( z | x ) || p ( z )] , promotes a balance between the latent variable’scapacity and the encoding process of data. During training, this balance can be adjusted through the introduction of ahyper-parameter β , L ( x ) = E q φ ( z | x ) [log p θ ( x | z )] − β KL [ q φ ( z | x ) || p ( z )] , where we recover the original VAE formulation when taking β = 1 . The optimization of the ELBO is done usinggradient-based methods, applying a re-parametrization technique [14].3 .1 MHVAE We now introduce the MHVAE model, which extends the single modality nature of VAEs to the multimodal hierarchicalsetting. In the multimodal setting, we consider a set of N modalities X = { x , x , . . . , x N } = x N , generatedaccordingly to some environmental-dependent process p θ ( x N ) , parameterized by θ . We model the generation processof information in a hierarchical fashion: each modality is generated by a corresponding modality-speciﬁc latent variablein the set Z m = { z m N } , conditionally independent given a core latent variable z c . The main goal of the MHVAE is tosimultaneously learn single-modality latent spaces, for reconstructing modality-speciﬁc data, and a joint distribution ofmodalities, encoded in a core latent distribution, allowing cross-modality inference. The architecture of the proposedmodel is presented in Figure 2(b). In order to train the model, we aim at maximizing the likelihood of the generative process, p θ ( x N ) , by marginalizingover the modality-speciﬁc and core latent variables, p θ ( X ) = (cid:90) z c (cid:90) z m N p ( x N , z m N , z c ) . (1)Given its hierarchical nature and the conditional independence of each modality-speciﬁc latent variable in regards to thecore latent variable, we can decompose the joint-modality probability as p θ ( X ) = (cid:90) z c (cid:90) z m N p ( z c ) N (cid:89) i =1 p θ ( x i | z mi ) p θ ( z mi | z c ) . (2)However, since the marginal likelihood of each modality is intractable, we estimate its posterior resorting to an inferencemodel q φ ( z m N , z c ) , parameterized by φ . We consider an inference model q φ ( z m N , z c ) , as shown in Figure 2(b), inwhich modality information is encoded simultaneously into the modality-speciﬁc latent spaces and into the core latentspace, yielding q φ ( z m N , z c ) = q φ ( z c | x N ) N (cid:89) i =1 q φ ( z mi | x i ) . (3)Introducing the inference model in the decomposed joint probability and rewriting the likelihood of the evidence as anexpectation over the latent variables, we obtain p θ ( x N ) = E q φ ( z m N | x N ) q φ ( z c | x N ) (cid:20) p ( z c ) q φ ( z c | x N ) N (cid:89) i =1 p θ ( x i | z mi ) p θ ( z mi | z c ) q φ ( z mi | x i ) (cid:35) . (4)Taking the logarithm and applying Jensen’s inequality [15], we estimate a lower-bound on the log-likelihood of theevidence log p θ ( x N ) ≥ L ( x N ) as L ( X ) = E q φ ( z m N | x N ) q φ ( z c | x N ) log (cid:20) p ( z c ) q φ ( z c | x N ) N (cid:89) i =1 p θ ( x i | z mi ) p θ ( z mi | z c ) q φ ( z mi | x i ) (cid:35) . (5)The lower-limit L ( x N ) can be seen as containing three distinct groups. The ﬁrst group, similar to the original VAEformulation, corresponds to the reconstruction loss of input x N , generated by the modality-speciﬁc latent variables z m N . For the i -th modality, this is given by (cid:90) z m N (cid:90) z c log p θ ( x i | z mi ) q φ ( z mi | x i ) q φ ( z c | x N ) N (cid:89) j =1 j (cid:54) = i q φ ( z mj | x j ) = E q φ ( z mi | x i ) [log p θ ( x i | z mi )] . (6)The second component parallels the encoding capacity constrain of the latent variable in the VAE formulation, nowconsidering the multimodal core latent variable z c . This constraint penalizes encoding distributions q φ ( z c | z m N ) thatdeviate from the prior p ( z c ) and is given by (cid:90) z m N (cid:90) z c log p θ ( z c ) q φ ( z c | x N ) q φ ( z c | x N ) N (cid:89) i =1 q φ ( z i | x i ) = − KL [ q φ ( z c | x N ) || p ( z c )] . (7)4 Modality 1 Modality N Modality 2 x x h x N h N d d d N (a) z c z m z m z mN (b) Figure 3: Diagram of the proposed Modality Representation Dropout (MRD) procedure for training the MHVAE: (a)after encoding modality observations x N , we compute the modality-speciﬁc hidden representations h = { h , . . . , h N } ,sampling the dropout mask d ∼ Bern ( w N ) , in order to zero-out the selected modality representations; (b) after theprocedure, we concatenate the hidden representations and encode the multimodal latent variable z c .Finally, the third term associates the distribution generated by the single-modality encoders, q φ ( z mi | x i ) , and thedistribution generated from the multimodal core latent space, p θ ( z mi | z c ) : (cid:90) z m N (cid:90) z c log p θ ( z mi | z c ) q φ ( z mi | x i ) q φ ( z mi | x i ) q φ ( z c | x N ) N (cid:89) j =1 j (cid:54) = i q φ ( z mj | x j ) = − E q φ ( z c | x N ) (cid:2) KL [ q φ ( z mi | x i ) || p θ ( z mi | z c )] (cid:3) . (8)Taking into consideration the previous components, we can write the evidence lower-bound of the MHVAE model as L ( x N ) = N (cid:88) i =1 λ i E q φ ( z mi | x i ) [log p θ ( x i | z mi )] − N (cid:88) i =1 β mi E q φ ( z c | x N ) (cid:2) KL [ q φ ( z mi | x i ) || p θ ( z mi | z c )] (cid:3) − β c KL [ q φ ( z c | x N ) || p ( z c )] , (9)where we introduce weight factors λ i for each modality-speciﬁc reconstruction loss and a divergence term β mi , inaddition to a core capacity weight β c . We now turn to the methodology to approximate the joint-modality posterior distribution. In the case of the MHVAE,we wish to encode information from the modality-speciﬁc data x N into the multimodal core latent variable z c .One approach to do so, the product-of-experts (POE), approximates the joint posterior with the product of Gaussianexperts including a prior expert[12]. However, this solution is computationally intensive (as it requires artiﬁcialsub-sampling of the observations during training) and suffers from overconﬁdent expert prediction, resulting in sub-parcross-modality inference performance [9].We propose a novel methodology to approximate the joint-modality posterior based on the dropout of modality-speciﬁcrepresentations, as shown in Figure 3. We introduce a modality-data dropout masks d , with dimensionality | d | = N ,such that h d = d (cid:12) h , (10)5here h = { h , . . . , h N } corresponds to the list of hidden-layer representations, computed by the modality-dataencoders, as seen in Figure 2(b). We effectively zero-out the selected components by considering that h i = , if d i = 1 . (11)During training, for each datapoint, we sample d from a Bernoulli distribution, d ∼ Bern ( w , . . . , w N ) , with N (cid:88) i =1 d i ≥ , (12)where the hyper-parameters w N control the dropout probability of each modality representation. Moreover, wecondition the mask sampling procedure in order to always allow at least a single modality representation to be non-zero.As such, for each sample we concatenate the resulting representations to be used as input to the multimodal encoder.Accounting for latent modality dropout, the modiﬁed ELBO of the MHVAE becomes L ( X ) = N (cid:88) i =1 λ i E q φ ( z mi | x i ) [log p θ ( x i | z mi )] − N (cid:88) i =1 β mi E q φ ( z c | h d ) (cid:2) KL [ q φ ( z mi | x i ) || p θ ( z mi | z c )] (cid:3) − β c KL (cid:2) q φ ( z c | h d ) || p ( z c ) (cid:3) . (13) In this section, we evaluate the MHVAE’s performance as a multimodal generative model on standard multimodaldatasets. Our model outperforms other state-of-the-art generative models regarding joint-modality reconstruction fromarbitrary input modalities and cross-modality inference.

As in previous literature, we transform single modality datasets into bimodal datasets by considering the label associatedto each image as a modality of its own right. We also compare the MHVAE to existing multimodal generative models:JMVAE-kl [10] and MVAE [12]. For the JMVAE-kl model we consider α = 0 . . For the MVAE model, trained usingthe publicly available ofﬁcial implementation, we employ the author’s suggested training hyper-parameters.We evaluate our model on literature standard datasets: MNIST [16], FashionMNIST [17], and CelebA [18]. We reportstate-of-the-art performance on the ﬁrst two datasets regarding generative modelling and cross-modality capabilities.We train the MHVAE with no hyper-parameter tuning, i.e., α i = β mi = β c = 1 , ∀ i ∈ [1 , N ] . Moreover, we ﬁx thedropout hyper-parameters w N = 0 . , for all modalities. For the MHVAE model, as shown in Figure 2(b), we considertwo different types of networks: the modality network, responsible for encoding the input data into the modality-speciﬁclatent space z m , the associated hidden representation h , and the inverse generative process; and the core network,responsible for the encoding of the multimodal core latent variable, from the representation h d , from which we generatethe modality-speciﬁc latent spaces z m . For fairness, on each dataset, we keep the network architectures consistent acrossmodels: the generative and inference networks of the baseline models share their architecture with the modality-speciﬁcnetworks of the MHVAE.Moreover, we also consider a warm-up period on the regularization terms of the ELBO [19]: we linearly increase thevalue of the prior regularization term on the modality-speciﬁc latent variable for U m epochs; and we linearly increasethe value of the Gaussian prior on the core latent space for U c epochs. For the baselines we consider a single warm-upperiod on the prior regularization of the latent space, U b .We evaluate the reconstruction capabilities and cross-modality inference performance of the models. To do so, weestimate the image marginal log-likelihood, log p ( x ) , the joint log-likelihood, log p ( x , x ) , and the conditionallog-likelihood, log p ( x | x ) , of the observations, through importance sampling. For MNIST and FashionMNIST, weconsider , importance samples, and for CelebA we consider samples. The evaluation metrics are derived inappendix. 6able 1: Log-likelihood values of the proposed evaluation metrics on the MNSIT dataset for the MHVAE and othermultimodal generative models. We estimate the latent variables considering as input image (I), label (L) or joint (I, L)modalities, resorting to importance samples. Due to numerical instabilities we were unable to train or evaluate theMVAE baseline model. Metric Input JMVAE MVAE MHVAE log p ( x ) I -90.189 - -89.050 log p ( x , x ) I -90.241 - -89.183 log p ( x , x ) L -125.381 - -121.401 log p ( x , x ) I,L -90.335 - -89.143 log p ( x | x ) L -123.070 - -118.856 (a) (b) (c)(d) (e) (f)

Figure 4: Image samples generated by the MHVAE model trained on standard multimodal datasets: (a), (b), (c) showimages conditionally generated from sampling z c ∼ q φ ( ·| x ) , with (a) x = 3 , (b) x = {Sneaker}, (c) x = {MaleSmiling}; and (d), (e), (f) show images generated by sampling from the prior z c ∼ N ( , I ) . For the MNIST dataset, we train all models on images x ∈ R × and labels x ∈ { , } . We consider a datasetdivision of 85% for training, of which for validation purposes, and the remaining 15 % for evaluation.We compose the image modality network of the MHVAE model with three linear layers with 512 hidden units, leakyrectiﬁers as activation function and applying batch-normalization between each hidden layer. Furthermore, we considera 16-dimensional image-speciﬁc latent space. The label modality network is similarly composed with three linear layerswith 128 hidden units, considering a 16-dimensional label-speciﬁc latent space. The core network is composed withthree linear layers with 64 hidden units, considering a 10-dimensional latent space. For the baselines, we consider asingle 26-dimensional latent space.We consider p ( x | z ) as a Bernoulli distributed likelihood and p ( x | z ) as a multinomial likelihood. Moreover, for theMHVAE model, we consider U m = 100 , U c = 200 epochs and for the baselines U b = 200 epochs. Implementation available at https://github.com/mhw32/multimodal-vae-public log p ( x ) I -232.427 -236.613 -231.753 log p ( x , x ) I -232.739 -242.628 -232.276 log p ( x , x ) L -244.378 -557.582 -243.932 log p ( x , x ) I,L -232.573 -241.534 -232.248 log p ( x | x ) L -242.060 -552.679 -241.662

We train all models for 500 epochs, considering a learning rate of l = 10 − and batch-size b = 64 . The estimatesof the test log-likelihoods for all the models are presented in Table 1. We report that the MHVAE outperformsother state-of-the-art multimodal models on both single-modality and joint-modality metrics, despite the fact thatthese separate representation spaces are of lower dimensionality than the joint representation space employed by theJMVAE and the MVAE. Moreover, the MHVAE model is able to provide better cross-modality inference than othermodels, as observed by the signiﬁcantly lower value of the conditional log likelihood log p ( x | x ) , employing only thelower-dimensional label modality to estimate this quantity.In Figure 4, we present images generated by the MHVAE, sampled from the prior z c ∼ N ( , I ) and conditioned on agiven label, i.e., estimated using q φ ( z c | x ) . The quality of the sampled images indicates a suitable performance of thegenerative networks of the MHVAE model. For FashionMNIST, we train the generative models on greyscale images x ∈ R × × and their class labels x ∈ { , } , with the same proportional division of the dataset as for the previous case. For the MHVAE model, weimplement a miniature DCGAN [20] architecture as the image-modality encoder, with Swish [21] as the activationfunction due to its performance in deep convolutional-based models. The network is composed of two convolutionallayers of 32 and 64 channels followed by a linear layer of 128 hidden units. For the core and text-modality inferenceand generator networks, we maintain the same architecture. We consider modality-speciﬁc and core latent spaces withthe same dimensionality as before and employ the same training hyper-parameters as in the previous evaluation case.We train all models for 500 epochs, employing the Adam optimization algorithm [22] in the training procedure withlearning rate − and batch-size b = 64 . The estimates of the test log-likelihoods for all the models are presentedin Table 2. Once again, we report that the MHVAE outperforms other state-of-the-art multimodal models on bothsingle-modality and joint-modality metrics, as well as on label-to-image cross-modality inference.In Figure 4, we present the images generated by the MHVAE, sampled from the prior z c ∼ N ( , I ) and conditioned ona given label, which provide evidence that the generative networks of the model have a suitable performance. For CelebA, we train the MHVAE on re-scaled colored images x ∈ R × × and a subset of 18 visually distinctiveattributes x ∈ { , } [23]. We compose the image modality network of the MHVAE model as a miniature DC-GAN [20]. This network is composed of four convolution layers, with , , , and channels, respectively,followed by a linear layer of 512 hidden units. Furthermore, we consider an 48-dimensional image-speciﬁc latent space.The label modality network is composed with three linear layers with 512 hidden units, considering an 48-dimensionallabel-speciﬁc latent space. The core network is composed with three linear layers with 256 hidden units, considering a16-dimensional latent space. The baselines consider a single 64-dimensional latent space.We train all models for 50 epochs employing a learning rate − and batch-size b = 128 . For the MHVAE model, weconsider U m = 5 and U c = 10 epochs. For the baselines models, we consider a single warm-up period of U b = 10 epochs. The estimates of the test log-likelihoods, computed using 500 importance samples, are presented in Table 3. Inthis scenario, the MHVAE performs on par with other state-of-the-art multimodal models on all metrics, albeit withslight less performance in comparison with the previous evaluations. In Figure 4, we present the images generated bythe MHVAE, sampled from the prior z c ∼ N ( , I ) and conditioned on a given set of attributes.8able 3: Log-likelihood values of the proposed evaluation metrics on the CelebA dataset for the MHVAE and othermultimodal generative models. We estimate the latent variables considering as input image (I), attributes (A) or joint (I,A) modalities, resorting to 500 importance samples.Metric Input JMVAE MVAE MHVAE log p ( x ) I -6260.35 -6256.65 -6271.35 log p ( x , x ) I -6264.59 -6270.86 -6278.19 log p ( x , x ) A -7204.36 -7316.12 -7303.64 log p ( x , x ) I,A -6262.67 -6266.14 -6276.57 log p ( x | x ) A -7191.11 -7309.10 -7296.22 We have evaluated the MHVAE against a baseline of state-of-the-art regarding performance on standard multimodaldatasets. We have compared our model with the JMVAE and the MVAE models, two widely used models for multimodalrepresentation learning.The results, on increasingly complex datasets, attest to the importance of considering hierarchical representation spacesto model multimodal data distributions. Even considering lower-dimensional spaces to learn the modality distributions,in comparison with the single-multimodal space of the baselines, the MHVAE is able to achieve state-of-the-art resultson the MNIST and FashionMNIST datasets, with minimal hyper-parameter tuning.On the CelebA dataset, the MHVAE behaves on par with the other baseline models, which raises the question about theimportance of the dimensionality of the representation spaces for complex scenarios. Indeed, for a fair comparisonwith the other baselines, we limited the MHVAE model to have lower-dimensional representations spaces which, on acomplex datasets such as CelebA, result in a lower log-likelihood of the modalities. However, the MHVAE model isstill capable of outperforming the MVAE model in regards to joint-modality and cross-modality inference, estimatedfrom the label. Regarding future work, we intend to address the question of the balance between representative capacityin the core and in the modality-speciﬁc distributions.

Deep generative models have shown great promise in learning generalized latent representations of data. The VAEmodel [14] estimates a deep generative model through variational inference methods, encoding univariate data in asingle latent space, regularized by a prior distribution. The regularization distribution is often an unitary Gaussian, or amore complex posterior distribution [24, 25]. Due to the intractability of the marginal likelihood of the data, the modelresorts to an inference network in the computation of the model’s evidence lower-bound. This lower-bound can beestimated, for example, through importance sampling techniques [26].Hierarchical generative models have also been proposed in literature to learn complex relationships between latentvariables [19, 27, 28, 29]. However, these models consider representations created from a single modality and, assuch, are not able to provide a framework for cross-modality inference nor to represent multimodal data. On the otherhand, VAE models have also been extended in order to learn joint distributions of several modalities by forcing theestimated single-modality representations to be similar, thus allowing cross-modality inference [10, 30, 11]. However,the necessity of introducing speciﬁc divergence terms in the model’s evidence lower-bound for each combination ofmodalities hinders its application in scenarios with a large number of modalities. Another approach introduced, wasthe POE inference network which reduces the number of encoding networks required for multimodal encoding [12],albeit with increased computational training cost associated. In order to provide cross-modality inference capabilities,the existing models encode information from all modalities into a single, common, latent variable space. Thus, theyrelinquish the generative capabilities that single-modality latent representational spaces possess. In this work, wepresent a novel multimodal generative model, capable of learning hierarchical representation spaces.

In this work, by taking inspiration from the human cognitive framework, we presented the MHVAE, a novel multimodalhierarchical generative model. The MHVAE is able to learn separate modality-speciﬁc representations and a joint-modality representation, allowing for improved representation learning in comparison with the single-representationchoice of other multimodal generative models. We have shown that, on standard multimodal datasets, the MHVAE is9ble to outperform other state-of-the-art multimodal generative models regarding modality-speciﬁc reconstruction andcross-modality inference.We also proposed a novel methodology to approximate joint-modality posterior, based on modality-speciﬁc representa-tion dropout. With minimal computational cost, this approach allows the encoding of information from an arbitrarynumber of modalities and naturally promotes cross-modality inference during the model’s training. We aim at exploringscenarios with larger number of modalities in the future.Moreover, we aim to employ the MHVAE as a perceptual representation model for artiﬁcial agents and exploreits application in deep multimodal reinforcement learning scenarios, when the agent has to perform cross-modalityinference to perform the task. Further inspired by human cognition and perceptual learning, we also intend to explorereinforcement learning mechanisms for the construction of the multimodal representation themselves.

References [1] Kaspar Meyer and Antonio Damasio. Convergence and divergence in a neural architecture for recognition andmemory.

Trends in neurosciences , 32(7):376–382, 2009.[2] Antonio R Damasio. Time-locked multiregional retroactivation: A systems-level proposal for the neural substratesof recall and recognition.

Cognition , 33(1-2):25–62, 1989.[3] Peter Walker, J Gavin Bremner, Uschi Mason, Jo Spring, Karen Mattock, Alan Slater, and Scott P Johnson.Preverbal infants’ sensitivity to synaesthetic cross-modality correspondences.

Psychological Science , 21(1):21–25,2010.[4] Daphne Maurer, Thanujeni Pathman, and Catherine J Mondloch. The shape of boubas: Sound–shape correspon-dences in toddlers and adults.

Developmental science , 9(3):316–322, 2006.[5] Charles Spence. Crossmodal correspondences: A tutorial review.

Attention, Perception, & Psychophysics ,73(4):971–995, 2011.[6] Chelsea Finn and Sergey Levine. Deep visual foresight for planning robot motion. In , pages 2786–2793. IEEE, 2017.[7] Lerrel Pinto, Dhiraj Gandhi, Yuanfeng Han, Yong-Lae Park, and Abhinav Gupta. The curious robot: Learningvisual representations via physical interactions. In

European Conference on Computer Vision , pages 3–18.Springer, 2016.[8] Rui Silva, Miguel Vasco, Francisco S. Melo, Ana Paiva, and Manuela Veloso. Playing games in the dark: Anapproach for cross-modality transfer in reinforcement learning. arXiv:1911.12851 [cs] , November 2019. arXiv:1911.12851.[9] Yuge Shi, N Siddharth, Brooks Paige, and Philip Torr. Variational mixture-of-experts autoencoders for multi-modaldeep generative models. In

Advances in Neural Information Processing Systems , pages 15692–15703, 2019.[10] Masahiro Suzuki, Kotaro Nakayama, and Yutaka Matsuo. Joint multimodal learning with deep generative models. arXiv preprint arXiv:1611.01891 , 2016.[11] Timo Korthals, Daniel Rudolph, Jürgen Leitner, Marc Hesse, and Ulrich Rückert. Multi-modal generative modelsfor learning epistemic active sensing. In , 2019.[12] Mike Wu and Noah Goodman. Multimodal generative models for scalable weakly-supervised learning. In

Advances in Neural Information Processing Systems , pages 5575–5585, 2018.[13] Stephane Lallee and Peter Ford Dominey. Multi-modal convergence maps: from body schema and self-representation to mental imagery.

Adaptive Behavior , 21(4):274–285, 2013.[14] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 , 2013.[15] Johan Ludwig William Valdemar Jensen et al. Sur les fonctions convexes et les inégalités entre les valeursmoyennes.

Acta mathematica , 30:175–193, 1906.[16] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to documentrecognition.

Proceedings of the IEEE , 86(11):2278–2324, 1998.[17] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machinelearning algorithms. arXiv preprint arXiv:1708.07747 , 2017.[18] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Large-scale celebfaces attributes (celeba) dataset.

Retrieved August , 15:2018, 2018. 1019] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variationalautoencoders. In

Advances in neural information processing systems , pages 3738–3746, 2016.[20] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutionalgenerative adversarial networks. arXiv preprint arXiv:1511.06434 , 2015.[21] Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions. arXiv preprintarXiv:1710.05941 , 2017.[22] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 ,2014.[23] Guim Perarnau, Joost Van De Weijer, Bogdan Raducanu, and Jose M Álvarez. Invertible conditional gans forimage editing. arXiv preprint arXiv:1611.06355 , 2016.[24] Jianlin Su and Guang Wu. f-vaes: Improve vaes with conditional ﬂows. arXiv preprint arXiv:1809.05861 , 2018.[25] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing ﬂows. arXiv preprintarXiv:1505.05770 , 2015.[26] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprintarXiv:1509.00519 , 2015.[27] Shengjia Zhao, Jiaming Song, and Stefano Ermon. Learning hierarchical features from deep generative models.In

Proceedings of the 34th International Conference on Machine Learning-Volume 70 , pages 4091–4099. JMLR.org, 2017.[28] Philip Bachman. An architecture for deep, hierarchical generative models. In

Advances in Neural InformationProcessing Systems , pages 4826–4834, 2016.[29] Wei-Ning Hsu, Yu Zhang, Ron J Weiss, Heiga Zen, Yonghui Wu, Yuxuan Wang, Yuan Cao, Ye Jia, ZhifengChen, Jonathan Shen, et al. Hierarchical generative modeling for controllable speech synthesis. arXiv preprintarXiv:1810.07217 , 2018.[30] Hang Yin, Francisco S Melo, Aude Billard, and Ana Paiva. Associate latent encodings in learning fromdemonstrations. In