SSemi-Supervised Translation with MMD Networks
Mark Hamilton Abstract
This work aims to improve semi-supervisedlearning in a neural network architecture byintroducing a hybrid supervised and unsuper-vised cost function. The unsupervised compo-nent is trained using a differentiable estimatorof the Maximum Mean Discrepancy (MMD) dis-tance between the network output and the targetdataset. We introduce the notion of an n -channelnetwork and several methods to improve perfor-mance of these nets based on supervised pre-initialization, and multi-scale kernels. This workinvestigates the effectiveness of these methodson language translation where very few qualitytranslations are known a priori . We also presenta thorough investigation of the hyper-parameterspace of this method on both synthetic data.
1. Introduction
Often in data analysis, one has a small set of quality la-beled data, and a large pool of unlabeled data. It is thetask of semi-supervised learning to make as much use ofthis unlabeled data as possible. In the low-data regime, theaim is to create models that perform well after seeing onlya handful of labeled examples. This is often the case withmachine translation and dictionary completion, as it can bedifficult to construct a large number of labeled instancesor a sufficiently large parallel corpora. However, this do-main offers a huge number of monolingual corpora to makehigh quality language embeddings (Tiedemann, 2012; Al-Rfou et al., 2013). The methods presented in this paper aredesigned to take into consideration both labeled and un-labeled information when training a neural network. Thesupervised component uses the standard alignment-basedloss functions and the unsupervised component attempts tomatch the distribution of the network’s output to the targetdata’s distribution by minimizing the Maximum Mean Dis-crepancy (MMD) “distance” between the two distributions.This has the effect of placing a prior on translation methodsthat preserve the distributional structure of the two datasets.This limits the model space and increases the quality of themapping, allowing one to use less labeled data.Related methods such as Auto-Encoder pre-initialization (Erhan et al., 2010), first learn the structure of the in-put, then learn a mapping. In this setup, unsupervisedknowledge enters through learning good features to de-scribe the dataset. The MMD method of unsupervisedtraining directly learns a mapping between the two spacesthat aligns all of the moments of the mapped data and thetarget data. This method can be used to improve any semi-supervised mapping problem, such as mappings betweenlanguages (Dinu et al., 2014), image labeling, FMRI anal-ysis (Mitchell et al., 2008), and any other domains wheretransformations need to be learned between data. This in-vestigation aims to study these methods in the low dataregime, with the eventual goal of studying of dying or lostlanguages, where very few supervised training examplesexist.
2. Background
The Maximum Mean Discrepancy (MMD) put forth by(Gretton et al., 2012a) is a measure of distance between twodistributions p, q . More formally, letting x , y be variablesdefined on a topological space X with Borel measures p, q ,and F be a class of functions from X → R . The MMDsemi-metric is defined as: M M D F ( p, q ) = sup f ∈F (cid:16) E x ∼ p f ( x ) − E y ∼ q f ( y ) (cid:17) (1)Where E is the first raw moment defined as: E x ∼ p f ( x ) = (cid:90) X f ( x ) dp (2)Intuitively, the MMD is a measure of distance which usesa class of functions as a collection of “trials” to put thetwo distributions through. The distributions pass a trial ifthe function evaluated on both distributions has the sameexpectation or mean. Two distributions fail a trial if theyyield different means, the size of the difference measureshow much the distributions fail that trial. Identical distribu-tions should yield the same images when put through eachfunction in F , so the means (first moments) of the imagesshould also be identical. Conversely, if the function class a r X i v : . [ c s . L G ] O c t emi-Supervised Translation with MMD Networks is “large enough” this method can distinguish between anytwo probability distributions that differ, making the MMDa semi-metric on the space of probability distributions. Aunit ball in a Reproducing Kernel Hilbert Space (RKHS)is sufficient to discern any two distributions provided thekernel, k , is universal. (Cortes et al., 2008) If F is equal toa unit ball in kernel space, Gretton et . al . showed that thefollowing is an unbiased estimator of the MMD: (Grettonet al., 2012a) M M D u ( X, Y ) = 1 m ( m − m (cid:88) i =1 m (cid:88) j (cid:54) = i k ( x i , x j )+1 n ( n − n (cid:88) i =1 n (cid:88) j (cid:54) = i k ( y i , y j ) − mn m (cid:88) i =1 n (cid:88) j =1 k ( x i , y j ) (3)If the kernel function is differentiable, this implies that theestimator of the MMD is differentiable, allowing one touse it as a loss function that can be optimized with gradientdescent. The differentiability of the MMD estimator allows it to beused as a loss function in a feed-forward network. Li et . al . showed that by using the MMD distance as a loss functionin a neural net, N , one can learn a transformation that mapsa distribution of points X = ( x i ) n in R d to another distri-bution Y = ( y i ) m in R n while approximately minimizingthe MMD distance between the image of X , N ( X ) , and Y . (Li et al., 2015) l MMD ( X, Y, N ) = M M D u ( N ( X ) , Y ) (4)This loss function allows the net to learn transformationsof probability distributions in a completely unsupervisedmanner. Furthermore, the MMD-net can also be used tocreate generative models, or mappings from a simple dis-tribution to a target distribution.(Li et al., 2015) Where sim-ple usually means easy to sample from, or a maximum en-tropy distribution. Often, a multivariate uniform or Gaus-sian source distribution is used in these generative mod-els. This loss function can be optimized via mini-batchstochastic gradient descent, though the samples from X andY need not be paired in any way. To avoid over-fitting, theminibatches for X and Y should be sampled independently,which this paper refers to as “unpaired” minibatching.
3. Methods n -Channel Networks This work introduces a generalization of a feed forward net,called an n -Channel net. This architecture allows an unsu-pervised loss term that requires unpaired mini-batching anda paired mini-batching scheme of a standard feed forwardnetwork to be mixed.An n -channel net is a collection of n networks with tiedweights that operate on n separate datasets ( X i , Y i ) n . Moreformally, an n -channel net is a mapping: N n : (cid:0) R d (cid:1) n → ( R e ) n (5)defined as: N n (cid:16) ( X i ) n (cid:17) ≡ (cid:16) N ( X i ) (cid:17) n (6)where where N : R d → R e is a feed forward network.Each channel of the network can have it’s own loss functionand be fed with a separate data source. Most importantly,these separate data sources can be trained in a paired orunpaired manner. In many applications where one is interested in estimatinga transformation between data spaces, one has a small la-beled dataset ( X, Y ) , and large, unlabeled datasets ( S, T ) .Throughout the literature, MMD networks have only beenapplied to the case of unpaired data.(Li et al., 2015) Weexpand on this work by augmenting the completely unsu-pervised MMD distance with a semi-supervised alignmentterm. More formally, if one has a collection of k pairedvectors ( x i , y i ) k with x i ∈ X and y i ∈ Y that shouldbe aligned through the transformation N , one can use thestandard loss function: l alignment ( X, Y, N ) = k (cid:88) i =1 (cid:107)N ( x i ) − y i (cid:107) (7)Where (cid:107)·(cid:107) is any differentiable norm in R d . This work usesthe standard l vector norm. This is the standard norm usedin regression, where the goal of the network is to minimizethe distance between the network output N ( x i ) , and the ob-served responses y i .Using a hyperparameter, we can blend the cost functions ofthe supervised alignment loss and the unsupervised MMDloss. The full cost function for the MMD network thenbecomes: emi-Supervised Translation with MMD Networks l ( X, Y, S, T, N ) = α pair l alignment ( X, Y, N )+(1 − α pair ) l MMD ( S, T, N ) (8) The MMD term of the cost function scales as O (cid:0) M (cid:1) where M is the size of the mini-batch. This significantlyincreases training time for large batch sizes slowing conver-gence in wall-time. To mitigate this effect, we first train thenetwork until convergence with only the supervised term ofthe cost function. Once converged, we then switch to thesemi-supervised cost function.This also helps the network avoid local minima as it al-ready starts close to the optimal solution. Because theMMD cost function is inherently unpaired, it is susceptibleto getting stuck in local minima when there are multipleways to map the mass of one probability distribution intoanother distribution. We say that a mapping from the sup-ports, f : X → Y , is a MMD-mode from distributions p to q if f ( p ) ∼ q . Here f ( p ) is the distribution formed bysampling from p and then applying f . These modes coin-cide with critical points of the M M D u cost function andare therefore tough to escape with gradient descent meth-ods. As the class of functions represented by the networkincreases, the more distinct MMD-modes arise. This in-creases the number of critical points, though these proba-bly tend to be saddle points rather than local minima as thedimensionality of the function space increases. (Dauphinet al., 2014)One can escape these local minima, by increasing α pair tothe point where the signal from the supervised term over-comes that the signal from the unsupervised cost function.However, if the network is within the pull of the correctminima, it is often better to rely on the robust unsupervisedsignal than the noisy supervised signal, which requires asmall α pair . We found that supervised pre-training helpedguide the network parameters to within the basin of attrac-tion for the correct unsupervised minima. From here theunsupervised signal was much more reliable and led to bet-ter results on synthetic and language datasets. Furthermoreon all datasets, the supervised warm-start greatly reducedfitting time, as convergence of the expensive MMD costfunction needed fewer optimization steps. Future workcould involve annealing the supervised term to a smallnumber, though this would eliminate the aforementionedcomputational speedup.To demonstrate the effect of pre-initialization, we show theunbiased MMD estimator of a simple synthetic experiment.We generate two datasets of two dimensional points. Thefirst, shown in Figure 1 left is sampled from a uniform dis-tribution on the unit square support centered at (0 , . To generate a simple target shown in Figure 1 middle, we ro-tate the source cloud of points by an angle θ ∗ = 255 ◦ andadd a small Gaussian noise term. Figure 1 right shows thethat MMD loss as a function of angle of rotation transfor-mation has several modes caused by the symmetries of thesquare. To simulate a very noisy MSE, we use the MSE ofone randomly sampled point and its respective pair. Thenoisy MSE loss function has two local minima and theglobal minima ˆ θ is within the correct basin of attractionof the unsupervised cost function. This basin of attractionof the unsupervised cost has a minima that is indistinguish-able from the correct value of theta and much more accuratethan the supervised loss term. The
M M D F is able to differentiate between any two dis-tributions if the function class, F , is a unit ball in the re-producing kernel Hilbert space (RKHS) of a universal ker-nel.(Cortes et al., 2008) One of the simplest and most com-monly used universal kernels is the Gaussian or radial basisfunction kernel, which excels at representing smooth func-tions. k σ ( x, y ) = exp (cid:18) − (cid:107) x − y (cid:107) σ (cid:19) (9)The parameter σ controls the width of the Gaussian, andneeds to be set properly for good performance. If σ istoo low, each point’s local neighborhood will be effectivelyempty, and the gradients will vanish. If it is too high, ev-ery point will be in each point’s local neighborhood andthe kernel will not have enough resolution to see the de-tails of the distribution. In this scenario, the gradients van-ish. We found that σ was one of the most important hyper-parameters for the success of the method. In both our syn-thetic data and natural language examples, we found thatthe method performed well in a small window of kernelscale settings.To improve the robustness of this method, this investigationused the following multi-scale Gaussian kernel: k ( x, y ) = n (cid:88) i =0 c i k σ i ( x, y ) Where c i = 1 , σ i = s w ( i/n ) − w/ , w = 4 , n = 10 .The scalar s is the average scale of the multi-scale and thewidth, w , controls the width of the frequency range coveredby the kernel. n controls how many samples are taken fromthis range. Choosing a larger n improves performance asthere are more scales in the kernel, but increases computa-tion time. By including multiple scales in the kernel, thegradients from the larger kernels will move the parameters emi-Supervised Translation with MMD Networks Figure 1.
Left: Initial dataset X sampled uniformly from the unit square. Colors indicate how points are mapped through the transform.Middle: Y = X ◦ + Gaussian ( µ = 0 , σ = . Where X θ denotes a rotation clockwise by θ . Right: Unit scaled MMD u ( X θ , Y ) ,and unit scaled MSE ( X θ, , Y ) as a function of θ . Where X denotes the first element of X . to a region where the distributions are aligned at a largescale, they will then begin to vanish and the smaller scalegradients will become more relevant. Setting w = 4 allowsthe kernel to be sensitive to functions with scales that arewithin orders of magnitude of the average scale s . Wefind that choosing this kernel significantly broadens the ar-eas of parameter space where the method succeeds, withouthurting the performance.Many have investigated the kernel scale problem and thereare several heuristics available for choosing the scale basedon optimizing statistical power or median distances to near-est neighbors. (Gretton et al., 2012b) For clarity, we ex-plicitly investigated and set the kernel scale based on a gridsearch evaluating on a held out validation set. Figure 2demonstrates that the method was fairly robust to settingsof average kernel scale on synthetic data and language data. In this analysis, performance of translation methods arecompared on their ability to infer the correct translation ona held out test set. More specifically, we use the precisionat N , which is the fraction of examples where the correctword was in the top N most likely translations of model.This is a natural choice for translation, as it estimates theprobability of translating a word correctly when N = 1 .To generate the list of N most likely translations for agiven word, one can use nearest neighbor (NN) retrieval.In this method, one uses the N closest neighbors in thetarget space of the mapped word vector as the list of bestguesses. We find that it is always better to use cosine dis-tance for nearest neighbor calculations. Finding the firstnearest neighbor to a point ˆ y can be more formally ex-pressed as: N N (ˆ y ) = argmin y ∈ T Rank T (ˆ y, y ) (10)Where ˆ y is our mapped word vector, T is our target space, and Rank T (ˆ y, y ) is a function that returns the rank of y inthe sorted list of distances between ˆ y and the points in T .If the space of word embeddings is not uniformly dis-tributed, there will be areas where word embeddings bunchtogether in higher densities. The points towards the centerof these bunches act as hub points, and may be the nearestneighbors of many other points. Dinu et. al. N N , one uses: GC (ˆ y ) = argmin y ∈ T ( Rank P ( y, ˆ y ) − cos (ˆ y, y )) (11)Where P is a random sampling of points from T and cos ( x, y ) is the cosine distance between x and y . Instead ofreturning the nearest neighbor of ˆ y , GC returns the point in T that has ˆ y ranked the highest. The cosine distance termbreaks ties. GC retrieval has been shown to outperform thenearest neighbor retrieval in all frequency bins when thetransformation is a linear mapping.(Dinu et al., 2014) Fig-ure 4 shows that it also improves the performance of thesemi-supervised translation task. This work implemented the network in Theano, (Bergstraet al., 2011) an automatic differentiation software writtenin python. The net was trained with RMSProp (Tieleman& Hinton, 2012) on both the unpaired and paired batcheswith a batch size of 200 for each set. The unregularizedpre-initialization was trained for epochs and the reg-ularized network was trained for epochs, which gaveample time for convergence. Hyperparameter optimization emi-Supervised Translation with MMD Networks
Figure 2.
Left: Performance comparison on word embeddings in the − k frequency bin as a function of the average kernel scale s .Middle: Performance comparison on synthetically generated data in R as a function of α pair . Right: Performance comparison onsynthetically generated data in R as a function of α pair . was perfomed through parallel grid searches a TORQUECluster, where each job ran for ∼ hours. A validationset consisting of a random sample of of the trainingset was used to choose the parameters for the final reportedresults.
4. Data
Several synthetic datasets were used to demonstrate themethod’s ability to accurately learn linear transformationsusing a very small paired dataset. Furthermore, we usedthis synthetic data to investigate the effects of the network’shyper-parameters.Two datasets were created, one with the dimension of thesource and target equal to and the other , the samedimensionality as the embeddings. The datasets contained , points and various sized paired subsets were usedto calculate the supervised alignment loss in the experi-ments.Source data was generated as a multivariate Gaussian withzero mean and unit variance. A ground truth mappingwas generated by sampling the entries of a d × d matrixof independent Gaussians with zero mean and unit vari-ance. The target data was generated by applying the groundtruth transformation to the source data and adding Gaussiannoise with zero mean and a variance of . . This analysis used 300 dimensional English (EN) and Ital-ian (IT) monolingual word embeddings from (Dinu et al.,2014). These embeddings were trained with word2vec’sCBOW model on . billion tokens as input (ukWaC +Wikipedia + BNC) for English and the 1.6 billion itWaCtokens for Italian.(Dinu et al., 2014) The embeddings con-tained the top 200,000 words in each language. Super- vised training and testing sets were constructed from a dic-tionary built from Europarl, available at http://opus.lingfil.uu.se/ . (Tiedemann, 2012)Two training sets consisted of the and , most fre-quent words from the source language (English) which hadtranslations in the gold dictionary. Five disjoint test setswere created consisting of roughly 400 translation pairsrandomly sampled from the frequency ranked words in theintervals 0-5k, 5k-20k, 20k-50k, 100k-200k.
5. Results
Adding the MMD term to the loss function dramaticallyimproved the ability to learn the transformation on all syn-thetic datasets. The synthetic data also provided a cleanenvironment to see the effect of varying hyper-parameters.The experiment used a “linear network” which is equiva-lent to learning a linear transformation between the spaces.In general, if the hyper-parameters are set correctly, theMMD assisted learner can approach the true transforma-tion with significantly less paired data.Our first investigation aimed to understand the effect androbustness of the kernel scale parameter. As one can seefrom Figure 2, the performance of the method is robust toa setting of the average kernel scale within + / − ordersof magnitude of the optimal scale. This empirically con-firms the intuition behind the width parameter of the multi-scale kernel. As the width parameter decreases, this valleyof good performance becomes narrower by the expectedamount. A similar pattern arose in the dimensionaldataset.In order to simulate the environment of the embedding ex-periment that required a validation set of ∼ of thedata, we also removed ∼ of our data. The plots inFigure 2 demonstrate that even with the data removed fora validation set, the method still significantly beats linear emi-Supervised Translation with MMD Networks Figure 3.
Performance of methods on synthetically generated datain R as a function of α pair , s = 10 . regression trained on the training and validation set, justi-fying the use of data for parameter tuning. The models in d = 30 and d = 300 both reach error rates comparableto the ground truth regressor learned on all , datapoints.Figure 3 investigates various settings of α pair and showsthat decreasing α pair drives the performance down to theground truth level. This trend appears in both the low andhigh dimensional data and suggests that the supervised pre-initialization yields a configuration that is within the basinof attraction of the true parameters in vector field ∇ l MMD .Thus, only the unsupervised term is needed as the super-vised initialization has already eliminated the ambiguity ofthe MMD loss function modes.
Figure 4 shows that the semi-supervised MMD-Net wasable to significantly outperform the standard linear regres-sion on a paired dataset of and word-translationpairs in every frequency bin . Furthermore, this dominanceover linear regression follows a similar pattern in the preci-sions @5 and @10. The method also outperformed severalother linear and nonlinear methods as shown in Table 1.
6. Discussion and Future Work
The addition of the MMD cost function term significantlyimproves the results of regression in the low data regime.Furthermore, to the best knowledge of the authors, thismethod achieves state of the art results on the embeddingsof (Dinu et al., 2014). The authors also experimented withdeeper nets, but did not observe significant performanceimprovements, an observation consistent with the observa- tions of (Mikolov et al., 2013).
One promising future direction involves replacing theMMD unsupervised term with a Generative AdversarialNetwork (GAN) (Goodfellow et al., 2014). Like the MMD,the GAN also involves a maximization over a function classof a measure of dissimilarity. Similarly, the GAN loss func-tion can be used for unsupervised learning of probabilitydistributions. However, the GAN is usually optimized di-rectly by stochastic gradient descent, trading the quadratictime dependence on minibatch size with a linear one. Inpractice however, the maximization over the function class(the discriminator) is usually done in k gradient descentsteps for every one step of training the distribution match-ing net (the generator). Furthermore, the GAN cost func-tion does not have a dependence on kernel scale.Analogous to the discriminator in the GAN, we can also ad-versarially learn the MMD. In this setup, the function classtakes the the form of a parametrized network. Instead of es-timating the supremum of the mean discrepancy over a ballin RKHS, we would be finding the supremum through gra-dient ascent on the network. This would also have the effectof eliminating the quadratic compute and the dependenceon kernel scale. This formulation of the MMD would allowfor a more direct comparison between the GAN and MMDloss functions, and warrants future investigation. These twoloss functions are in-equivalent, as the only intersection be-tween f -divergences, like the Jensen-Shannon Divergencewhich is equivalent to the GAN, and integral measures likethe MMD is the total variation distance. (Mohamed & Lak-shminarayanan, 2016) Thus, one might be able to leveragemore diverse information by combining the two. In the case of translation between two spaces of equaldimension, the inverse of the translation transformationshould also be a translation from the target to the sourcespace. We can capitalize on this observation to further con-strain our set of possible translations. This allows the trans-formation to also draw information from the structure of thesource space. More specifically one can minimize: L = α target (cid:107) RT − S (cid:107) target +(1 − α target ) (cid:107) R − ST − (cid:107) source (12)where T ∈ GL d , α target ∈ [0 , and R, S ∈ R d × n pair .This would result in twice as much supervisory signal andmaintain the same number of parameters. Furthermore, thiscan also be applied in conjunction with the GAN loss. It is emi-Supervised Translation with MMD Networks Figure 4.
Model performance as a function of English word frequency bins using the top 5000 (left) and 750 (right) EN-IT word pairsas training data. Precision@1 refers to the fraction of words correctly translated by the method on held out testing sets.
Table 1.
Comparison of Precision@1 across different algorithms and dimensionality reduction schemes. PCA S and PCA T refers toprojecting the source and target respectively onto their first 270 principal vectors. KR refers to Kernel Ridge Regression an RBF refersto the radial basis function kernel with heuristically set scale also compatible with the pre-initialization scheme. In thecase of a more complex nonlinear network where an inversetransformation cannot be easily calculated, the architecturecould include an encoder network which maps from thesource to the target and a decoding network which mapsfrom the target to the source. These two mappings couldthen be constrained to be close to mutual inverses througha reconstruction loss penalty.
References
Al-Rfou, Rami, Perozzi, Bryan, and Skiena, Steven. Poly-glot: Distributed word representations for multilingualnlp. arXiv preprint arXiv:1307.1662 , 2013.Bergstra, James, Bastien, Fr´ed´eric, Breuleux, Olivier, Lam-blin, Pascal, Pascanu, Razvan, Delalleau, Olivier, Des- jardins, Guillaume, Warde-Farley, David, Goodfellow,Ian, Bergeron, Arnaud, et al. Theano: Deep learningon gpus with python. In
NIPS 2011, BigLearning Work-shop, Granada, Spain , volume 3. Citeseer, 2011.Cortes, Corinna, Mohri, Mehryar, Riley, Michael, and Ros-tamizadeh, Afshin. Sample selection bias correction the-ory. In
International Conference on Algorithmic Learn-ing Theory , pp. 38–53. Springer, 2008.Dauphin, Yann N, Pascanu, Razvan, Gulcehre, Caglar,Cho, Kyunghyun, Ganguli, Surya, and Bengio, Yoshua.Identifying and attacking the saddle point problem inhigh-dimensional non-convex optimization. In
Advancesin neural information processing systems , pp. 2933–2941, 2014.Dinu, Georgiana, Lazaridou, Angeliki, and Baroni, Marco. emi-Supervised Translation with MMD Networks
Improving zero-shot learning by mitigating the hubnessproblem. arXiv preprint arXiv:1412.6568 , 2014.Erhan, Dumitru, Bengio, Yoshua, Courville, Aaron, Man-zagol, Pierre-Antoine, Vincent, Pascal, and Bengio,Samy. Why does unsupervised pre-training help deeplearning?
The Journal of Machine Learning Research ,11:625–660, 2010.Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu,Bing, Warde-Farley, David, Ozair, Sherjil, Courville,Aaron, and Bengio, Yoshua. Generative adversarial nets.In
Advances in neural information processing systems ,pp. 2672–2680, 2014.Gretton, Arthur, Borgwardt, Karsten M, Rasch, Malte J,Sch¨olkopf, Bernhard, and Smola, Alexander. A kerneltwo-sample test.
The Journal of Machine Learning Re-search , 13(1):723–773, 2012a.Gretton, Arthur, Sejdinovic, Dino, Strathmann, Heiko, Bal-akrishnan, Sivaraman, Pontil, Massimiliano, Fukumizu,Kenji, and Sriperumbudur, Bharath K. Optimal kernelchoice for large-scale two-sample tests. In
Advances inneural information processing systems , pp. 1205–1213,2012b.Li, Yujia, Swersky, Kevin, and Zemel, Richard. Gen-erative moment matching networks. arXiv preprintarXiv:1502.02761 , 2015.Mikolov, Tomas, Le, Quoc V, and Sutskever, Ilya. Ex-ploiting similarities among languages for machine trans-lation. arXiv preprint arXiv:1309.4168 , 2013.Mitchell, Tom M, Shinkareva, Svetlana V, Carlson, An-drew, Chang, Kai-Min, Malave, Vicente L, Mason,Robert A, and Just, Marcel Adam. Predicting humanbrain activity associated with the meanings of nouns. sci-ence , 320(5880):1191–1195, 2008.Mohamed, Shakir and Lakshminarayanan, Balaji. Learn-ing in implicit generative models. arXiv preprintarXiv:1610.03483 , 2016.Tiedemann, J¨org. Parallel data, tools and interfaces in opus.2012.Tieleman, Tijmen and Hinton, Geoffrey. Lecture 6.5-rmsprop: Divide the gradient by a running average ofits recent magnitude.