[PDF] Robust model training and generalisation with Studentising flows

Abstract

Normalising flows are tractable probabilistic models that leverage the power of deep learning to describe a wide parametric family of distributions, all while remaining trainable using maximum likelihood. We discuss how these methods can be further improved based on insights from robust (in particular, resistant) statistics. Specifically, we propose to endow flow-based models with fat-tailed latent distributions such as multivariate Student's t , as a simple drop-in replacement for the Gaussian distribution used by conventional normalising flows. While robustness brings many advantages, this paper explores two of them: 1) We describe how using fatter-tailed base distributions can give benefits similar to gradient clipping, but without compromising the asymptotic consistency of the method. 2) We also discuss how robust ideas lead to models with reduced generalisation gap and improved held-out data likelihood. Experiments on several different datasets confirm the efficacy of the proposed approach in both regards.

Full PDF

RRobust model training and generalisation with Studentising ﬂows

Simon Alexanderson Gustav Eje Henter Abstract

Normalising ﬂows are tractable probabilistic mod-els that leverage the power of deep learning todescribe a wide parametric family of distributions,all while remaining trainable using maximum like-lihood. We discuss how these methods can befurther improved based on insights from robust(in particular, resistant) statistics. Speciﬁcally,we propose to endow ﬂow-based models with fat-tailed latent distributions such as multivariate Stu-dent’s t , as a simple drop-in replacement for theGaussian distribution used by conventional nor-malising ﬂows. While robustness brings manyadvantages, this paper explores two of them: 1)We describe how using fatter-tailed base distribu-tions can give beneﬁts similar to gradient clipping,but without compromising the asymptotic consist-ency of the method. 2) We also discuss how robustideas lead to models with reduced generalisationgap and improved held-out data likelihood. Exper-iments on several different datasets conﬁrm theefﬁcacy of the proposed approach in both regards.

1. Introduction

Normalising ﬂows are tractable probabilistic models thatleverage the power of deep learning and invertible neuralnetworks to describe highly ﬂexible parametric families ofdistributions. In a sense, ﬂows combine powerful impli-cit data-generation architectures (Mohamed & Lakshmin-arayanan, 2016) of generative adversarial networks (GANs)(Goodfellow et al., 2014) with the tractable inference seenin classical probabilistic models such as mixture densities(Bishop, 1994), essentially giving the best of both worlds.Much ongoing research into normalising ﬂows strives todevise new invertible neural-network architectures that in-crease the expressive power of the ﬂow; see Papamakarios Division of Speech, Music and Hearing, KTH Royal Insti-tute of Technology, Stockholm, Sweden. Correspondence to:Gustav Eje Henter .Second workshop on

Invertible Neural Networks, NormalizingFlows, and Explicit Likelihood Models (ICML 2020), Virtual Con-ference et al. (2019) for a review. However, the invertible transform-ation used is not the only factor that determines the successof a normalising ﬂow in applications. In this paper, we in-stead turn our attention to the latent (a.k.a. base ) distributionthat ﬂows use. In theory, an inﬁnitely-powerful invertiblemapping can turn any continuous distribution into any other,suggesting that the base distribution does not matter. Inpractice, however, properties of the base distribution canhave a decisive effect on the learned models, as this paperaims to show. Based on insights from the ﬁeld of robuststatistics, we propose to replace the conventional standard-normal base distribution with distributions that have fattertails, such as the Laplace distribution or Student’s t . Weargue that this simple change brings several advantages, ofwhich this paper focusses on two aspects:1. It makes training more stable, providing a principledand asymptotically consistent solution to problems nor-mally addressed by heuristics such as gradient clipping.2. It improves generalisation capabilities of learned mod-els, especially in cases where the training data fails tocapture the full diversity of the real-world distribution.We present several experiments that support these claims.Notably, the gains from robustness evidenced in the ex-periments do not require that we introduce any additionallearned parameters into the model.

2. Background

Normalising ﬂows are nearly exclusively trained using max-imum likelihood. We here (Sec. 2.1) review strengths andweaknesses of that training approach; how it may sufferfrom low statistical robustness and how that affects typicalmachine-learning pipelines. We then (Sec. 2.2) discuss priorwork leveraging robust statistics for deep learning.

Maximum likelihood estimation (MLE) is the gold standardfor parameter estimation in parametric models, both in dis-criminative deep learning and for many generative modelssuch as normalising ﬂows. The popularity of MLE is groun-ded in several appealing theoretical properties. Most import-antly, MLE is consistent and asymptotically efﬁcient undermild assumptions (Daniels, 1961). Consistency means that,if the true data-generating distribution is a member of the a r X i v : . [ c s . L G ] J u l obust model training and generalisation with Studentising ﬂows parametric family we are using, the MLE will converge onthat distribution in probability. Asymptotic efﬁciency addsthat, as the amount of data gets large, the statistical uncer-tainty in the parameter estimate will furthermore be as smallas possible; no other consistent estimator can do better.Unfortunately, MLE can easily get into trouble in the im-portant case of misspeciﬁed models (when the true datadistribution is not part of the parametric family we are ﬁt-ting). In particular, MLE is not always robust to outliers :Since ln 0 = −∞ , outlying datapoints that are not explainedwell by a model (i.e., have near-zero probability) can havean unbounded effect on the log-likelihood and the parameterestimates found by maximising it. As a result, MLE is sens-itive to training and testing data that doesn’t ﬁt the modelassumptions, and may generalise poorly in these cases.As misspeciﬁcation is ubiquitous in practical applications,many steps in traditional machine-learning and data-sciencepipelines can be seen as workarounds that mitigate the im-pact of outliers before, during, and after training. For ex-ample, careful data gathering and cleaning to prevent and ex-clude idiosyncratic examples prior to training is consideredbest practise. Seeing that encountering poorly explained,low-probability datapoints can lead to large gradients thatdestabilise minibatch optimisation, various forms of gradi-ent clipping are commonplace in practical machine learning.This caps the degree of inﬂuence any given example canhave on the learned model. The downside is that clippedgradient minimisation is not consistent: Since the true op-timum ﬁt sits where the average of the loss-function gradi-ents over the data is zero, changing these gradients meansthat we will converge on a different optimum in general.Finally, since misspeciﬁcation tends to inﬂate the entropyof MLE-ﬁtted probabilistic models (Lucas et al., 2019), itis common practice to artiﬁcially reduce the entropy ofsamples at synthesis time for more subjectively pleasingoutput; cf. Kingma & Dhariwal (2018); Brock et al. (2019);Henter & Kleijn (2016). The goal of this paper is to describea more principled approach, rooted in robust statistics, toreducing the sensitivity to outliers in normalising ﬂows. Robust statistics, and in particular inﬂuence functions (Sec.3) have seen a number of different uses with deep learn-ing, such as explaining neural network decisions (Koh &Liang, 2017) and subsampling large datasets (Ting & Bro-chu, 2018). In this work, however, we speciﬁcally consider While many practitioners informally equate outliers with er-rors, the treatment in this paper is deliberately agnostic to the originof these observations. After all, it does not matter whether outliersare simple errors, or represent uncommon but genuine behavioursof the data-generating process, or comprise deliberate corruptionsinjected by an adversary – as long as the outlying point is in thedata, its mathematical effect on our model will be the same. statistical robustness in learning probabilistic models, fol-lowing Hampel et al. (1986); Huber & Ronchetti (2009).This process can be made more robust in two ways: eitherby changing the parametric family or by changing the ﬁt-ting principle. Both the ﬁrst and the second approach havebeen used in deep learning before. Generative adversarialnetworks have been adapted to minimise a variety of diver-gence measures between the model and data distributions(Nowozin et al., 2016; Arjovsky et al., 2017), some of whichamount to statistically-robust ﬁtting principles, but they arenotoriously ﬁckle to train in practice (Lucic et al., 2018).Henter et al. (2016) instead proposed using the β -divergenceto ﬁt models used in speech synthesis, demonstrating a largeimprovement when training on found data. This approachdoes not require the use of an adversary. However, the gen-eral idea of changing the ﬁtting principle is unattractive withnormalising ﬂows, since maximum likelihood is the onlystrictly proper local scoring function (Huszár, 2013, p. 15).This essentially means that all consistent estimation meth-ods not based on MLE take the form of integrals over theobservation space. Such integrals are intractable to computewith the normalising ﬂows commonly used today.The contribution of this paper is instead to robustify ﬂow-based models by changing the parametric family of thedistributions we ﬁt to have fatter tails than the conventionalGaussians. Since we still use maximum likelihood for es-timation, consistency is assured. This approach has beenused to solve inverse problems in stochastic optimisation(Aravkin et al., 2012) and to improve the quality of Google’sproduction text-to-speech systems (Zen et al., 2016). Re-cently, Jaini et al. (2019) showed that nearly all conventionalnormalising ﬂows with a Gaussian base are unsuitable formodelling inherently heavy-tailed distributions. However,they do not consider the greater advantages of changing thetail probabilities of the base distribution through the lens ofrobustness, which extend to data that (like much of the datain our experiments) need not have fat or heavy tails.While there are ﬂow-based models with non-Gaussian basedistributions such as uniform distributions (Müller et al.,2019) or GMMs (Izmailov et al., 2020; Atanov et al., 2019),these do not have fat tails. To the best of our knowledge, ourwork represents the ﬁrst practical exploration of statisticalrobustness with fat-tailed distributions in normalising ﬂows.

3. Method

This section provides a mathematical analysis of MLE ro-bustness, leading into our proposed solution in Sec. 3.1.Our overarching goal is to mitigate the impact of outliers intraining and test data using robust statistics. We speciﬁcallychoose to focus on the notion of resistant statistics , whichare estimators that do not break down under adversarial obust model training and generalisation with Studentising ﬂows -4 -2 0 2 400 . . . t ( ν = 4) t ( ν = 15) (a) Probability density functions p ( x ) . -8 -6 -4 -2 0 2 4 6 802468101214161820 (b) Penalty functions ρ ( ε ) . -8 -6 -4 -2 0 2 4 6 8 − − − − (c) Inﬂuence functions ψ ( ε ) . Figure 1: Functions of the normal (dashed), Laplace (dotted), and Student’s t distributions (solid) with mean 0 and variance 1.perturbation of a fraction of the data (arbitrary corruptionsonly have a bounded effect). For example, among meth-ods for estimating location parameters of distributions, thesample mean is not resistant: By adversarially replacingjust a single datapoint in the sample mean, we can make theestimator equal any value we want and make its norm go toinﬁnity. The median, in contrast, is resistant to up to %of the data being corrupted.Informally, being resistant means that we allow the model to“give up” on explaining certain examples, in order to betterﬁt the remainder of the data. This behaviour can be un-derstood through inﬂuence functions (Hampel et al., 1986).In the special case of maximum-likelihood estimation oflocation parameters µ , we ﬁrst deﬁne the penalty function ρ ( ε ) as the negative log-likelihood (NLL) loss as a functionof ε = x − µ , offset vertically such that ρ ( ) = 0 . Theinﬂuence function ψ ( ε ) is then just the gradient of ρ withrespect to ε . Fig. 1 graphs a number of different distribu-tions in 1D, along with the associated penalty and inﬂuencefunctions. For the Gaussian distribution with ﬁxed scale, thepenalty function is the squared error. The resulting ψ ( x ) is a linear function of x , as plotted in Fig. 1c, meaning thatthe extent of the inﬂuence of any single outlying datapointcan grow arbitrarily large – the estimator is not resistant.Consequently, using maximum likelihood to ﬁt distributionswith Gaussian tails is not statistically robust.The impact of outliers can be reduced by ﬁtting probabilitydistributions with fatter tails. One example is the Laplacedistribution, whose density decays exponentially with thedistance from the midpoint µ ; see Fig. 1 for plots. Theassociated penalty is the absolute error ρ ( ε ) = (cid:107) ε (cid:107) . Thisis minimised by the median, which is resistant to adversarialcorruptions. The Laplacian inﬂuence function in the ﬁgureis seen to be a step function and thus remains boundedeverywhere, conﬁrming that the median is resistant. This issimilar to the effect of gradient clipping in that the inﬂuenceof outliers can never exceed a certain maximal magnitude. Deﬁne a ﬂow as a parametric family of densities { X = f θ ( Z ) } θ , where f θ is an invertible transformation that de- pends on the parameters θ ∈ Θ and Z is a ﬁxed base distri-bution. Our general proposal is to gain statistical robustnessin this model by replacing the traditional multivariate normalbase distribution by a distribution with a bounded inﬂuencefunction. Our speciﬁc proposal (studied in detail in ourexperiments) is to replace Z by a multivariate t -distribution, t ν ( µ , Σ ) , building on Lange et al. (1989). The use of mul-tivariate t -distributions in ﬂows was studied theoreticallybut not empirically by Jaini et al. (2019) for the specialcase of triangular ﬂows on inherently heavy-tailed data.The pdf of the t ν -distribution in D dimensions is p t ( x ; µ , Σ , ν ) = Γ (cid:16) ν + D (cid:17) (cid:16) Γ (cid:0) ν (cid:1)(cid:17) − | νπ Σ | − · (cid:16) ν ( x − µ ) (cid:124) Σ − ( x − µ ) (cid:17) − ν + D , (1)where the scalar ν > is called the degrees of freedom . Wesee in Fig. 1 that this leads to a nonconvex penalty functionand, importantly, to an inﬂuence function that approacheszero for large deviations. This is known as a redescending inﬂuence function, and means that outliers not only havea bounded impact in general (like for the absolute erroror gradient clipping), but that gross outliers furthermorewill be effectively ignored by the model. Since the dens-ity asymptotically decays polynomially (i.e., slower thanexponentially), we say that it has fat tails . Seeing that the(inverse) transformation f − θ now no longer turns the obser-vation distribution X into a normal (Gaussian) distribution,we propose to call these models Studentising ﬂows .As our proposal is based on MLE, we retain both consistencyand efﬁciency in the absence of misspeciﬁcation. In the faceof outlying observations, our approach degrades gracefully,in contrast to distributions having, e.g., Gaussian tails. Aswe only change the base distribution, our proposal can becombined with any invertible transformation, network archi-tecture, and optimiser to model distributions on R D . It canalso be used with conditional invertible transformations inorder to describe conditional probability distributions. Since Recent follow-up work by Jaini et al. (2020), appearing con-currently with our paper, does contain empirical studies of theeffect of t ν base distributions on the tail properties of ﬂows.3 obust model training and generalisation with Studentising ﬂows T r a i n i n g l o ss t ( =50), lr=1e-3Gauss. no grad-clip, lr=1e-4Gauss. w. grad-clip, lr=1e-3Gauss. no grad-clip, lr=5e-4 Figure 2: Training loss (NLL) on CelebA data.the tails of t ν ( µ , Σ ) get slimmer as ν increases, we cantune the degree of robustness of the approach by changingthis parameter of the distribution. In fact, the distribu-tion converges on the multivariate normal N ( µ , Σ ) in thelimit ν → ∞ . Sampling from the t ν -distribution can bedone by drawing a sample from a multivariate Gaussianand then scaling it on the basis of a sample from the scalar χ ν -distribution; see Kotz & Nadarajah (2004).

4. Experiments

In this section we demonstrate empirically the advantagesof fat-tailed base distributions in normalising ﬂows, both interms of stable training and for improved generalisation.

Our initial experiments considered unconditional models of(uniformly dequantised) image data using Glow (Kingma &Dhariwal, 2018). Speciﬁcally, we used the benchmark codefrom Durkan et al. (2019) trained using Adam (Kingma& Ba, 2015). Implementing t ν -distributions for the baserequired just 20 lines of Python code; see Appendix B.First we investigated training stability on the CelebA facesdataset (Liu et al., 2015). We used the benchmark distributedby Durkan et al. (2019), which considers × imagesto reduce computational demands. Our model and traininghyperparameters were closely based on those used in theGlow paper, setting K = 32 and L = 3 like for the smallerarchitectures in the article. We found that without gradientclipping, training Glow on CelebA required low learningrates to remain stable. As seen in Fig. 2, training withlearning rate lr = 10 − was stable, but training with higherlearning rates lr ≥ . · − did not converge. Clippingthe gradient norm at 100, or our more principled approachof changing the base to a multivariate t ν -distribution (with ν = 50 ), both enabled successful training at lr = 10 − . Wealso reached better log-likelihoods on held-out data than themodel trained with low learning rate (see Fig. 5 in Appendix It is also possible to treat ν as a learnable model parameterrather than a ﬁxed or hand-tuned hyperparameter, but this proced-ure is not theoretically robust to gross outliers (Lucas, 1997). Table 1: Test-set NLL losses on MNIST with and withoutoutliers inserted from CIFAR-10. ∆ -values are w.r.t. thecorresponding Gaussian alternative ( ν = ∞ ). Test Clean 1% outliersTrain ν = ∞

20 50 1000 ∞

20 50 1000Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ (cid:57) (cid:57) (cid:57) (cid:57) (cid:57) ∆ (cid:57) (cid:57) (cid:57) (cid:57) A), even though the primary goal of this experiment was notnecessarily to demonstrate better generalisation.Next we performed experiments on the widely-used MINSTdataset (LeCun et al., 1998) to investigate the effect of out-liers on generalisation. Since pixel intensities are bounded,image data in general does not have asymptotically fat tails.But while MNIST is considered a quite clean dataset, wecan deliberately corrupt training and/or test data by insertinggreyscale-converted examples from CIFAR-10 (Krizhevsky,2009), which contains natural images that are much morediverse than the handwritten digits of MNIST. We randomlypartitioned MNIST into training, validation, and test sets(80/10/10 split), and considered four combinations of eitherclean or corrupted (1% CIFAR) test and/or train+val data.We trained (60k steps) and tested normalising as well as Stu-dentising ﬂows on the four combinations, using the the samelearning rate schedule (cosine decay from lr = 4 · − to − ) and hyperparameters ( K = 10 , L = 3 ), and clippingthe gradients for the normalising ﬂows only. This producedthe negative log-likelihood values listed in Table 1. Wesee that, for each conﬁguration, the proposed method per-formed similar to or better than the conventional setup usingGaussian base distributions. The generalisation behaviourof t ν -distributions was not sensitive to the parameter ν , al-though very high values ( ν ≈ or more) behaved morelike the conventional normalising ﬂow, as expected. Whilein most cases the improvements were relatively minor, Stu-dentising ﬂows generalised much better to the case wherethe test data displayed behaviours not seen during training. Last, we studied a domain where normalising ﬂows consti-tute the current state of the art, namely conditional probab-ilistic motion modelling as in Henter et al. (2019); Alexan-derson et al. (2020). These models resemble the VideoFlowmodel of Kumar et al. (2020), but also include recurrenceand an external control signal. The models give compellingvisual results, but have been found to overﬁt signiﬁcantly interms of the log-likelihood on held-out data. This reﬂects awell-known disagreement between likelihood and subject-ive impressions; see, e.g., Theis et al. (2016): Humans aremuch more sensitive to the presence of unnatural output obust model training and generalisation with Studentising ﬂows L o ss Train Gauss.Val Gauss.Train t ( =50)Val t ( =50) Figure 3: Training and validation losses on locomotion data. L o ss Train Gauss.Val Gauss.Train t ( =50)Val t ( =50) Figure 4: Training and validation losses on gesture data.examples than they are to mode dropping , where models donot represent all possibilities the data can show. Non-robustapproaches (which cannot “give up” on explaining even asingle observation), on the other hand, suffer signiﬁcant like-lihood penalties upon encountering unexpected examples inheld-out data; cf. Table 1. Having methods where general-isation performance better reﬂects subjective output qualitywould be beneﬁcial, e.g., when tuning generative models.We considered two tasks: locomotion generation with path-based control and speech-driven gesture generation. Forlocomotion synthesis, the input is a sequence of delta trans-lations and headings specifying a motion path along theground, and the output is a sequence of body poses (3D jointpositions) that animate human locomotion along that path.For gesture synthesis, the input is a sequence of speech-derived acoustic features and the output is a sequence ofbody poses (joint angles) of a character gesticulating andchanging stance to the speech. In both cases, the aim is touse motion-capture data to learn to animate plausible motionthat agrees with the input signal. See Appendix A for stillimages and additional information about the data.For the gesture task we used the same model and paramet-ers as system FB-U in Alexanderson et al. (2020). For thelocomotion task, we found that additional tuning of the MGmodel from Henter et al. (2019) could maintain the samevisual quality while reducing training time and improvingperformance on held-out data. Speciﬁcally, we applied aNoam learning-rate scheduler (Vaswani et al., 2017) withpeak lr = 10 − decaying to − , set data dropout to 0.75,and changed the recurrent network from an LSTM to a GRU. Learning curves for the two tasks are illustrated in Fig. 3and show similar trends. Under a Gaussian base distribu-tion, the loss on training data decreases, while the NLL onheld-out data begins to rise steeply early on during train-ing. This is subjectively misleading, since the perceivedquality of randomly-sampled output motion generally keepsimproving throughout training. We note that these normal-ising ﬂows were trained with gradient clipping (both of thenorm and individual elements), and the smooth shape of thecurves around the local optimum makes it clear that traininginstability is not a factor in the poor performance.Using the same models and training setups but with our pro-posed t ν -distribution ( ν = 50 ) for the base has essentiallyno effect on the training loss but brings the validation curvesmuch closer to the training curves. It is also signiﬁcantlyless in disagreement with subjective impressions of the qual-ity of random motion samples with held out control-inputs.While these plots only show the ﬁrst 30k training steps, thesame trends continue over the full 80k+ steps we trained,with normalising ﬂows diverging linearly while the valida-tion losses of Studentising ﬂows quickly saturate; see Fig. 8in Appendix A.

5. Conclusion

We have proposed fattening the tails of the base (latent)distributions in ﬂow-based models. This leads to a model-ling approach that is statistically robust: it remains consist-ent and efﬁcient in the absence of model misspeciﬁcationwhile degrading gracefully when data and model do notmatch. We have argued that many heuristic steps in stand-ard machine-learning pipelines, including the practice ofgradient clipping during optimisation, can be seen as work-arounds for core modelling approaches that lack robustness.Our experimental results demonstrate that changing to afat-tailed base distribution 1) provides a principled way tostabilise training, similar to what gradient clipping does, and2) improves generalisation, both by reducing the mismatchbetween training and validation loss and by improving thelog-likelihood of held-out data in absolute terms. These im-provements are observed for well-tuned models on datasetsboth with and without obviously extreme observations. Weexpect the improvements due to increased robustness to beof interest to practitioners in a wide range of applications. We have been able to replicate similarly-shaped learningcurves on CelebA by changing the balance to 20% training dataand 80% validation data (see Fig. 6 in Appendix A), suggestingthat the root cause of this divergent behaviour is an amount of train-ing data that is too small to adequately sample the full diversityof natural behaviour, leading to a poor model of held-out material.This is despite the fact that the motion databases used for these ex-periments are among the largest currently available for public use.In classiﬁcation, Recht et al. (2019) recently highlighted similarissues of poor generalisation on new data from the same source.5 obust model training and generalisation with Studentising ﬂows

Acknowledgement

This work was supported by the Swedish Research Coun-cil proj. 2018-05409 (StyleBot) and by the Wallenberg AI,Autonomous Systems and Software Program (WASP) of theKnut and Alice Wallenberg Foundation, Sweden.

References

Alexanderson, S., Henter, G. E., Kucherenko, T., and Be-skow, J. Style-controllable speech-driven gesture syn-thesis using normalising ﬂows.

Comput. Graph. Forum ,39(2):487–496, 2020. 4, 5Aravkin, A., Friedlander, M. P., Herrmann, F. J., and vanLeeuwen, T. Robust inversion, dimensionality reduction,and randomized sampling.

Math. Program., Ser. B , 134:101–125, 2012. 2Arjovsky, M., Chintala, S., and Bottou, L. Wassersteingenerative adversarial networks. In

Proc. ICML , pp. 214–223, 2017. 2Atanov, A., Volokhova, A., Ashukha, A., Sosnovik, I., andVetrov, D. Semi-conditional normalizing ﬂows for semi-supervised learning. In

Proc. INNF , 2019. 2Bishop, C. M. Mixture density networks. Technical ReportNCRG/94/004, Aston University, Birmingham, UK, 1994.1Brock, A., Donahue, J., and Simonyan, K. Large scale GANtraining for high ﬁdelity natural image synthesis. In

Proc.ICLR , 2019. 2CMU Graphics Lab. Carnegie Mellon University motioncapture database. http://mocap.cs.cmu.edu/, 2003. 8Daniels, H. E. The asymptotic efﬁciency of a maximumlikelihood estimator. In

Fourth Berkeley Symposium onMathematical Statistics and Probability , volume 1, pp.151–163. University of California Press Berkeley, 1961.1Durkan, C., Bekasov, A., Murray, I., and Papamakarios,G. Neural spline ﬂows.

Proc. NeurIPS , pp. 7509–7520,2019. https://github.com/bayesiains/nsf. 4Ferstl, Y. and McDonnell, R. Investigating theuse of recurrent motion modelling for speech ges-ture generation. In

Proc. IVA , pp. 93–98, 2018.http://trinityspeechgesture.scss.tcd.ie. 8Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio,Y. Generative adversarial nets. In

Proc. NIPS , pp. 2672–2680, 2014. 1 Habibie, I., Holden, D., Schwarz, J., Yearsley, J., andKomura, T. A recurrent variational autoencoder for hu-man motion synthesis. In

Proc. BMVC , pp. 119.1–119.12,2017. 8Hampel, F. R., Ronchetti, E. M., Rousseeuw, P. J., andStahel, W. A.

Robust Statistics: The Approach Based onInﬂuence Functions . John Wiley & Sons, Inc., 1986. 2, 3Henter, G. E. and Kleijn, W. B. Minimum entropy ratesimpliﬁcation of stochastic processes.

IEEE T. PatternAnal. , 38(12):2487–2500, 2016. 2Henter, G. E., Ronanki, S., Watts, O., Wester, M., Wu, Z.,and King, S. Robust TTS duration modelling using DNNs.In

Proc. ICASSP , pp. 5130–5134, 2016. 2Henter, G. E., Alexanderson, S., and Beskow, J. MoGlow:Probabilistic and controllable motion synthesis using nor-malising ﬂows. arXiv preprint arXiv:1905.06598 , 2019.4, 5, 8Huber, P. J. and Ronchetti, E. M.

Robust Statistics . JohnWiley & Sons, Inc., 2nd edition, 2009. 2Huszár, F.

Scoring rules, Divergences and Information inBayesian Machine Learning . PhD thesis, University ofCambridge, Cambridge, UK, 2013.https://github.com/fhuszar/thesis. 2Izmailov, P., Kirichenko, P., Finzi, M., and Wilson, A. G.Semi-supervised learning with normalizing ﬂows. In

Proc. ICML , 2020. 2Jaini, P., Kobyzev, I., Brubaker, M., and Yu, Y. Tails oftriangular ﬂows. arXiv preprint arXiv:1907.04481v1 ,2019. 2, 3Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails ofLipschitz triangular ﬂows. In

Proc. ICML , 2020. 3Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. In

Proc. ICLR , 2015. 4Kingma, D. P. and Dhariwal, P. Glow: Generative ﬂowwith invertible 1x1 convolutions. In

Proc. NeurIPS , pp.10236–10245, 2018. 2, 4Koh, P. W. and Liang, P. Understanding black-box predic-tions via inﬂuence functions. In

Proc. ICML , pp. 1885–1894, 2017. 2Kotz, S. and Nadarajah, S.

Multivariate t Distributions andTheir Applications ∼ kriz/cifar.html. 4 obust model training and generalisation with Studentising ﬂows Kumar, M., Babaeizadeh, M., Erhan, D., Finn, C., Levine,S., Dinh, L., and Kingma, D. VideoFlow: A conditionalﬂow-based model for stochastic video generation. In

Proc. ICLR , 2020. 4Lange, K. L., Little, R. J. A., and Taylor, J. M. G. Robuststatistical modeling using the t distribution. J. Am. Stat.Assoc. , 84(408):881–896, 1989. 3LeCun, Y., Cortes, C., and Burges, C. J. C.The MNIST database of handwritten digits.http://yann.lecun.com/exdb/mnist/, 1998. 4Liu, Z., Luo, P., Wang, X., and Tang, X. Deep learn-ing face attributes in the wild. In

Proc. ICCV , 2015.http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html. 4Lucas, A. Robustness of the Student t based M-estimator. Commun. Statist.–Theory Meth. , 26(5):1165–1182, 1997.4Lucas, T., Shmelkov, K., Alahari, K., Schmid, C., and Ver-beek, J. Adaptive density estimation for generative mod-els. In

Proc. NeurIPS , pp. 11993–12003, 2019. 2Lucic, M., Kurach, K., Michalski, M., Gelly, S., andBousquet, O. Are GANs created equal? A large-scalestudy. In

Proc. NeurIPS , pp. 700–709, 2018. 2Mohamed, S. and Lakshminarayanan, B. Learning in impli-cit generative models. arXiv preprint arXiv:1610.03483 ,2016. 1Müller, M., Röder, T., Clausen, M., Eberhardt, B., Krüger,B., and Weber, A. Documentation mocap databaseHDM05. Technical Report CG-2007-2, UniversitätBonn, Bonn, Germany, 2007. http://resources.mpi-inf.mpg.de/HDM05/. 8Müller, T., Mcwilliams, B., Rousselle, F., Gross, M., andNovák, J. Neural importance sampling.

ACM Trans.Graph. , 38(5):145:1–145:19, 2019. 2Nowozin, S., Cseke, B., and Tomioka, R. f-GAN: Traininggenerative neural samplers using variational divergenceminimization. In

Proc. NIPS , pp. 271–279, 2016. 2Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed,S., and Lakshminarayanan, B. Normalizing ﬂows forprobabilistic modeling and inference. arXiv preprintarXiv:1912.02762 , 2019. 1Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. DoImageNet classiﬁers generalize to ImageNet?

Proc.ICML , pp. 5389–5400, 2019. 5Theis, L., van den Oord, A., and Bethge, M. A note on theevaluation of generative models.

Proc. ICLR , 2016. 4 Ting, D. and Brochu, E. Optimal subsampling with inﬂu-ence functions. In

Proc. NeurIPS , pp. 3650–3659, 2018.2Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attentionis all you need. In

Proc. NIPS , pp. 5998–6008, 2017. 5Zen, H., Agiomyrgiannakis, Y., Egberts, N., Henderson,F., and Szczepaniak, P. Fast, compact, and high qual-ity LSTM-RNN based statistical parametric speech syn-thesizers for mobile devices. In

Proc. Interspeech , pp.2273–2277, 2016. 2 obust model training and generalisation with Studentising ﬂows V a li d a t i o n l o ss t ( =50), lr=1e-3Gauss. no grad-clip, lr=1e-4Gauss. w. grad-clip, lr=1e-3 Figure 5: Validation loss on CelebA with different setups. L o ss

99% - 1% split20% - 80% split

Figure 6: Validation loss on CelebA with different splits.

A. Additional information on data and results

Fig. 5 reports the validation-set performance over 100k stepsof training for the three stable systems from Fig. 2. Wesee that systems trained with the higher learning rate gavenoticeably better generalisation performance.We also performed an experiment on CelebA to see theeffect of reduced training-data size on generalisation. In par-ticular, we tried making the training set signiﬁcantly smallerthan before (going from 99% to 20% of the database), whilemaking the validation set much larger (from 1% to 80% ofthe database) in order to well sample the full diversity of thematerial. Fig. 6 shows learning curves on the CelebA datawith Gaussian base distributions before and after shiftingthe balance between training and held-out data. We see that,while validation loss originally decreased monotonically, theloss after changing dataset sizes instead reaches an optimumearly on in the training and then begins to rise signiﬁcantlyagain, reminiscent of the validation curves seen in Sec. 4.2.We conclude that the unusually large generalisation gap onthe motion data at least in part can be attributed to the sizeof the database relative to the complexity of the task.The two motion-data modelling tasks we considered inSec. 4.2, namely path-based locomotion control and speech-driven gesture generation, have applications in areas suchas animation, computer games, embodied agents, and socialrobots. For the locomotion data, we used the Edinburghlocomotion MOCAP database (Habibie et al., 2017) pooledwith the locomotion trials from the trials from the CMU(CMU Graphics Lab, 2003) and HDM05 (Müller et al., (a) Locomotion with control path. (b) Gesticulating avatar.

Figure 7: Snapshots visualising the motion data used. L o ss Gauss. t ( =50) (a) Locomotion modelling task. L o ss Gauss. t ( =50) (b) Gesture modelling task. Figure 8: Validation-loss curves of extended training.2007) motion-capture databases. Each frame in the data hadan output dimensionality of D = 63 . Gesture-generationmodels, meanwhile, were trained on the Trinity GestureDataset collected by Ferstl & McDonnell (2018), which isa large database of joint speech and gestures. Each outputframe had D = 65 dimensions. Fig. 7 shows still imagesfrom representative visualisations of the two tasks. Like forimage data, the numerical range of these motion datasetsis bounded in practice (e.g., by the ﬁnite length of humanbones coupled with the body-centric coordinate systemsused in Henter et al. (2019)), and the data is not known tocontain any numerically extreme observations.Fig. 8 illustrates the point from the end of Sec. 4.2 regard-ing the growing gap between normalising and Studentisingﬂows over the course of the entire training. We see thatthe held-out loss of the former diverges essentially linearly,while the proposed method shows saturating behaviour. obust model training and generalisation with Studentising ﬂows B. PyTorch code for the t ν -distribution We here reproduce our implementation of log-probability computation and sampling with the multivariate t ν -distribution: import numpy as npimport scipy.specialimport torchclass StudentT():def __init__(self, shape, nu=50):d = shape[0]self._const = scipy.special.loggamma(0.5*(nu+d)) - \scipy.special.loggamma(0.5*nu) - 0.5*d*np.log(np.pi*nu)self._shape = torch.Size(shape)self._nu = nudef _log_prob(self, inputs):d = self._shape[0]input_norms = utils.sum_except_batch(((inputs)**2), num_batch_dims=1)likelihood = self._const - \0.5*(self._nu+d)*torch.log(1+(1/self._nu)*input_norms)return likelihooddef _sample(self, num_samples):d = self._shape[0]x = np.random.chisquare(self._nu, num_samples)/self._nux = np.tile(x[:,None], (1,d))x = torch.Tensor(x.astype(np.float32))z = torch.randn(num_samples, *self._shape)return (z/torch.sqrt(x))import numpy as npimport scipy.specialimport torchclass StudentT():def __init__(self, shape, nu=50):d = shape[0]self._const = scipy.special.loggamma(0.5*(nu+d)) - \scipy.special.loggamma(0.5*nu) - 0.5*d*np.log(np.pi*nu)self._shape = torch.Size(shape)self._nu = nudef _log_prob(self, inputs):d = self._shape[0]input_norms = utils.sum_except_batch(((inputs)**2), num_batch_dims=1)likelihood = self._const - \0.5*(self._nu+d)*torch.log(1+(1/self._nu)*input_norms)return likelihooddef _sample(self, num_samples):d = self._shape[0]x = np.random.chisquare(self._nu, num_samples)/self._nux = np.tile(x[:,None], (1,d))x = torch.Tensor(x.astype(np.float32))z = torch.randn(num_samples, *self._shape)return (z/torch.sqrt(x))