Arbitrary Conditional Distributions with Energy
AArbitrary Conditional Distributions with Energy
Ryan R. Strauss Junier B. Oliva Abstract
Modeling distributions of covariates, or densityestimation , is a core challenge in unsupervisedlearning. However, the majority of work onlyconsiders the joint distribution, which has lim-ited relevance to practical situations. A moregeneral and useful problem is arbitrary condi-tional density estimation , which aims to model any possible conditional distribution over a set ofcovariates, reflecting the more realistic setting ofinference based on prior knowledge. We proposea novel method, Arbitrary Conditioning with En-ergy (ACE), that can simultaneously estimate thedistribution p ( x u | x o ) for all possible subsets offeatures x u and x o . ACE uses an energy functionto specify densities, bypassing the architecturalrestrictions imposed by alternative methods andthe biases imposed by tractable parametric distri-butions. We also simplify the learning problem byonly learning one-dimensional conditionals, fromwhich more complex distributions can be recov-ered during inference. Empirically, we show thatACE achieves state-of-the-art for arbitrary condi-tional and marginal likelihood estimation and fortabular data imputation.
1. Introduction
The ability to model the world and make predictions plays aprimary role in human intelligence (Hawkins & Blakeslee,2004). To this end, the human brain both represents prob-ability distributions and performs probabilistic inference(Pouget et al., 2013; Ma & Jazayeri, 2014; Pouget et al.,2016), allowing us to reason about the causal relationshipsbetween factors in our environment and anticipate unknownevents.Density estimation refers to methods that attempt to con-struct such probabilistic models from data. For example,given some random variables in the world, the goal is to Department of Computer Science, The University of NorthCarolina at Chapel Hill, Chapel Hill, North Carolina, USA. Corre-spondence to: Ryan R. Strauss < [email protected] > .Code available at https://github.com/lupalab/ace . learn their distribution. That distribution can then be used tomake predictions about the state of the world. For example,an agent that has modeled the distribution of meteorologicalconditions (e.g., atmospheric pressure, temperature, humid-ity) can forecast the likelihood of rain.The vast majority of work on density estimation focuses onthe joint distribution p ( x ) (Goodfellow et al., 2014; Dinhet al., 2016; Papamakarios et al., 2017; Grathwohl et al.,2018; Oliva et al., 2018; Nash & Durkan, 2019; Fakooret al., 2020), i.e., the distribution of all variables taken to-gether. While the joint distribution can be useful (e.g., whatis the distribution of pixel configurations that represent hu-man faces), it is limited in the types of predictions it canmake. We are often more interested in conditional probabil-ities, which communicate the likelihood of an event given that some prior information is known. For example, givena patient’s medical history and symptoms, a doctor deter-mines the likelihoods of different illnesses and other patientattributes. Conditional distributions are often more practicalsince real-world decisions are nearly always informed byprior information.However, we often do not know ahead of time which fea-tures will be known or which will be inferred. For example,not every patient will have had the same tests performed,i.e., have the same known features. So a na¨ıve approachrequires building an exponential number of models to coverall possible cases, which quickly becomes intractable. Thus,an intelligent system needs to understand the intricate condi-tional dependencies between all arbitrary subsets of covari-ates, and it must do so with a single model to be practical.In this work, we consider the problem of learning the con-ditional distribution p ( x u | x o ) for any arbitrary subsetsof unobserved variables x u ∈ R | u | and observed variables x o ∈ R | o | , where u, o ∈ { , . . . , d } and o ∩ u = ∅ . We pro-pose a method, Arbitrary Conditioning with Energy (ACE),that grapples with these exponentially many conditionals byusing an unrestricted neural network and by modeling thesimplest distributions possible. With a single trained model,ACE can assess any conditional distribution over any subsetof random variables.We develop ACE by decomposing the arbitrary condition-ing problem into the estimation of one-dimensional con-ditional densities (with arbitrary observations). While a r X i v : . [ c s . L G ] F e b rbitrary Conditional Distributions with Energy this concept is a simple application of the chain ruleof probability, it has yet to be thoroughly exploited forarbitrary conditioning. During training, ACE estimatesdistributions of the form p ( x u (cid:48) i | x o (cid:48) ) , where x u (cid:48) i is ascalar. During inference, more complex distributions canthen be recovered with an autoregressive decomposition: p ( x u | x o ) = (cid:81) | u | i =1 p ( x u (cid:48) i | x o ∪ u (cid:48)
Figure 1.
Proposal and normalized energy distributions producedby ACE for UCI datasets. Proposal distributions are mixtures ofGaussians. Each row shows the densities for a one-dimensionalmarginal, enabling us to show the data distribution as well. Boxeshighlight regions where the energy distribution is notably bettermatched with the data than the proposal. Best viewed zoomed in. conditioning is obtainable with a simple scheme that usesmixtures of Gaussians and fully-connected networks.
2. Previous Work
Several methods have been previously proposed for arbi-trary conditioning. Sum-Product Networks are speciallydesigned to only contain sum and product operations andcan produce arbitrary conditional or marginal likelihoods(Poon & Domingos, 2011; Butz et al., 2019). The UniversalMarginalizer trains a neural network with a cross-entropyloss to approximate the marginal posterior distributions ofall unobserved features conditioned on the observed ones(Douglas et al., 2017). VAEAC is an approach that extendsa conditional variational autoencoder by only consideringthe latent codes of unobserved dimensions (Ivanov et al.,2018), and NeuralConditioner uses adversarial training tolearn each conditional distribution (Belghazi et al., 2019).The current SOTA is ACFlow, which extends normalizingflow models to handle any subset of observed features (Liet al., 2020).Unlike VAEAC and NeuralConditioner, ACE is able to es-timate normalized likelihoods and is simpler to implementand train. While ACFlow can analytically produce normal-ized likelihoods and samples, it is restricted by a require-ment that its network consist of bijective transformationswith tractable Jacobians. Similarly, Sum-Product Networkshave limited expressivity due to their constraints. ACE, onthe other hand, exemplifies the appeal of energy-based meth-ods as it has no constraints on the parameterization of theenergy function. rbitrary Conditional Distributions with Energy
Energy-based methods have a wide range of applicationswithin machine learning (LeCun et al., 2006), and recentwork has studied their utility for density estimation. Deepenergy estimator networks (Saremi et al., 2018) and Au-toregressive Energy Machines (Nash & Durkan, 2019) areboth energy-based models that perform density estimation,where the latter is able to produce normalized likelihoods.However, these methods are only able to estimate the jointdistribution. To the best of our knowledge, ACE is thefirst energy-based method for arbitrary conditional densityestimation.
The problem of imputing missing data has been well studied,and there are several approaches based on classic machinelearning techniques such as k -nearest neighbors (Troyan-skaya et al., 2001), random forests (Stekhoven & B¨uhlmann,2012), and autoencoders (Gondara & Wang, 2018). Morerecent work has turned to deep generative models for im-putation. GAIN is a generative adversarial network (GAN)that produces imputations with the generator and uses thediscriminator to discern the imputed features (Yoon et al.,2018). Another GAN-based approach is MisGAN, whichlearns two generators to model the data and masks sepa-rately (Li et al., 2019). MIWAE adapts variational autoen-coders by modifying the lower bound for missing data andproduces imputations with importance sampling (Mattei &Frellsen, 2019). ACFlow, the current best arbitrary condi-tioning method, can also perform imputation and is SOTAfor imputing data that are missing completely at random(MCAR) (Li et al., 2020).While it is not always the case that data are missing at ran-dom, the opposite case (i.e., missingness that depends onunobserved features’ values) can be much more challengingto deal with (Fielding et al., 2008). Like many data impu-tation methods, we focus on the scenario where data aremissing completely at random, that is, where the likelihoodof being missing is independent of the covariates’ values.
3. Background
A probability density function (PDF) p ( x ) outputs a non-negative scalar for a given vector input x ∈ R d and satisfies (cid:82) p ( x ) d x = 1 . Such a function defines the distribution ofa random variable X and indicates the relative likelihoodthat X has a certain value. Given a dataset D = { x ( i ) } Ni =1 of i.i.d. samples drawn from an unknown distribution p ∗ ( x ) ,the object of density estimation is to find a model that bestapproximates the function p ∗ . Modern approaches gener-ally rely on neural networks to directly parameterize the approximated PDF.Arbitrary conditional density estimation is a more generalsetting where we have a subset of features o ⊂ { , . . . , d } which are observed (i.e., features whose values areknown) and corresponding subset of unobserved features u ⊂ { , . . . , d } such that o and u do not intersect. We arethen interested in modeling the density p ( x u | x o ) for allpossible subsets o and u , where x o ∈ R | o | and x u ∈ R | u | .The estimation of joint or marginal likelihoods is recoveredwhen o is the empty set. Energy-based models capture dependencies between vari-ables by assigning a nonnegative scalar energy to a givenarrangement of those variables, where energies closer tozero indicate more desirable configurations (LeCun et al.,2006). Learning consists of finding an energy functionthat outputs low energies for correct values. We can framedensity estimation as an energy-based problem by writinglikelihoods as a Boltzmann distribution p ( x ) = e −E ( x ) Z , (1)where E is the energy function, e −E ( x ) is the unnormal-ized likelihood, and Z = (cid:82) e −E ( x ) d x is the normalizingconstant.Energy-based models are appealing due to their relative sim-plicity and high flexibility in the choice of representationfor the energy function. This is in contrast to other commonapproaches to density estimation such as normalizing flows(Dinh et al., 2016; Li et al., 2020), which require invert-ible transformations with Jacobians that can be computedefficiently. Energy functions are also naturally capable ofrepresenting non-smooth distributions with low-density re-gions or discontinuities.
4. Arbitrary Conditioning with Energy
We are interested in approximating the probability density p ( x u | x o ) for any arbitrary sets of unobserved features x u and observed features x o . We approach this by decomposinglikelihoods into products of one-dimensional conditionals,which makes the learned distributions much simpler. Weadopt an energy-based approach, which is appealing as itaffords a large degree of flexibility in modeling the expo-nentially many conditional distributions at hand — we arefree to represent the energy function with an arbitrary, andhighly expressive, neural network that directly outputs un-normalized likelihoods. Our main contribution is a method,Arbitrary Conditioning with Energy (ACE), for computingarbitrary conditional likelihoods with energies, one dimen-sion at a time. rbitrary Conditional Distributions with Energy Estimating the normalizer Z for general energy functions(see Equation 1) is intractable. To bypass this issue, wedecompose the arbitrary conditioning task into d -domainarbitrary conditional estimation problems. This followsfrom the chain rule of probability, which allows us to write p ( x u | x o ) = | u | (cid:89) i =1 p (cid:16) x u (cid:48) i | x o ∪ u (cid:48)
We can useEquation 3 to compute normalized likelihoods, but onlyif the normalizing constant Z u i ; x o is known. Directly com-puting the normalizer is intractable in general. However, wecan get a sufficient estimate via importance sampling. As-suming access to a proposal distribution q ( x u i | x o ) whichis reasonably well-matched with the target distribution, weapproximate Z u i ; x o as Z u i ; x o = (cid:90) e −E ( x ui ; x o ) d x u i (4) = (cid:90) e −E ( x ui ; x o ) q ( x u i | x o ) q ( x u i | x o ) d x u i (5) ≈ S S (cid:88) s =1 e −E ( x ( s ) ui ; x o ) q ( x ( s ) u i | x o ) , x ( s ) u i ∼ q ( x u i | x o ) . (6)In high dimensions, it is difficult to make accurate approxi-mations. However, we have limited ourselves to only con-sider one-dimensional distributions, in which case suffi-ciently accurate estimates of Z u i ; x o are within reach (Nash& Durkan, 2019). For some problems, we may have accessto a good proposal distribution ahead of time. Otherwise,we can learn one in parallel with the energy network. We learn the proposal distribution alongside the energy func-tion by having a neural network output the parameters ofa tractable parametric distribution. The proposal networkaccepts a concatenation of b and φ ( x o ; b ) as input, and itoutputs the parameters ω ( u i ; x o ) for a mixture of Gaussiansfor each unobserved dimension u i . The proposal networkcan optionally output a latent vector, γ ( u i ; x o ) , for eachunobserved dimension, which is used as input to the energynetwork in order to enable weight sharing.We can then estimate the normalizing constants in Equa-tion 3 using importance sampling: ˆ Z u i ; x o = 1 S S (cid:88) s =1 e −E ( x ( s ) ui ; x o ; γ ( u i ; x o )) q ( x ( s ) u i | ω ( u i ; x o )) (7)where x ( s ) u i is sampled from q ( x u i | ω ( u i ; x o )) . This in turnleads to the following approximation of the log-likelihood rbitrary Conditional Distributions with Energy Proposal Network
23 2 0 8 0 01 1 0 1 0 0
Overview of the networks used in ACE. The plus symbol refers to concatenation. The symbol φ refers to the zero-imputingfunction depicted in Figure 3.
We use a bitmask b and zero-imputing function φ ( · ; b ) to ensure network inputs always have the same shape, regardlessof how many features are observed or unobserved. In the figure,shaded cells correspond to observed features. of x u i given x o : log p ( x u i | x o ) ≈ −E ( x u i ; x o ; γ ( u i ; x o )) − log ˆ Z u i ; x o , (8)where we use abbreviated notation in the previous two equa-tions and omit the bitmask b for greater readability. Referto Figure 2 for the precise inputs of each network.Since Equation 2 gives us a way to autoregressively compute p ( x u | x o ) as a chain of one-dimensional conditionals, weare only concerning ourselves with learning p ( x u i | x o ) forarbitrary u i and x o . Thus, for a given data point x , we parti-tion it into x o and x u and jointly optimize the proposal andenergy networks with the maximum-likelihood objective L ( x o ; x u ; θ ) = | u | (cid:88) i =1 log p ( x u i | x o ) + | u | (cid:88) i =1 log q ( x u i | x o ) , (9)where θ holds the parameters of both the energy and pro-posal networks. We want to optimize the proposal distribu-tion and energy function independently, i.e., the parametersof the proposal network should only be updated with respectto log q terms in the loss. This is implemented by stoppinggradients on proposal samples and proposal likelihoods be-fore they are used in Equation 7.Equation 9 is maximized with stochastic gradient ascentover a set of training data. We use a warm-up period at thebeginning of training where only the proposal network isoptimized so that importance sampling does not occur untilthe proposal is sufficiently similar to the target distribution.During training, we approximate normalizing constants with20 importance samples from the proposal distribution. In some cases, we found it useful to include a regularizationterm in the loss that penalizes the energy distribution forlarge deviations from the proposal distribution. We use themean-square error (MSE) between the proposal likelihoodsand energy likelihoods as a penalty, with gradients stoppedon the proposal likelihoods in the error calculation. Thecoefficient of this term in the loss is a hyperparameter. IKELIHOODS
Recall that our model learns one-dimensional conditionals,and we are unable to directly compute p ( x u | x o ) . Rather,we can only evaluate individual dimensions: p ( x u i | x o ) .Thus, to obtain a complete likelihood for x u , we employ anautoregressive application of the chain rule (see Equation 2).The pseudocode for this procedure is presented in Algo-rithm 1. Importantly, the order in which each unobserveddimension is evaluated does not matter. Also, since the val-ues of x u are known ahead of time, each one-dimensionalconditional can be evaluated in parallel as a batch (i.e., theloop in Algorithm 1 can be expressed as a map-reduce oper-ation), allowing the likelihood p ( x u | x o ) to be computedefficiently.4.4.2. S AMPLES
Sampling the proposal distribution can be performed inan autoregressive fashion where x u i is sampled from q ( x u i | x o ) then added to the observed set, at which point x u i +1 can be sampled. We do this until all unobserved fea-tures have been sampled. The pseudocode for this procedureis presented in Algorithm 2.We also want to produce samples that come from the energyfunction. One drawback of energy-based models is that weare unable to analytically sample the learned distribution.However, there are several methods for obtaining approxi-mate samples. We employ a modification of Algorithm 2such that many proposal samples are drawn at each step, anda single sample is then chosen from that collection basedon importance weights. As the number of samples goes toinfinity, this is consistent with drawing samples from theenergy distribution. The pseudocode for this procedure is rbitrary Conditional Distributions with Energy Algorithm 1
ACE Likelihood Evaluation Input: x o , x u , b Set x cur = φ ( x o ; b ) and b cur = b Initialize r = 0 Choose an arbitrary permutation u (cid:48) of u for u (cid:48) i in u (cid:48) do Compute log p ( x u (cid:48) i | x cur ) using Equation 8 Set r = r + log p ( x u (cid:48) i | x cur ) Set x cur [ u (cid:48) i ] = x u (cid:48) i Set b cur [ u (cid:48) i ] = 1 end for Output: r , which contains log p ( x u | x o ) Algorithm 2
ACE Proposal Sampling Input: x o , b , u Set x cur = φ ( x o ; b ) and b cur = b Choose an arbitrary permutation u (cid:48) of u for u (cid:48) i in u (cid:48) do Sample x u (cid:48) i ∼ q ( x u (cid:48) i | x cur ; b cur ) Set x cur [ u (cid:48) i ] = x u (cid:48) i Set b cur [ u (cid:48) i ] = 1 end for Output: x cur , which contains the observed and im-puted valuespresented in Algorithm 3.4.4.3. M EANS
Sampling allows us to obtain multiple possible values for theunobserved features that are diverse and realistic. However,these are not always the primary goals. For example, inthe case of data imputation, we may only want a singleimputation that aims to minimize some measure of error (seeSection 5.2). Thus, rather than imputing true samples, wemight prefer to impute the mean of the learned distribution.In this case, we forego autoregression and directly obtain themean of each distribution p ( x u i | x o ) with a single forwardpass. Analytically computing the mean of the proposaldistribution is straightforward since we are working with amixture of Gaussians. We estimate the mean of the energydistribution via importance sampling: E [ x u i ] ≈ S (cid:88) s =1 r s (cid:80) j r j x ( s ) u i , (10)where x ( s ) u i is sampled from q ( x u i | x o ) and r s = p ( x u i | x o ) q ( x u i | x o ) (11)is the importance weight between the energy and proposaldistribution. Algorithm 3
ACE Energy Sampling Input: x o , b , u , N Set x cur = φ ( x o ; b ) and b cur = b Choose an arbitrary permutation u (cid:48) of u for u (cid:48) i in u (cid:48) do Draw samples { x ( s ) u (cid:48) i } Ns =1 from q ( x u (cid:48) i | x cur ; b cur ) Compute importance weights for the N samples, asin Equation 6 Draw x u (cid:48) i from the N samples according to the im-portance weights Set x cur [ u (cid:48) i ] = x u (cid:48) i Set b cur [ u (cid:48) i ] = 1 end for Output: x cur , which contains the observed and im-puted values
5. Experiments
We evaluate ACE on real-valued tabular data. Specifically,we consider the benchmark UCI repository datasets listedin Table 1. The data is preprocessed as described by Papa-makarios et al. (2017).Unlike other approaches to density estimation that requireparticular network architectures (Germain et al., 2015; Dinhet al., 2016; Nash & Durkan, 2019; Li et al., 2020), ACE hasno such restrictions. Thus, we use a simple fully-connectednetwork with residual connections (He et al., 2016a;b) forboth the energy network and proposal network. This ar-chitecture is highly expressive, yet simple, and helps avoidadding unnecessary complexity to ACE.A small amount of Gaussian noise is added to each batchof data during training, as we found it improved stability.The bitmask b , which indicates observed features, is drawnfrom a Bernoulli distribution with p = 0 . for each trainingbatch. The objective given in Equation 9 is maximized withthe Adam optimizer (Kingma & Ba, 2014), and the learningrate is linearly decayed throughout training. Evaluationis performed using the weights that produced the highestlikelihoods on a set of validation data during training. Fullexperimental details and hyperparameters can be found inthe supplementary materials. Table 1.
UCI datasets used in our experiments.
Dataset Instances DimensionsP
OWER AS EPMASS
INIBOONE rbitrary Conditional Distributions with Energy
Table 2.
Test arbitrary conditional log-likelihoods (in nats) for UCI datasets. Higher is better. We present results for models trainedwith three different levels of missing data. Likelihood estimates are computed with 20,000 importance samples for P
OWER , G AS , andH EPMASS , 10,000 importance samples for M
INIBOONE , and 3,000 importance samples for BSDS. Results for ACFlow and VAEAC aretaken from Li et al. (2020). The best performing model for each dataset and missing rate is shown in bold. P OWER G AS H EPMASS M INIBOONE
BSDSMissing Rate 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5ACE
ACE Proposal 0.600 0.584 0.553 9.328 9.139 7.905 -5.116 -5.623 -8.657 -1.262 -2.012 -12.949 80.080 73.033 44.298ACFlow 0.528 0.510 0.417 7.593 7.212 4.818 -6.833 -9.670 -10.975 -1.098 -3.577 -10.849 81.399 79.745 73.061VAEAC -0.042 -0.103 -0.343 2.418 2.823 1.952 -10.082 -10.389 -11.415 -3.452 -4.242 -9.051 74.850 74.313 66.628
Table 3.
Test marginal log-likelihoods (in nats) for UCI datasets. Higher is better. We evaluate the marginal distributions of the first 3, 5,and 10 dimensions of each dataset (P
OWER and G AS don’t have 10 features, so the joint likelihood over all features is reported instead).The same number of importance samples are used as in Table 2. Results for ACFlow and TAN are taken from Li et al. (2020). Note thata separate TAN model has to be trained for each marginal distribution, whereas a single ACE model can estimate all three marginals.A single ACFlow model can estimate all the marginal distributions, however Li et al. (2020) retrained models specifically for arbitrarymarginal estimation, whereas we use the same ACE models for all tasks. Bold indicates instances where ACE was better than ACFlow. P OWER G AS H EPMASS M INIBOONE
BSDSDimensions 3 5 6 3 5 8 3 5 10 3 5 10 3 5 10ACE -0.56 1.42 0.56 1.13 4.14 11.88 -4.00 -5.91 -10.75 -3.64 -5.42 -9.63
We also consider the scenario in which data features arecompletely missing, i.e., some features are deemed unavail-able during training and are never part of the observed orunobserved set . This allows us to examine the effective-ness of ACE on incomplete datasets, which are commonwhen working with real-world data. When training modelswith missing data, we simply modify the sets of observedand unobserved indices to remove any indices which havebeen declared missing. This is a trivial modification andrequires no other change to the design or training procedureof ACE. We consider two scenarios where data are missingcompletely at random at a 10% and 50% rate. Table 2 presents the average arbitrary conditional log-likelihoods on held-out test data for the UCI datasets frommodels trained with different levels of missing data. Dur-ing inference, no data is missing and b is drawn from aBernoulli distribution with p = 0 . . Likelihoods are calcu-lated with the autoregressive procedure presented in Algo-rithm 1. The order in which the unobserved one-dimensionalconditionals are computed is randomly selected for each in-stance. We compare to ACFlow (Li et al., 2020), which is Features are missing at the per-instance level. For example,this does not mean that the i th feature is never observed for alltraining instances. the current SOTA for arbitrary conditional likelihood esti-mation, as well as VAEAC (Ivanov et al., 2018).We can draw two key findings from Table 2. First, wesee that our proposal distribution outperforms ACFlow innearly all cases. Even this exceptionally simple approach(just a fully-connected network that produces a mixture ofGaussians) can give rise to SOTA performance, and we seethere are advantages to using decomposed densities andunrestricted network architectures. Second, we find that inevery case, the likelihood estimates produced by the energyfunction are higher than those from the proposal, illustratingthe benefits of an energy-based approach which imposes nobiases on the shape of the learned distributions. Figure 1visualizes some of the energy distributions learned by ACEand compares them to their proposal counterparts. We seethat the energy function is often able to avoid unwantedartifacts that arise in mixtures of Gaussians, such as over-smoothing and oscillation. Similar visualizations for alldatasets are provided in the supplementary materials.We also examine the arbitrary marginal distributions learnedby ACE, i.e., the unconditional distribution over a subsetof features. We again test our model against ACFlow, andwe additionally compare to Transformation AutoregressiveNetworks (Oliva et al., 2018), which are designed only forjoint likelihood estimation. A separate TAN model hasto be trained for each marginal distribution. While a sin- rbitrary Conditional Distributions with Energy Missing Rate NR M SE Power
Missing Rate
Gas
Missing Rate
Hepmass
Missing Rate
Miniboone
Missing Rate
BSDS Method
ACEACE ProposalACFlowVAEAC
Figure 4.
Normalized root-mean-square error (NRMSE) of imputations generated by ACE for a set of test data. Lower is better. NRMSEis computed as the root-mean-square error normalized by the standard deviation of each feature and then averaged across all features.Imputed values are the means of the unobserved features (see Section 4.4.3). Estimates of energy distribution means are computed with20,000 importance samples for P
OWER and G AS , 10,000 importance samples for H EPMASS and M
INIBOONE , and 3,000 importancesamples for BSDS. Results for ACFlow and VAEAC are taken from Li et al. (2020). gle ACFlow model can estimate all marginal distributions,Li et al. (2020) retrained models specifically for arbitrarymarginal estimation. Contrarily, we used the same ACEmodels when evaluating arbitrary marginals as were usedfor arbitrary conditionals. Results are provided in Table 3.We find that ACE outperforms ACFlow the majority ofthe time and even surpasses TANs in some cases, eventhough ACFlow and TAN both received special training formarginal likelihood estimation and ACE did not. Samplesgenerated by ACE for a 3-dimensional marginal distributionof each dataset can be found in the supplementary materials.
We also evaluate ACE for data imputation, where someelements are missing from a dataset completely at randomand we seek to infer their values. ACE is naturally appliedto this task — we consider p ( x u | x o ) , where x u containsthe missing features.Figure 4 shows the normalized root-mean-square error(NRMSE) on held-out test data for the UCI datasets. Again,we consider models trained with three different levels ofmissing data. During inference, b is drawn from a Bernoullidistribution with p = 0 . for the 0% and 50% missing ratesand p = 0 . for the 10% missing rate. We impute theunobserved features, and the means of the unobserved distri-butions are used as the imputed values (see Section 4.4.3).As seen in Figure 4, ACE achieves a lower NRMSE scorethan ACFlow in all cases (exact numbers are available in thesupplementary materials), setting a new SOTA for MCARdata imputation. These results further validate ACE’s abil-ity to accurately model arbitrary conditionals, leading usto again advocate for simple models that decompose com-plex likelihoods. It is also worth noting that ACE and ACEProposal do comparably in this imputation task, which es-timates the first-order moment of conditional distributions.However, as evidenced in Table 2, the energy-based likeli-hood better captures higher-order moments.
6. Conclusion
In this work, we present a novel approach to modeling allarbitrary conditional distributions p ( x u | x o ) over a set ofcovariates. Our method, Arbitrary Conditioning with Energy(ACE), is the first to wholly reduce arbitrary conditioningto one-dimensional conditionals with arbitrary observationsand to estimate these with energy functions. By using anenergy function to specify densities, ACE can more easilymodel highly complex distributions, and it can freely usehigh-capacity networks to model the exponentially manydistributions at hand.Empirically, we find that ACE achieves state-of-the-art(SOTA) performance for arbitrary conditional and marginaldensity estimation. ACE is also SOTA for data imputa-tion, able to accurately infer missing values in a datasetbased on the context of the values that are present. For agiven dataset, all of these results are produced with the sametrained model.ACE’s high performance does not come at the cost ofcomplexity. ACE is much simpler than other commonapproaches to density estimation, which often require re-strictive complexities such as normalizing flow models ornetworks with specially masked connections (Germain et al.,2015). ACE is compatible with any network architecture,and our results are obtained with simple fully-connectednetworks.Furthermore, ACE’s proposal distribution, a basic mixtureof Gaussians, attains SOTA, demonstrating that the principleof learning one-dimensional distributions is still powerfulwhen decoupled from energy-based learning. These resultsemphasize that seemingly complex problems do not neces-sitate highly complex solutions, and we believe future workon arbitrary density estimation will benefit from similarideas. rbitrary Conditional Distributions with Energy References
Belghazi, M., Oquab, M., and Lopez-Paz, D. Learningabout an exponential amount of conditional distributions.In
Advances in Neural Information Processing Systems ,pp. 13703–13714, 2019.Butz, C. J., Oliveira, J. S., dos Santos, A. E., and Teixeira,A. L. Deep convolutional sum-product networks. In
Proceedings of the AAAI Conference on Artificial Intelli-gence , volume 33, pp. 3248–3255, 2019.Dinh, L., Sohl-Dickstein, J., and Bengio, S. Density esti-mation using real nvp. arXiv preprint arXiv:1605.08803 ,2016.Douglas, L., Zarov, I., Gourgoulias, K., Lucas, C., Hart, C.,Baker, A., Sahani, M., Perov, Y., and Johri, S. A uni-versal marginalizer for amortized inference in generativemodels. arXiv preprint arXiv:1711.00695 , 2017.Fakoor, R., Chaudhari, P., Mueller, J., and Smola, A. J.Trade: Transformers for density estimation. arXivpreprint arXiv:2004.02441 , 2020.Fielding, S., Fayers, P. M., McDonald, A., McPherson, G.,and Campbell, M. K. Simple imputation methods wereinadequate for missing not at random (mnar) quality oflife data.
Health and Quality of Life Outcomes , 6(1):1–9,2008.Germain, M., Gregor, K., Murray, I., and Larochelle, H.Made: Masked autoencoder for distribution estimation.In
International Conference on Machine Learning , pp.881–889. PMLR, 2015.Gondara, L. and Wang, K. Mida: Multiple imputation usingdenoising autoencoders. In
Pacific-Asia Conference onKnowledge Discovery and Data Mining , pp. 260–272.Springer, 2018.Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B.,Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y.Generative adversarial nets.
Advances in neural informa-tion processing systems , 27:2672–2680, 2014.Grathwohl, W., Chen, R. T., Bettencourt, J., Sutskever, I.,and Duvenaud, D. Ffjord: Free-form continuous dy-namics for scalable reversible generative models. arXivpreprint arXiv:1810.01367 , 2018.Hawkins, J. and Blakeslee, S.
On intelligence . Macmillan,2004.He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn-ing for image recognition. In
Proceedings of the IEEEconference on computer vision and pattern recognition ,pp. 770–778, 2016a. He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In
European conference oncomputer vision , pp. 630–645. Springer, 2016b.Ivanov, O., Figurnov, M., and Vetrov, D. Variational au-toencoder with arbitrary conditioning. arXiv preprintarXiv:1806.02382 , 2018.Kingma, D. P. and Ba, J. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., and Huang,F. A tutorial on energy-based learning.
Predicting struc-tured data , 1(0), 2006.Li, S. C.-X., Jiang, B., and Marlin, B. Misgan: Learningfrom incomplete data with generative adversarial net-works. arXiv preprint arXiv:1902.09599 , 2019.Li, Y., Akbar, S., and Oliva, J. ACFlow: Flow models forarbitrary conditional likelihoods. In III, H. D. and Singh,A. (eds.),
Proceedings of the 37th International Confer-ence on Machine Learning , volume 119 of
Proceedingsof Machine Learning Research , pp. 5831–5841. PMLR,13–18 Jul 2020. URL http://proceedings.mlr.press/v119/li20a.html .Ma, W. J. and Jazayeri, M. Neural coding of uncertainty andprobability.
Annual review of neuroscience , 37:205–220,2014.Mattei, P.-A. and Frellsen, J. MIWAE: Deep genera-tive modelling and imputation of incomplete data sets.In Chaudhuri, K. and Salakhutdinov, R. (eds.),
Pro-ceedings of the 36th International Conference on Ma-chine Learning , volume 97 of
Proceedings of MachineLearning Research , pp. 4413–4423. PMLR, 09–15 Jun2019. URL http://proceedings.mlr.press/v97/mattei19a.html .Nash, C. and Durkan, C. Autoregressive energy machines. arXiv preprint arXiv:1904.05626 , 2019.Oliva, J. B., Dubey, A., Zaheer, M., Poczos, B., Salakhutdi-nov, R., Xing, E. P., and Schneider, J. Transformation au-toregressive networks. arXiv preprint arXiv:1801.09819 ,2018.Papamakarios, G., Pavlakou, T., and Murray, I. Maskedautoregressive flow for density estimation. In
Advances inNeural Information Processing Systems , pp. 2338–2347,2017.Poon, H. and Domingos, P. Sum-product networks: A newdeep architecture. In , pp.689–690. IEEE, 2011. rbitrary Conditional Distributions with Energy
Pouget, A., Beck, J. M., Ma, W. J., and Latham, P. E. Prob-abilistic brains: knowns and unknowns.
Nature neuro-science , 16(9):1170–1178, 2013.Pouget, A., Drugowitsch, J., and Kepecs, A. Confidenceand certainty: distinct probabilistic quantities for differentgoals.
Nature neuroscience , 19(3):366, 2016.Saremi, S., Mehrjou, A., Sch¨olkopf, B., and Hyv¨arinen,A. Deep energy estimator networks. arXiv preprintarXiv:1805.08306 , 2018.Stekhoven, D. J. and B¨uhlmann, P. Missforest—non-parametric missing value imputation for mixed-type data.
Bioinformatics , 28(1):112–118, 2012.Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P.,Hastie, T., Tibshirani, R., Botstein, D., and Altman, R. B.Missing value estimation methods for dna microarrays.
Bioinformatics , 17(6):520–525, 2001.Yoon, J., Jordon, J., and van der Schaar, M. GAIN: Missingdata imputation using generative adversarial nets. In Dy,J. and Krause, A. (eds.),
Proceedings of the 35th Inter-national Conference on Machine Learning , volume 80of
Proceedings of Machine Learning Research , pp. 5689–5698, Stockholmsm¨assan, Stockholm Sweden, 10–15Jul 2018. PMLR. URL http://proceedings.mlr.press/v80/yoon18a.html . rbitrary Conditional Distributions with Energy Supplementary Materials
A. Experimental Details
We used a fully-connected residual architecture for boththe proposal and energy networks. Each network uses 4pre-activation residual blocks and ReLU activations. In theproposal network, the bitmask b is concatenated with theinput to the network’s last layer. In the energy network, b and φ ( x o ; b ) are concatenated with the input to the lastlayer. In all experiments, the latent vectors outputted bythe proposal network were 64-dimensional and the hiddenlayers of the energy network had 128 units. The proposaldistributions used 20 components, and the minimum allowedscale of each component was 0.001. A batch size of 512was used during training, along with an initial learning rateof 0.0005 (with the exception of the P OWER model with50% missing data, which used a learning rate of 0.0001).We found that the P
OWER dataset benefited from the MSEpenalty described in Section 4.3 of the main text, with acoefficient of 1.0. Table A.4 gives the hyperparameters thatvaried between datasets.Models were trained on a single NVIDIA GeForce GTX1080 Ti GPU, and training time varied from roughly a fewhours to 2 days depending on the dataset.
B. Imputation Results
In the main text, the imputation results are presented asa graph. We give the values that generated the graph inTable B.5.
C. Learned Distributions
In Figure C.5, Figure C.6, Figure C.7, Figure C.8, andFigure C.9, we provide examples of distributions that werelearned by ACE on the five UCI datasets. Each figure showsthe estimated distributions of the unobserved features forfour different test examples with randomly generated masks.Each plot in a column corresponds to one of the unobserveddimensions. The proposal and energy estimates are overlaidfor comparison.Figure C.10 shows the learned distributions for each datasetfor the case where x o is the empty set. In this case, we arealso able to compare to the data distribution. D. Marginal Samples
In the main text, we report marginal likelihoods learned byACE over the first three dimensions of each dataset. Here,we qualitatively evaluate samples drawn from those learneddistributions. Figure D.11 presents scatter plots of samplesdrawn from the proposal and energy distributions alongside test data points (1000 of each are plotted). Notably, we seefor BSDS that samples from the proposal distribution aremuch more spread out than the data, whereas the energysamples closely match the data distribution. rbitrary Conditional Distributions with Energy H YPERPARAMETER P OWER G AS H EPMASS M INIBOONE
BSDSProposal Hidden Dim. 512 512 512 512 1024Dropout 0.2 0.0 0.2 0.5 0.2MSE Penalty Coef. 1.0 0.0 0.0 0.0 0.0Training Steps 1600000 800000 800000 800000 800000Warm-up Steps 5000 5000 5000 2500 5000Training Noise Scale 0.005 0.001 0.001 0.005 0.001
Table A.4.
Dataset-specific hyperparameters. P OWER G AS H EPMASS M INIBOONE
BSDSMissing Rate 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5 0.0 0.1 0.5ACE
ACE Proposal
Table B.5.
Test NRMSE scores for UCI datasets. Lower is better. The best performing model for each dataset and missing rate is shownin bold.
Figure C.5. P OWER learned distributions. Blue is the proposal distribution and orange is the energy distribution. Best viewed zoomed in.
Figure C.6. G AS learned distributions. Blue is the proposal distribution and orange is the energy distribution. Best viewed zoomed in. Figure C.7. H EPMASS learned distributions. Blue is the proposal distribution and orange is the energy distribution. Best viewed zoomedin. rbitrary Conditional Distributions with Energy
Figure C.8. M INIBOONE learned distributions. Blue is the proposal distribution and orange is the energy distribution. Best viewed zoomedin. rbitrary Conditional Distributions with Energy
Figure C.9.
BSDS learned distributions. Blue is the proposal distribution and orange is the energy distribution. Best viewed zoomed in. rbitrary Conditional Distributions with Energy (a) P
OWER (b) G AS (c) H EPMASS (d) M
INIBOONE (e) BSDS
Figure C.10.
Learned distributions for each dataset for x o = ∅ . Blue is the proposal distribution, orange is the energy distribution, andgrey is the data distribution. Best viewed zoomed in. rbitrary Conditional Distributions with Energy x x x DataProposal SamplesEnergy Samples (a) P
OWER x x x DataProposal SamplesEnergy Samples (b) G AS x x x DataProposal SamplesEnergy Samples (c) H
EPMASS x x x DataProposal SamplesEnergy Samples (d) M
INIBOONE x x
10 5 0 5 10 x DataProposal SamplesEnergy Samples (e) BSDS
Figure D.11.
Samples generated by ACE for p ( x < ))