[PDF] Convex Nonparanormal Regression

Abstract

Quantifying uncertainty in predictions or, more generally, estimating the posterior conditional distribution, is a core challenge in machine learning and statistics. We introduce Convex Nonparanormal Regression (CNR), a conditional nonparanormal approach for coping with this task. CNR involves a convex optimization of a posterior defined via a rich dictionary of pre-defined non linear transformations on Gaussians. It can fit an arbitrary conditional distribution, including multimodal and non-symmetric posteriors. For the special but powerful case of a piecewise linear dictionary, we provide a closed form of the posterior mean which can be used for point-wise predictions. Finally, we demonstrate the advantages of CNR over classical competitors using synthetic and real world data.

Full PDF

11 Normalizing Flow Regression

Yonatan Woodbridge, Gal Elidan and Ami Wiesel

Abstract — In this letter we propose a convex approach tolearning expressive scalar conditional distributions. The modeldenoted by Normalizing Flow Regression (NFR) is inspired bydeep normalizing ﬂow networks but is convex due to the useof a dictionary of pre-deﬁned transformations. By deﬁning arich enough dictionary, NFR generalizes the Gaussian posteriorassociated with linear regression to an arbitrary conditionaldistribution. In the special case of piece wise linear dictionary,we also provide a closed form solution for the conditional mean.We demonstrate the advantages of NFR over competitors usingsynthetic data as well as real world data.

Index Terms — Linear Regression, Normalizing Flow, ConvexOptimization.

I. I

NTRODUCTION

A fundamental task in data analysis is ﬁtting a conditionaldistribution to an unknown label given observed features. In itssimplest case, linear regression (LR) can be interpreted as themaximum likelihood (ML) estimate of a Gaussian distributionwith a mean that is a linear function of the features, anda constant variance. LR is well understood and involves asimple quadratic optimization with closed form solution. Onthe other extreme, recent developments known as normalizingﬂows [1] involve deep architectures associated with complexhigh dimensional posteriors. Unfortunately, their ML estimatesrequire the solutions to non-convex and intractable optimiza-tion problems.The problem is even more acute when our goal is to quantifythe uncertainty of the estimate, or to characterize the fullposterior of unknown labels. While the classical LR frameworkprovides a closed-form solution to these tasks, its assumptionsare clearly too simplistic for most real world settings. Thus,many extensions on LR use variance that depends on thefeatures [2], [3], [4], [5]. These work well in practice, butinvolve a poorly understood non-convex optimization.A simpler extension to LR, which we denote by GaussianRegression (GR), is deﬁned as the likelihood function of thecanonical Gaussian parameters. While both mean and varianceof GR are functions of the features, estimation involves asimple convex minimization. In fact, it is well known that theGaussian distribution belongs to the exponential family andhas a convex formulation by re-parameterization of its meanand variance parameters [6]. However, similar to LR and itsextensions, GR is limited to the Gaussian distribution. Thesemodels are inappropriate when the posterior is not symmetricaround its mean, has heavy tails, or has multiple modes.In this letter, we introduce NFR model which extends LRand GR beyond the Gaussian setting, while still preservingconvexity. The NFR allows a general class of non-Gaussian

The authors are with the Hebrew University of Jerusalem, Israel, andGoogle Research. This research was partially supported by the Israel ScienceFoundation (ISF) under Grant 1339/15 yet convex posteriors, by formulating it as a linear combinationof pre-deﬁned basis functions. With a rich enough dictionary,this approach can model arbitrary conditional distributions.The price is an increased number of unknown parameters andhigher sample complexity. Practically, we suggest a simpledictionary construction, demonstrating by experimental evalu-ations its efﬁcacy in capturing complex distributions.The main tool in deriving NFR is the formula for the densityof a transformed random variable. Instead of ﬁtting the densitydirectly, NFR learns a transformation that will result in aGaussian distribution. In this sense, NFR follows up on alarge body of literature. The nonparanormal distribution (NPN)[7] relies on nonparametric marginal distributions and usesthem for estimation of high dimensional graphical models.Similarly, Gaussian copulas allow arbitrary marginals with aGaussian dependence structure, and belong to a richer familyof copula models [8], [9], [10], [11], [12]. More recently,normalizing ﬂow networks exploits the transformation formulain the context of deep generative learning. The networks buildupon a composition of multiple high dimensional invertiblemaps and compute their most likely parameters in order togenerate realistic samples [1], [13], [14], [15], [16]. Comparedto these works, NFR is less ambitious and considers simplescalar distributions. The goal is not synthetic sample genera-tion but non-Gaussian regression. On the other hand, we arenot aware of previous works that consider normalizing ﬂowswith pre-deﬁned basis transformations that ensure convexity.On the practical side, we focus on a speciﬁc NFR implemen-tation based on a dictionary of piecewise linear transformationfunctions. The dictionary partitions the label’s domain intopredeﬁned bins, and uses a different tranformation in each bin.By choosing the number of bins, this implementation allowsfor a ﬂexible tradeoff between complexity and expressivepower. In addition, we provide a closed form solution tothe conditional expectation of the piecewise linear modelwhich is useful in prediction tasks. We then demonstrate theperformance advantages of our NFR implementation over LRand GR using numerical experiments in both synthetic andreal world data. Given enough samples, NFR results in betterlikelihood values than its competitors. In terms of predictionerror, NFR performs similarly to both LR and GR.The letter is organized as follows. We begin in Section II byintroducing the general NFR framework. We derive the NFRmodel, detail its underlying assumptions, prove its convexity,and show that LR and GR are both special cases. In SectionIII we present our speciﬁc NFR implementation based on adictionary of piecewise linear functions. Finally, in Secion IVwe present the results of our numerical experiments. a r X i v : . [ s t a t . M L ] A p r II. N

ORMALIZED FLOW REGRESSION (NFR)In this section, we introduce Normalized Flow Regression(NFR), a non-Gaussian yet-convex generalization of linearregression. NFR learns the posterior of y ∈ R given x ∈ R k and is parameterized by θ ∈ R d , p θ ( y | x ) , NFR tries to estimate the unknown θ given pairs of ( y, x ) .It parameterizes the posterior using an expressive class oftransformation functions to the Gaussian distribution. For thispurpose, we deﬁne g θ ( y ; x ) ∼ N (0 , . The transformation function g operates on y and is determinedby the arguments θ ∈ Θ x and x ∈ X . We require the followingassumptions: Assumption 1.

The transformation g is differentiable andmonotonically increasing with respect to y for any θ ∈ Θ x and x ∈ X . Assumption 2.

The set Θ x is convex in θ for all x ∈ X .Within this set, the transformation g is afﬁne in θ . While at ﬁrst sight these assumptions might seem restrictive,they do not only give rise to concrete constructions of g ( · ) ,but also allow this function to be highly non-linear and non-convex in x and y . In fact, by choosing an expressive enoughtransformation class, g can approximate arbitrary continuousposterior. The assumptions are simply designed to ensure atractable likelihood function. Lemma 1.

Under Assumption 1 and ignoring constants, thenegative log likelihood of y parameterized by θ is − log p θ ( y | x ) = g θ ( y ; x ) − g (cid:48) θ ( y ; x )) where the derivative is deﬁned as g (cid:48) θ ( y ; x ) = ∂g θ ( y ; x ) ∂y Together with assumption 2, this objective is convex in θ ∈ Θ x for all x ∈ X .Proof. Assumption 1 allows the use of the classical formulaof transformed random variables [17]. Since g (cid:48) θ ( y ; x ) must bepositive, we get: p θ ( y | x ) = g (cid:48) θ ( y ; x ) √ π e − g θ ( y ; x ) (1)and yields the likelihood. Convexity is guaranteed by notingthat the quadratic and negative logarithm are convex function.Next, g is afﬁne in θ , and consequently g (cid:48) is afﬁne in θ too.Finally, convexity is preserved under afﬁne transformations.Given a training dataset { ( x i , y i ) } i =1 ,..,n , NFR is deﬁnedas the maximum likelihood estimate of θ : min θ (cid:80) i g θ ( y i ; x i ) − g (cid:48) θ ( y i ; x i ))s . t . θ ∈ Θ( x i ) i = 1 , · · · , n (2) Convexity ensures that this minimization can be efﬁcientlysolved using existing toolboxes. In fact, this property allowsus to add convex penalties and/or constraints. We can use thisto regularize the objective when the number of samples isinsufﬁciently large compared to the dimension of θ .As can be easily seen, NFR generalizes two well knownspecial cases: Lemma 2. If g is jointly linear in x and y then NFR reducesto standard linear regression of y given x : y | x ∼ N (cid:0) w T θ x , σ θ (cid:1) Proof.

This follows immediately by deﬁning g θ ( y ; x ) = u T θ x + v θ y , w θ = − u θ v θ and σ θ = v θ . Lemma 3. If g is afﬁne in y then NFR reduces to maximumlikelihood estimation of a Gaussian distribution in its canon-ical form: y ∼ N (cid:0) w θ ( x ) , σ θ ( x ) (cid:1) Proof.

This follows immediately by deﬁning g θ ( y ; x ) = u θ ( x ) + v θ ( x ) y , w θ ( x ) = − u θ ( x ) v θ ( x ) and σ θ ( x ) = v θ ( x ) .Thus, with proper constraints, LR and GR can both beimplemented as special cases of NFR.III. P IECEWISE LINEAR

NFRGenerally, NFR only requires Assumptions 1-2 to ensurea well deﬁned convex negative log likelihood. To make theconstruction concrete, we now provide a speciﬁc transforma-tion class that satisﬁes these assumptions and allows a ﬂexibletradeoff between complexity and expressivity. The class isbased on a dictionary of piece wise linear, monotonic non-decreasing functions of y with weights that are afﬁne functionsof the features: g A , b ( y ; x ) = h T ( y )[ A ψ ( x ) + b ] where θ = { A ∈ R ( L +3) × k , b ∈ R L +3 } are the unknownparameters, ψ ( x ) ∈ R k are pre-deﬁned feature mappings(possibly just x itself), and h ( y ) is the dictionary deﬁne as: h T ( y ) = [1 , h ( y ) , ...h L +1 ( y )] . (3)where each of the basis functions h i is monotonically non-decreasing in y .We choose to construct our dictionary using simple stepfunctions. More speciﬁcally, let I = [ p , ..., p L ] be evenlyspaced points in the real line, such that ∆ = p j +1 − p j isa ﬁxed distance between two adjacent points. We then deﬁne: h j ( y ) =  y ≤ p j − y − p j − p j − < y ≤ p j ∆ p j < y ∀ j ∈ { , ..., L } Below and above points p and p L we deﬁne, respectively: h ( y ) = (cid:110) y − p y ≤ p p < y ,h L +1 ( y ) = (cid:110) y ≤ p L y − p L p L < y . Fig. 1:

Example of basis functions, with h ( y ) (left), h j ( y ) for ≤ j ≤ L (center) and h L +1 ( y ) (right). An illustration of these functions is given in Fig. 1. To ensurea monotonic increasing g , we deﬁne the domain Θ x : Θ x = (cid:8) ( A , b ) : ∀ j ≥ , [ A ψ ( x ) + b ] j > (cid:9) . We can now easily see that this construction satisﬁes Assump-tions 1-2 and provides a convex NFR. Reminiscent of kerneldensity estimation [18], the dictionary divides the domain of y into predeﬁned bins and allows a different slope at each bin.With enough bins, any monotonic increasing transformationcan be modeled.Many applications require the point estimation of unknownlabels, or the minimum mean squared error (MMSE) estimateof y given x . Here lies another advantage of the piecewiselinear NFR, whose construction allows a closed form com-putation of these quantities. The following lemma provides aclosed form of the conditional expectation of y given x , whichcan then be used to obtain point estimates or MMSE’s. Lemma 4.

The conditional expectation of y given x in thepiece wise linear models with parameters A and b is E ( y | x ) = − e − µ √ πα + (cid:18) p − µα (cid:19) Φ( µ )+ L − (cid:88) i =0 (cid:32) e − ( µ +∆ i ) − e − ( µ +∆ i +1 ) √ πα i +1 + (cid:18) p i − µ + ∆ i α i +1 (cid:19) (Φ( µ + ∆ i +1 ) − Φ( µ + ∆ i )) (cid:33) + e − ( µ +∆ L ) √ πα L +1 + (cid:18) p L − µ + ∆ L α L +1 (cid:19) (1 − Φ( µ + ∆ L )) , (4) where Φ is the standard normal cumulative distribution func-tion (CDF) and we deﬁne [ µ, α , ..., α L +1 ] T = Ax + b ∆ = 0∆ i = ∆( α + ... + α i ) ∀ i = 1 , ..., L. Proof.

We obtain this result by computing the expectation ineach segment of the real line, partitioned by I . The NFRdensity function in each segment is given by: p θ ( y | x ) =  α √ π e − ( µ + α ( y − p )) y ≤ p α i +1 √ π e − ( µ +∆ i )+ α i +1 ( y − p i )) y ∈ ( p i , p i +1 ] , ≤ i ≤ L − α L +1 √ π e − ( µ +∆ L + α L +1 ( y − p L )) y > p L By standard calculation we can show that the integral of yp θ ( y | x ) over the outermost left segment satisﬁes: (cid:90) p −∞ yp θ ( y | x ) dy = − e − µ √ πα + (cid:18) p − µα (cid:19) Φ( µ ) , while the integral over the outermost right segment satisﬁes: (cid:90) ∞ p L yp θ ( y | x ) dy = e − ( µ +∆ L ) √ πα L +1 + (cid:18) p L − µ + ∆ L α L +1 (cid:19) (1 − Φ( µ + ∆ L )) . Regarding the inner segments, for i = 1 , ..., L − it holdsthat: (cid:90) p i +1 p i yp θ ( y | x ) dy = e − ( µ +∆ i ) − e − ( µ +∆ i +1 ) √ πα i +1 + (cid:18) p i − µ + ∆ i α i +1 (cid:19) (Φ( µ + ∆ i +1 ) − Φ( µ + ∆ i )) . The required result is obtained by summing up all these terms.

Algorithm 1:

ADMM

Repeat until convergence • w ← − (cid:0) P + ρ Q T Q (cid:1) − Q T ( y − ρ z ) • z (cid:96) ← ρ [ Qw ] (cid:96) + y (cid:96) + √ ( ρ [ Qw ] (cid:96) + y (cid:96) ) +8 ρ ρ ∀ (cid:96) . • y ← y + ρ ( Qw − z ) Return w .Next we provide an Alternating Directions Method of Mul-tipliers (ADMM) algorithm for solving the piecewise linearNFR efﬁciently [19]. Due to space limitations, we omit thederivation and many of the details. In brief, the piecewiselinear version of (2) boils down to the following convexoptimization: min w w T Pw − (cid:88) (cid:96) log ([ Qw ] (cid:96) ) , (5)where P (cid:23) and Q are ﬁxed matrices that depend on thedata and setting, and w is a vector with all the unknownparameters. ADMM solves this minimization by deﬁning z (cid:96) =[ Qw ] (cid:96) ≥ for each (cid:96) and alternatingly solving for w , z (cid:96) and the multipliers y (cid:96) . The overall algorithm is provided inAlgorithm 1 where ρ is a step size parameter. Given n datapoints ( x i , y i ) , the matrices P , Q are deﬁned as follows: p Ti = (cid:2) h T ( y i ) ⊗ x Ti h T ( y i ) (cid:3) = ⇒ P = n (cid:88) i =1 p i p Ti Q i : = (cid:104) h (cid:48) T ( y i ) ⊗ x Ti h (cid:48) T ( y i ) (cid:105) , ∀ i = 1 , .., n where ⊗ is the Kronecker product, h T ( y i ) is the row vectordeﬁned in (3), h (cid:48) T ( y i ) is the corresponding derivative w.r.t y i ,and Q i : is the i -th row of Q . See supplementary material formore details. Fig. 2:

NLL and MSE results of the different models on syntheticLR data, for an increasing number of samples.

IV. N

UMERICAL EXPERIMENTS

We now demonstrate the efﬁcacy of our approach in cap-turing expressive conditional distributions while still retainingmodel convexity. In all the simulations, we compare our NFRapproach to the LR and GR baselines. NFR is computedusing the ADMM in Algorithm 1. In the synthetic simula-tions, we also report a clairvoyant lower bound based on thetrue unknown parameters. Performance is measured using thenegative log likelihood (NLL) of an indpenedent test set. Wealso provide normalized root mean squared error (MSE) resultswith respect to the prediction error (ˆ y − y ) , where ˆ y is theconditional expectation in Lemma 4. The reported metrics areempirical medians of , independent trials per point. Synthetic data . We begin with experiments with syntheticdata generated from the LR and NFR regression models.The true parameters are ﬁxed constants and the features arestandard normal variables. Figs. 2 and 3 provide the NLL andMSE results as a function of the number of samples generatedfrom LR and NFR settings, respectively. In terms of NLL, eachestimator performs best under its true model. NFR is moreexpressive than its competitors and requires more samples toreach their performance in their specialized models. On theother hand, NFR is signiﬁcantly better when the true datais NFR distributed. The MSE experiments validate Lemma4. In terms of minimizing prediction error, LR is typicallysufﬁcient and there is no need for GR or NFR, yet the penaltyfor their usage is negligible when the number of samples inlarge. Asymptotically, the non-linearities in NFR even givea slight advantage and NFR outperforms its competitors. Weobtain similar results on GR samples.

Real world data

We now evaluate our NFR approachon a real world setting. We follow [20] and consider solarpower time series .Denoting the time series by { t i } ni =1 , our Fig. 3:

NLL and MSE results of the different models on syntheticNFR data, for an increasing number of samples.

Fig. 4:

NLL and MSE results of the different models on real solarpower data, for an increasing number of samples. goal is to predict its future outcome. Fixing a set of indices J = (1 , , , .., , we extract n samples through x i = (cid:0) t i + J (1) , .., t i + J (9) (cid:1) and y i = t i +90 , where i = 1 , ..., n . Foreach trial we assign samples to be used as a test set, whilethe remaining samples serve as a training set. As previously,we estimate the LR, GR and NFR parameters using thetraining set, and compute the MSE and likelihood of the testset. We preform , trials per point n . As shown in Fig. 4,the NFR results in signiﬁcantly better NLL values. In terms ofMSE, NFR requires many more samples to reach performanceof LR and GR as it has more unknown parameters. R EFERENCES[1] E. G. Tabak and C. V. Turner, “A family of nonparametric density esti-mation algorithms,”

Communications on Pure and Applied Mathematics ,vol. 66, no. 2, pp. 145–164, 2013.[2] D. A. Nix and A. S. Weigend, “Learning local error bars for nonlinearregression,” in

Advances in neural information processing systems ,pp. 489–496, 1995.[3] T. Heskes, “Practical conﬁdence and prediction intervals,” in

Advancesin neural information processing systems , pp. 176–182, 1997.[4] G. Papadopoulos, P. J. Edwards, and A. F. Murray, “Conﬁdence esti-mation methods for neural networks: A practical comparison,”

IEEEtransactions on neural networks , vol. 12, no. 6, pp. 1278–1287, 2001.[5] N. Meinshausen, “Quantile regression forests,”

Journal of MachineLearning Research , vol. 7, no. Jun, pp. 983–999, 2006.[6] M. J. Wainwright, M. I. Jordan, et al. , “Graphical models, exponen-tial families, and variational inference,”

Foundations and Trends R (cid:13) inMachine Learning , vol. 1, no. 1–2, pp. 1–305, 2008.[7] H. Liu, J. Lafferty, and L. Wasserman, “The nonparanormal: Semipara-metric estimation of high dimensional undirected graphs,” Journal ofMachine Learning Research , vol. 10, no. Oct, pp. 2295–2328, 2009.[8] R. B. Nelsen,

An introduction to copulas . Springer Science & BusinessMedia, 2007.[9] C. A. Klaassen, J. A. Wellner, et al. , “Efﬁcient estimation in the bivariatenormal copula model: normal margins are least favourable,”

Bernoulli ,vol. 3, no. 1, pp. 55–77, 1997.[10] M. Davy and A. Doucet, “Copulas: a new insight into positive time-frequency distributions,”

IEEE signal processing letters , vol. 10, no. 7,pp. 215–218, 2003.[11] S. G. Iyengar, P. K. Varshney, and T. Damarla, “A parametric copula-based framework for hypothesis testing using heterogeneous data,”

IEEETransactions on Signal Processing , vol. 59, no. 5, pp. 2308–2319, 2011.[12] Y. Woodbridge, G. Elidan, and A. Wiesel, “Signal detection in complexstructured para normal noise,”

IEEE Transactions on Signal Processing ,vol. 65, no. 9, pp. 2306–2316, 2017.[13] I. Kobyzev, S. Prince, and M. A. Brubaker, “Normalizing ﬂows: Intro-duction and ideas,” arXiv preprint arXiv:1908.09257 , 2019.[14] D. J. Rezende and S. Mohamed, “Variational inference with normalizingﬂows,” arXiv preprint arXiv:1505.05770 , 2015.[15] E. G. Tabak, E. Vanden-Eijnden, et al. , “Density estimation by dual as-cent of the log-likelihood,”

Communications in Mathematical Sciences ,vol. 8, no. 1, pp. 217–233, 2010.[16] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation usingreal nvp,” arXiv preprint arXiv:1605.08803 , 2016.[17] S. M. Ross,

Introduction to probability models . Academic press, 2014.[18] L. Wasserman,

All of nonparametric statistics . Springer Science &Business Media, 2006.[19] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. , “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”

Foundations and Trends R (cid:13) in Machine learning , vol. 3,no. 1, pp. 1–122, 2011.[20] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-andshort-term temporal patterns with deep neural networks,” in The 41stInternational ACM SIGIR Conference on Research & Development inInformation Retrieval , pp. 95–104, 2018. S UPPLEMENTARY M ATERIAL : N

ORMALIZING F LOW R EGRESSION