Convex Nonparanormal Regression
11 Normalizing Flow Regression
Yonatan Woodbridge, Gal Elidan and Ami Wiesel
Abstract — In this letter we propose a convex approach tolearning expressive scalar conditional distributions. The modeldenoted by Normalizing Flow Regression (NFR) is inspired bydeep normalizing flow networks but is convex due to the useof a dictionary of pre-defined transformations. By defining arich enough dictionary, NFR generalizes the Gaussian posteriorassociated with linear regression to an arbitrary conditionaldistribution. In the special case of piece wise linear dictionary,we also provide a closed form solution for the conditional mean.We demonstrate the advantages of NFR over competitors usingsynthetic data as well as real world data.
Index Terms — Linear Regression, Normalizing Flow, ConvexOptimization.
I. I
NTRODUCTION
A fundamental task in data analysis is fitting a conditionaldistribution to an unknown label given observed features. In itssimplest case, linear regression (LR) can be interpreted as themaximum likelihood (ML) estimate of a Gaussian distributionwith a mean that is a linear function of the features, anda constant variance. LR is well understood and involves asimple quadratic optimization with closed form solution. Onthe other extreme, recent developments known as normalizingflows [1] involve deep architectures associated with complexhigh dimensional posteriors. Unfortunately, their ML estimatesrequire the solutions to non-convex and intractable optimiza-tion problems.The problem is even more acute when our goal is to quantifythe uncertainty of the estimate, or to characterize the fullposterior of unknown labels. While the classical LR frameworkprovides a closed-form solution to these tasks, its assumptionsare clearly too simplistic for most real world settings. Thus,many extensions on LR use variance that depends on thefeatures [2], [3], [4], [5]. These work well in practice, butinvolve a poorly understood non-convex optimization.A simpler extension to LR, which we denote by GaussianRegression (GR), is defined as the likelihood function of thecanonical Gaussian parameters. While both mean and varianceof GR are functions of the features, estimation involves asimple convex minimization. In fact, it is well known that theGaussian distribution belongs to the exponential family andhas a convex formulation by re-parameterization of its meanand variance parameters [6]. However, similar to LR and itsextensions, GR is limited to the Gaussian distribution. Thesemodels are inappropriate when the posterior is not symmetricaround its mean, has heavy tails, or has multiple modes.In this letter, we introduce NFR model which extends LRand GR beyond the Gaussian setting, while still preservingconvexity. The NFR allows a general class of non-Gaussian
The authors are with the Hebrew University of Jerusalem, Israel, andGoogle Research. This research was partially supported by the Israel ScienceFoundation (ISF) under Grant 1339/15 yet convex posteriors, by formulating it as a linear combinationof pre-defined basis functions. With a rich enough dictionary,this approach can model arbitrary conditional distributions.The price is an increased number of unknown parameters andhigher sample complexity. Practically, we suggest a simpledictionary construction, demonstrating by experimental evalu-ations its efficacy in capturing complex distributions.The main tool in deriving NFR is the formula for the densityof a transformed random variable. Instead of fitting the densitydirectly, NFR learns a transformation that will result in aGaussian distribution. In this sense, NFR follows up on alarge body of literature. The nonparanormal distribution (NPN)[7] relies on nonparametric marginal distributions and usesthem for estimation of high dimensional graphical models.Similarly, Gaussian copulas allow arbitrary marginals with aGaussian dependence structure, and belong to a richer familyof copula models [8], [9], [10], [11], [12]. More recently,normalizing flow networks exploits the transformation formulain the context of deep generative learning. The networks buildupon a composition of multiple high dimensional invertiblemaps and compute their most likely parameters in order togenerate realistic samples [1], [13], [14], [15], [16]. Comparedto these works, NFR is less ambitious and considers simplescalar distributions. The goal is not synthetic sample genera-tion but non-Gaussian regression. On the other hand, we arenot aware of previous works that consider normalizing flowswith pre-defined basis transformations that ensure convexity.On the practical side, we focus on a specific NFR implemen-tation based on a dictionary of piecewise linear transformationfunctions. The dictionary partitions the label’s domain intopredefined bins, and uses a different tranformation in each bin.By choosing the number of bins, this implementation allowsfor a flexible tradeoff between complexity and expressivepower. In addition, we provide a closed form solution tothe conditional expectation of the piecewise linear modelwhich is useful in prediction tasks. We then demonstrate theperformance advantages of our NFR implementation over LRand GR using numerical experiments in both synthetic andreal world data. Given enough samples, NFR results in betterlikelihood values than its competitors. In terms of predictionerror, NFR performs similarly to both LR and GR.The letter is organized as follows. We begin in Section II byintroducing the general NFR framework. We derive the NFRmodel, detail its underlying assumptions, prove its convexity,and show that LR and GR are both special cases. In SectionIII we present our specific NFR implementation based on adictionary of piecewise linear functions. Finally, in Secion IVwe present the results of our numerical experiments. a r X i v : . [ s t a t . M L ] A p r II. N
ORMALIZED FLOW REGRESSION (NFR)In this section, we introduce Normalized Flow Regression(NFR), a non-Gaussian yet-convex generalization of linearregression. NFR learns the posterior of y ∈ R given x ∈ R k and is parameterized by θ ∈ R d , p θ ( y | x ) , NFR tries to estimate the unknown θ given pairs of ( y, x ) .It parameterizes the posterior using an expressive class oftransformation functions to the Gaussian distribution. For thispurpose, we define g θ ( y ; x ) ∼ N (0 , . The transformation function g operates on y and is determinedby the arguments θ ∈ Θ x and x ∈ X . We require the followingassumptions: Assumption 1.
The transformation g is differentiable andmonotonically increasing with respect to y for any θ ∈ Θ x and x ∈ X . Assumption 2.
The set Θ x is convex in θ for all x ∈ X .Within this set, the transformation g is affine in θ . While at first sight these assumptions might seem restrictive,they do not only give rise to concrete constructions of g ( · ) ,but also allow this function to be highly non-linear and non-convex in x and y . In fact, by choosing an expressive enoughtransformation class, g can approximate arbitrary continuousposterior. The assumptions are simply designed to ensure atractable likelihood function. Lemma 1.
Under Assumption 1 and ignoring constants, thenegative log likelihood of y parameterized by θ is − log p θ ( y | x ) = g θ ( y ; x ) − g (cid:48) θ ( y ; x )) where the derivative is defined as g (cid:48) θ ( y ; x ) = ∂g θ ( y ; x ) ∂y Together with assumption 2, this objective is convex in θ ∈ Θ x for all x ∈ X .Proof. Assumption 1 allows the use of the classical formulaof transformed random variables [17]. Since g (cid:48) θ ( y ; x ) must bepositive, we get: p θ ( y | x ) = g (cid:48) θ ( y ; x ) √ π e − g θ ( y ; x ) (1)and yields the likelihood. Convexity is guaranteed by notingthat the quadratic and negative logarithm are convex function.Next, g is affine in θ , and consequently g (cid:48) is affine in θ too.Finally, convexity is preserved under affine transformations.Given a training dataset { ( x i , y i ) } i =1 ,..,n , NFR is definedas the maximum likelihood estimate of θ : min θ (cid:80) i g θ ( y i ; x i ) − g (cid:48) θ ( y i ; x i ))s . t . θ ∈ Θ( x i ) i = 1 , · · · , n (2) Convexity ensures that this minimization can be efficientlysolved using existing toolboxes. In fact, this property allowsus to add convex penalties and/or constraints. We can use thisto regularize the objective when the number of samples isinsufficiently large compared to the dimension of θ .As can be easily seen, NFR generalizes two well knownspecial cases: Lemma 2. If g is jointly linear in x and y then NFR reducesto standard linear regression of y given x : y | x ∼ N (cid:0) w T θ x , σ θ (cid:1) Proof.
This follows immediately by defining g θ ( y ; x ) = u T θ x + v θ y , w θ = − u θ v θ and σ θ = v θ . Lemma 3. If g is affine in y then NFR reduces to maximumlikelihood estimation of a Gaussian distribution in its canon-ical form: y ∼ N (cid:0) w θ ( x ) , σ θ ( x ) (cid:1) Proof.
This follows immediately by defining g θ ( y ; x ) = u θ ( x ) + v θ ( x ) y , w θ ( x ) = − u θ ( x ) v θ ( x ) and σ θ ( x ) = v θ ( x ) .Thus, with proper constraints, LR and GR can both beimplemented as special cases of NFR.III. P IECEWISE LINEAR
NFRGenerally, NFR only requires Assumptions 1-2 to ensurea well defined convex negative log likelihood. To make theconstruction concrete, we now provide a specific transforma-tion class that satisfies these assumptions and allows a flexibletradeoff between complexity and expressivity. The class isbased on a dictionary of piece wise linear, monotonic non-decreasing functions of y with weights that are affine functionsof the features: g A , b ( y ; x ) = h T ( y )[ A ψ ( x ) + b ] where θ = { A ∈ R ( L +3) × k , b ∈ R L +3 } are the unknownparameters, ψ ( x ) ∈ R k are pre-defined feature mappings(possibly just x itself), and h ( y ) is the dictionary define as: h T ( y ) = [1 , h ( y ) , ...h L +1 ( y )] . (3)where each of the basis functions h i is monotonically non-decreasing in y .We choose to construct our dictionary using simple stepfunctions. More specifically, let I = [ p , ..., p L ] be evenlyspaced points in the real line, such that ∆ = p j +1 − p j isa fixed distance between two adjacent points. We then define: h j ( y ) = y ≤ p j − y − p j − p j − < y ≤ p j ∆ p j < y ∀ j ∈ { , ..., L } Below and above points p and p L we define, respectively: h ( y ) = (cid:110) y − p y ≤ p p < y ,h L +1 ( y ) = (cid:110) y ≤ p L y − p L p L < y . Fig. 1:
Example of basis functions, with h ( y ) (left), h j ( y ) for ≤ j ≤ L (center) and h L +1 ( y ) (right). An illustration of these functions is given in Fig. 1. To ensurea monotonic increasing g , we define the domain Θ x : Θ x = (cid:8) ( A , b ) : ∀ j ≥ , [ A ψ ( x ) + b ] j > (cid:9) . We can now easily see that this construction satisfies Assump-tions 1-2 and provides a convex NFR. Reminiscent of kerneldensity estimation [18], the dictionary divides the domain of y into predefined bins and allows a different slope at each bin.With enough bins, any monotonic increasing transformationcan be modeled.Many applications require the point estimation of unknownlabels, or the minimum mean squared error (MMSE) estimateof y given x . Here lies another advantage of the piecewiselinear NFR, whose construction allows a closed form com-putation of these quantities. The following lemma provides aclosed form of the conditional expectation of y given x , whichcan then be used to obtain point estimates or MMSE’s. Lemma 4.
The conditional expectation of y given x in thepiece wise linear models with parameters A and b is E ( y | x ) = − e − µ √ πα + (cid:18) p − µα (cid:19) Φ( µ )+ L − (cid:88) i =0 (cid:32) e − ( µ +∆ i ) − e − ( µ +∆ i +1 ) √ πα i +1 + (cid:18) p i − µ + ∆ i α i +1 (cid:19) (Φ( µ + ∆ i +1 ) − Φ( µ + ∆ i )) (cid:33) + e − ( µ +∆ L ) √ πα L +1 + (cid:18) p L − µ + ∆ L α L +1 (cid:19) (1 − Φ( µ + ∆ L )) , (4) where Φ is the standard normal cumulative distribution func-tion (CDF) and we define [ µ, α , ..., α L +1 ] T = Ax + b ∆ = 0∆ i = ∆( α + ... + α i ) ∀ i = 1 , ..., L. Proof.
We obtain this result by computing the expectation ineach segment of the real line, partitioned by I . The NFRdensity function in each segment is given by: p θ ( y | x ) = α √ π e − ( µ + α ( y − p )) y ≤ p α i +1 √ π e − ( µ +∆ i )+ α i +1 ( y − p i )) y ∈ ( p i , p i +1 ] , ≤ i ≤ L − α L +1 √ π e − ( µ +∆ L + α L +1 ( y − p L )) y > p L By standard calculation we can show that the integral of yp θ ( y | x ) over the outermost left segment satisfies: (cid:90) p −∞ yp θ ( y | x ) dy = − e − µ √ πα + (cid:18) p − µα (cid:19) Φ( µ ) , while the integral over the outermost right segment satisfies: (cid:90) ∞ p L yp θ ( y | x ) dy = e − ( µ +∆ L ) √ πα L +1 + (cid:18) p L − µ + ∆ L α L +1 (cid:19) (1 − Φ( µ + ∆ L )) . Regarding the inner segments, for i = 1 , ..., L − it holdsthat: (cid:90) p i +1 p i yp θ ( y | x ) dy = e − ( µ +∆ i ) − e − ( µ +∆ i +1 ) √ πα i +1 + (cid:18) p i − µ + ∆ i α i +1 (cid:19) (Φ( µ + ∆ i +1 ) − Φ( µ + ∆ i )) . The required result is obtained by summing up all these terms.
Algorithm 1:
ADMM
Repeat until convergence • w ← − (cid:0) P + ρ Q T Q (cid:1) − Q T ( y − ρ z ) • z (cid:96) ← ρ [ Qw ] (cid:96) + y (cid:96) + √ ( ρ [ Qw ] (cid:96) + y (cid:96) ) +8 ρ ρ ∀ (cid:96) . • y ← y + ρ ( Qw − z ) Return w .Next we provide an Alternating Directions Method of Mul-tipliers (ADMM) algorithm for solving the piecewise linearNFR efficiently [19]. Due to space limitations, we omit thederivation and many of the details. In brief, the piecewiselinear version of (2) boils down to the following convexoptimization: min w w T Pw − (cid:88) (cid:96) log ([ Qw ] (cid:96) ) , (5)where P (cid:23) and Q are fixed matrices that depend on thedata and setting, and w is a vector with all the unknownparameters. ADMM solves this minimization by defining z (cid:96) =[ Qw ] (cid:96) ≥ for each (cid:96) and alternatingly solving for w , z (cid:96) and the multipliers y (cid:96) . The overall algorithm is provided inAlgorithm 1 where ρ is a step size parameter. Given n datapoints ( x i , y i ) , the matrices P , Q are defined as follows: p Ti = (cid:2) h T ( y i ) ⊗ x Ti h T ( y i ) (cid:3) = ⇒ P = n (cid:88) i =1 p i p Ti Q i : = (cid:104) h (cid:48) T ( y i ) ⊗ x Ti h (cid:48) T ( y i ) (cid:105) , ∀ i = 1 , .., n where ⊗ is the Kronecker product, h T ( y i ) is the row vectordefined in (3), h (cid:48) T ( y i ) is the corresponding derivative w.r.t y i ,and Q i : is the i -th row of Q . See supplementary material formore details. Fig. 2:
NLL and MSE results of the different models on syntheticLR data, for an increasing number of samples.
IV. N
UMERICAL EXPERIMENTS
We now demonstrate the efficacy of our approach in cap-turing expressive conditional distributions while still retainingmodel convexity. In all the simulations, we compare our NFRapproach to the LR and GR baselines. NFR is computedusing the ADMM in Algorithm 1. In the synthetic simula-tions, we also report a clairvoyant lower bound based on thetrue unknown parameters. Performance is measured using thenegative log likelihood (NLL) of an indpenedent test set. Wealso provide normalized root mean squared error (MSE) resultswith respect to the prediction error (ˆ y − y ) , where ˆ y is theconditional expectation in Lemma 4. The reported metrics areempirical medians of , independent trials per point. Synthetic data . We begin with experiments with syntheticdata generated from the LR and NFR regression models.The true parameters are fixed constants and the features arestandard normal variables. Figs. 2 and 3 provide the NLL andMSE results as a function of the number of samples generatedfrom LR and NFR settings, respectively. In terms of NLL, eachestimator performs best under its true model. NFR is moreexpressive than its competitors and requires more samples toreach their performance in their specialized models. On theother hand, NFR is significantly better when the true datais NFR distributed. The MSE experiments validate Lemma4. In terms of minimizing prediction error, LR is typicallysufficient and there is no need for GR or NFR, yet the penaltyfor their usage is negligible when the number of samples inlarge. Asymptotically, the non-linearities in NFR even givea slight advantage and NFR outperforms its competitors. Weobtain similar results on GR samples.
Real world data
We now evaluate our NFR approachon a real world setting. We follow [20] and consider solarpower time series .Denoting the time series by { t i } ni =1 , our Fig. 3:
NLL and MSE results of the different models on syntheticNFR data, for an increasing number of samples.
Fig. 4:
NLL and MSE results of the different models on real solarpower data, for an increasing number of samples. goal is to predict its future outcome. Fixing a set of indices J = (1 , , , .., , we extract n samples through x i = (cid:0) t i + J (1) , .., t i + J (9) (cid:1) and y i = t i +90 , where i = 1 , ..., n . Foreach trial we assign samples to be used as a test set, whilethe remaining samples serve as a training set. As previously,we estimate the LR, GR and NFR parameters using thetraining set, and compute the MSE and likelihood of the testset. We preform , trials per point n . As shown in Fig. 4,the NFR results in significantly better NLL values. In terms ofMSE, NFR requires many more samples to reach performanceof LR and GR as it has more unknown parameters. R EFERENCES[1] E. G. Tabak and C. V. Turner, “A family of nonparametric density esti-mation algorithms,”
Communications on Pure and Applied Mathematics ,vol. 66, no. 2, pp. 145–164, 2013.[2] D. A. Nix and A. S. Weigend, “Learning local error bars for nonlinearregression,” in
Advances in neural information processing systems ,pp. 489–496, 1995.[3] T. Heskes, “Practical confidence and prediction intervals,” in
Advancesin neural information processing systems , pp. 176–182, 1997.[4] G. Papadopoulos, P. J. Edwards, and A. F. Murray, “Confidence esti-mation methods for neural networks: A practical comparison,”
IEEEtransactions on neural networks , vol. 12, no. 6, pp. 1278–1287, 2001.[5] N. Meinshausen, “Quantile regression forests,”
Journal of MachineLearning Research , vol. 7, no. Jun, pp. 983–999, 2006.[6] M. J. Wainwright, M. I. Jordan, et al. , “Graphical models, exponen-tial families, and variational inference,”
Foundations and Trends R (cid:13) inMachine Learning , vol. 1, no. 1–2, pp. 1–305, 2008.[7] H. Liu, J. Lafferty, and L. Wasserman, “The nonparanormal: Semipara-metric estimation of high dimensional undirected graphs,” Journal ofMachine Learning Research , vol. 10, no. Oct, pp. 2295–2328, 2009.[8] R. B. Nelsen,
An introduction to copulas . Springer Science & BusinessMedia, 2007.[9] C. A. Klaassen, J. A. Wellner, et al. , “Efficient estimation in the bivariatenormal copula model: normal margins are least favourable,”
Bernoulli ,vol. 3, no. 1, pp. 55–77, 1997.[10] M. Davy and A. Doucet, “Copulas: a new insight into positive time-frequency distributions,”
IEEE signal processing letters , vol. 10, no. 7,pp. 215–218, 2003.[11] S. G. Iyengar, P. K. Varshney, and T. Damarla, “A parametric copula-based framework for hypothesis testing using heterogeneous data,”
IEEETransactions on Signal Processing , vol. 59, no. 5, pp. 2308–2319, 2011.[12] Y. Woodbridge, G. Elidan, and A. Wiesel, “Signal detection in complexstructured para normal noise,”
IEEE Transactions on Signal Processing ,vol. 65, no. 9, pp. 2306–2316, 2017.[13] I. Kobyzev, S. Prince, and M. A. Brubaker, “Normalizing flows: Intro-duction and ideas,” arXiv preprint arXiv:1908.09257 , 2019.[14] D. J. Rezende and S. Mohamed, “Variational inference with normalizingflows,” arXiv preprint arXiv:1505.05770 , 2015.[15] E. G. Tabak, E. Vanden-Eijnden, et al. , “Density estimation by dual as-cent of the log-likelihood,”
Communications in Mathematical Sciences ,vol. 8, no. 1, pp. 217–233, 2010.[16] L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation usingreal nvp,” arXiv preprint arXiv:1605.08803 , 2016.[17] S. M. Ross,
Introduction to probability models . Academic press, 2014.[18] L. Wasserman,
All of nonparametric statistics . Springer Science &Business Media, 2006.[19] S. Boyd, N. Parikh, E. Chu, B. Peleato, J. Eckstein, et al. , “Distributedoptimization and statistical learning via the alternating direction methodof multipliers,”
Foundations and Trends R (cid:13) in Machine learning , vol. 3,no. 1, pp. 1–122, 2011.[20] G. Lai, W.-C. Chang, Y. Yang, and H. Liu, “Modeling long-andshort-term temporal patterns with deep neural networks,” in The 41stInternational ACM SIGIR Conference on Research & Development inInformation Retrieval , pp. 95–104, 2018. S UPPLEMENTARY M ATERIAL : N
ORMALIZING F LOW R EGRESSION