CCONVEX TECHNIQUES FOR MODEL SELECTION
DUSTIN TRAN
Abstract.
We develop a robust convex algorithm to select the regularizationparameter in model selection. In practice this would be automated in orderto save practitioners time from having to tune it manually. In particular, weimplement and test the convex method for K -fold cross validation on ridgeregression, although the same concept extends to more complex models. Wethen compare its performance with standard methods. Introduction
Regularization has been studied extensively since the theory of ill-posed problems[6]. By adding a penalty associated to a choice of regularization parameter, onepenalizes overfitted models and can achieve good generalized performance with thetraining data. This has been a crucial feature in current modeling frameworks:for example, Tikhonov regularization [10], Lasso regression [5], smoothing splines[8], regularization networks [1], SVMs [11], and LS-SVMs [9]. SVMs in particularare characterized by dual optimization reformulations, and their solutions followfrom convex programs. Standard SVMs reduce to solving quadratic problems andLS-SVMs reduce to solving a set of linear equations.Many general purpose methods to measure the appropriateness of a regularizationparameter for given data exist: cross-validation (CV), generalized CV, Mallows’s C p , Minimum Description Length (MDL), Akaike Information Criterion (AIC), andBayesian Information Criterion (BIC). Recent interest has also been in discoveringclosed-form expressions of the solution path [3], and developing homotopy methods[7].Like SVMs, this paper takes the perspective of convex optimization—following[2]—in order to tune the regularization parameter for optimal model selection.Classical Tikhonov regularization schemes require two steps: 1. (training) choosinga grid of fixed parameter values, find the solution for each constant regularizationparameter; 2. (validating) optimize over the regularization constants and choosethe model according to a model selection criterion. The approach in this paperreformulates the problem as one for constrained optimization. This allows one tocompute both steps simultaneously: minimize the validation measure subject tothe training equations as constraints. Date : May 16, 2014. a r X i v : . [ m a t h . O C ] D ec TRAN
This approach offers noticeable advantages over the ones above, of which we outlinea few here: • Automation : Practical users of machine learning tools may not be notinterested in tuning the parameter manually. This brings us closer to fullyautomated algorithms. • Convexity : It is (usually) much easier to examine worst case behaviorfor convex sets, rather than attempting to characterize all possible localminima. • Performance : The algorithmic approach of training and validating simul-taneously is occasionally more efficient than general purpose optimizationroutines.For this write-up, we focus only on ridge regression, although the same approachcan be applied to more complex model selection problems (which may benefit moreas they suffer more often from local minima).2.
Background
Ridge solution set.
We introduce the standard approach to ridge regressionand prove key properties for a convex reformulation. For a more rigorous intro-duction, see [4]. Let y ∈ R n and M be a n × n positive semi-definite matrix. Fora fixed γ ∈ (0 , ∞ ), recall that Tikhonov regularization schemes of linear operatorslead to the solution ˆ β ∈ R n , where the estimator ˆ β solves(2.1) ( M + γI n ) β = y ONVEX TECHNIQUES FOR MODEL SELECTION 3
Definition 2.1.
For a fixed value γ > , we define the ridge solution set S as theset of all solutions ˆ β corresponding to a value γ ∈ ( γ , ∞ ) . That is, (2.2) S ( γ, β | γ , M, y ) := { γ ∈ ( γ , ∞ ) , β ∈ R n | ( M + γI n ) β = y } The value γ can be thought of as the minimal regularization parameter allowed inthe solution set; this would speed up computation should the user already know alower bound on their optimal choice of γ .Let M = U Σ U T denote the reduced singular value decomposition (SVD) of M ,i.e., U is orthogonal and Σ = diag( σ , . . . , σ n ) contains all the ordered positiveeigenvalues with σ ≥ · · · ≥ σ n . Lemma 2.2.
The solution function h M,y ( γ ) = ( M + γI n ) − y is Lipschitz contin-uous.Proof. For two values γ , γ such that γ ≤ γ ≤ γ < ∞ , (cid:107) ( M + γ I n ) − y − ( M + γ I n ) − y (cid:107) = (cid:13)(cid:13)(cid:13) U (cid:16) (Σ + γ I n ) − − (Σ + γ I n ) − (cid:17) U T y (cid:13)(cid:13)(cid:13) (2.3) ≤ max i (cid:12)(cid:12)(cid:12)(cid:12) σ i + γ − σ i + γ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) y (cid:107) (2.4) ≤ (cid:12)(cid:12)(cid:12)(cid:12) σ n + γ − σ n + γ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) y (cid:107) (2.5)Consider the function(2.6) g ( x ) = 1 σ n + γ + x Then by the mean value theorem, there exists a value c ∈ [ λ − λ , λ − λ ] suchthat (cid:12)(cid:12)(cid:12)(cid:12) σ n + λ − σ n + λ (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) y (cid:107) ≤ | g (cid:48) ( c ) || λ − λ |(cid:107) y (cid:107) (2.7) ≤ (cid:107) y (cid:107) ( σ n + λ ) | λ − λ | (cid:3) Convex relaxation.
We now examine the the convex hull of the solution set S ( γ, β | γ , M, y ). This will allow us to search efficiently through a convex set ofhypotheses as in Section 3. We decompose the solution function into a sum of lowrank matrices:(2.8) h M,y ( γ ) = ( M + γI n ) − y = U (Σ + γI n ) − U T y = n (cid:88) i =1 σ i + γ U i U Ti y Then define λ i := 1 / ( σ i + γ ) for all i = 1 , . . . , n . The following proposition provideslinear constraints on the set of λ i ’s following from this reparameterization. TRAN
Proposition 2.3.
Let σ (cid:48) i := σ i + γ for i = 1 , . . . , n . Then the polytope R parametrized by Λ = { λ , . . . , λ n } as follows (2.9) R (Λ , β | γ , M, y ) = U Ti β = λ i U Ti y, for all i = 1 , . . . , n < λ i ≤ σ (cid:48) i , for all i = 1 , . . . , n σ (cid:48) i +1 σ (cid:48) i λ i +1 ≤ λ i ≤ λ i +1 , for all σ (cid:48) i ≥ σ (cid:48) i +1 , i = 1 , . . . , n − is convex, and moreover, forms a convex hull to S .Proof. It is easy to verify the first two constraints by looking at the function g defined previously, which is strictly increasing. Set γ (cid:48) := γ − γ >
0, and λ i =1 / ( σ (cid:48) i + γ (cid:48) ) as before. Then for all σ (cid:48) i ≥ σ (cid:48) i +1 , λ i = (cid:16) σ (cid:48) i +1 + γ (cid:48) (cid:17) / (cid:16) σ (cid:48) i +1 + γ (cid:48) + σ (cid:48) i − σ (cid:48) i +1 σ (cid:48) i +1 + γ (cid:48) (cid:17) (2.10) = λ i +1 λ i +1 ( σ (cid:48) i − σ (cid:48) i +1 )(2.11) ≥ λ i +1 σ (cid:48) i +1 ( σ (cid:48) i − σ (cid:48) i +1 ) = σ (cid:48) i +1 σ (cid:48) i λ i +1 (2.12)Hence the third inequality is also true. Moreover, the set R (Λ , β | γ , M, y ) is char-acterized entirely by equalities and inequalities which are linear in the unknown λ i ’s. So it forms a polytope. Then by the above result together with the convexproperty of polytopes, it follows that R (Λ , β | γ , M, y ) is a convex relaxation to theset S ( γ, β | γ , M, y ). (cid:3) We now find the relationship between solutions in S ( γ, β | γ , M, y ) and solutions inits convex relaxation R (Λ , β | γ , M, y ). Proposition 2.4.
The maximal distance between a solution in R (Λ , β | γ , M, y ) and its closest counterpart in S ( γ, β | γ , M, y ) is bounded by the maximum range ofthe inverse eigenvalue spectrum: (2.13) min γ (cid:107) U Σ U T y − ( M + γI n ) − y (cid:107) ≤ (cid:107) y (cid:107) max i>j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ (cid:48) i − σ (cid:48) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Proof.
Following similar steps to prove Lemma 2.2, one can show that the maximaldifference between a solution β Γ for a given Γ and its corresponding closest β ˆ γ is(2.14) min γ (cid:107) U Λ U T y − ( M + ˆ γI n ) − y (cid:107) ≤ min γ (cid:107) y (cid:107) max i =1 ,...,n (cid:12)(cid:12)(cid:12)(cid:12) λ i − σ (cid:48) i + γ (cid:12)(cid:12)(cid:12)(cid:12) For any λ i (cid:54) = λ j , the value(2.15) min γ max (cid:40)(cid:12)(cid:12)(cid:12)(cid:12) λ i − σ (cid:48) i − γ (cid:48) (cid:12)(cid:12)(cid:12)(cid:12) , (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) λ j − σ (cid:48) j − γ (cid:48) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:41) ONVEX TECHNIQUES FOR MODEL SELECTION 5 is bounded by the worst case scenario that the solution γ passes through λ i orthrough λ j . Then min γ max i (cid:12)(cid:12)(cid:12)(cid:12) λ i − σ (cid:48) i + γ (cid:12)(cid:12)(cid:12)(cid:12) ≤ max i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12) λ i − σ (cid:48) i + γ j (cid:12)(cid:12)(cid:12)(cid:12) (2.16) < max i (cid:54) = j λ i (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) − σ (cid:48) i σ (cid:48) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (2.17) < max i>j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ (cid:48) i − σ (cid:48) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (2.18)where γ j satisfies λ j = 1 / ( σ (cid:48) j + γ j ). That is, γ j = 1 /λ j − σ (cid:48) j which is greater thanzero by construction. Hence for all Γ ∈ R n , there exists ˆ γ such that (cid:3) (2.19) (cid:107) β Γ − β ˆ γ (cid:107) ≤ (cid:107) y (cid:107) max i>j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) σ (cid:48) i − σ (cid:48) j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) Reformulation of ridge regression.
Let D = { ( x i , . . . , x ip , y i ) } ni =1 ⊂ R p × R be a given data set. The ridge regression estimator f ( x ) = x T β with β ∈ R p minimizes the regularized loss function(2.20) ˆ β = arg min β n (cid:88) i =1 (cid:96) ( y i − x Ti β ) + γ (cid:107) β (cid:107) where (cid:96) : R → R > is some loss function. Proposition 2.5 (KKT conditions) . Set (cid:96) ( z ) := z and fix γ > . For β ∈ R p to be the unique global minimizer of (2.20) , it is necessary and sufficient that β satisfies (2.21) KKT ( β | γ, D ) : ( X T X + γI p ) β = X T y, where X is the n × p design matrix and y ∈ R n is the response vector formulatedfrom D . Note that we use the KKT notation in order to hint to the extension to otherlearning machines, which reduce to solving a similar convex optimization problembut with inequality constraints.Let D v = { ( x vi , . . . , x vip , y vi ) } n v i =1 ⊂ R p × R be a validation data set. The optimiza-tion problem of finding the optimal regularization parameter ˆ β with respect to avalidation performance criterion ˆ γ can then be written as(2.22) ( ˆ β, ˆ γ ) = arg min β,γ> n v (cid:88) j =1 (cid:96) ( y vj − ( x vj ) T β ) s.t. KKT ( β | γ, D ) = S ( γ, β |D )That is, we find the least squared error among all β ’s in the solution set S , orequivalently all β ’s satisfying the KKT conditions. TRAN Convex Program
For the convex approach, we simply replace the non-convex solution set S ( γ, β |D )with its convex relaxation R (Λ , β |D ). Then one obtains the convex optimizationproblem(3.1) ( ˆ β, ˆΛ) = arg min β, Λ n v (cid:88) j =1 (cid:96) ( y vj − ( x vj ) T β ) s.t. R (Λ , β |D )This has the immediate advantage of simultaneously training and validating (1step); in comparison the original method requires finding a grid of points ( ˆ β, ˆ γ )in S and then minimizing among those (2 steps). Furthermore, the convex hull isdefined by O ( n ) equality/inequality constraints, whose complexity is not any higherthan the original problem.For example, (3.1) can be solved with a QP solver when (cid:96) ( z ) = z as before, or witha LP solver when (cid:96) ( z ) = | z | (the latter of which may be preferred for sparsenessesor feature selection). Corollary 3.1.
The convex relaxation constitutes the solution path for the modifiedridge regression problem (3.2) ˆ β = arg min β n (cid:88) i =1 (cid:96) ( y i − x Ti β ) + 12 β T ( U Γ U T ) β where Γ = diag( γ , . . . , γ p ) and γ i = λ k − σ k for all i = 1 , . . . , p , and the followinginequalities hold by translating (2.9) : (3.3) (cid:40) γ i > , for all i = 1 , . . . , p σ i +1 σ i ( σ i + γ i ) ≥ σ i +1 + γ i +1 > σ i + γ i , for all σ i +1 ≥ σ i , i = 1 , . . . , p − Generalization to K -fold cross-validation. The above applies to a singletraining and validation set, and we now extend it to K -fold CV in general. Let D ( k ) and D v ( k ) denote the set of training and validation data respectively, correspondingto the k th fold for k = 1 , . . . , K : that is, they satisfy(3.4) (cid:91) k D v ( k ) = D , (cid:92) k D ( k ) = (cid:92) k D v ( k ) = ∅ Let n ( k ) = |D ( k ) | . Then in order to tune the parameter γ according to K -fold CV,we have the optimization problems( ˆ β ( k ) , ˆ γ ) = arg min β ( k ) ,γ> K (cid:88) k =1 n − n ( k ) (cid:88) ( x j ,y j ) ∈D v ( k ) (cid:96) ( y j − x Tj β ( k ) )s.t. KKT ( β ( k ) | γ, D ( k ) ) = S ( γ, β ( k ) |D ( k ) )(3.5)for all k = 1 , . . . , K . ONVEX TECHNIQUES FOR MODEL SELECTION 7
Then we need only relax the KKT conditions independently for each k . The convexoptimization problem according to a K -fold CV is( ˆ β ( k ) , ˆΛ) = arg min β ( k ) , Λ k K (cid:88) k =1 n − n ( k ) (cid:88) ( x j ,y j ) ∈D v ( k ) (cid:96) ( y j − x Tj β ( k ) )s.t. R (Λ , β ( k ) |D ( k ) )(3.6)for all k = 1 , . . . , K , each of which is solved as before, and so with O ( Kn ) con-straints. Then just as in typical K -fold CV, we take the average of the foldsˆ β avg = K (cid:80) Kk =1 ˆ β ( k ) as the final model.4. Performance
We show two simulation studies as a benchmark in order to compare it to currentmethods. The first figure below provides intuition behind the solution paths: thecurve S ( λ, β | γ , M, y ) and its convex relaxation R (Λ , β, | γ , M, y ). In the secondfigure, we simulate data with the function y = sinc( x ) + (cid:15) , where (cid:15) ∼ N (0 ,
1) aresampled i.i.d., n = 50 observations, and p = 5.The figure compares the performance of the method with one which uses basisfunctions, another with CV, and another with generaized CV (GCV). CV andGCV are implemented with standard gradient descent, and the convex algorithmoutlined here uses the interior point method. This toy example demonstrates thatthe relaxation does not result in an increase in true error.We also conduct a Monte Carlo simulation: every iteration constructs a simulatedmodel for a given value of n defined as(4.1) f p ( x ) = w x + · · · + w p x p for random values of w = ( w , . . . , w p ) ∈ R p where p = 10, w ∼ N (0 , C ), and C isa p × p covariance matrix with (cid:107) C (cid:107) = 1. A data set of size n is constructed such TRAN that D = { ( x i , . . . , x ip , y i ) } ni =1 and(4.2) y i = f m ( x i ) + (cid:15) i , (cid:15) i ∼ N (0 , i = 1 , . . . , n .We compare three methods for tuning the parameter with respect to the ordinaryleast squares (OLS) estimate: 10-fold CV with gradient descent ( RR+CV ) , general-ized CV with gradient descent (
RR+GCV ) , and 10-fold CV criterion which appliesthe convex method as in (18) ( fRR+CV ). We run it for 20,000 iterations.The figure on the left compares the true error as the condition number grows;the figure on the right compares the true error as the number of observations n increases.As we can see, the method is comparable to both CV and GCV for variable changesin the data set. We expect that the OLS worsens drastically over ill-conditionedmatrices, and our method compensates for that via the optimal tuning parameterwhich is roughly the same as CV’s.5. Conclusion
Viewing the model selection problem as one in constrained optimization gives riseto a natural approach to tuning the regularization parameter. According to thesimulation results, the convex program provides comparable performance to popularmethods. Moreover, global optimality is guaranteed and efficient convex algorithmscan be employed as suited to the modeling problem at hand. This would especiallyoutperform general purpose techniques when there is a high number of local optima,or if finding each individual solution for fixed regularization parameter is more costlythan optimizing it simultaneously with validation as we do here.
ONVEX TECHNIQUES FOR MODEL SELECTION 9
Further extensions to this framework can be used as a generic convex hull method[12], and it can also be applied for constructing stable kernel machines, featureselection, and other possibilities related to simultaneous training and validating.
References [1] M. Bertero, T. Poggio, and T. Torre. Ill-posed problems in early vision.
Proceedings of theIEEE , 76(8):869889, 1988.[2] S. Boyd and L. Vandenberghe.
Convex Optimization.
Cambridge University Press, 2004.[3] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani. Least angle regression. Annals of Statis-tics, 32(2):407499, 2004.[4] G. Golub, M. Heath, G. Wahba. Generalized cross-validation as a method for choosing a goodridge parameter.
Technometrics , 21:215-223, 1979.[5] A. Hoerl and R. Kennard. Ridge regression: biased estimation for nonorthogonal problems.
Technometrics , 12(1):5582, 1970.[6] V. Ivanov.
The Theory of Approximate Methods and Their Application to the NumericalSolution of Singular Integral Equations.
Nordhoff International Publishing, 1976.[7] M.R. Osborne, B. Presnell, and B.A. Turlach. On the LASSO and its dual.
Journal of Com-putational & Graphical Statistics , 2000.[8] L. Schumacher.
Spline Functions: Basic Theory.
New York: Wiley, 1981.[9] J. Suykens, and J. Vandewalle. Least squares support vector machine classifiers.
Neural Pro-cessing Letters , 9:293300, 1999.[10] A. Tikhonov and V. Arsenin.
Solutions of Ill-posed problems.
Hallsted Press, 1977.[11] V. Vapnik.
Statistical Learning Theory.
John Wiley & Sons, 1998.[12] K. Meers, E. Ceulmans, and T. Wilderjans. CHull: A generic convex-hull-based model selec-tion method.