[PDF] Dynamic Sasvi: Strong Safe Screening for Norm-Regularized Least Squares

Abstract

A recently introduced technique for a sparse optimization problem called "safe screening" allows us to identify irrelevant variables in the early stage of optimization. In this paper, we first propose a flexible framework for safe screening based on the Fenchel-Rockafellar duality and then derive a strong safe screening rule for norm-regularized least squares by the framework. We call the proposed screening rule for norm-regularized least squares "dynamic Sasvi" because it can be interpreted as a generalization of Sasvi. Unlike the original Sasvi, it does not require the exact solution of a more strongly regularized problem; hence, it works safely in practice. We show that our screening rule can eliminate more features and increase the speed of the solver in comparison with other screening rules both theoretically and experimentally.

Full PDF

DDynamic Sasvi: Strong Safe Screening for Norm-Regularized Least Squares

Hiroaki Yamada Makoto Yamada

Abstract

A recently introduced technique for a sparse opti-mization problem called ”safe screening” allowsus to identify irrelevant variables in the early stageof optimization. In this paper, we ﬁrst proposea ﬂexible framework for safe screening based onthe Fenchel-Rockafellar duality and then derivea strong safe screening rule for norm-regularizedleast squares by the framework. We call the pro-posed screening rule for norm-regularized leastsquares ”dynamic Sasvi” because it can be inter-preted as a generalization of Sasvi. Unlike theoriginal Sasvi, it does not require the exact so-lution of a more strongly regularized problem;hence, it works safely in practice. We show thatour screening rule can eliminate more featuresand increase the speed of the solver in compari-son with other screening rules both theoreticallyand experimentally.

1. Introduction

Sparse models such as Lasso (Tibshirani, 1996) and groupLasso (Yuan & Lin, 2006) have been widely studied in theareas of statistics and machine learning, and are used forvarious applications such as compressed sensing (Donoho,2006) and biomarker discovery (Climente-Gonz´alez et al.,2019), to name a few. Although sparse models can beformulated as a simple convex optimization problem, thecomputational cost can be large if the numbers of samplesand dimensions are extremely large.To tackle this problem, a technique called safe screening hasbeen introduced (Ghaoui et al., 2010) for Lasso problems.Speciﬁcally, it eliminates variables that are guaranteed to bezero in the Lasso solution before solving the original Lassooptimization problem. Many safe screening methods havebeen proposed for various problems (Ghaoui et al., 2010;Ogawa et al., 2013; Wang et al., 2015; Liu et al., 2014; Kyoto University RIKEN AIP. Correspondence to: HiroakiYamada < [email protected] > , Makoto Yamada < [email protected] > . Xiang et al., 2017). These are called sequential screeningrules because they require the solution to a more stronglyregularized problem. A recent technique used to eliminatevariables through an estimated solution in an iterative solver,called dynamic screening, has been proposed (Bonnefoyet al., 2015). In particular, Gap Safe (Fercoq et al., 2015;Ndiaye et al., 2015), a dynamic screening framework iswidely used owing to its generality and efﬁciency (Ndiayeet al., 2017; Shibagaki et al., 2016; Bao et al., 2020; Rajet al., 2016; Ndiaye et al., 2020). More speciﬁcally, GapSafe efﬁciently screens variables by using the dual form ofthe original problems, where the screening is characterizedby properly designing the dual safe region. For Lasso, twosimple region-based approaches exist: Gap Safe Sphere andGap Safe Dome (Fercoq et al., 2015).In this paper, we propose a dynamic safe screening algo-rithm that is stronger than either Gap Safe Sphere or GapSafe Dome for the

Lasso-Like problem, which includesnorm-regularized least squares. To this end, we ﬁrst pro-pose a general screening framework based on the Fenchel-Rockafellar duality and then derive

Dynamic Sasvi , a strongsafe screening rule for

Lasso-like problems. Our frame-work can be regarded as a generalization of the Gap Safeframework, and thus we can derive Gap Safe Sphere andGap Safe Dome simply using our results. Moreover, thanksto this generalization, we can use a strong problem adap-tive inequality. Interestingly, the derived screening rule for

Lasso-like problems can be seen as a dynamic variant ofthe safe screening with variational inequalities (Sasvi) (Liuet al., 2014), a sequential screening rule for Lasso. There-fore, we call this dynamic Sasvi. Unlike the original Sasvi,dynamic Sasvi does not require an exact solution to theproblem with another hyper-parameter and hence operatessafely in practice. Moreover, we propose the use of dy-namic enhanced dual polytope projections (EDPP) (Wanget al., 2015), which are a relaxation of dynamic Sasvi byintroducing a minimum radius sphere. We show both theo-retically and experimentally that the screening power andcomputational costs of Dynamic Sasvi and Dynamic EDPPcompare favorably with those of other state-of-the-art GapSafe methods.

Contribution:

The contributions of our paper are summa-rized as follows. a r X i v : . [ s t a t . M L ] F e b ubmission and Formatting Instructions for ICML 2021 • We propose a ﬂexible screening framework based onFenchel-Rockafellar duality, which is a generalizationof the Gap Safe framework (Ndiaye et al., 2017).• We propose two novel dynamic screening rules fornorm-regularized least squares, which are a dynamicvariant of Sasvi (Liu et al., 2014) and a dynamic variantof EDPP.• We show that Dynamic Sasvi eliminates more featuresand increases the speed of the solver in comparisonto Gap Safe (Fercoq et al., 2015; Ndiaye et al., 2017)both theoretically and experimentally.

2. Preliminary

In this section, we ﬁrst formulate the problem and introducethe key techniques used in this study.

Given h : R m → [ −∞ , ∞ ] , the domain of h is deﬁned by dom( h ) := { z ∈ R m | | h ( z ) | < ∞} and h (cid:63) : R m → [ −∞ , ∞ ] , the Fenchel conjugate of h , isdeﬁned by h (cid:63) ( v ) := sup z ∈ R d v (cid:62) z − h ( z ) . If h is proper, the Fenchel-Young inequality h ( z ) + h (cid:63) ( v ) ≥ v (cid:62) z (1)can be proven directly from the deﬁnition of the Fenchelconjugate. The subdifferential of a proper function h : R m → ( −∞ , ∞ ] at z is given as ∂h ( z ):= { v ∈ R m | ∀ w ∈ R m v (cid:62) ( w − z ) + h ( z ) ≤ h ( w ) } . The next proposition is important for driving Safe-screeningalgorithms.

Proposition 1

Assume that h : R m → ( −∞ , ∞ ] is aproper lower semicontinuous convex function and z , v ∈ R m . We then have v ∈ ∂h ( z ) ⇐⇒ h ( z ) + h (cid:63) ( v ) = v (cid:62) z ⇐⇒ z ∈ ∂h (cid:63) ( v ) . See (Bauschke et al., 2011) Section 16 for the proof.For convex set C ⊂ R m , the relative interior of C is deﬁnedby relint( C ):= { v ∈ C | ∀ w ∈ C ∃ (cid:15) > . t . v + (cid:15) ( v − w ) ∈ C } . In this study, we consider an optimization problem, formu-lated as minimize β ∈ R d f ( Xβ ) + g ( β ) , (2)where β ∈ R d is the optimization variable, X ∈ R n × d is a constant matrix, and f : R n → ( −∞ , ∞ ] and g : R d → ( −∞ , ∞ ] are proper lower semicontinuous convexfunctions. We assume ∃ β ∈ relint(dom( g )) s . t . Xβ ∈ relint(dom( f )) and the existence of the optimal point, i.e., ∃ ˆ β ∈ dom( P ) s . t . P ( ˆ β ) = inf β ∈ R d P ( β ) , where P : R d → R is deﬁned as P ( β ) = f ( Xβ ) + g ( β ) .Note that we have not assumed the uniqueness of the so-lution. Moreover, we focus on the cases where g inducessparsity. Although all theorems in this paper hold, we cannoteliminate any variables without sparsity.This class of optimization problem is popular, the mostpopular example of which is Lasso (Tibshirani, 1996): minimize β ∈ R d (cid:107) y − Xβ (cid:107) + λ (cid:107) β (cid:107) . Many extensions of Lasso, including Group-Lasso (Yuan &Lin, 2006), Elastic-Net (Zou & Hastie, 2005), and sparselogistic regression (Meier et al., 2008) are in this class. Notethat non-convex extensions such as SCAD (Fan & Li, 2001),Bridge (Frank & Friedman, 1993), and MCP (Zhang et al.,2010) do not satisfy this assumption.Another example of the problem in Eq. (2) is the dual prob-lem of a support vector machine (SVM) (Cortes & Vapnik,1995). The dual problem of SVM can be formulated asfollows: minimize β ∈ R d : ≤ β ≤ (cid:107) Xβ (cid:107) − (cid:62) β . The dual problem of a support vector regression (SVR)(Smola & Sch¨olkopf, 2004) is also a target problem. Notethat we cannot eliminate any variables of the primal problemof the normal SVM and SVR owing to a lack of sparsity.However, screening methods are available for the primalproblem of the feature sparse variants of SVM and SVR(Ghaoui et al., 2010; Shibagaki et al., 2016).

To derive a safe screening rule for the optimization problem,Eq. (2), the Fenchel-Rockafellar dual formulation, plays animportant role. ubmission and Formatting Instructions for ICML 2021

Theorem 2 (Fenchel-Rockafellar Duality) If all assump-tions for the optimization problem (2) are satisﬁed, we havethe following: min β ∈ R d f ( Xβ ) + g ( β ) = max θ ∈ R n − f (cid:63) ( − θ ) − g (cid:63) ( X (cid:62) θ ) . (3)The proof of Theorem 2 is given in the Appendix. Let usdenote − f (cid:63) ( − θ ) − g (cid:63) ( X (cid:62) θ ) by D ( θ ) . For primal/dualsolutions, we know many conditions that are equivalent tothe optimality. Herein, we provide a list of such conditionsfor convenience. Proposition 3 (Optimal Condition) If all assumptions forthe optimization problem (2) are satisﬁed, the following areequivalent:(a) ˆ β ∈ argmin β ∈ R d P ( β ) ∧ ˆ θ ∈ argmax θ ∈ R n D ( θ ) (b) P ( ˆ β ) = D ( ˆ θ ) (c) f ( X ˆ β ) + f (cid:63) ( − ˆ θ ) = − ˆ θ (cid:62) X ˆ β = − g ( ˆ β ) − g (cid:63) ( X (cid:62) ˆ θ ) (d) − ˆ θ ∈ ∂f ( X ˆ β ) ∧ X (cid:62) ˆ θ ∈ ∂g ( ˆ β ) (e) X ˆ β ∈ ∂f (cid:63) ( − ˆ θ ) ∧ ˆ β ∈ ∂g (cid:63) ( X (cid:62) ˆ θ ) (Proof) (a) ⇐⇒ (b) is directly derived from the strongduality. (b) ⇐⇒ (c) is derived from the Fenchel-Younginequality (1). (c) ⇐⇒ (d) ⇐⇒ (e) are derived fromProposition 1. (cid:3) In this section, we show that we can eliminate some featuresby constructing a simple region that contains ˆ θ . Theorem 4

Assume that all assumptions for the optimiza-tion problem (2) are satisﬁed. Let ˆ β be the primal optimalpoint. Assume that the dual optimal point ˆ θ is within theregion R . Then, ˆ β ∈ (cid:91) θ ∈R ∂g (cid:63) ( X (cid:62) θ ) . (Proof) According to Proposition 3, ˆ β ∈ ∂g (cid:63) ( X (cid:62) ˆ θ ) ⊂ (cid:83) θ ∈R ∂g (cid:63) ( X (cid:62) θ ) . (cid:3) Theorem 4 provides a general method for feature screening.A simple example is the following corollary.

Corollary 5

Consider an optimization problem, i.e., Eq. (2) with g ( β ) = (cid:107) β (cid:107) . Assume that ˆ θ ∈ R . We then have max θ ∈R | x (cid:62) i θ | < ⇒ ˆ β i = 0 . (Proof) By deﬁnition of g , we have ∂g (cid:63) ( X (cid:62) θ ) ⊂ { β | β i = 0 } ⇐⇒ | x (cid:62) i θ | < . When max θ ∈R | x (cid:62) i θ | < , wehave ˆ β ∈ (cid:83) θ ∈R ∂g (cid:63) ( X (cid:62) θ ) ⊂ { β | β i = 0 } by Theorem4. (cid:3) Note that the computational cost of (cid:83) θ ∈R ∂g (cid:63) ( X (cid:62) θ ) de-pends on the simplicity of g and R .The key challenge of screening is to determine the simplenarrow region R . Many regions have been proposed forvarious problems. In the next section, we provide a generalframework for constructing a safe region.

3. General Framework for Constructing SafeRegion

Herein, we propose a general framework for constructing adual region that has the solution to the optimization problemin Eq. (3). Our framework consists of a general lower boundand a problem adaptive upper-bound of the optimal value.Hence, we can derive a narrower region than the frameworkwith a general upper bound under certain situations. Thegeneral lower-bound is given in the next Theorem.

Theorem 6

Consider the optimization problem in Eq. (3) and assume that f (cid:63) is L -strongly convex ( L ≥ ). Let ˆ θ bethe solution to (3) . Then, for ∀ ˜ θ ∈ R n , we have l ( ˆ θ ; ˜ θ ) ≤ D ( ˆ θ ) , (4) where l ( θ ; ˜ θ ) = L (cid:107) θ − ˜ θ (cid:107) + D ( ˜ θ ) . (Proof) According to Proposition 3, X ˆ β ∈ ∂f (cid:63) ( − ˆ θ ) and ˆ β ∈ ∂g (cid:63) ( X (cid:62) ˆ θ ) hold. Because f (cid:63) is L -strongly convexand g (cid:63) is convex, for ∀ ˜ θ ∈ R n , we have f (cid:63) ( − ˆ θ ) + ( X ˆ β ) (cid:62) ( − ˜ θ + ˆ θ ) + L (cid:107) ˜ θ − ˆ θ (cid:107) ≤ f (cid:63) ( − ˜ θ ) ,g (cid:63) ( X (cid:62) ˆ θ ) + ˆ β (cid:62) ( X (cid:62) ˜ θ − X (cid:62) ˆ θ ) ≤ g (cid:63) ( X (cid:62) ˜ θ ) . Adding these two inequalities, we have the inequality (4). (cid:3)

This means that ˆ θ is within the region of { θ | l ( θ ; ˜ θ ) ≤ D ( θ ) } . Because this region is too complicated for screening,we use a simple upper bound of D ( θ ) to construct a simplesafe region. The next theorem can be directly derived fromTheorem 6. Theorem 7

Consider the optimization problem in Eq. (3) and assume that f (cid:63) is L -strongly convex ( L ≥ ). Let ˆ θ bethe solution to Eq. (3) . Assume D ( θ ) is upper bounded by u ( θ ) , i.e., ∀ θ ∈ R n D ( θ ) ≤ u ( θ ) . Then, for ∀ ˜ θ ∈ R n , wehave ˆ θ ∈ R ( ˜ θ , u ) = { θ | l ( θ ; ˜ θ ) ≤ u ( θ ) } . ubmission and Formatting Instructions for ICML 2021 The complexity of R ( ˜ θ , u ) depends on the complexity of u . For example, if u is linear, then R ( ˜ θ , u ) is a sphere. Wecan construct a narrow, simple, and safe region with a tightsimple upper-bound u .In fact, the Gap Safe Sphere region (Fercoq et al., 2015;Ndiaye et al., 2017) can be derived easily from this theoremand weak duality. Corollary 8 (Gap Safe Sphere) Consider the optimizationproblem in Eq. (3) and assume that f (cid:63) is L -strongly convex( L ≥ ). Let ˆ θ be the solution to Eq. (3) . For ∀ ˜ β ∈ R d and ∀ ˜ θ ∈ R n , the region of the Gap Safe Sphere is given as R GS ( ˜ β , ˜ θ ) = { θ | l ( θ ; ˜ θ ) ≤ P ( ˜ β ) } . (5) Then, ˆ θ ∈ R GS ( ˜ β , ˜ θ ) . (Proof) Based on a weak duality, we have ∀ θ D ( θ ) ≤ P ( ˜ β ) . Using this constant function as an upper bound inTheorem 7, the corollary is derived directly. (cid:3) Hence, our framework can be seen as a generalization ofGap Safe. Owing to this generalization, we can use astronger problem-adaptive upper-bound than a weak du-ality. In the next section, we derive speciﬁc regions for

Lasso-Like problem. Some regions for other problems aregiven in the Appendix.

4. Safe region for Lasso-like problem

In this section, we introduce a strong upper bound for thedual problems of Lasso and similar problems. The domeregion derived from it can be seen as a generalization ofSasvi (Liu et al., 2014) and is narrower than Gap Safe Sphereand Gap Safe Dome.

Norm-regularized least squares is an optimization problemand is formulated as minimize β ∈ R d (cid:107) y − Xβ (cid:107) + g ( β ) where g is a norm. Apparently, this is a subset of problems 2.Although this formulation includes Lasso (Tibshirani, 1996),(overlapping) group-Lasso (Yuan & Lin, 2006; Jacob et al.,2009), and ordered weighted L1 regression (Figueiredo &Nowak, 2016), the non-negative Lasso is not included. Tounify them, we deﬁne the Lasso-like problem as follows: minimize β ∈ R d (cid:107) y − Xβ (cid:107) + g ( β ) , (6) where the problem satisﬁes all assumptions for Eq. (2) and g satisﬁes ∀ k ≥ , β ∈ R d g ( k β ) = kg ( β ) . (7)For the Lasso-like problem, the Fenchel conjugate functionof f and g are given as f (cid:63) ( − θ ) = 12 (cid:107) θ (cid:107) − y (cid:62) θ , (8) g (cid:63) ( X (cid:62) θ ) = (cid:40) ∀ β θ (cid:62) Xβ − g ( β ) ≤ ∞ ( ∃ β θ (cid:62) Xβ − g ( β ) > . (9)Note that { θ | g (cid:63) ( X (cid:62) θ ) = 0 } is a closed convex set.Hence, the Lasso-like problem is a class of problems whoseFenchel-Rockafellar dual can be seen as a convex projection. Thanks to Theorem 6, we can construct a safe region byproposing an upper bound u ( θ ) . In this section, we proposea tight upper bound for Lasso-like problems.The direct expression of f (cid:63) in Eq. (8) is sufﬁciently simple.We only need an upper bound of − g (cid:63) to construct a simpleregion. The upper bound is given as follows: Lemma 9

For Lasso-like problems (6) , for ∀ ˜ β ∈ R d and ∀ θ ∈ R n , we have − g (cid:63) ( X (cid:62) θ ) ≤ inf k ≥ g ( k ˜ β ) − θ (cid:62) X ( k ˜ β )= (cid:40) g ( ˜ β ) − θ (cid:62) X ˜ β ≥ −∞ ( g ( ˜ β ) − θ (cid:62) X ˜ β < . (10) (Proof) Based on a Fenchel-Young inequality (1), we have − g (cid:63) ( X (cid:62) θ ) ≤ inf k ≥ g ( k ˜ β ) − θ (cid:62) X ( k ˜ β ) . Under the condition of Eq. (7), we have g ( k ˜ β ) = kg ( ˜ β ) .Therefore, the optimal value of the upper bound is zero if g ( ˜ β ) − θ (cid:62) X ( ˜ β ) ≥ and −∞ otherwise. (cid:3) The next theorem can be directly derived from Lemma 9.

Theorem 10

Consider Lasso-like problems in Eq. (6) . Let u DS ( θ ; ˜ β ) := (cid:40) − f (cid:63) ( − θ ) ( g ( ˜ β ) − θ (cid:62) X ˜ β ≥ −∞ ( g ( ˜ β ) − θ (cid:62) X ˜ β < . (11) Then, for ∀ ˜ β ∈ R d and ∀ θ ∈ R n , D ( θ ) ≤ u DS ( θ ; ˜ β ) . Then, Theorem 7 provides a simple and safe region. ubmission and Formatting Instructions for ICML 2021

Theorem 11

Consider the Lasso-like problem in Eq. (6) and its Fenchel-Rockafellar dual problem in Eq. (3) . Let ˆ θ be the dual optimal point. We assume that ˜ β ∈ R d and ˜ θ ∈ dom( D ) . Then, ˆ θ is within the Dynamic Sasvi region,which is given as an intersection of a sphere and a halfspace: R DS ( ˜ β , ˜ θ ):= { θ | l ( θ ; ˜ θ ) ≤ u DS ( θ ; ˜ β ) } , = { θ | (cid:13)(cid:13)(cid:13)(cid:13) θ −

12 ( ˜ θ + y ) (cid:13)(cid:13)(cid:13)(cid:13) ≤ (cid:107) ( ˜ θ − y ) (cid:107) . ∧ ≤ g ( ˜ β ) − θ (cid:62) X ˜ β } . The proof of Theorem 11 is given in the Appendix. Becauseof continuity, R DS ( ˜ β ( t ) , ˜ θ ( t ) ) converges to R DS ( ˆ β , ˆ θ ) = { ˆ θ } if lim t →∞ ˜ β ( t ) = ˆ β and lim t →∞ ˜ θ ( t ) = ˆ θ hold. In this section, we show that safe screening with variationalinequality (Sasvi) (Liu et al., 2014) is a special case of ourscreening rule. First, we review Sasvi. The target task ofSasvi is to minimize (cid:107) y − Xβ (cid:107) + λ (cid:107) β (cid:107) with many λ s.Divided by λ and change optimization variable, we obtainthe following: minimize β ∈ R d (cid:13)(cid:13)(cid:13)(cid:13) λ y − Xβ (cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) β (cid:107) . Let ˆ β ( λ ) and ˆ θ ( λ ) be the optimal points of the primal prob-lem and the Fenchel-Rockafellar dual problem, respectively.Sasvi uses ˆ θ ( λ ) to construct a safe region for ˆ θ ( λ ) . Al-though Sasvi was originally proposed for Lasso, it can beeasily generalized for the Lasso-like problem as follows. Theorem 12

Let ˆ θ ( λ ) be the optimal point of the Fenchel-Rockafellar dual problem of the Lasso-like problem (that is, g satisﬁes Eq. (7) ) maximize θ : g (cid:63) ( X (cid:62) θ )=0 − (cid:107) θ − λ y (cid:107) + 12 (cid:107) λ y (cid:107) . Assume we have an exact ˆ θ ( λ ) . We then have ˆ θ ( λ ) ∈R Sasvi ( λ, λ ):= { θ | ≥ (cid:18) λ y − θ (cid:19) (cid:62) (cid:16) ˆ θ ( λ ) − θ (cid:17) ∧ ≥ (cid:18) λ y − ˆ θ ( λ ) (cid:19) (cid:62) (cid:16) θ − ˆ θ ( λ ) (cid:17) } (Proof) Because the duality of the Lasso-like problem canbe interpreted as a projection from λ y to a closed convex set { θ | g (cid:63) ( X (cid:62) θ ) = 0 } , two variational inequalities hold.See (Liu et al., 2014) for more details. (cid:3) We can then prove that R Sasvi ( λ ) := R Sasvi (1 , λ ) equals R DS ( ˆ β ( λ ) , ˆ θ ( λ ) ) . Note that we can set λ = 1 without aloss of generality because multiplying the same scalar to y , λ , and λ does not change the problem or the region. Theorem 13

Consider the Lasso-like problem minimize β ∈ R d (cid:13)(cid:13)(cid:13)(cid:13) λ y − Xβ (cid:13)(cid:13)(cid:13)(cid:13) + g ( β ) . Let ˆ β ( λ ) and ˆ θ ( λ ) be the primal/dual optimal points, respec-tively. We then have R Sasvi ( λ ) = R DS ( ˆ β ( λ ) , ˆ θ ( λ ) ) . where R Sasvi ( λ ) and R DS ( ˆ β ( λ ) , ˆ θ ( λ ) ) are safe regionsfor ˆ θ (1) . The proof of Theorem 13 is given in the Appendix. Forthis reason, we have labeled it ”Dynamic Sasvi.” This gen-eralization increases the speed of the solver signiﬁcantlybecause the region of our method may be extremely narrowin the late stage of optimization. As pointed out in (Fercoqet al., 2015), some sequential safe screening rules, includingSasvi, are not safe in practice because we do not have the ex-act solution for λ . Dynamic Sasvi overcomes this problembecause its region is safe if it is not the exact solution. Here, we show that the proposed method is stronger thanGap Safe Dome (Fercoq et al., 2015) and Gap Safe Sphere(Fercoq et al., 2015), (Ndiaye et al., 2017) for Lasso-likeproblems. As shown in (Fercoq et al., 2015), for Lasso, theregions of the Gap Safe Dome and Gap Safe Sphere arethe relaxation of the intersection of a sphere and the contraof another sphere. We call this unrelaxed region Gap SafeMoon. Although Gap Safe Moon is deﬁned only for Lassoin (Fercoq et al., 2015), it can be naturally generalized forLasso-like problems. Gap Safe Moon can be derived fromCorollary 7.

Theorem 14 (Gap Safe Moon) Consider the Lasso-likeproblem in Eq. (6) and its dual Fenchel-Rockafellar equa-tion, i.e., Eq. (3) . Let ˆ θ be the dual optimal point. For ˜ β ∈ R d , the Gap Safe Moon upper bound is given as u GM ( θ ; ˜ β ) = (cid:40) − f (cid:63) ( − θ ) ( − f (cid:63) ( − θ ) ≤ P ( ˜ β )) −∞ ( − f (cid:63) ( − θ ) > P ( ˜ β )) . (12) Then, for ∀ ˜ β ∈ R d , ∀ ˜ θ ∈ dom( D ) , and ∀ θ ∈ R n , we have D ( θ ) ≤ u GM ( θ ; ˜ β ) ubmission and Formatting Instructions for ICML 2021 (a) Regions of dynamic Sasvi (dark green)and dynamic EDPP (light green). (b) Regions of dynamic Sasvi (green), GapSafe Sphere (light red) and Gap Safe Dome(dark red). (c) Regions of dynamic EDPP (green), GapSafe Sphere (light red), Gap Safe Dome (darkred). Figure 1.

Comparisons of various safe regions for Lasso ( X = (cid:18) − (cid:19) , y = (cid:18) . (cid:19) ). The blue region is the feasible region. ˜ β wasobtained by a cycle of coordinate descent. ˜ θ = φ ( ˜ β ) . and hence ˆ θ ∈{ θ | l ( θ ; ˜ θ ) ≤ u GM ( θ ; ˜ β ) } = { θ | − f (cid:63) ( − θ ) ≤ P ( ˜ β ) ∧ ( ˜ θ − θ ) (cid:62) ( y − θ ) ≤ } . The proof of Theorem 14 is given in the Appendix. We canthen derive the next theorem.

Theorem 15 (Gap Safe Moon and Dynamic Sasvi) For ∀ ˜ β ∈ R d and ∀ θ ∈ R n , we have u DS ( θ ; ˜ β ) ≤ u GM ( θ ; ˜ β ) . (Proof) If g ( ˜ β ) − θ (cid:62) X ˜ β is negative, u DS ( θ ; ˜ β ) = −∞ ,and thus the inequality holds. If ≤ g ( ˜ β ) − θ (cid:62) X ˜ β ,by adding the Fenchel-Young inequality (1), we have − f (cid:63) ( − θ ) ≤ P ( ˜ β ) and u DS ( θ ; ˜ β ) = u GM ( θ ; ˜ β ) = − f (cid:63) ( − θ ) . (cid:3) This theorem means that the region of dynamic Sasvi is asubset of the region of Gap Safe Moon. Because Gap SafeDome and Gap Safe Sphere are based on the relaxation ofthe Gap Safe Moon region, our screening is always strongerthan them. Figure 1b shows the regions of Dynamic Sasvi,Gap Safe Dome and Gap Safe Sphere.

In some situations, even a dome region is too complicatedto calculate (cid:83) θ ∈R ∂g (cid:63) ( − X (cid:62) θ ) . We propose using a mini-mum radius sphere that includes the dynamic Sasvi regionin such cases. This method can be seen as a dynamic variantof the enhanced dual polytope projections (EDPP) (Wanget al., 2015) because the EDPP is the minimum radius sphererelaxation of Sasvi. Theorem 16

Consider the Lasso-like problem in Eq. (6) and its Fenchel-Rockafellar dual problem in Eq. (3) . We assume that ˜ β ∈ R d and ˜ θ ∈ dom( D ) . If n ≥ , theminimum radius sphere including R DS ( ˜ β , ˜ θ ) is R DE ( ˜ β , ˜ θ ) = { θ | (cid:107) θ − θ c (cid:107) ≤ r } , (13) where θ c = 12 ( ˜ θ + y ) − α X ˜ β r = 14 (cid:107) ˜ θ − y (cid:107) − α (cid:107) X ˜ β (cid:107) α = max (cid:32) , (cid:107) X ˜ β (cid:107) (cid:18)

12 ( ˜ θ + y ) (cid:62) X ˜ β − g ( ˜ β ) (cid:19)(cid:33) . The proof of Theorem 16 is given in the Appendix. Figures1a and 1c show the dynamic EDPP region and other regions.Note that the dynamic EDPP region is not guaranteed to bewithin the Gap Safe Sphere region. However, its radius isalways smaller than that of Gap Safe Sphere.

5. Implementation for Lasso

In this section, we provide a speciﬁc solver based onTheorem 11. Because the algorithm used to calculate (cid:83) θ ∈R ∂g (cid:63) ( X (cid:62) θ ) depends on g , we introduce a Lassosolver as an example. We must choose an iterative solver tocombine with screening methods because they cannot esti-mate the solution alone. Although our methods can workwith any iterative method, we use coordinate descent, whichis recommended in (Friedman et al., 2007). ˜ θ As shown in the previous section, lim t →∞ R DS ( β t , θ t ) converges to { ˆ θ } when lim t →∞ β t = ˆ β and lim t →∞ θ t =ˆ θ holds. Because the iterative solver provides such a se-quence of primal points and screening does not harm its ubmission and Formatting Instructions for ICML 2021 Algorithm 1

Coordinate descent with Dynamic Sasvi forLasso Input: X , y , β , T, c, (cid:15) Initialize ˜ β ← β , A ← [[ d ]] for t ∈ [[ T ]] do if k mod c = 1 then Compute ˜ θ = φ A ( ˜ β ) if P ( ˜ β ) − D ( ˜ θ ) ≤ (cid:15) then break end if R ← R DS ( ˜ β , ˜ θ ) A ← { j ∈ A : max θ ∈R | x (cid:62) j θ | ≥ } for j ∈ [[ d ]] − A do ˜ β j ← end for end if for j ∈ A do u ← ˜ β j (cid:107) x j (cid:107) − x (cid:62) j ( X ˜ β − y ) ˜ β j ← (cid:107) x j (cid:107) sign ( u ) max(0 , | u | − end for end for Output: ˜ β convergence, we only need a converging sequence of dualpoints to obtain a converging safe region. The next theoremprovides such a sequence. Theorem 17 (Converging θ t ) Consider the optimizationproblem Eq. (6) with g ( β ) = (cid:107) β (cid:107) . Let ˆ β ∈ R d and ˆ θ ∈ R n be the primal/dual solution. Assume lim t →∞ β t = ˆ β .Let us deﬁne φ : R d → R n as φ ( β ) := 1max(1 , (cid:107) X (cid:62) ( y − Xβ ) (cid:107) ∞ ) ( y − Xβ ) . Then, ∀ β φ ( β ) ∈ dom( D ) and lim t →∞ φ ( β t ) = ˆ θ hold. (Proof) φ ( β ) ∈ dom( D ) is directly derived from (cid:107) X (cid:62) φ ( β ) (cid:107) ∞ = min( (cid:107) X (cid:62) ( y − Xβ ) (cid:107) ∞ , ≤ . Be-cause φ is continuous and φ ( ˆ β ) = ˆ θ , lim t →∞ φ ( β t ) = ˆ θ also holds. (cid:3) Actually, if A is the set of features that is not yet eliminated,we can use φ A ( β ) := 1max(1 , max j ∈A x (cid:62) j ( y − Xβ )) ( Xβ − y ) instead of φ ( β ) . Although φ A ( β ) ∈ dom( D ) is not guar-anteed, φ A ( β ) is guaranteed to satisfy all constraints thatare active in the dual solution. In other words, φ A ( β ) is inthe domain of the dual problem of the small primal problemwithout eliminated features. Now, we can optimize the problem with the proposed screen-ing. The pseudo code is described in Algorithm 1. Direct ex-pression of max θ ∈R DS ( ˜ β , ˜ θ ) | x (cid:62) j θ | is given in the Appendix. In Dynamic Sasvi screening, the calculation of φ A ( ˜ β ) and max θ ∈R | x (cid:62) j θ | controls the computational cost. If we have X ˜ β , X (cid:62) X ˜ β , and X (cid:62) y , we can obtain φ A ( β ) with O ( n + d ) calculations. If we have X ˜ β , X (cid:62) X ˜ β , ˜ θ , X (cid:62) ˜ θ , and X (cid:62) y , we can obtain max θ ∈R | x (cid:62) j θ | for all j with O ( n + d ) calculations. Because X (cid:62) y is constant and ˜ θ = φ A ( ˜ β ) isa linear combination of X ˜ β and y , the calculations of only X ˜ β and X (cid:62) X ˜ β cost O ( nd ) . Hence, the screening costis almost the same for all methods, which require X (cid:62) X ˜ β ,including Gap Safe. In practice, we formulate the Lasso problem as follows: minimize β ∈ R d (cid:13)(cid:13)(cid:13)(cid:13) λ y − Xβ (cid:13)(cid:13)(cid:13)(cid:13) + (cid:107) β (cid:107) and solve for many values of λ to choose the best solution.Considering the situation in which we have to estimatethe solutions ˆ β ( λ ) , ˆ β ( λ ) , · · · , ˆ β ( λ M ) corresponds to λ >λ > · · · > λ M . Many studies (e.g., (Fercoq et al., 2015))recommend using the estimated solution for λ m − as theinitial vector in the estimation of ˆ β ( λ m ) because ˆ β ( λ m − ) and ˆ β ( λ m ) may be close. In our implementation, we set theinitial vector as k ˜ β ( λ m − ) , where ˜ β ( λ m − ) is the estimationof ˆ β ( λ m − ) and k := argmin k ≥ (cid:13)(cid:13)(cid:13)(cid:13) λ m y − k X ˜ β ( λ m − ) (cid:13)(cid:13)(cid:13)(cid:13) + k (cid:107) ˜ β ( λ m − ) (cid:107) = 1 (cid:107) X ˜ β ( λ m − ) (cid:107) ( (cid:107) ˜ β ( λ m − ) (cid:107) − λ m y (cid:62) X ˜ β ( λ m − ) ) .

6. Experiments

In this section, we show the efﬁcacy of the proposed meth-ods using real-world data.

We compared the proposed methods with Gap Safe Sphereand Gap Safe Dome (Fercoq et al., 2015; Ndiaye et al.,2017), which are state-of-the-art dynamic safe screeningmethods. All methods were run on a Macbook Air witha 1.1 GHz quad-core Intel Core i5 CPU with 16 GB ofRAM. We implemented all methods in C++ using the Ac-celerate framework, which is the native framework for basiccalculations. ubmission and Formatting Instructions for ICML 2021 (a) Feature remaining rate (Leukemia). (b) Computational time (Leukemia). (c) Computational time (20newsgroup).

Figure 2. (a): Feature remaining rate of each iteration for Lasso on Leukemia (density n = 72 , d = 7128 ). (b) Average computational timeof Lasso path on subsampled Leukemia (density of n = 50 , d = 7128 ). (c): Average computational time of Lasso path on subsampled20newsgroup (sparsity of n = 800 , d = 18571 ) Table 1.

Logarithm of acceleration ratio for Leukemia and 20newsgroup. The smaller values indicate a greater speed up.

Dataset -log epsilon Dynamic Sasvi Dynamic EDPP Gap Safe Dome Gap Safe SphereLeukemia 4 − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . − . ± . First, we compared the number of screened variables amongthe four dynamic safe screening methods. We solved theLasso problem using the Leukemia dataset (dense data with72 samples and 7128 features) and λ = (cid:107) X (cid:62) y (cid:107) ∞ . Weused cyclic coordinate descent as the iterative algorithm andscreen variables for 10 iterations each. Figure 2a showsthe ratio of the uneliminated features at each iteration. Asguaranteed theoretically, we can see that Dynamic Sasvieliminates more variables in earlier steps than Gap SafeDome and Gap Safe Sphere. The ﬁgure also shows that Dy-namic EDPP, relaxed version of Dynamic Sasvi, eliminatedalmost the same number of features as Dynamic Sasvi. Next, we compared the computation time of the path ofthe Lasso solutions for various values of λ . Because λ may be deﬁned by a cross validation in practice, comput-ing the path of the solutions is an important task. Weused λ j = 100 − j (cid:107) X (cid:62) y (cid:107) ∞ ( j = 0 , . . . , ). The it-erative solver stops when the duality gap is smaller than (cid:15) ( P ( ) − D ( )) . Note that P ( ) − D ( ) makes the stop-ping criterion independent of the data scale. We used theLeukemia and tf-idf vectorized 20newsgroup datasets (base-ball versus hockey) (sparse data with 1197 samples and18571 features). We subsampled the data 50 times and ranall methods for the same 50 subsamples. The subsampled data size is 50 for leukemia and 800 for 20newsgroup. Fig-ures 2b and 2c show the average computation time of theLasso path for the Leukemia dataset and 20news datasets,respectively. For all settings, dynamic Sasvi and dynamicEDPP outperform Gap Safe Dome and Gap Safe Sphere.Table 1 shows the the average and standard deviations ofthe logarithm of the acceleration ratio to the computationaltime for the same subsample without screening. Proposedmethods are signiﬁcantly faster than Gap Safe methods. Inaddition, Dynamic EDPP is a little faster than DynamicSasvi because the computational cost of Dynamic EDPPscreening is smaller than the one of Dynamic Sasvi.

7. Conclusion

In this paper, we proposed a framework for safe screeningbased on Fenchel-Rockafellar duality and derived DynamicSasvi and Dynamic EDPP, which are speciﬁc safe screeningmethods for Lasso-like problems. Dynamic Sasvi and Dy-namic EDPP can be regarded as dynamic feature eliminationvariants of Sasvi and EDPP, respectively. We proved thatDynamic Sasvi always eliminates more features than GapSafe Sphere and Gap Safe Dome. Dynamic EDPP is basedon the sphere relaxation of the Dynamic Sasvi region andeliminates almost the same number of features as DynamicSasvi. We also showed experimentally that the computa-tional costs of the proposed methods are smaller than thoseof Gap Safe Sphere and Gap SafeDome. ubmission and Formatting Instructions for ICML 2021

References

Bao, R., Gu, B., and Huang, H. Fast oscar and owl regres-sion via safe screening rules. In

ICML , 2020.Bauschke, H. H., Combettes, P. L., et al.

Convex analysisand monotone operator theory in Hilbert spaces , volume408. Springer, 2011.Bonnefoy, A., Emiya, V., Ralaivola, L., and Gribonval, R.Dynamic screening: Accelerating ﬁrst-order algorithmsfor the lasso and group-lasso.

IEEE Transactions onSignal Processing , 63(19):5121–5132, 2015.Climente-Gonz´alez, H., Azencott, C.-A., Kaski, S., andYamada, M. Block hsic lasso: model-free biomarkerdetection for ultra-high dimensional data.

Bioinformatics ,35(14):i427–i435, 2019.Cortes, C. and Vapnik, V. Support-vector networks.

Ma-chine learning , 20(3):273–297, 1995.Donoho, D. L. Compressed sensing.

IEEE Transactions oninformation theory , 52(4):1289–1306, 2006.Fan, J. and Li, R. Variable selection via nonconcave pe-nalized likelihood and its oracle properties.

Journal ofthe American statistical Association , 96(456):1348–1360,2001.Fercoq, O., Gramfort, A., and Salmon, J. Mind the dualitygap: safer rules for the lasso. In

ICML , 2015.Figueiredo, M. and Nowak, R. Ordered weighted l1 reg-ularized regression with strongly correlated covariates:Theoretical aspects. In

AISTATS , 2016.Frank, L. E. and Friedman, J. H. A statistical view of somechemometrics regression tools.

Technometrics , 35(2):109–135, 1993.Friedman, J., Hastie, T., H¨oﬂing, H., and Tibshirani, R.Pathwise coordinate optimization.

Ann. Appl. Stat. , 1(2):302–332, 12 2007.Ghaoui, L. E., Viallon, V., and Rabbani, T. Safe featureelimination for the lasso and sparse supervised learningproblems. arXiv preprint arXiv:1009.4219 , 2010.Jacob, L., Obozinski, G., and Vert, J.-P. Group lasso withoverlap and graph lasso. In

ICML , 2009.Liu, J., Zhao, Z., Wang, J., and Ye, J. Safe screening withvariational inequalities and its application to lasso. In

ICML , 2014.Meier, L., Van De Geer, S., and B¨uhlmann, P. The grouplasso for logistic regression.

Journal of the Royal Statis-tical Society: Series B (Statistical Methodology) , 70(1):53–71, 2008. Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. Gapsafe screening rules for sparse multi-task and multi-classmodels. In

NIPS , 2015.Ndiaye, E., Fercoq, O., Gramfort, A., and Salmon, J. Gapsafe screening rules for sparsity enforcing penalties.

Jour-nal of Machine Learning Research , 18(1):4671–4703,2017.Ndiaye, E., Fercoq, O., and Salmon, J. Screening rules andits complexity for active set identiﬁcation, 2020.Ogawa, K., Suzuki, Y., and Takeuchi, I. Safe screening ofnon-support vectors in pathwise svm computation. In

ICML , 2013.Raj, A., Olbrich, J., G¨artner, B., Sch¨olkopf, B., and Jaggi,M. Screening rules for convex problems. arXiv preprintarXiv:1609.07478 , 2016.Shibagaki, A., Karasuyama, M., Hatano, K., and Takeuchi,I. Simultaneous safe screening of features and samplesin doubly sparse modeling. In

ICML , 2016.Smola, A. J. and Sch¨olkopf, B. A tutorial on support vec-tor regression.

Statistics and computing , 14(3):199–222,2004.Tibshirani, R. Regression shrinkage and selection via thelasso.

Journal of the Royal Statistical Society: Series B(Methodological) , 58(1):267–288, 1996.Wang, J., Wonka, P., and Ye, J. Lasso screening rules viadual polytope projection.

Journal of Machine LearningResearch , 16(1):1063–1101, 2015.Xiang, Z. J., Wang, Y., and Ramadge, P. J. Screening testsfor lasso problems.

IEEE Transactions on Pattern Anal-ysis and Machine Intelligence , 39(5):1008–1027, May2017.Yuan, M. and Lin, Y. Model selection and estimation inregression with grouped variables.

Journal of the RoyalStatistical Society Series B , 68:49–67, 02 2006.Zhang, C.-H. et al. Nearly unbiased variable selection underminimax concave penalty.

The Annals of statistics , 38(2):894–942, 2010.Zou, H. and Hastie, T. Regularization and variable selectionvia the elastic net.

Journal of the royal statistical society:series B (statistical methodology) , 67(2):301–320, 2005. ubmission and Formatting Instructions for ICML 2021

A. Proof of Theorems

A.1. Proof of Theorem 2(Proof)

According to ((Bauschke et al., 2011) Theorem 15.23), since we have assumed that ∃ β ∈ relint(dom( g )) s . t . Xβ ∈ relint(dom( f )) , i.e., relint(dom( f )) ∩ X relint(dom( g )) is not empty, we have inf β ∈ R d f ( Xβ ) + g ( β ) = max θ ∈ R n − f (cid:63) ( − θ ) − g (cid:63) ( X (cid:62) θ ) . In addition, we have assumed the existence of the optimal point. Hence, we have min β ∈ R d f ( Xβ ) + g ( β ) = max θ ∈ R n − f (cid:63) ( − θ ) − g (cid:63) ( X (cid:62) θ ) . (cid:3) A.2. Proof of Theorem 11(Proof)

According to Theorem 7, we can easily obtain ˆ θ ∈ R DS ( ˜ β , ˜ θ ) := { θ | l ( θ ; ˜ θ ) ≤ u DS ( θ ; ˜ β ) } . By the deﬁnition of u DS , we have the following: l ( θ ; ˜ θ ) ≤ u DS ( θ ; ˜ β ) ⇐⇒ l ( θ ; ˜ θ ) ≤ − f (cid:63) ( − θ ) ∧ ≤ g ( ˜ β ) − θ (cid:62) X ˜ β . In addition, by Eqs. (8), g (cid:63) ( X (cid:62) ˜ θ ) = 0 and L = 1 , we have l ( θ ; ˜ θ ) ≤ − f (cid:63) ( − θ ) ⇐⇒ (cid:107) θ − ˜ θ (cid:107) − (cid:107) ˜ θ (cid:107) + y (cid:62) ˜ θ ≤ − (cid:107) θ (cid:107) + y (cid:62) θ ⇐⇒ (cid:107) θ (cid:107) − θ (cid:62) ( ˜ θ + y ) ≤ − y (cid:62) ˜ θ ⇐⇒ (cid:107) θ −

12 ( ˜ θ + y ) (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) . Hence, we have ˆ θ ∈ R DS ( ˜ β , ˜ θ ) := { θ | l ( θ ; ˜ θ ) ≤ u DS ( θ ; ˜ β ) } = { θ | (cid:107) θ −

12 ( ˜ θ + y ) (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) ∧ ≤ g ( ˜ β ) − θ (cid:62) X ˜ β } . (cid:3) A.3. Proof of Theorem 13(Proof)

First, we prove g ( ˆ β ( λ ) ) = ˆ θ ( λ ) (cid:62) X ˆ β ( λ ) . Because X (cid:62) ˆ θ ( λ ) ∈ ∂g ( ˆ β ( λ ) ) (Proposition 3), for ∀ β , the inequality g ( ˆ β ( λ ) ) + ˆ θ ( λ ) (cid:62) X ( β − ˆ β ( λ ) ) ≤ g ( β ) holds. We can set β = and β = 2 ˆ β ( λ ) and obtain g ( ˆ β ( λ ) ) − ˆ θ ( λ ) (cid:62) X ˆ β ( λ ) ≤ g ( ˆ β ( λ ) ) + ˆ θ ( λ ) (cid:62) X ˆ β ( λ ) ≤ g ( ˆ β ( λ ) ) . Hence, we have g ( ˆ β ( λ ) ) = ˆ θ ( λ ) (cid:62) X ˆ β ( λ ) .In addition, ˆ θ ( λ ) = λ y − X ˆ β ( λ ) holds (Proposition 3). ubmission and Formatting Instructions for ICML 2021 We then have R DS ( ˆ β ( λ ) , ˆ θ ( λ ) ) = { θ | (cid:107) θ −

12 ( ˆ θ ( λ ) + y ) (cid:107) ≤ (cid:107) ( ˆ θ ( λ ) − y ) (cid:107) ∧ ≤ g ( ˆ β ( λ ) ) − θ (cid:62) X ˆ β ( λ ) } = { θ | (cid:107) θ (cid:107) − θ (cid:62) ( ˆ θ ( λ ) + y ) + 14 (cid:107) ( ˆ θ ( λ ) + y ) (cid:107) ≤ (cid:107) ( ˆ θ ( λ ) − y ) (cid:107) ∧ ≤ ( ˆ θ ( λ ) − θ ) (cid:62) X ˆ β ( λ ) } = { θ | (cid:107) θ (cid:107) − θ (cid:62) ( ˆ θ ( λ ) + y ) + y (cid:62) ˆ θ ( λ ) ≤ ∧ ≤ ( ˆ θ ( λ ) − θ ) (cid:62) ( 1 λ y − ˆ θ ( λ ) ) } = { θ | ( y − θ ) (cid:62) ( ˆ θ ( λ ) − θ ) ≤ ∧ ( 1 λ y − ˆ θ ( λ ) ) (cid:62) ( θ − ˆ θ ( λ ) ) ≤ } = R Sasvi (1 , λ ) = R Sasvi ( λ ) (cid:3) A.4. Proof of Theorem 14(Proof)

By Eq. (9), we have D ( θ ) ∈ {−∞ , − f (cid:63) ( − θ ) } . Clearly, the inequality D ( θ ) ≤ − f (cid:63) ( − θ ) always holds. If − f (cid:63) ( − θ ) > P ( ˜ β ) , D ( θ ) must be −∞ because D ( θ ) ≤ P ( ˜ β ) . Hence, we have D ( θ ) ≤ u GM ( θ ; ˜ β ) . According toTheorem 7, we have ˆ θ ∈{ θ | l ( θ ; ˜ θ ) ≤ u GM ( θ ; ˜ β ) } = { θ | − f (cid:63) ( − θ ) ≤ P ( ˜ β ) ∧ (cid:107) θ − ˜ θ (cid:107) − f (cid:63) ( − ˜ θ ) − g (cid:63) ( X (cid:62) ˜ θ ) ≤ − f (cid:63) ( − θ ) } = { θ | − f (cid:63) ( − θ ) ≤ P ( ˜ β ) ∧ (cid:107) θ − ˜ θ (cid:107) − (cid:107) ˜ θ (cid:107) + y (cid:62) ˜ θ ≤ − (cid:107) θ (cid:107) + y (cid:62) θ } = { θ | − f (cid:63) ( − θ ) ≤ P ( ˜ β ) ∧ (cid:107) θ (cid:107) − θ (cid:62) ˜ θ + y (cid:62) ˜ θ ≤ y (cid:62) θ } = { θ | − f (cid:63) ( − θ ) ≤ P ( ˜ β ) ∧ ( ˜ θ − θ ) (cid:62) ( y − θ ) ≤ } (cid:3) A.5. Proof of Theorem 16Proof of R DS ( ˜ β , ˜ θ ) ⊂ R DE ( ˜ β , ˜ θ ) :(Proof) Since α ≥ , we have R DS ( ˜ β , ˜ θ ) = { θ | (cid:107) θ −

12 ( ˜ θ + y ) (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) ∧ ≤ g ( ˜ β ) − θ (cid:62) X ˜ β }⊂{ θ | (cid:107) θ −

12 ( ˜ θ + y ) (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) + 2 α ( g ( ˜ β ) − θ (cid:62) X ˜ β ) } = { θ | (cid:107) θ (cid:107) − θ (cid:62) ( ˜ θ + y ) + 2 α θ (cid:62) X ˜ β ≤ (cid:107) ( ˜ θ − y ) (cid:107) − (cid:107) ˜ θ + y (cid:107) + 2 αg ( ˜ β ) } = { θ | (cid:107) θ −

12 ( ˜ θ + y ) + α X ˜ β (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) − (cid:107) ˜ θ + y (cid:107) + (cid:107)

12 ( ˜ θ + y ) − α X ˜ β (cid:107) + 2 αg ( ˜ β ) } = { θ | (cid:107) θ −

12 ( ˜ θ + y ) + α X ˜ β (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) + α (cid:107) X ˜ β (cid:107) − α ( ˜ θ + y ) (cid:62) X ˜ β + 2 αg ( ˜ β ) } . And by α ∈ { , (cid:107) X ˜ β (cid:107) ( ( ˜ θ + y ) (cid:62) X ˜ β − g ( ˜ β )) } , we have α (cid:107) X ˜ β (cid:107) − α ( ˜ θ + y ) (cid:62) X ˜ β + 2 αg ( ˜ β ) = − α (cid:107) X ˜ β (cid:107) . ubmission and Formatting Instructions for ICML 2021 Hence, R DS ( ˜ β , ˜ θ ) ⊂{ θ | (cid:107) θ −

12 ( ˜ θ + y ) + α X ˜ β (cid:107) ≤ (cid:107) ( ˜ θ − y ) (cid:107) − α (cid:107) X ˜ β (cid:107) } = { θ | (cid:107) θ − θ c (cid:107) ≤ r } = R DE ( ˜ β , ˜ θ ) holds. (cid:3) Proof of minimality of the radius:(Proof)

Let v ∈ R n be a vector which satisﬁes v (cid:62) X ˜ β = 0 and v (cid:62) v = 1 . Note that such a vector exists if n ≥ . Then, wehave θ c ± r v ∈ R DS ( ˜ β , ˜ θ ) because ( θ c ± r v ) (cid:62) X ˜ β = 12 ( ˜ θ + y ) (cid:62) X ˜ β − max(0 ,

12 ( ˜ θ + y ) (cid:62) X ˜ β − g ( ˜ β )) ≤ g ( ˜ β ) and (cid:107) θ c ± r v −

12 ( ˜ θ + y ) (cid:107) = (cid:107) − α X ˜ β ± r v (cid:107) = 14 (cid:107) ( ˜ θ − y ) (cid:107) hold. Since the distance between these two points is r , the radius of a sphere which includes R DS ( ˜ β , ˜ θ ) can not be smallerthan r . (cid:3) B. Direct Expression of max θ ∈R DS ( ˜ β , ˜ θ ) x (cid:62) j θ Let r = (cid:107) ˜ θ − y (cid:107) and θ o = ( ˜ θ + y ) .If ( θ o + r (cid:107) x j (cid:107) x j ) (cid:62) X ˜ β ≤ g ( ˜ β ) , argmax θ ∈R DS ( ˜ β , ˜ θ ) x (cid:62) j θ = θ o + r (cid:107) x j (cid:107) x j and max θ ∈R DS ( ˜ β , ˜ θ ) x (cid:62) j θ = x (cid:62) j θ o + r (cid:107) x j (cid:107) .If ( θ o + r (cid:107) x j (cid:107) x j ) (cid:62) X ˜ β > g ( ˜ β ) , the constraint θ (cid:62) X ˜ β ≤ g ( ˜ β ) guaranteed to be active at the solution. Hence, we have max θ ∈R DS ( ˜ β , ˜ θ ) x (cid:62) j θ = max (cid:107) θ − θ o (cid:107) ≤ r ∧ θ (cid:62) X ˜ β = g ( ˜ β ) x (cid:62) j θ = x (cid:62) j θ o + x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) ( g ( ˜ β ) − θ (cid:62) o X ˜ β ) + max (cid:107) θ (cid:48) (cid:107) ≤ r ∧ θ (cid:48)(cid:62) X ˜ β = g ( ˜ β ) − θ (cid:62) o X ˜ β (cid:32) x j − x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) X ˜ β (cid:33) (cid:62) θ (cid:48) = x (cid:62) j θ o + x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) ( g ( ˜ β ) − θ (cid:62) o X ˜ β ) + (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) x j − x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) X ˜ β (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) (cid:115) r − (cid:107) X ˜ β (cid:107) ( g ( ˜ β ) − θ (cid:62) o X ˜ β ) . Let δ = g ( ˜ β ) − θ (cid:62) o X ˜ β . we then have max θ ∈R DS ( ˜ β , ˜ θ ) x (cid:62) j θ =  x (cid:62) j θ o + r (cid:107) x j (cid:107) ( r (cid:107) x j (cid:107) x (cid:62) j X ˜ β ≤ δ ) x (cid:62) j θ o + x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) δ + (cid:13)(cid:13)(cid:13)(cid:13) x j − x (cid:62) j X ˜ β (cid:107) X ˜ β (cid:107) X ˜ β (cid:13)(cid:13)(cid:13)(cid:13) (cid:113) r − (cid:107) X ˜ β (cid:107) δ ( r (cid:107) x j (cid:107) x (cid:62) j X ˜ β > δ ) C. Regions for other problems

According to Theorem 7, we can construct simple safe region by constructing simple upperbound of D ( θ ) . Herein, weintroduce some regions for non Lasso-like problems. Elastic-Net : Consider the following problem: minimize β ∈ R d (cid:107) y − Xβ (cid:107) + g ( β ) , ubmission and Formatting Instructions for ICML 2021 where g ( β ) = (cid:107) β (cid:107) + γ (cid:107) β (cid:107) and γ > . Then, for ∀ ˜ β , we have − g (cid:63) ( X (cid:62) θ ) ≤ inf k ≥ g ( k ˜ β ) − θ (cid:62) X ( k ˜ β )= inf k ≥ k ( (cid:107) ˜ β (cid:107) − θ (cid:62) X ˜ β ) + γk (cid:107) ˜ β (cid:107) =  (cid:107) ˜ β (cid:107) − θ (cid:62) X ˜ β ≥ − ( (cid:107) ˜ β (cid:107) − θ (cid:62) X ˜ β ) γ (cid:107) ˜ β (cid:107) ( (cid:107) ˜ β (cid:107) − θ (cid:62) X ˜ β < . Because this is stronger than the Fenchel–Young inequality in Eq. (1), the region derived from it and Eq. (8), (cid:40) θ | l ( θ ; ˜ θ ) ≤ − (cid:107) θ − y (cid:107) + 12 (cid:107) y (cid:107) − min(0 , (cid:107) ˜ β (cid:107) − θ (cid:62) X ˜ β ) γ (cid:107) ˜ β (cid:107) (cid:41) , is narrower than the region of Gap Safe Sphere. Since this region is a little complex, we propose to use the sphere relaxation. General regularized least squares:

Except for Elastic-Net, there are many regularizers that do not satisfy Eq. (7), e.g.,squared L1 regularization. In addition, the dual problem of SVM can be seen as the regularized least squares. In those cases,we propose using the upper bound D ( θ ) ≤ − (cid:107) θ − y (cid:107) + 12 (cid:107) y (cid:107) + g ( ˜ β ) − θ (cid:62) X ˜ β . This is based on the Fenchel–Young inequality for g and Eq. (8). Note that the region { θ | l ( θ ; ˜ θ ) ≤ − (cid:107) θ − y (cid:107) + 12 (cid:107) y (cid:107) + g ( ˜ β ) − θ (cid:62) X ˜ β } is a sphere in the Gap Safe Sphere region. General norm regularized problems:

Here, we extend f to a more general setup, e.g., the logistic loss. Assume that g satisﬁes Eq. (7). In those cases, we propose using D ( θ ) ≤ f ( Xβ ) + θ (cid:62) Xβ + inf k ≥ g ( k ˜ β ) − θ (cid:62) X ( k ˜ β ) , = f ( Xβ ) + θ (cid:62) Xβ + (cid:40) g ( ˜ β ) − θ (cid:62) X ˜ β ≥ −∞ ( g ( ˜ β ) − θ (cid:62) X ˜ β < . This is based on the Fenchel–Young inequality for ff