[PDF] Greedy metrics in orthogonal greedy learning

Abstract

Orthogonal greedy learning (OGL) is a stepwise learning scheme that adds a new atom from a dictionary via the steepest gradient descent and build the estimator via orthogonal projecting the target function to the space spanned by the selected atoms in each greedy step. Here, "greed" means choosing a new atom according to the steepest gradient descent principle. OGL then avoids the overfitting/underfitting by selecting an appropriate iteration number. In this paper, we point out that the overfitting/underfitting can also be avoided via redefining "greed" in OGL. To this end, we introduce a new greedy metric, called δ -greedy thresholds, to refine "greed" and theoretically verifies its feasibility. Furthermore, we reveals that such a greedy metric can bring an adaptive termination rule on the premise of maintaining the prominent learning performance of OGL. Our results show that the steepest gradient descent is not the unique greedy metric of OGL and some other more suitable metric may lessen the hassle of model-selection of OGL.

Full PDF

aa r X i v : . [ c s . L G ] N ov Greedy metrics in orthogonal greedy learning ✩ Lin Xu , Shaobo Lin ∗ , Jinshan Zeng , Zongben Xu

1. Institute for Information and System Sciences, School of Mathematics and Statistics, Xi’an JiaotongUniversity, Xi’an, 710049, China2. College of Mathematics and Information Science, Wenzhou University, Wenzhou 325035, China

Abstract

Orthogonal greedy learning (OGL) is a stepwise learning scheme that adds a newatom from a dictionary via the steepest gradient descent and build the estimator viaorthogonal projecting the target function to the space spanned by the selected atoms ineach greedy step. Here, “greed” means choosing a new atom according to the steepestgradient descent principle. OGL then avoids the overﬁtting/underﬁtting by selecting anappropriate iteration number. In this paper, we point out that the overﬁtting/underﬁttingcan also be avoided via redeﬁning “greed” in OGL. To this end, we introduce a new greedymetric, called δ -greedy thresholds, to reﬁne “greed” and theoretically veriﬁes its feasibility.Furthermore, we reveals that such a greedy metric can bring an adaptive termination ruleon the premise of maintaining the prominent learning performance of OGL. Our resultsshow that the steepest gradient descent is not the unique greedy metric of OGL and someother more suitable metric may lessen the hassle of model-selection of OGL. Keywords:

Supervised learning, orthogonal greedy learning, greedy metric,thresholding, generalization capability.

1. Introduction

Supervised learning focuses on synthesizing a function (or mapping) to approximate (orrepresent) an underlying relationship between the input and corresponding output basedon ﬁnitely many input-output samples. A system tackling supervised learning problems is ✩ The research was supported by the National 973 Programming (2013CB329404), the Key Program ofNational Natural Science Foundation of China (Grant No. 11131006) and the National Natural ScienceFoundation of China (Grant No. 11401462) ∗ Corresponding author: [email protected]

Preprint submitted to Elsevier July 2, 2018 ommonly called as a learning system (or learning machine). A standard learning systemusually comprises a hypothesis space, an optimization strategy, and a learning algorithm;Speciﬁcally, the hypothesis space is a family of parameterized functions that encodes theprior knowledge of the data, and the optimization strategy is an optimization problemwhich deﬁnes the estimator by utilizing the given samples, and the learning algorithm isan inference procedure that numerically solves the optimization problem.Dictionary learning is a family of learning systems whose hypothesis spaces are linearcombinations of atoms (or elements) of some given dictionaries. Here, the dictionarydenotes a family of base learners [32]. For such type hypothesis spaces, regularizationschemes such as the bridge estimator [1], ridge estimator [18] and Lasso estimator [35]are often employed as the optimization strategies. When the scale of samples is nottoo large, these optimization strategies can be realized by various learning algorithmssuch as the regularized least square algorithms [39], iterative thresholding algorithms [12]and iterative reweighted algorithms [13]. However, a large portion of the aforementionedlearning algorithms are time-consuming and therefore may cause the sluggishness of thecorresponding learning systems [38], particularly, when applied to the large-scale datasets.Greedy learning or, more speciﬁcally, learning through greedy search or applyinggreedy-type algorithms, provides a possibility to circumvent the drawbacks of regular-ization methods [2]. Greedy-type algorithms are stepwise inference processes that startfrom a null model and follow the problem solving heuristic of making the locally optimalchoice at each step with the hope of ﬁnding a global optimum. If the number of steps ismoderate, then greedy-type algorithms possess charming computational advantage, whencompared with the regularization schemes [32]. This property triggers avid research activ-ities of greedy-type algorithms in signal processing [11, 20, 36], inverse problem [16, 37],sparse approximation [15, 34] and, particularly, machine learning [2, 7, 21].

Four most important elements of greedy learning are the “dictionary-selection”, “greedy-metric”, “iterative-strategy” and “stopping-criterion”. This is essentially diﬀerent from2he greedy approximation that usually only focuses on the “dictionary-selection” and“iterative-format” issues [32], as the greedy learning concerns not only the approximationcapability, but also the cost, such as the model complexity, that should pay to achievea speciﬁed approximation accuracy. Therefore, greedy learning can be regarded as afour-issue learning scheme. • “Dictionary-selection” issue: this issue devotes to selecting a suitable dictionaryfor a given learning task. As a classical topic of greedy approximation, there are a greatdeal of dictionaries available to greedy learning. Typical examples include the greedy basis[32], quasi-greedy basis [31], redundant dictionary [14], orthogonal basis [28], kernel-basedsample dependent dictionary [6, 21] and stump dictionary [17]. • “Greedy-metric” issue: this issue regulates the criterion to choose a new atom (orelement) from the dictionary in each greedy step. Besides the widely used steepest gradi-ent descent (SGD) method [14], there are also many existing methods such as weak greed[29], thresholding greed [32] and super greed [23] to quantify the greedy-metric for theapproximation purpose. However, to the best of our knowledge, only the SGD metric isemployed in greedy learning, as all the results in [23, 29, 32] imply that this metric issuperior to other metrics in greedy approximation. • “Iterative-format” issue: this issue focuses on how to deﬁne a new estimator basedon the selected atoms. Similar to the “dictionary-selection” issue, the “iterative-strategy”issue is also a classical topic of greedy approximation. There are several existing types ofgreedy iteration schemes [32]. Among these, three most commonly used iteration schemesare the pure greedy, orthogonal greedy and relaxed greedy formats. Each format possessits own pros and cons [31, 32] and has been widely used in greedy approximation andlearning [2, 6, 17, 22, 33]. For instance, compared with the orthogonal greedy strategy,the pure and relaxed greedy strategies have beneﬁts of computation but suﬀer from eitherthe low convergence rate or the small applicable scope problem. • “Stopping-criterion” issue: this issue depicts how to terminate the learning process.The “stopping-criterion” is regarded as the main distinction between greedy approxima-tion and learning and has been frequently studied recently [2, 6, 21]. For example, Barronet al. [2] proposed an l -based complexity regularization strategy, and Chen et al. [6]3rovided an l -based adaptive stopping criterion. Orthogonal greedy learning (OGL) is a stepwise learning scheme that adds a new atomfrom a dictionary via SGD and then generate an estimator via orthogonally projectingthe objective function to the space spanned by the selected atoms at each greedy step.A common consensus of orthogonal greedy approximation is that better approximationresults can be achieved with larger number of iterations [32]. However, this claim cannot be applicable to greedy learning since the estimator is based on the samples withobservational noises. Therefore, researches usually adopt a suitable number of iterationin OGL to avoid the overﬁtting/underﬁtting [2, 6]. T e s t e rr o r( R M SE ) OGL1OGL2OGL3OGLR

Figure 1: The comparisons among four OGL with diﬀerent greedy metrics. The levels ofgreed satisﬁes OGL1 ≥ OGL2 ≥ OGL3 ≥ OGLRSince OGL always searches the most correlative atom and realizes the optimal ap-proximation capability of the space spanned by the selected atoms in each greedy step,its generalization capability becomes sensitive to the number of iterations. Thus, a slightturbulence of the number of atoms may lead to a great change of the generalizationcapability, which can be witnessed in Fig.1. Furthermore, the l -based complexity reg-ularization strategy [2] is only for the beneﬁt of theoretical analysis and the applicablerange of the l -based adaptive stopping criterion [6] is quite restricted, which makes itbe diﬃcult to persuade the programmers to utilize OGL. Recalling that a possible reason4f this problem is OGL searches the new atom according to SGD, an advisable idea isto weaken the level of greed by taking the “greedy-metric” issue into account. For thispurpose, we run a simple simulation (whose experimental setting can be found in Sec.5.2) to judge the possibility of this idea. The result (Fig.1) shows that the generalizationof OGL will not degrade via weakening the level of greed if the greedy-metric is speciﬁedappropriately. Diﬀerent from other three issues of greedy learning, the “greedy metric” issue, to thebest of our knowledge, has been studied a few in both theory and practice. The purposeof the present paper is to reveal the importance and necessity of studying the “greedy-metric” issue in OGL. The main contributions can be summarized as the following. • We propose a new greedy metric called the “ δ -greedy thresholds” to measure the levelof greed in OGL. Although this metric has already been used in greedy approximation[32], the novelty of translating it to OGL is that using this metric in OGL provides apossibility to improve the generalization capability of OGL further. We prove that, ifthe iteration number is appropriately speciﬁed, then OGL with the “ δ -greedy thresholds”metric can reach the existing almost optimal learning rate of OGL [2]. • Based on the “ δ -greedy thresholds”, an adaptive termination rule is developed forOGL. Diﬀerent from the classical stopping criterion that reach the bias and variance bal-ance via choosing appropriate number of iterations, our study implies that the balancecan also be attained through setting a suitable greedy metric. This phenomenon revealsthe essential importance of the “greedy-metric” issue, which often seems to be overlookedin greedy learning. We also presents the theoretical justiﬁcation of such an adaptive ter-mination rule. Our result (Theorem 3.2) shows that the greedy-metric based terminationrule performs as good as the iteration number based termination rule [2] in the sense thatthe generalization capabilities of the corresponding OGL are almost identical. The rest of paper is organized as follows. In the next section, we make a brief intro-duction of statistical learning theory and greedy learning. In Section 3, we introduce the5 δ -greedy thresholds” metric in OGL and provide its feasibility justiﬁcation. In Section4, based on the “ δ -greedy thresholds” metric, we propose an adaptive termination ruleand the corresponding δ -TOGL system. The theoretical feasibility of the δ -TOGL systemis also given in this section. In Section 5, we present numerical simulation experimentsto verify our arguments. In Section 6, we provide the proofs of the main results. In thelast section, we draw a simple conclusions of this paper.

2. Preliminaries

In this section, we present some preliminaries A fast review of the statistical learningtheory as well as greedy learning is given in Sec.2.1 and Sec.2.2, respectively.

Suppose that z = ( x i , y i ) mi =1 are drawn independently and identically from Z := X × Y according to an unknown probability distribution ρ which admits the decomposition ρ ( x, y ) = ρ X ( x ) ρ ( y | x ) . Assume that f : X → Y characterizes the correspondence between the input and output,as induced by ρ . A natural measure of the error incurred by using f of this purpose isthe generalization error, deﬁned by E ( f ) := Z Z ( f ( x ) − y ) dρ, which is minimized by the regression function [8] f ρ ( x ) := Z Y ydρ ( y | x ) . In general, since ρ is unknown, f ρ is also unknown. However, we have access to randomexamples z from X × Y sampled according to ρ .Let L ρ X be the Hilbert space of ρ X square integrable functions on X , with norm k · k ρ . It is known that, for every f ∈ L ρ X , there holds E ( f ) − E ( f ρ ) = k f − f ρ k ρ . (2.1)6o, the goal of learning is to ﬁnd a best approximation of the regression function f ρ .Let H be a hypothesis space and f H ∈ H be a best approximation of f ρ , i.e., f H =arg min g ∈H k g − f ρ k ρ . Whenever there is an estimator f z ∈ H based on the samples z inhand, we have E ( f z ) − E ( f ρ ) = k f ρ − f H k ρ + E ( f H ) − E ( f z ) . (2.2)It is known [10] that a small H will derive a large bias k f ρ − f H k ρ , while a large H deducesa large variance E ( f H ) − E ( f z ) . Thus the bias and variance are conﬂicting, and an idealor best hypothesis space H ∗ should be the one that best compromises the bias and thevariance. This is the well known ”bias-variance” dilemma in statistical learning theory.Without loss of generality, we always assume y ∈ [ − M, M ], and the number of samplesis ﬁnite. Thus, it is reasonable to truncate the estimator to [ − M, M ]. That is, if we deﬁne π M u =  u, if | u | ≤ MM sign ( u ) , otherwiseas the truncation operator, then it is easy to deduce [42] k π M f z − f ρ k ρ ≤ k f z − f ρ k ρ . Let H be a Hilbert space endowed with norm k · k H and inner product h· , · , i H . Let D = { g } g ∈D be a given dictionary satisfying k g k H ≤

1. Deﬁne L = { f : f = P g ∈ D a g g } as a Banach space endowed with the norm k f k L := inf { a g } g ∈D (X g ∈D | a g | : f = X g ∈D a g g ) . There exist several types of greedy algorithms [31]. Three most commonly used arethe pure greedy (PGA), orthogonal greedy (OGA) and relaxed greedy (RGA) algorithms.In all the above greedy algorithms, we begin by setting f := 0. The new approximation f k ( k ≥

1) is deﬁned based on r k − := f − f k − . In OGA, f k is deﬁned as f k = P V k f, P V k is the orthogonal projection onto V k = span { g , . . . , g k } and g k is deﬁned as g k = arg max g ∈D |h r k − , g i H | . Given a set of training samples z = ( x i , y i ) mi =1 , the empirical inner product and normare deﬁned by h f, g i m := 1 m m X i =1 f ( x i ) g ( x i ) , k f k m := 1 m m X i =1 | f ( x i ) | . The initial setting of OGL is the same as that of OGA. However, OGL should take thefollowing four issues into account: (I) Dictionary-selection:

Select a dictionary D n := { g , . . . , g n } with k g i k m ≤ (II) Greedy-deﬁnition: g k = arg max g ∈D n |h r k − , g i m | . (III) Iteration-strategy: f k z = P V z ,k f, where P V z ,k is the orthogonal projection onto V k = span { g , . . . , g k } in the metric of k · k m . (IV) Stopping criterion: Terminate the learning process when k satisﬁes a certainassumption.

3. Greedy-metric in OGL

Given a real functional V : H → R , the Fr´echet derivative of V at f , V ′ f : H → R , isthe linear functional such that for g ∈ H ,lim k g k H → | V ( f + g ) − V ( f ) − V ′ f ( g ) |k g k H = 0 , and the gradient of V as a map grad V : H → H is deﬁned by h grad V ( f ) , g i H = V ′ f ( g ) , for all g ∈ H . The greedy-metric adopted in (II) is to ﬁnd g k ∈ D n such that h− grad( A m )( f k − z ) , g k i = sup g ∈D n h− grad( A m )( f k − z ) , g i , A m ( f ) = P mi =1 | f ( x i ) − y i | . Therefore, the classical greedy-metric is based on thesteepest gradient descent of r k − with respect to the dictionary D n . By normalizing theresidual r k , k = 0 , , , . . . , n , (II) equals to search g k satisfying g k = arg max g ∈D n |h r k − , g i m |k r k − k m . Geometrically, it means to search a g k minimizing the angle θ k between r k − / k r k − k m and g k , which is depicted as the following Fig.2.Figure 2: Classical greedy-metricRecalling the deﬁnition of OGL, it is not diﬃcult to judge that the angles satisfy | cos θ | ≤ | cos θ | ≤ · · · ≤ | cos θ k | ≤ | cos θ k +1 | ≤ · · · ≤ | cos θ n | , or |h r , g i m |k r k m ≥ · · · ≥ |h r k − , g k i m |k r k − k m ≥ · · · ≥ |h r n − , g n i m |k r n − k m , since |h r k − ,g k i m |k r k − k m = | cos θ k | . If the algorithm stops at the k -th iteration, then there is a δ ∈ [ | cos θ k | , | cos θ k +1 | ], which quantiﬁes whether an atom should be utilized to constructthe ﬁnal estimator. To be detailed, if | cos θ k | ≥ δ , then g k is regarded as an “active atom”and can be employed to build the estimator, otherwise, g k is a “dead one ” which shouldbe deported.Based on the above observations, we are interested in selecting arbitrary “active atom”, g k , in D n , that is |h r k − , g k i m |k r k − k m ≥ δ. (3.1)9f there is no g k satisfying (3.1), then the algorithm terminates. We call the greedymetric (3.1) as the “ δ -greedy thresholds” metric. In practice, the number of “activeatom” is usually not unique. Under this circumstance, we can choose arbitrary (just)one “active atom” at each greedy iteration. Once the “active atom” is selected, thenthe algorithm comes into the next greedy iteration and the “active atom” is redeﬁned.Through such a greedy-metric, we can develop a new orthogonal greedy learning scheme,called thresholding orthogonal greedy learning (TOGL). Instead of (II) and (IV) in OGL,the corresponding parts of TOGL are described as follows (II.1) Greedy-deﬁnition: Let g k be an arbitrary atom from D n satisfying |h r k − , g k i m |k r k − k m ≥ δ. (IV.1) Stopping criterion: Terminate the learning process either there is not atomsatisfying (3.1) or k satisﬁes a certain assumption.Before giving the theoretical analysis of TOGL, we should highlight the diﬀerence be-tween (II), (IV) and (II.1), (IV.1), respectively. Without considering the termination-rule,the classical greedy metric (II) satisﬁes (II.1) since (II) always selects the greediest atomin each greedy iteration. (II.1) slows down the speed of gradient descent and thereforemay conduct a more ﬂexible model-selection strategy. According to the bias and variancebalance principle [10], the bias decreases while the variance increases as a new atom isselected to build the estimator. If a lower-correlation atom is added, then the bias de-creases slower and the variance also increases slower. Then, the balance can be achievedin TOGL within a more gradually ﬂavor than OGL. Compared with (IV), (IV.1) providesanother termination condition that if all the atoms, g , in D n satisfy |h r k − , g i m |k r k − k m < δ, (3.2)then the algorithm terminates. Programmers have asked us frequently why there is therequirement of termination concerning k besides (3.2), since their practical experienceimplies that the termination condition (3.2) is suﬃcient. We emphasize that the terminalcondition concerning k is necessary in TOGL, as the numerical simulations usually donot face the worst case. Indeed, using only the stopping condition (3.2) may drive the10lgorithm to select all atoms from D n . For example, if the target function f is almostorthogonal to the space spanned by the dictionary and the atoms in the dictionary arealmost linear dependent (See Fig.3), then the selected δ should be very small and such asmall δ can not distinguish which is the “active atom ”. Consequently, the correspondinglearning scheme selects all the atoms of dictionary and therefore degrades the generaliza-tion capability of OGL. g g g g f Figure 3: Flaw of the single stopping conditionNow we present a theoretical assessment of TOGL. At ﬁrst, we give a few notations andconcepts, which will be used throughout the paper. Let L ( D n ) := { f : f = P g ∈D n a g g } endowed with the norm k f k L ( D n ) := inf nP g ∈D n | a g | : f = P g ∈D n a g g o . For r >

0, thespace L r is deﬁned to be the set of all functions f such that, there exists h ∈ span {D n } such that k h k L ( D n ) ≤ B , and k f − h k ≤ B n − r , (3.3)where k · k denotes the uniform norm for the continuous function space C ( X ). The in-ﬁmum of all such B deﬁnes a norm (for f ) on L r . It follows from [2] that (3.3) deﬁnesa interpolation space and is a natural assumption for the regression function in greedylearning. Indeed, this assumption has already been adopted in [2, 21] to analyze the learn-ing capability of greedy learning. The following Theorem 3.1 illustrate the performanceof TOGL and consequently, reveals the feasibility of the greedy-metric (II.1).11 heorem 3.1. Let < t < , < δ ≤ / , and f k,δ z be the estimator deduced by TOGL.If f ρ ∈ L r , then there exits a k ∗ ∈ N such that E ( π M f k ∗ ,δ z ) − E ( f ρ ) ≤ C B (( mδ ) − log m log 1 δ log 2 t + δ + n − r ) holds with probability at least − t , where C is a positive constant depending only on d and M . If δ = O ( m − / ), and the size of dictionary, n , is selected to be large enough, i.e., n ≥ O ( m r ), then our result shows that the generalization error bound of π M f k ∗ ,δ z isasymptotically O ( m − / (log m ) ). Up to a logarithmic factor, this bound is the same asthat in [2] and is the “record” of OGL. This implies that weakening the level of greed ofOGL within a certain extent is a feasible way to circumvent the model selection problemof OGL. It should also be pointed out that diﬀerent from OGL [2], there are two param-eters, k and δ , in TOGL. Therefore, Theorem 3.1 only presents a theoretical veriﬁcationthat introducing the “ δ -greedy thresholds” to measure the level of greed does not essen-tially degrade the generalization capability of OGL. Taking the practical applications intoaccount, eliminating the condition concerning k in (IV.1) is urgent. This is the scope ofthe following section, where an adaptive stopping criterion with respect to δ is presented. δ -thresholding orthogonal greedy learning In TOGL, besides the greedy threshold parameter δ , the stopping criterion shouldbe also adjusted appropriately, which may dampen the users’ spirits to employ it. Tocircumvent this, in this section, we will develop an adaptive stopping criterion based onthe “ δ -greedy thresholds” metric. With this, we can develop a practically user-friendlyorthogonal greedy type learning system.It has been pointed out in the previous section that the reason of employing theterminal condition concerning k in (IV.1) is to circumvent the extreme case for a fullrunning of TOGL. As the high impact atoms are all selected in such a setting, theythen lead the relative value of the residual, k r k − k m / k y ( · ) k m , to be small, where y ( · ) isa function satisﬁes y ( x i ) = y i , i = 1 , . . . , m . Therefore, a preferable terminal condition isto quantify this relative value. Noting that δ has already been utilized to terminate the12lgorithm, we append another terminal condition as k r k − k m ≤ δ k y ( · ) k m (4.1)to replace the condition concerning k in (IV.1). Based to this, we obtain a novel applicablelearning system by using the following (IV.2) to substitute (IV.1) in TOGL. (IV.2) Stopping criterion: Terminate the learning process if either (4.1) holds orthere is no atom satisfying (3.1).

Algorithm 1 δ -TOGLStep 1 (Initialization): Given data z = ( x i , y i ) mi =1 , dictionary D n , the greedy thresholds δ , and f = 0. Let k := 0.Step 2 ( δ -greedy thresholds): Let g k be an arbitrary atom from D n satisfying |h r k − , g k i m |k r k − k m ≥ δ. Step 3 (Orthogonal projection iteration): Let V z ,k = Span { g , . . . , g k } . Compute theapproximation f k,δ z as: f k,δ z = P z ,V z ,k ( y )and the residual: r k := y − f δ,k z , where P z ,V z ,k is the orthogonal projection onto space V z ,k in the metric of h· , ·i m .Step 4 (Iteration): if max g ∈D n |h r k , g i m | ≤ δ k r k k m or k r k k m ≤ δ k f k m , then the algorithm terminates, otherwise let k := k + 1, and we turn to Step 2.Output: Since the stopping criterion depends only on δ , we can write the ﬁnal estimatoras f δ z .For such a setting, we succeed in avoiding the cumbersome parameter k and derive astopping-criterion based only on δ . That is, the main parameter k of OGL [2] is replacedby the greedy thresholds δ . Eventually, by utilizing the “ δ -greedy thresholds” metric and13ts corresponding adaptive terminal rule (IV.2), we design a new learning system called δ -thresholding orthogonal greedy learning ( δ -TOGL) as in the Algorithm 1.The following Theorem 4.1 shows that if δ is appropriately tuned, then the δ -TOGLestimator f δ z can realize the almost optimal generalization capability of OGL and TOGL. Theorem 4.1.

Let < t < , < δ ≤ / , and f δ z be deﬁned in Algorithm 1. If f ρ ∈ L r ,then the inequality E ( π M f δ z ) − E ( f ρ ) ≤ C B (( mδ ) − log m log 1 δ log 2 t + δ + n − r ) (4.2) holds with probability at least − t , where C is a positive constant depending only on d and M . If we choose n ≥ O ( m r ) and δ = O ( m − / ), then the learning rate of (4.2) asymptot-ically equals to O ( m − / (log m ) ), which is the same as that of Theorem 3.1. Therefore,Theorem 4.1 implies that using (4.1) to replace the terminal condition concerning k in(IV.1) is theoretically feasible. From the viewpoint of implementation, the stopping crite-rion (IV.2) is far more user-friendly than that of (IV.1), since (IV.2) omits the parameter k of (IV.1) without scarifying the generalization capability of TOGL.The most highlight of Theorem 4.1 is that it provides a totally diﬀerent way to cir-cumvent the overﬁtting phenomenon of OGL. It is known that the stopping criterion iscrucial for OGL, but designing an eﬀective stopping criterion is a awkward problem. Bar-ron et al. [2] suggested to select k that minimizes a l based complexity regularizationstrategy, which often needs a full running before the best parameter is selected. Chen etal. [6] proposed a stopping criterion also leads to a long iterative procedure in practiceand sometimes does not work. In short, all the aforementioned study of stopping-criterionattempted to design a terminal rule by controlling the number of iterations directly. Sincethe generalization capability of OGL is sensitive to the number of iterations, these schemessometimes fails to get satisfactory eﬀects. The terminal rule employed in the present paperis based on the study of the “greedy-metric” issue of greedy learning. Theorem 4.1 showsthat, besides controlling the number of iterations directly, setting a greedy threshold toredeﬁne the greed can also conducts an eﬀective stopping criterion. Theorem 4.1 impliesthat this new stopping criterion theoretically works as well as others. Furthermore, when14ompared with k in OGL, the generalization capability of the δ -TOGL is stable to δ , sincethe new metric slows down the changes of bias and variance.

5. Numerical Studies

In this section, we present several numerical simulations to reveal the pros and consof δ -TOGL. We divide the description into seven subsections. Except for the ﬁrst one,each subsection depicts a topic concerning δ -TOGL. Data and dictionary: The samples z = { ( x i , y i ) } m i =1 are generated as follows. { x i } m i =1 are drawn independently and identically according to the uniform distribution on [ − π, π ]. { y i } m i =1 satisﬁes y i = f ρ ( x i ) + N (0 , σ ) with N (0 , σ ) being the white noise and f ρ ( x ) = sin xx , x ∈ [ − π, π ] . To comprehensively reveal the performances of OGL, TOGL and δ -TOGL, we adoptfour levels of noise, that is, σ is set to σ = 0 . σ = 0 . σ = 1 and σ = 2. Thelearning performances of diﬀerent algorithms were then tested by applying the resultantestimators to the test set z test = { ( x ( t ) i , y ( t ) i ) } m i =1 , which was generated similarly to z butwith a promise that y ′ i s were always taken to be y ( t ) i = f ρ ( x ( t ) i ) . In each simulation, we use Gaussian radial basis function to build up the dictionary: n e −k x − t i k /η : i = 1 , . . . , n o , where { t i } ni =1 are drawn as the best packing points in [ − π, π ]. Since, the aim of thesimulations is not to pursue the best width of Gaussian radial basis function, but tocompare δ -TOGL with other learning schemes on the same dictionary, we always set η = 1 throughout this section.Methods: For OGL and δ -TOGL, we apply the QR decomposition to solve the corre-sponding least squares problem and then obtain the estimators [25]. We use four metricsin (II) and (II.1) respectively to illustrate diﬀerent levels of greed. Here, we use abbre-viations OGL1 , OGL2, OGL3, TOGL1, TOGL2, TOGL3, and δ -TOGL1, δ -TOGL2 ,15 -TOGL3 to denote OGL, TOGL and δ -TOGL with (II), and (II.1) replaced by g k := arg max g ∈D n |h r k − , g i m | ,g k := arg second max g ∈D n |h r k − , g i m | , and g k := arg third max g ∈D n |h r k − , g i m | . Here, arg second max g ∈D n and arg third max g ∈D n means selecting g k such that the second andthird largest values of |h r k − , g i m | are attained, respectively. Furthermore, we use OGLR,TOGLR and δ -TOGLR to denote OGL, TOGL, and δ -TOGL with (II) and (II.1) replacedby g k randomly selected from D n , and g k randomly selectd from D δ with D δ = { g j : h g j , r k − i m ≥ δ k r k − k m } . We also compare our methods with two widely used learning schemes such as ridge re-gression [18] and Lasso [35]. We use the analytic solutions to ridge regression [18] andimplementing the fast iterative soft thresholding algorithm (FISTA) [3] for Lasso to de-duce the corresponding estimators.Aims of simulations The aims of the simulations can be concluded into six aspects.In Sec.5.2, we demonstrate that SGD is not the unique metric to deﬁne greed in OGL.Indeed, our simulation shows that OGL2 and OGL3 possess almost the same generaliza-tion capabilities as that of OGL1. In Sec.5.3, we illustrate that “ δ -greedy thresholds” isa feasible greedy metric. In Sec.5.4, we aim to provide numerical veriﬁcation of the goodperformance of δ -TOGL. In Sec.5.5, we analyze how the parameter δ aﬀects the trainingtime and the sparsity of the estimator. In Sec.5.6, we conduct a phase-transition diagramto illustrate the usability and limitations of δ -TOGL. In Sec.5.7, we compare δ -TOGLwith other widely used dictionary-based learning schemes and then show the feasibilityof δ -TOGL.Environment: All numerical studies are implemented by MATLAB R2013a on a Win-dows personal computer with Core(TM) i7-3770 3.40GHz CPUs and RAM 4.00GB, andthe statistics are averaged based on 50 independent trails.16 .2. Greedy metric of OGL In this part, we illustrate that SGD is not the unique metric for OGL. To this end,we conduct simulations for f ρ with the aforementioned four types of noise. We sample m = 1000 training samples and m = 1000 testing samples. The number of centers isset to n = 300. Under this setting, we run 5 times of simulations and describe its averagetest errors, which is measured by the rooted mean square error (RMSE), as functions ofthe number of iterations, k , of OGL1, OGL2, OGL3 and OGLR. Since the optimal k issmall and the test RMSE is very large when k is large, we only record the ﬁgures with k ∈ [0 , T e s t e rr o r( R M SE ) OGL1OGL2OGL3OGLR (a) T e s t e rr o r( R M SE ) OGL1OGL2OGL3OGLR (b) T e s t e rr o r( R M SE ) OGL1OGL2OGL3OGLR (c) T e s t e rr o r( R M SE ) OGL1OGL2OGL3OGLR (d)

Figure 4: The generalization capabilities of OGL with diﬀerent greedy metricsFig.4 (a)-(d) shows the learning capabilities of OGL for f ρ with diﬀerent levels ofnoise from δ to δ . It can be found that OGL1, OGL2 and OGL3 possess almost thesame generalization capabilities, since both the smallest test RMSE and the optimal k

17f them are almost the same. This implies that, at least for a certain learning task,SGD is not the unique metric for OGL. Furthermore, it can also be found in Fig.4 thatOGLR performs worse than that of other learning schemes. This phenomenon showsthat introducing a greedy metric is necessary. We also give a quantitive comparison ofthe learning performances of OGL1, OGL2, OGL3, and OGLR in the following Tab.1.Here

T estRM SE

OGL and k ∗ OGL denote the theoretically optimal test RMSEs and k ofOGL with diﬀerent greedy metrics. Indeed, k ∗ OGL ’s are selected according to the test datadirectly. Table 1: OGL numerical average results for 5 simulations.

M ethods T estRM SE

OGL k ∗ OGL σ = 0 . σ = 0 . M ethods T estRM SE

OGL k ∗ OGL σ = 1OGL1 0.0780 7OGL2 0.0762 7OGL3 0.0757 7OGLR 0.0995 7 σ = 2OGL1 0.1371 5OGL2 0.1374 7OGL3 0.1377 7OGLR 0.1545 6All the above simulations show that greed is necessary but not unique in OGL. Thisstimulates us to launch a study of the “greedy-metric” issue of OGL. δ -greedy thresholds” metric In this part, we verify the feasibility of the “ δ -greedy thresholds” metric proposed inSec.3. The simulation setting of this subsection is the same as that of Sec.5.2. We alsorun 5 times of simulations and describe its test RMSE as functions of the threshold, δ , ofTOGL1, TOGL2, TOGL3 and TOGLR, where we choose the optimal number of iterationsbased on the test set. There are 100 candidates of δ which are equally logarithmicallydrawn from [10 − , / δ lies in [10 − , . δ in [10 − , . −6 −5 −4 −3 T e s t e rr o r( R M SE ) TOGL1TOGL2TOGL3TOGLR (a) −6 −5 −4 −3 T e s t e rr o r( R M SE ) TOGL1TOGL2TOGL3TOGLR (b) −6 −5 −4 −3 T e s t e rr o r( R M SE ) TOGL1TOGL2TOGL3TOGLR (c) −6 −5 −4 −3 T e s t e rr o r( R M SE ) TOGL1TOGL2TOGL3TOGLR (d)

Figure 5: The feasibility of the “ δ -greedy threshold” metricFig.5 shows that, diﬀerent from Fig.4, the learning capability of TOGLR is similar asthat of TOGL1, TOGL2 and TOGL3. The main reason is that we select the new atom(even for the random selected atom) in a greedy fashion by adding the “ δ -greedy thresh-olds” metric in TOGL. This phenomenon implies that once an appropriately δ is preset,then how to choose the atom according to (II.1) is not crucial. Therefore, it numericallyveriﬁes Theorem 3.1 and demonstrates that the introduced “ δ -greedy threshold” is feasi-ble and appropriate to quantify the greedy metric. To facilitate the comparison, we alsorecord the optimal generalization errors in Tab.2.In Tab.2, the second column (i.e., “ δ and k ”) records the optimal δ value and theircorresponding k values (in the bracket) derived from TOGL. We should highlight thatthese k are obtained by using the terminal condition (3.2) only. We also use k ∗ T OGL todenote the theoretically optimal k of TOGL, which is selected based on the test set. It19an be found in Tab.2 that when δ equals to 0 . .

5, the corresponding k is almost thesame as k ∗ T OGL , which means that using the terminal condition (3.2) is suﬃcient to selectthe optimal iteration number. However, if the noise is enlarged, that is, δ = 1 or 2, thenthe terminal condition (3.2) usually fails to ﬁnd out the optimal k and another stoppingcondition need to be employed. This explains why we introduce a terminal conditionconcerning k in (IV.1) and an adaptive terminal condition (4.1) in (IV.2). Comparedwith Tab.1, we can ﬁnd from Tab.2 that the optimal test RMSEs ( T estRM SE

T OGL and

T estRM SE

OGL ) are comparable, which illustrates that the “ δ -greedy thresholds” metricis feasible. The new greedy metric then provides an alternative way to enrich the model-selection strategy without scarifying the generalization capability of OGL.Table 2: TOGL numerical average results for 5 simulations. M ethods δ and k T estRM SE

T OGL k ∗ T OGL σ = 0 . σ = 0 . σ = 1TOGL1 [1.00e-6,5.60e-6]([11,13]) 0.0877 8TOGL2 [1.00e-6,4.30e-6]([11,13]) 0.0862 8TOGL3 [1.00e-6,6.40e-6]([11,13]) 0.0840 8TOGLR 7.30e-6(12) 0.0842 8 σ = 2TOGL1 [1.00e-6,1.18e-4]([8,13]) 0.1402 6TOGL2 [1.00e-6,1.18e-4]([8,13]) 0.1394 6TOGL3 [1.00e-6,1.03e-4]([8,13]) 0.1398 6TOGLR 6.09e-5(10) 0.1282 520 .4. The generalization capability of δ -TOGL In this part, we justify the good performance of δ -TOGL proposed in Sec.4. Thedetailed experimental setting is the same as that in Sec.5.3. Diﬀerent from TOGL, δ -TOGL provides an adaptive terminal rule and therefore, eliminates the parameter k inTOGL. Similarly to Sec.5.3, we only plot the range of δ in [10 − , . −6 −5 −4 −3 T e s t e rr o r( R M SE ) δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (a) −6 −5 −4 −3 T e s t e rr o r( R M SE ) δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (b) −6 −5 −4 −3 T e s t e rr o r( R M SE ) δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (c) −6 −5 −4 −3 T e s t e rr o r( R M SE ) δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (d) Figure 6: The feasibility of the δ -TOGLFig.6 shows that δ -TOGL maintains the feasibility of “ δ -greedy thresholds” metricafter introduced the adaptive termination rule (IV.2). Therefore, it numerically veriﬁesTheorem 4.1 and demonstrates that δ -TOGL is feasible. We also show the generalizationcapability of δ -TOGL in the following Tab.3.21able 3: δ -TOGL numerical average results for 5 simulations. M ethods δ and k T estRM SE δ − T OGL k ∗ δ − T OGL σ = 0 . δ -TOGL1 [4.30e-6,4.91e-6](11) 0.0255 10.6 δ -TOGL2 [5.60e-6,6.40e-6](10.4) 0.0254 10.2 δ -TOGL3 3.76e-6(11) 0.0255 10.6 δ -TOGLR 2.75e-5(11) 0.0268 10.8 σ = 0 . δ -TOGL1 [1.18e-4,1.35e-4](7.4) 0.0521 7.4 δ -TOGL2 [2.01e-4,4.45e-4](7) 0.0511 7 δ -TOGL3 [1.54e-4.2.29e-4](7.2) 0.0520 7.2 δ -TOGLR 1.35e-4(8.6) 0.0536 8.6 σ = 1 δ -TOGL1 [1.03e-4,1.76e-4](7.2) 0.0747 6.8 δ -TOGL2 [1.03e-4,1.54e-4](7.2) 0.0752 6.8 δ -TOGL3 [1.35e-4,1.54e-4](7.2) 0.0733 7 δ -TOGLR 3.89e-4(7.2) 0.0759 6.4 σ = 2 δ -TOGL1 [2.01e-4,2.99e-4](6.2) 0.1529 5.4 δ -TOGL2 [2.29e-4,3.41e-4](6.2) 0.1516 5.6 δ -TOGL3 2.29e-4(6.2) 0.1519 4.8 δ -TOGLR 2.99e-4(7.2) 0.1537 6.2In Tab.3, the second column (i.e., “ δ and k ”) records the optimal δ and the corre-sponding k (in the bracket) derived from δ -TOGL, and k ∗ δ − T OGL denotes the theoreticallyoptimal k of δ -TOGL. It can be found that for all types of noise, k is almost the same as k ∗ δ − T OGL . This shows that the stopping condition concerning k in (IV.1) can be substitutedwith the terminal condition (4.1). Therefore, these experimental results demonstrate insome extent that we can avoid the “overﬁtting” by only taking the “greedy-metric” issueinto account. This can be regarded as the main novelty of our paper. Furthermore, notingthat the optimal test RMSEs ( T estRM SE δ − T OGL ) are comparable with

T estRM SE

T OGL ,we can declare that δ -TOGL performs as well as TOGL, while δ -TOGL successfully omitthe parameter concerning k in TOGL. 22 .5. The cost of alternating parameter of δ -TOGL From OGL to δ -TOGL, the main parameter is changed from k to δ . In the previ-ous subsections, we pointed out that the generalization capability of such a change wasnot degraded. Furthermore, δ -TOGL provides a more user-friendly parametric selectionstrategy. The purpose of this part is to discuss how the training time and testing timeof δ -TOGL vary with δ . Since the testing time depends only on the sparsity of the ﬁnalestimator, we use the number of iterations to replace the testing time in this simulation.In this simulation, we only take the level of noise as σ = 0 . S pa r s i t y OGL1OGL2OGL3OGLR (a) T r a i n i ng T i m e ( S e c ond ) OGL1OGL2OGL3OGLR (b) −6 −5 −4 −3 −2 −1 S pa r s i t y Iteration k δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (c) −6 −5 −4 −3 −2 −1 −3 Iteration k T r a i n i ng T i m e ( S e c ond ) δ −TOGL1 δ −TOGL2 δ −TOGL3 δ −TOGLR (d) Figure 7: The parameter’s inﬂuences on training and testing prices in OGL and δ -TOGLFrom Fig.7, it shows that the training and testing costs are not expensive when theparameter δ tuning in the range [10 − , . . × − second. All these show that whenthe parameter, k , of OGL is transformed as δ in δ -TOGL, both the training and testburdens are not added. δ -TOGL In this simulation experiments, we use δ -TOGL1 to learn the sinc function with sam-pling noise as N (0 , . ). The horizontal axis represents the number of training samples,and the vertical axis represents the associated target accuracies (which will be deﬁned asfollows). Therefore, every point in the coordinate system denotes a given learning task.If the test RMSE of δ -TOGL with δ selecting by 5-fold cross-validation is less than theaccuracy, we deﬁne that the learning task is successful and labeled 1, otherwise, the tasksfails and tag 0. We run 100 times of trials in each point. The color from blue to reddenotes the values from 0 to 100. The result is shown in the following Fig.8. Training samples T a r ge t a cc u r a cy

500 550 600 650 700 750 800 850 900 950 10000.010.020.030.040.050.060.070.080.090.1 0102030405060708090100

Figure 8: Usability and limitations of δ -TOGLIn the above Fig.8, the red areas represents that δ -TOGL meets the demand of learningtask and the blue area indicates failure. And we can immediately acquire an intuitiveenlightenment from the above phase transition diagram: given a set of data and a targetaccuracy for a speciﬁc learning task, if you want to use δ -TOGL to have a try, then suchphase transition diagram can tell you, how many samples are approximately needed toensure the accomplishment of your mission within a certain probability. From the aboveexperimental result, the generalization error of δ -TOGL performances steadily, gradually24nversely monotonous to the sample size, which ﬁts our theoretical results in Theorem4.2. Table 4: Compared δ -TOGL performance with other classic algorithms. M ethods P arameter T estRM SE ( standarderror ) Sparsity n = 300OGL k = 9 0.0218(0.0034) 9 δ -TOGL1 δ = 1 . e − δ -TOGL2 δ = 2 . e − δ -TOGL3 δ = 1 . e − δ -TOGLR δ = 3 . e − L (RLS) λ = 5e-5 0.0263(0.0098) 300 L (FISTA) λ = 5e-6 0.0298(0.0092) 290.4 n = 1000OGL k = 9 0.0255(0.0045) 9 δ -TOGL1 δ = 1 . e − δ -TOGL2 δ = 6 . e − δ -TOGL3 δ = 6 . e − δ -TOGLR δ = 1 . e − L (RLS) λ = 0 . L (FISTA) λ = 7e-6 0.0277(0.0094) 931.8 n = 2000OGL k = 9 0.0250(0.0054) 9 δ -TOGL1 δ = 2 . e − δ -TOGL2 δ = 1 . e − δ -TOGL3 δ = 2 . e − δ -TOGLR δ = 9 . e − L (RLS) λ = 0 . L (FISTA) λ = 7e-6 0.0235(0.0079) 177225 .7. δ -TOGL is competitive In this part, we compare δ -TOGL with some classical dictionary-based learning schemessuch as the classical OGL, ridge and lasso estimators. The regularization parameters ofboth ridge and lasso estimators, the iteration number of OGL and the threshold, δ , of δ -TOGL are drawn by using 5-fold cross-validation. The regression is the sinc functionwith sampling noise as the standard Gaussian noise with the variance 0 .

1, i.e., N (0 , . ).The simulation result can be seen in Tab.4.From Tab.4, we can see that under the same order of generalization performancemagnitude, the number of selected atoms of greedy-type strategy is far smaller than theregularization algorithms. This explains why greedy-type algorithms are more suitablefor redundant dictionary learning [2]. Furthermore, it also can be found in Tab.4 thatthe generalization capability of all the aforementioned learning schemes are similar. Atlast, our simulation results shows that the size of dictionary doesn’t aﬀect the learningperformance of δ -TOGL schemes very much, provided it attains the lowest requirement toﬁnishes the learning task. All these reveals that δ -OGL is a competitive learning scheme.

6. Proofs

Since Theorem 3.1 can be regarded as a special case of Theorem 4.1, we only proveTheorem 4.1 in this section. The methodology of proof is the same as that of [21] andthe main tool is borrowed from [33].In order to give an error decomposition strategy for E ( f k z ) −E ( f ρ ), we need to constructa function f ∗ k ∈ span( D n ) as follows. Since f ρ ∈ L r , there exists a h ρ := P ni =1 a i g i ∈ Span( D n ) such that k h ρ k L ≤ B , and k f ρ − h ρ k ≤ B n − r . (6.1)Deﬁne f ∗ = 0 , f ∗ k = (cid:18) − k (cid:19) f ∗ k − + P ni =1 | a i |k g i k ρ k g ∗ k , (6.2)where g ∗ k := arg max g ∈D ′ n (cid:28) h ρ − (cid:18) − k (cid:19) f ∗ k − , g (cid:29) ρ , D ′ n := { g i ( x ) / k g i k ρ } ni =1 [ {− g i ( x ) / k g i k ρ } ni =1 with g i ∈ D n .Let f δ z and f ∗ k be deﬁned as in Algorithm 1 and (6.2), respectively, then we have E ( π M f δ z ) − E ( f ρ ) ≤ E ( f ∗ k ) − E ( f ρ ) + E z ( π M f δ z ) − E z ( f ∗ k )+ E z ( f ∗ k ) − E ( f ∗ k ) + E ( π M f δ z ) − E z ( f k z ) , where E z ( f ) = m P mi =1 ( y i − f ( x i )) .Upon making the short hand notations D ( k ) := E ( f ∗ k ) − E ( f ρ ) , S ( z , k, δ ) := E z ( f ∗ k ) − E ( f ∗ k ) + E ( π M f δ z ) − E z ( π M f δ z ) , and P ( z , k, δ ) := E z ( π M f δ z ) − E z ( f ∗ k )respectively for the approximation error, the sample error and the hypothesis error, wehave E ( π M f δ z ) − E ( f ρ ) = D ( k ) + S ( z , k, δ ) + P ( z , k, δ ) . (6.3)At ﬁrst, we give an upper bound estimate for D ( k ), which can be found in Proposition1 of [21]. Lemma 6.1.

Let f ∗ k be deﬁned in (6.2). If f ρ ∈ L r , then D ( k ) ≤ B ( k − / + n − r ) . (6.4)To bound the sample and hypothesis errors, we need the following Lemma 6.2. Lemma 6.2.

Let y ( x ) satisfy y ( x i ) = y i , and f δ z be deﬁned in Algorithm 1. Then, thereare at most Cδ − log 1 δ (6.5) bases selected to build up the estimator f δ z . Furthermore, for any h ∈ Span { D n } , we have k y − f δ z k m ≤ k y − h k m + 2 δ k h k L ( D n ) . (6.6)27 roof. (6.5) can be found in [33, Theorem 4.1]. Now we turn to prove (6.6). Ourstopping criterion guarantees that either max g ∈D n |h r k , g i m | ≤ δ k r k k m or k r k k ≤ δ k y k m . In the latter case the required bound follows form k y k m ≤ k y − h k m + k h k m ≤ δ ( k y − h k m + k h k m ) ≤ δ ( k f − h k m + k h k L ( D n ) ) . Thus, we assume max g ∈D n |h r k , g i m | ≤ δ k r k k m holds. By using h y − f k , f k i m = 0 , we have k r k k m = h r k , r k i m = h r k , y − h i m + h r k , h i m ≤ k y − h k m k r k k m + h r k , h i m ≤ k y − h k m k r k k m + k h k L ( D n ) max g ∈D n h r k , g i m ≤ k y − h k m k r k k m + k h k L ( D n ) δ k r k k m . This ﬁnishes the proof.Based on Lemma 6.2 and the fact k f ∗ k k L ( D n ) ≤ B [21, Lemma 1], we obtain P ( z , k, δ ) ≤ E z ( π M f δ z ) − E z ( f ∗ k ) ≤ B δ . (6.7)Now, we turn to bound the sample error S ( z , k ). Upon using the short hand notations S ( z , k ) := {E z ( f ∗ k ) − E z ( f ρ ) } − {E ( f ∗ k ) − E ( f ρ ) } and S ( z , δ ) := {E ( π M f δ z ) − E ( f ρ ) } − {E z ( π M f δ z ) − E z ( f ρ ) } , we write S ( z , k ) = S ( z , k ) + S ( z , δ ) . (6.8)It can be found in Proposition 2 of [21] that for any 0 < t <

1, with conﬁdence 1 − t , S ( z , k ) ≤ M + B log t )3 m + 12 D ( k ) (6.9)Using [41, Eqs(A.10)] with k replaced by Cδ − log δ , we have S ( z , δ ) ≤ E ( π M f δ z ) − E ( f ρ ) + log 2 t Cδ − log δ log mm (6.10)holds with conﬁdence at least 1 − t/

2. Therefore, (6.3), (6.4), (6.7), (6.9), (6.10) and (6.8)yields that E ( π M f δ z ) − E ( f ρ ) ≤ C B (( mδ ) − log m log 1 δ log 2 t + δ + n − r )holds with conﬁdence at least 1 − t . This ﬁnishes the proof of Theorem 4.1.28 . Concluding Remarks The main contributions of the present paper can be concluded into four folds. Firstly,we propose that the steepest gradient descent (SGD) is not the unique choice to select anew atom from dictionary in orthogonal greedy algorithm (OGL), which disrupts habit-ual thinking to make a way for searching new greedy metric for OGL. To the best of ourknowledge, this is the ﬁrst work on the “greedy-metric” issue for greedy learning. Sec-ondly, we succeed in ﬁnding an appropriate greedy metric in OGL and theoretically andnumerically verify its rationality and feasibility. Motivated by a series work of Temlyakovand his co-authors [23], [29, 31, 32, 33], we propose a δ -greedy thresholds to measure thelevel of greed in orthogonal greedy learning. Our theoretical result shows that orthog-onal greedy learning with such a greedy metric yields a learning rate as m − / (log m ) ,which is almost the same as that of the classical SGD-based OGL [2]. Thirdly, basedon the selected greedy metric, we derive an adaptive terminal rule for the correspond-ing OGL and thus provide a complete learning system called δ -thresholding orthogonalgreedy learning ( δ -TOGL). Lastly, we study the learning performance of δ -TOGL in termsof both theoretical analysis and numerical veriﬁcation. Our study implies that δ -TOGLis a competitive learning scheme as the widely used strategies such as the classical or-thogonal greedy learning, ridge estimate and lasso estimate. The main results show thatwhen applied to supervised learning problems, δ -TOGL outperforms dictionary-basedregularization learning schemes such as lasso and ridge regression in the sense that it canproduces extremely high sparseness of the ﬁnal estimator. It also outperforms the classicalorthogonal greedy learning in the sense that it provides a more user-friendly parametricselection strategy.To stimulate more opinions from others on the “greedy-metric” issue of greedy learn-ing, we present the following two remarks. Remark 7.1.

In this paper, we give a type of “greedy-metric” for OGL. In greedy ap-proximation, Temlyakov [32] has been proposed various greedy-metric such as the supergreedy algorithm and weak greedy algorithm. Since greedy learning focus on not only theapproximation capability but also the capacity of the space spanned by the selected atoms,we guess that all these metrics can be adopted in greedy learning and may possess similarperformances as the classical steepest gradient descent metric. We will also keep workingon this issue and report our progress in a future publication. emark 7.2. Programmers frequently ask us what is the essential advantage of δ -TOGL.This is a good question and we ﬁnd a bit headache to answer it. Admittedly, in this paper,we do not provide any essential advantages of δ -TOGL. The purpose of this paper is onlyto propose the concepts of “greedy metric” and show that we can use the greedy metricto reach the “bias” and “variance” trade-oﬀ. However, in our opinion, there are at leasttwo advantages of δ -TOGL. The ﬁrst one is that, compared with OGL, its generalizationcapability is not so sensitive to the parameter. This advantage has already been shownin Fig.4 and Fig.5. The second one, δ -TOGL can be viewed as an accelerated versionof OGL. As shown in Step 2 in Algorithm 1, we can select the ﬁrst atom satisﬁes thegreedy metric. Under this circumstance, it need not to compute the h r k − , g i m for all g ∈ D n . Once the size of dictionary is large, such an operation can save a large number ofcomputations. As the main purpose of this paper is not to emphasize the computationalspeed, we do not illustrate this advantage in the present paper. If it is necessary, wewill study this advantage within practical applications and report our progress in a futurepublication. References [1] A. Armagan, Variational Bridge Regression, J. Mach. Learn. Res., 5 (2009), 17-24.[2] A. R. Barron, A. Cohen, W. Dahmen, R. A. DeVore, Approximation and learningby greedy algorithms, Ann. Statist., 36 (2008), 64-94.[3] A. Beck, M. Teboulle, A fast iterative shrinkagethresholding algorithn for linearinverse problems, SIAM J. Imag. SCI., 2 (2009), 183-202.[4] C. Bennett, R. Sharpley, Interpolation of Operators, Academic Press, Boston, 1988.[5] P. Buhlmann, B. Yu, Boosting with the L loss: regression and classiﬁcation, J.Amer. Statist. Assoc., 98 (2003), 324-339.[6] H. Chen, L. Li, Z. Pan, Learning rates of multi-kernel regression by orthogonal greedyalgorithm, J. Statist. Plan. & Infer., 143 (2013), 276-282.[7] H. Chen, Y. Zhou, Y. Tang, L. Li and Z. Pan. Convergence rate of the semi-supervisedgreedy algorithm, Neural Networks, 44 (2013), 44-50[8] F. Cucker, S. Smale, On the mathematical foundations of learning, Bull. Amer. Math.Soc., 39 (2001), 1-49. 309] F. Cucker, S. Smale, Best choices for regularization parameters in learning theory:on the bias-variance problem, Found. Comput. Math., 2 (2002), 413-428.[10] F. Cucker, D. X. Zhou, Learning Theory: An Approximation Theory Viewpoint,Cambridge University Press, Cambridge, 2007.[11] W. Dai, O. Milenkovic, Subspace pursuit for compressive sensing signal recontruction,IEEE Trans. Inf. Theory, 55 (2009), 2230-2249.[12] I. Daubechies, M. Defrise and C. De Mol. An iterative thresholding algorithm forlinear inverse problems with a sparsity constraint. Commun. Pure Appl. Math., 57(2004), 1413-1457.[13] I. Daubechies, R. A. Devore, M. Fornasier, C. G¨unt¨urk, Iteratively re-weighted leastsquares minimization for sparse recovery, Commun. Pure Appl. Math., 63 (2010),1-38.[14] R. DeVore, V. Temlyakov, Some remarks on greedy algorithms, Adv. Comput. Math.,5 (1996), 173-187.[15] D. Donoho, M. Elad, V. Temlyakov, On Lebesgue-type inequalities for greedy ap-proximation, J. Approx. Theory, 147 (2007), 185-195.[16] D. L. Donoho, Y. Tsaig, O. Drori, J. L. Starck, Sparse solution of underdeterminedsystems of linear equations by stagewise orthogonal matching pursuit, IEEE Trans.Inf. Theory, 58 (2012), 1094-1121.[17] J. Friedman, Greedy function approximation: a gradient boosting machine, Ann.Statis., 29 (2001), 1189-1232.[18] G. H. Golub, M. T. Heath, G. Wahba, Generalized Cross-Validation as a Method forChoosing a Good Ridge Parameter, Technometrics, 21 (1979), 215-223.[19] L. Gy¨orfy, M. Kohler, A. Krzyzak, H. Walk, A Distribution-Free Theory of Nonpara-metric Regression, Springer, Berlin, 2002.3120] S. Kunis, H. Rauhut, Random sampling of sparse trigonometric polynomials II-Orthogonal matching pursuit versus basis pursit, Found. Comput. Math., 8 (2008),737-763.[21] S. B. Lin, Y. H. Rong, X. P. Sun, Z. B. Xu, Learning capability of relaxed greedyalgorithms, IEEE Trans. Neural Netw. & Learn. Syst., 24 (2013), 1598-1608.[22] S. B. Lin, J. S. Zeng, J. Fang, Z. B. Xu, Learning rates of l q coeﬃcient regularizationlearning with Gaussian kernel, Neural Comput., 26 (2014), 2350-2378.[23] E. Liu, V. Temlyakov, The orthogonal super greedy algorithm and applications incompressed sensing, IEEE. Trans. Inf. Theory, 58 (2012), 2040-2047.[24] E. Liu, V. Temlyakov, Super greedy type algorithms, Adv. Comput. Math., 37 (2012),493-504.[25] T. Sauer, Numerical Analysis, Addison-Wesley Longman, London, 2006.[26] B. Sch¨olkopf, R. Herbrich, A. J. Smola, A generalized representer theorem, In D.Helmbold and B.Williamson, edited, Proceedings of the 14th Annual Conference onComputational Learning Theory, pp 416-426. Springer, New York, 2001.[27] L. Shi, Y. L. Feng, D. X. Zhou, Concentration estimates for learning with l -regularizer and data dependent hypothesis spaces, Appl. Comput. Harmon. Anal.,31 (2011), 286-302.[28] V. Temlyakov, Greedy algorithm and m -term trigonometric approximation, Constr.Approx., 14 (1998), 569-587.[29] V. Temlyakov, Weak greedy algorithms, Adv. Comput. Math., 12 (2000), 213-227.[30] V. Temlyakov, Greedy algorithms in Banach spaces, Adv. Comput. Math., 14 (2001),277-292.[31] V. Temlyakov, Nonlinear methods of approximation, Found. Comput. Math., 3(2003), 33-107. 3232] V. Temlyakov, Greedy approximation, Acta Numer., 17 (2008), 235-409.[33] V. Temlyakov, Relaxation in greedy approximation, Constr. Approx., 28 (2008), 1-25.[34] V. Temlyakov, P. Zheltov, On performance of greedy algorithms, J. Approx. Theory,163 (2011), 1134-1145.[35] R. Tibshirani, Regression shrinkage and selection via the LASSO, J. Roy. Statist.Soc. Ser. B, 58 (1995), 267-288.[36] J. A. Tropp, Greed is good: algorithmic results for sparse approximation, IEEETrans. Inf. Theory, 50 (2004), 2231-2242.[37] J. A. Tropp, S. Wright, Computational methods for sparse solution of linear inverseproblems, in: Proceedings of the IEEE, 98: 948-958, 2010.[38] Y. Zhang, J. Duchi, M. Wainwright, Divide and conquer kernel ridge regression: Adistributed algorithm with minimax optimal rates, arXiv:1305.5029, 2013.[39] Q. Wu, Y. M. Ying, D. X. Zhou, Learning rates of least square regularized regression,Found. Comput. Math., 6 (2006), 171-192, 2006.[40] Z. B. Xu, X. Y. Chang, F. M. Xu, H. Zhang, L /2