The Use of Mutual Coherence to Prove ℓ 1 / ℓ 0 -Equivalence in Classification Problems
TThe Use of Mutual Coherence to Prove (cid:96) /(cid:96) -Equivalencein Classification Problems Chelsea Weaver ∗ , Naoki Saito Department of MathematicsUniversity of California, DavisOne Shields AvenueDavis, California, 95616, United States
Abstract
We consider the decomposition of a signal over an overcomplete set of vectors. Minimization of the (cid:96) -norm of the coefficient vector can often retrieve the sparsest solution (so-called “ (cid:96) /(cid:96) -equivalence”), agenerally NP-hard task, and this fact has powered the field of compressed sensing. Wright et al.’s sparserepresentation-based classification (SRC) applies this relationship to machine learning, wherein the signal tobe decomposed represents the test sample and columns of the dictionary are training samples. We investigatethe relationships between (cid:96) -minimization, sparsity, and classification accuracy in SRC. After proving thatthe tractable, deterministic approach to verifying (cid:96) /(cid:96) -equivalence fundamentally conflicts with the highcoherence between same-class training samples, we demonstrate that (cid:96) -minimization can still recover thesparsest solution when the classes are well-separated. Further, using a nonlinear transform so that sparserecovery conditions may be satisfied, we demonstrate that approximate (not strict) equivalence is key to thesuccess of SRC. Keywords: sparse representation, representation-based classification, mutual coherence, compressedsensing
1. Introduction
The decomposition of a given signal or sample over a pre-determined set of vectors is a technique oftenused in signal processing and pattern recognition. We can store a signal by decomposing it over a fixed basisand keeping only the largest coefficients; in linear regression, predictions are made by estimating parametersvia least-squared error using the training data. In the case that the system is underdetermined, so thatan infinite number of representations of the signal or sample exist, regularization is often used to make the ∗ Corresponding author
Email addresses: [email protected] (Chelsea Weaver), [email protected] (Naoki Saito) Current address: Amazon Web Services, Seattle, WA
Preprint submitted to Applied and Computational Harmonic Analysis January 10, 2019 a r X i v : . [ c s . C V ] J a n roblem well-posed. The question, naturally, is how to choose the type of regularization used, so that therepresentation is well-suited to the task at hand and can be found efficiently.In compressed sensing, a fairly recent advancement in signal processing, it is assumed that a vector ofsignal measurements is represented using an overcomplete set of vectors (often called a dictionary ) and thatthe (unknown) coefficient vector is sparse. Obtaining this sparse solution vector is the key to recovering thecomplete signal in a way that requires fewer measurements than traditional methods [1]. Thus, to determinethe unknown coefficients, an appropriate regularization term should enforce sparsity, i.e., seek the solutionrequiring the fewest nonzero coefficients. Determining tractable methods for solving such optimizationproblems are the core of compressed sensing techniques, as minimizing the (cid:96) -“norm” (which counts thenumber of nonzero coefficients) is NP-hard in general. However, in addition to successful greedy methods suchas orthogonal matching pursuit [2], it was found that sparse regularization can, in many circumstances, bereplaced with minimization of the (cid:96) -norm (which sums the coefficient magnitudes) to the same effect. Thatis, under certain conditions, minimization of the (cid:96) -norm is equivalent to sparse regularization, hence the term“ (cid:96) /(cid:96) -equivalence ”. Though requiring an iterative algorithm to solve, this relaxation to (cid:96) -minimizationreduces the optimization problem to a linear program and can be solved efficiently. There has been a lotof work done (see, for example, the seminal papers by Candes and Tao [3] and Donoho [4]) showing that,under certain conditions, (cid:96) -minimization exactly recovers the sparsest solution, and analogous results holdin the case of noisy data. We review some of these results in Section 2.2.A similar technique used in compressed sensing has been successfully applied to tasks in pattern recog-nition. The popular classification method sparse representation-based classification (SRC) [5], proposed byWright et al. in 2009, classifies a given test sample by decomposing it over an overcomplete set of train-ing samples so that the (cid:96) -norm of the coefficient vector is minimized. The test sample is assigned tothe class with the most contributing coefficients (in terms of reconstruction). By minimizing the (cid:96) -norm,the goal is that the sparsest such representation will be found (as in compressed sensing), and that thiswill automatically produce nontrivial nonzero coefficients at training samples in the same class as the testsample, rendering correct classification. Similar approaches have been used in dimensionality reduction [6],semi-supervised learning [7], and clustering [7].In this paper, we investigate the role of sparsity in SRC, specifically, the two-fold question of: (i) whetheror not (cid:96) /(cid:96) -equivalence can be achieved in practice, i.e., whether (cid:96) -minimization reliably produces thesparsest solution in the classification context; and (ii) whether this equivalence is necessary for good classi-fication performance. The inherent problem with (i) is that practically-implementable recovery conditionsunder which (cid:96) -minimization is guaranteed to find the sparsest solution require that the vectors in the dictio-nary be incoherent , or in some way “spread out” in space. These guarantees hold with high probability, forexample, on dictionaries of vectors that are randomly-generated from certain probability distributions and2ictionaries consisting of randomly-selected rows of the discrete Fourier transform matrix [8, 9, 3]. Obviously,unlike these examples, data samples in the same class are often highly -correlated. In fact, strong inner-classsimilarity generally makes the data easier to classify.Our contributions in this paper are the following:1. We show that the fundamental assumptions of SRC are in direct contradiction with applicable andtractable sparse recovery guarantees. It follows that the experimental success of SRC should notautomatically imply the usefulness of sparsity in this framework.2. Using a randomly-generated database designed to model facial images, we show that (cid:96) -minimizationcan still recover the sparsest solution on highly-correlated data, provided that the classes are sufficientlywell-separated. Thus the lack of implementable equivalence guarantee does not automatically implylack of equivalence in SRC, at least on certain databases.3. We investigate the feasibility and implementation of a nonlinear transform that maximally spreadsout the training samples in each class while maintaining the dataset’s class structure. Though thereare strict limitations on the design of such a transform, which we describe in detail in Section 7, wedemonstrate that the higher-dimensional space can allow for the application of equivalence guaranteeswhile still allowing us to classify the dataset. This renders a method for examining the relationshipbetween classification accuracy and the sparsity of the coefficient vector in SRC, and how close thisis to the (provably) sparsest solution. We demonstrate that approximate (and not strict) equivalencebetween the (cid:96) -minimized solution and the sparsest solution is the key to the success of SRC.The paper is organized as follows: We begin by motivating and reviewing the basics of compressed sensingand sparsity recovery guarantees in Section 2, and we give an overview of SRC in Section 3. In Section 4,we formerly describe the conflict between (cid:96) /(cid:96) -recovery guarantees and classification data, and in Section5, we rigorously assess the applicability of these recovery guarantees in the classification context. Section 6presents empirical findings relating sparse recovery and highly-correlated data. In Section 7, we investigatethe feasibility of a nonlinear data transform to force the aforementioned recovery guarantees to hold andinsights that can be gained from this procedure. We conclude this paper in Section 8.
2. Compressed Sensing and Recovery Guarantees
In this section, we detail the motivation behind (cid:96) /(cid:96) -equivalence and state practically-implementableequivalence theorems. Suppose that we wish to collect information about (i.e., sample or take measurements of) a continuoussignal f ( t ) and then send or store this information in an efficient manner. For example, f ( t ) could be a sound3ave or an image. Also suppose that a good approximation of the original signal must later be recovered.According to the Nyquist/Shannon sampling theorem, we must sample f ( t ) at a rate of at least twice itsmaximum frequency in order to be able to reconstruct f ( t ) exactly [10]. But in some applications, doing somay be expensive or even impossible.In the circumstances that we are able to take many measurements of f ( t ) to obtain its discrete analog f ∈ R N , one efficient method of compressing it is the following procedure: Let the columns of Ψ := [ ψ , . . . , ψ N ]form an orthonormal basis for R N , and suppose that f has a sparse representation in this basis, i.e., thatwe can write f = (cid:80) Nj =1 α j ψ j , where α j := (cid:10) f , ψ j (cid:11) , 1 ≤ j ≤ N , and α := [ α , . . . , α N ] T is sparse. Settingall but the k largest (in absolute value) entries of α to 0 in order to obtain α k , it can be shown that Ψ α k gives the best k -term least squares approximation of f in this basis. Clearly, the sparser α is, the betterapproximation we will obtain of f , and in the case that α has no more than k nonzero coefficients, we recoverthe exact solution. This is the basic idea behind the so-called transform coding, and the most popular oneis the JPEG image compression standard [11], which uses the discrete cosine transform as the sparsifyingbasis Ψ.The problem with this procedure is that it is inefficient to collect all N samples if we are only going tothrow most (all but k ) of them away when the signal is compressed. This is the motivation behind compressedsensing , originally proposed by Cand`es and Tao [3] and Donoho [4] (see also Cand`es and Tao’s work [12]and the paper by Cand`es et al. [13]). Let Φ ∈ R m × N be a sensing or measurement matrix with m < N andconsider the underdetermined system y := Φ f = ΦΨ α = X α for sparse α , where we have set X := ΦΨ. Using (cid:107) α (cid:107) to denote the number of nonzero coordinates of α (hence the terminology “ (cid:96) -‘norm’ ”—observe that (cid:107) · (cid:107) is only a pseudonorm because it does not satisfyhomogeneity), we would ideally recover f by solving the optimization problem α := arg min α ∈ R N (cid:107) α (cid:107) subject to X α = y (1)and setting (cid:98) f := Ψ α with (cid:98) f ≈ f . Unfortunately, solving Eq. (1) is NP-hard. When X satisfies certainconditions and when α is sufficiently sparse, however, the solution to Eq. (1) can be found by solving the (cid:96) -minimization problem α := arg min α ∈ R N (cid:107) α (cid:107) subject to X α = y . (2)This was a riveting finding, as the optimization problem in Eq. (2) is convex and can be solved efficiently.It has been shown that, under certain conditions (e.g., when the columns of Φ are uniformly random onthe sphere S m − ), this procedure produces an approximation of f that is as good as that of its best k -termapproximation [4]. Further, theoretical and experimental results demonstrate that in many situations, the4umber of measurements m needed to recover f is significantly less than N and can be much lower than thenumber required by the Nyquist/Shannon theorem. For example, when the measurement matrix Φ ∈ R m × N contains i.i.d. Gaussian entries, then exact recovery of α via (cid:96) -minimization can be achieved (with highprobability) in only m = O ( k log( N/k )) measurements, where (cid:107) α (cid:107) = k [3].Even more astoundingly, similar results hold in the presence of noise. Suppose that the noiseless vector y is replaced with y = y + z , for z ∈ R m a vector of errors satisfying (cid:107) z (cid:107) ≤ ζ . It follows that undercertain conditions (see Section 2.2.1), the (cid:96) -minimization problem α ,(cid:15) := arg min α ∈ R N (cid:107) α (cid:107) subject to (cid:107) X α − y (cid:107) ≤ (cid:15) (3)is guaranteed to recover a coefficient vector approximating the ground truth sparse vector α (the solutionto Eq. (1)) with (cid:107) α ,(cid:15) − α (cid:107) ≤ C k ( (cid:15) + ζ ) [14]. The constant C k depends on properties of the matrix X andthe sparsity level (cid:107) α (cid:107) = k .A popular application of compressed sensing is magnetic resonance imaging (MRI), in which the mea-surement matrix Φ consists of m randomly-selected rows of the discrete Fourier transform in R N × N [15].Other applications abound in the areas of data acquisition and compression, including sensor networks [16],seismology [17], and single pixel cameras [18]. The conditions under which (cid:96) -minimization can guarantee exact or approximate recovery of the sparsestsolution (e.g., conditions under which the solutions to Eq. (1) and Eq. (2) are equal, i.e, (cid:96) /(cid:96) -equivalenceholds) are called recovery guarantees . These conditions concern the incoherence (or spread ) of the vectorsin the dictionary. Essentially, recovery guarantees cannot be applied when the vectors are too correlated. Aprototypical example is that if the dataset contains two copies of the same vector (i.e., a pair of maximally-correlated vectors), then the minimum (cid:96) -norm solution may contain a nonzero coefficient at either one of thecopies or at a combination of the two. Contrast this with the sparsest solution, which would never containnonzero coefficients at both copies.There are various ways of measuring the incoherence in a dictionary, each leading its own theory relatingthe solutions of Eq. (1) and Eq. (2) (or its noise version Eq. (3)). In this paper, we focus primarily on recoveryguarantees stated in terms of mutual coherence , and we review mutual coherence-based recovery guaranteesbelow. Unlike other approaches, the mutual coherence method is both tractable and deterministic, as wesubsequently discuss.To make the problem more general, we no longer explicitly assume the use of a sparsifying transformmatrix Ψ and consider the general system X α = y , for X ∈ R m × N with m < N .5 .2.1. Recovery Guarantees in Terms of Mutual Coherence Definition 2.1.
Given a matrix X = [ x , . . . , x N ] ∈ R m × N with normalized columns (so that (cid:107) x i (cid:107) = 1 for ≤ i ≤ N ), the mutual coherence of X , denoted µ ( X ) , is given by µ ( X ) := max ≤ i (cid:54) = j ≤ N | (cid:104) x i , x j (cid:105) | . (4)Note that mutual coherence costs O ( N m ) to compute. Theorem 2.1 (Donoho and Elad [19] ; Gribonval and Nielsen [20]) . Let X ∈ R m × N , m < N , havenormalized columns and mutual coherence µ ( X ) . If α satisfies X α = y with (cid:107) α (cid:107) < (cid:16) µ ( X ) (cid:17) , (5) then α is the unique solution to the (cid:96) -minimization problem in Eq. (2) . This means that if (cid:96) -minimization finds a solution with less than (1 / µ ( X ) − ) nonzeros, then it isnecessarily the sparsest solution and so (cid:96) /(cid:96) -equivalence holds.Given noise tolerance ζ and approximation error bound (cid:15) , the following theorem by Donoho et al. givesconditions for (cid:96) /(cid:96) -equivalence in the noisy setting: Theorem 2.2 (Donoho, Elad, and Temlyakov [14]) . Let X ∈ R m × N , m < N , have normalized columns andmutual coherence µ ( X ) . Suppose there exists an ideal noiseless signal y such that y = X α and (cid:107) α (cid:107) = k ≤ (cid:16) µ ( X ) (cid:17) . (6) Then α = α is the unique sparsest representation of y over X . Further, suppose that we only observe y = y + z with (cid:107) z (cid:107) ≤ ζ . Then we have (cid:107) α ,(cid:15) − α (cid:107) ≤ ( (cid:15) + ζ ) − µ ( X )(4 k − , (7) where α ,(cid:15) is the solution to Eq. (3) . That is, if the ideal sparse vector α is sparse enough and the mutual coherence of X is small enough, (cid:96) -minimization will give us a solution close to α , with “how close” depending on the sparsity level k ,mutual coherence µ ( X ), noise tolerance ζ , and approximation error bound (cid:15) .Something can also be said regarding the support of α ,(cid:15) in the noisy setting: Theorem 2.3 (Donoho, Elad, Temlyakov [14]) . Suppose that y = y + z , where y = X α , (cid:107) α (cid:107) ≤ k and (cid:107) z (cid:107) ≤ ζ . Suppose that β := µ ( X ) k < (so k < µ ( X ) ). Set γ := √ − β − β . (8) Then given α ,(cid:15) the solution to Eq. (3) with exaggerated error tolerance (cid:15) := Cζ where C = C ( µ ( X ) , k ) := γ √ k , we have that supp( α ,(cid:15) ) ⊂ supp( α ) . α ,(cid:15) toEq. (3) has the same support as the sparsest solution α . (Observe that α is indeed the sparsest solutionby Theorem 2.1, since (cid:107) α (cid:107) < (1 / µ ( X ) − < (1 / µ ( X ) − ).) Since (cid:15) = γ √ k ζ and γ ≥ (cid:15) ≥ ζ isrequired in Theorem 2.3. There are methods of proving (cid:96) /(cid:96) -equivalence that do not involve mutual coherence. For example, thoseusing the restricted isometry constant involve a quantification of how close any set of k columns of X is tobeing an orthonormal basis [21, 22], and other guarantees use the smallest number of linearly dependentcolumns of X , defined as the spark of X [19]. However, these approaches are generally not tractable indeterministic settings; their usefulness is largely limited to applications in which X is a random matrix withknown (with high probability) restricted isometry constant or spark.Alternatively, if we desire stochastic results, there are other recovery guarantees involving versions ofmutual incoherence. When applied to random matrices, these guarantees are generally stronger than those inTheorem 2.1 and 2.2 (in terms of requiring less measurements and/or less sparsity of the solution vector). Forexample, Cand`es and Plan [23] provide conditions that guarantee recovery (with high probability) of sparseand approximately sparse solutions in the case that the rows of the dictionary are sampled independentlyfrom certain probability distributions. These conditions are in terms of incoherence defined as an upperbound on the squared norms of the rows of X (either deterministically or stochastically), and require an isotropy property [23]. In the case that the probability distribution has mean 0, this property states thatthe covariance matrix of the probability distribution is equal to the identity matrix. In another paper [24],Cand`es and Plan guarantee probabilistic recovery in terms of a condition on mutual coherence (as defined inDefinition 2.1) that is satisfied with high probability on certain random matrices. These recovery guaranteesallow for the sparsity level k in the case of these random matrices to be notably larger than in Eq. (5) inTheorem 2.1. We also mention the results by Tropp [25] concerning recovery in terms of mutual coherenceand the extreme singular values of randomly-chosen subsets of dictionary columns.If we do not assume that classification data are drawn from a particular probability distribution, thenthese stochastic results either do not apply or are intractable to compute. Thus Donoho et al.’s theoremsdiscussed in Section 2.2.1 are the best tool we have to prove (cid:96) /(cid:96) -equivalence given an arbitrary (possiblylarge) matrix of training data. That said, it is important to note that these mutual coherence theorems pro-duce what are generally considered to be fairly loose bounds on the sparsity level (cid:107) α (cid:107) , given experimentalresults and cases for which restricted isometry constants are known [26, Chap. 10].7 . Sparse Representation-Based Classification We next review Wright et al.’s application of the (cid:96) -norm/sparsity relationship to classification. Inreviewing the compressed sensing framework, we referred to our underdetermined system using the notation X α = y (or X α = y , if the represented signal was expected to be noisy), for X ∈ R m × N . To differentiatethe classification context, let X tr ∈ R m × N tr be the matrix of training samples, and let y ∈ R m be an arbitrarytest sample.SRC solves α ∗ := arg min α ∈ R N tr (cid:107) α (cid:107) , subject to y = X tr α . (9)Alternatively, in the case of noise in which an exact representation may not be desirable (see the discussionat the beginning of Section 5), one can solve the regularized optimization problem α ∗ := arg min α ∈ R N tr (cid:110) (cid:107) y − X tr α (cid:107) + λ (cid:107) α (cid:107) (cid:111) . (10)Here, λ is the trade-off between error in the approximation and the sparsity of the coefficient vector.For a classification problem with L classes, define the indicator function δ l : R N tr → R N tr , l = 1 , . . . , L ,to set all coordinates corresponding to training samples not in class l to 0 (and to act as the identity on allremaining coordinates). After obtaining α ∗ from Eq. (9) or (10), the class label of y is predicted usingclass label( y ) = arg min ≤ l ≤ L (cid:13)(cid:13) y − X tr δ l ( α ∗ ) (cid:13)(cid:13) . (11)As mentioned in the introduction, it is assumed that by constraining the number of nonzero representationcoefficients, nonzeros will occur at training samples most similar to the test sample, and thus Eq. (11) willreveal the correct class. This works as follows: It is assumed that each class manifold is a linear subspacespanned by its set of training samples, so that if the number of classes L is large with regard to N tr , thereexists a sparse (in terms of the entire training set) representation of y using training samples in its groundtruth class. The coefficient vector α ∗ is an attempt at finding this class representation, and Eq. (11) is usedto allow for a certain amount of error.In essence, SRC classifies y to the class that contributes the most to its sparse (via (cid:96) -minimization)representation (or approximation, if Eq. (10) is used). SRC is summarized in Algorithm 1.
4. The Conflict
In classification problems, samples from the same class may be highly correlated. As demonstrated inTable 1, the mutual coherence (as defined in Eq. (4)) of a training matrix X = X tr is often quite large.When µ ( X tr ) ≈
1, the mutual coherence bound in Theorem 2.1 becomes (cid:107) α (cid:107) < (cid:16) µ ( X tr ) (cid:17) ≈ . lgorithm 1 Sparse Representation-Based Classification (SRC) [5]
Input:
Matrix of normalized training samples X tr ∈ R m × N tr , test sample y ∈ R m , number of classes L , anderror/sparsity trade-off λ (optional) Output:
The computed class label of y : class label( y ) Solve either the constrained problem in Eq. (9) or the regularized problem in Eq. (10). for each class l = 1 , . . . , L , do Compute the norm of the class l residual: err l ( y ) := (cid:13)(cid:13) y − X tr δ l ( α ∗ ) (cid:13)(cid:13) . Setclass label( y ) = arg min ≤ l ≤ L { err l ( y ) } . end for Database N tr m m PCA = 30 m PCA = 56 m PCA = 120AR-1 [27] 700 19800 0.9991 0.9987 0.9985AR-2 [27] 1000 19800 0.9993 0.9988 0.9984Extended Yale Face Database B [28] 1216 32256 0.9951 0.9954 0.9941Database of Faces (formerly “ORL”) [29] 200 10304 0.9971 0.9970 0.9966
Table 1: Average mutual coherence (over 10 trials) computed from training set X tr of some popular face databases after PCApre-processing to dimension m PCA . The original sample dimension is given by m . The training sets were chosen by randomlyselecting half of the samples from each database, for a total of N tr training samples. AR-1 contains all the unoccluded images(no sunglasses or scarf) from both sessions of the AR Face Database [27]; AR-2 contains all the unoccluded images from bothsessions, as well as the occluded images from Session 1. (cid:107) α (cid:107) denotes the number of nonzero coefficients in the representation of y over X tr , it will neversatisfy (cid:107) α (cid:107) <
1. Thus we cannot use Theorem 2.1 to prove (cid:96) /(cid:96) -equivalence in SRC, for example, on thedatabases used in Table 1.It follows that the “theory” behind sparse representation-based methods for learning (like SRC) is missinga significant piece. In the next three sections, we aim to provide insight into the following three questions:1. Can Theorem 2.1 ever be used to prove (cid:96) /(cid:96) -equivalence in SRC?2. Regardless of theoretical guarantees, is (cid:96) -minimization finding the sparsest solution in practice inSRC?3. What is the role of sparsity in SRC’s classification performance?
5. Mutual Coherence Equivalence and Classification
In this section, we identify cases in which the condition given in Eq. (5) from Theorem 2.1 provably doesnot hold, and thus we cannot use Theorem 2.1 to prove (cid:96) /(cid:96) -equivalence. We also discuss analogous resultsin the noisy case, i.e., Eq. (6) in Theorem 2.2. In particular, we are concerned with the applicability of thesetheorems for classification problems.Before we begin, we take a moment to clarify notation: • In discussing compressed sensing in Section 2, we used y to refer to a clean measurement vector and y := y + z to refer to its noisy version. In contrast, in this section and in Section 7, y may represent either a clean or noisy measurement vector, or an arbitrary test sample (as it does in Algorithm 1).We do this because, in the context of representation-based classification, there are reasons other thannoise in the test sample for allowing the equality y = X tr α to hold only approximately: the trainingdata could also be corrupted, or we may want to relax the assumption that class manifolds are linearsubspaces (perhaps this is only approximately, or locally, the case). Additionally, it is difficult todetermine the amount of noise in test samples in real-world problems. To keep the situation general andto avoid confusion, we will only differentiate between y and y when we explicitly consider y = y + z with (cid:107) z (cid:107) ≤ ζ the noise vector, as in Donoho et al.’s Theorems 2.2 and 2.3.When we explicitly consider data from a classification problem, we will use the subscript “tr.” Thatis, in the general compressed sensing representation y = X α , we set X = X tr when we want to denotea matrix of training samples, and when this is done, it is assumed that y specifically designates a testsample. • For the underdetermined system y = X α , we have already seen several instantiations of the coefficientvector α . We denoted the sparsest coefficient vector, i.e., the solution to the (cid:96) -minimization problemgiven in Eq. (1), by α = α , and we used α = α and α = α ,(cid:15) to denote the coefficient vectors foundusing (cid:96) -minimization (in particular, the solutions to Eq. (2) and Eq. (3), respectively). In contrast,10 = α ∗ denotes the solution to the SRC optimization problem (the solution to Eq. (9) or (10)). It ispossible to have α ∗ = α or α ∗ = α ,(cid:15) , depending on the optimization problem used in SRC and theamount of noise in the test sample. In particular, α ∗ = α if Eq. (9) is used in SRC, and α ∗ = α ,(cid:15) ifEq. (10) is used and the test sample satisfies y = y + z with (cid:107) z (cid:107) ≤ ζ . We will use the following lemma which gives a lower-bound on mutual coherence in the underdeterminedsetting:
Lemma 5.1 (Welch [30], Rosenfeld [31]) . For X ∈ R m × N with normalized columns and m < N , we havethat µ ( X ) ≥ (cid:115) N − mm ( N − . (12)It is straightforward to show that Lemma 5.1 implies that µ ( X ) ≥ /m , since (cid:113) N − mm ( N − monotonicallyincreases in N ∈ N for N > m , with a minimum value of 1 /m attained at N = m + 1. Thus to have even a chance of Theorem 2.1 or 2.2 holding, we must have (cid:107) α (cid:107) < c (cid:16) µ ( X ) (cid:17) ≤ c (cid:16) m (cid:17) , (13)where c = 2 in the noiseless case and c = 4 in the noisy case.We next consider the smallest possible value of the number of nonzeros (cid:107) α (cid:107) in any classification problemrepresentation X tr α = y . Let us assume that the test sample is not a scalar multiple of any training sample.It follows that (cid:107) α (cid:107) ≥
2. Thus in order for Theorem 2.1 or 2.2 to hold, we must have2 ≤ (cid:107) α (cid:107) < c (cid:16) µ ( X tr ) (cid:17) ⇒ µ ( X tr ) < c − ⇒ µ ( X tr ) < / , noiseless case1 / , noisy setting . Note that these upper bounds for µ ( X tr ) are very small compared to the values of µ ( X tr ) in Table 1. Thesefindings produce the following small-scale result: Proposition 5.1.
Suppose that X tr α = y . If m ≤ and y is not a scalar multiple of any training sample,then the inequality in Eq. (5) with X = X tr does not hold. That is, we cannot use Theorem 2.1 to prove (cid:96) /(cid:96) -equivalence in SRC.Proof. By Lemma 5.1, we must have that µ ( X tr ) ≥ m ≥ . (cid:3) An analogous statement holds in the noisysetting (Theorem 2.2) for m ≤
7. 11 .2. Main Result
Proposition 5.2 (Main Result) . Suppose that the sparsest representation of y ∈ R m over the dictionary X = [ x , . . . , x N ] ∈ R m × N is given by y = α j x j + . . . + α j k x j k for { j , . . . , j k } ⊂ { , . . . , N } . Set (cid:101) N to bethe number of columns of X contained in (cid:101) X := span { x j , . . . , x j k } , where clearly (cid:101) N ≥ k . If (cid:101) N > k , then the inequality in Eq. (5) does not hold. That is, we cannot use Theorem2.1 to prove (cid:96) /(cid:96) -equivalence.Proof. Suppose that (cid:101)
N > k . Then there are more than k dictionary elements in the subspace (cid:101) X . Since thevectors x j , . . . , x j k are linearly independent (because otherwise, y could be expressed more sparsely), thedimension of (cid:101) X is exactly k .Define (cid:101) X ∈ R m × (cid:101) N to be the matrix of the (cid:101) N dictionary elements contained in (cid:101) X . Let the singular valuedecomposition of (cid:101) X be given by (cid:101) X = U Σ V T , and set U k to contain the first k columns of U , V k to containthe first k columns of V , and Σ k to contain the first k columns and rows of Σ. Because (cid:101) X has rank k , wecan alternatively write (cid:101) X = U k Σ k V T k . The k × (cid:101) N matrix U T k (cid:101) X has the same mutual coherence as (cid:101) X , since they have the same Gram matrices:( U T k (cid:101) X ) T ( U T k (cid:101) X ) = (cid:101) X T U k U T k (cid:101) X = ( U k Σ k V T k ) T U k U T k ( U k Σ k V T k )= V k Σ T k U T k U k U T k U k Σ k V T k = V k Σ T k U T k U k Σ k V T k = ( U k Σ k V T k ) T ( U k Σ k V T k )= (cid:101) X T (cid:101) X. By Lemma 5.1, we have that µ ( X ) ≥ µ ( (cid:101) X ) = µ ( U T k (cid:101) X ) ≥ (cid:115) (cid:101) N − kk ( (cid:101) N − ≥ (cid:115) ( k + 1) − kk (( k + 1) −
1) = 1 k .
Thus the bound on k in Theorem 2.1 requires that k < (cid:16) µ ( X ) (cid:17) ≤
12 (1 + k ) ⇒ k < , (14)which contradicts with k being a natural number. 12e present several corollaries to Proposition 5.2. The first is a consequence applicable to any (cid:96) -minimization problem, regardless of whether or not the dictionary elements have class structure: Corollary 5.1 (Consequence for general (cid:96) -minimization) . If a measurement vector y ∈ R m is not at allsparse over the dictionary X ∈ R m × N , i.e., if every representation of y requires no less than m dictionaryelements, then the condition in Eq. (5) from Theorem 2.1 does not hold.Proof. Because the dimension of (cid:101) X (as defined in Proposition 5.2) k is actually m , every dictionary elementis contained in (cid:101) X .Corollary 5.1 illustrates the importance of choosing a dictionary that awards a sparse representation of y in any application of (cid:96) -minimization, including compressed sensing.The following corollary follows from the proof of Proposition 5.2: Corollary 5.2.
Let X ∈ R m × N with m < N , and let k be any positive integer such that k < N . If any setof k linearly independent columns of X spans an additional, distinct column of X , then the bound k < (cid:16) µ ( X ) (cid:17) does not hold. Of course, this bound will not hold for any larger values of k , either. This means that if we can findan integer k satisfying the conditions of Corollary 5.2, then any attempt to prove (cid:96) /(cid:96) -equivalence usingTheorem 2.1 will require X α = y with (cid:107) α (cid:107) < k . The following corollary is an explicit consequence for dictionaries consisting of training samples:
Corollary 5.3 (Consequence for Class-Structured Dictionaries) . Suppose that y is a test sample with (cid:107) y (cid:107) =1 , and define µ := µ ( X tr ) . If adding y to the set of training samples does not increase its mutual coherence,that is, if | (cid:104) y , x i (cid:105) | ≤ µ for all ≤ i ≤ N tr , i.e., µ ([ y , X tr ]) = µ , then we cannot have both that (i) X tr α = y and (ii) (cid:107) α (cid:107) < (1 / /µ ( X tr ))) .Proof. If we can write X tr α = y for (cid:107) α (cid:107) =: k , then the k (linearly independent) training samples withnonzero coefficients in the representation span a k -dimensional subspace containing y . Setting X = [ y , X tr ]in Corollary 5.2, we have that k ≮ (cid:16) µ ( X ) (cid:17) = 12 (cid:16) µ (cid:17) . Corollary 5.2 can alternatively be proven using the equivalence theorem involving spark ; see the work of Donoho and Elad[19].
13n the other hand, if k < (cid:16) µ ( X ) (cid:17) = 12 (cid:16) µ (cid:17) for some positive integer k < N tr , then also by Corollary 5.2, it must be the case that y is not contained inthe subspace spanned by any k linearly independent distinct columns of X , i.e., columns of X tr . Thus wecannot write X tr α = y for any α satisfying (cid:107) α (cid:107) = k .It might initially seem that the hypothesis of Corollary 5.3 is unlikely to hold. However, if one assumesthat the data is sampled randomly with test samples having the same distribution as the training samplesin their ground truth classes, then the hypothesis that µ ([ y , X tr ]) = µ ( X tr ) becomes much more probable.We discuss this further in Section 7.Our final corollary determines conditions under which the bound in Eq. (5) from Theorem 2.1 is theo-retically incompatible with the explicit assumptions made in SRC [5]. We review these assumptions briefly: Assumption 1 (Linear Subspaces) . The ground truth class manifolds of the given dataset are linear sub-spaces.
Assumption 2 (Spanning Training Set) . The training matrix X tr contains sufficient samples in each classto span the corresponding linear subspace. Corollary 5.4 (Consequence for SRC) . Suppose that the SRC Assumptions 1 and 2 hold. Let y have groundtruth class l , and suppose that the number of class l training samples, N l , is large, i.e., N l > d l , for d l thedimension of the linear subspace representing the class l manifold. Then there exists a test sample y whichrequires the maximum number d l of class l training samples to represent it. If this representation of y is itssparsest representation over the dictionary X tr , then the condition in Eq. (5) from Theorem 2.1 cannot hold.Thus we cannot use Theorem 2.1 to prove (cid:96) /(cid:96) -equivalence in SRC. Corollary 5.4 says that if we have a surplus of class l training samples (i.e., more than enough to spanthe class l subspace), then, provided that the “class representations” (representations of the test samples interms of their ground truth classes) truly are the sparsest representations of the test samples over the trainingset (as argued by the SRC authors [5]), there will be some test samples for which Theorem 2.1 cannot hold.These test samples are exactly those requiring k = d l class l training samples in their representations. Ingeneral, such test samples must exist; otherwise, the dimension of the class l subspace would be less than d l . To reiterate, if everything we want to happen in SRC actually happens (large class sizes, sparse classrepresentations), then we cannot consistently use Theorem 2.1 to prove (cid:96) /(cid:96) -equivalence.On a more positive note, the assumptions in SRC make it possible to estimate whether or not theconditions of Proposition 5.2 hold. Though these conditions are difficult to check in general (if we knewthe sparsest solution of y over the dictionary, then we would not need to use (cid:96) -minimization to find it),14he linear subspace assumption in SRC gives us a heuristic for doing so. We could potentially estimatethe dimension of each class (using a method such as multiscale SVD [32] or DANCo [33], for example) andcompare this with the number of training samples in that class. If the latter is larger than the former, thenwe expect that Theorem 2.1 cannot be applied for some test samples.In typical applications, we must deal with noisy data. Thus we should consider the application of Theorem2.2 instead of Theorem 2.1. But this is immediate: Since the mutual coherence condition is stricter in thecase of noise, the consequences of Proposition 5.2 and the above corollaries hold whenever the conditionsare assumed to hold on the clean version of the data. In particular, Theorem 2.2 requires the existence of aclean test sample y (even if it is unknown to us) that satisfies X α = y with (cid:107) α (cid:107) ≤ (1 / /µ ( X ))).Under the hypothesis of Corollary 5.3 (setting y = y ), such a y cannot exist.In concluding this section, we stress that the mutual coherence conditions in Theorems 2.1 and 2.2 aresufficient, but not necessary, for (cid:96) /(cid:96) -equivalence. Thus it is possible for (cid:96) -minimization to find (or closelyapproximate) the sparsest solution even when the conditions of these theorems do not hold. Whether or notthis happens in the context of SRC is the topic of the next section.
6. Equivalence on Highly-Coherent Data
In this section, we investigate whether sparsity is reliably achieved via (cid:96) -minimization on highly-correlated data, such as class-structured databases. We are inspired by the data model and subsequent work of Wright and Ma [34] (see also the work ofWright et al. [35]), which produces an (cid:96) /(cid:96) -equivalence guarantee for dictionaries containing vectors assumedto model facial images. We summarize their result briefly.Previous work has shown that the set of facial images of a fixed subject (person) under varying illu-mination conditions forms a convex cone, called an illumination cone , in pixel space [28, 36]. Wright andMa demonstrate that in fact the set of facial images under varying illuminations over all subjects combined exhibits this cone structure. For example, they show that this is the case for the entire set of (raw) sam-ples from the Extended Yale B Face Database [28]. Further, this cone becomes extremely narrow, i.e., a“bouquet,” as the number of pixels grows large [34]. These findings reiterate that class-structured data,particularly face databases, are highly-coherent.Lee et al. [37] showed that any image from the illumination cone can be expressed as a linear combinationof just a few images of the same subject under varying lighting conditions. In other words, illumination conesare well-approximated by linear subspaces. Thus the SRC condition that class manifolds are (approximately)linear subspaces presumably holds for databases made up of facial images under varying lighting conditions.15iven a facial image y ∈ R m that may be occluded or corrupted by noise, y can thus be expressed as y = X tr α + z , (15)given that certain requirements are satisfied in the sampling of the training data. By the above model, α isassumed to be non-negative (a result of the illumination cone model [35, 28]) and sparse, containing nonzerosat training samples that represent the same subject as y (i.e., are in the same class). Additionally, z isan (unknown) error vector with nonzeros in only a fraction of its coordinates; i.e., the model assumes thatonly a portion of the pixels are occluded or corrupted [35]. Note that this is not quite the same situationas in the condition for (cid:96) /(cid:96) -equivalence in the noisy setting given in Theorem 2.2. One difference is that inEq. (15) above, z is bounded in terms of (cid:96) -norm (sparsity) with no limit on (cid:96) -norm (magnitude), whereasin Theorem 2.2, z is bounded in terms of magnitude but not sparsity.The goal, as one might expect, is to recover α from Eq. (15). In the SRC paper [5], Wright et al. use (cid:96) -minimization to do this. In particular, they solve( (cid:98) α , z ) := arg min (cid:107) α (cid:107) + (cid:107) z (cid:107) subject to y = X tr α + z , (16)and they show that this version of SRC produces very good classification results on occluded or corruptedfacial images. (Again, note that (cid:98) α is different from both α and α ,(cid:15) discussed earlier, as there is a sparsityconstraint instead of an (cid:96) -norm bound on the noise component z .)In a later paper, Wright, et al. [35] correctly note that the usual (cid:96) /(cid:96) -equivalence theorems do nothold on the highly-correlated data in X tr , and so it cannot be determined whether or not the (cid:96) -minimizedsolution (cid:98) α in Eq. (16) is equal to (what is assumed to be) the true sparsest solution α . Fortunately, Wrightand Ma [34] proved a theorem that gives sufficient conditions for this equivalence under an assumed model(called the bouquet model ) of facial images; see also Wright et al.’s version [35]. To state the theorem, wewill need the following definition: Definition 6.1 (Proportional Growth [34]) . A sequence of signal-error problems y = X α + z , for X ∈ R m × N , exhibits proportional growth with parameters δ > , ρ ∈ (0 , , and β > , if N = (cid:98) δm (cid:99) , (cid:107) z (cid:107) = (cid:98) ρm (cid:99) , and (cid:107) α (cid:107) = (cid:98) βm (cid:99) . It follows that δ is the redundancy factor in the dictionary X and ρ and β control the sparsity of z and α , respectively. Here, β is assumed to be small and may depend on δ and ρ .We are now in a position to state Wright and Ma’s main theorem: Theorem 6.1 (Wright and Ma [34]) . Fix any δ > and ρ < . Suppose that X is distributed according tothe bouquet model given by X = [ x , . . . , x N ] ∈ R m × N , x i i . i . d . ∼ N ( µ , ( ν /m ) I m ) , (cid:107) µ (cid:107) = 1 , (cid:107) µ (cid:107) ∞ ≤ C µ m − / , C µ ≥ or ν sufficiently small. Also suppose that the sequence of signal-error problems y = X α + z for X ∈ R m × N exhibits proportional growth with parameters δ , ρ , and β . Suppose further that J ⊂ { , . . . , m } is a uniformrandom subset of size ρm , and that σ ∈ R m with entries of σ J i.i.d. ± (independent of J ) and σ J C = .Lastly assume that m is sufficiently large. Then with probability at least − C exp( − γ ∗ m ) in X , J , and σ ,for all α with (cid:107) α (cid:107) ≤ β ∗ m and any z with sign vector σ and support J , we have ( α , z ) = arg min α , z (cid:107) α (cid:107) + (cid:107) z (cid:107) subject to X α + z = X α + z . Here, C is a numerical constant and β ∗ and γ ∗ are positive constants (independent of m ) which dependon δ , ρ , and ν . By “ ν sufficiently small” and “ m sufficiently large,” Wright and Ma mean that thereexist constants 0 < ν < ν ∗ and m > m ∗ (independent of m ) such that ν ∗ ( δ, ρ ) > m ∗ ( δ, ρ, ν ) > This theorem illustrates that (cid:96) /(cid:96) -equivalence can provably hold on the classification ofhighly-coherent data via random database model. Remark 6.1.
Despite its applicability to highly-coherent data, Theorem 6.1 does not prove that (cid:96) /(cid:96) -equivalence holds in SRC. First of all, the theorem requires that m be sufficiently large, which may not bethe case, especially when feature extraction is used. Second, the model in Theorem 6.1 does not explicitlydeal with class-structured data. A true face recognition model should account for the individual subjects, withsamples in the same class being (on average) more correlated than those from different classes. Thus ourmodel should contain “sub-bouquets” (i.e., the classes) inside the larger bouquet.6.2. Experiments With these changes in mind, we design a random database model that will allow us to study the re-lationship between sparsity and (cid:96) -minimization on highly-coherent and class-structured data, such as theimages used in face recognition. First, we specify the dimension m , the number of classes L , and the numberof samples N l ≡ N , 1 ≤ l ≤ L in each training class. We require that N tr = N L > m so that theresulting dictionary of training samples leads to an underdetermined system. We then randomly generatetraining data with an increasing amount of cone/bouquet structure as well as class structure, along with atest sample—with known sparse coefficient vector α —generated as a linear combination of training sam-ples from a single class. We run a fixed number of trials of the experiment at each of 11 increasing valuesof coherence (we call these stages ) and determine at which stages (cid:96) -minimization can closely (or exactly)recover α . The relationship between β ∗ and β is not explicitly stated, but it makes sense that β ∗ ≤ β by the proportional growthassumption. Further, if β = β ( δ, ρ ), then since β ∗ = β ∗ ( δ, ρ, ν ), we can likely alternatively write β ∗ = β ∗ ( β, ν ). .2.1. Experimental Setup For each generated training set X tr = [ X (1) , . . . , X ( L ) ] ∈ R m × N tr , we set the (clean) test sample y to bea random vector in the positive span of the class 1 data. That is, we set y := α (1)1 x (1)1 + . . . + α (1) N x (1) N , where X (1) := [ x (1)1 , . . . , x (1) N ] and α (1) j ∼ unif(0 , ≤ j ≤ N . We then define α := [ α (1)1 , . . . , α (1) N , , . . . , T ∈ R N tr . Given this setup, we want to see if (cid:96) -minimization will recover α , i.e., if the solution α := arg min α ∈ R N tr (cid:107) α (cid:107) subject to X tr α = y is equal to α . Note that for large L , α can be viewed as a sparse vector.In Stage 1 of our model, the training data has no class or cone structure and is randomly generated onthe unit sphere S m − . It has been shown experimentally that, for N tr = 2 m and m sufficiently large, an (cid:96) -minimization solution with no more than (3 / m nonzeros is enough to ensure it is the sparsest solutionwith high probability [9]. Thus we expect to see exact recovery in Stage 1 for values of N , m , and L satisfying these requirements.To add both bouquet and class (or sub-bouquet) structure to the training set in subsequent stages, wedefine the cone mean x and the class means { x , . . . , x L } . At Stage i , 1 ≤ i ≤
11, we set x ∼ N ( , I m )and then modify x ← µ i x / (cid:107) x (cid:107) , where µ i := ( i − /
10 effectively increases the cone mean from as i increases. Next, each class mean is randomly generated depending on x as follows: For each class 1 , . . . , L ,we sample x l from N ( x , η i m − / I m ) for η i := 2 /i (so that each class mean becomes increasingly close to thecone mean) and then modify x l ← µ i x l / (cid:107) x l (cid:107) , 1 ≤ l ≤ L . Lastly, to generate the training samples in class1 ≤ l ≤ L , we sample x ( l ) j from N ( x l , ( η i m − / /L ) I m ) and then modify x ( l ) j ← x ( l ) j / (cid:107) x ( l ) j (cid:107) , 1 ≤ j ≤ N .Figure 1 shows an example of Stage i ∈ { , , . . . , } with m = 3, N = 5, and L = 4.We perform experiments using four different specifications for the triples ( N , m, L ), as shown in Table2. By design, we have that (cid:107) α (cid:107) = N in our experiments (though we will also briefly look at the case that (cid:107) α (cid:107) < N ). Note that: (i) the inequality (cid:107) α (cid:107) < (3 / m is satisfied for each of the specifications inTable 2; and (ii) these numbers are similar to what we might expect to see in classification of a face database(after some method of feature extraction is applied, as is generally required by SRC for face classification). We consider the following quantities for evaluating the success of (cid:96) /(cid:96) -recovery: • The average normalized (cid:96) -error err (cid:96) := (cid:107) α − α (cid:107) / (cid:107) α (cid:107) (18)18 a) Stage 1 (b) Stage 3 (c) Stage 5(d) Stage 7 (e) Stage 9 (f) Stage 11 Figure 1: An example of the generated training data from the random database model across odd-numbered stages (as mutualcoherence increases) with m = 3, N = 5, and L = 4. The colors denote the classes. Plots have been manually rotated to aidin visualization. (a) At Stage 1, data is uniformly spread out on the sphere; (b)-(f) At increasingly higher stages, the datasetas a whole becomes more bouquet-shaped, as does the data in each class. ID ( N , m, L ) (cid:107) α (cid:107) /m (cid:107) α (cid:107) /N tr Redundancy ( N L/m ) CommentsDB-1 (5,50,20) 1/10 1 /
20 2:1
Baseline redundancy; N small with respect to m , N tr DB-2 (10,50,10) 1/5 1 /
10 2:1
Baseline redundancy; N less small with respect to m , N tr DB-3 (10,50,50) 1/5 1 /
50 10:1
High redundancy; large L DB-4 (5,200,50) 1/40 1 /
50 5:4
Low redundancy; large L Table 2: Specification of parameters in the random database model. (cid:96) -minimized solution α and α , • The average number of nonzeros of α occurring at training samples not in class 1 (we call these “off-support” nonzeros, because they are nonzeros not in the support of α ), divided by the total numberof nonzeros. That is, let α off − supp1 be the result of setting all entries in α that are in class 1 to zero.Then this error is defined as err supp := (cid:107) α off − supp1 (cid:107) (cid:107) α (cid:107) , • Since err supp does not provide information regarding the size of the off-support nonzero coefficients,we also consider err supp( (cid:96) ) := (cid:107) α off − supp1 (cid:107) (cid:107) α (cid:107) and err supp( (cid:96) ) := (cid:107) α off − supp1 (cid:107) (cid:107) α (cid:107) , • The average mutual coherence of the training set, µ ( X tr ) =: µ .It is informative to consider the effect that the support error quantities would (hypothetically) have onthe classification performance of SRC. Recall that, in the case that the clean test sample y is known, SRCcomputes the class residuals err l ( y ) := (cid:107) y − X tr δ l ( α ) (cid:107) , 1 ≤ l ≤ L , and assigns y to the class withthe smallest residual. Thus if err supp , err supp( (cid:96) ) , and err supp( (cid:96) ) are small, we expect that SRC will have aneasier time classifying the test sample correctly (recall that these quantities measure the residual from thecorrect class l = 1). For example, if all the support error quantities are 0, then δ ( α ) = α and it followsthat the class 1 residual err ( y ) = 0 and err l ( y ) = (cid:107) y (cid:107) for 2 ≤ l ≤ L . This corresponds to the idealclassification scenario.We compute the average quantities err (cid:96) , err supp , err supp( (cid:96) ) , err supp( (cid:96) ) , and µ over 1000 trials at eachstage, using the (cid:96) -minimization algorithm HOMOTOPY [38, 39] with error/sparsity trade-off parameter λ = 10 − (to force near-exactness in the approximation). The results are shown in Figure 2.Considering that err supp records any off-support nonzeros, regardless of how small, the results are quitegood. In many cases, (cid:96) -minimization was able to recover the exact solution α on highly-correlated data,and when errors in the support occurred, they were generally small.We see two different things happening at either end of the Stage axis. At Stage 1, we see supporterrors in every database except DB-4 (the low-redundancy case). Further, there are nonzero values of err (cid:96) ,err supp( (cid:96) ) , and err supp( (cid:96) ) for DB-3 (the high-redundancy case) at this stage. At high stages, we see similarsmall support errors as the data became very correlated; these support errors were numerous (accountingfor around half the nonzero coefficients) for both DB-3 and DB-4.We start by explaining the results at Stage 1. Given the plots in Figure 2, our instinct may be to suspectthat something wrong happened here, especially considering the exact recovery on all databases at Stage 2.For the cases that we had a ratio of 2-to-1 redundancy, does this contradict the experimental result [9] thathaving N = (cid:107) α (cid:107) < (3 / m nonzeros guarantees (cid:96) /(cid:96) -equivalence with high probability? It would, but20 a) DB-1: ( N , m, L ) = (5 , ,
20) (b) DB-2: ( N , m, L ) = (10 , , N , m, L ) = (10 , ,
50) (d) DB-4: ( N , m, L ) = (5 , , Figure 2: Recovery results on random database model (average of 1000 trials) in the case of no noise. a) DB-1: r = 1 / r = 1 / r = 1 / r = 1(c) DB-3: r = 1 / r = 1 / Figure 3: Asymptotic recovery at Stage 1 of the random database model (average of 1000 trials). Note the different scales. for the fact that this result holds asymptotically. To test this, we repeated the experiments for increasingvalues of m , scaling N and L accordingly so that the redundancy remained constant. More precisely, wedefined r := m/N tr and r := N /L and then set ˜ L := [ (cid:112) ˜ m/ ( r r )] and ˜ N := r ˜ L . Here, [ · ] denotes thenearest integer function and ˜ m , ˜ L , and ˜ N denote the increased values of m , L , and N , respectively. Aswe illustrate in Figure 3, the value of err supp decreased to 0 as ˜ m increased. As is to be expected, both theamount of redundancy and the relationship N /m affected the speed of convergence. We exclude results forDB-4, as we already see perfect recovery at Stage 1 in Figure 2d.In comparing the Stage 1 results to those from data with bouquet/cone structure (i.e., Stages 2-11),it is initially surprising that small to moderate levels of correlation in the data samples appear to improve sparse recovery. As mentioned, we see near-perfect recovery of α at Stage 2 for every tried ( N , m, L ) triple;this is in stark contrast to the recovery accuracy at Stage 1, especially for DB-3 (Figure 2c). This sharp22hange coincides with a significant increase in the within-class correlation between Stages 1 and 2 in ourmodel, whereas the correlation between classes essentially remains unchanged. Though the exact specificswill depend on the (cid:96) -minimization algorithm used, we strongly suspect that the relative clustering of thesamples in the support of α at Stage 2 (as compared to their random distribution at Stage 1) make it mucheasier for the algorithm to recover the desired solution.Conversely, at high stages, it appears that the loss of class structure negatively affected the recovery of α .As the standard deviation of the class mean distributions grew small, the class cones began to significantlyoverlap, and (cid:96) -minimization could not exactly recover the support of α . Notice that we see an especiallylarge number of support errors err supp for databases with large values of L , namely, DB-3 (Figure 2c) andDB-4 (Figure 2d). For DB-3, the nonzero values of err supp at Stages 5 and 6 (compared to err supp ≈ Effect on classification:
We earlier discussed the relationship between the support error quantities err supp ,err supp( (cid:96) ) , and err supp( (cid:96) ) on the classification performance of SRC, in particular, their effect on the classresiduals err l ( y ) := (cid:107) y − X tr δ l ( α ) (cid:107) . Here, we consider these residuals explicitly. For each of the fourdatabases, we computed the average residual err l ( y ) (over 1000 trials) for each class 1 ≤ l ≤ L at each ofthe 11 values of coherence.Not surprisingly given the small support error quantities determined in the previous section, there is astark difference between the residual of class 1 and those of the other classes at all stages. More precisely,the ideal classification scenario occurs in all cases, with err ( y ) ≈ l ( y ) ≈ (cid:107) y (cid:107) for all 2 ≤ l ≤ L .The approximations are of the order 10 − (or better), except for the highly-redundant database DB-3 atStage 1. In this case, the average quantities were err ( y ) = 0 .
230 and (cid:107) y (cid:107) − mean ≤ l ≤ L err l ( y ) = 0 . . These findings are consistent with the results in Figure 2c. Even though these quantities at Stage 1 arenonzero, it is important to note that good classification would still be achieved, as min ≤ l ≤ L err l ( y ) = 1 . ( y ) = 0 . Varying the sparsity level:
We next consider what happens when the sparsity level (cid:107) α (cid:107) is strictly lessthan the number of class 1 training samples N . This is important to investigate: can (cid:96) -minimizationidentify the correct training samples from among the rest of the (highly-correlated) training data in thatclass? For DB-2 and DB-3, we generated α (and subsequently y ) using the first five samples in class 1.Figures 4a and 4c show the recovery results, and Figures 4b and 4d repeat the plots in Figures 2b and 2c(in which (cid:107) α (cid:107) = N ) for convenient comparison.At Stage 1, we see that the support of α was more concentrated on the correct training sampleswhen (cid:107) α (cid:107) was smaller, evidenced by smaller values of err supp . This is to be expected, as the groundtruth solution became sparser. For the lower-redundancy case DB-2, we see far more support errors as the23 a) DB-2: ( N , m, L ) = (10 , , (cid:107) α (cid:107) = 5 (b) DB-2: ( N , m, L ) = (10 , , (cid:107) α (cid:107) = N (c) DB-3: ( N , m, L ) = (10 , , (cid:107) α (cid:107) = 5 (d) DB-3: ( N , m, L ) = (10 , , (cid:107) α (cid:107) = N Figure 4: Comparing (cid:107) α (cid:107) < N and (cid:107) α (cid:107) = N sparsity levels (average of 1000 trials) on the random database model inthe case of no noise. (cid:107) α (cid:107) = 5 (Figure 4a) than for the case (cid:107) α (cid:107) = N (Figure 4b); however, thevalues of these off-support coefficients were very small, as demonstrated by the near-zero values of err supp( (cid:96) ) and err supp( (cid:96) ) . Though class 1 training samples not in the support of α were mistakenly selected as thedata in class 1 became more correlated, these samples played a negligible role in the representation. For thehigh-redundancy case DB-3, we similarly see more small-valued, off-support coefficients at Stages 2-5 when (cid:107) α (cid:107) = 5 (Figure 4c) than in the case (cid:107) α (cid:107) = N (Figure 4d). The value of err supp for (cid:107) α (cid:107) < N wasactually smaller than it was for (cid:107) α (cid:107) = N at many of the higher stages, however, suggesting that theadded degree of sparsity helped to counter-balance the high redundancy of this database (and its negativeeffect on recovery) in these cases. Eliminating errors by thresholding:
Before we turn to the noisy setting, we demonstrate that the smallsupport errors in α depicted in Figure 2 can be completely remedied using thresholding in all but the high-redundancy case DB-3. After determining α as before, we set its small coefficients (those with absolute valueless than some threshold τ ) to zero, obtaining the vector α τ . We then re-solved the equation X tr α = y withthe constraint that the solution, denoted ˆ α , had the same support as the thresholded α τ . For simplicity,we did this by setting the columns of X tr corresponding to zero-coordinates in α τ to , thus obtaining thematrix ˆ X tr . We then used MATLAB’s “ \ ” operator to define ˆ α := ˆ X tr \ y . In our case, since X tr was notsquare, the desired least squares solution was found by (MATLAB’s implementation of) QR-factorization.For all but the highly-redundant database DB-3, ˆ α was equal to the sparsest solution α (up to nearlymachine-precision) for the thresholding value τ = 10 − . For τ ∈ { . , . } on these three databases(DB-1, DB-2, and DB-3), we saw small nonzero values of err (cid:96) , but these errors were indiscernible in plotson the same scale as those in Figure 2, and so we do not show them here. For τ = 0 .
1, there was a consistent,small but nontrivial (cid:96) -error across all stages, as small coefficients corresponding to class 1 training sampleswere incorrectly set to 0. For all four values of τ , there were no support errors.For the high-redundancy case DB-3, we continued to see errors at Stage 1, similar to those in Figure 2c.For the thresholding values τ ∈ { . , . , . } (i.e., for τ large enough), there were no support errors atother stages. However, similarly to the other databases, we saw nontrivial (cid:96) -error when τ = 0 .
1. We plotthe results for DB-3 in Figure 5, stressing that the results for the other databases contained errors too smallto produce nontrivial plots.
In these experiments, we examine (cid:96) /(cid:96) -recovery when noise is added to the test sample y . Recall thetheorems by Donoho et al. regarding (cid:96) /(cid:96) -equivalence in the noisy setting stated in Theorems 2.2 and 2.3. Accuracy of recovery:
Unfortunately, Eq. (7) and Eq. (8) in the referenced theorems do not make sensefor large mutual coherence µ ( X ). However, we can still look for a correlation between (cid:107) α − α ,(cid:15) (cid:107) (where α ,(cid:15) is the solution to Eq. (19) below) and the values of the noise tolerance ζ (see that statement of Theorem25 a) τ = 10 − (b) τ = 0 . τ = 0 .
01 (d) τ = 0 . Figure 5: The results of thresholding (average of 1000 trials) on the highly-redundant database DB-3: ( N , m, L ) = (10 , , (cid:15) , N = (cid:107) α (cid:107) := k , and µ ( X tr ), with (cid:15) =: Cζ for some constant C >
0. We modify the experiments in Section 6.2.2 as follows: First, we specify the noise tolerance ζ andthe constant C . After generating the training data and the (noise-free) test sample y , we set y := y + z ,where the entries of z are drawn from N (0 , ζ/ (2 √ m )). Then (cid:107) z (cid:107) ≤ ζ with probability at least 95%. Fromhere, we set (cid:15) := Cζ and find α ,(cid:15) := arg min α (cid:107) α (cid:107) subject to (cid:107) y − X tr α (cid:107) ≤ (cid:15). (19)We set ζ = 0 .
01, and we used two values of C : C = 5, and C = 10, producing the ( ζ, (cid:15) )-pairs (0 . , . . , . (cid:107) X tr α ,(cid:15) − y (cid:107) was less than (cid:15) , we used thebasis pursuit denoising version of the (cid:96) -minimization algorithm SPGL1 [40, 41].In Figure 6, we plot the normalized (cid:96) -error, the fraction of off-support nonzeros, the normalized (cid:96) and (cid:96) -norms of the off-class support vectors, and the mutual coherence µ ( X tr ) =: µ . Note that we modify thecorresponding definitions given in Section 6 (for err (cid:96) , err supp , err supp( (cid:96) ) and err supp( (cid:96) ) ) to use α ,(cid:15) insteadof α and do not change the notation. We report the averages over 1000 trials at each stage.As we can see, there is clearly a relationship between err (cid:96) and the amount of correlation in the data.As the data became increasingly bouquet-shaped, both within each class and as a dataset as a whole, thenormalized (cid:96) -distance between α ,(cid:15) and α increased. The rate of increase of this error appears to berelated the redundancy of the database. It is evident that mutual coherence was not a good indicator oferr (cid:96) , as the plots show that err (cid:96) could be relatively low even after µ ( X tr ) had reached its maximum value.Perhaps more importantly, the supports of the solution vectors α ,(cid:15) and α were nearly identical at stagesgreater than 1. This means that the vast majority of nonzeros in α occurred at positions corresponding toclass 1 training samples. To fix the small support errors, we could use the thresholding technique discussedin the previous section, choosing τ by trial-and-error. This method could also be used to ameliorate thenumerous support errors for the databases DB-2 and DB-3 at Stage 1. In this case, we found that τ = 0 . C = 5 and C = 10. For the most part, the plots arequite similar. We see that setting C = 5 produced slightly better recovery than C = 10 at Stage 1, but ingeneral, the normalized (cid:96) -error err (cid:96) was the same for the two settings at higher stages. This is informative,as it tells us that (cid:96) /(cid:96) -recovery on this kind of highly-correlated data is potentially quite robust to the settingof C in the approximation error tolerance (cid:15) = Cζ . Once again, we attribute this to the class structure of thedata making it easier for the (cid:96) -minimization algorithm to find the class solution α . Effect on classification:
As in the noise-free scenario, we compute the class residuals err l ( y ) := (cid:107) y − X tr δ l ( α ,(cid:15) ) (cid:107) for each of the four databases at each of the 11 values of coherence. Specifically, we areinterested in how close the class 1 residual is to 0 (signifying perfect reconstruction of y using class 1) andhow close the next smallest class residual min ≤ l ≤ L err l ( y ) is to this value. If it is close, then it means that27 a) DB-1, C = 5 (b) DB-1, C = 10(c) DB-2, C = 5 (d) DB-2, C = 10(e) DB-3, C = 5 (f) DB-3, C = 10(g) DB-4, C = 5 (h) DB-4, C = 10 Figure 6: Recovery results on the random database model in the case of noise.
28e should have less confidence in the SRC classification assignment than if these quantities were far apart,i.e., that SRC distinguishes the correct class less clearly.The average relevant class residuals (over 1000 trials) are displayed in Table 3. Since the results for C = 5and C = 10 were very similar, we only include the results for C = 5. DB-1 DB-2 DB-3 DB-4Stage err ( y ) min ≤ l ≤ L err l ( y ) err ( y ) min ≤ l ≤ L err l ( y ) err ( y ) min ≤ l ≤ L err l ( y ) err ( y ) min ≤ l ≤ L err l ( y )1 0.05 1.27 0.06 1.81 0.28 1.80 0.05 1.272 0.05 2.30 0.05 3.77 0.05 4.92 0.05 2.473 0.05 2.47 0.05 4.76 0.04 4.96 0.04 2.464 0.05 2.52 0.05 4.92 0.04 5.00 0.05 2.535 0.05 2.51 0.05 5.04 0.05 5.02 0.05 2.496 0.05 2.50 0.05 4.99 0.05 5.03 0.05 2.517 0.05 2.51 0.05 5.01 0.05 4.99 0.05 2.508 0.05 2.53 0.05 4.99 0.05 4.99 0.05 2.489 0.05 2.51 0.05 5.00 0.05 5.05 0.05 2.5010 0.05 2.56 0.05 4.96 0.05 4.97 0.05 2.5211 0.05 2.50 0.05 5.01 0.05 5.00 0.05 2.50 Table 3: Average SRC class residuals err ( y ) := (cid:107) y − X tr δ ( α ,(cid:15) ) (cid:107) and min ≤ l ≤ L { err l ( y ) := (cid:107) y − X tr δ l ( α ,(cid:15) ) (cid:107) } (over 1000trials) on the random database model in the case of noise. Noting that (cid:15) := Cζ = 0 .
05, we see that the ideal classification scenario occurred in nearly all cases. Thatis, since err ( y ) ≈ (cid:15) almost always, class 1 training samples made up essentially the entire approximation ofthe test sample. The exception, again, was DB-3 at Stage 1, for which err ( y ) and min ≤ l ≤ L err l ( y ) werethe least separated (i.e., relatively close in value). However, correct classification would still be achieved.The reader might notice that the quantities min ≤ l ≤ L err l ( y ) at Stage 1 are lower than at higher stages;this is because min ≤ l ≤ L err l ( y ) = min ≤ l ≤ L (cid:107) y − X tr δ l ( α ,(cid:15) ) (cid:107) ≈ (cid:107) y − X tr (cid:107) = (cid:107) y (cid:107) is smaller in this case, due to the class 1 training samples being uniformly distributed on S m − . In this section, we designed a model, inspired by the work of Wright and Ma [34], for facial recognitionand other similar classification databases. To model the mechanisms of SRC [5], we randomly generateda test sample as a non-negative linear combination of a single class’s training samples. We computed the29orresponding (sparse) coefficient vector and then ran experiments to test whether or not (cid:96) -minimization, asit is used in the SRC setting, could recover this vector under increasing values of correlation, both within-classand in the database as a whole.The results demonstrate that the within-class correlation in this model consistently improves (cid:96) /(cid:96) -recovery when compared to randomly-generated uniform data on the sphere. This is an important empiricalresult, as this latter type of data is one of the “golden children” of (cid:96) /(cid:96) -equivalence; i.e., these type ofdictionaries produce, in some sense, ideal recovery (see, e.g., the work of Donoho [9]). However, those resultsare strongly asymptotic, and our experiments dealt only with small databases. More work is needed todetermine if our findings hold up on larger datasets.It is not too surprising, given the mutual coherence recovery condition studied in the last section, thatvery large correlation in the database as a whole can degrade recovery. When the global correlation in ourmodel was very high, so that the classes, or sub-bouquets, began to overlap, we saw that (cid:96) -minimizationdid not find the correct support of the sparse solution. However, we showed that the support could becompletely fixed by a simple thresholding technique.We also demonstrated that (cid:96) -minimization achieved a good approximation of the sparsest solution inthe case of noise in our model. Though the accuracy of the approximation generally decreased as the databecame more correlated, this deterioration was slow compared to the increase in mutual coherence of thedatabase. Further, the amount of (cid:96) -error appeared to be less dependent on the relationship between noise ζ and error tolerance (cid:15) than it was on the amount of redundancy in the database.Assuming that test samples truly are linear combinations of their ground truth class training samples, asis done in SRC, these experiments suggest that (cid:96) -minimization will recover this class representation, leadingto good classification in SRC and similar classification algorithms. This of course assumes that our model isappropriate for the given dataset, and that its values of N , m , and L are comparable to those used in ourexperiments, so that the class representation is sparse.Our results are purely empirical; however, they strongly suggest that theoretical recovery results arepossible. We conjecture that exact recovery can be provably obtained whenever the classes are sufficientlynon-overlapping and that a similar result can be obtained in the case of noise. The amount of redundancyin the database and the number of classes will play a crucial role in this analysis.Finally, though we explicitly modeled the cone structure of facial images, our results are likely applicableto other areas of classification as well. In particular, as long as it is assumed that the training samples withineach class are highly correlated, we could amend our model so that the sign of each training sample waschosen randomly and so that the test sample was generated in the linear (not necessarily positive) span of itssame-class training samples. However, since (cid:96) -minimization is invariant to multiplication of the dictionaryelements by ±
1, we suspect that our results would be the same.30 . Proving Equivalence via Nonlinear Embedding
As we have seen, class structure often results in the training set having high mutual coherence, making itimpossible to apply the mutual coherence recovery guarantees given in Theorems 2.1 and 2.2 in the contextof SRC. We consider a resolution to this conflict through the use of more space . That is, if we had many“extra” dimensions, the data in each class could conceivably be spread out and we would still have enough“room” to keep the classes well-separated from each other, allowing for both low mutual coherence andclass-structured data.Let us illustrate this in low dimension. Consider the toy example in which we have L = 2 classes,each containing 2 samples in R m for m = 2. First, let the goal be to arrange the samples in a way thatminimizes their mutual coherence while at the same time provides some indication of class. Assuming thatthe samples must be normalized (as in SRC), this class-structure criterion can reasonably be interpreted asthe requirement that (cid:12)(cid:12)(cid:12) (cid:68) x (1) i , x (1) j (cid:69) (cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12) (cid:68) x (1) i , x (2) j (cid:69) (cid:12)(cid:12)(cid:12) and (cid:12)(cid:12)(cid:12) (cid:68) x (2) i , x (2) j (cid:69) (cid:12)(cid:12)(cid:12) > (cid:12)(cid:12)(cid:12) (cid:68) x (2) i , x (1) j (cid:69) (cid:12)(cid:12)(cid:12) , for i, j ∈ { , } . In other words, the samples in the same class must be more correlated than samples indifferent classes.One solution is given by the class matrices X (1) = (cid:2) x (1)1 , x (1)2 (cid:3) = π − (cid:15) )0 sin( π − (cid:15) ) , X (2) = (cid:2) x (2)1 , x (2)2 (cid:3) = π − (cid:15) )1 sin( π − (cid:15) ) , where (cid:15) > π − (cid:15) ),and that of samples in different classes is cos( π + (cid:15) ). Clearly, the former quantity is the mutual coherenceof the dataset. This arrangement is illustrated in Figure 7a with (cid:15) = 0 . X (1) = θ ) sin( φ )0 sin( θ ) sin( φ )0 cos( φ ) , X (2) = θ ) sin( φ )1 sin( θ ) sin( φ )0 cos( φ ) , for θ = π/ − (cid:15) , θ = π/ (cid:15) , φ = 3 π/
4, and φ = π/
4. The mutual coherence of the dataset iscos( π − (cid:15) ) sin( π ) = sin( π + (cid:15) ) cos( π ). This arrangement is illustrated in Figure 7b with (cid:15) = 0 . (cid:15) = 0 .
2, for example, adding an additional dimension allows us to decrease the mutual coherence ofthe dataset from cos( π − (cid:15) ) ≈ . π − (cid:15) ) sin( π ) ≈ . (a) m = 2, µ = 0 . m = 3, µ = 0 . Figure 7: Illustration of decreasing the mutual coherence of a dataset by embedding it into higher dimension. (a) Samples inoriginal space R , (b) Samples in transform space R . Colors denote classes. As discussed above, we consider forcing the mutual coherence criterion of (cid:96) /(cid:96) -equivalence to hold viadata transformation. Is it possible to learn a class-preserving transform from the training data and thenclassify test samples in a space in which (cid:96) -minimization provably produces the sparsest solution? Such atransform would allow us to investigate the extent (if any) to which obtaining the sparsest solution affectsSRC’s classification accuracy.For a transform φ , set Φ( X tr ) := [ φ ( x ) , . . . , φ ( x N tr )] for notional ease. Formally, we desire a transform φ ∗ : R m → R ˜ m with m < ˜ m that satisfies φ ∗ = arg max φ ∈C f cs (Φ( X tr )) subject to µ (Φ( X tr )) ≤ ˜ µ, (20)where C is some compact set (so that f cs obtains a maximum). Here, f cs evaluates the amount of classstructure in the transformed training set (“cs” stands for “class structure”). For example, f cs might denotethe inverse of the sum of within-class distances or the inverse of the Frobenius norm of the within-class scattermatrix used in linear discriminant analysis [42, 43]. Clearly, ˜ µ is an upper bound on the mutual coherenceof the transformed training set. Ideally, we want to choose ˜ µ small enough so that the (cid:96) /(cid:96) -equivalencecondition in Theorem 2.1 or Theorem 2.2 can be applied.32e note that the desired transform φ ∗ must be a nonlinear transform, otherwise the dimension of thesubspace containing the embedded samples will be no greater than that of the original space ( m ). Thus wewill have failed to utilize the extra space (needed to achieve our objective) awarded by the increased ambientdimension ˜ m .Though this setup seems promising, we have a problem when we consider how the transform φ ∗ shouldtreat (new) test samples. In order for us to classify the test sample in the transform space, φ ∗ must treat y similarly to a training sample in its own class. However, this leads to the following conflict: Proposition 7.1.
Let φ : R m → R ˜ m be a data transform and y a test sample so that µ ([Φ( X tr ) , φ ( y )]) ≈ µ (Φ( X tr )) ≤ ˜ µ for some ˜ µ , i.e., the transform φ treats test samples in the same way as their same-classtraining samples. For any vector α ∈ R m , if α satisfies (cid:107) α (cid:107) < (cid:16) µ (cid:17) , then Φ( X tr ) α (cid:54) = φ ( y ) with high probability.Proof. This is a direct consequence of Corollary 5.3.This demonstrates the extent to which the assumptions in SRC conflict with the mutual coherencerecovery guarantees. We cannot construct a transform which can be applied to the entire dataset and allowsfor both sufficiently-low mutual coherence and adequate grouping of the classes, so that (transformed) testsamples can be expressed as linear combinations of their same class training samples.However, we can still use the nonlinear transformation approach to study the relationship between clas-sification accuracy of SRC and the sparsity of its solution vector . We do this by artificially generatingtransformed test samples φ ( y ) as linear combinations of the columns of the transformed training data, inparticular, with nonzero coefficients occurring at training samples in the ground truth class of y . Thus wecan ensure that Φ( X tr ) α = φ ( y ) always has a solution. However, this will mean that we never actuallycompute or handle the test sample y in the original space and only assume that it exists implicitly .The reader may object that we cannot just make up test samples in this manner, and in general, thisis absolutely true. Nevertheless, we stress that our goal in this experiment is not to classify an arbitrarydatabase but to determine the effect of sparsity in SRC on classification accuracy, and so the implied existenceof y is acceptable in this context.In the next two subsections, we reveal our approach to determining the desired transform φ ∗ and furtherdiscuss the consequences of Proposition 7.1 (and our approach to handling them) in this particular context. Rather than constructing an explicit transform, we consider the reduction of mutual coherence via theso-called kernel trick.
We will use the
Gaussian kernel as a method of controlling the mutual coherence of33he transformed training data.To review, the kernel trick allows us to perform operations in a space of dimension ˜ m > m (possiblyinfinite-dimensional) without having to actually compute the transformed samples. The “trick” is to workonly with the inner-products between transformed samples, which are given to us by some kernel function κ : R m × R m → R . More formally, denote the transform by φ κ . We define the inner-product in the kernelspace as (cid:104) φ κ ( x i ) , φ κ ( x j ) (cid:105) := κ ( x i , x j ) , for 1 ≤ i, j ≤ N tr . The kernel function κ should satisfy Mercer’s condition so that κ defines a properinner-product [44].Kernel methods can be particularly effective when used to “non-linearize” linear classifiers. In kernelsupport vector machines , for example, classes that are not linearly-separable in the original space may beseparated linearly in kernel space (see the work of Boser et al. [45]). Though SRC is not linear, it doesassume a linear relationship between the test sample and the training samples in its ground truth class.When such a relationship does not hold in the original space, it may hold in kernel space given that anappropriate kernel is selected [46].Consider the Gaussian kernel, which is given by κ ( x i , x j ) := e − (cid:107) x i − x j (cid:107) σ . Essentially, the Gaussian kernel adds inverse exponential scaling to the Euclidean distance function. Pointsclose together obtain values of κ that are close to 1, whereas points that are faraway from each other havekernel values approaching 0. The window or width parameter σ controls the drop off (or steepness) of thistrade-off.The Gaussian kernel is a natural choice for our transform, since the mutual coherence of the (transformed)training set will be given by µ (Φ κ ( X tr )) = max ≤ i (cid:54) = j ≤ N tr | (cid:104) φ κ ( x i ) , φ κ ( x j ) (cid:105) | = max ≤ i (cid:54) = j ≤ N tr | κ ( x i , x j ) | = max ≤ i (cid:54) = j ≤ N tr e − (cid:107) x i − x j (cid:107) σ . Since the vectors ˆ x i and ˆ x j satisfying (cid:107) ˆ x i − ˆ x j (cid:107) = max i (cid:54) = j (cid:107) x i − x j (cid:107) are fixed for a given training set,the mutual coherence µ (Φ κ ( X tr )) depends completely on σ . Thus we can write µ (Φ κ ( X tr )) =: µ = µ ( σ ). Toreiterate, we can completely control the mutual coherence of the data in the kernel space by adjusting σ . The kernel κ satisfies Mercer’s condition if (cid:82)(cid:82) κ ( x , y ) g ( x ) g ( y ) d x d y ≥ g . (cid:96) /(cid:96) -equivalence is achieved in kernel space. We will do this as follows: In order toensure (cid:96) /(cid:96) -equivalence, the Gaussian width parameter σ must be chosen so that the mutual coherence issmall enough that Theorem 2.1 holds. Let us set k sup := 12 (cid:16) µ (cid:17) . Clearly, k sup completely depends on µ , or equivalently, on σ . As σ approaches 0, k sup = k sup ( σ ) blows up.Suppose we choose σ to be the largest value such that, with high probability (whp), the sparsity level (cid:107) α (cid:107) is less than k sup , where α = α ∗ ∈ R N tr is the solution to the exact (cid:96) -minimization problem in SRC givenby Eq. (9) (replacing X tr with Φ κ ( X tr ) and y with φ κ ( y )). This will ensure that α is the sparsest solutionby Theorem 2.1. Using “mc” to denote “mutual coherence,” we define σ mc := max (cid:110) σ : (cid:107) α (cid:107) < k sup (cid:111) . (21)It follows that φ κ = φ ∗ , our desired transform, when σ = σ mc and the class-structure evaluation f cs inEq. (20) is defined as the minimum spread of vectors in transform space. (We assume that the databasealready has class-structure in the original space—so that the mutual coherence is high—and by the continuityof the Gaussian kernel, φ κ with σ = σ mc separates the data in each class only as much as necessary to achievethe mutual coherence bound.)To relate σ and classification accuracy, we consider the set of values of σ such that maximum classificationaccuracy is achieved for all values in this set (whp). (We can think of this as the range of σ values thatproduce the maximum amount—without a mutual coherence constraint—of class structure.) Defining themaximum value in this set by σ acc , we want to investigate the relationship between σ mc and σ acc . Weare also interested in the sparsity level (cid:107) α (cid:107) of the (cid:96) -minimized coefficient vector at both σ = σ mc and σ = σ acc . Since some coefficients may be small, we also consider the size of the coefficients of trainingsamples corresponding to the ground truth class of y . In analyzing these quantities and relationships, weaim to provide insight into the role of sparsity in classification. We elaborate on the effect of Proposition 7.1 in the kernel setup: For a fixed training set and test sample(in the original space), we lose the ability to write Φ κ ( X tr ) α = φ κ ( y ) for any coefficient vector α as σ → σ ,we cause not only the training samples to become more orthogonal to each other, but also the test sampleto become more orthogonal to each training sample, to the point that when σ = σ mc , φ κ ( y ) is likely notcontained in the span of the columns of Φ κ ( X tr ). In other words, the resulting system is overdetermined withno solution to Φ κ ( X tr ) α = φ κ ( y ) when σ ≤ σ mc . By Theorem 2.1, the minimal (cid:96) -norm solution satisfying35 κ ( X tr ) α = φ κ ( y ) with (cid:107) α (cid:107) < (1 / /µ )) is necessarily the sparsest such solution. However, ifthere is no solution satisfying Φ κ ( X tr ) α = φ κ ( y ), then there can be no sparsest solution.Even when the equality in SRC is relaxed and the constrained (cid:96) -minimization problem in Eq. (3) isused, relating the found solution α ,(cid:15) and the true sparsest solution α using Theorem 2.2 requires the existence of some α = α satisfying the equality Φ κ ( X tr ) α = φ κ ( y ). Since the bound in Eq. (6) in the noisycase is more restrictive than Eq. (5) in the noiseless case, to satisfy Theorem 2.2 we must have σ < σ mc . ByProposition 7.1, no such α exists, and it follows that Theorem 2.2 cannot be applied in this setup, either.As discussed earlier, we will side-step this conflict by artificially generating test samples in transform(kernel) space. This approach affects the accuracy of SRC as follows: As σ → φ κ ( y ) ∈ span { φ κ ( x ( l )1 ) , . . . , φ κ ( x ( l ) N l ) } . Thuswe will not see the classification performance deteriorate at all as σ → In other words, decreasing µ sothat we can provably obtain (cid:96) /(cid:96) -equivalence in this setup can only help classification accuracy, as doing soisolates the linear relationship between the test sample and the training samples in its ground truth class (inkernel space). Thus our investigation of the relationship between σ acc and σ mc can be more precisely statedin terms of how much larger σ acc is than σ mc , i.e., how quickly does classification accuracy deteriorate afterwe no longer have (cid:96) /(cid:96) -equivalence? For a fixed training set (that will be described in detail in Section 7.5.3) and fixed σ , we generate N l testsamples in kernel space for each class 1 ≤ l ≤ L as linear combinations of the training samples in that class(in kernel space) with coefficients randomly drawn from unif(0 ,
1) distribution. Non-negative coefficients areused so that (cid:104) φ κ ( y ) , φ κ ( x j ) (cid:105) ≥ ≤ j ≤ N tr , as is consistent with the Gaussian kernel. We then applySRC in kernel space to classify the resulting test samples, using the Kernel SRC algorithm of Kang et al., inparticular, their kernel coordinate descent (KCD) algorithm [47]. Note that in their paper, the authors applythis algorithm to the local binary patterns of the original samples instead of the original samples themselves,and since other types of kernels are more appropriate for these type of features, they do not use the Gaussiankernel, as we do.In our experiments, we determine σ mc and σ acc by trial-and-error. Given the randomness inherent in thedatabase construction (again, see Section 7.5.3 for a description of the database used), determining these Note that the formulation in Eq. (3) is equivalent to the regularized (cid:96) -minimization problem in Eq. (10) in the formal SRCalgorithm statement. We stress that this is certainly not the case in general: consider the increasing difficulty of identifying class structure ina dataset whose samples become more and more uncorrelated (as σ → φ κ ( y ) in this manner adds anundesirable—but necessary—degree of artificiality into our experiment. (cid:96) -minimized coefficient vector α by 10 − to help avoid rounding errors. We saw in Section 5 that we cannot apply Theorem 2.1 unless µ < . Since we are using the kernelapproach, this means that we must have µ (Φ κ ( X tr )) = max ≤ i (cid:54) = j ≤ N tr (cid:104) φ κ ( x i ) , φ κ ( x j ) (cid:105) = max ≤ i (cid:54) = j ≤ N tr κ ( x i , x j ) < . In particular, since we are using the Gaussian kernel, it must be the case thatmax i (cid:54) = j κ ( x i , x j ) = max ≤ i (cid:54) = j ≤ N tr e − (cid:107) x i − x j (cid:107) σ < ⇒ σ < √ ln 3 max ≤ i (cid:54) = j ≤ N tr (cid:107) x i − x j (cid:107) . Since the training samples (in the original space) are normalized, this means that σ < √ ln 3 ≈ . . (22)Thus in searching for σ mc , we only need to consider values of σ less than 1.35. We constructed a very simple toy database in the original space as follows: Samples in the l th class wereinitially N copies of the canonical basis vector e l ∈ R L , where L was the number of classes. The featuredimension m was user-specified, and then m − L coordinates were added to each canonical basis vector andset to zero. Lastly, random noise from N (0 , η ) was added to all (training) samples in all coordinates.We set N = 5, m = 50, and L = 20, so that each class would consist of a relatively small portion of thedictionary X tr , as is ideal in SRC. Recall our method of generating test samples as linear combinations oftheir same-class training samples (in kernel space) in Section 7.5.1. We set the number of test samples ineach class to N l = N = 5, so that we had the same number of test samples as training samples. We usedthree different values of noise level η ∈ { . , . , . } . As in the (cid:96) -minimization algorithm HOMOTOPY,KCD requires an error/sparsity tradeoff parameter λ . To force near-exactness in the representations, we set λ = 10 − . Remark 7.1.
The reader may question why we used a different synthetic database than the one in the lastsection: Would not this be better, so that we might obtain a fair comparison ? We are making our best effortto stress that this line of thinking misconstrues the point of this experiment. Here, we only care about theclassification results of Kernel SRC as they relate to the sparsity level (cid:107) α (cid:107) and the mutual coherence bound n Eq. (5) . We are not at all interested in whether the kernel approach improves the classification accuracyof SRC (for a positive answer to this question, see, for example, Kang et al.’s paper [47]). Further, theprevious synthetic database was designed for a specific purposes: (cid:96) /(cid:96) -equivalence—and not classificationperformance in SRC—could be studied at increasing levels of data correlation. So that the aim of this previousexperiment did not bleed into our goals here, we used a completely new (and very simple) database.7.5.4. Results In Figure 8, we plot the synthetic database results for each value of η over various values of σ , annotatingthe values of σ mc and σ acc . We report the averages over 100 instantiations of the training and test sets(“trials”). In particular, we report the average sparsity level, Kernel SRC classification accuracy, and the(relative) (cid:96) and (cid:96) -norms of the correct class support. These quantities are defined rigorously asSparsity := mean all trials (cid:110) median all test samples (cid:107) α (cid:107) N tr (cid:111) for α thresholded at 10 − (we compute the median sparsity over all test samples so that the result is morerobust to atypical very sparse or very dense coefficient vectors),Accuracy := mean all trials (cid:110) mean all test samples { class label ( y )=ground truth class( y ) } (cid:111) where { x = y } is the indicator function that returns 1 if x = y and 0 otherwise, andsupp( (cid:96) ) := mean all trials (cid:110) mean all test samples (cid:107) δ GT ( α ) (cid:107) (cid:107) α (cid:107) (cid:111) , supp( (cid:96) ) := mean all trials (cid:110) mean all test samples (cid:107) δ GT ( α ) (cid:107) (cid:107) α (cid:107) (cid:111) , where the nonzero entries of δ GT ( α ) are exactly those from α that correspond to the ground truth classof the given test sample.From Figure 8, we see that σ acc was generally much larger than σ mc , and that the Kernel SRC methodcould tolerate substantial (cid:96) and (cid:96) -support error before classification deteriorated. Further, perfect classi-fication was achieved even for maximally dense α . This shows that a strictly-sparse solution vector is notalways necessary to the success of SRC.As the level of noise η increased, we see in Figure 8 that σ acc decreased towards σ mc . However,lim η →∞ σ acc (cid:54) = σ mc . Once the class structure was lost due to noise in the original space, increasing the noise level further had noeffect on the quantities displayed in Figure 8. In other words, Figure 8c is representative of the results forlarger values of η .We also observe that for η = 0 . (cid:96) -minimization for valuesof σ slightly larger than σ mc (note the position of the σ mc arrow tip in Figure 8a). In fact, the mutualcoherence of the dataset with η = 0 .
001 reached µ = 0 . (cid:96) -minimization failed to retrieve the38parsest solution. This indicates that when the classes are well-separated (for small η and sufficiently small σ ,separability in the original space carries over to kernel space in this experiment), (cid:96) /(cid:96) -equivalence can stillbe achieved even when the mutual coherence is much larger than that allowed by Eq. (5). This reinforces thefindings from Section 6, namely, that (cid:96) /(cid:96) -equivalence holds on highly-correlated data as long as the vectorscorresponding to the support of the sparsest solution are sufficiently separated from the other dictionaryelements. On the other hand, for larger values of η , i.e., when the classes were less well-separated, the boundin Eq. (5) appears to be approximately tight. It is notable that σ acc is substantially larger than σ mc for all η , and that the accuracy in Kernel SRChas a steep drop-off as soon as σ > σ acc . The value σ acc appears to be a threshold for which the linearrelationship between φ κ ( y ) and the training samples in its ground truth class cannot be identified by theclassification mechanism in (Kernel) SRC. We want to know what triggers this threshold.We first look for an “elbow” or sharp change in the correlation between φ κ ( y ) and training samples inits ground truth class, and that between φ κ ( y ) and samples in other classes. In particular, we computedcorr GT := mean all trials (cid:110) median x ( l ) j : y ∈ class l (cid:68) φ κ ( y ) , φ κ ( x ( l ) j ) (cid:69) (cid:111) and corr other := mean all trials (cid:110) median l : y / ∈ class l (cid:110) median ≤ j ≤ N l (cid:68) φ κ ( y ) , φ κ ( x ( l ) j ) (cid:69) (cid:111)(cid:111) . Again, we compute the median quantities within each trial to make the correlation values more robust tosample outliers.The results for η = 0 . η are similar. As we cansee, the accuracy threshold σ acc occurred after the sharp increase in the correlation quantities. In fact, we seethat SRC was able to retrieve the correct classification assignment when corr GT was only moderately largerthan corr other . On the other hand, the sharp increase in the correlation quantities appears to correspondto the steep increase in sparsity level, which makes sense in the context of the mutual coherence recoveryguarantee in Theorem 2.1.As a more informative approach to understanding the accuracy threshold, in particular, what causes thesharp drop-off in accuracy at σ acc , we consider the distribution of the absolute values of the coefficients, i.e.,the magnitude of the coordinates of α , with respect to the different classes. Without loss of generality,we do this by studying the coefficients for the class l = 20 test samples. More specifically, for η = 0 .
1, wecomputed the mean vector | α | over the N = 5 class l = 20 test samples, and then averaged the result over100 trials: mean all trials (cid:110) mean y ∈ class l =20 (cid:8) | α | (cid:9)(cid:111) . σ . The x -axis in the left-hand-side plots (Figures 10a, 10c, 10e, and 10g) corresponds to the individual coordinates of the averaged vector | α | ∈ R N tr . The coordinates corresponding to training samples in each class are simply summed to producethe right-hand-side plots (Figures 10b, 10d, 10f, and 10h), so that the contribution from each class in therepresentation of φ κ ( y ) can be viewed easily. We also include the corresponding Kernel SRC classificationaccuracies for reference.Given the dominance of coefficients corresponding to class l = 20 in Figures 10a-10d, it is not surprisingthat Kernel SRC obtains perfect accuracy in these cases. It is also quite clear from these figures that smallcoefficients in the wrong class do not negatively affect classification accuracy. Thus there is no reason torequire a solution sparser than that with σ = 3.For σ ∈ { , } , the closeness in the coefficient magnitudes between those corresponding to class l = 20and those corresponding to other classes illustrates the decreased accuracy in Kernel SRC; recall that theseplots contain averages. Additionally, we note that the distribution of the coefficients in class l = 20 becamefairly unbalanced among that class’s training samples for these large values of σ . This is because as mutualcoherence increased, the class l = 20 samples became more and more parallel to each other. Thus most of φ κ ( y ) could be represented using only the first training sample in that class.Figure 10 helps to explain the sharp drop-off in accuracy at σ acc . Though the quantities corr GT andcorr other are only slightly increasing at σ acc (and the general behavior of the coefficients varying smoothly),the threshold occurs right at the point where the coefficients of other classes become competitive with thosefrom the correct class (as we would expect). The sharp drop-off can be attributed to the nonlinearity ofthe min function in determining min ≤ l ≤ L {(cid:107) φ κ ( y ) − Φ κ ( X tr ) δ l ( α ) (cid:107) } in the classification stage of (Kernel)SRC. We summarize some important conclusions from this section: • Any procedure that spreads out the data in each class in a way that decreases mutual coherence yet aimsto maintain class structure will necessarily come into conflict with maintaining a linear relationshipbetween y and any subset of training samples. More precisely, it is generally impossible to write y asa linear combination of the training samples in class l while satisfying the bound (cid:107) α (cid:107) < (cid:16) µ ( X tr ) (cid:17) ≈ (cid:16) µ ([ X ( l ) , y ]) (cid:17) , i.e., when y is spread out in the same manner as the other samples in the database. Besides artificiallygenerating y as a linear combination of the training samples after they have been spread out, it is notclear to us how to overcome this conflict. 40 Though generating y as a linear combination of its ground truth class training samples in kernel spaceprevented us, in some sense, from isolating the relationship between σ mc and classification accuracy,we were still able to study the correspondence between σ mc and sparsity level (cid:107) α (cid:107) . In particular,we confirmed our previous findings that perfect recovery can be achieved on highly-correlated data aslong as the classes are sufficiently well-separated (in this experiment, this meant small η ). • We saw that there was a sharp drop-off in classification accuracy as soon as σ > σ acc , which was notdirectly correlated with a sharp change in either sparsity or the relationship between within-class andbetween-class correlation, or in the normalized (cid:96) and (cid:96) -norms of δ GT ( α ). Though (cid:96) /(cid:96) -equivalence(whether provable by Theorem 2.1 or not) was a way to ensure perfect classification accuracy in thisexperiment, it was not necessary. The classification mechanism in SRC can clearly tolerate even themaximal number of nonzero coefficients in the representation, as long as the magnitudes of coefficientscorresponding to the wrong classes are small with respect to those from the correct class. In this sense, relative —or approximate —sparsity is the key to SRC. It might be possible to make this idea precisein terms of a coefficient thresholding procedure similar to the one used in Section 6.In future research, it would be interesting to consider the modification of the above experiment whennoise is added to the test sample φ ( y ) after it is generated as a linear combination of its ground truthclass training samples in kernel space. Of course, this will not have the same effect as adding noise to theoriginal (and implicitly-defined) test sample y , but it would allow us to investigate the relationship betweenclassification accuracy in SRC and the mutual coherence bound in the case of noise as stated in Theorem2.2.
8. Conclusion
In this paper, we investigated the applicability of (cid:96) /(cid:96) -equivalence guarantees on dictionaries containingtraining samples. We detailed the inherent conflict between tightly-clustered classes—desirable for goodclassification—and the sufficient incoherence required by recovery guarantees such as those based on mutualcoherence. In particular, we proved that under the assumptions of SRC, i.e., that class manifolds are linearsubspaces spanned by their respective training data, Donoho et al.’s mutual coherence guarantees can onlyhold in the case that we have exactly enough training samples to span each lower-dimensional subspace.Considering that the performance of SRC should generally improve as the training class size increases, it islikely counter-productive for classification purposes to restrict the training set in this way. Further, despiteexisting methods to estimate the class manifold dimension, it is impractical to assume that such approacheswill always work perfectly.Despite not being able to prove (cid:96) /(cid:96) -equivalence on most class-structured data, we saw that it can indeedbe achieved in some specific cases. Inspired by the random model of Wright and Ma to generate face image-like41atabases, we designed an experiment to test the ability of (cid:96) -minimization to recover the sparsest solutionon highly-correlated data. The results were mostly positive. We observed that in all cases, (cid:96) -minimizationrecovered a solution closely approximating the sparsest solution (defined by generating the test sampleas a linear combination of training samples in its ground truth class). Further, within-class correlationactually improved recovery relative to uniformly-random data, provided that the between-class correlationwas sufficiently low, i.e., that the classes were sufficiently separated. In many cases, (cid:96) -minimization exactlyrecovered the sparsest solution. Additionally, in the case that noise was added to the test sample, the correctsupport was found in nearly every case in which correlation was introduced.We also considered the role of sparsity in the context of SRC and similar classification algorithms. Oneobstacle in determining this relationship is obtaining access to the sparsest solution for comparison withoutthe aid of (cid:96) /(cid:96) -equivalence guarantees. Towards resolving this problem, we designed a nonlinear transform,based on kernel methods using the Gaussian kernel, to decrease the within-class mutual coherence while stillmaintaining class structure so that (hypothetically) provable equivalence and good classification could besimultaneously achieved. However, we found that the degree to which we had to decrease coherence in thissetup meant that the test sample was no longer in the span of the training data, and so we were forced tolimit our analysis to test samples artificially generated as linear combinations of their ground truth classtraining samples, as in Section 6. Though this to some extent limited the applicability of our experiment, theresults clearly indicate that strict sparsity is not necessary for good classification in SRC. Instead, its successlies in its ability to correctly differentiate the coefficient magnitudes of training samples in different classes,i.e., to find approximately or relatively sparse solutions, in the case that the linear subspace assumption isobserved and the classes themselves are not too correlated, i.e., not close together.There is certainly much work to be done to quantify these findings. We mention two potential next steps:Eldar and Kuppinger’s notion of block-coherence [48], with blocks corresponding to classes of the trainingdatabase, might serve to make precise the meaning of between-class correlation; note that this was observedto play a role in both (cid:96) /(cid:96) -equivalence on highly-correlated data and SRC’s classification performance.Additionally, the accuracy threshold detected in Section 7 might be better understood in the context ofWang et al.’s interpretation of SRC as a maximum margin-based classifier [49]. As an alternative to thethresholding route as suggested in Section 7, their work could be very helpful in rigorously defining theconcept of approximate sparsity as it relates to the classification performance of SRC. Acknowledgments
C. Weaver’s research on this project was conducted with government support under contract FA9550-11-C-0028 and awarded by DoD, Air Force Office of Scientific Research, National Defense Science andEngineering Graduate (NDSEG) Fellowship, 32 CFR 168a. She was also supported by National Science42oundation VIGRE DMS-0636297 and NSF DMS-1418779. N. Saito was partially supported by ONR grantsN00014-12-1-0177 and N00014-16-1-2255, as well as NSF DMS-1418779.
ReferencesReferences [1] E. J. Cand`es, M. B. Wakin, An introduction to compressive sampling, IEEE Signal Processing Magazine25 (2) (2008) 21–30. doi:10.1109/MSP.2007.914731 .[2] J. A. Tropp, A. C. Gilbert, Signal recovery from random measurements via orthogonal matching pursuit,IEEE Trans. Inform. Theory 53 (12) (2007) 4655–4666. doi:10.1109/TIT.2007.909108 .[3] E. J. Cand`es, T. Tao, Decoding by linear programming, IEEE Trans. Inform. Theory 51 (12) (2005)4203–4215. doi:10.1109/TIT.2005.858979 .[4] D. L. Donoho, Compressed sensing, IEEE Trans. Inform. Theory 52 (4) (2006) 1289–1306. doi:10.1109/TIT.2006.871582 .[5] J. Wright, A. Y. Yang, A. Ganesh, S. S. Sastry, Y. Ma, Robust face recognition via sparse representation,IEEE Trans. Pattern Anal. Mach. Intell. 31 (2) (2009) 210–227. doi:10.1109/TPAMI.2008.79 .[6] L. Qiao, S. Chen, X. Tan, Sparsity preserving projections with applications to face recognition, PatternRecogn. 43 (1) (2010) 331–341. doi:10.1016/j.patcog.2009.05.005 .[7] B. Cheng, J. Yang, S. Yan, Y. Fu, T. S. Huang, Learning with l doi:10.1109/TIP.2009.2038764 .[8] E. J. Cand`es, J. Romberg, T. Tao, Robust uncertainty principles: exact signal reconstruction fromhighly incomplete frequency information, IEEE Trans. Inform. Theory 52 (2) (2006) 489–509. doi:10.1109/TIT.2005.862083 .[9] D. L. Donoho, For most large underdetermined systems of linear equations the minimal l -norm solutionis also the sparsest solution, Comm. Pure Appl. Math. 59 (6) (2006) 797–829. doi:10.1002/cpa.20132 .[10] C. E. Shannon, Communication in the presence of noise, Proc. I.R.E. 37 (1949) 10–21.[11] W. B. Pennebaker, J. L. Mitchell, JPEG: Still Image Data Compression Standard, 1st Edition, KluwerAcademic Publishers, Norwell, MA, USA, 1992.[12] E. J. Cand`es, T. Tao, Near-optimal signal recovery from random projections: universal encoding strate-gies?, IEEE Trans. Inform. Theory 52 (12) (2006) 5406–5425. doi:10.1109/TIT.2006.885507 .4313] E. J. Cand`es, J. K. Romberg, T. Tao, Stable signal recovery from incomplete and inaccurate measure-ments, Comm. Pure Appl. Math. 59 (8) (2006) 1207–1223. doi:10.1002/cpa.20124 .[14] D. L. Donoho, M. Elad, V. N. Temlyakov, Stable recovery of sparse overcomplete representations in thepresence of noise, IEEE Trans. Inform. Theory 52 (1) (2006) 6–18. doi:10.1109/TIT.2005.860430 .[15] M. Lustig, D. L. Donoho, J. M. Santos, J. M. Pauly, Compressed sensing MRI, IEEE Signal ProcessingMagazine 25 (2) (2008) 72–82. doi:10.1109/MSP.2007.914728 .[16] Z. Xiaoyan, W. Houjun, D. Zhijian, Wireless sensor networks based on compressed sensing, in: 3rdIEEE International Conference on Computer Science and Information Technology (ICCSIT), Vol. 9,2010, pp. 90–92. doi:10.1109/ICCSIT.2010.5564960 .[17] F. J. Herrmann, M. P. Friedlander, O. Yilmaz, Fighting the curse of dimensionality: Compressivesensing in exploration seismology, IEEE Signal Processing Magazine 29 (3) (2012) 88–100. doi:10.1109/MSP.2012.2185859 .[18] M. F. Duarte, M. A. Davenport, D. Takbar, J. N. Laska, T. Sun, K. F. Kelly, R. G. Baraniuk, Single-pixel imaging via compressive sampling, IEEE Signal Processing Magazine 25 (2) (2008) 83–91. doi:10.1109/MSP.2007.914730 .[19] D. L. Donoho, M. Elad, Optimally sparse representation in general (nonorthogonal) dictionaries via (cid:96) minimization, Proc. Natl. Acad. Sci. USA 100 (5) (2003) 2197–2202. doi:10.1073/pnas.0437847100 .[20] R. Gribonval, M. Nielsen, Sparse representations in unions of bases, IEEE Trans. Inform. Theory 49 (12)(2003) 3320–3325. doi:10.1109/TIT.2003.820031 .[21] E. J. Cand`es, The restricted isometry property and its implications for compressed sensing, C. R. Math.Acad. Sci. Paris 346 (9-10) (2008) 589–592. doi:10.1016/j.crma.2008.03.014 .[22] T. T. Cai, L. Wang, G. Xu, New bounds for restricted isometry constants, IEEE Trans. Inform. Theory56 (9) (2010) 4388–4394. doi:10.1109/TIT.2010.2054730 .[23] E. J. Cand`es, Y. Plan, A probabilistic and RIPless theory of compressed sensing, IEEE Trans. Inform.Theory 57 (11) (2011) 7235–7254. doi:10.1109/TIT.2011.2161794 .[24] E. J. Cand`es, Y. Plan, Near-ideal model selection by (cid:96) minimization, Ann. Statist. 37 (5A) (2009)2145–2177. doi:10.1214/08-AOS653 .[25] J. A. Tropp, On the conditioning of random subdictionaries, Appl. Comput. Harmon. Anal. 25 (1)(2008) 1–24. doi:10.1016/j.acha.2007.09.001 .4426] T. Hastie, R. Tibshirani, M. Wainwright, Statistical Learning with Sparsity: The Lasso and General-izations, CRC Press, Taylor & Francis, 2015.[27] A. Martinez, R. Benavente, The AR face database, Tech. Rep. 24, Computer Vision Center (June 1998).URL [28] A. S. Georghiades, P. N. Belhumeur, D. J. Kriegman, From few to many: illumination cone models forface recognition under variable lighting and pose, IEEE Trans. Pattern Anal. Mach. Intell. 23 (6) (2001)643–660. doi:10.1109/34.927464 .[29] AT&T Laboratories Cambridge, The database of faces, , 1992-1994 (accessed 26.3.2016).[30] L. R. Welch, Lower bounds on the maximum cross correlation of signals, IEEE Trans. Inform. TheoryIT-20 (3) (1974) 397–399.[31] M. Rosenfeld, In praise of the Gram matrix, in: The mathematics of Paul Erd˝os, II, Vol. 14 of AlgorithmsCombin., Springer, Berlin, 1997, pp. 318–323. doi:10.1007/978-3-642-60406-5_29 .[32] A. V. Little, M. Maggioni, L. Rosasco, Multiscale geometric methods for data sets I: Multiscale SVD,noise and curvature, Appl. Comput. Harmon. Anal 43 (3) (2017) 504–567. doi:10.1016/j.acha.2015.09.009 .[33] C. Ceruti, S. Bassis, A. Rozza, G. Lombardi, E. Casiraghi, P. Campadelli, DANCo: An intrinsic di-mensionality estimator exploiting angle and norm concentration, Pattern Recogn. 47 (8) (2014) 2569 –2581. doi:10.1016/j.patcog.2014.02.013 .[34] J. Wright, Y. Ma, Dense error correction via (cid:96) -minimization, IEEE Trans. Inform. Theory 56 (7) (2010)3540–3560. doi:10.1109/TIT.2010.2048473 .[35] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. S. Huang, S. Yan, Sparse representation for computer visionand pattern recognition, Proceedings of the IEEE 98 (6) (2010) 1031–1044. doi:10.1109/JPROC.2010.2044470 .[36] P. N. Belhumeur, D. J. Kriegman, What is the set of images of an object under all possible lightingconditions?, in: 1996 IEEE Conference on Computer Vision and Pattern Recognition, 1996, pp. 270–277. doi:10.1109/CVPR.1996.517085 .[37] K.-C. Lee, J. Ho, D. Kriegman, Acquiring linear subspaces for face recognition under variable lighting,IEEE Trans. Pattern Anal. Mach. Intell. 27 (5) (2005) 684–698. doi:10.1109/TPAMI.2005.92 .4538] D. L. Donoho, Y. Tsaig, Fast solution of l -norm minimization problems when the solution may besparse, IEEE Trans. Inform. Theory 54 (11) (2008) 4789–4812. doi:10.1109/TIT.2008.929958 .[39] M. Asif, J. Romberg, (cid:96) homotopy: A MATLAB toolbox for homotopy algorithms in (cid:96) -norm minimiza-tion problems, http://users.ece.gatech.edu/~sasif/homotopy/ , 2009–2013 (accessed 31.3.2015).[40] E. van den Berg, M. P. Friedlander, Probing the Pareto frontier for basis pursuit solutions, SIAMJournal on Scientific Computing 31 (2) (2008) 890–912. doi:10.1137/080714488 .[41] E. van den Berg, M. P. Friedlander, SPGL1: A solver for large-scale sparse reconstruction, Version 1.9,April 2015 (accessed 12.4.2016) (June 2007).URL [42] R. A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 7 (2) (1936)179–188. doi:10.1111/j.1469-1809.1936.tb02137.x .[43] C. R. Rao, The utilization of multiple measurements in problems of biological classification, J. Roy.Statist. Soc. Ser. B. 10 (1948) 159–193.[44] C. Cortes, V. Vapnik, Support-vector networks, Machine Learning 20 (3) (1995) 273–297. doi:10.1007/BF00994018 .[45] B. E. Boser, I. M. Guyon, V. N. Vapnik, A training algorithm for optimal margin classifiers, in: Pro-ceedings of the Fifth Annual Workshop on Computational Learning Theory, COLT ’92, ACM, 1992,pp. 144–152. doi:10.1145/130385.130401 .[46] J. Yin, Z. Liu, Z. Jin, W. Yang, Kernel sparse representation based classification, Neurocomputing77 (1) (2012) 120 – 128. doi:http://dx.doi.org/10.1016/j.neucom.2011.08.018 .[47] C. Kang, S. Liao, S. Xiang, C. Pan, Kernel sparse representation with pixel-level and region-level localfeature kernels for face recognition, Neurocomputing 133 (2014) 141 – 152. doi:http://dx.doi.org/10.1016/j.neucom.2013.11.022 .[48] Y. C. Eldar, P. Kuppinger, H. B¨olcskei, Block-sparse signals: uncertainty relations and efficient recovery,IEEE Trans. Signal Process. 58 (6) (2010) 3042–3054. doi:10.1109/TSP.2010.2044837 .[49] Z. Wang, J. Yang, N. Nasrabadi, T. Huang, A max-margin perspective on sparse representation-basedclassification, in: 2013 IEEE International Conference on Computer Vision, 2013, pp. 1217–1224. doi:10.1109/ICCV.2013.154 . 46 a) η = 0 . η = 0 . η = 0 . Figure 8: Average sparsity, accuracy, supp( (cid:96) ) and supp( (cid:96) ) (over 100 trials) as σ increased in the kernel setup. The annotations“ σ mc ” and “ σ acc ” denote the maximum σ for which Eq. (5) holds and for which maximum accuracy is obtained in Kernel SRC,respectively. igure 9: Median correlation (averaged over 100 trials) between the test sample φ κ ( y ) and training samples in the same class(corr GT ) and training samples in different classes (corr other ) for the synthetic database with η = 0 .
1. Sparsity and accuracyare also displayed for comparison. Notice that the drop in accuracy occurs well after the jump in the correlation terms andsparsity. a) σ = σ mc , Accuracy = 1 (b) σ = σ mc , Accuracy = 1(c) σ = 3, Accuracy = 1 (d) σ = 3, Accuracy = 1(e) σ = 5, Accuracy = 0 .
51 (f) σ = 5, Accuracy = 0 . σ = 9, Accuracy = 0 .
08 (h) σ = 9, Accuracy = 0 . Figure 10: Average class contributions (over 100 trials) of coefficient vectors corresponding to class l = 20 test samples. Thecolors denote the classes.= 20 test samples. Thecolors denote the classes.