A Mathematical Framework for Feature Selection from Real-World Data with Non-Linear Observations
AA Mathematical Framework for Feature Selection fromReal-World Data with Non-Linear Observations
Martin Genzel and Gitta Kutyniok
Technische Universit¨at Berlin, Department of MathematicsStraße des 17. Juni 136, 10623 Berlin, GermanyE-Mails: [genzel,kutyniok]@math.tu-berlin.de
Abstract.
In this paper, we study the challenge of feature selection based on a relatively small collection ofsample pairs { ( x i , y i ) } ≤ i ≤ m . The observations y i ∈ R are thereby supposed to follow a noisy single-indexmodel, depending on a certain set of signal variables . A major difficulty is that these variables usually can-not be observed directly, but rather arise as hidden factors in the actual data vectors x i ∈ R d ( feature variables ).We will prove that a successful variable selection is still possible in this setup, even when the applied esti-mator does not have any knowledge of the underlying model parameters and only takes the “raw” samples { ( x i , y i ) } ≤ i ≤ m as input. The model assumptions of our results will be fairly general, allowing for non-linearobservations , arbitrary convex signal structures as well as strictly convex loss functions . This is particularly appeal-ing for practical purposes, since in many applications, already standard methods, e.g., the Lasso or logisticregression, yield surprisingly good outcomes. Apart from a general discussion of the practical scope of ourtheoretical findings, we will also derive a rigorous guarantee for a specific real-world problem, namely sparsefeature extraction from (proteomics-based) mass spectrometry data . Key words.
Convex optimization, dictionary representations, feature selection, Gaussian mean width, high-dimensional data, mass spectrometry data, model uncertainty, non-linear observations, sparsity, structuredloss minimization, variable selection.
1. I
NTRODUCTION
Let us start with a classical problem situation from learning theory. Suppose we are given a collectionof samples ( s , y ) , . . . , ( s m , y m ) ∈ R p × {− + } which are independently drawn from a random pair ( S , Y ) with unknown joint probability distribution on R p × {− + } . Here, the random vector S typ-ically models a set of signal variables (or data variables ), whereas the binary label Y assigns this data toa certain class-of-interest , which is either − + supervised machinelearning is then to find an accurate predictor ˆ F : R p → {− + } of this classification procedure, suchthat ˆ Y : = ˆ F ( S ) coincides with the true variable Y , at least “with high probability.” In a very simpleexample scenario, we may assume that the observed labels can be described by a linear classificationmodel of the form y i = sign ( (cid:104) s i , z (cid:105) ) , i =
1, . . . , m , (1.1)where z ∈ R p is an unknown signal vector (or parameter vector ). The ultimate goal would be nowto learn an estimator ˆ z ∈ R p of z , by merely using a small set of training pairs { ( s i , y i ) } ≤ i ≤ m . Agood approximation of the true signal vector z would not only provide a reliable predictor ˆ F ( S ) = sign ( (cid:104) S , ˆ z (cid:105) ) , but in fact, its non-zero entries supp ( ˆ z ) = { j | ˆ z j (cid:54) = } would even indicate whichvariables of the data S are (strongly) correlated with the associated class Y . Such a statement is ofcourse much stronger than just correctly predicting the class label because in that way, we are able tounderstand the underlying observation process. The main focus of this work will be precisely on thistype of problem, which is usually referred to as the task of feature selection , variable selection , or featureextraction in statistical learning theory. This is the same as assuming that Y = sign ( (cid:104) S , z (cid:105) ) because each sample ( s i , y i ) can be seen as an independent realization of ( S , Y ) . But in this paper, we shall prefer the “sample notation” of (1.1), which seems to be more natural and convenient forour purposes. 1 a r X i v : . [ s t a t . M L ] A ug
1. I
NTRODUCTION
Before continuing with the general problem issue, let us illustrate the above setup by a specific real-world example: The medical research of the last decades has shown that the early diagnosis of tumordiseases, such as cancer, can be significantly improved by extracting new biomarkers from proteomicsdata . In this context, each sample pair ( s i , y i ) ∈ R p × {− + } corresponds to an individual probandof a clinical study. The class label y i simply specifies whether the i -th test person suffers from a certaindisease ( y i = −
1) or not ( y i = + s i = ( s i ,1 , . . . , s i , p ) contains the molecular concentration of a particular protein structure in the human body. In this sense, s i represents (a part of) the so-called proteome , which is the entire collection of an individual’s proteinsat a fixed point of time. Assuming a linear classification model as in (1.1) would mean that thepatient’s health status can be essentially determined by the presence or absence of some of thoseprotein structures. The signal vector z now plays the role of a disease fingerprint because its non-zero entries precisely indicate those proteins which seem to be relevant to the examined disease.Interestingly, various empirical studies have shown that, oftentimes, already a very small set of highlydiscriminative molecules is characteristic for a certain disease (see [7] and the references therein),which implies that z might be relatively sparse . The molecular concentrations of these few proteinsfinally form candidates for biomarkers , that are, reliable indicators for a potential affection of the body.This prototype application will recurrently serve as an illustration of our framework. In order tokeep the examples as simple as possible, we shall omit several nonessential biological details in thiswork. The interested reader is referred to [10, Chap. 2] and [7] for a more extensive discussion of thebiological and clinical background.Returning to our initial challenge of variable selection, we may ask under which conditions an ac-curate reconstruction of z from { ( s i , y i ) } ≤ i ≤ m is possible. For instance, when assuming a Gaussiandistribution of the data, s i ∼ N ( , Σ ) with Σ ∈ R p × p positive definite, this task becomes practi-cally feasible (even in a much more general setting). Efficient algorithms as well as rigorous recoveryguarantees are indeed available in this case; see [1, 11, 14, 19, 20] for example.Unfortunately, the situation is more complicated in most practical applications. The signal variables s i = ( s i ,1 , . . . , s i , p ) usually cannot be observed directly but merely in terms of a certain data representa-tion . A typical model for a real-world data set may look as follows: x i = p ∑ k = s i , k a k + ¯ n i ∈ R d , i =
1, . . . , m , (1.2)where a , . . . , a p ∈ R d are fixed feature atoms (or patterns ) and ¯ n i ∈ R d generates additive random noise in each entry. The (hidden) variables s i ,1 , . . . , s i , p are rather encoded as (random) scalar factorsof a linear combination now, building up the data vector x i ; this is why the s i ,1 , . . . , s i , p are sometimesalso called signal factors .In the setup of proteomics, the model of (1.2) could, for example, describe mass spectrometry data ( MS data ), which is a widely-used acquisition method to detect the concentration of protein structureswithin clinical samples, e.g., blood or urine. A typical mass spectrum is shown in Figure 1.1. We mayassume that each single feature atom a k ∈ R d corresponds to a particular peak, determining its position and shape , whereas its height is specified by the scalar factor s i , k , which varies from sampleto sample. Thus, the information-of-interest is not directly available anymore, but only representedin terms of the (relative) peak heights. The technical and physical details of MS data will be furtherdiscussed in Subsection 3.2, including a precise definition of the data model.As a general conclusion, we are often obliged to deal with data pairs { ( x i , y i ) } ≤ i ≤ m —for example,generated by (1.2) and (1.1), respectively—but our primal goal is still to estimate the signal vector z from them. The main difficulty is that in many practical situations neither the feature atoms a , . . . , a p nor the signal factors s , . . . , s m are explicitly known. In particular, there is no straightforward way A vector is said to be sparse if most of its entries are equal to zero.. G
ENZEL AND
G. K
UTYNIOK Figure 1.1: A typical example of a (finite-resolution) mass spectrum. The horizontal axis represents the indices j =
1, . . . , d of the associated data vector x i = ( x i ,1 , . . . , x i , d ) , plotted against its entries on the verticalaxis (in the above spectrum, we have d =
42 390). Each of the characteristic
Gaussian-shaped peaks can be identified with a specific type of protein, where its maximal intensity is proportional to themolecular concentration within the examined sample. to evaluate the binary observation scheme (1.1). One common approach is to perform a factor analysis for the data model (1.2), i.e., using the available sample collection x , . . . , x m to approximate both a , . . . , a p and s , . . . , s m . But as long as the sample count m is small and noise is present, such aprocedure could become very unstable, and moreover, the desired factorization might be highly non-unique.To circumvent these drawbacks, let us follow a more basic strategy that extracts features directly fromthe data. For that matter, one could simply try to mimic the behavior of the classification model (1.1)by learning a feature vector ˆ β ∈ R d such that y i = sign ( (cid:104) x i , ˆ β (cid:105) ) (1.3)holds at least for “many” i =
1, . . . , m . Similarly to the initial problem of variable selection, the non-zero entries of ˆ β may determine several features of the data that are relevant to an accurate predictionof the label. For the case of (proteomics-based) MS data, we would therefore try to choose the supportof ˆ β as small as possible, but in such a way that its entries are closely located to those peaks whichare strongly correlated with the examined disease. Finally, a physician could use his/her expertiseto identify these (few) peaks with the associated proteins “by hand.” From a practical perspective,this would essentially solve the original problem, although the molecular concentrations s i are stillunknown. Such a simple work-flow for (sparse) feature selection from proteomics data can lead in factto state-of-the-art results for both simulated and real-world data sets; see [7, 10] for some numericalexperiments. These works promote that many classical algorithms from machine learning, combined In the literature, feature vectors are often referred to as the outputs of a feature map . But in this work, a feature vector β ∈ R d israther understood to be a vector that extracts features from the data by a linear projection (cid:104) x i , β (cid:105) . The word “many” essentially means that the equality of (1.3) holds for a large fraction of samples. This will be made moreprecise later on in Section 2. 1. I
NTRODUCTION with some appropriate preprocessing, already provide a reliable feature vector ˆ β . Perhaps, the mostprominent example is the Lasso , originally introduced by Tibshirani [24]:ˆ β = argmin β ∈ R d m m ∑ i = ( (cid:104) x i , β (cid:105) − y i ) subject to (cid:107) β (cid:107) ≤ R , ( P R )where the regularization parameter R > Which information about z does an estimated feature vector ˆ β already carry? Can we use ˆ β to definean accurate estimator ˆ z = ˆ z ( ˆ β ) of z ? If so, how do we measure the approximation error? (Q2) Regarding the issues of (Q1) , to which extent can we aomit explicit knowledge of the data model (1.2) ?More precisely, in how far is the definition of ˆ z affected by the unknown feature atoms a , . . . , a p ? (Q3) In how far can we extend the above model setup? In particular: Can we replace the sign -function in (1.1) by a general (unknown and noisy) non-linearity? What other types of convex loss functions andstructural constraints could be considered for the Lasso-estimator ( P R ) ? It will turn out that, although the first two questions are obviously closely connected, their answersmight be quite different. Indeed, while (Q1) and (Q3) are rather theoretically motivated and we willbe able to give general solutions, a positive statement on (Q2) will heavily depend on the intrinsicstructure of the data model (1.2).
Let us first focus on the problem issue of (Q1). As already mentioned above, our scope is initiallyrestricted to the data domain R d where the samples x , . . . , x m ∈ R d belong. But our actual task isassociated with the signal domain R p , which contains the hidden signal variables s , . . . , s m and thedesired parameter vector z . Hence, the following question emerges: How can we express the underlyingobservation procedure (1.1) in terms of the data domain?
Inspired by the previous subsection, we shallsimply assume that there exists some feature vector β ∈ R d such that y i = sign ( (cid:104) s i , z (cid:105) ) = sign ( (cid:104) x i , β (cid:105) ) for “many” i =
1, . . . , m . (1.4)However, this choice of β is typically non-unique. The Gaussian-shaped peaks of MS data, forexample, have a certain spatial extent (see Figure 1.1) so that the data vectors x , . . . , x m consist ofseveral almost collinear entries. Due to this redundancy , we may find another feature vector ˆ β ∈ R d whose support is disjoint from the one of β but satisfies (1.4) as well. It is therefore not verymeaningful to compare feature vectors only in terms of their supports or their Euclidean distance.In order to further investigate the properties of β and to derive an appropriate measure of distancefor feature vectors, we need to incorporate the factor model of (1.2). Collecting the atoms in a featurematrix A : = (cid:2) a | · · · | a p (cid:3) ∈ R d × p , we obtain a compact expression x i = As i + ¯ n i , i =
1, . . . , m , thatleads to the following identity:sign ( (cid:104) x i , β (cid:105) ) = sign ( (cid:104) As i , β (cid:105) + (cid:104) ¯ n i , β (cid:105) ) = sign ( (cid:104) s i , A T β (cid:105) + (cid:104) ¯ n i , β (cid:105) ) . (1.5) . G ENZEL AND
G. K
UTYNIOK Supposed that the additive noise term (cid:104) ¯ n i , β (cid:105) is relatively small compared to the signal term (cid:104) s i , A T β (cid:105) —which essentially means that the signal-to-noise ratio ( SNR ) of the data is high—we may expect that y i = sign ( (cid:104) s i , z (cid:105) ) = sign ( (cid:104) s i , A T β (cid:105) ) for “many” i =
1, . . . , m . (1.6)These simple calculations show us how the hidden classification rule can be (approximately) repre-sented in terms of the feature vector β ∈ R d . In particular, the matrix D : = A T ∈ R p × d therebyarises as a natural transform between the signal domain and data domain. Note that D just containsall feature atoms a , . . . , a p as rows, which should be primarily seen as fixed and intrinsic parametersof the data model (1.2).If, in addition, D is surjective, we may even choose β ∈ R d such that z = D β and (1.6) holds for all samples i =
1, . . . , m . In this context, it is quite common to regard D as an overcomplete featuredictionary that allows us to represent the signal-of-interest z by a coefficient vector β from the datadomain. But nevertheless, the choice of β could be still highly non-unique. So one might ask forthe following: What is the “best-possible” representation of z by the dictionary D ? On the one hand,we should make use of the fact that z often carries some additional (low-complexity) structure. Asan example, one could try to mimic the sparsity constraint of the Lasso ( P R ) and assume that thereexists β ∈ RB d with R > z = D β ; this would actually mean that z can be sparsely represented by D . On the other hand, it is also essential to control the deviation of the noise term (cid:104) ¯ n i , β (cid:105) in (1.5), so that the mismatch of our approximation in (1.4) does not become too large. Hence,we may simply minimize its variance: σ : = min β ∈ RB d z = D β E [ (cid:104) ¯ n i , β (cid:105) ] . (1.7)This definition forms a key ingredient of our general framework in Section 2. In fact, any minimizer β ∈ RB d of (1.7) will be referred to as an optimal representation of z by D (see Definition 2.2). We willparticularly observe that the corresponding noise variance σ is immediately related to the SNR, whichin turn plays a crucial role in the quality of our error estimates.Let us finally return to the initial question of (Q1). Given an accurate classifier ˆ β ∈ R d , the abovediscussion suggests to define an estimator of z via the dictionary transform D , that is, ˆ z : = D ˆ β . Itwill turn out that this idea, although surprisingly simple, is the key step to achieve a rigorous errorbound on the Euclidean distance (cid:107) ˆ z − λ z (cid:107) = (cid:107) D ˆ β − λ D β (cid:107) , (1.8)where λ is an adaptive scaling parameter that depends on the SNR as well as on the (non-)linear ob-servation model. In particular, (1.8) should be regarded as an intrinsic and natural (semi-)metric tocompare two feature vectors within the data domain. Before developing the ideas of the previous subsection towards a rigorous mathematical frameworkin Section 2, we would like to briefly illustrate our approach by a first theoretical guarantee for fea-ture selection, which is a simplified and easy-to-read version of our main results Theorem 2.9 and This automatically implies that p ≤ d . In fact, if the assumption of surjectivity is not satisfied, there is a certain evidence thatthe available data set might not be “rich enough” to describe the original problem of variable selection (associated with thesignal domain). Here, B d = { x ∈ R d | (cid:107) x (cid:107) ≤ } denotes the (cid:96) -unit ball of R d . 1. I NTRODUCTION
Theorem 2.11. In order to stress the key concepts and to avoid technicalities, let us consider the noise-less case here: This means, we assume that the data pairs { ( x i , y i ) } ≤ i ≤ m ∈ R d × {− + } are i.i.d.samples of the models (1.1) and (1.2), where s i ∼ N ( , I p ) and ¯ n ≡ . Moreover, let D , . . . , D d ∈ R p denote the columns of the dictionary D ∈ R p × d , which are usually referred to as the dictionary atoms . The “maximal energy” of the feature variables is then given by the largest column norm: D max : = max ≤ j ≤ d (cid:107) D j (cid:107) .Using this model setup, we can state the following theorem with its proof being postponed to Sub-section 5.1: Theorem 1.1
Let the above model assumptions hold true. Moreover, suppose that (cid:107) z (cid:107) = and z ∈ R D B d for some R > , implying that we can find a representing feature vector β ∈ RB d with z = D β . Then thereexist constants C , C (cid:48) > such that the following holds with high probability:If the number of samples obeys m ≥ C · D · R · log ( d ) , (1.9) then, setting ˆ z : = D ˆ β for any minimizer ˆ β of the Lasso ( P R ) , we have (cid:107) ˆ z − (cid:113) π z (cid:107) = (cid:107) D ˆ β − (cid:113) π D β (cid:107) ≤ C (cid:48) (cid:18) D · R · log ( d ) m (cid:19) . (1.10)In simple terms, the error bound (1.10) (together with the condition of (1.9)) yields a sufficient criterionfor successful feature selection: An accurate recovery of z is guaranteed as long as the sample count m(greatly) exceeds D · R · log ( d ) . Thus, at least in our simplified model setting, Theorem 1.1 givesa satisfactory answer to (Q1). It is particularly important that the right-hand side of (1.10) only logarithmically depends on the dimension d of the data, which can be extremely large in practice. Asa consequence, the above result indicates that feature selection is already possible if only a relativelyfew samples are available, which is also a typical constraint in realistic scenarios.In contrast, Theorem 1.1 does not give a full solution to (Q2). The dictionary D is unfortunately unknown in many applications so that we are neither able to explicitly construct the estimator ˆ z nor toassess the quality of ˆ β in the sense of (1.10). Therefore, it is a remarkable feature of Theorem 1.1 thatvariable selection by a standard algorithm is still (theoretically) feasible, although a factorization ofthe data x i = D T s i is completely missing.Apart from that, the approximation quality of (1.10) also depends on an appropriate bound for theproduct D · R . In fact, both factors heavily rely on the transformation rule of D , which maychange substantially from application to application. We shall return to this issue in Section 3 wherewe further discuss the practical relevance of our theoretical findings and investigate some desirableproperties (and modifications) of the feature dictionary D . One major goal of this work is to develop a further understanding of why even standard approaches,such as the Lasso or logistic regression, perform surprisingly well in many real-world applications. This should not be mixed up with the feature atoms a , . . . , a p ∈ R d , which form the rows of D and are rather seen as the“building blocks” of the data. Some people strictly refer to variable selection as the task of recovering the support of z . In this work however, selectingvariables and approximating z (in the Euclidean sense) are understood to be the same challenges. This is also due to the factthat we shall go beyond classical sparsity patterns later on.. G ENZEL AND
G. K
UTYNIOK Indeed, our main results show that successful feature selection can be already achieved with estima-tors which only take the raw sample pairs { ( x i , y i ) } ≤ i ≤ m as input and do not require any specificinformation about the data. Regarding the illustrative setup of Theorem 1.1, we shall also see in Sec-tion 2 that the underlying model assumptions can be further generalized, including noisy non-linearoutputs, strictly convex loss functions, and arbitrary (convex) signal structures. In this sense, ourframework gives fairly general solutions to the initial challenges of (Q1) and (Q3).As already sketched in Subsection 1.2, our key idea is to mimic the true observation model within thedata domain. Using the novel concept of optimal dictionary representations (cf. Definition 2.2), it is infact possible to obtain an approximation of z while the actual estimator works in the data domain.The proofs of our results are based on the recent work of [11], where the first author has provided anabstract toolbox for signal recovery in high dimensions. To a certain extent, this paper also continuesthe philosophy of [11], in the sense that the unknown data decomposition can be seen as an additionalsource of model uncertainty. But we will try to keep this exposition as self-contained as possible andrestate the used main principles of [11].Most approaches in classical learning theory aim to model the posterior distribution E [ Y | X ] (con-ditional expectation) in a direct way. In our context, this would basically mean that a model of theform y i ≈ (cid:104) x i , β (cid:105) is assumed to be the ground-truth output rule. There has been a remarkable effortduring the last decades to prove recovery guarantees (for β ) if the data variables x i are not “toodegenerated.” Prominent examples are the restricted isometry property , restricted eigenvalue condition ,or irrepresentability condition , which require that the x i are sufficiently well-behaved; see [5, 9, 13] foroverviews. But unfortunately, real-world data is typically highly redundant, so that these assump-tions are not even met in relatively simple cases, such as mass spectra (see Figure 1.1). For thatreason, various strategies have been recently suggested in the literature to deal with redundant (oralmost perfectly correlated) features, for instance, hierarchical clustering ([4]) or OSCAR/OWL ([3, 8]).These methods are however mostly adapted to very specific settings and therefore do not allow fora general treatment of our problem setup. In contrast, we shall not focus on designing sophisticatedalgorithms, but rather propose a novel perspective on the challenge of variable selection: As pointedout above, one may easily circumvent the issue of redundancy by considering a hidden measurementprocess y i ≈ (cid:104) s i , z (cid:105) and relating it to the data prior x i in a separate step. Such a combination of mod-els has been much less studied theoretically, and to the best of our knowledge, this work provides thefirst rigorous mathematical result in this direction. Hence, we hope that our key techniques, primarilythe idea of optimal representations, could also have a certain impact on future developments in thefield of feature extraction.An alternative branch of research aims at extracting the signal variables s , . . . , s m directly from theraw data x , . . . , x m . This could be done by applying an appropriate feature map (to the data), whichis typically obtained by dictionary learning or a factor analysis . Afterwards, the question of (Q1)–(Q3)would be essentially superfluous, since any estimator could now explicitly invoke the hidden outputrule. But on the other hand, such a strategy is only feasible as long as sufficiently many samples areavailable and the data are not too noisy—and these are two assumptions which are hardly satisfied inpractice. For example, a peak detection for MS data might be extremely unstable if some of the peaksare “buried” in the baseline noise. It was already succinctly emphasized by Vapnik in [26, p. 12] thata direct approach should be always preferred: “If you possess a restricted amount of information for solving some problem, try to solve the problemdirectly and never solve the more general problem as an intermediate step. It is possible that the availableinformation is sufficient for a direct solution but is insufficient for solving a more general intermediateproblem.” For simplicity, we will still speak of “raw” data if it has undergone some common preprocessing, like smoothing orstandardization, which is not part of the actual selection procedure. 1. I
NTRODUCTION
This fundamental principle is precisely reflected by our findings, showing that selecting features andlearning a feature map can (and should) be considered as separate tasks.The generality of our framework, in particular that D can be arbitrary, comes with the drawback thatwe cannot always make a significant statement on (Q2). If no further information on the dictionary D is given, it is virtually impossible to assess the practical relevance of an estimated feature vectorˆ β . As a consequence, when establishing a novel result for a specific application, one always needs tocarefully analyze the intrinsic structure of D . This will be illustrated for MS data in Subsection 3.2, butwe believe that our theorems may be applicable to other types of data as well, such as microarrays orhyperspectral images. For that matter, some general rules-of-thumbs are provided in Subsection 3.3,indicating whether or not a successful feature extraction (by standard estimators) can be expected fora certain problem-of-interest.There is also an interesting relationship of our results to compressed sensing (CS) with dictionaries (see[6, 22] for example). In order to apply the guarantees from [11], the convex optimization program(e.g., ( P R )) is actually transformed into a synthesis formulation using the feature dictionary D (seeSubsection 5.2, particularly (5.5)). However, it is not entirely clear if one could make use of thisconnection, since the CS-related theory strongly relies on an explicit knowledge of the dictionary. The main focus of Section 2 is on the questions of (Q1) and (Q3). In this section, we will developthe setup of the introduction further towards an abstract framework for feature selection, includingmore general data and observation models (Subsections 2.1 and 2.2), strictly convex loss functions,and arbitrary convex signal structures (Subsection 2.3). The Subsections 2.4 and 2.5 then contain ourmain recovery guarantees (Theorem 2.9 and Theorem 2.11), while their proofs are postponed to Sub-section 5.2. The practical scope of our results is studied afterwards in Section 3. In this course, wewill particularly investigate the benefit of standardizing the data (Subsection 3.1). Moreover, the issueof (Q2) is addressed again, first by returning to our prototype example of MS data (Subsection 3.2),followed by a more general discussion (Subsection 3.3). Some concluding remarks and possible ex-tensions are finally presented in Section 4.Throughout this paper, we will recurrently make use of several (standard) notations and conventionswhich are summarized in the following list: • Vectors and matrices are usually written in boldface, whereas their entries are referenced by sub-scripts. Let x = ( x , . . . , x n ) ∈ R n be a vector (unless stated otherwise, it is always understood tobe a column vector). The support of x is defined bysupp ( x ) : = { ≤ j ≤ n | x j (cid:54) = } ,and its cardinality is (cid:107) x (cid:107) : = ( x ) . For 1 ≤ p ≤ ∞ , the (cid:96) p -norm of x is given by (cid:107) x (cid:107) p : = (cid:40) ( ∑ nj = | x j | p ) p , p < ∞ ,max ≤ j ≤ n | x j | , p = ∞ .The associated unit ball is denoted by B np : = { x ∈ R n | (cid:107) x (cid:107) p ≤ } and the (Euclidean) unit sphere is S n − : = { x ∈ R n | (cid:107) x (cid:107) = } . The operator norm of a matrix M ∈ R n (cid:48) × n is defined as (cid:107) M (cid:107) : = sup x ∈ S n − (cid:107) Mx (cid:107) . • Let Z and Z (cid:48) be two real-valued random variables (or random vectors). The expected value of Z isdenoted by E [ Z ] and the variance by V [ Z ] . Similarly, we write E [ Z | Z (cid:48) ] for the conditional expectation of Z with respect to Z (cid:48) . The probability of an event A is denoted by P [ A ] . . G ENZEL AND
G. K
UTYNIOK • The letter C is always reserved for a constant, and if necessary, an explicit dependence of C on acertain parameter is indicated by a subscript. More specifically, C is said to be a numerical constantif its value is independent from all involved parameters in the current setup. In this case, wesometimes simply write A (cid:46) B instead of A ≤ C · B ; and if C · A ≤ B ≤ C · A for numericalconstants C , C >
0, we use the abbreviation A (cid:16) B . • The phrase “with high probability” means that an event arises at least with a fixed (and high) prob-ability of success , for instance, 99%. Alternatively, one could regard this probability as an additionalparameter which would then appear as a factor somewhere in the statement. But for the sake ofconvenience, we will usually omit this explicit quantification.2. G
ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
In this section, we shall further extend the exemplary setup of the introduction, ultimately leading toan abstract framework for feature selection from real-world data. Our main results, Theorem 2.9 andTheorem 2.11, will show that rigorous guarantees even hold in this more advanced situation. Thesefindings particularly provide relatively general solutions to the issues of (Q1) and (Q3).
For an overview of the notations introduced in this and the subsequent subsection, the reader mayalso consider Table 2.1 below.Let us start by generalizing the simple binary model of (1.1). The following observation scheme wasalready considered in [11, 20, 21]:(M1) We assume that the observation variables y , . . . , y m ∈ R obey a semiparametric single-index modely i = f ( (cid:104) s i , z (cid:105) ) , i =
1, . . . , m , (2.1)where z ∈ R p is the ground-truth signal vector . The signal variables s , . . . , s m ∼ N ( , I p ) areindependent samples of a standard Gaussian vector and f : R → Y is a (possibly random)function which is independent of s i , with Y denoting a closed subset of R .The range Y of f defines the observation domain , restricting the possible values of y , . . . , y m . Forexample, we would have Y = {−
1, 0, + } for the binary classification rule in (1.1). The function f plays the role of a non-linearity in (2.1), modifying the linear output (cid:104) s i , z (cid:105) in a certain (random)way. Remarkably, this perturbation might be even unknown so that, in particular, we do not need toassume any knowledge of the noise structure of the observations.Our extension of the factor model in (1.2) looks as follows:(M2) The sample data are generated from a linear model of the form x i = p ∑ k = s i , k a k + q ∑ l = n i , l b l ∈ R d , i =
1, . . . , m , (2.2)where the signal variables s i = ( s i ,1 , . . . , s i , p ) are already determined by (M1). The noise vari-ables n i = ( n i ,1 , . . . , n i , q ) ∼ N ( , I q ) , i =
1, . . . , m , are independently drawn from a standard The randomness of f is understood observation-wise, i.e., for every sample i ∈ {
1, . . . , m } , we take an independent sample of f . But this explicit dependence of f on i is omitted here. One should be aware of the fact that f could be also a (noisy) linear function, even though we will continue to speak of“non-linearities.”0 2. G ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
Gaussian vector, which is also independent from the s , . . . , s m . As before, a , . . . , a p ∈ R d are fixed (deterministic) feature atoms whereas the fixed vectors b , . . . , b q ∈ R d are called noiseatoms .Note that the assumptions of s i and n i having mean zero is no severe restriction, since the input datais typically centered in advance (see Remark 3.1(2)). Similarly, unit variances can be also taken forgranted as the actual energy of the signals is determined by their feature and noise atoms, respec-tively. The main novelty of (2.2), compared to (1.2), is obviously the generalized noise term. Indeed,if q = d and b , . . . , b d are (scalar multiples of) the Euclidean unit vectors of R d , we would preciselyend up with the entry-wise noise structure of (1.2). Put simply, the data x , . . . , x m are built up ofa linear combination of atoms where their respective contributions (scalar factors) are random. Butonce the observation process (M1) is taken into account, a distinction between feature and noise atomsbecomes meaningful: The signal factors s i ,1 , . . . , s i , p are precisely those contributing to (2.1) whereasthe noise factors n i ,1 , . . . , n i , q are irrelevant. It would be therefore also quite natural to refer to s i and n i as active and inactive variables, respectively. This issue will become important again in the followingsubsection when we introduce the so-called extended signal domain , unifying both types of signals.Finally, let us briefly restate the major challenge of feature selection: Find a robust and accurate estima-tor ˆ z of the signal vector z , using only a (small) collection of sample pairs { ( x i , y i ) } ≤ i ≤ m ∈ R d × Y . Atthis point, we would like to call attention to the nomenclature used for our framework: All phrasescontaining the word “feature” are associated with the data domain R d ; in particular, each single com-ponent of x i = ( x i ,1 , . . . , x i , d ) is called a feature variable . In contrast, the word “signal” is naturallyrelated to the signal domain R p . Hence, it is a bit inconsistent to speak of “feature selection” in thiscontext because our task is rather to select signal variables. But using a phrase like “signal selection”would be quite uncommon, and moreover, the output of the estimator (e.g., the Lasso) is actually a feature vector . Remark 2.1
In the language of statistical learning, (2.1) is a typical example of a discriminative model that describes the posterior E [ y i | s i ] , and the above goal would translate into learning the model pa-rameters z . Together with the prior distribution of the data x i defined in (2.2), the joint probabilitydistribution of a sample pair ( x i , y i ) is completely determined. In fact, any Gaussian vector can bewritten in terms of (2.2) with appropriately chosen feature atoms. The actual limitation of our modelare the assumptions in (M1); it is not always true that an observation y i only depends on a linear projection of s i . For example, one could think of an additive model where the entries of s i are modifiedin a non-linear fashion before projecting. Fortunately, we will see in Subsection 2.5 that (M1) can befurther relaxed. The true observations might be even affected by some arbitrary, possibly determinis-tic noise. And furthermore, it will turn out that low correlations between the signal variables do notcause any difficulties, i.e., we can allow for s i ∼ N ( , Σ ) with a positive definite covariance matrix Σ ∈ R p × p . ♦ Now, we would like to adapt the key concept of optimal representations that was already outlined inSubsection 1.2. In order to carry out a precise analysis, one clearly needs to incorporate the influenceof the noise variables, which was essentially disregarded in (1.4). As foreshadowed in the previoussubsection, it is quite natural here to view the noise term as an additional (physical) source of signals.Mathematically, this corresponds to artificially extending the signal domain by all noise variables sothat we actually consider joint vectors ( s i , n i ) ∈ R p + q . Thus, collecting the feature and noise atoms asmatrices A : = (cid:2) a | · · · | a p (cid:3) ∈ R d × p and B : = (cid:2) b | · · · | b q (cid:3) ∈ R d × q , we obtain a simple factorization . G ENZEL AND
G. K
UTYNIOK of our data model (2.2): x i = p ∑ k = s i , k a k + q ∑ l = n i , l b l = As i + Bn i = (cid:2) A | B (cid:3) (cid:20) s i n i (cid:21) . (2.3)A major difficulty is that this factorization is usually unknown , forcing us to work in the data domain.For this reason, we shall consider a representing feature vector β ∈ R d that tries to “mimic the proper-ties“ of the ground-truth signal vector z . As a first step, let us use (2.3) to redo the computation of(1.5) for our generalized models (M1) and (M2): y (cid:48) i : = f ( (cid:104) x i , β (cid:105) ) = f (cid:16)(cid:68) (cid:2) A | B (cid:3) (cid:20) s i n i (cid:21) , β (cid:69)(cid:17) = f ( (cid:104) s i , A T β (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) = : s + (cid:104) n i , B T β (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) = : n ) = f ( s + n ) , i =
1, . . . , m . (2.4)Note that the dependence of the signal term s and the noise term n on the index i was omitted for thesake of convenience, since all samples are identically and independently distributed.The additional summand n in (2.4) might generate a certain model mismatch with respect to (2.1),which means that we have y i (cid:54) = y (cid:48) i for (some of) the samples i =
1, . . . , m . Our primal goal is thereforeto ensure that the approximate observation rule of (2.4) matches as closely as possible with the truemodel of (2.1). In other words, we wish to choose β in such a way that the deviation between y i and y (cid:48) i becomes very small. Following the argumentation of Subsection 1.2, it is promising to pick some β with z = A T β , which implies that (cid:104) s i , z (cid:105) = (cid:104) s i , A T β (cid:105) = s . The matrix D = A T ∈ R p × d shouldbe again regarded as a feature dictionary , containing the fixed intrinsic parameters of our data model.To impose some additional structure on z , we may assume that the representing feature vector justbelongs to a known convex coefficient set (or feature set ) K ⊂ R d . More precisely, we choose β ∈ K z : = { β ∈ R d | µ β ∈ K and z = D β } ⊂ R d , (2.5)where µ is constant scaling factor that depends on f and is specified later on in (2.11a). This par-ticularly means that µ z ∈ D K . The purpose of the assumption (2.5) is twofold: On the one hand,one would like to reduce the complexity of a potential estimator in order to avoid overfitting , which isespecially important when m (cid:28) d . Our main result in Subsection 2.4 will show that such a restrictionof the solution space is immediately related to the number of required samples. On the other hand,we may intend to incorporate some heuristic constraints by an appropriate choice of K . For example,this could be sparsity, which we have already applied in Theorem 1.1 (with K = RB d ).Taking (2.5) into account, we have y i = f ( (cid:104) s i , z (cid:105) ) = f ( s ) and y (cid:48) i = f ( s + n ) . Our goal is nowto match y (cid:48) i with y i by decreasing the impact of the additive noise term n = (cid:104) n i , B T β (cid:105) . Since n i ∼N ( , I q ) , we observe that n ∼ N ( σ ) with σ : = (cid:107) B T β (cid:107) . Hence, it is quite natural to considerprecisely those β ∈ K z that minimize the variance of n : Definition 2.2
A feature vector β ∈ R d satisfying β = argmin β ∈ K z (cid:107) B T β (cid:107) (2.6)is called an optimal representation of z by the feature dictionary D . The associated noise variance isdenoted by σ = (cid:107) B T β (cid:107) . One does not have to be concerned about the (non-)uniqueness of the minimizer in (2.6), since all results of this work holdfor any optimal representation.2 2. G
ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
We would like to emphasize that our concept of optimal representations is also underpinned by thefact that it provides the best possible signal-to-noise ratio ( SNR ) for the corrupted observation model(2.4). Indeed, the SNR of (2.4) can be defined asSNR : = V [ s ] V [ n ] = V [ (cid:104) s i , A T β (cid:105) ] V [ (cid:104) n i , B T β (cid:105) ] = (cid:107) A T β (cid:107) (cid:107) B T β (cid:107) = (cid:107) z (cid:107) σ , (2.7)where the signal vector z is always assumed to be fixed. Therefore, maximizing the SNR is actuallyequivalent to minimizing σ . Remark 2.3 (1) The previous paragraph has shown that our choice of β is optimal with respectto the SNR. But it is not entirely clear whether such a notion always leads to the “best possible”representation. The above argumentation is somewhat heuristic in the sense that we did not specifyhow the distance between the outputs y i and y (cid:48) i is measured. This will be made precise in the courseof our main results, namely in (2.16). A more general approach towards optimal representations isbriefly discussed in the concluding part of Section 4.(2) The noise term n in (2.4) generates a certain perturbation of the true observation model from(M1). From the perspective of signal recovery, this means that we aim to reconstruct a signal z from samples y , . . . , y m while the available measurement process is inaccurate. Indeed, we are onlyallowed to invoke y (cid:48) i = f ( (cid:104) x i , β (cid:105) ) = f ( (cid:104) s i , z (cid:105) + n ) , but not y i = f ( (cid:104) s i , z (cid:105) ) . Consequently, onecan view the noise term n as another source of uncertainty that affects the underlying measurementrule—and by our specific choice of β in Definition 2.2, we try to minimize its impact.(3) When stating our main recovery result in Subsection 2.4, it will be very helpful to be aware ofthe following alternative strategy, which however leads exactly to Definition 2.2. For this, let us takethe perspective that was already suggested in the previous subsection: Instead of regarding the noisefactors n i = ( n i ,1 , . . . , n i , q ) as separate variables, one can rather interpret them as additional signalvariables that remain inactive in the observation process (2.1). This motivates us to consider the so-called extended signal domain R p + q which serves as the ambient space of the joint factors ˜ s i : = ( s i , n i ) ∼N ( , I p + q ) . In order to adapt our initial problem formulation to this setup, we first extend the signal vector z ina trivial way by ˜ z : = ( z , ) ∈ R p + q , leading to an “extended” observation scheme y i = f ( (cid:104) s i , z (cid:105) ) = f (cid:16)(cid:68) (cid:20) s i n i (cid:21) , (cid:20) z (cid:21) (cid:69)(cid:17) = f ( (cid:104) ˜ s i , ˜ z (cid:105) ) , i =
1, . . . , m . (2.8)Similarly, let us introduce the extended dictionary ˜ D : = (cid:20) DN (cid:21) = (cid:2) A | B (cid:3) T = (cid:2) a | · · · | a p | b | · · · | b q (cid:3) T ∈ R ( p + q ) × d ,where N : = B T is referred to as the noise dictionary .Since x i = ˜ D T ˜ s i , the computation of (2.4) can be rewritten as follows: y (cid:48) i = f ( (cid:104) x i , β (cid:105) ) = f ( (cid:104) ˜ D T ˜ s i , β (cid:105) ) = f ( (cid:104) ˜ s i , ˜ D β (cid:105) ) , i =
1, . . . , m . (2.9)Compared to (2.4), we are not concerned with an additive noise term here anymore, but this comesalong with the drawback that there might not exist an exact representation of the extended signalvector ˜ z by ˜ D . Indeed, if p + q > d , the matrix ˜ D is not surjective, so that it could be impossible to Hereafter, we agree with the convention that objects which are associated with the extended signal space are usuallyendowed with a tilde.. G
ENZEL AND
G. K
UTYNIOK Signal domain (SD) R p Signal factors/variables,latent factors/variables (random) s i = ( s i ,1 , . . . , s i , p ) ∼ N ( , I p ) , i =
1, . . . , m Signal vector, parameter vector, classifier z , ˆ z ∈ R p Data domain (DD) R d Sample data, input data, feature variables(random) x , . . . , x m ∈ R d Noise factors/variables (random) n i = ( n i ,1 , . . . , n i , q ) ∼ N ( , I q ) , i =
1, . . . , m Feature atoms a , . . . , a p ∈ R d , A = [ a |···| a p ] ∈ R d × p Noise atoms b , . . . , b q ∈ R d , B = [ b |···| b q ] ∈ R d × q Optimal representations and extended signal domain (ESD) R p + q Feature dictionary D = A T ∈ R p × d Noise dictionary N = B T ∈ R q × d Feature vector, coefficient vector β , ˆ β ∈ R d Coefficient set, feature set K ⊂ R d convex and boundedSignal term (random) s = (cid:104) s i , A T β (cid:105) ∼ N ( (cid:107) z (cid:107) ) Noise term (random) n = (cid:104) n i , B T β (cid:105) ∼ N ( σ ) Noise variance σ = (cid:107) B T β (cid:107) Extended signal factors/variables (random) ˜ s i = ( s i , n i ) ∼ N ( , I p + q ) , i =
1, . . . , m Extended signal vector ˜ z = ( z , ) ∈ R p + q Extended dictionary ˜ D = (cid:2) DN (cid:3) ∈ R ( p + q ) × d Observation domain (OD) Y ⊂ R Observation variables, observations,measurements, outputs (random) y , . . . , y m ∈ Y Non-linearity (i.i.d. random function) f : R → Y Table 2.1: A summary of the notions introduced in Subsections 2.1 and 2.2. find β ∈ R d with ˜ z = ˜ D β . Restricting again to β ∈ K z , we can therefore immediately concludethat the vector ˜ z (cid:48) : = ˜ D β = (cid:20) D β N β (cid:21) = (cid:20) z N β (cid:21) has a non-vanishing noise component, i.e., N β = B T β (cid:54) = . In order to match (2.9) with (2.8)closely, we can try to minimize the distance between ˜ z and ˜ z (cid:48) : β = argmin β ∈ K z (cid:107) ˜ z − ˜ D β (cid:107) = argmin β ∈ K z (cid:13)(cid:13)(cid:13)(cid:2) z (cid:3) − (cid:104) z B T β (cid:105)(cid:13)(cid:13)(cid:13) = argmin β ∈ K z (cid:107) B T β (cid:107) .Not very surprisingly, this choice precisely coincides with the finding of (2.6).It will turn out that the proofs of our main results in Subsection 5.2 do strongly rely on the idea ofworking in the extended signal domain. This particularly explains why the extended dictionary ˜ D explicitly appears in the statements of Theorem 2.9 and Theorem 2.11. ♦ ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
Now, we aim to generalize the Lasso-estimator ( P R ) which was considered in the introduction. Thefirst question one may ask is whether there is a compelling reason why the square loss L sq ( (cid:104) x i , β (cid:105) , y i ) : = ( (cid:104) x i , β (cid:105) − y i ) is applied to fit a (non-)linear observation model, such as (M1). At least when some properties of themodel are known, for example that f : R → Y produces binary outputs, it might be more beneficial toreplace L sq by a specifically adapted loss function. There would be (empirical) evidence in the binarycase that a least-square fit is outperformed by logistic regression using L log ( (cid:104) x i , β (cid:105) , y i ) : = − y i · (cid:104) x i , β (cid:105) + log ( + exp ( − y i · (cid:104) x i , β (cid:105) )) .Hence, we shall allow for a general loss function from now on, L : R × Y → R , ( v , y ) (cid:55)→ L ( v , y ) ,which measures the residual between v = (cid:104) x i , β (cid:105) and y = y i in a very specific way. A second extensionof ( P R ) concerns the sparsity constraint (cid:107) β (cid:107) ≤ R . Here, we simply follow our general structuralassumption of (2.5) and ask for β ∈ K . This leads us to the following generalized estimator :min β ∈ R d m m ∑ i = L ( (cid:104) x i , β (cid:105) , y i ) subject to β ∈ K . ( P L , K )The objective functional ¯ L y ( β ) : = m ∑ mi = L ( (cid:104) x i , β (cid:105) , y i ) is often called the empirical loss function be-cause it actually tries to approximate the expected loss E [ L ( (cid:104) x i , β (cid:105) , y i )] . For this reason, ( P L , K ) istypically referred to as an empirical structured loss minimization in the literature.Next, let us specify some general properties of L that make the optimization program ( P L , K ) capableof signal estimation. The following conditions, originating from [11], are relatively mild and thereforepermit a fairly large class of loss functions:(L1) Regularity:
Let L be twice continuously differentiable in the first variable. The first and secondpartial derivatives are then denoted by L (cid:48) ( v , y ) : = ∂ L ∂ v ( v , y ) and L (cid:48)(cid:48) ( v , y ) : = ∂ L ∂ v ( v , y ) . Further-more, assume that L (cid:48) is Lipschitz continuous in the second variable, i.e., there exists a constant C L (cid:48) > |L (cid:48) ( v , y ) − L (cid:48) ( v , y (cid:48) ) | ≤ C L (cid:48) | y − y (cid:48) | for all v ∈ R , y , y (cid:48) ∈ Y .(L2) Strict convexity:
Let L be strictly convex in the first variable, i.e., there exists some continuousfunction F : R → ( ∞ ) such that L (cid:48)(cid:48) ( v , y ) ≥ F ( v ) > ( v , y ) ∈ R × Y . Remark 2.4 (1) By condition (L2), the estimator ( P L , K ) becomes a convex program and efficientsolvers are often available in practice. However, we shall not discuss computational issues and theuniqueness of solutions here. Fortunately, our results hold for any minimizer ˆ β of ( P L , K ), even thoughthis might lead to different outcomes ˆ z = D ˆ β .(2) For the sake of simplicity, this work restricts to strictly convex loss functions, but we would liketo emphasize that the framework of [11] contains an even more general condition on L , based on theconcept of restricted strong convexity ( RSC ). Instead of requiring a strictly positive second derivative The purpose of F is to guarantee that, once the value of v is fixed, L (cid:48)(cid:48) ( v , y ) can be bounded from below independently of y . If Y is bounded and L (cid:48)(cid:48) is continuous in the second variable, then (L2) is already satisfied if L (cid:48)(cid:48) ( v , y ) > ( v , y ) ∈ R × Y .. G ENZEL AND
G. K
UTYNIOK on the entire domain, one can show that it is actually enough to have strong convexity locally aroundthe origin: L (cid:48)(cid:48) ( v , y ) ≥ C M for all ( v , y ) ∈ [ − M , M ] × Y , (2.10)where C M > M > ♦ Probably the most remarkable feature of the generalized estimator ( P L , K ) is that it only takes the datapairs { ( x i , y i ) } ≤ i ∈ m as input and does neither require any explicit knowledge of (M1) nor of (M2).As already highlighted in the introduction, this might become crucial in practical applications wherethe exact model parameters are hardly known. But at some point of course, one needs to pay a certainprice for using non-adaptive estimators. Indeed, our lack of information will impose a rescaling of theground-truth signal z —we already saw this problem in (1.10) and (2.5)—and the constants of theerror bounds are affected as well.In a first step, let us investigate how to quantify these model uncertainties. The key idea is to regardthe non-linearity f as a specific type of noise that makes the observations deviate from a noiselesslinear measurement process. Following [11], this can be captured by three model parameters µ , ρ , and η , where µ is given by the solution of (2.11a): = E [ L (cid:48) ( µ g , f ( g )) · g ] , (2.11a) ρ : = E [ L (cid:48) ( µ g , f ( g )) ] , (2.11b) η : = E [ L (cid:48) ( µ g , f ( g )) · g ] , (2.11c)with g ∼ N (
0, 1 ) . Since µ is only implicitly given, its existence is not always guaranteed. There are infact “incompatible” pairs of L and f for which (2.11a) cannot be satisfied, see [11, Ex. 3.4]. Therefore,we shall assume for the rest of this work that the loss function L was chosen such that µ exists.Fortunately, such a choice is always possible; for instance, one can isolate µ in (2.11a) when using thesquare loss L sq . A verification of this claim is part of the following example, which illustrates thestatistical meaning of the model parameters defined in (2.11). For a more extensive discussion of theinterplay between different loss functions L and non-linearities f , the reader is referred to [11]. Example 2.5 (1) Let us again consider the standard example of the square loss L sq : R × Y = R × R → R , ( v , y ) (cid:55)→ ( v − y ) .Then the conditions (L1), (L2) are easily verified ( ( L sq ) (cid:48)(cid:48) ≡
1) and the definitions of the modelparameters in (2.11) simplify significantly: µ = E [ f ( g ) · g ] , (2.12a) ρ = E [( f ( g ) − µ g ) ] , (2.12b) η = E [( f ( g ) − µ g ) · g ] . (2.12c)Supposed that (cid:107) z (cid:107) =
1, it is very helpful to regard g : = (cid:104) s i , z (cid:105) ∼ N (
0, 1 ) as a noiseless linearoutput and f ( g ) = f ( (cid:104) s i , z (cid:105) ) as a non-linear and noisy modification of it. In this context, we mayinterpret µ as the correlation between these two variables, whereas ρ and η essentially measure their deviation . We would like to point out that this approach was originally suggested by Plan, Vershynin, and Yudovina in [20, 21], wherethe authors introduce the same parameters for the special case of the square loss L sq . Note that the parameter ρ was denoted by σ in [11], but this notation could be easily mixed up with the noise variance σ defined previously.6 2. G ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
This intuition is particularly underpinned by the standard case of noisy linear observations y i = f ( (cid:104) s i , z (cid:105) ) : = ¯ µ (cid:104) s i , z (cid:105) + ξ i where ¯ µ > ξ i is independent, mean-zero noise. We easily compute µ = ¯ µ and ρ = η = E [ ξ i ] ,meaning that µ captures the rescaling caused by f , and ρ = η equals the variance of the additivenoise term. Hence, it is reasonable to view the quotient µ / max { ρ , η } as the SNR of the observationmodel. This stands in contrast to (2.7), where we were concerned with defining an SNR for the datamodel (M2). However, we will see in the discussion of Theorem 2.9 that both types of SNR affect thequality of our error estimates.(2) Inspired by the prototype application of feature selection from MS data, one might wonder howthe model parameters behave in the setup of binary outputs. As a basic example, we may assume arandom bit-flip model: Let y i = f ( (cid:104) s i , z (cid:105) ) : = ξ i · sign ( (cid:104) s i , z (cid:105) ) where ξ i is an independent ± P [ ξ i = ] = : p ∈ [
0, 1 ] . Using the square loss L = L sq again, one has µ = E [ ξ i · sign ( g ) · g ] = E [ ξ i ] E [ | g | ] = ( p − ) · (cid:113) π and ρ = η = − π ( − p ) .An interesting special case is p = , implying that µ =
0. Then, f ( g ) = ξ i · sign ( (cid:104) s i , z (cid:105) ) and g = (cid:104) s i , z (cid:105) are perfectly uncorrelated and there is clearly no hope for a recovery of z (cf. [19,Sec. III.A]). Interestingly, it will turn out that accurate estimates are still possible when the chance ofa bit-flip is very close to . ♦ We have already pointed out that the major objective of the structural constraint β ∈ K in ( P L , K ) is toincorporate some prior knowledge, so that the size of the solution space is eventually reduced. Thisis highly relevant in situations with small sample-counts, where overfitting is often a serious issue. Inorder to establish a measure for the complexity of signal classes, let us introduce the powerful conceptof (Gaussian) mean width . This is also well-known as Gaussian complexity in statistical learning theory(cf. [2]).
Definition 2.6
The (global Gaussian) mean width of a bounded subset L ⊂ R n is given by w ( L ) : = E [ sup x ∈ L (cid:104) g , x (cid:105) ] ,where g ∼ N ( , I n ) is a standard Gaussian random vector. Moreover, we call the square of the meanwidth d ( L ) : = w ( L ) the effective dimension of L .Fixing a random vector g , the supremum sup x ∈ L (cid:104) g , x (cid:105) essentially measures the spatial extent (width)of L in the direction of g , and by taking the expected value, we obtain an average measure of size(cf. [11, Fig. 1]). In general, the mean width enjoys a certain robustness against small perturbations,i.e., slightly increasing L will only slightly change w ( L ) . A further discussion of (geometric) proper-ties of the mean width can be found in [19, Sec. II]. The following examples indicate that it is moreconvenient to consider the effective dimension when analyzing the complexity of signal sets. Example 2.7 (1)
Linear subspaces.
Let L ⊂ R n be a linear subspace of dimension n (cid:48) . Then we have(cf. [20]) d ( L ∩ B n ) = w ( L ∩ B n ) (cid:16) n (cid:48) .Note that L is restricted to the Euclidean unit ball in order to obtain a bounded set. In fact, d ( · ) mea-sures the algebraic dimension in this case, which particularly justifies why we speak of the effectivedimension of a set. . G ENZEL AND
G. K
UTYNIOK (2) Polytopes and finite sets.
Let L (cid:48) = { z , . . . , z k } ⊂ R n be a finite collection of points. Thus, L : = conv ( L (cid:48) ) is a polytope and we have (cf. [27, Ex. 1.3.8]) d ( L ) = d ( L (cid:48) ) (cid:46) ( max ≤ j ≤ k (cid:107) z j (cid:107) ) · log ( k ) , (2.13)where we have used that the mean width is invariant under taking the convex hull of a set ([19,Prop. 2.1]). It is remarkable that the bound of (2.13) only logarithmically depend on the number ofvertices, although L might have full algebraic dimension. Therefore, polytopes with few vertices areof relatively low complexity, which makes them a good candidate for signal sets.(3) (Approximately) Sparse vectors. Sparsity has emerged as one of the key model assumptions inmany modern applications. In its simplest form, one may consider the set of s-sparse vectors S : = { z ∈ R n | (cid:107) z (cid:107) ≤ s } .The main difficulty when dealing with a sparse signal z ∈ S is that its support is unknown, thoughsmall. In fact, S is a union of ( ns ) -many s -dimensional subspaces and we need to identify the one thatcontains z . Fortunately, this lack of knowledge affects the effective dimension only by a logarithmicfactor (cf. [19, Lem. 2.3]): d ( S ∩ B n ) (cid:16) s log ( ns ) , (2.14)where we have again restricted to the unit ball. The set S (or S ∩ B n ) is usually not applied in practicebecause it would impose a non-convex constraint, and moreover, real-world signals are often not exactly sparse. Instead, one rather tries to design a certain convex relaxation . For this purpose, let usapply the Cauchy-Schwarz inequality for some z ∈ S ∩ B n : (cid:107) z (cid:107) ≤ (cid:113) (cid:107) z (cid:107) · (cid:107) z (cid:107) ≤ √ s ,that means, S ∩ B n ⊂ √ sB n . Since L : = √ sB n is a polytope with 2 n vertices, we obtain by part (2) that d ( L ) (cid:46) s log ( n ) ,which is almost as good as (2.14). This observation verifies that a scaled (cid:96) -ball, as used for thestandard Lasso ( P R ), can serve as an appropriate convex and robust surrogate of sparse vectors.(4) Representations in a dictionary.
In this work, the most important classes of signals are those arisingfrom dictionary representations . Motivated by the previous subsection, let D ∈ R n × d be a dictionary and assume that the class-of-interest is given by L = D K for a certain coefficient set K ⊂ R d . A basicapplication of Slepian’s inequality [9, Lem. 8.25] provides a general, yet non-optimal, bound on theeffective dimension: d ( L ) = d ( D K ) (cid:46) (cid:107) D (cid:107) · d ( K ) .This estimate becomes particularly bad when the operator norm of D is large, which is unfortunatelythe case for highly overcomplete dictionaries as we consider. However, for many choices of coefficientsets there are sharper bounds available. For example, if K = conv { z , . . . , z k } ⊂ B d is a polytope, sois L = D K = conv { Dz , . . . , Dz k } . Then we obtain d ( L ) = d ( D K ) (cid:46) D · log ( k ) (2.15)by part (2), where D max : = max ≤ j ≤ k (cid:107) Dz j (cid:107) ≤ (cid:107) D (cid:107) . ♦ This asymptotic bound is well-established in the theory of compressed sensing and basically corresponds to the minimalnumber of measurements that is required to recover s -sparse vectors; see [9] for example.8 2. G ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES
In what follows, let us accept the notations introduced in Subsections 2.1–2.3; in particular, recallTable 2.1 and the definitions of the model parameters µ , ρ , η in (2.11). Moreover, the feature vector β ∈ R d is always understood to be an optimal representation of z ∈ R p (according to Definition 2.2).Since the resulting representation in the data domain (2.4) is usually not exact, we capture this modelmismatch by means of the following noise parameter : ε : = (cid:115) m m ∑ i = | y i − y (cid:48)(cid:48) i | ≥
0, (2.16)where y (cid:48)(cid:48) i : = f (cid:16) (cid:104) x i , β (cid:105) √ (cid:107) z (cid:107) + (cid:107) N β (cid:107) (cid:17) = f (cid:16) s + n √ (cid:107) z (cid:107) + σ (cid:17) , i =
1, . . . , m . (2.17)Note that ε might be a random variable, depending on both the non-linearity f as well as on the SNRbetween s and n (cf. (2.7)). Remark 2.8
The modified outputs y (cid:48)(cid:48) i arise from rescaling y (cid:48) i = f ( s + n ) by a factor of τ : = √ (cid:107) z (cid:107) + σ , which guarantees that s + n τ has unit variance. Unfortunately, such a technicality is neces-sary to make the framework of [11] applicable (see also Remark 5.2(2)). Since the additional factor of τ is nonessential for the derivation of Subsection 2.2, it was omitted for the sake of clarity. But thiscould have been easily incorporated by replacing the feature vector β by a dilated version τ − β . ♦ We are now ready to state the main result of this work. Its proof is postponed to Subsection 5.2.
Theorem 2.9
Suppose that the input data pairs { ( x i , y i ) } ≤ i ≤ m obey (M1) and (M2) . Furthermore, let (L1) and (L2) hold true. We assume that (cid:107) z (cid:107) = and µ z ∈ D K for a bounded, convex coefficient set K ⊂ R d containing the origin. Then there exists a constant of the form C ρ , η = C · max { ρ , η } > with C > suchthat the following holds with high probability: If the number of samples satisfies m ≥ C · d ( ˜ D K ) , (2.18) then, setting ˆ z : = D ˆ β for any minimizer ˆ β of ( P L , K ) , we have (cid:107) ˆ z − λ z (cid:107) = (cid:107) D ˆ β − λ D β (cid:107) ≤ C ρ , η (cid:18)(cid:16) d ( ˜ D K ) m (cid:17) + ε (cid:19) , (2.19) where λ : = µ √ + σ = µ √ + (cid:107) N β (cid:107) . The statement of Theorem 2.9 gives a quite general answer to our initial challenge (Q1). Indeed,the error bound (2.19) shows how one can use an estimated feature vector ˆ β to approximate theground-truth signal z . A very remarkable observation is that the applied estimator ( P L , K ) only takesthe raw data { ( x i , y i ) } ≤ i ≤ m as input. This indicates that—at least from a theoretical perspective—effective variable selection can be still achieved without any knowledge of the model specificationsin (M1) and (M2). However, this disregard is reflected by the fact that z needs to be appropriatelyrescaled (by λ ) and the error (semi-)metric explicitly depends on the (unknown) feature dictionary D . The above result does therefore not automatically resolve (Q2), unless the behavior of D is well-understood or a good approximation of it is available. We shall come back to these issues in Section 3,where some applications and consequences of Theorem 2.9 are discussed. More precisely, the constant C may depend on the “probability of success” as well as on the so-called RSC-constant of L (cf.Remark 2.4(2)).. G ENZEL AND
G. K
UTYNIOK Let us now analyze under which conditions (2.19) establishes a meaningful error estimate: • Sample count. If m (greatly) exceeds d ( ˜ D K ) —which is already assumed by (2.18)—the fraction d ( ˜ D K ) / m becomes small. Thus, the effective dimension of ˜ D K can be regarded as a threshold forthe number of required observations. This explains why it is very important to impose a certainlow-complexity constraint on z when only a few samples are available. A successful applicationof Theorem 2.9 therefore always asks for an appropriate upper bound on d ( ˜ D K ) . Later on, we willillustrate how this can be done in the case of sparse representations (see Subsection 3.1 and proofof Theorem 1.1). • Signal-to-noise ratio.
The accuracy of (2.19) is highly sensitive to the choice of β , since it affectsboth the scaling factor λ and the noise parameter ε . By (2.6), we have defined β precisely suchthat the noise variance σ is minimized, i.e., | λ | is guaranteed to be as large as possible (among allrepresentations). Consequently, an undesirable situation occurs if | λ | is still very small. Then, therescaled vector λ z is close to and (2.19) provides a rather poor bound. Such an implication isactually not very surprising because if even an optimal representation of z suffers from a low SNR,the underlying problem of variable selection might be intractable.The impact of ε , on the other hand, also depends on the (unknown) output function f . Thus, itis quite difficult to make a general statement on its behavior. But at least for the popular cases ofnoisy linear and binary responses, we will provide simple high-probability bounds in Example 3.2,which again depend on the SNR. • Model parameters.
We have illustrated in Example 2.5(1) that µ should be seen as a measure for theexpected rescaling (of (cid:104) s i , z (cid:105) ) caused by f . Since this “scaling-effect” is completely unknown tothe estimator ( P L , K ), we need to replace the actual signal-of-interest z by a dilated version µ z inTheorem 2.9. The necessity of such a technical step becomes already clear in the noiseless linearcase when the outputs are multiplied by a fixed but unknown scalar; see again Example 2.5(1)with ξ i ≡
0. The other two model parameters ρ and η , capturing the “variance” of f , are part of C ρ , η = C · max { ρ , η } . Dividing (2.19) by | µ | , we conclude that the resulting constant essentiallyscales like max { ρ , η } / | µ | . Hence, the quality of the error estimate is improved if | µ | is relativelylarge compared to ρ and η . Interestingly, this observation is very closely related to our findings inExample 2.5(1), where we have identified the quotient µ / max { ρ , η } as the SNR of (M1).Finally, we would like to emphasize that Theorem 2.9 also satisfies our desire for a general frameworkfor feature selection, as formulated in (Q3). The above setting in fact allows for a unified treatment ofnon-linear single-index models, arbitrary convex coefficient sets, and strictly convex loss functions.Concerning the asymptotic behavior of ( P L , K ), it has even turned out that the actual capability ofapproximating z neither depends on the (noisy) modifier f nor on the choice of loss function L ; seealso [11, Sec. 2] for a more extensive discussion. Remark 2.10 (1) A careful analysis of the proof of Theorem 2.9 shows that the optimality propertyof β from (2.6) is never used. Indeed, the statement of Theorem 2.9 actually holds for any represen-tation of z , i.e., β ∈ K z . This is particularly relevant to applications where the feature vector β needs to be explicitly constructed. But in such a situation, the resulting error bound (2.19) might beof course less significant, meaning that λ decreases and ε increases.(2) A certain normalization of the signal vector, e.g., (cid:107) z (cid:107) =
1, is inevitable in general. For example,it is obviously impossible to recover the magnitude of z if the outputs obey a binary rule y i = sign ( (cid:104) s i , z (cid:105) ) . The same problem occurs for the noisy linear case of Example 2.5(1) when there is anunknown scaling parameter ¯ µ > structured signal µ z ∈ D K , the unit-norm assumption can be always achieved by considering z / (cid:107) z (cid:107) instead of z . But this would imply that the coefficient set K needs to be rescaled by (cid:107) z (cid:107) − This can be seen more easily when the inequality (2.19) is first divided by | λ | .0 2. G ENERAL F RAMEWORK FOR F EATURE S ELECTION AND T HEORETICAL G UARANTEES in order to satisfy the hypotheses of Theorem 2.9. Since (cid:107) z (cid:107) is usually unknown in real-world ap-plications, it is very common to introduce an additional scaling parameter R > RK for Theorem 2.9—note that this approach was already applied in Theorem 1.1. If R is sufficiently large, we can always ensure µ z / (cid:107) z (cid:107) ∈ R D K , but at the same time, the effec-tive dimension d ( R ˜ D K ) = R · d ( ˜ D K ) starts to grow. Consequently, the significance of Theorem 2.9is highly sensitive to the choice of R . Finding an appropriate scaling parameter is a deep issue ingeneral, which arises (in equivalent forms) almost everywhere in structured and constrained opti-mization. Especially in the machine learning literature, numerous approaches have been developedto cope with this challenge; for an overview see [12]. ♦ Before proceeding with practical aspects of our framework in Section 3, let us discuss some extensionsof Theorem 2.9 which could be interesting for various real-world applications. A first refinementconcerns the limited scope of the output procedure in (M1). Indeed, the true observations y , . . . , y m are supposed to precisely follow a simple single-index model, thereby implying that one is always ableto determine an appropriate parameter vector z for (2.1). In practice however, it could happen thatsuch a choice of z is actually impossible. One would rather assume instead that there exists a vector z , e.g., a structured Bayes estimator , which approximates the truth such that y i ≈ f ( (cid:104) s i , z (cid:105) ) holds for(many) i =
1, . . . , m . This motivates us to permit a certain degree of adversarial model mismatch thatcould be even systematic and deterministic. Moreover, we shall drop the assumption of independentsignal factors s i ,1 , . . . , s i , p —a hypothesis that is rarely satisfied in applications—and allow for lowcorrelations between them. These desiderata give rise to the following extension of (M1):(M1’) We consider a single-index model y i : = f ( (cid:104) s i , z (cid:105) ) , i =
1, . . . , m , (2.20)where z ∈ R p and s , . . . , s m ∼ N ( , Σ ) i.i.d. with positive definite covariance matrix Σ ∈ R p × p . As before, f : R → Y is a (possibly random) function which is independent of s i , and Y ⊂ R is closed. The true observations y , . . . , y m ∈ Y are allowed to differ from (2.20) and weassume that their deviation is bounded by means of (cid:115) m m ∑ i = | y i − y i | ≤ ε for some adversarial noise parameter ε ≥ ε from (2.16) now takes the form ε = (cid:115) m m ∑ i = | y i − y (cid:48)(cid:48) i | . Theorem 2.11
Suppose that the input data pairs { ( x i , y i ) } ≤ i ≤ m obey (M1’) and (M2) . Furthermore, let (L1) and (L2) hold true. We assume that (cid:107)√ Σ z (cid:107) = and µ z ∈ D K for a bounded, convex coefficient setK ⊂ R d containing the origin. Then, with the same constant C ρ , η = C · max { ρ , η } > as in Theorem 2.9,the following holds with high probability: More precisely, we also need to assume that D is surjective and K is full-dimensional. Since Σ ∈ R p × p is positive definite, there exists a unique, positive definite matrix root √ Σ ∈ R p × p with √ Σ · √ Σ = Σ .. G ENZEL AND
G. K
UTYNIOK Let the number of samples satisfy m ≥ C · d ( ˜ D Σ K ) , where ˜ D Σ : = (cid:104) √ Σ DN (cid:105) . Then, setting ˆ z : = D ˆ β for any minimizer ˆ β of ( P L , K ) , we have (cid:107)√ Σ ( ˆ z − λ z ) (cid:107) = (cid:107)√ Σ ( D ˆ β − λ D β ) (cid:107) ≤ C ρ , η (cid:18)(cid:16) d ( ˜ D Σ K ) m (cid:17) + ε + ε (cid:19) . (2.21)The proof is again postponed to Subsection 5.2; note that the statements of Theorem 2.9 and Theo-rem 2.11 coincide if Σ = I p and ε =
0. The impact of the adversarial noise is reflected by an additiveerror term ε in (2.21). If the mismatch between the proper model of (2.20) and the true observations y , . . . , y m is not too large, one should be able to control the value of ε . But similarly to handling ε , this also relies on the specific behavior of the non-linear modifier f . Another difference to Theo-rem 2.9 is the additional deformation of z and ˜ D by √ Σ , which is due to the non-trivial covariancestructure of the latent factors s i . Here, we may run into problems if some of the signal variables arealmost perfectly correlated with each other, so that √ Σ is close to being singular. In such a situation,a good bound on (cid:107)√ Σ ( ˆ z − λ z ) (cid:107) does not automatically imply that we also have ˆ z ≈ λ z . Thisobservation underpins our intuition that, without prior information, (almost) collinear features areconsidered to be equally important, and it is therefore unclear which one should be selected. Finally,it is again worth mentioning that Theorem 2.11 continues the fundamental philosophy of this workin the sense that the applied estimator ( P L , K ) does not require any knowledge of Σ and ε . Remark 2.12
Exploiting the abstract framework of [11], the above results could be even furthergeneralized. This particularly concerns the asymptotic error rate of O ( m − ) in (2.21), which isknown to be non-optimal for some choices of K (see [21, Sec. 4]). The key step towards optimal errorbounds is to introduce a localized version of the mean width, capturing the geometric structure of D K in a (small) neighborhood around the desired signal µ z . This approach could be easily incorporatedin Theorem 2.11 by using [11, Thm. 2.8]. But since the focus of this work is rather on a mathematicaltheory of feature selection, we omit this extension here and leave the details to the interested reader. ♦
3. C
ONSEQUENCES AND A PPLICATIONS
In this part, we focus on the application of our theoretical framework to real-world problems. Subsec-tion 3.1 illustrates that standardizing the data can be helpful (and sometimes even necessary) to obtainsignificant error bounds from Theorem 2.9. In Subsection 3.2, we shall return to our prototype ex-ample of proteomics analysis and prove a rigorous guarantee for feature extraction from MS data.Finally, the general scope of our results is discussed in Subsection 3.3, including some rules-of-thumbwhen Theorem 2.9 implies a practice-oriented statement.
The discussion part of Theorem 2.9 has shown that the quality of the error estimate (2.19) heavilyrelies on controlling the effective dimension d ( ˜ D K ) as well as on the noise parameter ε . For illustrationpurposes, let us focus on the important case of sparse representations here and assume that K = RB d with some “appropriately chosen” scaling factor R > In the following, we usually refer to Theorem 2.9 as our main result. But unless stated otherwise, similar conclusions holdfor its generalization, Theorem 2.11, as well.2 3. C
ONSEQUENCES AND A PPLICATIONS
Adapting the notation from Subsection 1.3, we denote the columns of the dictionaries D , N , ˜ D by D j , N j , ˜ D j , respectively, j =
1, . . . , d . Hence,˜ D = (cid:2) ˜ D | · · · | ˜ D d (cid:3) = (cid:20) D . . . D d N . . . N d (cid:21) = (cid:20) DN (cid:21) .At this point, it is particularly useful to think of ˜ D j as a dictionary atom that determines the contributionof each single feature/noise atom (i.e., the rows of ˜ D ) to the j -th feature variable of the data.Since B d = conv {± e , . . . , ± e d } , where e , . . . , e d ∈ R d are the canonical unit vectors of R d , theestimate (2.15) of Example 2.7(4) yields d ( ˜ D K ) = R · d ( ˜ D B d ) (cid:46) R · ˜ D · log ( d ) (3.1)with ˜ D max : = max ≤ j ≤ d (cid:107) ˜ D j (cid:107) . This implies that the asymptotic behavior of the effective dimensioncan be controlled by means of the maximal column norm of ˜ D . In order to establish a bound on ˜ D max ,let us first compute the variance of each feature variable of the data x i = ( x i ,1 , . . . , x i , d ) . Using thefactorization (2.3) and A = D T , B = N T , we obtain V [ x i , j ] = V [ D T j s i + N T j n i ] = V [ (cid:104) D j , s i (cid:105) ] + V [ (cid:104) N j , n i (cid:105) ] = (cid:107) D j (cid:107) + (cid:107) N j (cid:107) = (cid:107) ˜ D j (cid:107) (3.2)for j =
1, . . . , d . Regarding (3.1), it is therefore beneficial to standardize the data, meaning that x i , j gets replaced by x i , j / √ V [ x i , j ] . Every feature variable thereby achieves unit variance and the resultingdictionary atoms ˜ D j / (cid:107) ˜ D j (cid:107) are normalized. And once again, it is crucial that such a simple step doesonly require the very raw data and can be realized without any knowledge of ˜ D .In the following, let us assume without loss of generality that V [ x i , j ] = D = max ≤ j ≤ d (cid:107) ˜ D j (cid:107) = max ≤ j ≤ d V [ x i , j ] =
1, so that d ( ˜ D K ) (cid:46) R · log ( d ) . (3.3)This estimate does not depend on the underlying data model (M2), except for the log-factor. Hence,we can draw the remarkable conclusion that the estimator ( P L , K ) of Theorem 2.9 works like an oracle in the sense that it almost performs as good as if the signal factors s i were explicitly known. But notethat the parameters λ and ε still may have a certain impact on the approximation quality, as we willanalyze below. Remark 3.1 (1) The above argumentation can be easily adapted to Theorem 2.11, which allows foranisotropic signal variables s i ∼ N ( , Σ ) . To see this, one simply needs to repeat the computationswith D and ˜ D replaced by √ Σ D and ˜ D Σ , respectively.(2) In most practical applications, we do not have access to the random distribution of x i , j . Thus, V [ x i , j ] cannot be exactly computed and a standardization becomes infeasible. A typical way out is toperform an empirical standardization instead, which makes only use of the available sample data: For afixed index j ∈ {
1, . . . , d } , calculate the empirical mean ¯ x j : = m ∑ mi = x i , j as well as the empirical variance ¯ σ j : = m ∑ mi = ( x i , j − ¯ x j ) . Then, the initial component x i , j is replaced by¯ x i , j : = x i , j − ¯ x j ¯ σ j for all i =
1, . . . , m and j =
1, . . . , d . Since the data is i.i.d., all results will be independent of the actual sample index i =
1, . . . , m . Here, we implicitly assume that V [ x i , j ] (cid:54) =
0. But this is no severe restriction, since a zero-variable would not be relevant tofeature selection anyway.. G
ENZEL AND
G. K
UTYNIOK Since the law of large numbers yields ¯ x j → E [ x i , j ] and ¯ σ j → V [ x i , j ] as m → ∞ , we can conclude that¯ σ j ≈ V [ x i , j ] = (cid:107) ˜ D j (cid:107) . Consequently, an empirical standardization implies ˜ D max ≈ m sufficientlylarge, so that the bound of (3.3) is at least approximately true.(3) Due to the computational inaccuracies described in the previous remark, one could ask whethersimply rescaling the data could be a better alternative to standardization. Unfortunately, such an ap-proach can sometimes cause serious problems: Since (cid:107) z (cid:107) = z = D β , it might happen thatseveral entries of β must be very large in magnitude (when the corresponding atom norm (cid:107) D j (cid:107) issmall) whereas others must be small (when (cid:107) D j (cid:107) is large). The first case might involve an enlarge-ment of the parameter R > µ β ∈ K = RB d , but at the same time, the estimateof (3.1) becomes worse. This issue could be easily fixed by rescaling the data x i (cid:55)→ λ · x i with a suffi-ciently large λ >
0. However, the value of ˜ D max would then blow up (due to the large atom norms ofthe second case), which makes (3.1) meaningless again.Instead, a standardization aims to circumvent this drawback by adjusting each atom separately. Buteven then, we could run into difficulties as the SNR of the individual feature variables is too low.Indeed, if (cid:107) D j (cid:107) (cid:107) N j (cid:107) (cid:28) (cid:107) D j (cid:107) + (cid:107) N j (cid:107) = (cid:107) ˜ D j (cid:107) = V [ x i , j ] =
1, (3.4)the value of (cid:107) D j (cid:107) must be very small. Thus, we are actually in the same situation as before (thefirst case), leading to a larger value of R . Problems may therefore occur if some of the non-zerocomponents of the representing feature vector β suffer from (3.4). This observation gives a furtherjustification of Definition 2.2, since it tries to avoid a low SNR by minimizing the noise part (cid:107) N β (cid:107) .(4) For general coefficient sets K , the situation becomes more difficult. Even if K is a polytope,it is not guaranteed anymore that the factor ˜ D max is bounded by the maximal column norm of ˜ D .For that reason, different strategies may be required to gain control of the asymptotic behavior of d ( ˜ D K ) . Which type of preprocessing is appropriate clearly depends on the specific application, butas illustrated above, a careful analysis of the effective dimension could serve as a promising guidelinehere. ♦ The previous considerations have merely focused on the first summand of the error bound (2.19).However, in order to achieve a meaningful interpretation of Theorem 2.9, it is equally important tobound the noise parameter ε , given by (2.16). In this situation, the SNR plays a more significant role. Since one has ε = ε is still small if the SNRis sufficiently high. The following example verifies this intuition for the standard cases of linear andbinary outputs. In the sequel, the noise variance σ (= SNR − ) should be regarded as a fixed (small)number, which is determined by the choice of β . Example 3.2 (1)
Noisy linear observations.
Similarly to Example 2.5(1), let us assume that f ( g ) : = g + ξ with additive mean-zero noise ξ . Recalling s = (cid:104) s i , z (cid:105) , n = (cid:104) n i , N β (cid:105) and (2.17), the observationstake the following form: y i = s + ξ i and y (cid:48)(cid:48) i = s + n √ + σ + ξ i .Since s ∼ N (
0, 1 ) and n ∼ N ( σ ) are independent, we have Y i : = y i − y (cid:48)(cid:48) i = s − s + n √ + σ = (cid:16) − √ + σ (cid:17) s + √ + σ n ∼ N (
0, ¯ σ ) with ¯ σ : = (cid:16) − √ + σ (cid:17) + σ + σ = − √ + σ . The SNR is defined according to (2.7), that is, SNR = σ .4 3. C ONSEQUENCES AND A PPLICATIONS
Thus E [ Y i ] = ¯ σ , and by [9, Prop. 7.5], we conclude that Y i = ( y i − y (cid:48)(cid:48) i ) is subexponential: P [ | Y i − ¯ σ | ≥ u ] ≤ P (cid:104) | ¯ σ − Y i | ≥ (cid:113) ¯ σ − u − (cid:105) ≤ exp ( − ¯ σ − u − ) ≤ √ e · exp ( − u σ ) for all u ≥ ¯ σ . Finally, we apply Bernstein’s inequality ([9, Cor. 7.32] with β = √ e , κ = ( σ ) , and t = σ m ) to obtain a high-probability bound on the noise parameter: P [ ε ≤ σ ] = P (cid:104) m m ∑ i = Y i ≤ σ (cid:105) ≥ P (cid:104) | m m ∑ i = ( Y i − ¯ σ ) | ≤ σ (cid:105) ≥ − (cid:16) − m (cid:17) .Observing that ¯ σ = O ( σ ) = O ( SNR − ) as σ →
0, such a bound could be now easily incorporatedinto the statement of Theorem 2.9.(2)
Binary observations.
We now consider f ( g ) : = sign ( g ) , which implies y i = sign ( s ) and y (cid:48)(cid:48) i = sign (cid:16) s + n √ + σ (cid:17) = sign ( s + n ) .These are Bernoulli variables, so that ε actually measures the fraction of bit-flips (up to a factor of 4)caused by the noise term n . Hence, let us first compute the probability of a single bit-flip: P [ y i (cid:54) = y (cid:48)(cid:48) i ] = P [ s · ( s + n ) < ] = (cid:90)(cid:90) s ( s + n ) < √ π exp ( − s ) · √ πσ exp ( − n σ ) dn ds = · πσ (cid:90) ∞ (cid:90) − s − ∞ exp ( − s ) · exp ( − n σ ) dn ds = πσ (cid:16) πσ − σ arctan ( σ ) (cid:17) = − π arctan ( σ ) = : p .Setting again Y i : = y i − y (cid:48)(cid:48) i , we conclude that P [ Y i = ] = − P [ Y i = ] = p . In particular, E [ Y i ] = p and | Y i − E [ Y i ] | ≤
4. An application of Hoeffding’s inequality ([9, Thm. 7.20] with B i = t = p m ) finally yields P [ ε ≤ √ p ] = P (cid:104) m m ∑ i = Y i ≤ p (cid:105) = P (cid:104) m m ∑ i = ( Y i − E [ Y i ]) ≤ p (cid:105) ≥ − exp (cid:16) − p m (cid:17) .This bound is also relevant to Theorem 2.9, at least for small (but fixed) values of σ , since we have √ p = O ( √ σ ) = O ( SNR − ) as σ → ♦ Throughout this paper, the challenge of (sparse) feature extraction from MS data has been a drivingmotivation for our modeling. With the full abstract framework of Section 2 available, we are nowready to formulate a rigorous recovery guarantee for this specific application. In order to make thespecifications of (M1’) and (M2) precise, let us first recall the vague model description from Sub-section 1.1. Each sample i ∈ {
1, . . . , m } corresponds to a patient who suffers from a certain disease( y i = +
1) or not ( y i = − s i = ( s i ,1 , . . . , s i , p ) ∼ N ( , Σ ) repre-sent the (centered) concentrations of the examined protein collection. In other words, each index k ∈ {
1, . . . , p } of s i stands for a particular type of protein and the corresponding entry contains itsmolecular concentration for the i -th patient. Our hope is that the disease labels can be accurately . G ENZEL AND
G. K
UTYNIOK predicted by a sparse -bit observation scheme ( f = sign) y i : = sign ( (cid:104) s i , z (cid:105) ) , i =
1, . . . , m , (3.5)where the signal vector z ∈ R p is assumed to be s-sparse , that is, (cid:107) z (cid:107) ≤ s . However, the truedisease labels y , . . . , y m ∈ {− + } may “slightly” differ from this very simple model. As proposedby (M1’), we shall measure this mismatch by ε : = (cid:115) m m ∑ i = | y i − y i | ,which is (up to a factor of 2) equal to the root of the fraction of wrong observations .The actual input data x , . . . , x m ∈ R d is generated via mass spectrometry ( MS ); see also Subsection 1.1,especially Figure 1.1. To give a formal definition of a mass spectrum according to (M2), we followthe basic approach of [10, Sec. 4.1], which describes the raw data as a linear combination of Gaussian-shaped peaks plus low-amplitude baseline noise . In this situation, it is also helpful to be aware of thephysical meaning of the data variables x i = ( x i ,1 , . . . , x i , d ) : Each index j ∈ {
1, . . . , d } correspondsto a specific mass value (which is proportional to j ), while the associated mass channel of the MSmachine detects (counts) all those molecules having this specific mass value. Due to the underly-ing physics, this process is not exact so that the abundance of each protein, i.e., the total count ofdetected molecules, typically spreads over several channels. Consequently, one eventually observeswide peaks instead of sharp spikes (cf. Figure 1.1). To keep our illustration as simple as possible, weassume that the (raw) feature atoms are discrete samplings of Gaussian-shaped functions a (cid:48) k = ( a (cid:48) k ,1 , . . . , a (cid:48) k , d ) : = ( G I k , c k , d k ( ) , . . . , G I k , c k , d k ( d )) ∈ R d , k =
1, . . . , p ,with G I k , c k , d k ( t ) : = I k · exp (cid:16) − ( t − c k ) d k (cid:17) , t ∈ R , I k ≥ c k ∈ R , d k > I k determines the height (intensity) of the k -th peak, d k its width, and c k its mass center.The (raw) noise atoms, on the other hand, simply capture the independent baseline noise that ispresent in each mass channel: b (cid:48) l : = σ l e l ∈ R d , l =
1, . . . , q : = d ,where σ l ≥ e l is the l -th unit vector of R d .If we would just use a (cid:48) , . . . , a (cid:48) p , b (cid:48) , . . . , b (cid:48) d as input parameters for (M2), the issue of Remark 3.1(3)would become relevant, since the maximal intensities I , . . . , I p may be of different orders of magni-tude in practice. In fact, there are usually peaks of very low height, whereas some others are huge inmagnitude (see again Figure 1.1). However, this does by far not imply that the latter ones are moreimportant than the others. Raw MS data is therefore often standardized before feature selection. Forthis purpose, let D (cid:48) = (cid:2) D (cid:48) | · · · | D (cid:48) d (cid:3) : = (cid:2) a (cid:48) | · · · | a (cid:48) p (cid:3) T ∈ R p × d , N (cid:48) = (cid:2) N (cid:48) | · · · | N (cid:48) d (cid:3) : = (cid:2) b (cid:48) | · · · | b (cid:48) d (cid:3) T = diag ( σ , . . . , σ d ) ∈ R d × d . The word “raw” indicates that we do not care about standardizing the data for now. A standardization is then considered asa separate step below. Note that I k is still a fixed number. By (M2), the maximal height of the k -peak is actually given by s i , k · I k , which alsodepends on the considered sample i .6 3. C ONSEQUENCES AND A PPLICATIONS denote the raw dictionaries. According to (3.2), we compute the variance of the raw feature variables ( σ (cid:48) j ) : = V [ x (cid:48) i , j ] = (cid:107) D (cid:48) j (cid:107) + (cid:107) N (cid:48) j (cid:107) = (cid:107) D (cid:48) j (cid:107) + σ j , j =
1, . . . , d ,and standardize the dictionary atoms by D j : = D (cid:48) j / σ (cid:48) j and N j : = N (cid:48) j / σ (cid:48) j , j =
1, . . . , d . (3.6)The resulting data model (M2) is then based on the extended dictionary˜ D = (cid:2) ˜ D | · · · | ˜ D d (cid:3) : = (cid:20) D . . . D d N . . . N d (cid:21) = (cid:20) DN (cid:21) = (cid:20) A T B T (cid:21) ∈ R ( p + d ) × d . (3.7)Using these model assumptions, we can now state a corollary of Theorem 2.11 for variable selectionfrom MS data. This can be also regarded as an extension of Theorem 1.1 to the noisy case. Corollary 3.3
Suppose that the standardized input data { ( x i , y i ) } ≤ i ≤ m follow (M1’) and (M2) with theabove specifications. Furthermore, let (L1) and (L2) hold true. We assume that (cid:107)√ Σ z (cid:107) = and z ∈ R D B d for some R > . Then there exists a constant C > , depending on the used loss function L , such that thefollowing holds with high probability:If the number of samples obeys m ≥ C · R · log ( d ) , then, defining ˆ z : = D ˆ β for any minimizer ˆ β of ( P L , K ) , we have (cid:107)√ Σ ( ˆ z − λ z ) (cid:107) = (cid:107)√ Σ ( D ˆ β − λ D β ) (cid:107) ≤ C (cid:18)(cid:16) R · log ( d ) m (cid:17) + √ σ + ε (cid:19) , where λ : = (cid:113) π ( + σ ) .Proof. First, we observe that µ = √ π < µ z ∈ µ R D B d ⊂ R D B d . Applying the results from Subsection 3.1 (in particular, the bound (3.3), Remark 3.1(1), andExample 3.2(2)), the claim now immediately follows from Theorem 2.11. (cid:4) The significance of Corollary 3.3 primarily depends on the scaling parameter R > σ . To investigate this issue further, we need to better understand the structure of the (stan-dardized) dictionary ˜ D . A key observation of [10, Subsec. 4.3.3] was that, mostly, the peak centers are“well-separated,” i.e., | c k − c k (cid:48) | (cid:29) max { d k , d k (cid:48) } for all k , k (cid:48) =
1, . . . , p with k (cid:54) = k (cid:48) .In other words, the overlap between two different peaks is usually very small (cf. Figure 1.1). Mathe-matically seen, this assumption implies that the feature atoms are almost orthogonal, that is, (cid:104) a k , a k (cid:48) (cid:105) ≈ k (cid:54) = k (cid:48) . Figure 3.1 shows a typical example of ˜ D in such a situation. This visualization suggeststo distinguish between two extreme cases: (1) Features contained in the essential support of a peak.
Let us assume that the j -th variable is very closeto the center c k of the k -th peak ( k ∈ {
1, . . . , p } ). Then, we have (cid:107) D (cid:48) j (cid:107) (cid:29) σ j , implying ( σ (cid:48) j ) = In realistic scenarios, the dimension d is usually much larger than the number of peaks p . Therefore, one of these two“extreme” cases applies to most of the feature variables.. G ENZEL AND
G. K
UTYNIOK Figure 3.1: Visualization of ˜ D ∈ R ( p + d ) × d in the case of almost isolated peaks (the grayscale values correspondto the magnitudes of the entries, white = = p = (cid:107) D (cid:48) j (cid:107) + σ j ≈ (cid:107) D (cid:48) j (cid:107) . Due to (3.6) and (3.7), the associated dictionary atom now approximatelytakes the form ˜ D j = (cid:20) D j N j (cid:21) ≈ (cid:20) e k (cid:21) ∈ R p + d , (3.8)where e k ∈ R p is the k -th unit vector of R p . Thus, the contribution of the noise atoms (baselinenoise) is negligible here.(2) Features far away from any of the peak centers.
Now, (cid:107) D (cid:48) j (cid:107) (cid:28) σ j , which allows for an approxima-tion ( σ (cid:48) j ) = (cid:107) D (cid:48) j (cid:107) + σ j ≈ σ j . For that reason, the dictionary atom is dominated by the noisepart: ˜ D j = (cid:20) D j N j (cid:21) ≈ (cid:20) e j (cid:21) ∈ R p + d ,where e j ∈ R d is the j -th unit vector of R d . According to (3.4), these variables suffer from a verylow SNR.With regard to our main goal to achieve a small variance σ = (cid:107) N β (cid:107) , we can conclude that thesupport of an optimal representation β is (most likely) contained within those variables correspond-ing to Case (1). This particularly shows that the issue of Remark 3.1(3) can be easily resolved bystandardizing the raw MS data.Restricting to the features of Case (1)—and these are actually the only ones of interest—the dictionary D enjoys a very simple structure (while N is negligible anyway). In fact, it basically consists of theunit vectors e , . . . , e p ∈ R p , even though some of them may repeatedly occur, due to the spatialextent of the single peaks. The error estimate of Corollary 3.3 therefore implies that the non-zeroentries of the estimated feature vector ˆ β do also correspond to Case (1). In particular, it is possible toassign each non-zero component of ˆ β to a specific peak. A physician could now use his/her medicalexpertise to identify these peaks (feature atoms) with the underlying proteins (signal variables). In This conclusion is a bit vaguely formulated. For a more precise argument, one needs to observe that the error bound ofCorollary 3.3 holds for (cid:107) N ˆ β − λ N β (cid:107) as well (see Remark 5.2(1)), which implies that N ˆ β ≈ λ N β ≈ .8 3. C ONSEQUENCES AND A PPLICATIONS that sense, the unknown transformation ˜ D can be invoked “by hand,” and consequently, we can evengive a positive answer to (Q2) in this situation. For a numerical verification of the above argument,the interested reader is referred to the works of [10, Chap. 5] and [7].The natural (one-to-one) correspondence between the variables of the signal and data domain alsoallows us to specify a good choice of R : Since D behaves similarly to the identity matrix, we mayassume that there exists an (optimal) representation β of z which is s -sparse by itself and approx-imately preserves the norm, i.e., (cid:107) z (cid:107) = (cid:107) D β (cid:107) ≈ (cid:107) β (cid:107) . By the Cauchy-Schwarz inequality, wethen obtain (cid:107) β (cid:107) ≤ (cid:113) (cid:107) β (cid:107) · (cid:107) β (cid:107) ≈ √ s · (cid:107) z (cid:107) ≤ √ s · (cid:107)√ Σ − (cid:107) · (cid:107)√ Σ z (cid:107) = √ s · (cid:107)√ Σ − (cid:107) .Hence, the hypothesis of Corollary 3.3 can be satisfied with R ≈ √ s · (cid:107)√ Σ − (cid:107) , and the number ofrequired observations is essentially dominated by the degree of sparsity: m (cid:38) C · (cid:107)√ Σ − (cid:107) · s · log ( d ) .This indicates that feature extraction from MS data is already possible with a relatively few samples,supposed that the disease fingerprint z is sufficiently sparse and the covariance matrix Σ is not tooill-conditioned. Such a conclusion is especially appealing for practical scenarios, since many clinicalstudies only encompass a very small sample set . Moreover, it is worth mentioning that—regardingthe resulting error bound of Corollary 3.3—the estimator ( P L , K ) enjoys a certain type of oracle property in the sense that it behaves almost as good as if we would work directly in the signal domain. Remark 3.4 (1) The discussion of Corollary 3.3 is based on some heuristic assumptions which donot always strictly hold in practice. For example, we may run into trouble if a peak is “buried” inthe baseline noise, i.e., I k (cid:28) σ l when l is close to c k . In this case, all associated feature variableswould suffer from a low SNR (in the sense of (3.4)) and the approximation of Case (1) becomes veryrough. Another undesirable situation arises if the sparse ε and the error estimate of Corollary 3.3becomes meaningless.(2) The bound of Corollary 3.3 could be substantially improved if some prior information about thedata is available. For instance, by using statistical tests or medical expertise, one might already havea rough idea of what peaks (proteins) are more important than others. Such an additional knowledgecould be easily incorporated via a weighted standardization as proposed in [10]. Alternatively, therecent concept of weighted sparsity could be beneficial (see [23] for example), which would basicallyinvolve an anisotropic rescaling of the coefficient set K = RB d . ♦ In the previous subsection, we have seen how our framework can be used to establish theoreticalguarantees for feature selection from MS data. But what about the general practical scope of ourresults? First, one might wonder about the feasibility of the model assumptions made in Section 2.The linear model of (M2) is in fact quite generic, since every (Gaussian) data can be factorized by (2.2)if the feature and noise atoms are appropriately chosen. The same is true for the observation schemeof (M1’) where we even permit adversarial (deterministic) noise in the output variables. A strongerlimitation is of course the hypothesis of
Gaussian signal factors. It is indeed very unlikely that s i and n i precisely follow a Gaussian distribution in realistic situations. Hence, an emerging goal for futurework should be to allow for more complicated measurement designs, such as sub-Gaussians (see alsoSection 4). . G ENZEL AND
G. K
UTYNIOK Next, let us return to our initial challenge of (Q2). When applying Theorem 2.9 to a specific problem,we are always faced with the drawback that the estimator ( P L , K ) yields a vector ˆ β in the data domainwhile the actual approximation statement holds in the signal domain for ˆ z = D ˆ β . The practicalsignificance of our results therefore heavily depends on the intrinsic dictionary D , which could bevery different for each application. Unfortunately, one cannot expect a general conclusion here, since D can be arbitrary and might be (entirely) unknown. However, we would like to point out some“good” scenarios for which a meaningful answer to (Q2) can be given: • Well-separated feature atoms.
If the supports of a , . . . , a p do only slightly overlap, we are essentiallyin the desirable setting of Subsection 3.2, where D consisted of almost isolated peak atoms (seeFigure 3.1). Thus, we may assume again that the support of the representing vector β is associatedwith feature variables of the form (3.8). Although the dictionary D might be still highly redundant,its structure is rather simple in this case, and a domain expert should be able to identify the non-zero components of ˆ β with the underlying signal variables.But note that this strategy can quickly fail if the feature atoms superpose. In the setup of MS datafor example, one could think of several narrow peaks sitting on the top of a wider peak. Thebehavior of D becomes much more complicated in such a situation, so that some prior knowledgeis probably required to cope with variable selection (in the signal domain). • D is approximately known. Suppose that an approximate dictionary D (cid:48) ∈ R p × d is available, e.g., asan outcome of dictionary learning or a factor analysis . Then we can define an estimator by ˆ z (cid:48) : = D (cid:48) ˆ β .If the error bound of Theorem 2.9 holds true, the triangle inequality now implies (cid:107) ˆ z (cid:48) − λ z (cid:107) ≤ (cid:107) ˆ z (cid:48) − ˆ z (cid:107) + (cid:107) ˆ z − λ z (cid:107) ≤ (cid:107) D (cid:48) − D (cid:107) · (cid:107) ˆ β (cid:107) + C ρ , η (cid:18)(cid:16) d ( ˜ D K ) m (cid:17) + ε (cid:19) .Hence, if D (cid:48) and D are sufficiently close, we can conclude that ˆ z (cid:48) serves as a reliable substitute forthe (unknown) estimator ˆ z . • Output prediction.
The primal focus of many applications is rather on accurately predicting theoutput variable, e.g., the health status of a patient. In this case, it is not even necessary to workwithin the signal domain: Let ( x , y ) ∈ R d × Y be a data pair with y unknown. Then, one cansimply define ˆ y : = f ( (cid:104) x , ˆ β (cid:105) ) as an estimator of y . Here, the function f , or at least an approximateversion of it, is of course assumed to be given.These guidelines might be helpful to identify more potential applications of our framework. Westrongly believe that a similar statement than Corollary 3.3 can be also shown for other types of real-world data, for example, microarrays , neuroimages , or hyperspectral images . A major difficulty whenestablishing a novel result for feature extraction is to find an appropriate parameterization of the dataand observation model. Initially, it might be unclear how to split the underlying collection of signaland noise factors, i.e., to figure out which variables are relevant to (M1) and which are probably not.But even then, there is usually still a certain degree of freedom to specify the atoms of (M2), and wehave seen that the quality of our error bounds is quite sensitive to this choice. For that reason, eachspecific application will require a very careful individual treatment, such as we did in Subsection 3.2for MS data. 4. C ONCLUSION AND O UTLOOK
Regarding the initial challenges of (Q1) and (Q3), we can conclude that our main results, Theorem 2.9and Theorem 2.11, provide fairly general solutions. The key idea was to construct an (optimal) rep-resentation of the signal vector z in order to mimic the original problem of variable selection withinthe data domain. In this way, we were able to prove recovery statements for standard estimators, ONCLUSION AND O UTLOOK which only take the raw samples { ( x i , y i ) } ≤ i ≤ m as input. Interestingly, it has turned out that thisapproach works almost as good as if one would explicitly know the hidden signal factors s i of thedata (“oracle property”). Another remarkable observation was that the used loss function L as wellas the (unknown) non-linearity f do only have a minor impact on the qualitative behavior of the errorbounds.In Section 3, we have also discussed the practical scope of our findings, in particular the issue of(Q2). As the setting of Theorem 2.9 and Theorem 2.11 is quite generic, a practice-oriented statementcan be usually only drawn up if some further assumptions on the data model (M2) are made. Thiswas illustrated for the example of MS data, but there should be many other applications for whichsimilar guarantees can be proven. For that matter, we hope that our results could at least serve as apromising indicator of successful feature extraction from real-world data.Finally, let us sketch several improvements and extensions of our framework which could be inter-esting to study in future research: • Non-Gaussian observations.
In the course of Subsection 3.3, we have already pointed out that theassumption of Gaussian signal variables is often too restrictive in realistic scenarios. Therefore, itwould be essential to allow for more general distributions, e.g., sub-Gaussians or even heavily-tailed random variables. The most difficult step towards such an extension is to handle the non-linearity f in a unified way. In the noisy linear case of Example 2.5(1), our results may be easilyextended to sub-Gaussian factors, using techniques from [16, 17, 25]. If f produces binary outputson the other hand, the situation becomes much more complicated. There exists in fact a sim-ple counterexample based on Bernoulli variables for which signal recovery is impossible (see [18,Rem. 1.5]). An interesting approach has been given in [1], excluding such extreme cases, but theauthors just consider a linear loss for their estimator, which does not fit into our setup. However,we strongly believe that similar arguments can be used to adapt the statement of Theorem 2.9 to (asub-class of) sub-Gaussian distributions. • More advanced sample models.
In order to achieve a smaller adversarial noise parameter ε in (M1’),it might be useful to go beyond a single-index model. For example, the output variables could bedescribed by a multi-index observation rule of the form y i = f ( (cid:104) s i , z ( ) (cid:105) , (cid:104) s i , z ( ) (cid:105) , . . . , (cid:104) s i , z ( D ) (cid:105) ) , i =
1, . . . , m ,where z ( ) , . . . , z ( D ) ∈ R p are unknown signal vectors and f : R D → Y is again a scalar function.Apart from that, one could also think of a certain non-linearity in the linear factor model (M2),especially in the noise term. • Incorporation of prior knowledge.
A crucial feature of our main results is the fact that they impose onlyrelatively few assumptions on the data model (the atoms a , . . . , a p and b , . . . , b q can be arbitrary).In most applications however, the input data obey additional structural constraints, such as thecharacteristic peak patterns of mass spectra. On this account, we have sketched various rules-of-thumb in Section 3, but a rigorous treatment of these heuristics remains largely unexplored. Forexample, one could ask for the following: Can we simultaneously learn the signal vector z and thefeature dictionary D ? How to optimally preprocess (reweight) the data if some feature variablesare known to be more important than others? And to what extent does this improve our errorbounds? • Optimality of the representation.
In Definition 2.2, we have introduced a representative feature vector β that is optimal with respect to the SNR (cf. (2.7)). This is indeed a very natural way to matchthe true observation y i with its perturbed counterpart y (cid:48)(cid:48) i from (2.17), but the impact of the non-linearity f is actually disregarded. The error bound of Theorem 2.9, in contrast, involves the noise . G ENZEL AND
G. K
UTYNIOK parameter ε , which also depends on f . For this reason, it is not clear whether our notion of opti-mality always leads to the best possible estimates. A more elaborate approach would be to choose β such that (the expected value of) ε is minimized, which would in turn require a very carefulstudy of f . But fortunately, Example 3.2 indicates that we would end up with the same result inmost situations anyway. • General convex loss functions.
It can be easily seen that some popular convex loss functions do notfit into our setup. A prominent example is the hinge loss (used for support vector machines ), whichis neither strictly convex nor differentiable. However, the recent work of [15] has shown thatsignal recovery is still possible for this choice, even though the authors consider a more restrictivesetting than we do. Another interesting issue would concern the design of adaptive loss functions:Suppose the non-linearity f of the observation model is (approximately) known, can we constructa good (or even optimal) loss L ? 5. P ROOFS
The claim of Theorem 1.1 follows from a straightforward application of our main result, Theorem 2.9,which is proven in the next subsection.
Proof of Theorem 1.1.
We first observe that the data x i = As i does not involve any noise variables, sothat one can simply assume q = D = D . Since f = sign and L = L sq , Example 2.5(2) yields µ = √ π <
1. Thus, choosing K : = RB d , we have µ z ∈ µ R D B d ⊂ R D B d = D K .The noise term n is equal to zero by assumption, which means that the feature vector β is alreadyan optimal representation of z . In particular, we conclude that σ =
0, implying y (cid:48)(cid:48) i = y i for all i =
1, . . . , m and also ε =
0. An application of (2.15) finally gives the bound d ( ˜ D K ) = R · d ( D B d ) (cid:46) R · D · log ( d ) ,and the claim is now established by the statement of Theorem 2.9. (cid:4) We have already pointed out that we would like to apply the abstract framework of [11] to proveTheorem 2.9 and Theorem 2.11. In order to keep our exposition self-contained, let us first restate adeep result from [11], where the notation is adapted to the setup of this work:
Theorem 5.1 ([11, Thm. 1.3])
Let v ∈ S n − be a unit-norm vector. Set ˜ y i : = f ( (cid:104) u i , v (cid:105) ) , i =
1, . . . , m , (5.1) for i.i.d. Gaussian samples u , . . . , u m ∼ N ( , I n ) and a (random) function f : R → Y . We assume that (L1) , (L2) are fulfilled and that µ , ρ , η are defined according to (2.11) . Moreover, suppose that µ v ∈ L for abounded, convex subset L ⊂ R n . Then there exists a constant of the form C (cid:48) ρ , η = C (cid:48) · max { ρ , η } > withC (cid:48) > such that the following holds with high probability: As in Theorem 2.9, the constant C (cid:48) may depend on the “probability of success” as well as on the RSC-constant of L .2 5. P ROOFS
If the number of observations obeys m ≥ C (cid:48) · d ( L − µ v ) , then any minimizer ˆ v of min v ∈ R n m m ∑ i = L ( (cid:104) u i , v (cid:105) , y i ) subject to v ∈ L , (5.2) with arbitrary inputs y , . . . , y m ∈ Y , satisfies (cid:107) ˆ v − µ v (cid:107) ≤ C (cid:48) ρ , η (cid:18)(cid:16) d ( L − µ v ) m (cid:17) + (cid:16) m m ∑ i = | ˜ y i − y i | (cid:17) (cid:19) . (5.3)The proof of our main result now follows from a sophisticated application of Theorem 5.1: Proof of Theorem 2.9.
Let us recall the approach of Remark 2.3(3) and work in the extended signaldomain R p + q . That means, we would like to apply Theorem 5.1 in a setup with n : = p + q and u i : = ˜ s i = ( s i , n i ) ∼ N ( , I p + q ) , i =
1, . . . , m .Next, we need to specify the vector v . For this purpose, consider again˜ z (cid:48) = ˜ D β = (cid:20) D β N β (cid:21) = (cid:20) z N β (cid:21) ,where the assumption µ z ∈ D K ensures that an optimal representation β actually exists (cf. (2.5)).Due to (cid:107) z (cid:107) =
1, we have τ : = (cid:107) ˜ z (cid:48) (cid:107) = (cid:113) + (cid:107) N β (cid:107) = (cid:113) + σ , so that v : = τ − ˜ z (cid:48) = τ − (cid:20) z N β (cid:21) satisfies (cid:107) v (cid:107) =
1. Putting this into (5.1) leads to˜ y i = f ( (cid:104) u i , v (cid:105) ) = f (cid:16) τ − (cid:68) (cid:20) s i n i (cid:21) , (cid:20) D β N β (cid:21) (cid:69)(cid:17) = f ( τ − (cid:104) x i , β (cid:105) ) , i =
1, . . . , m , (5.4)whereas the true observations are (cf. (2.8)) y i = f ( (cid:104) s i , z (cid:105) ) = f (cid:16)(cid:68) (cid:20) s i n i (cid:21) , (cid:20) z (cid:21) (cid:69)(cid:17) = f ( (cid:104) ˜ s i , ˜ z (cid:105) ) , i =
1, . . . , m .Recalling (2.16) and (2.17), we observe that in fact ˜ y i = y (cid:48)(cid:48) i , which implies ε = (cid:16) m m ∑ i = | ˜ y i − y i | (cid:17) .As signal set we simply choose L : = ˜ D K . By ∈ K , τ ≥
1, and µ β ∈ K , one concludes that µ v = µτ − (cid:20) z N β (cid:21) = τ − (cid:20) µ D β µ N β (cid:21) ∈ τ − ˜ D K ⊂ ˜ D K = L . . G ENZEL AND
G. K
UTYNIOK Since (see also [19, Prop. 2.1]) C (cid:48) · d ( L − µ v ) ≤ C (cid:48) (cid:124)(cid:123)(cid:122)(cid:125) = : C · d ( L ) = C · d ( ˜ D K ) ≤ m ,all conditions of Theorem 5.1 are indeed satisfied.For the remainder of this proof, let us assume that the “high-probability-event” of Theorem 5.1 hasoccurred. Then, the minimizer of (5.2) can be expressed in terms of the actual estimator ( P L , K ):ˆ v = argmin v ∈ L m m ∑ i = L ( (cid:104) u i , v (cid:105) , y i )= argmin v ∈ ˜ D K m m ∑ i = L ( (cid:104) ˜ s i , v (cid:105) , y i )= ˜ D · argmin β ∈ K m m ∑ i = L ( (cid:104) ˜ s i , ˜ D β (cid:105) (cid:124) (cid:123)(cid:122) (cid:125) = (cid:104) x i , β (cid:105) , y i )= ˜ D · argmin β ∈ K m m ∑ i = L ( (cid:104) x i , β (cid:105) , y i ) (cid:124) (cid:123)(cid:122) (cid:125) ( P L , K ) = ˜ D ˆ β = (cid:20) D ˆ β N ˆ β (cid:21) = (cid:20) ˆ zN ˆ β (cid:21) . (5.5)This finally yields the desired error bound (with C ρ , η : = √ C (cid:48) ρ , η ) (cid:107) ˆ z − λ z (cid:107) = (cid:13)(cid:13)(cid:2) ˆ z (cid:3) − λ (cid:2) z (cid:3)(cid:13)(cid:13) ≤ (cid:13)(cid:13)(cid:13)(cid:104) ˆ zN ˆ β (cid:105) − µτ − (cid:104) z N β (cid:105)(cid:13)(cid:13)(cid:13) = (cid:107) ˆ v − µ v (cid:107) ≤ C (cid:48) ρ , η (cid:18)(cid:16) d ( ˜ D K ) m (cid:17) + ε (cid:19) . (5.6) (cid:4) Remark 5.2 (1) The above proof even shows a stronger statement than given in Theorem 2.9. In-deed, (5.6) also implies that (cid:107) N ˆ β − µτ − N β (cid:107) ≤ C (cid:48) ρ , η (cid:18)(cid:16) d ( ˜ D K ) m (cid:17) + ε (cid:19) .Hence, if N β ≈ (high SNR) and the bound of the right-hand side is sufficiently small, we canconclude that N ˆ β ≈ . This basically means that the SNR is still high when using the estimatedfeature vector ˆ β instead of β .(2) As already mentioned in Remark 2.8, the proof of Theorem 2.9 reveals why the outputs of (2.9), y (cid:48) i = f ( (cid:104) x i , β (cid:105) ) = f ( (cid:104) ˜ s i , ˜ z (cid:48) (cid:105) ) , i =
1, . . . , m ,cannot match with the observation model (5.1) in general. In fact, this would require (cid:107) ˜ z (cid:48) (cid:107) = D β = z ∈ S p − and N β (cid:54) = . That is precisely the reason why weneed to consider rescaled outputs y (cid:48)(cid:48) i = f ( τ − (cid:104) x i , β (cid:105) ) instead.(3) The proof strategy of Theorem 2.9 might appear a bit counter-intuitive at first sight: The trueobservations y , . . . , y m from (M1) are rather treated as noisy (and dependent) perturbations of the“artificial” output rule defined in (5.4). This “reverse” perspective is due to the fact that there might However, there are at least some special cases, e.g., f = sign, where this approach would still work.4 R EFERENCES not exist a feature vector β ∈ R d with ˜ z = ( z , ) = ˜ D β , i.e., there is no exact representation of z inthe extended signal domain (see also Remark 2.3(3)). More precisely, if we would have simply chosen u i : = s i , v : = z , and L = D K , the crucial equivalence of the estimators in (5.5) would only hold ifker N ∩ K = K . But this is not the case in general. ♦ Theorem 2.9 could have been also deduced as a corollary of Theorem 2.11. However, we have decidedto give a separate proof in order to avoid technicalities that may detract from the key techniques ofthis work. Theorem 2.11 can be derived in a very similar way, using Theorem 5.1 again:
Proof of Theorem 2.11.
First, observe that √ Σ − s i ∼ N ( , I p ) . So, let us set up Theorem 5.1 with u i : = ( √ Σ − s i , n i ) ∼ N ( , I p + q ) , i =
1, . . . , m , v : = τ − (cid:20) √ Σ z N β (cid:21) , and L : = ˜ D Σ K .Using the assumptions of Theorem 2.11, one easily shows that (cid:107) v (cid:107) = µ v ∈ L , ˜ y i = y (cid:48)(cid:48) i , andˆ v = argmin v ∈ L m m ∑ i = L ( (cid:104) u i , v (cid:105) , y i ) = ˜ D Σ · argmin β ∈ K m m ∑ i = L ( (cid:104) x i , β (cid:105) , y i ) = ˜ D Σ ˆ β = (cid:20) √ Σ ˆ zN ˆ β (cid:21) .Recall that y , . . . , y m are the true observations, whereas y , . . . , y m obey (2.20). This leads to (cid:16) m m ∑ i = | ˜ y i − y i | (cid:17) ≤ (cid:16) m m ∑ i = | y (cid:48)(cid:48) i − y i | (cid:17) + (cid:16) m m ∑ i = | y i − y i | (cid:17) ≤ ε + ε ,and the statement of Theorem 5.1 finally yields the desired bound (cid:107)√ Σ ( ˆ z − λ z ) (cid:107) ≤ (cid:107) ˆ v − µ v (cid:107) ≤ C (cid:48) ρ , η (cid:18)(cid:16) d ( ˜ D Σ K ) m (cid:17) + ε + ε (cid:19) . (cid:4) A CKNOWLEDGEMENTS
This research is supported by the Einstein Center for Mathematics Berlin (ECMath), project grantCH2. G. Kutyniok acknowledges partial support by the Einstein Foundation Berlin, the EinsteinCenter for Mathematics Berlin (ECMath), the European Commission-Project DEDALE (contract no.665044) within the H2020 Framework Program, DFG Grant KU 1446/18, DFG-SPP 1798 Grants KU1446/21 and KU 1446/23, the DFG Collaborative Research Center TRR 109 Discretization in Geom-etry and Dynamics, and by the DFG Research Center Matheon Mathematics for Key Technologies inBerlin. R
EFERENCES [1] A. Ai, A. Lapanowski, Y. Plan, and R. Vershynin. One-bit compressed sensing with non-gaussian measure-ments.
Linear Algebra Appl. , 441:222–239, 2014.[2] P. L. Bartlett and S. Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results.
J. Mach. Learn. Res. , 3:463–482, 2002. . G
ENZEL AND
G. K
UTYNIOK [3] H. D. Bondell and B. J. Reich. Simultaneous regression shrinkage, variable selection, and supervised clus-tering of predictors with oscar. Biometrics , 64(1):115–123, 2008.[4] P. B ¨uhlmann, P. R ¨utimann, S. van de Geer, and C.-H. Zhang. Correlated variables in regression: Clusteringand sparse estimation.
J. Statist. Plann. Inference , 143(11):1835–1858, 2013.[5] P. B ¨uhlmann and S. Van De Geer.
Statistics for High-Dimensional Data: Methods, Theory and Applications .Springer, 2011.[6] E. J. Cand`es, Y. C. Eldar, D. Needell, and P. Randall. Compressed sensing with coherent and redundantdictionaries.
Appl. Comput. Harmon. Anal. , 31(1):59–73, 2011.[7] T. Conrad, M. Genzel, N. Cvetkovic, N. Wulkow, A. Leichtle, J. Vybiral, G. Kutyniok, and C. Sch ¨utte. Sparseproteomics analysis – A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data. arXiv: 1506.03620 , 2015.[8] M. A. T. Figueiredo and R. D. Nowak. Sparse estimation with strongly correlated variables using orderedweighted (cid:96) regularization. arXiv: 1409.4005 , 2014.[9] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing . Birkh¨auser, 2013.[10] M. Genzel. Sparse Proteomics Analysis. Master’s thesis, TU Berlin, 2015.[11] M. Genzel. High-dimensional estimation of structured signals from non-linear observations with generalconvex loss functions. arXiv: 1602.03436 , 2016.[12] T. Hastie, R. Tibshirani, and J. Friedman.
The Elements of Statistical Learning: Data Mining, Inference, andPrediction . Springer, 2009.[13] T. Hastie, R. Tibshirani, and M. J. Wainwright.
Statistical Learning with Sparsity . CRC Press, 2015.[14] L. Jacques, J. N. Laska, P. T. Boufounos, and R. G. Baraniuk. Robust 1-bit compressive sensing via binarystable embeddings of sparse vectors.
IEEE Trans. Inform. Theory , 59(4):2082–2102, 2013.[15] A. Kolleck and J. Vybiral. Non-asymptotic analysis of (cid:96) -norm support vector machines. arXiv: 1509.08083 ,2015.[16] C. Liaw, A. Mehrabian, Y. Plan, and R. Vershynin. A simple tool for bounding the deviation of randommatrices on geometric sets. In Geometric aspects of functional analysis , Lecture Notes in Math. Springer, toapprear, arXiv:1603.00897.[17] S. Mendelson, A. Pajor, and N. Tomczak-Jaegermann. Reconstruction and subgaussian operators in asymp-totic geometric analysis.
Geom. Funct. Anal. , 17(4):1248–1282, 2007.[18] Y. Plan and R. Vershynin. One-bit compressed sensing by linear programming.
Comm. Pure Appl. Math. ,66(8):1275–1297, 2013.[19] Y. Plan and R. Vershynin. Robust 1-bit compressed sensing and sparse logistic regression: a convex pro-gramming approach.
IEEE Trans. Inf. Theory , 59(1):482–494, 2013.[20] Y. Plan and R. Vershynin. The generalized lasso with non-linear observations.
IEEE Trans. Inf. Theory , 2016,to appear, arXiv: 1502.04071.[21] Y. Plan, R. Vershynin, and E. Yudovina. High-dimensional estimation with geometric constraints. arXiv:1404.3749 , 2014.[22] H. Rauhut, K. Schnass, and P. Vandergheynst. Compressed sensing and redundant dictionaries.
IEEE Trans.Inform. Theory , 54(5):2210–2219, 2008.[23] H. Rauhut and R. Ward. Interpolation via weighted (cid:96) -minimization. Appl. Comput. Harmon. Anal. ,40(2):321–351, 2016.[24] R. Tibshirani. Regression shrinkage and selection via the lasso.
J. Roy. Statist. Soc. Ser. B , 58(1):267–288, 1996.
EFERENCES [25] J. A. Tropp. Convex recovery of a structured signal from independent random linear measurements. InG. E. Pfander, editor,
Sampling Theory, a Renaissance , ANHA Series, pages 67–101. Birkh¨auser, 2015.[26] V. N. Vapnik.
Statistical Learning Theory . John Wiley & Sons, 1998.[27] R. Vershynin. Estimation in high dimensions: A geometric perspective. In G. E. Pfander, editor,