Consistent Semi-Supervised Graph Regularization for High Dimensional Data
aa r X i v : . [ c s . L G ] J un Journal of Machine Learning Research XX (2019) X-X Submitted XX/XX; Revised XX/XX; Published XX/XX
Consistent Semi-Supervised Graph Regularizationfor High Dimensional Data
Xiaoyi Mai [email protected] Romain Couillet , [email protected] CentraleSup´elec, Laboratoire des Signaux et Syst`emesUniversit´e Paris-Saclay3 rue Joliot Curie, 91192 Gif-Sur-Yvette GIPSA-lab, GSTATS DataScience ChairUniversit´e Grenoble–Alpes11 rue des Math´ematiques, 38400 St Martin d’H`eres.
Editor:
XX XX
Abstract
Semi-supervised Laplacian regularization, a standard graph-based approach for learningfrom both labelled and unlabelled data, was recently demonstrated to have an insignifi-cant high dimensional learning efficiency with respect to unlabelled data (Mai and Couillet,2018), causing it to be outperformed by its unsupervised counterpart, spectral clustering,given sufficient unlabelled data. Following a detailed discussion on the origin of this incon-sistency problem, a novel regularization approach involving centering operation is proposedas solution, supported by both theoretical analysis and empirical results.
Keywords: semi-supervised learning, graph-based methods, high dimensional statistics,distance concentration, random matrix theory
1. Introduction
Machine learning methods aim to form a mapping from an input data space to an outputcharacterization space (classification labels, regression vectors) by optimally exploiting theinformation contained in the collected data. Depending on whether the data fed into thelearning model are labelled or unlabelled , the machine learning algorithms are respectivelybroadly categorized as supervised or unsupervised . Although the supervised approach has bynow occupied a dominant place in real world applications thanks to its high-level accuracy,the cost of labelling process, overly high in comparison to the collection of data, continuallycompels researchers to develop techniques using unlabelled data with growing interest, asmany popular learning tasks of these days, such as image classification, speech recognitionand language translation, require enormous training datasets to achieve satisfying results.The idea of semi-supervised learning (SSL) comes from the expectation of maximiz-ing the learning performance by combining labelled and unlabelled data (Chapelle et al.,2009). It is of significant practical value when the cost of supervised learning is too high andthe performances of unsupervised approaches is too weak. Despite its natural idea, semi-supervised learning has not reached broad recognition. As a matter of fact, many standard c (cid:13) ai, Couillet semi-supervised learning techniques were found to be unable to learn effectively from un-labelled data (Shahshahani and Landgrebe, 1994; Cozman et al., 2002; Ben-David et al.,2008), thereby hindering the interest for these methods.A first key reason for the underperformance of semi-supervised learning methods liesin the lack of understanding of such approaches, caused by the technical difficulty of atheoretical analysis. Indeed, even the simplest problem formulations, the solutions of whichassume an explicit form, involve complicated-to-analyze mathematical objects (such as theresolvent of kernel matrices).A second important aspect has to do with dimensionality. As most semi-supervisedlearning techniques are built upon low-dimensional reasonings, they suffer the transitionto large dimensional datasets. Indeed, it has been long noticed that learning from dataof intrinsically high dimensionality presents some unique problems, for which the term curse of dimensionality was coined. One important phenomenon of the curse of dimen-sionality is known as distance concentration , which is the tendency for the distances be-tween high dimensional data vectors to become indistinguishable. This problem has beenstudied in many works (Beyer et al., 1999; Aggarwal et al., 2001; Hinneburg et al., 2000;Francois et al., 2007; Angiulli, 2018), providing mathematical characterization of the dis-tance concentration under the conditions of intrinsically high dimensional data.Since the strong agreement between geometric proximity and data affinity in low dimen-sional spaces is the foundation of similarity-based learning techniques, it is then questionablewhether these traditional techniques will perform effectively on high dimensional data sets,and many counterintuitive phenomena may occur.The aforementioned tractability and dimensionality difficulties can be tackled at onceby exploiting recent advances in random matrix theory to analyze the performance of semi-supervised algorithms. With their weakness understood, it is then possible to proposefundamental corrections for these algorithms. The present article specifically focuses on semi-supervised graph regularization approaches (Belkin and Niyogi, 2003; Zhu et al., 2003;Zhou et al., 2004), a major subset of semi-supervised learning methods (Chapelle et al.,2009), often referred to as Laplacian regularizations with their loss functions involving dif-ferently normalized Laplacian matrices (Avrachenkov et al., 2012). These semi-supervisedlearning algorithms of Laplacian regularization are presented in Subsection 2.1. It wasmade clear in a recent work of Mai and Couillet (2018) that among existing Laplacianregularization algorithms, only one (related to the PageRank algorithm) yields reasonableclassification results, yet with asymptotically negligible contribution from the unlabelleddataset. This last observation of the inefficiency of Laplacian regularization methods tolearn from unlabelled data may cause them to be outperformed by a mere (unsupervised)spectral clustering approach (Von Luxburg, 2007) in the same high dimensional settings(Couillet and Benaych-Georges, 2016). We refer to Subsection 2.2 for a summary of thekey mathematical results in the previous analysis of Mai and Couillet (2018), which moti-vate the present work.The contributions of the present work start from Section 3: with the cause for the unla-belled data learning inefficiency of Laplacian regularization identified in Subsection 3.1,a new regularization approach with centered similarities is proposed in Subsection 3.2 igh Dimensional Semi-Supervised Graph Regularization as a cure, followed by a subsection justifying the proposed method from the alternativeviewpoint of label propagation. This new regularization method is simple to implementand its effectiveness supported by a rigorous analysis, in addition to heuristic argumentsand empirical results which justify its usage in more general data settings. Specifically,the statistical analysis of Section 4, placed under a high dimensional Gaussian mixturemodel (as employed in the previous analysis of Mai and Couillet (2018), as well as thatof Couillet and Benaych-Georges (2016) in the context of spectral clustering), proves theconsistency of our proposed high dimensional semi-supervised learning method, with guar-anteed performance gains over Laplacian regularization. The theoretical results of Section 4are validated by simulations in Subsection 5.1. Broadening the perspective, the discussion inSubsection 3.1 suggests that the unlabelled data learning inefficiency of Laplacian regular-ization is due to the universal distance concentration phenomenon of high dimensional data.The advantage of our centered regularization, proposed as a countermeasure to the prob-lem of distance concentration, should extend beyond the analyzed Gaussian mixture model.This claim is verified in Subsection 5.2 through experimentation on real-world datasets,where we observe that the proposed method tends to produce more marked performancegains over the Laplacian approach when the distance concentration phenomenon is moresevere. The discussion is extended in Section 6 to include some related graph-based SSLmethods. Although not suffering from the unlabelled data learning inefficiency problemlike Laplacian regularization, these methods may still have a suboptimal semi-supervisedlearning performance on high dimensional data as they do not possess the same perfor-mance guarantees as our proposed method. This claim is verified in Subsection 6.2 thanksto a recent work of Lelarge and Miolane (2019) characterizing the optimal performance onisotropic Gaussian data, upon which the graph-based SSL methods, except the proposedcentered regularization approach, are found to yield unsatisfying results. We approach thesubject of learning on sparse graphs in Subsection 6.3, where the benefit of using the pro-posed method is justified in terms of computational efficiency and learning performance. Notations: n is the column vector of ones of size n , I n the n × n identity matrix. Thenorm k · k is the Euclidean norm for vectors and the operator norm for matrices. We followthe convention to use o P (1) for a sequence of random variables that convergences to zeroin probability. For a random variable x ≡ x n and u n ≥
0, we write x = O ( u n ) if for any η > D >
0, we have n D P( x ≥ n η u n ) →
2. Background
We will begin this section by recalling the basics of graph learning methods, before brieflyreviewing the mains results of (Mai and Couillet, 2018), which motivates the proposition ofour centered regularization method presented in the subsequent section.
Consider a set { x , . . . , x n } ∈ R p of p -dimensional input vectors belonging to either one oftwo affinity classes C or C . In graph-based methods, data samples x , . . . , x n are repre- ai, Couillet sented by vertices in a graph, upon which a weight matrix W is computed by W = { w ij } ni,j =1 = (cid:26) h (cid:18) p k x i − x j k (cid:19)(cid:27) ni,j =1 (1)for some decreasing non-negative function h , so that nearby data vectors x i , x j are connectedwith a large weight w ij , which can also be seen as a similarity measure between datasamples. A typical kernel function for defining w ij is the radial basis function kernel w ij = e −k x i − x j k /t . The connectivity of data point x i is measured by its degree d i = P nj =1 w ij , thediagonal matrix D ∈ R n × n having d i as its diagonal elements is called the degree matrix.Graph learning approach assumes that data points belonging to the same affinity groupare “close” in a graph-proximity sense. In other words, if f ∈ R n is a class signal of datasamples x , . . . , x n , it varies little from x i to x j when w ij has a large value. The graphsmoothness assumption is usually characterized as minimizing a smoothness penalty termof the form 12 n X i,j =1 w ij ( f i − f j ) = f T Lf where L = D − W is referred to as the Laplacian matrix. Notice that the above loss functionis minimized to zero for f = 1 n ; obviously such constant vector contains no informationabout data classes. According to this remark, the popular unsupervised graph learningmethod, spectral clustering, simply consists in finding a unit vector orthogonal to 1 n thatminimizes the smoothness penalty term, as formalized belowmin f ∈ R n f T Lfs.t. k f k = 1 f T n = 0 . (2)It is easily shown by the spectral properties of Hermitian matrices that the solution tothe above optimization is the eigenvector of L associated to the second smallest eigen-value. There exist also other variations of the smoothness penalty term involving differ-ently normalized Laplacian matrices, such as the symmetric normalized Laplacian matrix L s = I n − D − W D − , and the random walk normalized Laplacian matrix L r = I n − W D − ,which is related to the PageRank algorithm.In the semi-supervised setting, we dispose of n [ l ] pairs of labelled points and labels { ( x , y ) , . . . , ( x n [ l ] , y n [ l ] ) } with y i ∈ {− , } the class label of x i , and n [ u ] unlabelled data { x n [ l ] +1 , . . . , x n } . To incorporate the prior knowledge on the class of labelled data intothe class signal f , the semi-supervised graph regularization approach imposes deterministicscores at the labelled points of f , e.g., by letting f i = y i for all x i labelled. The mathematicalformulation of the problem then becomesmin f ∈ R n f T Lfs.t. f i = y i , ≤ i ≤ n [ l ] . (3)Denoting f = (cid:20) f [ l ] f [ u ] (cid:21) , L = (cid:20) L [ ll ] L [ lu ] L [ ul ] L [ uu ] (cid:21) , igh Dimensional Semi-Supervised Graph Regularization the above convex optimization problem with equality constrains on f [ l ] is realized by lettingthe derivative of the loss function with respect to f [ u ] equal zero, which gives the followingexplicit solution f [ u ] = − L − uu ] L [ ul ] f [ l ] . (4)Finally, the decision step consists in assigning unlabelled sample x i to C (resp., C ) if f i < f i > f [ u ] by regularizing them over the Laplacian matrixalong with predefined class signals of labelled data f [ l ] . It is often observed in practice thatusing other normalized Laplacian regularizers such as f T L s f or f T L r f can lead to betterclassification results. Similarly to the work of Avrachenkov et al. (2012), we define L ( a ) = I − D − − a W D a as the a -normalized Laplacian matrix in order to integrate all these different Laplacianregularization algorithms into a common framework. Replacing L with L ( a ) in (4) to get f [ u ] = − (cid:16) L ( a )[ uu ] (cid:17) − L ( a )[ ul ] f [ l ] , (5)we retrieve the solutions of standard Laplacian L , symmetric Laplacian L s and randomwalk Laplacian L r respectively at a = 0, a = − / a = − L ( a )[ uu ] is invertible under the trivial condition that thegraph represented by W is fully connected (i.e., with no isolated subgraphs). To show this,note first that, under this condition, we have u T [ u ] D a [ u ] L ( a )[ uu ] u [ u ] = n X i,j = n [ l ] +1 w ij ( d ai u i − d aj u j ) + n X i = n [ l ] +1 d ai u i n [ l ] X m =1 w im > u [ u ] = 0 n [ u ] ∈ R n [ u ] , as the first term on the right-hand side is strictly positive unlessall d ai u i have the same positive value, in which case the second term is strictly positive forthere is at least one w im >
0. The matrix L ( a )[ uu ] is therefore positive definite. As shown inthe following though, the fully connected condition is not required for the new algorithmproposed in this article to be well defined and to perform as expected.Despite being a popular semi-supervised learning approach, Laplacian regularizationalgorithms are shown by Mai and Couillet (2018) to have a non-efficient learning capacityfor high dimensional unlabelled data, as a direct consequence of the distance concentrationphenomenon, hinted at in the introduction. A deeper examination of the results in thework of Mai and Couillet (2018) allows us to discover that this problem of unlabelled datalearning efficiency may in fact be settled through the usage of a centered similarity measure,as opposed to the current convention of non-negative similarities w ij . In the followingsubsections, we will recall the findings in the analysis of Mai and Couillet (2018), then moveon to the proposition of the novel corrective algorithm, along with some general remarksexplaining the effectiveness of the proposed algorithm, leaving the thorough performanceanalysis to the next section. ai, Couillet Conforming to the settings employed by Mai and Couillet (2018), we adopt the followinghigh dimensional data model for the theoretical discussions in this paper.
Assumption 1
Data samples x , . . . , x n are i.i.d. observations from a generative modelsuch that, for k ∈ { , } , P ( x i ∈ C k ) = ρ k , and x i ∈ C k ⇔ x i ∼ N ( µ k , C k ) with k C k k = O (1) , k C − k k = O (1) , k µ − µ k = O (1) , tr C − tr C = O ( √ p ) and tr( C − C ) = O ( √ p ) .The ratios c = np , c [ l ] = n [ l ] p and c [ u ] = n [ u ] p are bounded away from zero for arbitrarilylarge p . Here are some remarks to interpret the conditions imposed on the data means µ k andcovariance matrices C k in Assumption 1. Firstly, as the discussion is placed under a largedimensional context, we need to ensure that the data vectors do not lie in a low dimensionalmanifold; the fact that k C k k = O (1) along with k C − k k = O (1) guarantees non-negligiblevariations in p linearly independent directions. Other conditions controlling the differencesbetween the class statistics k µ − µ k = O (1), tr C − tr C = O ( √ p ), and tr( C − C ) = O ( √ p ) are made for the consideration of establishing non-trivial classification scenarioswhere the classification of unlabelled data does not become impossible or overly easy atextremely large values of p .The first result concerns the distance concentration of high dimension data. This resultis at the core of the reasons why Laplacian-based semi-supervised learning is bound to failwith large dimensional data. Proposition 1
Define τ = tr( C + C ) /p . Under Assumption 1, we have that, for all i, j ∈ { , . . . , n } , p k x i − x j k = τ + O ( p − ) . The above proposition indicates that in large dimensional spaces, all pairwise distancesof data samples converge to the same value, thereby indicating that the presumed con-nection between proximity and data affinity is completely disrupted. In such situations,the performance of the Laplacian regularization approach (along with most distance-basedclassification methods), which normally works well in small dimensions, may be severelyaffected. Indeed, under some mild smooth conditions on the weight function h , the analysisof Mai and Couillet (2018) reveals several surprising and critical aspects of the high dimen-sional behavior of this approach. The first conclusion is that all unlabelled data scores f i for n [ l ] + 1 ≤ i ≤ n tend to have the same signs in the case of unequal class priors (i.e., ρ = ρ ), causing all unlabelled data to be classified in the same class (unless one normal-izes the deterministic scores at labelled points so that they are balanced for each class). Inaccordance with this message, we shall use in the remainder of the article a class-balanced f [ l ] defined as below f [ l ] = (cid:18) I n [ l ] − n [ l ] n [ l ] T n [ l ] (cid:19) y [ l ] (6) igh Dimensional Semi-Supervised Graph Regularization where y [ l ] ∈ R n [ l ] is the label vector composed of y i for 1 ≤ i ≤ n [ l ] .Nevertheless, even with balanced f [ l ] as per (6), (Mai and Couillet, 2018) shows thatthe aforementioned “all data affected to the same class” problem still persists for all Lapla-cian regularization algorithms under the framework of a -normalized Laplacian (i.e., for L ( a ) = I − D − − a W D a ) except for a ≃ −
1. This indicates that among all existing Lapla-cian regularization algorithms proposed in the literature, only the random walk normalizedLaplacian regularization yields non-trivial classification results for large dimensional data.We recall in the following theorem the exact statistical characterization of f [ u ] producedby the random walk normalized Laplacian regularization, which was firstly presented byMai and Couillet (2018). Theorem 2
Let Assumption 1 hold, the function h of (1) be three-times continuously dif-ferentiable in a neighborhood of τ , and the solution f [ u ] be given by (5) for a = − . Then,for n [ l ] + 1 ≤ i ≤ n (i.e., x i unlabelled) and x i ∈ C k , p ( c / ρ ρ c [ l ] ) f i = ˜ f i + o P (1) , where ˜ f i ∼ N ( m k , σ k ) with m k = ( − k (1 − ρ k ) " − h ′ ( τ ) h ( τ ) k µ − µ k + (cid:18) h ′′ ( τ ) h ( τ ) − h ′ ( τ ) h ( τ ) (cid:19) (tr C − tr C ) p (7) σ k = 4 h ′ ( τ ) h ( τ ) " ( µ − µ ) T C k ( µ − µ ) + 1 c [ l ] P a =1 ( ρ a ) − tr C a C k p + (cid:18) h ′′ ( τ ) h ( τ ) − h ′ ( τ ) h ( τ ) (cid:19) C k (tr C − tr C ) p . (8)Theorem 2 states that the classification scores f i for an unlabelled x i follows approx-imately a Gaussian distribution at large values of p , with the mean and variance beingexplicitly dependent of the data statistics µ k , C k , the class proportions ρ k , and the ratio oflabelled data over dimensionality c [ l ] . The asymptotic probability of correct classificationfor unlabelled data is then a direct result of Theorem 2, and reads P ( x i → C k | x i ∈ C k , i > n [ l ] ) = Φ (cid:18)q m k /σ k (cid:19) + o p (1) (9)where Φ( u ) = π R u −∞ e − t dt is the cumulative distribution function of the standard Gaus-sian distribution.Of utmost importance here is the observation that, while m k /σ k is an increasing functionof c [ l ] , suggesting an effective learning from the labelled set, it is independent of the unlabelleddata ratio c [ u ] , which tells us that in the case of high dimensional data, the addition ofunlabelled data, even in significant numbers with respect to the dimensionality p , producesnegligible performance gain. Motivated by this crucial remark, we propose in this paper a simple and fundamental update to the classical Laplacian regularization approach, for thepurpose of boosting high dimensional learning performance through an enhanced utilizationof unlabelled data. The proposed algorithm will be presented and intuitively justified inthe next subsection. ai, Couillet
3. Proposed Regularization with Centered Similarities
As will be put forward in Subsection 3.1, we find that the unlabelled data learning efficiencyproblem of the Laplacian regularization method, revealed by Mai and Couillet (2018), isrooted in the concentration of pairwise distances between data vectors of high dimen-sionality. To counter the disastrous effect of the distance concentration problem, a newregularization approach with centered similarities is proposed in Subsection 3.2. An alter-native interpretation of the proposed method from the perspective of label propagation isgiven in Subsection 3.3, justifying its usage in general scenarios beyond the discussed highdimensional regime.
To gain perspective on the cause of inefficient learning from unlabelled data, we will startwith a discussion linking the issue to the data high dimensionality.Developing (5), we get f [ u ] = L ( a ) − uu ] D − − a [ u ] W [ ul ] D a [ l ] f [ l ] where W = (cid:20) W [ ll ] W [ lu ] W [ ul ] W [ uu ] (cid:21) and D = (cid:20) D [ l ] D [ u ] (cid:21) . From a graph-signal processing perspective Shuman et al. (2013), since L ( a )[ uu ] is theLaplacian matrix on the subgraph of unlabelled data, and a smooth signal s [ u ] on theunlabelled data subgraph typically induces large values for the inverse smoothness penalty s T [ u ] L ( a ) − uu ] s [ u ] , we may consider the operator P u ( s [ u ] ) = L ( a ) − uu ] s [ u ] as a “smoothness filter”strengthening smooth signals on the unlabelled data subgraph. The unlabelled scores f [ u ] can be therefore seen as obtained by a two-step procedure:1. propagating the predetermined labelled scores f [ l ] through the graph with the a -normalized weight matrix D − − a [ u ] W [ ul ] D a [ l ] through the label propagation operator P l ( f [ l ] ) = D − − a [ u ] W [ ul ] D a [ l ] f [ l ] ;2. passing the received scores at unlabelled points through the smoothness filter P u ( s [ u ] ) = L ( a ) − uu ] s [ u ] to finally get f [ u ] = P u (cid:0) P l ( f [ l ] ) (cid:1) .It is easy to see that the first step is essentially a supervised learning process, whereasthe second one allows to capitalize on the global information contained in unlabelled data.However, as a consequence of the distance concentration “curse” stated in Proposition 1,the similarities (weights) w ij between high dimensional data vectors are dominated bythe constant value h ( τ ) plus some small fluctuations, which results in the collapse of thesmoothness filter: P u ( s [ u ] ) = L ( a ) − uu ] s [ u ] ≃ (cid:18) I n [ u ] − n n [ u ] T n [ u ] (cid:19) − s [ u ] = s [ u ] + 1 n [ l ] (1 T n [ u ] s [ u ] )1 n [ u ] , igh Dimensional Semi-Supervised Graph Regularization meaning that at large values of p , only the constant signal direction 1 n [ u ] is amplified bythe smoothness filter P u .To understand such behavior of the smoothness filter P u , we recall that as mentionedin Subsection 2.1, constant signals with the same value at all points are always consideredto be the most smooth on the graph. This comes from the fact that all weights w ij havenon-negative value, so the smoothness penalty term Q ( s ) = P i,j w [ ij ] ( s i − s j ) is minimizedat the value of zero if all elements of the signal s have the same value. Notice also that inperfect situations where the data points in different class subgraphs are connected with zeroweights w ij , class indicators (i.e., signals with constant values within class subgraphs whichare different for each class) are just as smooth as constant signals for they also minimizethe smoothness penalty term to zero. Even though such scenarios almost never happenin real life, it is hoped that the inter-class similarities are sufficiently weak so that thesmoothness filter P u is still effective. What is problematic for high dimensional learning isthat when the similarities w ij tend to be indistinguishable due to the distance concentrationissue of high dimensional data vectors, constant signals have overwhelming advantages tothe point that they become the only direction privileged by the smoothness filter P u , withalmost no discrimination between all other directions. In consequence, there is nearly noutilization of the global information in high dimensional unlabelled data through Laplacianregularizations.In view of the above discussion, we shall try to eliminate the dominant advantagesof constant signals, in an attempt to render detectable the discrimination between class-structured signals and other noisy directions. As constant signals always have a smoothnesspenalty of zero, a very easy way to break their optimal smoothness is to introduce negativeweights in the graph so that the values of the smoothness regularizer can go below zero.More specifically, in the cases where the intra-class similarities are averagely positive andthe inter-class similarities are averagely negative, class-structured signals are bound to havea lower smoothness penalty than constant signals. However, the implementation of suchidea using both positive and negative similarities is hindered by the fact that the positivityof the data points degrees d i = P nj =1 w ij is no longer ensured, and having negative degreescan lead to severely unstable results. Take for instance the label propagation step P l ( f [ l ] ) = D − − a [ u ] W [ ul ] D a [ l ] f [ l ] , at an unlabelled point x i , the sum of the received scores after thatstep equals to d − − ai P n [ l ] j =1 ( w ij d aj ) f j , the sign of which obviously alters if the signs of thedegree of that point and those of labelled data change, leading thus to extremely unstableclassification results. To cope with the problem identified above, we propose here the usage of centered similaritiesˆ w ij , for which the positive and negative weights are balanced out at any data point, i.e., forall i ∈ { , . . . , n } , d i = P nj =1 w ij = 0. Given any similarity matrix W , its centered versionˆ W is easily obtained by applying a projection matrix P = (cid:0) I n − n n T n (cid:1) on both sides:ˆ W = P W P. ai, Couillet As a first advantage, the centering approach allows to remove the degree matrix altogether(for the degrees are exactly zero now) from the updated smoothness penaltyˆ Q ( s ) = n X i,j =1 ˆ w ij ( s i − s j ) = − s T ˆ W s, (10)securing thus a stable behavior of graph regularization with both positive and negativeweights.This being said, a problematic consequence of regularization procedures employing pos-itive and negative weights is that the optimization problem is no longer convex and mayhave an infinite solution. To deal with this issue, we add a constraint on the norm of thesolution. Letting f [ l ] be given by (6), the new optimization problem may now be posed asfollows: min f [ u ] ∈ R n [ u ] − f T ˆ W fs.t. k f [ u ] k = n [ u ] e (11)for some e > α = α ( e ) to thenorm constraint k f [ u ] k = n [ u ] e and the solution reads f [ u ] = (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − ˆ W [ ul ] f [ l ] (12)with α > k ˆ W [ uu ] k uniquely given by (cid:13)(cid:13)(cid:13)(cid:13)(cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − ˆ W [ ul ] f [ l ] (cid:13)(cid:13)(cid:13)(cid:13) = n [ u ] e . (13)To see that (11) is the unique solution to the optimization problem (11), it is useful toremark that, by the properties of convex optimization, (12) is the unique solution to theunconstrained convex optimization problem min f [ u ] α k f [ u ] k − f T ˆ W f for some α > k ˆ W [ uu ] k .When Equation (13) is satisfied, we get (through a proof by contradiction) that (12) is theonly solution that minimizes − f T ˆ W f in the subspace defined by k f [ l ] k = n [ u ] e .In practice, α can be used directly as a hyperparameter for a more convenient imple-mentation. We summarize the method in Algorithm 1.The proposed algorithm induces almost no extra cost to the classical Laplacian approach,except the addition of the hyperparameter α controlling the norm of f [ u ] . The performanceanalysis in Section 4 will help demonstrate that the existence of this hyperparameter, asidefrom making the regularization with centered similarities a well-posed problem, allows one toadjust the combination of labelled and unlabelled information in search for an optimal semi-supervised learning performance. As a justification of its usage in a general context (beyondthe discussed high dimensional regime), the following subsection provides an alternativeinterpretation of the proposed method from the perspective of label propagation. igh Dimensional Semi-Supervised Graph Regularization Algorithm 1
Semi-Supervised Graph Regularization with Centered Similarities Input: n [ l ] pairs of labelled points and labels { ( x , y ) , . . . , ( x n [ l ] , y n [ l ] ) } with y i ∈{− , } the class label of x i , and n [ u ] unlabelled data { x n [ l ] +1 , . . . , x n } . Output:
Classification of unlabelled data { x n [ l ] +1 , . . . , x n } . Compute the similarity matrix W . Compute the centered similarity matrix ˆ W = P W P with P = I n − n n T n , and defineˆ W = " ˆ W [ ll ] ˆ W [ lu ] ˆ W [ ul ] ˆ W [ uu ] . Set f [ l ] = (cid:16) I n [ l ] − n [ l ] n [ l ] T n [ l ] (cid:17) y [ l ] with y [ l ] the vector containing labelled y i . Compute the class scores of unlabelled data f [ u ] = (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − ˆ W [ ul ] f [ l ] for some α > k ˆ W [ uu ] k . Classify unlabelled data { x n [ l ] +1 , . . . , x n } by the signs of f [ u ] . Similarly to Laplacian regularization, the proposed method can be interpreted from theperspective of label propagation (Zhu and Ghahramani, 2002). Setting f (0)[ u ] ← α − ˆ W [ ul ] f [ l ] ,we retrieve the solution (12) of centered regularization at the stationary point f ( ∞ )[ u ] of thefollowing iteration: f ( t +1)[ u ] ← α − (cid:2) n [ u ] × n [ l ] I n [ u ] (cid:3) P W P " f [ l ] f ( t )[ u ] . Denoting f ( t ) = [ f [ l ] , f ( t )[ u ] ] T , the above process can be seen as propagating the centeredscore vector ˆ f ( t ) = P f ( t ) through the weight matrix W and recentering the received scores η ( t ) = W ˆ f ( t ) before outputting f ( t +1)[ u ] as the subset of ˆ η ( t ) = P η ( t ) corresponding to theunlabelled points.Recall from the discussion in Subsection 3.1 that the extremely amplified constant signal1 n [ u ] in the outcome f [ u ] of the Laplacian method is closely related to the ineffective unla-belled data learning problem. In the proposed approach, the constant signal is cancelledthanks to the recentering operations before and after the label propagation over W . Theexistence of the multiplier α − allows us to magnify the score vector, after its norm wassignificantly reduced due to the recentering operations.Since the regularization method with centered similarities can be viewed as a labelpropagation of the recentered score vector over the original weight matrix W , the proposedmethod, despite being motivated under the scenario of high dimensional learning, is ex-pected to yield competitive (if not superior) performance even when the original Laplacianapproach works well thanks to an informative weight matrix W . This claim is notablysupported by simulations which will be displayed in Subsection 6.3, where the proposedmethod is observed to perform better than the Laplacian regularization (and other graph-based SSL algorithms) on sparse graphs with connections within the same class significantlymore frequent than those between different classes. ai, Couillet
4. Performance Analysis
The main purpose of this section is to provide mathematical support for its effective highdimensional learning capabilities from not only labelled data but also from unlabelled data,allowing for a theoretically guaranteed performance gain over the classical Laplacian ap-proach (through an enhanced utilization of unlabelled data). The theoretical results alsopoint out that the learning performance of the proposed method has an unlabelled datalearning efficiency that is at least as good as spectral clustering, as opposed to Laplacianregularization.
We provide here the statistical characterization of unlabelled data scores f [ u ] obtained bythe proposed algorithm. As the new algorithm will be shown to draw on both labelled andunlabelled data, the complex interactions between these two types of data generate moreintricate outcomes than in (Mai and Couillet, 2018). To facilitate the interpretation of thetheoretical results without cumbersome notations, we present the theorem here under thehomoscedasticity of data vectors, i.e., C = C = C , without affecting the generality ofthe conclusions given subsequently. We refer the interested reader to the appendix for anextended version of the theorem along with its proof.We introduce first two positive functions m ( ξ ) and σ ( ξ ) which are crucial for describingthe statistical distribution of unlabelled scores: m ( ξ ) = 2 c [ l ] θ ( ξ ) c [ u ] (cid:0) − θ ( ξ ) (cid:1) (14) σ ( ξ ) = ρ ρ (2 c [ l ] + m ( ξ ) c [ u ] ) s ( ξ ) + ρ ρ (4 c l + m ( ξ ) c [ u ] ) q ( ξ ) c [ u ] (cid:0) c [ u ] − q ( ξ ) (cid:1) (15)where θ ( ξ ) = ρ ρ ξ ( µ − µ ) T ( I p − ξC ) − ( µ − µ ) q ( ξ ) = ξ p − tr h ( I p − ξC ) − C i s ( ξ ) = ρ ρ ξ ( µ − µ ) T ( I p − ξC ) − C ( I p − ξC ) − ( µ − µ ) . Here the positive functions m ( ξ ) and σ ( ξ ) are defined respectively on the domains (0 , ξ m )and (0 , ξ σ ) with ξ m , ξ σ > θ ( ξ m ) = 1 and q ( ξ σ ) = c [ u ] . Additionally,we define ξ sup = min { ξ m , ξ σ } . (16)These definitions may at first glance seem complicated, but it suffices to keep in minda few key messages to understand the theoretical results and their implications: • θ ( ξ ), q ( ξ ) and s ( ξ ) are all positive and strictly increasing functions for ξ ∈ (0 , ξ sup );consequently so are m ( ξ ) and σ ( ξ ). igh Dimensional Semi-Supervised Graph Regularization • ξ m does not depend on c [ l ] or c [ u ] ; as for ξ σ , it is constant with c [ l ] but increases as c [ u ] increases. • ρ ρ m ( ξ ) + σ ( ξ ) monotonously increases from zero to infinity as ξ increases fromzero to ξ sup .The above remarks can be derived directly from the definitions of the involved mathematicalobjects. Theorem 3
Let Assumption 1 hold with C = C = C , the function h of (1) be three-times continuously differentiable in a neighborhood of τ , f [ u ] be the solution of (11) withfixed norm n [ u ] e and with the notations of m ( ξ ) , σ ( ξ ) , ξ sup given in (14) , (15) , (16) .Then, for n [ l ] + 1 ≤ i ≤ n (i.e., x i unlabelled) and x i ∈ C k , f i = ˜ f i + o P (1) , where ˜ f i ∼ N (( − k (1 − ρ k ) ˆ m, ˆ σ ) with ˆ m = m ( ξ e ) , ˆ σ = σ ( ξ e ) for ξ e ∈ (0 , ξ sup ) uniquely given by ρ ρ m ( ξ e ) + σ ( ξ e ) = e . Theorem 3 implies that the performance of the proposed method is controlled by both c [ l ] and c [ u ] (the number of labelled and unlabelled samples per dimension), as m ( ξ ), σ ( ξ )(given by (14), (15)) are dependent of c [ l ] and c [ u ] . It is however hard to see directly aconsistently increasing performance with both c [ l ] and c [ u ] from these results. As a firstobjective of this subsection, we translate the theorem into more interpretable results.First, it should be pointed out that with the approach of centered similarities, thenorm of the unlabelled data score vector f [ u ] can be controlled through the adjustmentof the hyperparameter e , as opposed to the Laplacian regularization methods. As willbe demonstrated later in this section, the norm of f [ u ] , or more precisely the norm of itsdeterministic part E { f [ u ] } , directly affects how much the learning process relies on theunlabelled (versus labelled) data. With E { f [ u ] } given by Theorem 3 for high dimensionaldata, we indeed note that k E { f [ u ] }kk f [ l ] k + k E { f [ u ] }k = c [ u ] ˆ m c [ l ] + c [ u ] ˆ m + o P (1) = θ ( ξ e ) + o P (1)as it can be obtained from (14) that θ ( ξ ) = c [ u ] m ( ξ )2 c [ l ] + c [ u ] m ( ξ ) . In the following discussion, we shall use the variance over square mean ratio r ctr ≡ ˆ σ / ˆ m (17)as the inverse performance measure for the method of centered regularization (i.e., smallervalues of r ctr translate into better classification results for high dimensional data). A reor-ganization of the results in Theorem 3 leads to the corollary below. ai, Couillet Corollary 4
Under the conditions and notations of Theorem 3, and with r ctr defined in (17) , we have r ctr ρ ρ = s ( ξ e ) θ ( ξ e ) + q ( ξ e ) θ ( ξ e ) " θ ( ξ e ) c [ u ] (cid:18) r ctr ρ ρ (cid:19) + (1 − θ ( ξ e )) c [ l ] (18) where we recall θ ( ξ ) = c [ u ] m ( ξ )2 c [ l ] + c [ u ] m ( ξ ) ∈ (0 , . Equation (18) suggests a growing performance with more labelled or unlabelled data, asthe last two terms on the right-hand side have respectively c [ u ] and c [ l ] in their denominators.These two terms are actually quite similar, except for the pair of θ ( ξ e ) and [1 − θ ( ξ e )] each associated to one of them, and a factor of 1 + r ctr /ρ ρ ≥ c [ u ] .As said earlier, the quantity θ ( ξ e ) = c [ u ] ˆ m/ (2 c [ l ] + c [ u ] ˆ m ) ∈ (0 ,
1) reflects how much thelearning relies on unlabelled data. Indeed, it can be observed from (18) that r ctr tends tobe only dependent of c [ l ] (resp., c [ u ] ) in the limit θ ( ξ e ) → θ ( ξ e ) → r ctr /ρ ρ ≥ r ctr , this factor goes to 1 when the scores ofunlabelled data tend to deterministic values, indicating an equivalence between labelled andunlabelled data in this extreme scenario. In a way, the factor of 1 + r ctr /ρ ρ quantifies howmuch labelled samples are more helpful than unlabelled data to the learning process.To demonstrate an effective learning from labelled and unlabelled data, we now showthat, for a well-chosen e , r ctr decreases with c [ u ] and c [ l ] . Recall that the expressions of θ ( ξ ), q ( ξ ) and s ( ξ ) do not involve c [ u ] or c [ l ] . It is then easy to see that, at some fixed ξ e , r ctr > c [ u ] and c [ l ] . Adding to this argument the fact that theattainable range (0 , ξ sup ) of ξ e over e > c [ l ] and only enlarges with greater c [ u ] (as can be derived from the definition (16) of ξ sup ), we conclude that the performance ofthe proposed method consistently benefits from the addition of input data, whether labelledor unlabelled , as formally stated in Proposition 5. These remarks are illustrated in Figure 1,where we plot the probability of correct classification as θ ( ξ e ) varies from 0 to 1. Proposition 5
Under the conditions and notations of Corollary 4, we have that, for any e > , there exists a e ′ > such that r ctr ( c [ l ] , c [ u ] , e ) > r ′ ctr ( c ′ [ l ] , c ′ [ u ] , e ′ ) if c ′ [ l ] ≥ c [ l ] , c ′ [ u ] ≥ c [ u ] and c ′ [ l ] + c ′ [ u ] > c [ l ] + c [ u ] . Not only is the proposed method of centered regularization able to achieve an effectivesemi-supervised learning on high dimensional data, it does so with a labelled data learningefficiency lower bounded by that of Laplacian regularization (which is reduced to supervisedlearning in high dimensions), and an unlabelled data learning efficiency lower bounded bythat of spectral clustering , a standard unsupervised learning algorithm on graphs. The focusof the following discussion is to establish this second remark, which implies the superiority ofcentered regularization over the methods of Laplacian regularization and spectral clustering.Recall from Theorem 2 that, similarly to the centered regularization, the random walknormalized Laplacian algorithm (the only one ensuring non-trivial high dimensional classifi-cation among existing Laplacian algorithms) gives also ˜ f i ∼ N (cid:0) ( − k (1 − ρ k ) m ′ , σ ′ (cid:1) under igh Dimensional Semi-Supervised Graph Regularization . . . . θ ( ξ e ) P r o b a b ili t y o f c o rr e c t c l a ss i fi c a t i o n c [ u ] = 8 c [ u ] = 4 c [ u ] = 2 . . . . . θ ( ξ e ) c [ l ] = 4 c [ l ] = 2 c [ l ] = 1 Figure 1: Asymptotic probability of correct classification as θ ( ξ e ) varies, for ρ = ρ , p =100, µ = − µ = [ − , , . . . , T , { C } i,j = . | i − j | . Left: various c [ u ] with c [ l ] = 1.Right: various c [ l ] with c [ u ] = 8. Optimal values marked in circle.the homoscedasticity assumption, for m ′ = (2 ρ ρ c [ l ] /pc )( m − m ), σ ′ = (2 ρ ρ c [ l ] /pc ) σ =(2 ρ ρ c [ l ] /pc ) σ with m k , σ k , k ∈ { , } given in Theorem 2. Similarly to the definition of r ctr , we denote r Lap ≡ σ ′ /m ′ . (19)Since θ ( ξ e ) → ξ e → ξ e → e →
0, we obtain the following propositionfrom the results of Theorem 2 and Corollary 4.
Proposition 6
Under the conditions and notations of Theorem 2 and Corollary 4, letting r Lap be defined by (19) , we have that lim e → r ctr = r Lap = ( µ − µ ) T C ( µ − µ ) k µ − µ k + tr C p k µ − µ k ρ ρ c [ l ] . We thus remark that the performance of Laplacian regularization is retrieved by the methodproposed in the present article in the limit e → Q ( s ) of a signal s can be writtenas Q ( s ) = s T Ls . In an unsupervised learning manner, we shall seek the unit-norm vectorthat minimizes the smoothness penalty, which is the eigenvector of L associated with thesmallest eigenvalue. However, as Q ( s ) reaches its minimum at the clearly non-informativeflat vector s = 1 n , the sought-for solution is provided instead by the eigenvector associatedwith the second smallest eigenvalue. In contrast, the updated smoothness penalty termˆ Q ( s ) = s T ˆ W s with centered similarities does not achieves its minimum for “flat” signals,and thus the eigenvector associated with the smallest eigenvalue is here a valid solution. ai, Couillet Another important aspect is that spectral clustering based on the unnormalized Laplacianmatrix L = D − W has long been known to behave unstably (Von Luxburg et al., 2008), asopposed to the symmetric normalized Laplacian L s = I n − D − W D − , so fair comparisonshould be made versus L s rather than L .Let us define d inter ( v ) as the inter-cluster distance operator that takes as input a real-valued vector v of dimension n , then returns the distance between the centroids of theclusters formed by the set of points { v i | ≤ i ≤ n, x i ∈ C k } , for k ∈ { , } ; and d intra ( v )be the intra-cluster distance operator that returns the standard deviation within clusters.Namely, d inter ( v ) = | j T v/n − j T v/n | d intra ( v ) = k v − ( j T v/n ) j − ( j T v/n ) j k / √ n where j k ∈ R n with k ∈ { , } is the indicator vector of class k with [ j k ] i = 1 if x i ∈ C k ,otherwise [ j k ] i = 0; and n k the number of ones in the vector j k . As the purpose of clusteringanalysis is to produce clusters conforming to the intrinsic classes of data points, with lowvariance within a cluster and large distance between clusters, the following proposition (seethe proof in the appendix) shows that the performance of the classical normalized spectralclustering, which has been studied by Couillet et al. (2016) under the high dimensionalsetting, is practically the same as the one with centered similarities on high dimensionaldata. Proposition 7
Under the conditions of Theorem 3, let v Lap be the eigenvector of L s as-sociated with the second smallest eigenvalue, and v ctr the eigenvector of ˆ W associated withthe largest eigenvalue. Then, d inter ( v Lap ) d intra ( v Lap ) = d inter ( v ctr ) d intra ( v ctr ) + o P (1) for non-trivial clustering with d inter ( v Lap ) /d intra ( v Lap ) , d inter ( v ctr ) /d intra ( v ctr ) = O (1) . As explained before, the solution f [ u ] of the centered similarities regularization can beexpressed as f [ u ] = (cid:0) αI n [ u ] − ˆ W [ uu ] (cid:1) − ˆ W [ ul ] f [ l ] for some α > k ˆ W [ uu ] k (dependent of e asindicated in (13)). Clearly, as α ↓ k ˆ W [ uu ] k , f [ u ] tends to align to the eigenvector of ˆ W [ uu ] associated with the largest eigenvalue. Therefore, the performance of spectral clustering onthe unlabelled data subgraph is retrieved at e → + ∞ .In summary of the discussion in this section, we conclude that the proposed regulariza-tion method with centered similarities • recovers the high dimensional performance of Laplacian regularization at e → • recovers the high dimensional performance of spectral clustering at e → + ∞ ; • accomplishes a consistent high dimensional semi-supervised learning for e appropri-ately set between the two extremes, thus leading to an increasing performance gainover Laplacian regularization with greater amounts of unlabelled data. igh Dimensional Semi-Supervised Graph Regularization . . c [ u ] A cc u r a c y Centered, empiricalCentered, theoryLaplacian, empiricalLaplacian, theorySpectral, empiricalSpectral, theory 0 2 4 6 8 100 . . c [ u ] Figure 2: Empirical and theoretical accuracy as a function of c [ u ] with c [ l ] = 2, ρ = ρ , p = 100, − µ = µ = [ − , , . . . , T , C = I p (left) or { C } i,j = . | i − j | (right).Graph constructed with w ij = e −k x i − x j k /p . Averaged over 50000 /n [ u ] iterations.
5. Experimentation
The objective of this section is to provide empirical evidence to support the proposed regu-larization method with centered similarities, by comparing it with Laplacian regularizationthrough simulations under and beyond the settings of the theoretical analysis.
We first validate the asymptotic results of the above section on finite data sets of relativelysmall sizes ( n, p ∼ θ . In other words, the high dimensional accuracies of Laplacianregularization and spectral clustering are given by Equation ( ?? ) of Theorem 3, respectivelyin the limit θ = 0 and θ = + ∞ (when spectral clustering yields non-trivial solutions); thisis how the theoretical values of both methods are computed in Figure 2. The finite-sampleresults are given for the best (oracle) choice of the hyperparameter a in the generalizedLaplacian matrix L ( a ) = I − D − − a W D a for Laplacian regularization and spectral clustering,and for the optimal (oracle) choice of the hyperparameter α for centered regularization.Under a non-trivial Gaussian mixture model setting (see caption) with p = 100, Figure 2demonstrates a sharp prediction of the average empirical performance by the asymptoticanalysis. As revealed by the theoretical results, the Laplacian regularization fails to learneffectively from unlabelled data, causing it to be outperformed by the purely unsupervisedspectral clustering approach (for which the labelled data are treated as unlabelled ones)for sufficiently numerous unlabelled data. The performance curve of the proposed centeredapproach, on the other hand, is consistently above that of spectral clustering, with a growingadvantage over Laplacian regularization as the number of unlabelled data increases. ai, Couillet Digits (3 ,
5) Digits (7 , , . . . R e l a t i v e f r e q u e n c y Intra-classInter-class 0 0 . . . . . . . . n [ u ] A cc u r a c y Lapalcian regularizationCentered regularization
200 400 6000 . . . . . n [ u ] Figure 3: Top: distribution of normalized pairwise distances k x i − x j k / ¯ δ ( i = j ) with ¯ δ theaverage of k x i − x j k for MNIST data. Bottom: average accuracy as a functionof n [ u ] with n [ l ] = 15 (left) or n [ l ] = 10 (right), computed over 1000 randomrealizations with 99% confidence intervals represented by shaded regions.Figure 2 also interestingly shows that the unsupervised performance of spectral clus-tering is noticeably reduced when the covariance matrix of the data distribution changesfrom the identity matrix to a slightly disrupted model (here for { C } i,j = . | i − j | ). On thecontrary, the Laplacian regularization, the high dimensional performance of which reliesessentially on labelled data, is barely affected. This is explained by the different impactslabelled and unlabelled data have on the learning process, which can be understood fromthe theoretical results of the above section. While the performance analysis of this article is placed under the Gaussianity of data vec-tors, we expect the proposed method to exhibit its advantage of non-negligible unlabelleddata learning over the Laplacian approach in a broader context of high dimensional learning.Indeed, as discussed in Subsection 3.1, the key element causing the unlabelled data learninginefficiency of Laplacian regularization is the negligible distinction between inter-class andintra-class similarities, induced by the distance concentration of high dimensional data. It igh Dimensional Semi-Supervised Graph Regularization SNR = − − . . . . . R e l a t i v e f r e q u e n c y Intra-classInter-class 0 . . . . . . . . . n [ u ] A cc u r a c y Lapalcian regularizationCentered regularization
200 400 600 8000 . . . . n [ u ] Figure 4: Top: distribution of normalized pairwise distances k x i − x j k / ¯ δ ( i = j ) with ¯ δ theaverage of k x i − x j k for noisy MNIST data (7,8,9). Bottom: average accuracyas a function of n [ u ] with n [ l ] = 15, computed over 1000 random realizations with99% confidence intervals represented by shaded regions.is important to understand that this concentration phenomenon is essentially irrespective ofthe Gaussianity of the data . Proposition 1 can indeed be generalized to a wider statisticalmodel by a mere law of large numbers; this is the case for instance of all high dimensionaldata vectors x i of the form x i = µ k + C k z i , for k ∈ { , } , where µ k ∈ R p , C k ∈ R p × p aremeans and covariance matrices as specified in Assumption 1 and z i ∈ R p any random vectorof independent elements with zero mean, unit variance and bounded fourth order moment.Beyond this model of z i with independent entries, the recent work (Louart and Couillet,2018) strongly suggests that Proposition 1 remains valid for the wider class of concentratedvectors x i (i.e., satisfying a concentration of measure phenomenon (Ledoux, 2005)), includ-ing in particular generative models of the type x i = F ( z i ) for z i ∼ N (0 , I p ) and F : R p → R p any 1-Lipschitz mapping (for instance, artificial images produced by generative adversarialnetworks (Goodfellow et al., 2014)).The main objective of this subsection is to provide an actual sense of how the Laplacianregularization approach and the proposed method behave under different levels of distance ai, Couillet concentration . We first give here, as a real-life example, simulations on datasets fromthe standard MNIST database of handwritten digits (LeCun, 1998), which are depicted inFigures 3–4.For a fair comparison of Laplacian and centered regularizations, the results displayedhere are obtained on their respective best performing graphs, selected among the k − nearestneighbors graphs (which were observed to yield very competitive performance on MNISTdata) with various numbers of neighbors k = { , . . . , q } , for q the largest integer such that2 q < n . The hyperparameters of the Laplacian and centered regularization approaches areset optimally within the admissible range. It worth pointing out that the popular KNNgraphs, constructed by letting w ij = 1 if data points x i or x j is among the k nearest ( k being the parameter to be set beforehand) to the other data point, and w ij = 0 if not, arenot covered by the present analytic framework. Our study only deals with graphs where w ij is exclusively determined by the distance between x i and x j , while in the KNN graphs, w ij is dependent of all pairwise distances of the whole data sets. Nonetheless, KNN graphsevidently suffer the same problem of distance concentration, for they are still based on thedistances between data points. It is thus natural to expect that the proposed centeringprocedure may also be advantageous on KNN graphs.Figure 3 shows that high classification accuracy is easily obtained on MNIST data, evenwith the classical Laplacian approach. However, it exhibits an lower learning efficiencycompared to the proposed method. We also find that the benefit of the proposed algorithmis more perceptible on the binary classification task displayed on the left side of Figure 3than the multiclassification task on the right side, for which the difference between inter-class and intra-class distances is more apparent. This suggests that the advantage of theproposed method is more related to a subtle distinction between inter-class and intra-classdistances than to the number of classes.As further evidence, Figure 4 presents situations where the learning problem becomesmore challenging in the presence of additive noise. Understandably, the distance concen-tration phenomenon is more acute in this noise-corrupted setting, causing more subtledistinction between inter-class and intra-class distances. As a result, the performance gaingenerated by the proposed method should be more significant, according to our discussion atthe beginning of this subsection. This is corroborated by Figure 4, where larger performancegains are observed for the muticlassification task on the right side of Figure 3. Moreover,on the right display of Figure 4, where the similarity information is seriously disruptedby the additive noise, we observe the anticipated saturation effect when increasing n [ u ] forLaplacian regularization, in contrast to the growing performance of the proposed approach.This suggests, in conclusion, that regularization with centered similarities has a competi-tive, if not superior, performance in various situations, and yields particularly significantperformance gains when the distinction between intra-class and inter-class similarities isquite subtle.To further test the proposed method on challenging real-world datasets, we also com-pare the Laplacian and centered similarities methods on the popular Cifar10 database
1. Specifically, the hyperparameter a of Laplacian regularization is searched among the values from − .
02, and the hyperparameter α of centered regularization within the grid α =(1 + 10 t ) k ˆ W [ uu ] k where t varies from − .
1. The results outside these ranges areobserved to be non-competitive. igh Dimensional Semi-Supervised Graph Regularization “automobile” versus “airplane” “ship” versus “truck”
100 200 300 400 5000 . . . . Laplacian regularizationCentered regularization
100 200 300 400 5000 . . . . Figure 5: Average accuracy on two-class Cifar10 data as a function of n [ u ] with n [ l ] = 10,computed over 1000 random realizations with 99% confidence intervals repre-sented by shaded regions.(Krizhevsky et al., 2014). To obtain meaningful results, the data went through a featureextraction step using the standard pre-trained ResNet-50 network (He et al., 2016). Otherexperimental settings are the same as for the above MNIST data. The simulations arereported in Figure 5, where the findings support again the use of the proposed method.
6. Further Discussion
As further discussion, we start this section by presenting other graph-based semi-supervisedlearning methods and explaining how they relate to the regularization approaches investi-gated in this paper. To evaluate the ability of these SSL methods to exploit optimally theinformation in partially labelled data sets, we use the recent results of Lelarge and Miolane(2019) as a reference point, where the best achievable semi-supervised learning performanceon high dimensional Gaussian mixture data with identity covariance matrices was character-ized. The proposed centered regularization method is found to have a remarkable advantageover other graph-based SSL methods for reaching the optimal performance. We test alsoon sparse graphs generated from the stochastic block model. These simulations, where wecan control the informativeness of the local geometry of the graph, will provide additionalempirical support for the proposed method from another perspective.
As multiple times emphasized, the focus of the article is to promote the usage of centeredsimilarities in graph regularization for semi-supervised learning. This fundamental ideacan also be embedded into more involved graph regularization methods, such as iteratedapproaches. In parallel to the graph regularization methods, there also exists an alternativeapproach which uses the spectral information of the graph matrix instead of optimizing thegraph smoothness. We briefly discuss these related methods here. ai, Couillet The method of semi-supervised Laplacian regularization can also be problematic outsidethe high dimensional regime discussed in this article. The earlier work of Nadler et al.(2009) showed that unlabelled data scores f i concentrate around the same value (i.e., f i = c + o (1) for some constant c ) in the limit where the number of unlabelled samples is ex-ceedingly large compared to that of labelled ones (i.e., n [ u ] /n [ l ] → ∞ ). The follow-up worksAlamgir and Luxburg (2011); Zhou and Belkin (2011); Bridle and Zhu (2013); Kyng et al.(2015); El Alaoui et al. (2016) advocated the usage of higher order regularization techniquesto address the problem of flat scores under the same setting of Nadler et al. (2009). Amongthese techniques, the method of iterated Laplacian regularization, which consists in usingthe powers of Laplacian matrices for constructing high order regularizers f T L m f of graphsmoothness, tends to highly competitive classification results (Zhou and Belkin, 2011).While bringing into light this important phenomenon of ‘flat’ unlabelled data scores,the analysis of Nadler et al. (2009), unlike that of Mai and Couillet (2018), did not clar-ify why non-trivial classification is still empirically observed to be achievable and how theclassification performance is affected. Remarkably, the analysis of Mai and Couillet (2018)also pointed out that, in high dimensions, the phenomenon of flat unlabelled data scoresoccurs even when the number of unlabelled samples is comparable to that of labelled ones.As can be easily deduced from our study, the problem of flat unlabelled scores is addressedby the centered regularization method in the more challenging setting of high dimensionallearning. In terms of performance guarantees, as high order regularization techniques in-clude the basic Laplacian regularization as a special case, they are guaranteed to performno worse than Laplacian algorithms. However, it is not clear how they compare to theunsupervised performance of spectral clustering. Finally, it should be emphasized that theuse of centered similarities is not in conflict with the approach of high order regularization.Future studies can be envisioned to further improve the performance by combining thesetwo ideas. Aside from graph regularization methods, another popular graph-based semi-supervisedapproach exists which takes advantage of the spectral information of Laplacian matrices(Belkin and Niyogi, 2003, 2004). Rather than regularizing f over the graph, this methodcomputes first the eigenmap of Laplacian matrices, then uses a certain number s of eigen-vectors E = [ e , . . . , e s ] associated with the smallest eigenvalues to build a linear subspaceand search within this space for an f which minimizes k f [ l ] − y [ l ] k . By the method of leastsquares, f = Ea with a = ( E T [ l ] E [ l ] ) − E T [ l ] y [ l ] .As an advantage of using the spectral information, this eigenvector-based method isguaranteed to achieve at least the performance of spectral clustering, as opposed to theLaplacian regularization approach. On the other hand, the regularization approach doesnot have a performance which depends crucially on how well the class signal is captured bya small number of eigenvectors, as it uses the graph matrix as a whole. Another benefit ofthe graph regularization approach is that it can be easily incorporated into other algorithmsinvolving optimization as an additional term in the loss function (e.g., Laplacian SVMs).With our proposed algorithm using centered similarities, a consistent learning of unlabelled igh Dimensional Semi-Supervised Graph Regularization data, related to the performance of spectral clustering, can also be achieved by the graphregularization approach. Moreover, the proposed method has a theoretically-proven efficientlearning of labelled data which is absent in the eigenvector-based method. A very recent work of Lelarge and Miolane (2019) has established the optimal performanceof semi-supervised learning on a high dimensional Gaussian mixture data model N ( ± µ, I p ),with identity covariance matrices. In this work, a method of Bayesian estimation is identi-fied as the one achieving the optimal performance. However, as pointed out by the authors,this method is computationally expensive except on fully labelled datasets and approxima-tions are needed for practical usage.By comparing the results of this work with our performance analysis in Section 4, wefind that the method of centered regularization achieves an optimal performance on fullylabelled datasets and a nearly optimal one on partially labelled sets. Numerical results aregiven in Figure 6, where the classification accuracy of the centered regularization method,computed from Theorem 3 and maximized over the hyperparameter e , is observed to beextremely close to the optimal performance provided by Lelarge and Miolane (2019). Hence,the centered regularization method can be used as a computationally efficient alternativeto the Bayesian approach which yields the best achievable performance. In contrast, othergraph-based semi-supervised learning algorithms are much less effective in reaching theoptimal performance, as can be observed from Figure 7.We remark also that the iterated Laplacian regularization appears to be comparably lessefficient in exploiting unlabelled data and so is the eigenvector-based method in learningfrom labelled data. As can be observed in Figure 7, the iterated Laplacian regularizationfalls notably short of approaching the optimal performance when the value of m yieldingthe highest accuracy is further away from 1 (scenarios corresponding to the blue curves inthe figure). Since we retrieve the standard Laplacian regularization at m = 1, which givesthe optimal performance in the absence of unlabelled data, the performance gain yieldedby the iterated Laplacian regularization over the Laplacian method is mainly brought bythe utilization of unlabelled data at higher m . However, as demonstrated in Figure 7, theutilization of unlabelled data at higher m is unsatisfactory in allowing the method to reachthe optimal semi-supervised learning performance. Since the eigenvector-based approach isreduced to the purely unsupervised method of spectral clustering at s = 1, the same remarkcan be made with respect to its labelled data learning efficiency. Constructing sparse graphs with good local geometry has been the focus of many researchworks in graph-based learning. In these graphs, a data point is only connected with non-zeroweight to a small portion of other points, and connections within the same affinity groupshould be more frequent than between different groups. Sparse graphs are also naturalobjects in problems of community detection (Fortunato, 2010). As our proposed algorithm
2. To the authors’ knowledge, more general results (e.g., with arbitrary covariance matrices) are currentlyout-of-reach.3. We refer to Appendix D for some theoretical details. ai, Couillet . . . . c [ u ] A cc u r a c y Baysian (optimal)Centered regularization
Figure 6: Asymptotic accuracy on isotropic Gaussian mixture data. Performance curves asa function of c [ u ] with c [ l ] = 1 /
2, for (from top to bottom) k µ k = 2 , / , or 1.Centered Iterated Eigenvector-based − . . . . . A cc u r a c y c [ l ] = 1 c [ l ] = 3 − . . . . A cc u r a c y log ( α/ k ˆ W [ uu ] − k ) m s Figure 7: Empirical accuracy of graph-based SSL algorithms at different values of hyperpa-rameters for isotropic Gaussian mixture data with p = 60, n = 360 and k µ k = 1(bottom) or k µ k = 2 (top). Averaged over 1000 realizations. Best empiricalvalue marked in circle and the asymptotic optimum in cross. igh Dimensional Semi-Supervised Graph Regularization involves a centering operation on the weight matrix W , it disrupts the sparsity of theweight matrix, as well as the traditional concept of local geometry in sparse graphs. Wemay wonder about the computational efficiency (which benefits from the structure of sparsematrices) and the performance of the proposed method on sparse graphs, in comparison tothe original Laplacian approach.In terms of computational efficiency, note that, even though the centered weight matrixˆ W is not sparse, it can be written as a sum of W and a matrix of rank two:ˆ W = W + (cid:2) n v (cid:3) A (cid:20) T n v T (cid:21) where v = W n and A = (cid:20) (1 T n W n ) /n − /n − /n (cid:21) . Using Woodbury’s inversion formula,we can then decompose the inverse of αI n [ u ] − ˆ W [ uu ] as the inverse of αI n [ u ] − W [ uu ] plus amatrix of rank two as: (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − = Q − Q (cid:2) n [ u ] v [ u ] (cid:3) A − + " T n [ u ] v T [ u ] Q (cid:2) n [ u ] v [ u ] (cid:3)! − " T n [ u ] v T [ u ] Q where Q = ( αI n [ u ] − W [ uu ] ) − . Therefore, the complexity of computing the solution ofcentered regularization can be reduced to that of computing QW [ ul ] f [ l ] , which benefits fromthe sparsity of W .There remains the question of the learning performance on sparse graphs. Recall firstfrom Subsection 3.3 that the solution of centered regularization can be viewed as a station-ary point of a label propagation through W with a centering operation on the input andoutput score vectors at each iteration. Naturally, the label propagation should be able toexploit the local geometry of W , and we thus expect the centered regularization methodto perform well on sparse graphs. To verify this claim, we test the centered regularizationmethod, along with other graph-base SSL algorithms, over sparse graphs generated fromstochastic block models (SBMs). SBMs are standard models for simply characterizing anunderlying local geometry, where a pair of points x i , x j are connected (i.e., w ij = 1) with aprobability of q in if they belong to the same class and q out otherwise. To account for het-erogeneous degrees, the Degree-Corrected SBMs investigated in (Coja-Oghlan and Lanka,2010; Gulikers et al., 2017) propose to modify the probability of x i , x j being connected as r i r j q in for x i , x j in the same class and r i r j q out for x i , x j in different classes, with r i reflectingthe intrinsic connectivity of node i . The results reported in Table 1 show again a significantadvantage of centered regularization over other methods across various ratios of labelledpoints, suggesting a highly competitive performance of the proposed method even on sparsegraphs. We also observe that the centered regularization method tends to be more robustto heterogeneous degrees than other methods.
7. Concluding Remarks
The key to the proposed semi-supervised learning method lies in the replacement of conven-tional Laplacian regularizations by a centering operation on similarities. The motivation ai, Couillet n [ l ] /n /
20 1 /
10 1 / . ± . . ± . . ± . . ± . . ± . . ± . Iterated Laplacian 68 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Iterated Laplacian 77 . ± . . ± . . ± . . ± . . ± . . ± . Table 1: Accuracy of graph-based SSL algorithms on sparse graphs of SBMs. Averaged over1000 realizations. Case 1: n = 1000, q in = 14 /n , q out = 7 /n , homogeneous degreeswith r i = 1 for all i ∈ { , . . . , n } . Case 2: n = 1000, q in = 35 /n , q out = 15 /n ,heterogeneous degrees with P ( r i = 0 .
3) = , P ( r i = 0 .
5) = and P ( r i = 1) = .behind this operation is rooted in the large dimensional concentration of pairwise-data dis-tances and thus likely to extend beyond the present graph-based semi-supervised learningschemes. It would in particular be interesting to know whether other advanced learningmodels involving Laplacian regularizations benefit from the same update. A specific ex-ample is Laplacian support vector machines (Laplacian SVMs) (Belkin et al., 2006), whichis another widespread semi-supervised learning algorithm. Answering this question aboutLaplacian SVMs is however not a straightforward extension of the present analysis. Unlikethe outcomes of Laplacian regularization, Laplacian SVMs are learned through an optimiza-tion problem without an explicit solution; additional technical tools, such as those recentlydevised in the work of El Karoui et al. (2013), to deal with implicit objects are required foranalyzing their performance.As already anticipated by the theoretical results, it is not surprising that the proposedcentered similarities regularization empirically produces large performance gains over thestandard Laplacian regularization method when the aforementioned distance concentrationproblem is severe on the experimented data. However, it is quite illuminating to observethat even on datasets with weak distance concentration, for which the standard Laplacianapproach exhibits a clear performance growth with respect to unlabelled data, the advantageof the proposed algorithm is still preserved. This attests to the general potential of such highdimensional studies for improving machine learning algorithms by identifying and settlingsome underlying issues compromising their learning performance, which would be difficultto spot if not through high dimensional analyses.
8. Acknowledgements
Couillet’s work is supported by the IDEX GSTATS and the MIAI “GSTATS” chairs atUniversity Grenoble Alpes, as well as by the HUAWEI LarDist project. igh Dimensional Semi-Supervised Graph Regularization Appendix A. Generalization of Theorem 3 and Proof
A.1 Generalized Theorem
We first present an extended version of Theorem 3 for the general setting where C may differfrom C . The functions m ( ξ ), σ ( ξ ) defined in (14) and (15) for describing the statisticaldistribution of unlabelled scores in the case of C = C need be adapted as follows: m ( ξ ) = 2 c [ l ] θ ( ξ ) c [ u ] (cid:0) − θ ( ξ ) (cid:1) (20) σ ( ξ ) = ρ ρ (2 c [ l ] + m ( ξ ) c [ u ] ) s ( ξ ) + ρ ρ (4 c l + m ( ξ ) c [ u ] ) q ( ξ ) c [ u ] (cid:0) c [ u ] − q ( ξ ) (cid:1) , k ∈ { , } (21)where θ ( ξ ) = ρ ρ ξ ( ν − ν ) T (cid:0) I p − ξ ¯Σ (cid:1) − ( ν − ν ) q ( ξ ) = ξ p − tr h(cid:0) I p − ξ ¯Σ (cid:1) − ¯Σ i s ( ξ ) = ρ ρ ξ ( ν − ν ) T (cid:0) I p − ξ ¯Σ (cid:1) − ¯Σ (cid:0) I p − ξ ¯Σ (cid:1) − ( ν − ν ) , (22)with ν k = (cid:2)p − h ′ ( τ ) µ T k p h ′′ ( τ ) tr C k / √ p (cid:3) T Σ k = (cid:20) − h ′ ( τ ) C k p × × p h ′′ ( τ ) tr C k /p (cid:21) and ¯Σ = ρ Σ + ρ Σ .Notice that the adaptation is made here through the redefinitions of θ ( ξ ), q ( ξ and s ( ξ );the expressions of m ( ξ ) and σ ( ξ ) are kept identical. As in the case of C = C , the positivefunctions m ( ξ ) and σ ( ξ ) are defined respectively on the domains (0 , ξ m ) and (0 , ξ σ ) with ξ m , ξ σ > θ ( ξ m ) = 1 and q ( ξ σ ) = c [ u ] . We define ξ sup as ξ sup = min { ξ m , ξ σ } (23)With these adapted notations, we present the generalized results in the theorem below. Theorem 8
Let Assumption 1 hold, the function h of (1) be three-times continuously dif-ferentiable in a neighborhood of τ , f [ u ] be the solution of (11) with fixed norm n [ u ] e , andwith the notations of m ( ξ ) , σ ( ξ ) , ξ sup given in (20) , (21) , (23) . Then, for n [ l ] + 1 ≤ i ≤ n (i.e., x i unlabelled) and x i ∈ C k , f i L → N (cid:16) ( − k (1 − ρ k ) ˆ m, ˆ σ k (cid:17) where ˆ m = m ( ξ e )ˆ σ k = c − u ] ρ ρ (cid:2) (2 c [ l ] + m ( ξ e ) c [ u ] ) s k ( ξ e ) + (4 c l + m ( ξ e ) c [ u ] + σ ( ξ e ) c [ u ] ) q ( ξ e ) (cid:3) ai, Couillet with s a ( ξ e ) = ρ ρ ξ e ( ν − ν ) T (cid:0) I p − ξ e ¯Σ (cid:1) − Σ k (cid:0) I p − ξ e ¯Σ (cid:1) − ( ν − ν ) , a ∈ { , } , and ξ e ∈ (0 , ξ sup ) uniquely given by ρ ρ m ( ξ e ) + σ ( ξ e ) = e . A.2 Proof of Generalized Theorem
The proof of Theorem 8 relies on a leave-one-out approach, in the spirit of El Karoui et al.(2013), along with arguments from previous related analyses (Couillet and Benaych-Georges,2016; Mai and Couillet, 2018) based on random matrix theory .
A.2.1 Main Idea
The main idea of the proof is to first demonstrate that for unlabelled data scores f i (i.e.,with i > n [ l ] ), f i = γβ ( i ) T φ c ( x i ) + o P (1) (24)where γ is a finite constant, φ c a certain mapping from the data space that we shall define,and β ( i ) a random vector independent of φ c ( x i ). Additionally, we shall show that β ( i ) = 1 p n X j =1 f j φ c ( x j ) + ǫ (25)with k ǫ k / k β ( i ) k = o P (1).As a consequence of (24), the statistical behavior of the unlabelled data scores canbe understood through that of β ( i ) , which itself depends on the unlabelled data scores asdescribed by (25). By combining (24) and (25), we thus establish the equations ruling theasymptotic statistical behavior (i.e., mean and variance) of the unlabelled data scores f i . A.2.2 Detailed Arguments
In addition to the notations given in the end of the introduction (Section 1), we specifythat when multidimensional objects are concerned, O ( u n ) is understood entry-wise. Thenotation O k·k is understood as follows: for a vector v , v = O k·k ( u n ) means its Euclideannorm is O ( u n ) and for a square matrix M , M = O k·k ( u n ) means that the operator norm of M is O ( u n ).First note that, as w ij = h ( k x i − x j k /p ) = h ( τ ) + O ( p − ), Taylor-expanding w ij around h ( τ ) gives (see Appendix C for a detailed proof) ˆ W = O k·k (1) andˆ W = 1 p ˆΦ T ˆΦ + [ h (0) − h ( τ ) + τ h ′ ( τ )] P n + O k·k ( p − ) (26) igh Dimensional Semi-Supervised Graph Regularization where P n = I n − n n T n , and ˆΦ = [ ˆ φ ( x ) , . . . , ˆ φ ( x n )] = [ φ ( x ) , . . . , φ ( x n )] P n with φ ( x i ) = (cid:2)p − h ′ ( τ ) x T i p h ′′ ( τ ) k x i k / √ p (cid:3) T . Define ν k = E { φ ( x i ) } , Σ k = cov { φ ( x i ) } for x i ∈ C k , k ∈ { , } , and let Z = [ z , . . . , z n ]with z i = φ ( x i ) − ν k (i.e., E { z i } = 0). We also write the labelled versus unlabelled divisionsΦ = (cid:2) Φ [ l ] Φ [ u ] (cid:3) , Z = (cid:2) Z [ l ] Z [ u ] (cid:3) and ˆΦ = h ˆΦ [ l ] ˆΦ [ u ] i .Recall that f [ u ] = (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − ˆ W [ ul ] f [ l ] . To proceed, we need to show that n T n [ u ] f [ u ] = O ( p − ). Applying (26), we can express f [ u ] as f [ u ] = (cid:18) ˜ αI n [ u ] − p ˆΦ T [ u ] ˆΦ [ u ] + rn n [ u ] T n [ u ] (cid:19) − (cid:18) p ˆΦ T [ u ] ˆΦ [ l ] − rn n [ u ] T n [ l ] (cid:19) f [ l ] + O ( p − )where ˜ α = α − h (0) + h ( τ ) − τ h ′ ( τ ), r = h (0) − h ( τ ) + τ h ′ ( τ ). Since 1 T [ l ] f [ l ] = 0 from itsdefinition given in (6), f [ u ] = (cid:18) ˜ αI n [ u ] − p ˆΦ T [ u ] ˆΦ [ u ] + rn n [ u ] T n [ u ] (cid:19) − p ˆΦ T [ u ] Φ [ l ] f [ l ] + O ( p − ) . (27)Write ˆΦ [ u ] = E { ˆΦ [ u ] } + Z [ u ] − ( Z n /n )1 T n [ u ] . Evidently, E { ˆΦ [ u ] } = ( ν − ν ) s T where s ∈ R n [ u ] with s i = ( − k ( n − n k ) /n for x i ∈ C k , k ∈ { , } . By the large number law, s = ζ + O ( p − )where ζ ∈ R n [ u ] with ζ i = ( − k (1 − ρ k ) for x i ∈ C k , therefore1 p ˆΦ T [ u ] ˆΦ [ u ] = 1 p (cid:26) k ν − ν k ζζ T + Z T [ u ] Z [ u ] + (1 T n Z T Z n /n )1 n [ u ] T n [ u ] + [ Z T [ u ] ( ν − ν )] ζ T + ζ [ Z T [ u ] ( ν − ν )] T − ( Z T [ u ] Z n /n )1 T n [ u ] − n [ u ] ( Z T [ u ] Z n /n ) T (cid:27) + O k·k ( p − ) . Invoking Woodbury’s identity (Woodbury, 1950) expressed as (cid:16) R − U N U T (cid:17) − = R + RU ( N − − U T RU ) − U T R, we get (cid:18) ˜ αI n [ u ] − p ˆΦ T [ u ] ˆΦ [ u ] + rn n [ u ] T n [ u ] (cid:19) − = (cid:18) ˜ αI n [ u ] − p Z T [ u ] Z [ u ] − U N U T (cid:19) − + O k·k ( p − )= R + RU ( N − − U T RU ) − U T R + O k·k ( p − ) (28)by letting R = (cid:16) ˜ αI n [ u ] − p Z T [ u ] Z [ u ] (cid:17) − and U = 1 √ p h ζ Z T [ u ] ( ν − ν ) 1 n [ u ] Z T [ u ] Z n /n i N = k ν − ν k T n Z T Z n /n ) − rc −
10 0 − . (29) ai, Couillet Note also that1 p ˆΦ T [ u ] Φ [ l ] f [ l ] = √ pU ( ν − ν ) T p Φ [ l ] f [ l ] c [ l ] ρ ρ + 1 p Z T [ u ] Z [ l ] f [ l ] + O ( p − ) . (30)Now we want to prove that U T RU is of the form U T RU = (cid:20) A × × B (cid:21) + O ( p − ) , (31)for some matrices A, B ∈ R × with elements of O (1). First it should be pointed out that z i is a Gaussian vector if the last element is ignored. Since ignoring the last element of z i will not change the concentration results given subsequently to prove the form of U T RU ,we shall treat z i as Gaussian vectors for simplicity. As there exists a deterministic matrix¯ R of the form cI n [ u ] such that a T Rb − a T ¯ Rb = O ( p − )for any a, b = O k·k (1) independent of R (Benaych-Georges and Couillet, 2016, Proposition5), we get immediately that U T · RU · = 1 p ζ T R n [ u ] = 1 p ζ T ¯ R n [ u ] + O ( p − ) = O ( p − ) . In order to prove the rest, we begin by showing that1 √ p a T Z [ u ] Rb = O ( p − ) (32)for any a, b = O k·k (1) independent of Z [ u ] . First let us set a ′ = Cov { z i } a and denote by P a ′ the projection matrix orthogonal to a ′ . We write then z i = Cov { z i } P a ′ Cov { z i } − z i + Cov { z i } a ′ a ′ T k a ′ k Cov { z i } − z i = ˜ z i + a T z i k a ′ k Cov { z i } a where ˜ z i = Cov { z i } P a ′ Cov { z i } − z i . Note that in this decomposition of z i , the two termsare independent. Indeed, sinceCov { ˜ z i , a T z i } = E { Cov { z i } P a ′ Cov { z i } − z i z T i a } = Cov { z i } P a ′ Cov { z i } a = 0 p ,a T z i and ˜ z i are uncorrelated, and thus independent by the property that uncorrelated jointlyGaussian variables are independent. Applying this decomposition of z i , we have, by letting˜ Z = [˜ z , . . . , ˜ z n ] and q = [ a T z k Cov { z } a k / k a ′ k , . . . , a T z n k Cov { z n } a k / k a ′ k ] , that Z T [ u ] Z [ u ] = ˜ Z T [ u ] ˜ Z [ u ] + qq T . igh Dimensional Semi-Supervised Graph Regularization Then with the help of Sherman-Morrison’s formula (Sherman and Morrison, 1950), we get R = ˜ R − ˜ Rqq T ˜ R/p q T ˜ Rq/p .
Similarly to R , we have also for ˜ R a deterministic equivalent ¯˜ R = ˜ cI n [ u ] with some constant˜ c such that u T ˜ Rv − u T ¯˜ Rv = O ( p − )for any u, v = O k·k (1) independent of ˜ R (Benaych-Georges and Couillet, 2016, Proposition5). Since Z T [ u ] a and q are independent of ˜ R , we prove √ p a T Z [ u ] Rb = O ( p − ) with1 √ p a T Z [ u ] Rb = 1 √ p a T Z [ u ] ˜ Rb − √ p a T Z [ u ] ˜ Rqq T ˜ Rb q T ˜ Rq = 1 √ p ˜ ca T Z [ u ] b − √ p ˜ c a T Z [ u ] qq T b c k q k + O ( p − )= O ( p − ) . This leads directly to U T · RU · = 1 √ p ( ν − ν ) Z [ u ] R n [ u ] / √ p = O ( p − ) . With the same argument, we have also U T · RU · = 1 p ζ T R (cid:16) Z T [ u ] Z [ u ] n [ u ] /n + Z T [ u ] Z [ l ] n [ l ] /n (cid:17) = ζ T ( ˜ αR − I n [ u ] )1 n [ u ] /n + 1 p ζ T RZ T [ u ] (cid:16) Z [ l ] n [ l ] /n (cid:17) = O ( p − );and U T · RU · = 1 p ( ν − ν ) T Z [ u ] R (cid:16) Z T [ u ] Z [ u ] n [ u ] /n + Z T [ u ] Z [ l ] n [ l ] /n (cid:17) = ˜ α ( ν − ν ) T Z [ u ] R n [ u ] /n − ( ν − ν ) T Z [ u ] n [ u ] /n + ( ν − ν ) T (cid:18) p Z [ u ] RZ T [ u ] (cid:19) (cid:16) Z [ l ] n [ l ] /n (cid:17) = O ( p − ) . We conclude thus that U T RU is of the form (31).Substituting (28) and (30) into (27) and using the fact that p − k U T RZ T [ u ] Z [ l ] f [ l ] k = O ( p − ) derived by similar reasoning to the above, we obtain1 n T n [ u ] f [ u ] = c − (cid:2) (cid:3) K ( ν − ν ) T p Φ [ l ] f [ l ] c [ l ] ρ ρ + O ( p − ) (33) ai, Couillet with K = U T RU + U T RU ( N − − U T RU ) − U T RU.
Since U T RU is of the form (31), we find from classical algebraic arguments that K is alsoof the same diagonal block matrix form. We thus finally get from (33) that1 n T n [ u ] f [ u ] = O ( p − ) . Now that we have shown that n T n [ u ] f [ u ] = O ( p − ), multiplying both sides of (27) with˜ αI n [ u ] − p ˆΦ T [ u ] ˆΦ [ u ] + rn n [ u ] T n [ u ] from the left gives˜ αf [ u ] = 1 p ˆΦ T [ u ] ˆΦ [ u ] f [ u ] + 1 p ˆΦ T [ u ] ˆΦ [ l ] f [ l ] + O ( p − ) . Decomposing this equation for any i > n [ l ] (i.e., x i unlabelled) leads to˜ αf i = 1 p ˆ φ ( x i ) T ˆΦ f + O ( p − ) (34)˜ αf { i } [ u ] = 1 p ˆΦ { i } T [ u ] ˆ φ ( x i ) f i + 1 p ˆΦ { i } T [ u ] ˆΦ { i } [ u ] f { i } [ u ] + 1 p ˆΦ { i } T [ u ] ˆΦ [ l ] f [ l ] + O ( p − ) (35)with f { i } [ u ] standing for the vector obtained by removing f i from f [ u ] , ˆΦ { i } [ u ] for the matrixobtained by removing ˆ φ ( x i ) from ˆΦ [ u ] .Our objective is to compare the behavior of the vector f [ u ] decomposed as { f i , f { i } [ u ] } tothe “leave- x i -out” version f ( i )[ u ] to be introduced next. To this end, define the leave-one-outdataset X ( i ) = { x , . . . , x i − , x i +1 , . . . , x n } ∈ R ( n − × p for any i > n [ l ] (i.e., x i unlabelled),and ˆ W ( i ) ∈ R ( n − × ( n − the corresponding centered similarity matrix, for which we have,similarly to ˆ W , ˆ W ( i ) = 1 p ˆΦ ( i ) T ˆΦ ( i ) + [ h (0) − h ( τ ) + τ h ′ ( τ )] P n − + O k·k ( p − ) (36)where ˆΦ ( i ) = [ ˆ φ ( i ) ( x ) , . . . , ˆ φ ( i ) ( x i − ) , ˆ φ ( i ) ( x i +1 ) , . . . , ˆ φ ( i ) ( x n )] = [ φ ( x ) , . . . , φ ( x i − ) , φ ( x i +1 ) ,. . . , φ ( x n )] P n − . Denote by f ( i )[ u ] the solution of the centered similarities regularization onthe “leave-one-out” dataset X ( i ) , i.e., f ( i )[ u ] = (cid:16) αI n [ u ] − − ˆ W ( i )[ uu ] (cid:17) − ˆ W ( i )[ ul ] f [ l ] . (37)Substituting (36) into (37) leads to˜ αf ( i )[ u ] = 1 p ˆΦ ( i ) T [ u ] ˆΦ ( i )[ u ] f ( i )[ u ] + 1 p ˆΦ ( i ) T [ u ] ˆΦ [ l ] f [ l ] + O ( p − ) (38) igh Dimensional Semi-Supervised Graph Regularization where ˆΦ ( i ) = h ˆΦ ( i )[ l ] ˆΦ ( i )[ u ] i . From the definitions of ˆΦ ( i )[ u ] and ˆΦ { i } [ u ] , which essentially differ bythe addition of the O (1 / √ p )-norm term φ ( x i ) /n to every column, we easily have1 √ p ˆΦ ( i )[ u ] − √ p ˆΦ { i } [ u ] = O k·k ( p − ) , which entails 1 p ˆΦ ( i ) T [ u ] ˆΦ ( i )[ u ] − p ˆΦ { i } T [ u ] ˆΦ { i } [ u ] = O k·k ( p − ) , (39)Thus, subtracting (38) from (35) gives M ( i ) (cid:16) f { i } [ u ] − f ( i )[ u ] (cid:17) = 1 p ˆΦ ( i ) T [ u ] ˆ φ ( x i ) f i + O ( p − ) (40)with M ( i ) = ˜ αI ( n [ u ] − − p ˆΦ ( i ) T [ u ] ˆΦ ( i )[ u ] . Set β = p ˆΦ f = O k·k (1), the unlabelled data “regression vector” which gives unlabelleddata scores by f i = ˜ α − β T ˆ φ ( x i ), and its “leave-one-out” version β ( i ) = p ˆΦ ( i ) f ( i ) with f ( i ) = h f [ l ] f ( i )[ u ] i . Applying (39) and (40), we get that β − β ( i ) = (cid:18) I p + 1 p ˆΦ ( i )[ u ] (cid:16) M ( i ) (cid:17) − ˆΦ ( i ) T [ u ] (cid:19) p f i ˆ φ ( x i ) + O k·k ( p − ) = O k·k ( p − ) . (41)By the above result, Equation (34) can be expanded as˜ αf i = β ( i ) T ˆ φ ( x i ) + 1 p ˆ φ ( x i ) T (cid:18) I p + 1 p ˆΦ ( i )[ u ] (cid:16) M ( i ) (cid:17) − ˆΦ ( i ) T [ u ] (cid:19) ˆ φ ( x i ) f i + O ( p − ) . (42)To go further in the development of (42), we first need to evaluate the quadratic form κ i ≡ p ˆ φ ( x i ) T T ( i ) ˆ φ ( x i )where T ( i ) = I p + 1 p ˆΦ ( i )[ u ] (cid:16) M ( i ) (cid:17) − ˆΦ ( i ) T [ u ] . Since p ˆΦ ( i ) T [ u ] ˆΦ ( i )[ u ] = O k·k (1), it is easy to see that T ( i ) = O k·k (1). As ˆ φ ( x i ) is independent of T ( i ) , it unfolds from the “trace lemma” (Couillet and Debbah, 2011, Theorem 3.4) that κ i − p tr Σ k T ( i ) a . s . −→ . ai, Couillet Notice that T ( i ) = ˜ α (cid:18) ˜ αI p − p ˆΦ ( i )[ u ] ˆΦ ( i ) T [ u ] (cid:19) − = ˜ α (cid:18) ˜ αI p − p ˆΦ { i } [ u ] ˆΦ { i } T [ u ] (cid:19) − + O k·k ( p − )= T − ˜ αp T ( i ) ˆ φ ( x i ) ˆ φ ( x i ) T T ( i ) − p κ i ˜ α + O k·k ( p − )where T = ˜ α (cid:18) ˜ αI p − p ˆΦ [ u ] ˆΦ T [ u ] (cid:19) − = T ( i ) + ˜ αp T ( i ) ˆ φ ( x i ) ˆ φ ( x i ) T T ( i ) − p κ i ˜ α by Sherman-Morrison’s formula (Sherman and Morrison, 1950). We get consequently1 p tr Σ k T ( i ) = 1 p tr Σ k T + O ( p − ) ,κ i converges thus to a deterministic limit κ independent of i at large n, p .Equation (42) then becomes f i = γβ ( i ) T ˆ φ ( x i ) + O ( p − ) . (43)where γ = ( ˜ α − κ ) − .We focus now on the term β ( i ) T ˆ φ ( x i ) in (43). To discard the “weak” dependence between β ( i ) T and ˆ φ ( x i ), let us define φ c ( x i ) = ( − k (1 − ρ k )( ν − ν ) + z i . As n k /n = ρ k + O ( n − ), by the law of large numbers, E { ˆ φ ( x i ) } = ( − k [( n − n k ) /n ]( ν − ν ) = E { φ c ( x i ) } + O k·k ( n − ). Remark that, unlike ˆ φ ( x i ), φ c ( x i ) is independent of all x j with j = i , and therefore independent of β ( i ) . We thus now have β ( i ) T ˆ φ ( x i ) = β ( i ) T (cid:18) E { ˆ φ ( x i ) } + z i − n n X m =1 z m (cid:19) = β ( i ) T φ c ( x i ) + 1 n β T Z n + O ( p − ) . We get from (41) that n β ( i ) T Z n = n β T Z n + O ( p − ), leading to f i = γβ ( i ) T φ c ( x i ) + 1 n β T Z n + O ( p − ) . (44)Since φ c ( x i ) is independent of β ( i ) , according to the central limit theorem, β ( i ) T φ c ( x i )asymptotically follows a Gaussian distribution.To demonstrate that n β T Z n is negligibly small, notice fist that, by summing (44) forall i > n [ u ] , we have1 n T n [ u ] f [ u ] = 1 n n X i = n [ l ] +1 β ( i ) T φ c ( x i ) + c [ u ] ( β ( i ) T Z n /n ) + O ( p − ) . igh Dimensional Semi-Supervised Graph Regularization Since n T n [ u ] f [ u ] = O ( p − ), it suffices to prove n P ni = n [ l ] +1 β ( i ) T φ c ( x i ) = O ( p − ) to con-sequently show that n β T Z n = O ( p − ) from the above equation. To this end, we shallexamine the correlation between β ( i ) T φ c ( x i ) and β ( j ) T φ c ( x j ) for i = j > n [ l ] . Consider β ( ij ) , ˆΦ ( ij )[ u ] , M ( ij ) obtained in the same way as β ( i ) , ˆΦ ( i )[ u ] , M ( i ) , but this time by leaving outthe two unlabelled samples x i , x j . Similarly to (41), we have β ( i ) − β ( ij ) = (cid:18) I p + 1 p ˆΦ ( ij )[ u ] (cid:16) M ( ij ) (cid:17) − ˆΦ ( ij ) T [ u ] (cid:19) p f j ˆ φ ( x j ) + O k·k ( p − ) = O k·k ( p − ) . (45)It follows from the above equation that, for i = j > n [ l ] ,Cov { β ( i ) T φ c ( x i ) , β ( i ) T φ c ( x j ) } = E { β ( i ) T φ c ( x i ) β ( i ) T φ c ( x j ) } − E { β ( i ) T φ c ( x i ) } E { β ( j ) T φ c ( x j ) } = E { β ( ij ) T φ c ( x i ) β ( ij ) T φ c ( x j ) } − E { β ( i ) T φ c ( x i ) } E { β ( j ) T φ c ( x j ) } + O ( p − )= E { β ( ij ) T φ c ( x i ) } E { β ( ij ) T φ c ( x j ) } − E { β ( i ) T φ c ( x i ) } E { β ( j ) T φ c ( x j ) } + O ( p − )= O ( p − ) , (46)leading to the conclusion that n [ u ] P ni = n [ l ] +1 β ( i ) T φ c ( x i ) = n [ u ] P ni = n [ l ] +1 E { β ( i ) T φ c ( x i ) } + O ( p − ) = O ( p − ). Hence, n β T Z n = O ( p − ). Finally, we have that, for i > n [ l ] , f i = γβ ( i ) T φ c ( x i ) + O ( p − ) , (47)indicating that, up to the constant γ , f i asymptotically follows the same Gaussian distri-bution as β ( i ) T φ c ( x i ).Moreover, taking the expectation and the variance of the both sides of (47) for x i ∈ C k yields E { f i | i > n [ l ] , x ∈ C k } = γ E { β ( i ) T } ( − k (1 − ρ k )( ν − ν ) + O ( p − )var { f i | i > n [ l ] , x ∈ C k } = γ tr (cid:2) cov { β ( i ) } Σ k (cid:3) + γ E { β ( i ) } T Σ k E { β ( i ) } + O ( p − ) . Since β − β ( i ) = O k·k ( p − ) as per (41), we obtain E { f i | i > n [ l ] , x ∈ C k } = γ E { β T } ( − k (1 − ρ k )( ν − ν ) + O ( p − ) (48)var { f i | i > n [ l ] , x ∈ C k } = γ tr (cid:2) cov { β } Σ k (cid:3) + γ E { β } T Σ k E { β } + O ( p − ) . (49)After linking the distribution parameters of unlabelled scores to those of β with Equa-tion (48) and Equation (49), we now turn our attention to the statistical behaviour of β .Substituting (47) into β = p ˆΦ f yields β = 1 p n [ l ] X i =1 f i ˆ φ ( x i ) + 1 p n X i = n [ l ] +1 γβ ( i ) T φ c ( x i ) ˆ φ ( x i ) + O k·k ( p − )= 1 p n [ l ] X i =1 f i φ c ( x i ) + 1 p n X i = n [ l ] +1 γβ ( i ) T φ c ( x i ) φ c ( x i ) + O k·k ( p − ) . (50) ai, Couillet For i > n [ l ] and x i ∈ C k , we decompose φ c ( x i ) as φ c ( x i ) = E { φ c ( x i ) } + Σ k β ( i ) β ( i ) T z i + ˜ z i (51)where ˜ z i = z i − Σ k β ( i ) β ( i ) T z i . By substituting the expression (51) of φ c ( x i ) into (50) and using the fact that β − β ( i ) = O k·k ( p − ), we obtain (cid:18) I p − γc [ u ] 2 X a =1 ρ a Σ a (cid:19) β = 1 p n [ l ] X i =1 f i E { φ c ( x i ) } + 1 p n X i = n [ l ] +1 γβ ( i ) T φ c ( x i ) E { φ c ( x i ) } + 1 p n [ l ] X i =1 f i z i + 1 p n X i = n [ l ] +1 γβ ( i ) T φ c ( x i )˜ z i + O k·k ( p − ) . (52)Recall that f [ l ] is a deterministic vector (given in (6)) and note that E { β ( i ) T φ c ( x i )˜ z i } = E { β ( i ) T z i [ z i − Σ k β ( i ) / ( β ( i ) T z i )] } = E { β ( i ) T z i z i } − Σ k E { β ( i ) } = 0 . Taking the expectation of both sides of (52) thus gives (cid:18) I p − γc [ u ] 2 X a =1 ρ a Σ a (cid:19) E { β } = 1 p n [ l ] X i =1 f i E { φ c ( x i ) } + 1 p n X i = n [ l ] +1 γ E { β ( i ) } T E { φ c ( x i ) } E { φ c ( x i ) } + O k·k ( p − )= 1 p n [ l ] X i =1 f i E { φ c ( x i ) } + 1 p n X i = n [ l ] +1 γ E { β } T E { φ c ( x i ) } E { φ c ( x i ) } + O k·k ( p − ) . (53)Let Q = I p − γc [ u ] ¯Σ with ¯Σ = ρ Σ + ρ Σ and denote ˆ m ≡ γ ( ν − ν ) T E { β } . With thesenotations, we get directly from the above equation thatˆ m = γρ ρ (2 c [ l ] + mc [ u ] )( ν − ν ) T Q − ( ν − ν ) + o P (1) . (54)With the notation m , (48) notably becomes E { f i | i > n [ l ] , x ∈ C k } = ( − k (1 − ρ k ) ˆ m + O ( p − ) . In addition, we get from (53) that γ E { β } T Σ k E { β } = (cid:2) γρ ρ (2 c [ l ] + ˆ mc [ u ] ) (cid:3) ( ν − ν ) T Q − Σ k Q − ( ν − ν ) . (55) igh Dimensional Semi-Supervised Graph Regularization Furthermore, we have from (52) and (53)tr[cov { β } Σ k ] = E n ( β − E { β } ) T Σ k ( β − E { β } ) o = 1 p n [ l ] X i =1 f i E { z T i Q − Σ k Q − z i } + 1 p n X i = n [ l ] +1 γ E { ( β ( i ) T φ c ( x i )) ˜ z T i Q − Σ k Q − ˜ z i } + O ( p − ) . Since p z T i Q − Σ k Q − z i = p tr( Q − ¯Σ) + O ( p − ) and p ˜ z T i Q − Σ k Q − ˜ z i = p tr( Q − ¯Σ) + O ( p − ), by the trace lemma (Couillet and Debbah, 2011, Theorem 3.4) and Assumption 1, γ tr[cov { β } Σ k ] = γ (cid:2) ρ ρ (4 c [ l ] + ˆ m c [ u ] ) + c [ u ] 2 X a =1 ρ a var { f i | i > n [ l ] , x ∈ C a } (cid:3) p tr( Q − ¯Σ) + O ( p − ) . (56)Using the shortcut notation ˆ σ k ≡ var { f i | i > n [ l ] , x ∈ C k } for k ∈ { , } , we get by substi-tuting (55) and (56) into (49) thatˆ σ k = (cid:2) γρ ρ (2 c [ l ] + ˆ mc [ u ] ) (cid:3) ( ν − ν ) T Q − Σ k Q − ( ν − ν )+ γ (cid:2) ρ ρ (4 c [ l ] + ˆ m c [ u ] ) + c [ u ] 2 X a =1 ρ a ˆ σ a (cid:3) p tr( Q − ¯Σ) + o P (1) . (57)Letting ξ ≡ c [ u ] γ , we get by multiplying the both sides of (54) with c [ u ] that c [ u ] ˆ m = ξρ ρ (2 c [ l ] + ˆ mc [ u ] )( ν − ν ) T (cid:0) I p − γc [ u ] ¯Σ (cid:1) − ( ν − ν ) + o P (1) . And multiplying the both sides of (57) with c u ] leads to c u ] ˆ σ k = (cid:2) ρ ρ (2 c [ l ] + ˆ mc [ u ] ) (cid:3) ξ ( ν − ν ) T Q − Σ k Q − ( ν − ν )+ (cid:2) ρ ρ (4 c [ l ] + ˆ m c [ u ] ) + c [ u ] 2 X a =1 ρ a ˆ σ a (cid:3) ξ p − tr( Q − ¯Σ) + o P (1) . (58)Set ˆ σ = P a =1 ρ a ˆ σ a , we obtain c u ] ˆ σ = (cid:2) ρ ρ (2 c [ l ] + ˆ mc [ u ] ) (cid:3) ξ ( ν − ν ) T Q − ¯Σ Q − ( ν − ν )+ (cid:2) ρ ρ (4 c [ l ] + ˆ m c [ u ] ) + c [ u ] ˆ σ (cid:3) ξ p − tr( Q − ¯Σ) + o P (1) . It is derived from the above equations that there exists a ξ ∈ R such that ( ˆ m, ˆ σ ) =( m ( ξ ) , σ ( ξ )) with m ( ξ ) , σ ( ξ ) as given in (20) and (21). Let us denote by ξ e the value of ξ that allows us to access ˆ m, ˆ σ at some given value of the hyperparameter e > { f i , f j } = O ( p − ) ai, Couillet for i, j > n [ l ] . With the same arguments, we get easilyCov { f i , f j } = O ( p − ) , which entails1 n [ u ] k f [ u ] k = 1 n [ u ] n X i = n [ l ] +1 f i = 1 n [ u ] n X i = n [ l ] +1 E { f i } + O ( p − ) = ρ ρ m + σ + O ( p − ) . Therefore, the value ξ e should satisfy, up to some asymptotically negligible terms, theequation ρ ρ m ( ξ e ) + σ ( ξ e ) = e . Note that the above equation does not give an unique ξ e if ξ e is allowed to take any valuein R . We need thus to further specify the admissible range of ξ e as e goes from zero toinfinity. We start by showing that m has always a positive value. With small adjustmentto (33), we have1 n ζ T f [ u ] = c − (cid:2) (cid:3) K ( ν − ν ) T p Φ [ l ] f [ l ] c [ l ] ρ ρ + O ( p − )with K = U T RU + U T RU ( N − − U T RU ) − U T RU.
We recall U T RU is of the form (31), and further remark that the matrix A in (31) is ofthe form A = (cid:20) a a (cid:21) as we have U T · RU · by applying (32). As indicated in Section 3.2,for any e > α has a value greater than which is determined by (13). The matrix (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − is thus definite positive. Since (cid:16) αI n [ u ] − ˆ W [ uu ] (cid:17) − = (cid:18) ˜ αI n [ u ] − p ˆΦ T [ u ] ˆΦ [ u ] + rn n [ u ] T n [ u ] (cid:19) − = R + RU ( N − − U T RU ) − U T R + O k·k ( p − ) ,K is definite positive with high probability. Notice also that K = U T RU + U T RU ( N − − U T RU ) − U T RU = (cid:20)(cid:16) U T RU (cid:17) − − N (cid:21) − , meaning that K = N det n ( U T RU ) − − N o = 1det n ( U T RU ) − − N o . igh Dimensional Semi-Supervised Graph Regularization We get thus K > n(cid:0) U T RU (cid:1) − − N o = det (cid:8) K − (cid:9) > K , which implies that all the eigenvalues of K are positive. The fact that K is definite positive implies also K >
0, otherwise we would have (cid:2) (cid:3) K (cid:20) (cid:21) = K ≤ n ζ T f [ u ] = c − (cid:18) K ( ν − ν ) T p Φ [ l ] f [ l ] + K c [ l ] ρ ρ (cid:19) + O ( p − )= 2 ρ ρ ( K k ν − ν k + K ln [ l ] /n ) + O ( p − ) , we get n ζ T f [ u ] > p . As1 n ζ T f [ u ] = 1 n ζ T E { f [ u ] } + O ( p − ) = ρ ρ ˆˆ m + O ( p − )as a result of Cov { f i , f j } = O ( p − ). We remark thus that ˆ m > e >
0. Since σ > ξ e ∈ (0 , ξ sup ) for any e , as atleast one of m ( ξ e ) , σ ( ξ e ) is negative (or not well defined) outside this range. It can alsobe observed from the expressions (20)–(21) of m ( ξ e ) and σ ( ξ e ) that ρ ρ m ( ξ ) + σ ( ξ )monotonously increases from zero to infinity as ξ increases from zero to ξ sup . Therefore, ξ e ∈ (0 , ξ sup ) is uniquely given by ρ ρ m ( ξ e ) + σ ( ξ e ) = e . In summary, for any e ∈ (0 , + ∞ ), we have that ˆ m = m ( ξ e ), ˆ σ = σ ( ξ e ) with functions m ( ξ ) , σ ( ξ ) as defined in (20)–(21) and ξ e ∈ (0 , ξ sup ) the unique solution of ρ ρ m ( ξ e ) + σ ( ξ e ) = e ; we get also from (58) the value of ˆ σ k asˆ σ k = c − u ] (cid:2) ρ ρ (2 c [ l ] + m ( ξ e ) c [ u ] ) (cid:3) ξ e ( ν − ν ) T Q − Σ k Q − ( ν − ν )+ c − u ] (cid:2) ρ ρ (4 c [ l ] + m ( ξ e ) c [ u ] ) + c [ u ] 2 X a =1 ρ a σ ( ξ e ) a (cid:3) ξ e p − tr( Q − ¯Σ) The proof of theorem 8 is thus concluded.
Appendix B. Proof of Proposition 7
As the eigenvector of L s associated with the smallest eigenvalue is D n , we consider L ′ s = nD − W D − − n D n T n D T n D n . Note that k L ′ s k = O (1) according to (Couillet and Benaych-Georges, 2016, Theorem 1), andif v is an eigenvector of L s associated with the eigenvalue u , then it is also an eigenvectorof L ′ s associated with the eigenvalue − u + 1, except for the eigenvalue-eigenvector pair( n, D n ) of L s turned into (0 , D n ) for L ′ s . The second smallest eigenvector v Lap of L s is the same as the largest eigenvector of L ′ s . ai, Couillet From the random matrix equivalent of L ′ s given by Couillet and Benaych-Georges (2016,Theorem 1) and that of ˆ W expressed in (26), we haveˆ W = h ( τ ) L ′ s + 5 h ′ ( τ ) ψψ T + O ( p − )where ψ = [ ψ , . . . , ψ n ] T with ψ i = k x i k − E [ k x i k ].Recall that d inter ( v ) = | j T v/n − j T v/n | d intra ( v ) = k v − ( j T v/n ) j − ( j T v/n ) j k / √ n for some v ∈ R n , and j k ∈ R n with k ∈ { , } the indicator vector of class k with [ j k ] i = 1if x i ∈ C k , otherwise [ j k ] i = 0.Denote by λ Lap the eigenvalue of h ( τ ) L ′ s associated with v Lap , and λ ctr the eigenvalueof ˆ W associated with v ctr . Under the condition of non-trivial clustering upon v Lap with d inter ( v Lap ) /d intra ( v Lap ) = O (1), we have j T k v Lap / √ n k = O (1) from the above expressionsof d inter ( v ) and d intra ( v ). The fact that j T k v Lap / √ n k = O (1) implies that the eigenvalue λ Lap of h ( τ ) L ′ s remains at a non vanishing distance from other eigenvalues of h ( τ ) L ′ s (Couillet and Benaych-Georges, 2016, Theorem 4). The same can be said about ˆ W andits eigenvalue λ ctr .Let γ be a positively oriented complex closed path circling only around λ Lap and λ ctr .Since there can be only one eigenvector of L ′ s ( ˆ W , resp.) whose limiting scalar product with j k for k ∈ { , } is bounded away from zero (Couillet and Benaych-Georges, 2016, Theorem4), which is v Lap (resp., v ctr ), we have, by Cauchy’s formula (Walter, 1987, Theorem 10.15),1 n k ( j T k v Lap ) = − πi I γ n k j T k ( h ( τ ) L ′ s − zI n ) − j k dz + o P (1)1 n k ( j T k v ctr ) = − πi I γ n k j T k ( ˆ W − zI n ) − j k dz + o P (1)for k ∈ { , } . Since ˆ W is a low-rank perturbation of ˆ L , invoking Sherman-Morrison’sformula (Sherman and Morrison, 1950), we further have j T k ( ˆ W − zI n ) − j k = j T k ( h ( τ ) L ′ s − zI n ) − j k − (5 h ′ ( τ ) / (cid:0) j T k ( h ( τ ) L ′ s − zI n ) − ψ (cid:1) h ′ ( τ ) / ψ T ( h ( τ ) L ′ s − zI n ) − ψ + o P ( n k ) . As √ n k j T k ( h ( τ ) L ′ s − zI n ) − ψ = o P (1) (Couillet and Benaych-Georges, 2016, Equation 7.6),we get 1 n k j T k ( ˆ W − zI n ) − j k = 1 n k j T k ( h ( τ ) L ′ s − zI n ) − j k + o P (1) , and thus 1 n k ( j T k v Lap ) = 1 n k ( j T k v ctr ) + o P (1) , which concludes the proof of Proposition 7. igh Dimensional Semi-Supervised Graph Regularization Appendix C. Asymptotic Matrix Equivalent for ˆ W The objective of this section is to prove the asymptotic matrix equivalent for ˆ W expressedin (26). Some additional notations that will be useful in the proof: • for x i ∈ C k , k ∈ { , } , θ i ≡ x i − µ k , and θ ≡ [ θ , · · · , θ n ] T ; • µ ◦ k = µ k − n P k ′ =1 n k ′ µ k ′ , t k = (cid:16) tr C k − n P k ′ =1 n k ′ tr C k ′ (cid:17) / √ p ; • j k ∈ R n is the canonical vector of C k , i.e., [ j k ] i = 1 if x i ∈ C k and [ j k ] i = 0 otherwise; • ψ i ≡ (cid:0) k θ i k − E[ k θ i k (cid:1) / √ p , ψ ≡ [ ψ , · · · , ψ n ] T and ( ψ ) ≡ [( ψ ) , · · · , ( ψ n ) ] T .As w ij = h ( k x i − x j k /p = h ( τ ) + O ( p − ) for all i = j , we can Taylor-expand w ij = h ( k x i − x j k /p around h ( τ ) to obtain the following expansion for W , which can be foundin (Couillet and Benaych-Georges, 2016): W = h ( τ )1 n T n + h ′ ( τ ) √ p " ψ T n + 1 n ψ T + X b =1 t b j b T n + 1 n X a =1 t a j T a + h ′ ( τ ) p " X a,b =1 k µ ◦ a − µ ◦ b k j b j T a − θ X a =1 µ ◦ a j T a + 2 X b =1 diag( j b ) θµ ◦ b T n − X b =1 j b µ ◦ T b θ T + 21 n X a =1 µ ◦ a T θ T diag( j a ) − θθ T + h ′′ ( τ )2 p (cid:20) ( ψ ) T n + 1 n [( ψ ) ] T + X b =1 t b j b T n + 1 n X a =1 t a j T a + 2 X a,b =1 t a t b j b j T a + 2 X b =1 diag( j b ) t b ψ T n + 2 X b =1 t b j b ψ T + 2 X a =1 n ψ T diag( j a ) t a + 2 ψ X a =1 t a j T a + 2 ψψ T (cid:21) + ( h (0) − h ( τ ) + τ h ′ ( τ )) I n + O k·k ( p − ) . Applying P n = (cid:0) I n − n n T n (cid:1) on both sides of the above equation, we getˆ W = P n W P n = − h ′ ( τ ) p " X a,b =1 ( µ ◦ T a µ ◦ b ) j b j T a + P n θ X a =1 µ ◦ a j T a + X b =1 j b µ ◦ T b θ T P n + P n θθ T P n + h ′′ ( τ ) p " X a,b =1 t a t b j b j T a + X b =1 t b j b ψ T P n + P n ψ X a =1 t a j T a + P n ψψ T P n + ( h (0) − h ′ ( τ ) + τ h ′′ ( τ )) P n + O ( p − )= 1 p ˆΦ T ˆΦ + ( h (0) − h ( τ ) + τ h ′ ( τ )) P n + O k·k ( p − ) ai, Couillet where the last equality is justified by1 p ˆΦ T ˆΦ = − h ′ ( τ ) p " X a,b =1 ( µ ◦ T a µ ◦ b ) j b j T a + P n θ X a =1 µ ◦ a j T a + X b =1 j b µ ◦ T b θ T P n + P n θθ T P n + h ′′ ( τ ) p " X a,b =1 t a t b j b j T a + X b =1 t b j b ψ T P n + P n ψ X a =1 t a j T a + P n ψψ T P n . Equation (26) is thus proved.
Appendix D. Guarantee for approaching the optimal performance onisotropic Gaussian data
The purpose of this section is to provide some general guarantee for the proposed centeredregularization method to approach the best achievable performance on isotropic high dimen-sional Gaussian data, which was characterized in the recent work of Lelarge and Miolane(2019). In this work, the considered isotropic data model is a special case of our analyticalframework, in which − µ = µ = µ , C = C = I p and ρ = ρ . Reorganizing the results of(Lelarge and Miolane, 2019), the optimally achievable classification accuracy in the limit oflarge p is equal to 1 − Q ( √ q ∗ )with q ∗ > q ∗ = k µ k − p k µ k p + k µ k (cid:0) n [ l ] + E z ∼N ( q ∗ ,q ∗ ) { tanh( z ) } n [ u ] (cid:1) . (59)It is easy to see that the optimal accuracy is higher with greater q ∗ . In parallel, reformulatingthe results of Corollary 4 for some value of the hyperparameter e > m ( ξ e ) = m ( ξ e ) m ( ξ e ) + σ ( ξ e ) , (60)the high dimensional classification accuracy achieved by the centered regularization methodis asymptotically equal to 1 − Q ( √ q c )with q c > q c = k µ k − p k µ k p + k µ k (cid:16) n [ l ] + q c q c +1 n [ u ] (cid:17) (61)Obviously, the fixed-point equations (59)–(61) are identical at n [ u ] = 0, meaning thatthe centered regularization method achieves the optimal performance on fully labelled sets.For partially labelled sets, the difference between (59) and (61) resides in the multiplying igh Dimensional Semi-Supervised Graph Regularization . . . q g ( q ) Figure 8: Values of g ( q ) at various q .factors before n [ u ] . This means that, for a best achievable accuracy of 1 − Q ( √ q ∗ ) at some n [ l ] and n [ u ] , the centered regularization method achieves, with the hyperparameter e setto satisfy (60), the same level of accuracy with the same amount of labelled samples and g ( q ∗ ) n [ u ] unlabelled ones where g ( q ∗ ) = E z ∼N ( q ∗ ,q ∗ ) { tanh( z ) } ( q ∗ + 1) /q ∗ . The ratio function g ( q ) = E z ∼N ( q,q ) { tanh( z ) } ( q + 1) q is plotted in Figure 8. We remark also that lim q → + g ( q ) = 1 and lim q → + ∞ g ( q ) = 0.Although the value of g ( q ) can get up to 1 . e (which generallydoes not satisfy (60)). In fact, even with the same numbers of labelled and unlabelled data,the performance of centered regularization method at an optimally set e is often very closeto the best achievable one, as shown in Figures 6–7. References
Charu C Aggarwal, Alexander Hinneburg, and Daniel A Keim. On the surprising behaviorof distance metrics in high dimensional space. In
International conference on databasetheory , pages 420–434. Springer, 2001.Morteza Alamgir and Ulrike V Luxburg. Phase transition in the family of p-resistances. In
Advances in Neural Information Processing Systems , pages 379–387, 2011.Fabrizio Angiulli. On the behavior of intrinsically high-dimensional spaces: Distances, directand reverse nearest neighbors, and hubness.
Journal of Machine Learning Research , 18(170):1–60, 2018. URL http://jmlr.org/papers/v18/17-151.html .Konstantin Avrachenkov, Alexey Mishenin, Paulo Gon¸calves, and Marina Sokol. General-ized optimization framework for graph-based semi-supervised learning. In
Proceedings ofthe 2012 SIAM International Conference on Data Mining , pages 966–974. SIAM, 2012. ai, Couillet Mikhail Belkin and Partha Niyogi. Using manifold stucture for partially labeled classifica-tion. In
Advances in neural information processing systems , pages 953–960, 2003.Mikhail Belkin and Partha Niyogi. Semi-supervised learning on riemannian manifolds.
Machine learning , 56(1-3):209–239, 2004.Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometricframework for learning from labeled and unlabeled examples.
Journal of machine learningresearch , 7(Nov):2399–2434, 2006.Shai Ben-David, Tyler Lu, and D´avid P´al. Does unlabeled data provably help? worst-caseanalysis of the sample complexity of semi-supervised learning. In
COLT , pages 33–44,2008.F. Benaych-Georges and R. Couillet. Spectral analysis of the gram matrix ofmixture models.
ESAIM: Probability and Statistics , 20:217–237, 2016. URL http://dx.doi.org/10.1051/ps/2016007 .Kevin Beyer, Jonathan Goldstein, Raghu Ramakrishnan, and Uri Shaft. When is “nearestneighbor” meaningful? In
International conference on database theory , pages 217–235.Springer, 1999.Nick Bridle and Xiaojin Zhu. p-voltages: Laplacian regularization for semi-supervisedlearning on high-dimensional data. In
Eleventh Workshop on Mining and Learning withGraphs (MLG2013) , 2013.Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. Semi-supervised learning(chapelle, o. et al., eds.; 2006)[book reviews].
IEEE Transactions on Neural Networks , 20(3):542–542, 2009.Amin Coja-Oghlan and Andr´e Lanka. Finding planted partitions in random graphs withgeneral degree distributions.
SIAM Journal on Discrete Mathematics , 23(4):1682–1714,2010.R. Couillet and F. Benaych-Georges. Kernel spectral clustering of large dimensional data.
Electronic Journal of Statistics , 10(1):1393–1454, 2016.Romain Couillet and Merouane Debbah.
Random matrix methods for wireless communica-tions . Cambridge University Press, 2011.Romain Couillet, Florent Benaych-Georges, et al. Kernel spectral clustering of large dimen-sional data.
Electronic Journal of Statistics , 10(1):1393–1454, 2016.Fabio Gagliardi Cozman, Ira Cohen, and M Cirelo. Unlabeled data can degrade classificationperformance of generative classifiers. In
Flairs conference , pages 327–331, 2002.Ahmed El Alaoui, Xiang Cheng, Aaditya Ramdas, Martin J Wainwright, and Michael IJordan. Asymptotic behavior of \ ell p-based laplacian regularization in semi-supervisedlearning. In Conference on Learning Theory , pages 879–906, 2016. igh Dimensional Semi-Supervised Graph Regularization Noureddine El Karoui, Derek Bean, Peter J Bickel, Chinghway Lim, and Bin Yu. Onrobust regression with high-dimensional predictors.
Proceedings of the National Academyof Sciences , page 201307842, 2013.Santo Fortunato. Community detection in graphs.
Physics reports , 486(3-5):75–174, 2010.Damien Francois, Vincent Wertz, and Michel Verleysen. The concentration of fractionaldistances.
IEEE Transactions on Knowledge and Data Engineering , 19(7):873–886, 2007.Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, SherjilOzair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In
Advances inneural information processing systems , pages 2672–2680, 2014.Lennart Gulikers, Marc Lelarge, and Laurent Massouli´e. A spectral method for communitydetection in moderately sparse degree-corrected stochastic block models.
Advances inApplied Probability , 49(3):686–721, 2017.Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In
Proceedings of the IEEE conference on computer vision and patternrecognition , pages 770–778, 2016.Alexander Hinneburg, Charu C Aggarwal, and Daniel A Keim. What is the nearest neighborin high dimensional spaces? In , pages506–515, 2000.Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The cifar-10 dataset. , 2014.Rasmus Kyng, Anup Rao, Sushant Sachdeva, and Daniel A Spielman. Algorithms forlipschitz learning on graphs. In
Conference on Learning Theory , pages 1190–1223, 2015.Yann LeCun. The mnist database of handwritten digits. http://yann. lecun.com/exdb/mnist/ , 1998.Michel Ledoux.
The concentration of measure phenomenon . Number 89. American Mathe-matical Soc., 2005.Marc Lelarge and Leo Miolane. Asymptotic bayes risk for gaussian mixture in a semi-supervised setting. arXiv preprint arXiv:1907.03792 , 2019.Cosme Louart and Romain Couillet. Concentration of measure and large random matriceswith an application to sample covariance matrices. arXiv preprint arXiv:1805.08295 ,2018.Xiaoyi Mai and Romain Couillet. A random matrix analysis and improvement of semi-supervised learning for large dimensional data.
The Journal of Machine Learning Re-search , 19(1):3074–3100, 2018. ai, Couillet Boaz Nadler, Nathan Srebro, and Xueyuan Zhou. Semi-supervised learning with the graphlaplacian: The limit of infinite unlabelled data. In
Proceedings of the 22nd Interna-tional Conference on Neural Information Processing Systems , pages 1330–1338. CurranAssociates Inc., 2009.Behzad M Shahshahani and David A Landgrebe. The effect of unlabeled samples in reducingthe small sample size problem and mitigating the hughes phenomenon.
IEEE Transactionson Geoscience and remote sensing , 32(5):1087–1095, 1994.Jack Sherman and Winifred J Morrison. Adjustment of an inverse matrix corresponding toa change in one element of a given matrix.
The Annals of Mathematical Statistics , 21(1):124–127, 1950.David I Shuman, Sunil K Narang, Pascal Frossard, Antonio Ortega, and Pierre Van-dergheynst. The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other irregular domains.
IEEE Signal Pro-cessing Magazine , 30(3):83–98, 2013.Ulrike Von Luxburg. A tutorial on spectral clustering.
Statistics and computing , 17(4):395–416, 2007.Ulrike Von Luxburg, Mikhail Belkin, and Olivier Bousquet. Consistency of spectral clus-tering.
The Annals of Statistics , pages 555–586, 2008.Rudin Walter. Real and complex analysis, 1987.Max A Woodbury. Inverting modified matrices.
Memorandum report , 42(106):336, 1950.Denny Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and Bernhard Sch¨olkopf.Learning with local and global consistency. In
Advances in neural information processingsystems , pages 321–328, 2004.Xueyuan Zhou and Mikhail Belkin. Semi-supervised learning by higher order regularization.In
Proceedings of the Fourteenth International Conference on Artificial Intelligence andStatistics , pages 892–900, 2011.Xiaojin Zhu and Zoubin Ghahramani. Learning from labeled and unlabeled data with labelpropagation. 2002.Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. Semi-supervised learning usinggaussian fields and harmonic functions. In
Proceedings of the 20th International conferenceon Machine learning (ICML-03) , pages 912–919, 2003., pages 912–919, 2003.