[PDF] Guaranteed Classification via Regularized Similarity Learning

Abstract

Full PDF

aa r X i v : . [ c s . L G ] A ug Guaranteed Classiﬁcation via Regularized SimilarityLearning

Zheng-Chu Guo and Yiming YingCollege of Engineering, Mathematics and Physical SciencesUniversity of Exeter, EX4 4QF, UK { gzhengchu,mathying } @gmail.com Abstract

Learning an appropriate (dis)similarity function from the available data is acentral problem in machine learning, since the success of many machine learningalgorithms critically depends on the choice of a similarity function to compareexamples. Despite many approaches for similarity metric learning have beenproposed, there is little theoretical study on the links between similarity met-ric learning and the classiﬁcation performance of the result classiﬁer. In thispaper, we propose a regularized similarity learning formulation associated withgeneral matrix-norms, and establish their generalization bounds. We show thatthe generalization error of the resulting linear separator can be bounded by thederived generalization bound of similarity learning. This shows that a good gen-eralization of the learnt similarity function guarantees a good classiﬁcation ofthe resulting linear classiﬁer. Our results extend and improve those obtainedby Bellet at al. [3]. Due to the techniques dependent on the notion of uniformstability [6], the bound obtained there holds true only for the Frobenius matrix-norm regularization. Our techniques using the Rademacher complexity [5] andits related Khinchin-type inequality enable us to establish bounds for regularizedsimilarity learning formulations associated with general matrix-norms includingsparse L -norm and mixed (2 , The success of many machine learning algorithms heavily depends on how to spec-ify the similarity or distance metric between examples. For instance, the k-nearestneighbor (k-NN) classiﬁer depends on a distance (dissimilarity) function to identifythe nearest neighbors for classiﬁcation. Most information retrieval methods rely on asimilarity function to identify the data points that are most similar to a given query.Kernel methods rely on the kernel function to represent the similarity between exam-ples. Hence, how to learn an appropriate (dis)similarity function from the availabledata is a central problem in machine learning, which we refer to as similarity metriclearning throughout the paper. 1ecently, a considerable amount of research eﬀorts are devoted to similarity metriclearning and many mthods have been proposed. They can be broadly divided intotwo main categories. The ﬁrst category of such methods is one-stage approach forsimilarity metric learning, which means that the methods learn the similarity (kernel)function and classiﬁer together. Multiple kernel learning [23, 32] is a notable one-stageapproach, which aims to learn an optimal kernel combination from a prescribed setof positive semi-deﬁnite (PSD) kernels. Another exemplary one-stage approach isindeﬁnite kernel learning, which is motivated by the fact that, in many applications,potential kernel matrices could be non-positive semi-deﬁnite. Such cases include hy-perbolic tangent kernels [30], and the protein sequence similarity measures derivedfrom Smith-Waterman and BLAST score [29]. Indeﬁnite kernel learning [8, 38] aimsto learn a PSD kernel matrix from a prescribed indeﬁnite kernel matrix, which aremostly restricted to the transductive settings. Recent methods [35, 36] analyzed reg-ularization networks such as ridge regression and SVM given a prescribed indeﬁnitekernel, instead of aiming to learn an indeﬁnite kernel function from data. The gener-alization analysis for such one-stage methods was well studied, see e.g. [8, 40, 11].The second category of similarity metric learning is a two-stage method, which meansthat the processes of learning the similarity function and training the classiﬁer areseparate. One exemplar two-stage approach is referred to as metric learning [4, 13, 15,17, 34, 37, 39], which often focuses on learning a Mahalanobis distance metric deﬁned,for any x, x ′ ∈ R d , by d M ( x, x ′ ) = p ( x − x ′ ) T M ( x − x ′ ). Here, M is a positive semi-deﬁnite (PSD) matrix. Another example of such methods [9, 25] is bilinear similaritylearning , which focuses on learning a similarity function deﬁned, for any x, x ′ ∈ R d , by s M ( x, x ′ ) = x T M x ′ with M being a PSD matrix. The above methods are mainlymotivated by the natural intuition that the similarity score between examples inthe same class should be larger than that of examples from distinct classes. Thek-NN classiﬁcation using the similarity metric learnt from the above methods wasempirically shown to achieve better accuracy than that using the standard Euclideandistance.Although many two-stage approaches for similarity metric learning have been pro-posed, in contrast to the one-stage methods, there is relatively little theoretical studyon the question whether similarity-based learning guarantees a good generalization ofthe resultant classiﬁcation. For instance, generalization bounds were recently estab-lished for metric and similarity learning [7, 17, 25] under diﬀerent statistical assump-tions on the data. However, there are no theoretical guarantees for such empiricalsuccess. In other words, it is not clear whether good generalization bounds for metricand similarity learning [17, 7] can lead to a good classiﬁcation performance of theresultant k-NN classiﬁers. Recently, Bellet et al. [3] proposed a regularized similaritylearning approach, which is mainly motivated by the ( ε, γ, τ )-good similarity func-tions introduced in [1, 2]. In particular, they showed that the proposed similaritylearning can theoretically guarantee a good generalization for classiﬁcation. However,due to the techniques dependent on the notion of uniform stability [6], the generaliza-tion bounds only hold true for strongly convex matrix-norm regularization (e.g. theFrobenius norm). 2n this paper, we consider a new similarity learning formulation associated with gen-eral matrix-norm regularization terms. Its generalization bounds are established forvarious matrix regularization including the Frobenius norm, sparse L -norm, andmixed (2 , L -norm and mixed (2 , In this section, we mainly introduce the regularized formulation of similarity learningand state our main results. Before we do that, let us introduce some notations andpresent some background material.Denote, for any n ∈ N , N n = { , , . . . , n } . Let z = { z i = ( x i , y i ) : i ∈ N m } be a set oftraining samples, which is drawn identically and independently from a distribution ρ on Z = X × Y . Here, the input space X is a domain in R d and Y = {− , } is calledthe output space. For any x, x ′ ∈ X , we consider K A ( x, x ′ ) = x T Ax ′ as a bilinearsimilarity score parameterized by a symmetric matrix A ∈ S d × d . The symmetry ofmatrix A guarantees the symmetry of the similarity score K A , i.e. K A ( x, x ′ ) = K A ( x ′ , x ) . The aim of similarity learning is to learn a matrix A from a given set of trainingsamples z such that the similarity score K A between examples from the same labelis larger than that between examples from diﬀerent labels. A natural approach toachieve the above aim is to minimize the following empirical error E z ( A ) = 1 m X i ∈ N m (cid:0) − mr X j ∈ N m y i y j K A ( x i , x j ) (cid:1) + , (1)where r > P j ∈ N m y i y j K A ( x i , x j ) = P { j : y j = y i } K A ( x i , x j ) − { j : y j = y i } K A ( x i , x j ) . Minimizing the above empirical error encourages, for any i ,that, with margin r , the average similarity scores between examples with the sameclass as y i are relatively larger than those between examples with distinct classes from y i . To avoid overﬁtting, we add a matrix-regularized term to the above empirical errorand reach the following regularization formulation A z = arg min A ∈ S d × d h E z ( A ) + λ k A k i , (2)where λ > k A k denotes a generalmatrix norm. For instance, it can be the sparse L -norm k A k = P k ∈ N d P ℓ ∈ N d | A kℓ | ,the (2 , k A k (2 , := P k ∈ N d (cid:0)P ℓ ∈ N d A kℓ (cid:1) , the Frobenius norm k A k F = (cid:0)P k,ℓ ∈ N d A kℓ (cid:1) or the trace norm k A k tr := P ℓ ∈ N d σ ℓ ( A ) , where { σ ℓ ( A ) : ℓ ∈ N d } denote the singular values of matrix A. The ﬁrst contribution of this paper is to establish generalization bounds for regularizedsimilarity learning (1) with general matrix-norms. Speciﬁcally, deﬁne E ( A ) = Z Z (cid:18) − r Z Z yy ′ K A ( x, x ′ ) dρ ( x ′ , y ′ ) (cid:19) + dρ ( x, y ) . (3)The target of generalization analysis for similarity learning is to bound E ( A z ) −E z ( A z ). Its special case with the Frobenius matrix norm was established in [3]. It usedthe uniform stability techniques [6], which, however, can not deal with non-stronglyconvex matrix-norms such as the L -norm, (2 , Deﬁnition 1.

Let F be a class of uniformly bounded functions. For every integer n ,we call R n ( F ) := E z E σ " sup f ∈ F n X i ∈ N n σ i f ( z i ) , the Rademacher average over F , where { z i : i ∈ N n } are independent random variablesdistributed according to some probability measure and { σ i : i ∈ N n } are independentRademacher random variables, that is, P ( σ i = 1) = P ( σ i = −

1) = . Before stating our generalization bounds for similarity learning, we ﬁrst introducesome notations. For any

B, A ∈ R n × d , let h B, A i = trace( B T A ) , where trace( · )denotes the trace of a matrix. For any matrix-norm k·k , its dual norm k·k ∗ is deﬁned,for any B , by k B k ∗ = sup k A k≤ trace( B T A ) . Denote k X k ∗ = sup x,x ′ ∈X k x ′ x T k ∗ . Letthe Rademacher average with respect to the dual matrix norm be deﬁned by R m := E z ,σ h sup ˜ x ∈X (cid:13)(cid:13)(cid:13) m X i ∈ N m σ i y i x i ˜ x T (cid:13)(cid:13)(cid:13) ∗ i . (4)Now we can state the generalization bounds for similarity learning, which is closelyrelated to the Rademacher averages with respect to the dual matrix-norm k · k ∗ . heorem 1. Let A z be the solution to algorithm (2). Then, for any < δ < , withprobability at least − δ, there holds E z ( A z ) − E ( A z ) ≤ R m rλ + 2 X ∗ rλ s (cid:0) δ (cid:1) m . (5)The proof for Theorem 1 will be given in Section 4. Following the exact argument,similar result is also true if we switch the position of E z ( A z ) and E ( A z ), i.e. for any0 < δ < , with probability at least 1 − δ, we have E ( A z ) ≤ E z ( A z ) + 6 R m rλ + 2 X ∗ rλ s (cid:0) δ (cid:1) m . The second contribution of this paper is to investigate the theoretical relationshipbetween similarity learning (2) and the generalization error of the linear classiﬁerbuilt from the learnt metric A z . We show that the generalization bound for the simi-larity learning gives an upper bound for the generalization error of a linear classiﬁerproduced by the linear Support Vector Machine (SVM) [31] deﬁned as follows: f z = arg min n m X i ∈ N m (cid:0) − y i f ( x i ) (cid:1) + : f ∈ F z , Ω( f ) := X j ∈ N m | α j | ≤ /r o . (6)where F z = n f : f = P j ∈ N m α j K A z ( x j , · ) , a j ∈ R o is the sample-dependent hypoth-esis space. The empirical error of f ∈ F z associated with z is deﬁned by E z ( f ) = 1 m X i ∈ N m (cid:0) − y i f ( x i ) (cid:1) + . The true generalization error is deﬁned as E ( f ) = Z Z (cid:16) − yf ( x ) (cid:17) + dρ ( x, y ) . Now we are in a position to state the relationship between the generalization error ofsimilarity learning and the generalization error of the liner classiﬁer.

Theorem 2.

Let A z and f z be deﬁned by (2) and (6), respectively. Then, for any < δ < , with conﬁdence at least − δ, there holds E ( f z ) ≤ E z ( A z ) + 4 R m λr + 2 X ∗ λr s δ m . (7)The proof for Theorem 2 will be established in Section 5.Theorems 1 and 2 depend critically on two terms: the constant X ∗ and the Rademacheraverage R m . Below, we list the estimation of these two terms associated with diﬀerentmatrix norms. For any vector x = ( x , x , . . . , x d ) ∈ R d , denote k x k ∞ = max ℓ ∈ N d | x ℓ | . xample 1. Consider the matrix norm be the sparse L -norm deﬁned, for any A ∈ S d × d , by k A k = P k,ℓ ∈ N d | A kℓ | . Let A z and f z be deﬁned respectively by (2) and (6).Then, we have the following results.(a) X ∗ ≤ sup x ∈X k x k ∞ and R m ≤ x ∈X k x k ∞ q e log( d +1) m . (b) For any < δ < , with conﬁdence at least − δ, there holds E z ( A z ) −E ( A z ) ≤

12 sup x ∈X k x k ∞ rλ r e log( d + 1) m + 2 sup x ∈X k x k ∞ rλ s (cid:0) δ (cid:1) m . (8) (c) For any < δ < , with conﬁdence at least − δ, there holds E ( f z ) ≤ E z ( A z ) + 4 sup x ∈X k x k ∞ λr r e log( d + 1) m + 2 sup x ∈X k x k ∞ λr s δ m . (9)For any vector x ∈ R d , let k x k F be the standard Euclidean norm. Considering theregularized similarity learning with the Frobenius matrix norm, we have the followingresult. Example 2.

Consider the Frobenius matrix norm deﬁned, for any A ∈ S d × d , by k A k = qP k,ℓ ∈ N d | A kℓ | . Let A z and f z be deﬁned by (2) and (6), respectively. Then,we have the following estimation.(a) X ∗ ≤ sup x ∈X k x k F and R m ≤ x ∈X k x k F q m . (b) For any < δ < , with conﬁdence at least − δ, there holds E z ( A z ) − E ( A z ) ≤ x ∈X k x k F rλ √ m + 2 sup x ∈X k x k F rλ s (cid:0) δ (cid:1) m . (10) (c) For any < δ < , with conﬁdence at least − δ, there holds E ( f z ) ≤ E z ( A z ) + 4 sup x ∈X k x k F λr √ m + 2 sup x ∈X k x k F λr s δ m . (11)We end this section with two remarks. Firstly, the above theorem and examples meanthat a good similarity (i.e. a small generalization error E z ( A z ) for similarity learning)can guarantee a good classiﬁcation (i.e. a small classiﬁcation error E ( f z )). Secondly,the bounds in Example 2 is consistent with that in [3].6 Related Work

In this section, we discuss studies on similarity metric learning which are related toour work.Many similarity metric learning methods have been motivated by the intuition thatthe similarity score between examples in the same class should be larger than thatof examples from distinct classes, see e.g. [4, 7, 9, 15, 17, 25, 34, 37]. Jin et al. [17]established generalization bounds for regularized metric learning algorithms via theconcept of uniform stability [5], which, however, only works for strongly convex ma-trix regularization terms. A very recent work [7] established generalization bounds forthe metric and similarity learning associated with general matrix norm regularizationusing techniques of Rademacher averages and U-statistics. However, there was notheoretical links between the similarity metric learning and the generalization perfor-mance of classiﬁers based on the learnt similarity matrix. Here, we focused on theproblem how to learn a good linear similarity function K A such that it can guaranteea good classiﬁcation error of the resultant classiﬁer derived from the learnt similar-ity function. In addition, our formulation (2) is quite distinct from similarity metriclearning methods [7, 9], since they are based on pairwise or triplet-wise constraintsand considered the following pairwise empirical objective function:1 m ( m − m X i,j =1 ,i = j (cid:0) − y i y j ( K A ( x i , x j ) − r ) (cid:1) + . (12)Our formulation (2) is less restrictive since the empirical objective function is deﬁnedover an average of similarity scores and it doesn’t require the positive semi-deﬁnitenessof the similarity function K. Balcan et al. [2] developed a theory of ( ǫ, γ, τ )-good similarity function deﬁned asfollows. It attempts to investigate the theoretical relationship between the propertiesof a similarity function and its performance in linear classiﬁcation.

Deﬁnition 2. ([1])

A similarity function K is a ( ǫ, γ, τ ) -good similarity function inhinge loss for a learning problem P if there exists a random indicator function R ( x ) deﬁning a probabilistic set of “reasonable points” such that the following conditionshold:1. E ( x,y ) ∼ P [1 − yg ( x ) /γ ] + ≤ ǫ, where g ( x ) = E ( x ′ ,y ′ ) ∼ P [ y ′ K ( x, x ′ ) | R ( x ′ )] , Pr x ′ [ R ( x ′ )] ≥ τ. The ﬁrst condition can be interpreted as “most points x are on average 2 γ moresimilar to random reasonable points of the same class than to random reasonablepoints of the distinct classes” and the second condition as “at least a τ proportionof the points should be reasonable.” The following theorem implies that if given an( ǫ, γ, τ )-good similarity function and enough landmarks, there exists a separator α with error arbitrarily close to ǫ. Theorem 3. ([1])

Let K be an ( ǫ, γ, τ ) -good similarity function in hinge loss for alearning problem P. For any ǫ > , and < δ ≤ γǫ / , let S = { x ′ , · · · , x ′ d land } be a otentially unlabeled sample of d land = τ (cid:0) log(2 /δ )+16 log(2 /δ )( ǫ γ ) (cid:1) landmarks drawn from P. Consider the mapping φ Si = K ( x, x ′ i ) , i ∈ { , · · · , d land } . Then, with probability atleast − δ over the random sample S, the induced distribution φ S ( P ) in R d land has alinear separator α of error at most ǫ + ǫ at margin γ. It was mentioned in [2] that the linear separator can be estimated by solving thefollowing linear programming if we have d u potentially unlabeled sample and d l labeledsample, min α n d l X i =1 h − d u X j =1 α j y i K ( x i , x ′ j ) i + : d u X j =1 | α j | ≤ /γ o . (13)The above algorithm (13) is quite similar to the linear SVM (6) used in our paper.Our work is distinct from Balcan et al. [2] in the following two aspects. Firstly, thesimilarity function K is predeﬁned in algorithm (13), while we aim to learn a simi-larity function K A z from a regularized similarity learning formulation (2). Secondly,although the separators are both trained from the linear SVM, the classiﬁcation al-gorithm (13) in [2] was designed using two diﬀerent sets of examples, a set of labeledsamples of size d l to train the classiﬁcation algorithm and another set of unlabeledsamples with size d u to deﬁne the mapping φ S . In this paper, we used the same set oftraining samples for both similarity learning (2) and the classiﬁcation algorithm (6).Recent work by Bellet et al. [3] is mostly close to ours. Speciﬁcally, they consideredsimilarity learning formulation (2) with the Frobenius norm regularization. Gener-alization bounds for similarity learning were derived via uniform stability arguments[6] which can not deal with, for instance, the L -norm and (2 , µ ≥ m ( m − m X ≤ i

Proof of Theorem 1:

Our proof is divided into two steps.

Step 1 : Let E z denote the expectation with respect to samples z . Observe that E z ( A z ) − E ( A z ) ≤ sup A ∈A h E z ( A ) − E ( A ) i . Also, for any z = ( z , . . . , z k , . . . , z m ) and˜ z = ( z , . . . , ˜ z k , . . . , z m ) , ≤ k ≤ m , there holds (cid:12)(cid:12)(cid:12) sup A ∈A h E z ( A ) − E ( A ) i − sup A ∈A h E ˜ z ( A ) − E ( A ) i(cid:12)(cid:12)(cid:12) ≤ sup A ∈A |E z ( A ) − E ˜ z ( A ) |≤ m r sup A ∈A n m X i =1 ,i = k | y i y k K A ( x k , x i ) − y i ˜ y k K A (˜ x k , x i ) | + | P j ∈ N m ( y k y j K A ( x k , x j ) − ˜ y k y j K A (˜ x k , x j )) | o ≤ m r sup A ∈A X i ∈ N m (cid:16) | y i y k K A ( x k , x i ) | + | y i ˜ y k K A (˜ x k , x i ) | (cid:17) ≤ X ∗ mrλ . Applying the McDiarmid’s inequality [26] (see Lemma 1 in the Appendix) to the termsup A ∈A h E z ( A ) − E ( A ) i , with probability at least 1 − δ, there holdssup A ∈A h E z ( A ) − E ( A ) i ≤ E z sup A ∈A h E z ( A ) − E ( A ) i + 2 X ∗ rλ s (cid:0) δ (cid:1) m . (15)9ow we are in a position to estimate the ﬁrst term in the expectation form on therighthand side of the above equation by standard symmetrization techniques. Step 2 : We divide the term E z sup A ∈A h E z ( A ) − E ( A ) i into two parts as follows, E z sup A ∈A h E z ( A ) − E ( A ) i = E z sup A ∈A n m X i ∈ N m h − mr X j ∈ N m y i y j K A ( x i , x j ) i + − E ( A ) o = E z sup A ∈A n m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + − E ( A ) − m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + + 1 m X i ∈ N m h − mr X j ∈ N m y i y j K A ( x i , x j ) i + o ≤ I + I , where I := E z sup A ∈A n m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + − E ( A ) o , and I := − E z sup A ∈A n m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + + 1 m X i ∈ N m (cid:16) − mr X j ∈ N m y i y j K A ( x i , x j ) (cid:17) + o . Now let ¯ z = { ¯ z , ¯ z , . . . , ¯ z m } be an i.i.d. sample which is independent of z . We ﬁrstestimate I using the standard symmetrization techniques, to this end, we rewrite E ( A ) as E ¯ z (cid:16) m X i ∈ N m h − r E ( x ′ ,y ′ ) ¯ y i y ′ K A (¯ x i , x ′ ) i + (cid:17) . Then we have I = E z sup A ∈A n m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + − E ¯ z (cid:16) m X i ∈ N m h − r E ( x ′ ,y ′ ) ¯ y i y ′ K A (¯ x i , x ′ ) i + o ≤ E z , ¯z sup A ∈A n m X i ∈ N m h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + − m X i ∈ N m h − r E ( x ′ ,y ′ ) ¯ y i y ′ K A ( ¯ x i , x ′ ) i + By the standard Rademacher symmetrization technique and the contraction propertyof the Rademacher average (see Lemma 2 in the Appendix), we further have I ≤ E z ,σ sup A ∈A n m X i ∈ N m σ i h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + o ≤ E z ,σ sup A ∈A (cid:12)(cid:12)(cid:12) h mr X i ∈ N m σ i y i x i Z y ′ x ′ T dρ ( x ′ , y ′ ) , A i (cid:12)(cid:12)(cid:12) ≤ rλ E z ,σ (cid:13)(cid:13)(cid:13) m X i ∈ N m σ i y i x i Z y ′ x ′ T dρ ( x ′ , y ′ ) (cid:13)(cid:13)(cid:13) ∗ ≤ rλ E z ,σ sup ˜ x (cid:13)(cid:13)(cid:13) m X i ∈ N m σ i y i x i ˜ x T (cid:13)(cid:13)(cid:13) ∗ , h A, B i ≤ k A kk B k ∗ ≤ r k B k ∗ forany A ∈ A and B ∈ R d × d . Similarly, we can estimate I as follows. I = E z sup A ∈A m X i ∈ N m (cid:16)h − mr X j ∈ N m y i y j K A ( x i , x j ) i + − h − r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) i + (cid:17) ≤ E z sup A ∈A m X i ∈ N m (cid:16)(cid:12)(cid:12)(cid:12) r E ( x ′ ,y ′ ) y i y ′ K A ( x i , x ′ ) − mr X j ∈ N m y i y j K A ( x i , x j ) (cid:12)(cid:12)(cid:12)(cid:17) = E z sup A ∈A mr X i ∈ N m (cid:16)(cid:12)(cid:12)(cid:12) h E ( x ′ ,y ′ ) y i y ′ x ′ x Ti − m X j ∈ N m y i y j x j x Ti , A i (cid:12)(cid:12)(cid:12)(cid:17) ≤ rλ E z sup x ∈X (cid:13)(cid:13)(cid:13) E ( x ′ ,y ′ ) y ′ x ′ x T − m X j ∈ N m y j x j x T (cid:13)(cid:13)(cid:13) ∗ = rλ E z sup x ∈X (cid:13)(cid:13)(cid:13) E z ′ m X j ∈ N m y ′ j x ′ j x T − m X j ∈ N m y j x j x T (cid:13)(cid:13)(cid:13) ∗ . In the above estimation, the ﬁrst inequality follows from the Lipschitz continuity of thehinge loss function. Following the standard Rademacher symmetrization technique(see e.g. [5]), from the above estimation we can further estimate I as follows: I ≤ rλ E z sup x ∈X (cid:13)(cid:13)(cid:13) E z ′ m X j ∈ N m y ′ j x ′ j x T − m X j ∈ N m y j x j x T (cid:13)(cid:13)(cid:13) ∗ ≤ rλ E z , z ′ sup x ∈X (cid:13)(cid:13)(cid:13) m X j ∈ N m y ′ j x ′ j x T − m X j ∈ N m y j x j x T (cid:13)(cid:13)(cid:13) ∗ ≤ rλ E z , z ′ ,σ sup x ∈X (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j (cid:16) y ′ j x ′ j x T − y j x j x T (cid:17)(cid:13)(cid:13)(cid:13) ∗ ≤ rλ E z ,σ sup x ∈X (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j x T (cid:13)(cid:13)(cid:13) ∗ . The desired result follows by combining (15) with the above estimation for I and I . This completes the proof for the theorem. (cid:3)

In this section, we investigate the theoretical relationship between the generalizationerror for the similarity learning and that of the linear classiﬁer built from the learntsimilarity metric K A z . In particular, we will show that the generalization error ofthe similarity learning gives an upper bound for the generalization error of the linearclassiﬁer which was stated as Theorem 2 in Section 2.Before giving the proof of Theorem 2, we ﬁrst establish the generalization boundsfor the linear SVM algorithm (6). Recalling that the linear SVM algorithm (6) wasdeﬁned by f z = arg min n m X i ∈ N m (cid:0) − y i f ( x i ) (cid:1) + : f ∈ F z , Ω( f ) := X j ∈ N m | α j | ≤ /r o , F z = n f : f = X j ∈ N m α j K A z ( x j , · ) , a j ∈ R o . The generalization analysis of the linear SVM algorithm (6) aims to estimate theterm E ( f z ) − E z ( f z ) . For any z , one can easily see that the solution to algorithm (6)belongs to the set F z ,r , where F z ,r = n f = X j ∈ N m α j K A z ( x j , · ) : Ω( f ) = X j ∈ N m | α j | ≤ /r, a j ∈ R o . To perform the generalization analysis, we seek a sample-independent set which con-tains, for any z , the sample-dependent hypothesis space F z . Speciﬁcally, we deﬁne asample independent hypothesis space by F m = n f = X i ∈ N m α i K A ( u i , · ) : k A k ≤ /λ, u j ∈ X, a j ∈ R o . Recalling that, for any z , k A z k ≤ λ − , one can easily see that F z is a subset of F m . It follows that, for any z , the solution to the linear SVM algorithm (6) lies in the set F m,r , which is given by F m,r = n f ∈ F m : Ω( f ) ≤ /r o . The following theorem states the generalization bounds of the linear SVM for classi-ﬁcation.

Theorem 4.

Let f z be the solution to the algorithm (6). For any < δ < , withprobability at least − δ, we have E ( f z ) − E z ( f z ) ≤ R m λr + 2 X ∗ λr s δ m . (16) Proof.

By McDiarmid’s inequality, for any 0 < δ < , with conﬁdence 1 − δ , thereholds E ( f z ) − E z ( f z ) ≤ sup f ∈F z ,r (cid:0) E ( f ) − E z ( f ) (cid:1) ≤ sup f ∈F m,r (cid:0) E ( f ) − E z ( f ) (cid:1) ≤ E z sup f ∈F m,r (cid:0) E ( f ) − E z ( f ) (cid:1) + 2 X ∗ λr s δ m . Next, all we need is to estimate the ﬁrst part of the right hand-side of the aboveinequality. Let ¯z be an independent sample (independent each other and z ) and with12he same distribution as z . E z sup f ∈F m,r (cid:0) E ( f ) − E z ( f ) (cid:1) = E z sup f ∈F m,r (cid:0) E ¯z E ¯z ( f ) − E z ( f ) (cid:1) ≤ E z , ¯z sup f ∈F m,r (cid:0) E ¯z ( f ) − E z ( f ) (cid:1) ≤ E z ,σ h sup k A k≤ /λ sup P i ∈ N m | α i |≤ /r (cid:16) m X i ∈ N m σ i (cid:2) − X j ∈ N m α j y i K A ( x i , u j )] + (cid:17)i ≤ E z ,σ h sup k A k≤ /λ sup P i ∈ N m | α i |≤ /r (cid:16) m X i ∈ N m σ i X j ∈ N m α j y i K A ( x i , u j ) (cid:17)i ≤ r E z ,σ sup A : k A k≤ /λ sup x ∈X (cid:16)(cid:12)(cid:12)(cid:12) m X i ∈ N m σ i y i h x i x T , A i (cid:12)(cid:12)(cid:12)(cid:17) ≤ λr E z ,σ sup x ∈X (cid:13)(cid:13)(cid:13) m X i ∈ N m σ i y i x i x T (cid:13)(cid:13)(cid:13) ∗ . Here we also use the standard Rademacher Symmetrization technique and the con-tractor property of the Rademacher average. Then the proof is completed.Now we are in a position to give the detailed proof of Theorem 2.

Proof of Theorem 2:

If we take α = ( y mr , · · · , y m mr ) T , then f z = mr P j ∈ N m y j K A z ( x j , · ) . One can easily see that Ω( f z ) = P j ∈ N m | α j | = r , that means f z ∈ F z ,r . From Theo-rem 4 and the deﬁnition of f z , we get E ( f z ) ≤ E z ( f z ) + R m λr + X ∗ λr q δ m ≤ E z ( f z ) + R m λr + X ∗ λr q δ m = m X i ∈ N m (cid:16) − mr X j ∈ N m y i y j K A z ( x i , x j ) (cid:17) + + 4 R m λr + 2 X ∗ λr s δ m = E z ( A z ) + R m λr + X ∗ λr q δ m . This completes the proof of the theorem. (cid:3)

The main theorems above critically depend on the estimation of the Rademacheraverage R m deﬁned by equation (4). In this section, we establish a self-containedproof for this estimation and prove the examples listed in Section 2. For notationalsimplicity, denote by x ℓi the ℓ -th variable of the i -th sample x i ∈ R d . Proof of Example 1:

The dual norm of L -norm is the L ∞ -norm. Hence, X ∗ = sup x,x ′ ∈X sup ℓ,k ∈ N d | x ℓ ( x ′ ) k | = sup x ∈X k x k ∞ . (17)Also, the Rademacher average can be rewritten as R m = E z ,σ sup x ∈X k m X j ∈ N m σ j y j x j x T k ∞ ≤ sup x ∈X k x k ∞ E z ,σ max ℓ ∈ N d (cid:12)(cid:12)(cid:12) m X j ∈ N m σ j y j x ℓj (cid:12)(cid:12)(cid:12) . (18)13ow let U ℓ ( σ ) = m X j ∈ N m σ j y j x ℓj , for any ℓ ∈ N d . By Jensen’s inequality, for any η > e η ( E σ max ℓ ∈ N d | U ℓ ( σ ) | ) − ≤ E σ [ e η (max ℓ ∈ N d | U ℓ ( σ ) | ) − E σ [max ℓ ∈ N d e η | U ℓ ( σ ) | − ≤ X ℓ ∈ N d E σ [ e η ( | U ℓ ( σ ) | ) − . (19)Furthermore, for any ℓ ∈ N d , there holds E σ [ e η ( | U ℓ ( σ ) | ) −

1] = X k ≥ k ! η k E σ | U ℓ | k ≤ X k ≥ k ! η k (2 k − k ( E σ | U ℓ | ) k ≤ X k ≥ (2 eη E σ | U ℓ | ) k , where the ﬁrst inequality follows from the Khinchin-type inequality (see Lemma 3 inthe Appendix), and the second inequality holds due to the Stirling’s inequality: e − k k k ≤ k !. Now set η = [2 √ e max ℓ ∈ N d ( E σ | U ℓ | ) ] − . The above inequality can be upper boundedby E [ e η ( | U ℓ ( σ ) | ) − ≤ X k ≥ − k = 1 , ∀ ℓ ∈ N d . Putting the above estimation back into (19) implies that e η ( E max ℓ ∈ N d | U ℓ ( σ ) | ) − ≤ d. That means E σ max ℓ ∈ N d | U ℓ ( σ ) | = E σ max ℓ ∈ N d (cid:12)(cid:12) m X j ∈ N m σ j y j x ℓj (cid:12)(cid:12) ≤ p log( d + 1) η − = 2 p e log( d + 1) max ℓ ∈ N d ( E σ | U ℓ | ) = 2 p e log( d + 1) max ℓ ∈ N d (cid:16) E σ (cid:12)(cid:12)(cid:12) m n X j ∈ N m σ j y j x ℓj (cid:12)(cid:12)(cid:12) (cid:17) = 2 p e log( d + 1) max ℓ ∈ N d (cid:16) E σ m X j,k ∈ N m σ j σ k y j y k x ℓj x ℓk (cid:17) = 2 p e log( d + 1) max ℓ ∈ N d (cid:16) m X j ∈ N m ( x ℓj ) (cid:17) ≤ x ∈X k x k ∞ r e log( d + 1) m . (20)Putting the above estimation back into (18) implies that R m ≤ x ∈X k x k ∞ r e log( d + 1) m . The other desired results in the example follow directly from combining the aboveestimation with Theorems 1 and 2. (cid:3)

We turn our attention to similarity learning formulation (2) with the Frobenius normregularization. 14 roof of Example 2:

The dual norm of the Frobenius norm is itself. Consequently, X ∗ = sup x,x ′ ∈X k x ′ x T k F = sup x ∈X k x k F . The Rademacher average can be rewritten as R m = E z ,σ sup x ∈X (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j x T (cid:13)(cid:13)(cid:13) F . By Cauchy’s Inequality, there holds R m = E z ,σ sup x ∈X k x k F (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13)(cid:13) F ≤ sup x ∈X k x k F E z (cid:0) E σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F (cid:1) = sup x ∈X k x k F E (cid:0) X j ∈ N m k x j k F (cid:1) (cid:14) m ≤ sup x ∈X k x k F √ m . (21)Then, the desired results can be derived by combining the above estimation withTheorems 1 and 2. (cid:3) The above generalization bound for similarity learning formulation (2) with the Frobe-nius norm regularization is consistent with that given in [3], where the result holdstrue under the assumption that sup x ∈X k x k F ≤

1. Below, we provide the estimationof R m respectively for the mixed (2 , Example 3.

Consider similarity learning formulation (2) with the mixed (2 , -normregularization k A k (2 , = P k ∈ N d ( P ℓ ∈ N d | A kℓ | ) / . Then, we have the following esti-mation.(a) X ∗ ≤ (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3) and R m ≤ (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3)r e log( d + 1) m . (b) For any < δ < , with conﬁdence at least − δ, there holds E z ( A z ) − E ( A z ) ≤ (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3) rλ q e log( d +1) m + (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3) rλ r (cid:0) δ (cid:1) m . (22) (c) For any < δ < , with probability at least − δ there holds E ( f z ) ≤ E z ( A z ) + (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3) λr q e log( d +1) m + (cid:2) sup x ∈X k x k F (cid:3)(cid:2) sup x ∈X k x k ∞ (cid:3) λr q δ m . roof. The dual norm of the (2 , , ∞ )-norm, which implies that X ∗ =sup x,x ′ ∈X k x ′ x T k (2 , ∞ ) = sup x ∈X k x k F sup x ′ ∈X k x ′ k ∞ and E z ,σ sup x ∈X (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j x T (cid:13)(cid:13)(cid:13) ∗ ≤ sup x ∈X k x k F E z ,σ max ℓ ∈ N d (cid:12)(cid:12)(cid:12) m X j ∈ N m σ j y j x ℓj (cid:12)(cid:12)(cid:12) ≤ x ∈X k x k F sup x k x k ∞ r e log( d + 1) m , where the last inequality follows from estimation (20). We complete the proof bycombining the above estimation with Theorems 1 and 2.We brieﬂy discuss the case of the trace norm regularization, i.e., k A k = k A k tr . In thiscase, the dual norm of trace norm is the spectral norm deﬁned, for any B ∈ S d × d ,by k B k ∗ = max ℓ ∈ N d σ ℓ ( B ) where { σ ℓ : ℓ ∈ N d } are the singular values of matrix B. Observe, for any u, v ∈ R d , that k uv T k ∗ = k u k F k v k F . Hence, the constant X ∗ =sup x,x ′ ∈X k x ′ x T k ∗ = sup x ∈X k x k F . In addition, R m = E z ,σ sup x ∈X (cid:13)(cid:13) m X j ∈ N m σ j y j x j x T (cid:13)(cid:13) ∗ = E z ,σ sup x ∈X k x k F (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F = sup x ∈X k x k F E z ,σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F ≤ sup x ∈X k x k F E z (cid:0) E σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F (cid:1) = sup x ∈X k x k F E (cid:0) X j ∈ N m k x j k F (cid:1) (cid:14) m. (23)Indeed, the above estimation for R m is optimal. To see this, we observe from [28,Theorem 1.3.2] that (cid:0) E σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F (cid:1) ≤ √ E σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F . Combining the above fact with (23), we can obtain R m = sup x ∈X k x k F E z ,σ (cid:13)(cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13)(cid:13) F ≥ √ x ∈X k x k F E z ( E σ (cid:13)(cid:13) m X j ∈ N m σ j y j x j (cid:13)(cid:13) F ) = m √ sup x ∈X k x k F E (cid:0) X j ∈ N m k x j k (cid:1) . Hence, the estimation (23) for R m is optimal up to the constant √ . Furthermore,ignoring further estimation m E (cid:0)P j ∈ N m k x j k (cid:1) ≤ √ m sup x ∈X k x k F , the above estima-tions mean that the estimation for R m in the case of trace-norm regularization are thesame as the estimation (21) for the Frobenius norm regularization. Consequently, the16eneralization bounds for similarity learning and the relationship between similaritylearning and the linear SVM are the same as those stated in Example 2. It is a bitdisappointing that there is no improvement when using the trace norm. The possiblereason is that the spectral norm of B and the Frobenius norm of B are the same when B takes the form B = xy T for any x, y ∈ R d . We end this section with a comment on an alternate way to estimate the Rademacheraverage R m . Kakade [18, 19] developed elegant techniques for estimating Rademacheraverages for linear predictors. In particular, the following theorem was established: Theorem 5. ([18, 19])

Let W be a closed convex set and let f : W → R be a β -stronglyconvex with respect to k · k and assume that f ∗ (0) = 0 . Assume W ⊆ { w : f ( w ) ≤ f max } . Furthermore, let X = { x : k x k ∗ ≤ X } and F = { w → h w, x i : w ∈ W , x ∈ X } . Then, we have R n ( F ) ≤ X s f max βn . To apply Theorem 5, we rewrite the Rademacher average R m as R m = E z ,σ h sup ˜ x ∈X (cid:13)(cid:13)(cid:13) m X i ∈ N m σ i y i x i ˜ x T (cid:13)(cid:13)(cid:13) ∗ i = E z ,σ h sup ˜ x ∈X sup k A k≤ ,A ∈ S d h m X i ∈ N m σ i y i x i ˜ x T , A i i = E z ,σ h sup ˜ x ∈X sup k A k≤ ,A ∈ S d h m X i ∈ N m σ i y i x i , A ˜ x i i . (24)Now let W := { w → h w, x i , w = A ˜ x, k A k ≤ , A ∈ S d × d } . Let us consider the sparse L -norm deﬁned, for any A ∈ S d × d , by k A k = P k,ℓ ∈ N d | A kℓ | . In this case, we observethat k w k ≤ k A k k ˜ x k ∞ ≤ k A k sup x ∈X k x k ∞ ≤ sup x ∈X k x k ∞ . Let f ( w ) = k w k q with q = log d log d − which is ( d )-strongly convex with respect to the norm k·k . Then, for any w ∈ W , we have that k w k q ≤ k w k ≤ sup x ∈X k x k ∞ . Combining these observationswith (24) allows us to obtain the estimation R m ≤ sup x ∈X k x k ∞ q dm . Similarly,for the (2 , k A k (2 , = P k ∈ N d ( P ℓ ∈ N d | A kℓ | ) / , observe that k w k ≤k A k (2 , k ˜ x k F ≤ sup x ∈X k x k F . Applying Theorem 5 with f ( w ) = k w k q ( q = log d log d − )again, we will have the estimation R m ≤ sup x ∈X k x k F sup x ∈X k x k ∞ q dm . Hence,the estimations for the above two cases are similar to our estimations in the aboveexamples. Our estimation is more straightforward by directly using the Khinchin-typeinequality in contrast to the advanced convex-analysis techniques used in Kakade etal. [18, 19].However, for the case of trace-norm regularization (i.e., k A k = k A k tr . ), one wouldexpect, using the techniques in [18, 19], that the estimation for R m is the same as thatin the case for the sparse L -norm. The main hurdle for such result is the estimationof k w k = k A ˜ x k by the trace-norm of A. Indeed, by the discussion following ourestimation (23) directly using Khinchin-type inequality, we know that our estimation1723) is optimal. Hence, one can not expect the estimation for R m for the case for trace-norm regularization is the same as that in the case for sparse L -norm regularizationin our particular case of similarity learning formulation (2). In this paper, we considered a regularized similarity learning formulation (2). Itsgeneralization bounds were established for various matrix-norm regularization termssuch as the Frobenius norm, sparse L -norm, and mixed (2 , Acknowledgement:

We are grateful to the referees for their invaluable comments and suggestions on thispaper. This work was supported by the EPSRC under grant EP/J001384/1. Thecorresponding author is Yiming Ying.

References [1] M.-F. Balcan and A. Blum. On a theory of learning with similarity functions.

ICML , 2006.[2] M.-F. Balcan, A. Blum and N. Srebro. Improved gaurantees for learning via sim-ilarity functions.

COLT , 2008.[3] A. Bellet, A. Habrard and M. Sebban. Similarity learning for provably accuratesparse linear classiﬁcation.

ICML , 2012.184] A. Bar-Hillel, T. Hertz, N. Shental, and D. Weinshall, Learning a mahalanobismetric from equivalence constraints.

J. of Machine Learning Research , : 937–965,2005.[5] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: riskbounds and structural results. J. of Machine Learning Research , : 463–482, 2002.[6] O. Bousequet and A. Elisseeﬀ. Stability and generalization. J. of Machine LearningResearch , : 499–526, 2002.[7] Q. Cao, Z.-C. Guo and Y. Ying. Generalization bounds for metric and similaritylearning. Preprint, 2012.[8] Y. Chen, E. K. Garcia, M. R. Gupta, A. Rahimi, and L. Cazzanti. Similarity-based classiﬁcation: concepts and algorithms. J. of Machine Learning Research ,10:747-776, 2009.[9] G. Chechik, V. Sharma, U. Shalit, and S. Bengio. Large scale online learning ofimage similarity through ranking.

J. of Machine Learning Research , : 1109-1135,2010.[10] S. Cl´emencon, G. Lugosi, and N. Vayatis. Ranking and empirical minimizationof U-statistics. Annals of Statistics , : 844–874, 2008.[11] C. Cortes, M. Mohri, and A. Rostamizadeh. Generalization bounds for learningkernels. ICML

ICML , 2010.[13] J. Davis, B. Kulis, P. Jain, S. Sra, and I. Dhillon. Information-theoretic metriclearning.

ICML , 2007.[14] M. Guillaumin, J. Verbeek and C. Schmid. Is that you? Metric learning ap-proaches for face identiﬁcation.

ICCV

IEEE Computer Society Conferenceon Computer Vision and Pattern Recognition (CVPR) , pages 2072–2078, 2006.[16] P. Jain and P. Kar. Supervised learning with similarity functions.

NIPS , 2012.[17] R. Jin, S. Wang and Y. Zhou. Regularized distance metric learning: theory andalgorithm.

NIPS , 2009[18] S. M. Kakade, K. Sridharan, and A. Tewari. On the complexity of linear predic-tion: Risk bounds, margin bounds, and regularization.

NIPS , 2008.[19] S. M. Kakade, S. Shalev-Shwartz, and A. Tewari. Regularization techniques forlearning with matrices.

J. of Machine Learning Research , : 1865-1890, 2012.1920] P. Kar. Generalization guarantees for a binary classiﬁcation framework for two-stage multiple kernel learning. CoRR abs/1302.0406 , 2013.[21] P. Kar and P. Jain. Similarity-based learning via data-driven embeddings.

NIPS ,2011.[22] V. Koltchinskii and V. Panchenko. Empirical margin distributions and boundingthe generalization error of combined classiﬁers.

The Annals of Statistics , , 1–5,2002.[23] G. R. G. Lanckriet, N. Cristianini, P. Bartlett, L. E. Ghaoui, and M. I. Jordan.Learning the kernel matrix with semideﬁnite programming. J. of Machine LearningResearch , : 27–72, 2004.[24] M. Ledoux and M. Talagrand. Probability in Banach Spaces: Isoperimetry andProcesses . Springer Press, New York, 1991.[25] A. Maurer. Learning similarity with operator-valued large-margin classiﬁers.

J.of Machine Learning Research , : 1049–1082, 2008.[26] C. McDiarmid. Surveys in Combinatorics, Chapter On the methods of boundeddiﬀerences, 148–188, 1989. Cambridge University Press, Cambridge (UK).[27] C. S. Ong, X. Mary, S. Canu, and A. J. Smola. Learning with non-positivekernels. ICML , 2004.[28] V. H. De La Pe˜na and E. Gin´e.

Decoupling: from Dependence to Independence .Springer, New York, 1999.[29] H. Saigo, J. P. Vert and N. Ueda, and T. Akutsu. Protein homology detectionusing string alignment kernels.

Bioinformatics , : 1682–1689., 2004.[30] A. J. Smola, Z. L. ´O´vari, and R. C. Williamson. Regularization with dot-productkernels. NIPS , 2000.[31] V. N. Vapnik.

Statistical Learning Theory . Wiley, New York, 1998.[32] M. Varma and B. R. Babu. More generality in eﬃcient multiple kernel learning.

ICML , 2009.[33] L. Wang, C. Yang, and J. Feng. On learning with dissimilarity functions.

ICML ,2007.[34] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distance metric learning for largemargin nearest neighbour classiﬁcation.

NIPS , 2006.[35] Q. Wu and D. X. Zhou. SVM soft margin classiﬁers: linear programming versusquadratic programming.

Neural computation , : 1160–1187, 2005.[36] Q. Wu. Regularization networks with indeﬁnite kernels. Journal of Approxima-tion Theory , : 1–18, 2013. 2037] E. Xing, A. Ng, M. Jordan, and S. Russell. Distance metric learning with appli-cation to clustering with side information. NIPS , 2002.[38] Y. Ying, M. Girolami and C. Campbell. Analysis of SVM with indeﬁnite kernels.

NIPS , 2009.[39] Y. Ying, K. Huang and C. Campbell. Sparse metric learning via smooth opti-mization.

NIPS , 2009.[40] Y. Ying and C. Campbell. Generalization bounds for learning the kernel problem.

COLT , 2009.

Appendix

In this appendix, the following facts are used for establishing generalization boundsin section 4 and section 5.

Deﬁnition 3.

We say the function f : m Y k =1 Ω k → R with bounded diﬀerences { c k } mk =1 if, for all ≤ k ≤ m , max z , ··· ,z k ,z ′ k ··· ,z m | f ( z , · · · , z k − , z k , z k +1 , · · · , z m ) − f ( z , · · · , z k − , z ′ k , z k +1 , · · · , z m ) | ≤ c k Lemma 1. (McDiarmid’s inequality [26]) Suppose f : m Y k =1 Ω k → R with boundeddiﬀerences { c k } mk =1 then , for all ǫ > , there holds Pr z (cid:26) f ( z ) − E z f ( z ) ≥ ǫ (cid:27) ≤ e − ǫ P mk =1 c k . We need the following contraction property of the Rademacher averages which isessentially implied by Theorem 4.12 in Ledoux and Talagrand [24], see also [5, 22].

Lemma 2.

Let F be a class of uniformly bounded real-valued functions on (Ω , µ ) and m ∈ N . If for each i ∈ { , . . . , m } , φ i : R → R is a function having a Lipschitzconstant c i , then for any { x i } i ∈ N m , E ǫ (cid:16) sup f ∈ F X i ∈ N m ǫ i φ i ( f ( x i )) (cid:17) ≤ E ǫ (cid:16) sup f ∈ F X i ∈ N m c i ǫ i f ( x i ) (cid:17) . (25)Another important property of the Rademacher average which is used in the proofof the generalization bounds of the similarity learning is the following Khinchin-typeinequality, see e.g. [28, Theorem 3.2.2]. 21 emma 3. For n ∈ N , let { f i ∈ R : i ∈ N n } , and { σ i : i ∈ N n } be a family of i.i.d.Rademacher random variables. Then, for any < p < q < ∞ we have (cid:16) E σ (cid:12)(cid:12)(cid:12) X i ∈ N n σ i f i (cid:12)(cid:12)(cid:12) q (cid:17) q ≤ (cid:16) q − p − (cid:17) (cid:16) E σ (cid:12)(cid:12)(cid:12) X i ∈ N n σ i f i (cid:12)(cid:12)(cid:12) p (cid:17) p ..