Local Group Invariant Representations via Orbit Embeddings
Anant Raj, Abhishek Kumar, Youssef Mroueh, P. Thomas Fletcher, Bernhard Schölkopf
LLocal Group Invariant Representations via Orbit Embeddings
Anant Raj Abhishek Kumar Youssef Mroueh P. Thomas Fletcher Bernhard Sch¨olkopf
MPI AIF, IBM Research AIF, IBM Research University of Utah MPI
Abstract
Invariance to nuisance transformations isone of the desirable properties of effec-tive representations. We consider trans-formations that form a group and pro-pose an approach based on kernel meth-ods to derive local group invariant repre-sentations. Locality is achieved by defin-ing a suitable probability distribution overthe group which in turn induces distribu-tions in the input feature space. We learna decision function over these distributionsby appealing to the powerful framework ofkernel methods and generate local invari-ant random feature maps via kernel ap-proximations. We show uniform conver-gence bounds for kernel approximation andprovide generalization bounds for learn-ing with these features. We evaluate ourmethod on three real datasets, includingRotated MNIST and CIFAR-10, and ob-serve that it outperforms competing kernelbased approaches. The proposed methodalso outperforms deep CNN on Rotated-MNIST and performs comparably to therecently proposed group-equivariant CNN.
Effective representation of data plays a key role inthe success of learning algorithms. One of the mostdesirable properties of effective representations is be-ing invariant to nuisance transformations. For in-stance, convolutional neural networks (CNNs) owemuch of their empirical success to their ability incapturing local translation invariance through con-volutional weight sharing and pooling which turnsout to be a useful model prior for images. Capturing
Proceedings of the 20 th International Conference on Artifi-cial Intelligence and Statistics (AISTATS) 2017, Fort Laud-erdale, Florida, USA. JMLR: W&CP volume 54. Copy-right 2017 by the author(s). class sensitive invariance can also result in reductionin sample complexity [1] which is particularly usefulin label scarce applications. We approach the prob-lem of learning with invariant representations from agroup theoretical perspective and propose a scalableframework for incorporating invariance to nuisancegroup actions via kernel methods.At an abstract level, a group is defined as a set G en-dowed with a notion of product on its elements thatsatisfies certain axioms of (i) closure : a, b ∈ G = ⇒ the product ab ∈ G , (ii) associativity : ( ab ) c = a ( bc ),and (iii) inverse element : for each g ∈ G, ∃ g − ∈ G such that gg − = g − g = e ∈ G , where e is theidentity element satisfying ge = eg = g, ∀ g ∈ G .A group is abelian if the group product is com-mutative ( gh = hg, ∀ g, h ∈ G ). For most practi-cal applications each element g ∈ G can be seenas a transformation acting on an input space X , T g : X (cid:55)→ X . The orbit of an element x ∈ X un-der the action of the group G is defined as the set O x = { T g ( x ) | g ∈ G } . The set of all rotations ina fixed 2-D plane is an example of an infinite groupwhere the product is defined as the consecutive ap-plication of two rotations. The orbit of an imageunder this rotation group is the infinite set consist-ing of all rotated versions of the image. The closureproperty of the group implies that the orbit of apoint x is invariant under a group action on x , i.e., O x | G = O T g ( x ) , ∀ g ∈ G . The reader is referred to [38]for a more detailed introduction to group theory.For unimodular groups, which include compactgroups and abelian groups, there exists a so calledunique (up to scaling) Haar measure ν that is in-variant to both left and right group products, i.e., ν ( S ) = ν ( gS ) = ν ( Sg ) for all measurable sub-sets S ⊂ G and all g ∈ G , essentially generaliz-ing the notion of Lebesgue measure to groups. Fora compact group G , Haar measure can be normal-ized by ν ( G ) (since ν ( G ) < ∞ ) to obtain the nor-malized Haar measure which assigns a probabilitymass to all measurable subsets of G . NormalizedHaar measure can be seen as inducing a uniform a r X i v : . [ c s . L G ] M a y robability distribution on the group. Recently,Anselmi et al. [1] used the normalized Haar mea-sure ˜ ν on the group to map each orbit ( O x | G ∀ x ) to aprobability distribution P x on the input space, i.e., P x | G ( A ) = ˜ ν ( { g | T g ( x ) ∈ A } ) , ∀ A ⊂ X . The distri-bution P x | G induced by each point x can be takenas its invariant representation. However, estimatingthis distribution directly can be challenging due toits potentially high dimensional support. Anselmiet al. [1] propose to capture histogram statistics of1-dimensional projections of P x | G to generate an in-variant representation that can be used for learn-ing, i.e., φ kn ( x ) = 1 / | G | (cid:80) g ∈ G η n ( (cid:104) T g ( x ) , t k (cid:105) ) for afinite group G , where t k are the projection direc-tions (termed as templates ), η n ( · ) are some nonlinearfunctions that are expected to capture the histogramstatistics. More recently, Mroueh et al. [31] analyzedthe concentration properties of the linear kernel de-fined over these features and provided generalizationbounds for learning with this linear kernel.Our point of departure from [1, 31] is the observa-tion that histogram based features may not be theoptimal way to characterize the probability distribu-tions P x induced by the group on the input space andtheir approach has its limitations. First, there is noprincipled guidance provided regarding the choice ofnonlinearities η n . Second, the inner-product of his-togram based features ( { φ kn ( x ) } ) approximately in-duces a Euclidean distance (group-averaged) in theinput space [31] which may render them unsuitablefor learning complex nonlinear decision boundariesin the input space. Further, locality is achieved byrestricting the uniform distribution to a chosen sub-set of the group (i.e. elements within the subset areallowed to transform the input with equal probabil-ity and elements outside the subset are prohibited)which can be limiting. Contributions:
In this paper, we address afore-mentioned points and propose a framework to gener-ate invariant representations by embedding the orbitdistributions P x | G into a reproducing kernel Hilbertspace (RKHS) [33,42]. We propose to use character-istic kernels [44] so that the resulting map from thedistributions to the RKHS is injective (one-to-one),preserving all the moments of the distribution. Ouruse of kernel methods to embed orbit distributionsalso renders a large body of work on kernel approx-imation methods at our disposal, which enable usto scale our proposed method. In particular, we de-rive invariant features by approximating the kernelusing Nystr¨om method [17, 48] and random Fourierfeatures (for shift invariant kernels) [34]. The nonlin-earities in the features ( η n ( · )) emerge in a principled manner as a by-product of the kernel approximation.The RKHS embedding framework also naturally al-lows us to use more general probability distributionson the group, apart from the uniform distribution.This allows us to have better control over selectivityof the derived features and also becomes a technicalnecessity when the group in non-compact. We ex-periment with three real datasets and observe con-sistent accuracy improvements over baseline randomFourier [34] and Nystr¨om features [17] as well asover [31]. Further, on Rotated MNIST dataset [24]we outperform recent invariant deep CNN and RBMbased architectures [40, 43], and perform compara-bly to the more recently proposed group equivariantdeep convolutional nets [11]. Let the input features belong to a set X ⊂ R d . Agroup element g ∈ G acts on points from R d througha map T g : R d (cid:55)→ R d , and we use a shorthand no-tation of gx to denote T g ( x ). We use gS to denotethe action of a group element g on the set S , i.e., gS = { T g ( x ) | x ∈ S ⊆ X } . We take liberty inusing the same notation to denote the product ofa group element with a subset of the group, i.e., gS = { gh | h ∈ S ⊂ G } and Sg = { hg | h ∈ S ⊂ G } . As introduced in the previous section, the orbit ofan element x ∈ X under the action of the group G is defined as the set O x | G = { gx | g ∈ G } . Forall unimodular groups there exists a Haar measure ν : S (cid:55)→ R + which is invariant under left and rightgroup product i.e., ν ( S ) = ν ( gS ) = ν ( Sg ) for allmeasurable subsets S ⊂ G and all g ∈ G . Let q ( · )be the probability density function of a distributiondefined over G . This probability distribution overthe group can be used to map each orbit O x | G toa probability distribution P x | G on the input space,i.e., P x | G ( A ) = (cid:82) g : gx ∈ A q ( g ) dν ( g ) ∀ A ⊂ X . Notethat P x | G ( O x | G ) = 1 (for an appropriately normal-ized measure ν ), and P x | G ( A ) = 0 ∀ A for which A ∩ O x | G = ∅ .Let H be a reproducing kernel Hilbert space (RKHS)of functions f : X (cid:55)→ R induced by kernel k : X × X (cid:55)→ R , with the inner-product satisfying the re-producing property, i.e., (cid:104) f, k ( x, · ) (cid:105) = f ( x ) , ∀ f ∈ H and (cid:104) k ( x, · ) , k ( x (cid:48) , · ) (cid:105) = k ( x, x (cid:48) ). The RKHS embed-ding of the distribution P x | G is given as [42] µ [ P x | G ] := E z ∼ P x | G k ( z, · ) . (1)2his expectation is well-defined under the probabil-ity measure P x | G , which is in turn induced by themeasure ν over the group. The support of P x | G is O x | G and sampling a point z ∼ P x | G is equivalentto sampling the corresponding group element g andsetting z = gx . Thus we can rewrite the RKHSembedding of Eq. 1 as µ [ P x | G ] = (cid:90) G k ( gx, · ) q ( g ) dν ( g ) . (2)If the kernel is characteristic this map from distribu-tions to the RKHS is injective, preserving all the in-formation about the distribution [44]. All universalkernels [46] are characteristic when the support set ofthe distribution is compact [42]. In addition, manyshift invariant kernels (e.g., Gaussian and Laplaciankernels) are characteristic on all of R d [18]. For pre-cise characterization of characteristic shift invariantkernels, please refer to [45].For a characteristic kernel the embedding µ [ P x | G ]can be used as a proxy for P x | G in learning prob-lems. To this end, we introduce a hyperkernel h : H × H (cid:55)→ R that defines the similarity betweenthe RKHS embeddings corresponding to two points x and x (cid:48) as k q,G ( x, x (cid:48) ) := h ( µ [ P x | G ] , µ [ P x (cid:48) | G ]). If wetake h to be the linear kernel which is the regularinner-product in H , we obtain k q,G ( x, x (cid:48) ) := (cid:104) µ [ P x | G ] , µ [ P x (cid:48) | G ] (cid:105) H = (cid:90) G (cid:90) G k ( gx, g (cid:48) x (cid:48) ) q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) )(3)The kernel k q,G : X × X (cid:55)→ R turns out to bethe expectation of the base kernel k ( · , · ) under thepredefined probability distribution on the group G .It trades off locality and group invariance throughappropriately selecting the probability density q ( · ).Taking q to be a delta function over the Identitygroup element gives back the original base kernel k ( · , · ) which does not capture any invariance. On theother hand, if we take q to be the uniform probabil-ity density, we get the global group invariant kernel(also termed as Haar integration kernel [21, 31]) k G ( x, x (cid:48) ) = (cid:90) G (cid:90) G k ( gx, g (cid:48) x (cid:48) ) dν ( g ) dν ( g (cid:48) ) , (4)satisfying the property k G ( gx, g (cid:48) x (cid:48) ) = k G ( x, x (cid:48) ) forany g, g (cid:48) ∈ G and any x, x (cid:48) ∈ X . Haar integral ker-nel does not preserve any locality information (e.g.,images of digits 6 and 9 will be placed under sameequivalence class). Strictly speaking, we only need ν to be the normalized right Haar measure satisfy-ing ν ( S ) = ν ( Sg ) , ∀ S ⊂ G, ∀ g ∈ G for the global group invariance property to hold. A unique (upto scaling) right Haar measure exists for all locallycompact groups and for all unimodular groups (forwhich left and right Haar measures conincide) [38].All Lie groups (e.g., rotation, translation, scaling,affine) are locally compact. Additionally, all com-pact groups (e.g., rotation), abelian groups (e.g.,translation, scaling), and discrete groups (e.g., per-mutation) are unimodular. However, the Haar inte-gration kernel k G ( x, x (cid:48) ) of Eq. 4 can only be definedfor compact groups since we need ν ( G ) < ∞ to keepthe integral finite. Indeed, earlier work has usedHaar integration kernel for compact groups [21, 31](however, without the RKHS embedding perspectiveprovided in our work which motivates the use of a characteristic base kernel k ( · , · )).A framework allowing more general (non-uniform)probability distribution on the group serves two pur-poses: (i) It enables us to operate with non-compactgroups in a principled manner since we only need (cid:82) G q ( g ) dν ( g ) < ∞ to enable construction of kernelssuch that Eq. 3 is finite; (ii) It allows for a bet-ter control over locality of the kernel k q,G ( · , · ). Ear-lier work [1, 31] achieves locality by taking a sub-set G ⊂ G and restricting the domain of the Haarintegration kernel to be G which amounts to hav-ing a uniform distribution over G . A more gen-eral non-uniform distribution (e.g., a unimodal dis-tribution with mode at the Identity element of thegroup) allows us to smoothly decrease the probabil-ity of sampling more extreme group transformationsrather than abruptly prohibiting group transformsfalling outside a preselected subset. The kernel k q,G of Eq. 3 can be used for learningwith kernel machines [41], probabilistically tradingoff locality and group invariance through appropri-ately selecting q ( · ). However, kernel based learningalgorithms suffer from scalability issues due to theneed to compute kernel values for all pairs of datapoints. In this section, we describe our approachto obtain local invariant features via approximating k q,G . We first consider the case of shift-invariant base ker-nel satisfying k ( x, x (cid:48) ) = ˜ k ( x − x (cid:48) ) which is a com-monly used class of kernels that includes Gaussianand Laplacian kernels. Many shift-invariant kernels3re characteristic on R d as mentioned in the pre-vious section. We use the random Fourier featuresproposed in [34] that are based on the characteriza-tion of positive definite functions by Bochner [6, 39].Bochner’s theorem establishes Fourier transform asa bijective map from finite non-negative Borel mea-sures on R d to positive definite functions on R d . Ap-plying it to shift-invariant positive definite kernelsone gets k ( x, x (cid:48) ) = ˜ k ( x − x (cid:48) ) = (cid:90) R d e − i ( x − x (cid:48) ) (cid:62) ω p ( ω ) dω, ∀ x, x (cid:48) , (5)where p ( · ) is the unique probability distribution cor-responding to the kernel k ( · , · ), assuming the kernelis properly scaled. We use this characterization toobtain local group invariant features as follows: k q,G ( x, x (cid:48) )= (cid:90) G (cid:90) G E ω ∼ p (cid:104) e − i ( gx − g (cid:48) x (cid:48) ) (cid:62) ω (cid:105) q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) )= E ω ∼ p (cid:90) G (cid:90) G e − i ( gx − g (cid:48) x (cid:48) ) (cid:62) ω q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) )= E ω ∼ p (cid:90) G e − i (cid:104) ω,gx (cid:105) q ( g ) dν ( g ) (cid:90) G e i (cid:104) ω,g (cid:48) x (cid:48) (cid:105) q ( g (cid:48) ) dν ( g (cid:48) ) ≈ E ω ∼ p r r (cid:88) k =1 e − i (cid:104) ω,g k x (cid:105) r (cid:88) k =1 e i (cid:104) ω,g k x (cid:105) , ( g k ∼ q ) ≈ sr s (cid:88) j =1 r (cid:88) k =1 e − i (cid:104) ω j ,g k x (cid:105) r (cid:88) k =1 e i (cid:104) ω j ,g k x (cid:105) , ( g k ∼ q, ω j ∼ p ):= (cid:104) ψ RF ( x ) , ψ RF ( x (cid:48) ) (cid:105) C s , (6)where ψ RF ( x ) = 1 r √ s (cid:34) r (cid:88) k =1 e − i (cid:104) ω ,g k x (cid:105) . . . r (cid:88) k =1 e − i (cid:104) ω s ,g k x (cid:105) (cid:35) ∈ C s . (7)We use standard Monte Carlo to approximate bothinner integral over the group and the outer ex-pectation over ω . It is also possible to use quasiMonte Carlo approximation for the expectation over ω , which has been carefully studied for randomFourier features [49]. We provide uniform conver-gence bounds and excess risk bounds for these fea-tures in Section 3.The feature map ψ RF ( · ) requires us to apply r groupactions to every data point which can be expen-sive in large data regime. If the group action isunitary transformation preserving norms and dis-tances between points (i.e., (cid:107) gx (cid:107) = (cid:107) x (cid:107) ), the in-ner product satisfies (cid:104) x, x (cid:48) (cid:105) = (cid:104) gx, gx (cid:48) (cid:105) . This can be used to transfer the group action from the data tothe sampled template as (cid:104) ω, gx (cid:105) = (cid:104) g − ω, g − gx (cid:105) = (cid:104) g − ω, x (cid:105) [1] without affecting the approximation ofkernel k q,G , as long as the pdf q is symmetric aroundthe identity element ( q ( g ) = q ( g − ) ∀ g ∈ G ). For in-stance, in the case of images which can be viewedas a function I : R (cid:55)→ R , one can show the follow-ing result regarding group actions (e.g., rotation,translation, scaling, affine transformation). Lemma 2.1.
Let g be a group element acting onan image I : R (cid:55)→ R . The group action definedas T g [ I ( x )] = | J g | − / I ( g − x ) , ∀ x , where J g is theJacobian of the transformation, is a unitary trans-formation and satisfies (cid:104) T g ( I ) , T g ( I (cid:48) ) (cid:105) = (cid:104) I, I (cid:48) (cid:105) .Proof.
See appendix.The lemma suggests scaling the pixel intensities ofthe image by a factor | J g | − / to make the group ac-tion unitary. The Jacobian for rotating or translat-ing an image has determinant 1 obviating the needfor scaling. For general affine transformation, weneed to scale the pixel intensities accordingly to keepit unitary . Here we consider the case of a general base ker-nel and derive local group invariant features usingNystr¨om approximation [17, 48]. Nystr¨om methodstarts with identifying a set of landmark points (alsoreferred as templates ) Z = { z , . . . , z s } and approx-imates each function f ∈ H by its orthogonal pro-jection onto the subspace spanned by { k ( · , z i ) } si =1 .Several schemes for identifying the landmark pointshave been studied in the literature, including ran-dom sampling, sampling based on leverage scores,and clustering based landmark selection [20,23]. Wecan choose landmarks from the original set X orfrom the orbit gX . Nystr¨om method approximatesthe kernel as k ( x, x (cid:48) ) ≈ K (cid:62) Z,x K + Z,Z K Z,x (cid:48) , where K Z,x = [ k ( x, z ) , . . . , k ( x, z s )] (cid:62) and K Z,Z is squarekernel matrix for the landmark points with K + Z,Z denoting the pseudo-inverse.Since K Z,Z is a positive semi-definite matrix, let This is mentioned in [1] as a remark without a formalproof. We provide a proof in the appendix for complete-ness. The Jacobian for affine transformation T ( x ) = Ax + b is its linear component A . + Z,Z = L (cid:62) L , where L ∈ R rank( K Z,Z ) × s . We have k q,G ( x, x (cid:48) ) ≈ (cid:90) G (cid:90) G K gx,Z K + Z,Z K Z,g (cid:48) x (cid:48) q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) )= (cid:90) G (cid:90) G K gx,Z L (cid:62) LK Z,g (cid:48) x (cid:48) q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) )= (cid:28)(cid:90) G LK Z,gx q ( g ) dν ( g ) , (cid:90) G LK Z,gx (cid:48) q ( g ) dν ( g ) (cid:29) ≈ (cid:42) L r r (cid:88) k =1 K Z,g k x , L r r (cid:88) k =1 K Z,g k x (cid:48) (cid:43) , ( g k ∼ q ) , where the features are given by ψ Nys ( x ) = 1 r L r (cid:88) k =1 K Z,g k x ∈ R rank( K Z,Z ) . (8)If the base kernel satisfies k ( gx, gx (cid:48) ) = k ( x, x (cid:48) ) , ∀ g, x, x (cid:48) , we can transfer the groupaction from the data points to the landmark pointsas k ( gx, z ) = k ( g − gx, g − z ) = k ( x, g − z ) withoutaffecting the Nystr¨om approximation of k q,G , aslong as the pdf q is symmetric around the identityelement ( q ( g ) = q ( g − ) ∀ g ∈ G ). This becomesessential in large data regime where the numberof data points is much larger than the numberof landmarks. For the group action defined inLemma 2.1, all dot product kernels (˜ k ( (cid:104) x, x (cid:48) (cid:105) )) andshift invariant kernels (˜ k ( (cid:107) x − x (cid:48) (cid:107) )) satisfy thisproperty. Remarks:(1)
Earlier work [1, 31] has proposed features of theform φ kn ( x ) = 1 /r (cid:80) rj =1 η n ( (cid:104) g j x, ω k (cid:105) ) where η n ( · )were taken to be step functions η n ( a ) = 1( a < h n )with preselected thresholds h n . Nonlinearitiesin our proposed local invariant features emergenaturally as a result of kernel approximation, with η ( x, ω ) = e − i (cid:104) x,ω (cid:105) for ψ RF and η ( x, ω ) = k ( x, ω ) for ψ Nys . (2) Our work can also be viewed as incorporatinglocal group invariance in widely used randomFourier and Nystr¨om approximation methods,however this viewpoint overlooks the Hilbert spaceembedding perspective motivated in this work. (3)
The kernel k q,G defined in Eq. (3) assumesa linear hyperkernel h : H × H (cid:55)→ R over RKHSembeddings of orbit distributions. It is also possibleto use a nonlinear hyperkernel along the lines of [10]and [32], and approximate it using a second layerof random Fourier (RF) or Nystr¨om features. Weshow empirical results for both linear and Gaussianhyperkernel (approximated using RF features) in Sec. 4. (4) Computational aspects. The complexityof feature computation is rC f + rsC g where C f isthe cost of computing the vanilla random Fourieror vanilla Nystr¨om features and C g is the costof computing a group action on a template ω .However same set of templates are used for all datapoints so group actions on the templates can becomputed in advance. Structured random Gaussiantemplates can also be used in our framework tospeed up the computation of random Fourier fea-tures ψ RF [7, 9, 25]. Recent approaches for scalingrandomized kernel machines to massive data sizesand very large number of random features can alsobe used [3]. In this section we focus on local invariance learningusing the random feature map ψ RF defined in Sec-tion 2.2.1 for the Gaussian base kernel k ( · , · ). Wefirst address the uniform convergence of the ran-dom feature map ψ RF to the local invariant kernel k q,G on a set of points M . In other words we showin Theorem 3.1 that for a sufficiently large numberof random templates s , and group element samples r , we have (cid:104) ψ RF ( x ) , ψ RF ( y ) (cid:105) ≈ k q,G ( x, y ), for allpoints x, y ∈ M . Second we consider a supervisedbinary classification setting, and study generaliza-tion bounds of learning a linear classifier in the localinvariant random feature space ψ RF . In a nutshellTheorem 3.2 shows that linear functions in the ran-dom feature space (cid:104) w, ψ RF ( x ) (cid:105) , approximate func-tions in the RKHS induced by our local invariantkernel k q,G . Theorem 3.1 provides a uniform convergence boundof our invariant random feature map ψ RF for Gaus-sian base kernel k ( · , · ). Theorem 3.1 (Uniform convergence of Fourier Ap-proximation) . Let X be a compact space in R d withdiameter diam( X ) . For ε > , δ , δ ∈ (0 , , the fol-lowing uniform convergence bound holds with proba-bility − (cid:16) d +1) ε σ (cid:17) dd +1 ( δ + δ ) dd +1 . sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( y ) (cid:69) − K q,G ( x, y ) (cid:12)(cid:12)(cid:12) ≤ ε + 1 r for a number of group samples r ≥ C dε log(diam(X) /δ ) , nd a number of random templates s ≥ C dε log(diam(X) /δ ) , where σ p ≡ E p [ ω (cid:62) ω ] = d/σ is the second momentof the Fourier transform of the Gaussian base kernel k , and C and C are numeric universal constants.Proof. See Appendix.
Given a labeled training set S = (cid:8) ( x i , y i ) | x i ∈ X, y i ∈ Y = { +1 , − } (cid:9) , our goal is to learn a deci-sion function f : X → Y via empirical risk mini-mization (ERM)min f ∈H K ˆ E V ( f ) = 1 N N (cid:88) i =1 V ( y i f ( x i ))where V is convex and L -Lipschitz loss function.Let E V ( f ) = E x,y ∼ P V ( yf ( x )) be the expected riskfor f ∈ H K . According to the representer the-orem, the solution of ERM is given by f (cid:63) ( · ) = (cid:80) Ni =1 α (cid:63)i k q,G ( x i , . ).We consider linear hyperkernel h in Eq. (3) andconsider H K , the RKHS induced by the kernel k q,G ( x, y ) = (cid:82) G (cid:82) G k ( gx, g (cid:48) x (cid:48) ) q ( g ) q ( g (cid:48) ) dν ( g ) dν ( g (cid:48) ),as introduced in Sec. 2.1. Similar to [31], for C > F p an infinite dimensional space to ap-proximate H K (see [35] for a motivation for this ap-proximation): F p ≡ (cid:110) f ( x ) = (cid:90) Ω α ( ω ) (cid:90) G φ ( gx, ω ) q ( g ) dν ( g ) dω (cid:12)(cid:12)(cid:12) | α ( ω ) | ≤ Cp ( ω ) (cid:111) , where φ ( gx, ω ) = e − i (cid:104) gx,ω (cid:105) . Similarly define the lin-ear space in the span of ψ RF ( · ), ˆ F p ≡ (cid:110) ˆ f ( x ) = (cid:104) α, ψ RF ( x ) (cid:105) = (cid:80) sk =1 α k r (cid:80) ri =1 φ ( g i x, ω k ) (cid:12)(cid:12)(cid:12) | α k | ≤ Cs (cid:111) . Theorem 3.2.
Let δ > .Consider the training set S = (cid:8) ( x i , y i ) | x i ∈ X, y i ∈ Y, i = 1 . . . N (cid:9) sam-pled from the input space and let f (cid:63)N is the empiricalrisk minimizer such that f (cid:63)N = arg min f ∈ ˆ F ˆ E V ( f ) = N (cid:80) Ni =1 V ( y i f ( x i )) , then we have with probability − δ (over the training set, random templates andgroup elements) E V ( f (cid:63)N ) − min f ∈F P E V ( f ) ≤ RMSE RMSE w/Method 2nd layer RF
Original (RF) 14.01 13.78Original (Nys) 13.97 13.81Original (GP) 13.48
N/A
Sort-Coulomb (RF) 12.89 12.49Sort-Coulomb (Nys) 12.83 12.51Sort-Coulomb (GP) [30] 12.59
N/A
Rand-Coulomb [30] 11.40
N/A
GICDF [31] 12.25
N/A
LGIKA(RF)
LGIKA(Nys) 10.87 10.45Table 1: RMSE on Quantum Machine data O (cid:32)(cid:18) √ N + 1 √ s + 1 √ r (cid:19) LC (cid:114) log 1 δ (cid:33) . Proof.
See Appendix.
We evaluate the proposed method (referred asLGIKA here) on three real datasets. We use Gaus-sian kernel as the base kernel in all our experi-ments. For methods that produce random (unsu-pervised) features, which include the proposed ap-proach as well as regular random Fourier (abbrv.as RF) [34] and Nystrom [48] method, we reportperformance with: (i) linear decision boundary onthese features (linear SVM or linear regularized leastsquares (RLS)), and (ii) nonlinear decision bound-ary which is realized by having a Gaussian kernelon top of the features and approximating it throughrandom Fourier features [34], followed by a linearSVM or RLS. The later can also be viewed as usinga nonlinear hyperkernel over RKHS embeddings oforbit distributions (also see Remark (3) at the end ofSec. 2). Parameters for all the methods are selectedusing grid search on a hold-out validation set unlessotherwise stated.
This data consists of 7165 Coulomb matrices of size23 ×
23 (each matrix corresponding to a molecule)and their associated atomization energies in kcal/-mol. It is a small subset of a large dataset collectedby Blum and Reymond (2009) [4], and was recentlyused by Montavon et al. (2012) [30] for evalua-tion. The goal is to predict atomization energies6f molecules which is modeled as a regression task.The atomization energy is known to be invariant topermutations of rows/columns of the Coulomb ma-trix which motivates the use of representations in-variant to the permutation group. We follow theexperimental methodology of [30] and report meancross-validation accuracy on the five folds providedin the dataset. An inner cross-validation is usedfor tuning the parameters for each fold as in [30].We compare the performance of our method withseveral baselines in Table 1: (i)
Original (GP/R-F/Nys):
Gaussian Process regression on originalCoulomb matrices and its approximation via ran-dom Fourier (RF) [34] and Nystrom features [48],(ii)
Sort-Coulomb (GP/RF/Nys):
GP regressionon sorted Coulomb matrices (sorted according torow norms) [30] and its approximation, (iii)
Rand-Coulomb: permutation invariant kernel proposed in[30], and (iv)
GICDF:
Group invariant CDF (his-togram) based features proposed in [31]. The re-sults for
Sort-Coulomb (GP) and
Rand-Coulomb aretaken directly from [30]. For all RF and Nystr¨ombased features we use 10 k random templates ( ω ).For GICDF and our method, we sample 70 ran-dom permutations ( r = 70 in Eq. 7) using the samescheme as in [30]. The proposed LGIKA outper-forms all these directly competing methods including Rand-Coulomb and
GICDF . Neural network basedfeatures used in [30] can also be used within ourframework but we stick to raw Coulomb matricesfor simplicity sake.Figure 1: Kernel approximation error (normalized)in spectral and Frobenius norms vs number of ran-dom features, for 20 (left) and 40 (right) group trans-formations
Kernel Approximation.
We also report empiri-cal results on approximation error for kernel matrix(in terms of spectral norm and Frobenius norm) in
Accuracy Accuracy w/Method 2nd layer RF
Original (RF) 87.75 88.01Original (Nys) 88.93 88.98Original (RBF) 90
N/A
TI-RBM [43] 95.8
N/A
RC-RBM [40] 97.02
N/A
GICDF [31] 93.81
N/A
Z2-CNN [11] 94.97
N/A
P4-CNN [11]
N/A
LGIKA(RF)
Table 2: Rotated MNIST resultsFig. 4.1. The plots show the approximation errorfor different number of group actions as the numberof random Fourier features are increased. The ker-nel used is the Gaussian kernel. The true kernel hasbeen computed using 70 group elements randomlysampled from the permutation group. The normal-ized error for all the cases goes down with the num-ber of random Fourier features which is in line withour theoretical convergence results.
Rotated MNIST dataset [24] consists of total 62 k images of digits (12 k for training and 50 k for test),obtained by rotating original MNIST images by anangle sampled uniformly between 0 and 2 π . Wecompare the proposed method with several otherapproaches in Table 4.1. We use von-misses dis-tribution ( p ( θ ) = exp( − κ cos( θ )) with κ = 0 .
2, se-lected using cross-validation) to sample r = 100 ro-tations. We use s = 7 k random templates for bothRF and Nystrom approximations, and use 17 k ran-dom templates for layer-2 RF approximation. Theresults for the cited methods in Table 4.1 are di-rectly taken from the respective papers, except forGICDF [31] which we implemented ourselves. Theproposed LGIKA outperforms most of the com-peting methods including deep architectures likerotation-invariant convolutional RBM (RC-RBM)[40], transformation invariant RBM (TI-RBM) [43],and regular deep CNN (Z2-CNN) [11]. Our methodalso performs close to the recently proposed group-equivariant CNN (P4-CNN) [11]. The CIFAR-10 dataset consists of 60 k RGB images(50 k/ k for train/test) of size 32 ×
32, divided into10 classes. We consider a sub-group of the affine7riginal (RF) [34] LGIKA
Table 3: CIFAR-10 resultsgroup
Af f (2) consisting of rotations, translationsand isotropic scaling. Instead of operating with adistribution (e.g. Gaussian) over this subgroup, weuse three individual distributions to have better con-trol over the three variations: a log-normal distri-bution over the scaling group ( µ = 0 , σ = 0 . µ = 0 , σ = 0 . κ = 9). We observe that workingwith wider distributions over these groups actuallyhurts the performance, highlighting the importanceof locality for CIFAR-10. We use the normalizedpixel intensities as our input features and use thegroup action defined in Lemma 2.1 to keep it unitary.We use s = 10 k random templates and r = 50 grouptransforms for the first layer RF features (Eq. 7),and use 30 k random templates for second layer RFfeatures. The proposed LGIKA outperforms vanillaRF features as shown in Table 4.3. Nystr¨om basedfeatures gave similar results as random Fourier fea-tures in our early explorations. We were not able toscale GICDF [31] to a suitable number of randomtemplates due to memory issues (for every randomtemplate, GICDF generates n features (number ofbins, set to 25 following [31]) blowing up the overallfeature dimension to n × k ). Note that the per-formance of LGIKA on this data is still significantlyworse than deep CNNs [11] since LGIKA treats theimage as a vector ignoring the spatial neighbor-hood structure taken into account by CNNs throughtranslation invariance over small image patches. In-corporating orbit statistics of image patches in ourframework is left for future work. Invariant Kernel Methods. [2] introduced Tomo-graphic Probabilistic Representations (TPR) thatembed orbits to probability distributions. UnlikeTPR, our representation maps orbits or local por-tions of the orbit via kernel mean embedding to anRKHS and allows to define similarity between orbitsin this space. Indeed our representation is infinitedimensional and is related to Haar Invariant Ker-nel [21]. As discussed earlier it can be approximatedvia random features or Nystr¨om sampling. Other approaches for building invariant kernels were de-fined in [47] that focuses on dilation invariances. Akernel view of histogram of gradients was introducedin [5], where finite dimensional features were definedthrough kernel PCA. Kernel convolutional networksintroduced in [29], [28], considers the compositionof multilayer kernels, where local image patches arerepresented as points in a reproducing kernel. How-ever they do not consider general group invariances.The work of [13] considers the general problem oflearning from conditional distributions. When ap-plied to invariant learning, their optimization ap-proach needs to sample a group transformed exam-ple in every SGD iteration whereas our approachallows working with group actions on the randomtemplates.
Invariance in Neural Networks.
Inducing in-variances in neural networks has attracted many re-cent research streams. It is now well established thatconvolutional neural networks (CNN) [26] ensuretranslation invariance. [16] showed that mapping or-bits of rotated and flipped images through a sharedfully connected network builds some invariance inthe network. Scattering networks [8] have built ininvariances for the roto-translation group. [19] gen-eralizes CNN to general group transformations. [15]exploits cyclic symmetry to have invariant predic-tion in the network. More recently, [11] designs aconvolutional neural network that is equivariant togroup transforms by introducing convolution overthe group.
The proposed approach can be suitable for large-scale problems, benefiting from the recent advancesin scalability of randomized kernel methods [3, 14,27]. As a future direction, we would like to ex-tend our framework to operate at the level of imagepatches, enabling us to capture local spatial struc-ture. Further, the proposed approach requires com-putation of all r group transformations for all thesampled random templates. Reducing the requirednumber of group transformations is an importantdirection for future work. Our work also assumesthat the appropriate group actions are given. Ex-tension to the case when the group transformationsare learned from the data (e.g., using local tangentspace [37]) is also an important direction for futurework. Acknowledgments:
We thank Dmitry Malioutovfor several insightful discussions. This work wasdone while Anant Raj was a summer intern at IBMResearch.8 eferences [1] Fabio Anselmi, Joel Z Leibo, Lorenzo Rosasco,Jim Mutch, Andrea Tacchetti, and TomasoPoggio. Unsupervised learning of invariant rep-resentations in hierarchical architectures. arXivpreprint arXiv:1311.4158 , 2014.[2] Fabio Anselmi, Lorenzo Rosasco, and TomasoPoggio. On invariance and selectivity in repre-sentation learning.
Information and Inference ,2016.[3] Haim Avron and Vikas Sindhwani. High-performance kernel machines with implicitdistributed optimization and randomization.
Technometrics , 2015.[4] Lorenz C Blum and Jean-Louis Reymond. 970million druglike small molecules for virtualscreening in the chemical universe databasegdb-13.
Journal of the American Chemical So-ciety , 131(25), 2009.[5] Liefeng Bo, Xiaofeng Ren, and Dieter Fox. Ker-nel descriptors for visual recognition. In
NIPS ,2010.[6] S. Bochner. Monotone funktionen, stieltjes in-tegrale und harmonische analyse.
Math. Ann. ,108, 1933.[7] Mariusz Bojarski, Anna Choromanska,Krzysztof Choromanski, Francois Fagan,Cedric Gouy-Pailler, Anne Morvan, NouriSakr, Tamas Sarlos, and Jamal Atif. Struc-tured adaptive and random spinners for fastmachine learning computations. arXiv preprintarXiv:1610.06209 , 2016.[8] Joan Bruna and Stephane Mallat. Invariantscattering convolution networks.
IEEE Trans.Pattern Anal. Mach. Intell. , 2013.[9] Krzysztof Choromanski and Vikas Sindhwani.Recycling randomness with structure for sub-linear time kernel expansions. arXiv preprintarXiv:1605.09049 , 2016.[10] Andreas Christmann and Ingo Steinwart. Uni-versal kernels on non-standard input spaces. In
Advances in neural information processing sys-tems , pages 406–414, 2010.[11] Taco Cohen and Max Welling. Group equivari-ant convolutional networks. In
Proceedings ofthe 33rd International Conference on MachineLearning , 2016. [12] Felipe Cucker and Steve Smale. On the mathe-matical foundations of learning.
Bulletin of theAmerican Mathematical Society , 39:1–49, 2002.[13] Bo Dai, Niao He, Yunpeng Pan, Byron Boots,and Le Song. Learning from conditional dis-tributions via dual kernel embeddings. arXivpreprint arXiv:1607.04579 , 2016.[14] Bo Dai, Bo Xie, Niao He, Yingyu Liang, AnantRaj, Maria-Florina F Balcan, and Le Song.Scalable kernel methods via doubly stochasticgradients. In
Advances in Neural InformationProcessing Systems , pages 3041–3049, 2014.[15] Sander Dieleman, Jeffrey De Fauw, and KorayKavukcuoglu. Exploiting cyclic symmetry inconvolutional neural networks.
ICML , 2016.[16] Sander Dieleman, Kyle Willett, and JoniDambre. Rotation-invariant convolutional neu-ral networks for galaxy morphology prediction.
Monthly Notices Of The Royal AstronomicalSociety , 2015.[17] Petros Drineas and Michael W Mahoney.On the nystr¨om method for approximating agram matrix for improved kernel-based learn-ing.
Journal of Machine Learning Research ,6(Dec):2153–2175, 2005.[18] Kenji Fukumizu, Arthur Gretton, Xiaohai Sun,and Bernhard Sch¨olkopf. Kernel measures ofconditional dependence. In
NIPS , volume 20,pages 489–496, 2007.[19] Robert Gens and Pedro M Domingos. Deepsymmetry networks. In
NIPS , 2014.[20] Alex Gittens and Michael W Mahoney. Revis-iting the nystrom method for improved large-scale machine learning. In
Proceedings ofthe 30th International Conference on Machinelearning , volume 28, 2013.[21] Bernard Haasdonk, A Vossen, and HansBurkhardt. Invariance in kernel methods byhaar-integration kernels. In
Scandinavian Con-ference on Image Analysis , pages 841–851.Springer, 2005.[22] Wassily Hoeffding. Probability inequalities forsums of bounded random variables.
Jour-nal of the American statistical association ,58(301):13–30, 1963.[23] Sanjiv Kumar, Mehryar Mohri, and Ameet Tal-walkar. Sampling methods for the nystr¨om9ethod.
Journal of Machine Learning Re-search , 13, 2012.[24] Hugo Larochelle, Dumitru Erhan, AaronCourville, James Bergstra, and Yoshua Bengio.An empirical evaluation of deep architectureson problems with many factors of variation. In
Proceedings of the 24th international conferenceon Machine learning , 2007.[25] Quoc Le, Tam´as Sarl´os, and Alex Smola.Fastfood-approximating kernel expansions inloglinear time. In
Proceedings of the 30th Inter-national Conference on Machine learning , 2013.[26] Yann Lecun, Lon Bottou, Yoshua Bengio, andPatrick Haffner. Gradient-based learning ap-plied to document recognition. In
Proceedingsof the IEEE , pages 2278–2324, 1998.[27] Zhiyun Lu, Avner May, Kuan Liu,Alireza Bagheri Garakani, Dong Guo, Aur´elienBellet, Linxi Fan, Michael Collins, BrianKingsbury, Michael Picheny, et al. How toscale up kernel methods to be as good as deepneural nets. arXiv preprint arXiv:1411.4000 ,2014.[28] Julien Mairal. End-to-end kernel learning withsupervised convolutional kernel networks. In
NIPS , 2016.[29] Julien Mairal, Piotr Koniusz, Zaid Harchaoui,and Cordelia Schmid. Convolutional kernel net-works. In
NIPS , 2014.[30] Gr´egoire Montavon, Katja Hansen, Siamac Fa-zli, Matthias Rupp, Franziska Biegler, AndreasZiehe, Alexandre Tkatchenko, Anatole V Lilien-feld, and Klaus-Robert M¨uller. Learning invari-ant representations of molecules for atomizationenergy prediction. In
Advances in Neural In-formation Processing Systems , pages 440–448,2012.[31] Youssef Mroueh, Stephen Voinea, andTomaso A Poggio. Learning with groupinvariant features: A kernel perspective. In
Advances in Neural Information ProcessingSystems , pages 1558–1566, 2015.[32] Krikamol Muandet, Kenji Fukumizu, FrancescoDinuzzo, and Bernhard Sch¨olkopf. Learningfrom distributions via support measure ma-chines. In
Advances in neural information pro-cessing systems , pages 10–18, 2012. [33] Krikamol Muandet, Kenji Fukumizu, BharathSriperumbudur, and Bernhard Sch¨olkopf. Ker-nel mean embedding of distributions: A reviewand beyonds. arXiv preprint arXiv:1605.09522 ,2016.[34] Ali Rahimi and Benjamin Recht. Random fea-tures for large-scale kernel machines. In
Ad-vances in neural information processing sys-tems , pages 1177–1184, 2007.[35] Ali Rahimi and Benjamin Recht. Uniform ap-proximation of functions with random bases.In
Communication, Control, and Computing,2008 46th Annual Allerton Conference on .IEEE, 2008.[36] Ali Rahimi and Benjamin Recht. Weightedsums of random kitchen sinks: Replacing min-imization with randomization in learning. In
Advances in neural information processing sys-tems , pages 1313–1320, 2009.[37] Salah Rifai, Yann Dauphin, Pascal Vincent,Yoshua Bengio, and Xavier Muller. The man-ifold tangent classifier. In
NIPS , volume 271,page 523, 2011.[38] Joseph Rotman.
An introduction to the theoryof groups , volume 148. Springer Science & Busi-ness Media, 2012.[39] Walter Rudin.
Fourier analysis on groups . JohnWiley & Sons, 2011.[40] Uwe Schmidt and Stefan Roth. Learningrotation-aware features: From invariant priorsto equivariant descriptors. In
Computer Visionand Pattern Recognition (CVPR), 2012 IEEEConference on , 2012.[41] Bernhard Scholkopf and Alexander J Smola.
Learning with kernels: support vector machines,regularization, optimization, and beyond . MITpress, 2001.[42] Alex Smola, Arthur Gretton, Le Song, andBernhard Sch¨olkopf. A hilbert space embeddingfor distributions. In
International Conferenceon Algorithmic Learning Theory , pages 13–31.Springer, 2007.[43] Kihyuk Sohn and Honglak Lee. Learning invari-ant representations with local transformations.In
Proceedings of the 29th International Con-ference on Machine learning , 2012.1044] Bharath K Sriperumbudur, Arthur Gretton,Kenji Fukumizu, Gert RG Lanckriet, and Bern-hard Sch¨olkopf. Injective hilbert space embed-dings of probability measures. In
COLT , vol-ume 21, pages 111–122, 2008.[45] Bharath K Sriperumbudur, Arthur Gretton,Kenji Fukumizu, Bernhard Sch¨olkopf, andGert RG Lanckriet. Hilbert space embeddingsand metrics on probability measures.
Journalof Machine Learning Research , 11(Apr):1517–1561, 2010.[46] Ingo Steinwart. On the influence of the ker-nel on the consistency of support vector ma-chines.
Journal of machine learning research ,2(Nov):67–93, 2001.[47] C. Walder and O. Chapelle. Learning withtransformation invariant kernels. In
NIPS ,2007.[48] Christopher Williams and Matthias Seeger. Us-ing the nystr¨om method to speed up kernel ma-chines. In
Proceedings of the 14th annual con-ference on neural information processing sys-tems , pages 682–688, 2001.[49] Jiyan Yang, Vikas Sindhwani, Haim Avron, andMichael W. Mahoney. Quasi-monte carlo fea-ture maps for shift-invariant kernels. In
Pro-ceedings of the 31st International Conference onMachine learning , 2014. 11 ppendix
Proof of Lemma 2.1.
Since unitary transformations preserve dot-products, i.e., (cid:104) T ( x ) , T ( y ) (cid:105) = (cid:104) x, y (cid:105) , weneed to show that a group element acting on the image I : R (cid:55)→ R as T g [ I ( x )] = | J g | − / I ( T − g ( x )) , ∀ x isa unitary transformation.Let J g be the Jacobian of the transformation T g , with determinant | J g | . We have (cid:107) I ( T − g ( · )) (cid:107) = (cid:90) I ( T − g ( x )) dx = (cid:90) I ( z ) | J g | dz, substituting z = T − g ( x ) ⇒ dx = | J g | dz = | J g | (cid:107) I ( · ) (cid:107) Hence the transformation given as T g [ I ( · )] = | J g | − / I ( T − g ( · )) is unitary and thus (cid:104) T g ( I ) , T g ( I (cid:48) ) (cid:105) = (cid:104) I, I (cid:48) (cid:105) for two images I and I (cid:48) . Proof of Theorem 3.1.
We first define the notion of
U-statistics [22].
U-statistics -
Let g : R → R be a symmetric function of its arguments. Given an i.i.d. sequence X , X · · · X k of k ( ≥
2) random variables, the quantity U := n ( n − (cid:80) ni (cid:54) = j,i,j =1 g ( X i , X j ) is known as apairwise U-statistics . If θ ( P ) = E X ,X ∼ P g ( X , X ) then U is an unbiased estimate of θ ( P ).Our goal is to bound sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( y ) (cid:69) − k q,G ( x, y ) (cid:12)(cid:12)(cid:12) where ψ RF ( x ) = 1 r r (cid:88) i =1 z ( g i x ) , x ∈ X ⊂ R d . We work with z ( · ) = (cid:112) /s [cos( (cid:104) ω , ·(cid:105) + b ) , . . . , cos ( (cid:104) ω s , ·(cid:105) + b s ] ∈ R s with b i ∼ Unif(0 , π ) as in [34].Let (cid:98) k q,G ( x, y ) := r (cid:80) r i,j =1 k ( g i x, g j y ) and (cid:101) k q,G ( x, y ) := r ( r − (cid:80) r i (cid:54) = j,i,j =1 k ( g i x, g j y ).Using the triangle inequality we havesup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( v ) (cid:69) − k q,G ( x, y ) (cid:12)(cid:12)(cid:12) ≤ sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( y ) (cid:69) − (cid:98) k q,G ( x, y ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) A + sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:101) k q,G ( x, y ) − k q,G ( x, y ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) B + sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:98) k q,G ( x, y ) − (cid:101) k q,G ( x, y ) (cid:12)(cid:12)(cid:12)(cid:124) (cid:123)(cid:122) (cid:125) C Bounding A. A := sup x,y ∈ X (cid:12)(cid:12)(cid:12) r (cid:88) i,j ( (cid:104) z ( g i x ) , z ( g j y ) (cid:105) − k ( g i x, g j y )) (cid:12)(cid:12)(cid:12) Let us define f ij ( x, y ) := (cid:104) z ( g i x ) , z ( g j y ) (cid:105) − k ( g i x, g j y ), and f ( x, y ) = 1 /r (cid:80) i,j f ij ( x, y ). Since eachof the s independent random variables in the summand of 1 /r (cid:80) i,j (cid:104) z ( g i x ) , z ( g j y ) (cid:105) = s (cid:80) sk =1 (cid:18) r i,j (cid:104) ω k , g i x (cid:105) + b k ) cos( (cid:104) ω k , g j y (cid:105) + b k ) (cid:19) is bounded by [ − , x, y , we have Pr[ | f ( x, y ) | ≥ ε/ ≤ − sε / . To obtain a uniform convergence guarantee over X , we follow the arguments in [34], relying on covering thespace with an ε -net and Lipschitz continuity of the function f ( x, y ).Since X is compact, we can find an ε -net that covers X with N X = (cid:16) η (cid:17) d balls of radius η [12].Let { c k } N X k =1 be the centers of these balls, and let L f denote the Lipschitz constant of f ( · , · ), i.e., | f ( x, y ) − f ( c k , c l ) | ≤ L f (cid:107) ( xy ) − ( c k c l ) (cid:107) for all x, y, c k , c l ∈ X . For any x, y ∈ X , there exists a pair of centers c k , c l such that (cid:107) ( xy ) − ( c k c l ) (cid:107) < √ η . We will have | f ( x, y ) | < ε/ x, y if (i) | f ( c k , c l ) | < ε , ∀ c k , c l , and (ii) L f < ε √ η .We immediately get the following by applying union bound for all the center pairs ( c k , c l )Pr [ ∪ k,l | f ( c k , c l ) | ≥ ε/ ≤ N X exp( − sε / . (9)We use Markov inequality to bound the Lipschitz constant of f . By definition, we have L f = sup x,y (cid:107)∇ x,y f ( x, y ) (cid:107) = (cid:107)∇ x,y f ( x ∗ , y ∗ ) (cid:107) , where ∇ x,y f ( x, y ) = (cid:16) ∇ x f ( x,y ) ∇ y f ( x,y ) (cid:17) . We also have E ω ∼ p ∇ x,y (cid:104) z ( g i x ) , z ( g j y ) (cid:105) = ∇ x,y k ( g i x, g j y ). It follows that E ω ∼ p (cid:107)∇ x,y f ( x ∗ , y ∗ ) (cid:107) = E ω ∼ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r r (cid:88) i,j =1 ∇ x,y (cid:104) z ( g i x ∗ ) , z ( g j y ∗ ) (cid:105) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) − (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r r (cid:88) i,j =1 ∇ x,y k ( g i x ∗ , g j y ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E ω ∼ p (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r r (cid:88) i,j =1 ∇ x,y (cid:104) z ( g i x ∗ ) , z ( g j y ∗ ) (cid:105) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ E ω ∼ p r r (cid:88) i,j =1 (cid:107)∇ x,y (cid:104) z ( g i x ∗ ) , z ( g j y ∗ ) (cid:105)(cid:107) ≤ E ω ∼ p sup x,y,g i ,g j (cid:107)∇ x (cid:104) z ( g i x ) , z ( g j y ) (cid:105)(cid:107) ≤ E ω ∼ p sup x,g (cid:32) s s (cid:88) k =1 (cid:107)∇ x T g ( x ) ω k (cid:107) (cid:33) ≤ E ω ∼ p sup x,g (cid:32) s s (cid:88) k =1 (cid:107)∇ x T g ( x ) (cid:107) (cid:107) ω k (cid:107) (cid:33) = 2 E ω ∼ p sup x,g (cid:107)∇ x T g ( x ) (cid:107) s s (cid:88) k =1 s (cid:88) l =1 (cid:107) ω k (cid:107)(cid:107) ω l (cid:107) = 2 sup x,g (cid:107)∇ x T g ( x ) (cid:107) s s (cid:88) k =1 s (cid:88) l =1 E ω ∼ p (cid:107) ω k (cid:107)(cid:107) ω l (cid:107) = 2 sup x,g (cid:107)∇ x T g ( x ) (cid:107) s s E ω ∼ p (cid:107) ω (cid:107) + s (cid:88) k,l =1 ,k (cid:54) = l ( E ω ∼ p (cid:107) ω (cid:107) ) ( ω k i.i.d.) ≤ x,g (cid:107)∇ x T g ( x ) (cid:107) s s E ω ∼ p (cid:107) ω (cid:107) + s (cid:88) k,l =1 ,k (cid:54) = l E ω ∼ p (cid:107) ω (cid:107) (Jensen’s inequality)13 σ p sup x ∈ X,g ∈ G (cid:107)∇ x T g ( x ) (cid:107) , where σ p = E ( ω (cid:62) ω ), and T g ( x ) = gx denotes the transformation corresponding to the group action. If weassume the group action to be linear, i.e., T g ( x + y ) = T g ( x ) + T g ( y ) and T g ( αx ) = αT g ( x ), which holdsfor all group transformations considered in this work (e.g., rotation, translation, scaling or general affinetransformations on image x ; permutations of x ), we can bound (cid:107)∇ x T g ( x ) (cid:107) as (cid:107)∇ x T g ( x ) (cid:107) = sup u : (cid:107) u (cid:107) =1 (cid:107)∇ x T g ( x ) u (cid:107) = sup u : (cid:107) u (cid:107) =1 (cid:13)(cid:13)(cid:13)(cid:13) lim h → T g ( x + hu ) − T g ( x ) h (cid:13)(cid:13)(cid:13)(cid:13) (directional derivative of vector valued function T g ( · ))= sup u : (cid:107) u (cid:107) =1 (cid:107) T g ( u ) (cid:107) = 1(since T g ( · ) is either unitary or is converted to unitary by construction (see Lemma 2.1))Using Markov inequality, Pr[ L f ≥ ε ] ≤ E ( L f ) /ε , hence we getPr (cid:20) L f ≥ ε √ η (cid:21) ≤ σ p η ε . Combining Eq. (9) with the above result on Lipschitz continuity, we getPr (cid:20) sup x,y | f ( x, y ) | ≤ ε/ (cid:21) ≥ − N X exp( − sε / − σ p η ε . (10) Bounding B.
As defined earlier, (cid:101) k q,G ( x, y ) := r ( r − (cid:80) ri (cid:54) = j,i,j =1 k ( g i x, g j y ). From the result of U-statistics literature [22],it is easy to see that E ( (cid:101) k q,G ( x, y )) = k q,G ( x, y ).Since g , g · · · g r are i.i.d samples, we can consider (cid:101) k q,G ( x, y ) as function of r random variables ( g , g , · · · g r ).Denote (cid:101) k q,G ( x, y ) as f ( g , g , · · · g r ) . Now if a variable g p is changed to g (cid:48) p then we can bound the absolutedifference of the changed and the original function. For the rbf kernel, | k ( g p x, g j y ) − k ( g (cid:48) p x, g j y ) | ≤ | f ( g , g , · · · g p , · · · g r ) − f ( g , · · · g p − , g (cid:48) p , g p +1 · · · g r ) | = 1 r ( r − (cid:12)(cid:12)(cid:12) r (cid:88) j =1 ,j (cid:54) = p k ( g p x, g j y ) − k ( g (cid:48) p x, g j y ) (cid:12)(cid:12)(cid:12) ≤ r ( r − r (cid:88) j =1 ,j (cid:54) = p | k ( g p x, g j y ) − k ( g (cid:48) p x, g j y ) |≤ ( r − r ( r −
1) = 1 r Using bounded difference inequality
P r (cid:104)(cid:12)(cid:12) f ( g , g , · · · g r ) − E [ f ( g , g · · · g r )] (cid:12)(cid:12) ≥ ε (cid:105) ≤ (cid:16) − rε (cid:17) . The above bound holds for a given pair x, y . Similar to the earlier segment for bounding the first term A ,we use the ε -net covering of X and Lipschitz continuity arguments to get a uniform convergence guarantee.Using a union bound on all pairs of centers, we have P r (cid:104) ∪ N X k,(cid:96) =1 (cid:12)(cid:12)(cid:12) E [ k ( gc k , g (cid:48) c (cid:96) )] − r ( r − r (cid:88) i,j =1 ,i (cid:54) = j k ( g i c k , g j c (cid:96) ) (cid:12)(cid:12)(cid:12) > ε (cid:105) ≤ N X exp (cid:16) − rε (cid:17) . (11)14n order to extend the bound from the centers c i to all x ∈ X , we use the Lipschitz continuity argument.Let h ( x, y ) = k q,G ( x, y ) − (cid:101) k q,G ( x, y ) . Let L h denote the Lipschitz constant of h ( · , · ), i.e., | h ( x, y ) − h ( c k , c l ) | ≤ L h (cid:107) ( xy ) − ( c k c l ) (cid:107) for all x, y, c k , c l ∈ X .By the definition of ε -net, for any x, y ∈ X , there exists a pair of centers c k , c l such that (cid:107) ( xy ) − ( c k c l ) (cid:107) < √ η .We will have | h ( x, y ) | < ε/ x, y if (i) | h ( c k , c l ) | < ε , ∀ c k , c l , and (ii) L h < ε √ η .We will again use Markov inequality to bound the Lipschitz constant of h . By definition, wehave L h = sup x,y (cid:107)∇ x,y h ( x, y ) (cid:107) = (cid:107)∇ x,y h ( x ∗ , y ∗ ) (cid:107) , where ∇ x,y h ( x, y ) = (cid:16) ∇ x h ( x,y ) ∇ y h ( x,y ) (cid:17) . We also have E ω ∼ p ∇ x,y (cid:101) k q,G ( x, y ) = ∇ x,y k q,G ( x, y ). It follows that E g ,...,g r (cid:107)∇ x,y h ( x ∗ , y ∗ ) (cid:107) = E g ,...,g r (cid:107)∇ x,y (cid:101) k q,G ( x ∗ , y ∗ ) (cid:107) − (cid:107)∇ x,y k q,G ( x ∗ , y ∗ ) (cid:107) ≤ E g ,...,g r (cid:107)∇ x,y (cid:101) k q,G ( x ∗ , y ∗ ) (cid:107) = E g ,...,g r (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) r ( r − (cid:88) i (cid:54) = j ∇ x,y k ( g i x ∗ , g j y ∗ ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13)(cid:13) . Noting T g i ( x ) = g i x , and k ( x, y ) = exp − σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) x − y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) , we have ∇ x k ( g i x, g j y ) = ∇ x k ( T g i ( x ) , T g j ( y ))= − σ ∇ x T g i ( x )( g i x − g j y ) exp (cid:18) − σ (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) g i x − g j y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:19) . Continuing (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) r ( r − (cid:88) i (cid:54) = j ∇ x,y k ( g i x, g j y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ r ( r − (cid:88) i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ x,y k ( g i x, g j y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ r ( r −
1) sup x (cid:88) i (cid:54) = j (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ x k ( g i x, g j y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (using symmetry of k ( · , · ))= √ r ( r − σ sup x (cid:88) i (cid:54) = j k ( g i x, g j y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ∇ x T g i ( x )( g i x − g j y ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) ≤ √ r ( r − σ (cid:88) i (cid:54) = j k ( g i x, g j y ) ||∇ x T g i ( x ) || || ( g i x − g j y ) ||≤ √ e − / σ sup x ∈ X,g ∈ G ||∇ x T g ( x ) || (using sup z ≥ ze − z / (2 σ ) = σe − / ) ≤ √ e − / σ (using linearity and unitariy of T g ( · ) as before)It follows that E ( L h ) ≤ σ e . Now using Markov inequality we have P (cid:104) L h > √ t (cid:105) ≤ E ( L h ) t , t = (cid:16) ε √ η (cid:17) , P (cid:20) L h > ε √ η (cid:21) ≤ η E (( L h ) ) ε ≤ η eσ ε , Hence Pr[ B ≤ ε/ ≥ − N X ) exp (cid:18) − rε (cid:19) − η eσ ε . Bounding C. (cid:12)(cid:12)(cid:12)(cid:101) k q,G ( x, y ) − (cid:98) k q,G ( x, y ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12) r ( r − r (cid:88) i,j =1 ,i (cid:54) = j k ( g i x, g j y ) − r r (cid:88) i,j =1 k ( g i x, g j y ) (cid:12)(cid:12)(cid:12) = (cid:12)(cid:12)(cid:12)(cid:16) r ( r − − r (cid:17) r (cid:88) i,j =1 ,i (cid:54) = j k ( g i x, g j y ) − r r (cid:88) i,j =1 ,i = j k ( g i x, g j y ) (cid:12)(cid:12)(cid:12) ≤ max r ( r − r (cid:88) i,j =1 ,i (cid:54) = j k ( g i x, g j y ) , r r (cid:88) i,j =1 ,i = j k ( g i x, g j y ) (since k ( · , · ) ≥ ≤ r (as Gaussian kernel k ( · , · ) ≤ x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( y ) (cid:69) − k q,G ( x, y ) (cid:12)(cid:12)(cid:12) ≤ A + B + C ≤ ε + 1 r with a probability at least 1 − N X exp (cid:16) − sε (cid:17) − N X exp (cid:0) − rε (cid:1) − (cid:16) η dε σ (cid:17) − (cid:16) η eε σ (cid:17) , noting that σ p = d/σ for the Gaussian kernel k ( x, y ) = e − (cid:107) x − y (cid:107) σ .Let p = 1 − N X exp (cid:16) − sε (cid:17) − N X exp (cid:0) − rε (cid:1) − (cid:16) η dε σ (cid:17) − (cid:16) η eε σ (cid:17) = 1 − (cid:16) diam ( X ) η (cid:17) d exp (cid:16) − sε (cid:17) − (cid:16) diam ( X ) η (cid:17) d exp (cid:0) − rε (cid:1) − (cid:16) η dε σ (cid:17) − (cid:16) η eε σ (cid:17) ≥ − η − d (cid:16)(cid:0) diam ( X ) (cid:1) d exp (cid:0) − rε (cid:1) + (cid:0) diam ( X ) (cid:1) d exp (cid:16) − sε (cid:17)(cid:17) − η (cid:16) d + 1) ε σ (cid:17) . The above probability is of the form of 1 − ( κ + κ ) η − d − κ η where κ = 2 (cid:0) diam ( X ) (cid:1) d exp (cid:0) − rε (cid:1) , κ = 2 (cid:0) diam ( X ) (cid:1) d exp (cid:16) − sε (cid:17) and κ = (cid:16) d +1) ε σ (cid:17) . Choose η = (cid:16) κ + κ κ (cid:17) d +1) Hence p ≥ − κ + κ ) d +1 κ dd +1 .For given δ , δ ∈ (0 , C , C , for r ≥ C dε log(diam(X) /δ ) ,s ≥ C dε (log(diam(X) /δ ) , we have sup x,y ∈ X (cid:12)(cid:12)(cid:12)(cid:68) ψ RF ( x ) , ψ RF ( y ) (cid:69) − k q,G ( x, y ) (cid:12)(cid:12)(cid:12) ≤ ε + 1 r , with probability 1 − (cid:16) d +1) ε σ (cid:17) dd +1 ( δ + δ ) dd +1 . roof of Theorem 3.2. We give here the proof of Theorem 3.2.
Lemma A.1 (Lemma 4 [36]) . - Let X = { x , x · · · x K } be iid random variables in a ball H of radius M centered around the origin in a Hilbert space. Denote their average by X = K (cid:80) Ki =1 x i . Then for any δ > , with probability at least − δ , (cid:107) X − E X (cid:107) ≤ M √ K (cid:16) (cid:114) δ (cid:17) Proof.
For proof, see [36].Now consider a space of functions, F p ≡ (cid:110) f ( x ) = (cid:90) Ω α ( ω ) (cid:90) G φ ( gx, ω ) q ( g ) dν ( g ) dω (cid:12)(cid:12)(cid:12) | α ( ω ) | ≤ Cp ( ω ) (cid:111) , and also consider another space of functions,ˆ F p ≡ (cid:110) ˆ f ( x ) = s (cid:88) k =1 α k r r (cid:88) i =1 φ ( g i x, ω k ) (cid:12)(cid:12)(cid:12) | α k | ≤ Cs (cid:111) , where φ ( gx, ω ) = e − i (cid:104) gx,ω (cid:105) . Lemma A.2.
Let µ be a measure defined on X , and f (cid:63) a function in F p . If ω , ω . . . ω s are iid samplesfrom p ( ω ) , then for δ , δ > , there exists a function ˆ f ∈ ˆ F p such that (cid:13)(cid:13)(cid:13) f (cid:63) − ˆ f (cid:13)(cid:13)(cid:13) L ( X,µ ) ≤ C √ s (cid:16) (cid:114) δ (cid:17) + C √ r (cid:16) (cid:114) δ (cid:17) , with probability at least − δ − δ .Proof. Consider ψ ( x ; ω k ) = (cid:82) G φ ( gx, ω k ) q ( g ) dν ( g ). Let ˜ f k = β k ψ ( . ; ω k ) , k = 1 · · · s , with β k = α ( ω k ) p ( ω k ) . Hence E ω k ∼ p ˜ f k = f (cid:63) .Define ˜ f ( x ) = s (cid:80) sk =1 ˜ f k . Let ˆ f k ( x ) = β k ˆ ψ ( x ; ω k ), where ˆ ψ ( x ; ω k ) = r (cid:80) ri =1 φ ( g i x, ω k ) is the empiricalestimate of ψ ( x ; ω k ). Define ˆ f ( x ) = s (cid:80) sk =1 ˆ f k ( x ). We have E g i ∼ q ˆ f ( x ) = ˜ f ( x ). (cid:13)(cid:13)(cid:13) f (cid:63) − ˆ f (cid:13)(cid:13)(cid:13) L ( X,µ ) ≤ (cid:13)(cid:13)(cid:13) f (cid:63) − ˜ f (cid:13)(cid:13)(cid:13) L ( X,µ ) + (cid:13)(cid:13)(cid:13) ˜ f − ˆ f (cid:13)(cid:13)(cid:13) L ( X,µ ) From Lemma 1 of [36], with probability 1 − δ , (cid:13)(cid:13)(cid:13) f (cid:63) − ˜ f (cid:13)(cid:13)(cid:13) L ( X,µ ) ≤ C √ s (cid:16) (cid:114) δ (cid:17) . Since ˆ f ( x ) = r (cid:80) ri =1 (cid:80) sk =1 β k s φ ( g i x, ω k ) and E g i ∼ q ˆ f ( x ) = ˜ f ( x ) with g i iid (and { ω k } sk =1 fixed beforehand),we can apply Lemma A.1 with M = (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) s (cid:88) k =1 β k s φ ( g i x, ω k ) (cid:13)(cid:13)(cid:13)(cid:13)(cid:13) ≤ s (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) β k s (cid:12)(cid:12)(cid:12)(cid:12) (cid:107) φ ( g i x, ω k ) (cid:107) ≤ s (cid:88) k =1 (cid:12)(cid:12)(cid:12)(cid:12) β k s (cid:12)(cid:12)(cid:12)(cid:12) ≤ C. We conclude that with a probability at least 1 − δ , (cid:13)(cid:13)(cid:13) ˜ f − ˆ f (cid:13)(cid:13)(cid:13) L ( X,µ ) ≤ C √ r (cid:16) (cid:114) δ (cid:17) . − δ − δ , we have (cid:13)(cid:13)(cid:13) f (cid:63) − ˆ f (cid:13)(cid:13)(cid:13) L ( X,µ ) ≤ C √ s (cid:16) (cid:114) δ (cid:17) + C √ r (cid:16) (cid:114) δ (cid:17) Theorem A.3 (Estimation error [36]) . Let F be a bounded class of functions, sup x ∈ X | f ( x ) | ≤ C for all f ∈ F . Let V ( y i f ( x i )) be an L -Lipschitz loss. Then with probability − δ , with respect to training samples { x i , y i } i =1 , ··· N (iid ∼ P ), every f satisfies E V ( f ) ≤ ˆ E V ( f ) + 4 L R N ( F ) + 2 | V (0) |√ N + LC (cid:114) N log 1 δ , where R N ( F ) is the Rademacher complexity of the class F : R N ( F ) = E x,σ (cid:34) sup f ∈F (cid:12)(cid:12)(cid:12)(cid:12)(cid:12) N N (cid:88) i =1 σ i f ( x i ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:35) , and σ i are iid symmetric Bernoulli random variables taking value in {− , } , with equal probability and areindependent form x i .Proof. See in [36].Let f ∈ F p and ˆ f ∈ ˆ F p then the approximation error is bounded as E V ( ˆ f ) − E V ( f ) ≤ E ( x,y ) ∼ P (cid:12)(cid:12)(cid:12) V ( y ˆ f ( x )) − V ( yf ( x )) (cid:12)(cid:12)(cid:12) ≤ L E (cid:12)(cid:12) ˆ f ( x ) − f ( x ) (cid:12)(cid:12) ≤ L (cid:113) E (cid:0) ˆ f ( x ) − f ( x ) (cid:1) (Jensen’s inequaity for √· concave function) ≤ LC (cid:18) √ s (cid:16) (cid:114) δ (cid:17) + 1 √ r (cid:16) (cid:114) δ (cid:17)(cid:19) , with probability at least 1 − δ − δ . Now let f (cid:63)N = arg min f ∈ ˆ F p ˆ E V ( f ) and ˜ f = arg min f ∈ ˆ F p E V ( f ). We have E V ( f (cid:63)N ) − min f ∈F P E V ( f ) = ˆ E V ( f (cid:63)N ) − E V ( ˜ f ) + E V ( ˜ f ) − min f ∈F P E V ( f ) ≤ ˜ f ∈ ˆ F P (cid:12)(cid:12)(cid:12) E V ( ˜ f ) − ˆ E V ( ˜ f ) (cid:12)(cid:12)(cid:12) + L (cid:18) C √ s (cid:16) (cid:114) δ (cid:17) + C √ r (cid:16) (cid:114) δ (cid:17)(cid:19) ≤ (cid:18) L R N ( F ) + 2 | V (0) |√ N + LC (cid:114) N log 1 δ (cid:19) + LC (cid:18) √ s (cid:16) (cid:114) δ (cid:17) + 1 √ r (cid:16) (cid:114) δ (cid:17)(cid:19) , with probability at least 1 − δ − δ − δ . It is easy to show that R N ( F ) ≤ C √ N . Taking δ = δ = δ2