[PDF] Layer-wise Learning of Kernel Dependence Networks

Abstract

Due to recent debate over the biological plausibility of backpropagation (BP), finding an alternative network optimization strategy has become an active area of interest. We design a new type of kernel network, that is solved greedily, to theoretically answer several questions of interest. First, if BP is difficult to simulate in the brain, are there instead "trivial network weights" (requiring minimum computation) that allow a greedily trained network to classify any pattern. Perhaps a simple repetition of some basic rule can yield a network equally powerful as ones trained by BP with Stochastic Gradient Descent (SGD). Second, can a greedily trained network converge to a kernel? What kernel will it converge to? Third, is this trivial solution optimal? How is the optimal solution related to generalization? Lastly, can we theoretically identify the network width and depth without a grid search? We prove that the kernel embedding is the trivial solution that compels the greedy procedure to converge to a kernel with Universal property. Yet, this trivial solution is not even optimal. By obtaining the optimal solution spectrally, it provides insight into the generalization of the network while informing us of the network width and depth.

Full PDF

LLayer-wise Learning of Kernel Dependence Networks

Chieh Wu ∗ Aria Masoomi ∗ Arthur Gretton Jennifer Dy

Abstract

We propose a greedy strategy to train a deep network for multi-class classiﬁcation,where each layer is deﬁned as a composition of a linear projection and a nonlinearmapping. This nonlinear mapping is deﬁned as the feature map of a Gaussiankernel, and the linear projection is learned by maximizing the dependence betweenthe layer output and the labels, using the Hilbert Schmidt Independence Criterion(HSIC) as the dependence measure. Since each layer is trained greedily in sequence,all learning is local, and neither backpropagation nor even gradient descent isneeded. The depth and width of the network are determined via natural guidelines,and the procedure regularizes its weights in the linear layer. As the key theoreticalresult, the function class represented by the network is proved to be sufﬁcientlyrich to learn any dataset labeling using a ﬁnite number of layers, in the sense ofreaching minimum mean-squared error or cross-entropy, as long as no two datapoints with different labels coincide. Experiments demonstrate good generalizationperformance of the greedy approach across multiple benchmarks while showing asigniﬁcant computational advantage against a multilayer perceptron of the samecomplexity trained globally by backpropagation.

Since the seminal work by Rumelhart et al. [1], Multilayer Perceptrons (MLPs) have become apopular tool for classiﬁcation. This success can in part be explained by the expressiveness of theresulting function class [2, 3, 4, 5]; in particular, a two-layer network can approximate any continuousfunction in a compact domain, to any desired accuracy, albeit with a network size exponential in inputdimension. A helpful perspective on networks of large (or, in the limit, even inﬁnite) width can begained from kernel methods, for which the feature spaces can be inﬁnite-dimensional by construction.For example, the Gaussian process (GP) has been used to understand the limiting behavior of widenetworks [6, 7, 8]: in particular, deep GPs give a mechanism for constructing deep networks whereeach layer has inﬁnitely many features [9, 10, 11, 12]. Following GPs, the Neural Tangent Kernel(NTK) was then proposed to describe the dynamics of the network during training [13, 14], furtherelucidating the relationship between MLPs and kernels. Moreover, Belkin et al. [15] has shown howMLPs and kernels both yield good generalization results despite overﬁtting, leading them to linkkernel discovery to network training.In this paper, we introduce a new function class for supervised classiﬁcation, which builds a kernelneural network in a greedy layer-wise fashion: we call this the Kernel Dependence Network (

KNet ).Each layer is a composition of a linear subspace projection and an inﬁnite-dimensional feature map.A key advantage is that the network can be trained greedily, layer by layer.

KNet solves each layerby maximizing the dependence between the layer output and the labels, using the Hilbert SchmidtIndependence Criterion (HSIC) as the dependence measure [16]. This can be achieved efﬁciently,even without Stochastic Gradient Descent (SGD), via the Iterative Spectral Method (ISM) [17, 18].This local and greedy strategy is used in place of backpropagation (BP) to obviate the sharing of thegradient information throughout the network, thereby avoiding exploding/vanishing gradients. As a ∗ Signiﬁes equal contribution.Preprint. Under review. a r X i v : . [ s t a t . M L ] J un onsequence of our formulation, we demonstrate how the natural choice of the network width anddepth also emerges.A related work by Ma et al. [19] uses a chain of HSIC dependencies to simulate InformationBottleneck. They employ a standard network structure that is solved by SGD. In contrast, our kernelperspective replaces the conventional activation function with the feature map of a Gaussian kernel(GK), resulting in an inﬁnitely wide network that is solved via the kernel trick. Additionally, ourkey contribution is the theoretical characterization of the richness of the function class describedby our architecture. We prove a property related to that of ﬁnite sample expressivity [20, Section 4]for classical neural nets. Speciﬁcally, we claim that network construction can achieve any desirable training accuracy given a minimum depth of 2 inﬁnitely wide layers, with the implication ofminimizing Mean Squared Error (MSE) and Cross-Entropy (CE) on the training data.Similar to traditional MLPs, the richness of KNet promises to ﬁt any training data with an overparam-eterized network, yet it also experimentally generalizes well on test data. To explain this observationfor standard MLPs, the regularizing effects of architecture choice and optimization strategies havebeen used to explore how MLPs generalize [21, 22, 23]. As our second theoretical contribution, wedemonstrate that our architecture and training procedure perform an implicit regularization, similar tothe weight penalization arguments of Poggio et al. [24], and we describe the mechanisms by whichthis is achieved. We lastly verify every theoretical guarantee of our model experimentally, where

KNet achieves comparable generalization results to MLPs of comparable complexity trained by BP.

Let X ∈ R n × d be a dataset of n samples with d features and let Y ∈ R n × τ be the correspondingone-hot encoded labels with τ number of classes. Let (cid:12) be the element-wise product. The i th sampleand label of the dataset is written as x i and y i . H is a centering matrix deﬁned as H = I n − n n Tn where I n is the identity matrix of size n × n and n is a vector of 1s also of length n . Given H , we let Γ =

HY Y T H and let K ( · ) be the Gaussian kernel (GK) matrix computed using a dataset matrix ( · ) .We denote the MLP weights as W ∈ R d × q and W l ∈ R m × q for the 1st layer and the l th layer;assuming l > . The input and output at the l th layer are R l − ∈ R n × m and R l ∈ R n × m , i.e., given ψ : R n × q → R n × m as the activation function, R l = ψ ( R l − W l ) . For each layer, the i th row of itsinput R l − is r i ∈ R m and it represents the i th input sample. We denote W l as a function where W l ( R l − ) = R l − W l ; consequently, each layer is also a function φ l = ψ ◦ W l . By stacking L layerstogether, the entire network itself becomes a function φ where φ = φ L ◦ ... ◦ φ . Given an empiricalrisk ( H ) and a loss function ( L ) , our network model assumes an objective of min φ n n (cid:88) i =1 L ( φ ( x i ) , y i ) . (1)We propose to solve Eq. (1) greedily; this is equivalent to solving a sequence of single-layered networks where the previous network output becomes the current layer’s input. At each layer, weﬁnd the W l that maximizes the dependency between the layer output and the label via HSIC [16]: max W l Tr (cid:0) Γ (cid:2) ψ ( R l − W l ) ψ T ( R l − W l ) (cid:3)(cid:1) s . t . W Tl W l = I. (2)Deviating from the traditional concept of activation functions, we use the GK feature map in theirplace, simulating an inﬁnitely wide network. Yet, the kernel trick spares us the direct computa-tion of ψ ( R l − W l ) ψ T ( R l − W l ) ; we instead compute the GK matrix given K ( W Tl r i , W Tl r j ) = exp {−|| W Tl r i − W Tl r j || / σ } . We also deviate from a standard MLP by restricting our solutionspace to a linear subspace. If the solution indeed lives on a linear subspace independent of their scaleas suggested by [24, 25, 26], then we can exploit this prior knowledge to narrow the search spaceduring optimization speciﬁc to the Stiefel Manifold, i.e., by adding the constraint of W Tl W l = I .This prior enables us to solve Eq. (2) by leveraging the iterative spectral method (ISM) proposed byWu et al. [17, 18] to simultaneously avoid SGD and identify the network width. Applying ISM to ourmodel, each layer’s weight is initialized using the most dominant eigenvectors of Q l = R Tl − (Γ − Diag (Γ1 n )) R l − , (3)2here the Diag( · ) function places the elements of a vector into the diagonal of a square matrix withzero elements. Once the initial weights W l are set, ISM iteratively updates W l i to W l i + by setting W l i + to the most dominant eigenvectors of Q l i = R Tl − (ˆΓ − Diag (ˆΓ1 n )) R l − , (4)where ˆΓ is a function of W l i computed with ˆΓ = Γ (cid:12) K R l − W l i . This iterative weight-updatingprocess stops when Q l i ≈ Q l i + , whereupon Q l i + is set to Q ∗ l , and its most dominant eigenvectors W ∗ l becomes the solution of Eq. (2).ISM solves Eq. (2) directly on an inﬁnitely wide network during training, obtaining W ∗ l . However,during test, we approximate ψ with Random Fourier Features (RFF) [27], simulating a ﬁnite widthnetwork passing samples through W ∗ l and ψ . Capitalizing on the spectral properties of ISM, thespectrum of Q ∗ l completely determines the the width of the network W ∗ l ∈ R m × q , i.e., m is equal tothe size of the RFF, and q is simply the rank of Q ∗ l . Furthermore, since Eq. (2) after normalizationis upper bounded by 1, we can stop adding new layers when the HSIC value of the current layerapproaches this theoretical bound, thereby prescribing a natural depth of the network. The resultingnetwork φ after training will map samples of the same class into its own cluster, allowing the testsamples to be classiﬁed by matching their network outputs to the nearest cluster center. We formallydeﬁne the hyperparameter settings and provide Algorithm 1 in the Experimental section. The sourcecode is also publicly available at https://github.com/anonymous . Background and Notations.

Let S be a set of i, j sample pairs that belong to the same class. Itscomplement, S c contains all sample pairs from different classes. We denote compositions of the ﬁrst l layers as φ l ◦ = φ l ◦ ... ◦ φ where l ≤ L . This notation enables us to connect the data directly tothe layer output where R l = φ l ◦ ( X ) . Since KNet is greedy, it solves MLPs by replacing φ in Eq. (1)incrementally with a sequence of functions { φ l ◦ } Ll =1 where each layer relies on the weights of theprevious layer. This implies that we are also solving a sequence of empirical risks {H l } Ll =1 , i.e.,different versions of Eq. (1) given the current φ l ◦ . We refer to { φ l ◦ } Ll =1 and {H l } Ll =1 as the KernelSequence and the H - Sequence . Classiﬁcation Strategy.

Classiﬁcation tasks can be solved using objectives like MSE and CE tomatch the network output φ ( X ) to the label Y . While this approach achieves the desirable outcome,it also constrains the space of potential solutions where φ ( X ) must match Y . Yet, if φ maps X to thelabels { , } instead of the true label {− , } , φ ( X ) may not match Y , but the solution is the same.Therefore, enforcing φ ( X ) = Y ignores an entire space of solutions that are functionally equivalent.We posit that by relaxing this constraint and accept a larger space of potential global optima, it willbe easier during optimization to collide with this space. This intuition motivates us to depart fromthe tradition of label matching, and instead seek out alternative objectives that focus on solving theunderlying prerequisite of classiﬁcation, i.e., learning a mapping of X where similar and differentclasses become distinguishable.Since there are many notions of similarity, it is not always clear which is best for a particular situation. KNet overcomes this uncertainty by discovering the optimal similarity measure as a kernel functionduring training. To understand how, ﬁrst realize that the i, j th element of Γ , denoted as Γ i,j , is apositive value for samples in S and negative for S c . By deﬁning a kernel function K as a similaritymeasure between 2 samples, Eq. (2) becomes max W l (cid:88) i,j ∈S Γ i,j K W l ( r i , r j ) − (cid:88) i,j ∈S c | Γ i,j |K W l ( r i , r j ) . (5)Notice that the objective uses the sign of Γ i,j as labels to guide the choice of W l such that it increases K W l ( r i , r j ) when r i , r j belongs to S while decreasing K W l ( r i , r j ) otherwise. Therefore, by ﬁndinga W l matrix that best parameterizes K , HSIC discovers the optimal pair-wise relationship function K W l that separates samples into similar and dissimilar partitions. Given this strategy, we will formallydemonstrate how learning K leads to classiﬁcation in the following sections. Optimization Strategy.

MLP is traditionally solved with all the layers jointly via BP. We insteadfocus on greedily discovering a

Kernel Sequence that compels the H - Sequence to exhibit keybehaviors that enable classiﬁcation. Here, we discuss how these behaviors lead to an optimal solution.3irst, the H - Sequence must be convergent. We accomplish this by leveraging the Monotone Conver-gence Theorem (MCT) [28]: it states that a monotone sequence is guaranteed to have a limit if andonly if the sequence is bounded. For

KNet , since ψ is the feature map of GK, K W l ( r i , r j ) is naturallyconstrained between [0 , . Therefore, given our greedy strategy, H - Sequence converges if we canachieve H l ≥ H l − at every layer.Second, while the MCT guarantees convergence, it does not guarantee the quality of its limit. Namely,the improvement at each layer could be so small such that the overall gain is trivial. To answerthis question, we must investigate the potential contribution from each layer to identify the point ofconvergence. In the most ideal case, as L → ∞ the network should converge towards an optimalkernel that achieves the theoretical upper limit of HSIC, or H ∗ = (cid:80) i,j ∈S Γ i,j . Moreover, we shouldbe able to achieve this within a ﬁnite number of layers.We investigate these criteria using HSIC as the empirical risk of an MLP and prove that by using thefeature map of a GK as ψ , there exists a sequence of weights { W l } Ll =1 where these considerationsare simultaneously satisﬁed. We state this theorem below and provide its proof in App. A. Theorem 1.

For any H , there exists a Kernel Sequence { φ l ◦ } Ll =1 parameterized by a set of weights W l and a set of bandwidths σ l such that:I. H L can approach arbitrarily close to H ∗ such that for any L > and δ > we can achieve H ∗ − H L ≤ δ, (6) II. as L → ∞ , the H -Sequence converges to the global optimum, that is lim L →∞ H L = H ∗ , (7) III. the convergence is strictly monotonic where H l > H l − ∀ l ≥ . (8)To simplify the proof, Thm. 1 use the average directions of each class W s at each layer when W s is not guaranteed to be an optimal solution. In spite of this deﬁcit, Thm. 1 proves that a globaloptimum is attainable given a minimum of two layers , i.e., for any L > . While this remarkablefeat is theoretically possible, depending on the data, the σ required for the GK may be extremelysmall, leading to an undesirably sharp φ that is overﬁtting the noise. Fortunately, this issue isresolved by simply spreading the monotonic improvement across more layers. We have additionallyobserved experimentally that by replacing the suboptimal W s with an optimal W ∗ l from ISM (where ∂ H /∂W ( W ∗ l ) = 0 ), large σ values can be used to still achieve convergence with few layers. Relating H ∗ to Classiﬁcation. As H l → H ∗ , the maximization of Eq. (5) demonstrates how welearn the similarity function. However, it may be unclear why learning this function also induces anoptimal classiﬁcation. To understand this deeper, let us ﬁrst clarify some key notations and concepts.We refer to the image of ψ and φ as the Reproducing Kernel Hilbert Space (RKHS) and the imageof W as the Images of the RKHS Dual Space (IDS). This distinction is crucial because each spacedictates the geometric orientations that lead to classiﬁcation. Keeping this in mind, each layer’soutput can be measured by the within S lw and between S lb class scatter matrices deﬁned as S lw = (cid:88) i,j ∈S W Tl ( r i − r j )( r i − r j ) T W l and S lb = (cid:88) i,j ∈S c W Tl ( r i − r j )( r i − r j ) T W l . (9)These matrices are historically important [29, 30] because their trace ratio ( T = Tr( S w ) / Tr( S b ) )measures class separability, i.e., a small T implies a tight grouping of the same class under Euclideandistance. Classiﬁcation is consequently achieved by mapping different classes into these spatialseparations. Crucially, by maximizing H , T is minimized as a byproduct. In fact, we prove that as H l → H ∗ , T approaches 0, and φ will map samples of different classes into separated points.Note that since T is computed with samples from the image of W , this particular relationship residesin IDS. Concurrently, since the inner product deﬁnes similarity in RKHS, samples are simultaneouslybeing partitioned via the angular distance. Indeed, our proof indicates that as H l → H ∗ , RKHSsamples within S achieves perfect alignment while samples in S c become orthogonal to each other,separated by a maximum angle of π/ . This dual relationship between IDS and RKHS produces avastly different network output right before and immediately after the ﬁnal activation function, wheredifferent notions of distance (Euclidean and angular) are employed to partition samples. We formallystate these results in the following theorem with the proof in App. B.4 heorem 2. As l → ∞ and H l → H ∗ , the following properties are satisﬁed:I the scatter trace ratio T approaches 0 where lim l →∞ Tr( S lw )Tr( S lb ) = 0 (10) II the Kernel Sequence converges to the following kernel: lim l →∞ K ( x i , x j ) l = K ∗ ( x i , x j ) l = (cid:26) ∀ i, j ∈ S c ∀ i, j ∈ S . (11)As corollaries to Theorem 2, the resulting partition of samples under Euclidean and angular distanceimplicitly satisﬁes different classiﬁcation objectives in each space. In IDS, KNet will map a dataset of τ classes into τ distinct points. While these τ points may not match the original label, this differenceis inconsequential. In contrast, samples in RKHS at convergence will reside along τ orthogonal axeson a unit sphere. By realigning these results to the standard bases, solutions that simulate the softmaxare generated to solve CE. Therefore, as H l → H ∗ , the maximization of Eq. (5) minimizes MSE andCE in different spaces without matching the actual labels itself: instead, we match the underlyinggeometry of the network output. We state the two corollaries below with their proof in App. E. Corollary 1.

Given H l → H ∗ , the network output in IDS solves MSE via a translation of labels. Corollary 2.

Given H l → H ∗ , the network output in RKHS solves CE via a change of bases. HSIC and Regularization.

Overparameterized MLPs can generalize without any explicit regularizer[20]. This observation deﬁes classical learning theory and has been a longstanding puzzle in theresearch community [31, 32, 33]. Overparameterized with inﬁnite width,

KNet experimentallyexhibits a resembling generalization behavior. Moreover, Ma et al. [19] have experimentally observedthat HSIC can generalize even without the W T W = I constraint; we seek to better understandthis phenomenon theoretically. Recently, Poggio et al. [24] have proposed that traditional MLPsgeneralize because gradient methods implicitly regularize the normalized weights. We discovereda similar relationship with HSIC, i.e., the objective can be reformulated to isolate out n functions [ D ( W l ) , ..., D n ( W l )] that act as a penalty term during optimization. Let S| i be the set of samplesthat belongs to the i th class and let S c | i be its complement, then each function D i ( W l ) is deﬁned as D i ( W l ) = 1 σ (cid:88) j ∈S| i Γ i,j K W l ( r i , r j ) − σ (cid:88) j ∈S c | i | Γ i,j |K W l ( r i , r j ) . (12)Notice that D i ( W l ) is simply Eq. (5) for a single sample scaled by σ . Therefore, as we identifybetter solutions for W l , this leads to an increase and decrease of K W l ( r i , r j ) associated with S| i and S c | i in Eq. (12), thereby increasing the size of the penalty term D i ( W l ) . To appreciate how D i ( W l ) penalizes H , we propose an equivalent formulation in the theorem below with its derivation in App C. Theorem 3.

Eq. (5) is equivalent to max W l (cid:88) i,j Γ i,j σ e − ( ri − rj ) T WWT ( ri − rj )2 σ ( r Ti W l W Tl r j ) − (cid:88) i D i ( W l ) || W Tl r i || . (13)Based on Thm. 3, D i ( W l ) adds a negative cost to the sample norm in IDS, || W Tl r i || , suggestingthat ISM regularizes KNet regardless of W Tl W l = I . In fact, a better W l imposes a heavier penaltyon Eq. (13) where the overall H may actually decrease. Complexity Analysis.

The complexity analysis of a single ISM iteration as reported by Wu et al.[17] is O ( n ) . Since ISM is repeated with L layers, KNet complexity is simply O ( Ln ) . For memory, KNet suffers from the same O ( n ) restriction which all kernel methods inherit. Limitations of Layer-Wise Kernel Dependence Networks.

Although our framework presentsmany theoretical advantages, we caution that much more research would be required for

KNet tobecome practically viable. While ISM resolves many existing problems, it also limits the kernels tothe ISM family [18]. Therefore, it currently cannot be extended to solve the traditional activationfunctions such as relu and sigmoid. The computational and memory complexity is another practicalobstacle. Therefore,

KNet at its current maturity is intended for analysis and is not yet suitable forlarge datasets . Although there are already existing solutions [34, 35] to overcome these challenges,these engineering questions are topics we purposely isolate away from the theory for future research.5

Experiments

Datasets.

This work focuses on verifying the theoretical guarantees of a greedily trained networkagainst traditional MLPs of comparable complexity trained by BP. Speciﬁcally, we conﬁrm thetheoretical properties of

KNet using three synthetic (Random, Adversarial and Spiral) and ﬁve popularUCI benchmark datasets: wine, cancer, car, divorce, and face [36]. They are included along with thesource code in the supplementary, and their comprehensive download link and statistics are in App. F.

Evaluation Metrics.

To evaluate the central claim that MLPs can be solved greedily, we report H ∗ at convergence along with the training/test accuracy for each dataset. Here, H ∗ is normalized to therange between 0 to 1 using the method proposed by Cortes et al. [37]. To corroborate Corollaries 1and 2, we also record MSE and CE. To evaluate the sample geometry predicted by Eq. (10), werecorded the scatter trace ratio T to measure the compactness of samples within and between classes.The angular distance between samples in S and S c as predicted by Eq. (11) is evaluated with theCosine Similarity Ratio ( C ). The equations for H ∗ and C are H ∗ = H ( φ ( X ) , Y ) (cid:112) H ( φ ( X ) , φ ( X )) H ( Y, Y ) and C = (cid:80) i,j ∈S c (cid:104) φ ( x i ) , φ ( x j ) (cid:105) (cid:80) i,j ∈S (cid:104) φ ( x i ) , φ ( x j ) (cid:105) . (14) Experiment Settings.

The width of the network is set by ISM to keep 90% of the data variance.The RFF width is set to 300 for all datasets and the σ l that maximizes H ∗ is chosen. The convergencethreshold for H - Sequence is set at H l > . . The network structures discovered by ISM for everydataset are recorded and provided in App. G. The MLPs that use MSE and CE have weights initializedvia the Kaiming method [38]. All datasets are centered to 0 and scaled to a standard deviation of 1.All sources are written in Python using Numpy, Sklearn and Pytorch [39, 40, 41]. All experimentswere conducted on an Intel Xeon(R) CPU E5-2630 v3 @ 2.40GHz x 16 with 16 total cores. Algorithm 1

KNet

Algorithm

Input :

Data X ∈ R n × d , Label Y ∈ R n × τ Output :

Network weights W , ..., W L while H l < 0.99 do Use the output of last layer as inputAdd a new layerInitialize layer weight with Eq. (3) while Q l i (cid:54)≈ Q l i + do Update Q l i → Q l i + with Eq. (4) end W l = Most dominant eigenvector of Q ∗ l end Figure 1: Key evaluation metrics at each layer.Figure 2: Simulation of Thm. 1 on Random and Adversarial datasets. The 2D representation is shown,and next to it, the 1D output of each layer is displayed over each line. Both datasets achieved theglobal optimum H ∗ at the th layers. Refer to App. H for additional results.6 xperimental Results. Since Thm. 1 guarantees an optimal convergence for any dataset with asuboptimal W s , we designed an Adversarial dataset to trick the network, i.e., the samples pairs in S c are signiﬁcantly closer than samples pairs in S . We next designed a Random dataset with completelyrandom labels. We then simulated Thm. 1 in Python and plot out the sample behavior in Fig. 2. Theoriginal 2-dimensional data is shown next to its 1-dimensional IDS results: each line represents the1D output at that layer. As predicted by the theorem, our network converged at the 12th layer andperfectly separated the samples based on labels. We emphasize that these achievements are acquiredpurely from the simulation of Thm. 1 without resorting to σ ≈ while using a suboptimal solution W s . Namely, the smallest σ values used are 0.15 and 0.03 for Random and Adversarial respectively.Using the optimal W ∗ from ISM, we next conduct 10-fold cross-validation across all 8 datasets andreported their mean and the standard deviation for all key metrics. The random and non-randomdatasets are visually separated. Once our model is trained and has learned its structure, we use thesame depth and width to train 2 additional MLPs via SGD, where instead of HSIC, MSE and CE areused as the empirical risk. The results are listed in Table 1 with the best outcome in bold.Can H - Sequence be optimized greedily? The H ∗ column in Table 1 consistently reports results thatconverge near its theoretical maximum value of 1, thereby corroborating with Thm. 1. As predictedby Thm. 2, we also report high training accuracies as H l → H ∗ . Given the overﬁtting results fromFig. 2, will our network generalize? Since smooth mappings are associated with better generalization,we also report the smallest σ value used for each network to highlight the smoothness of φ learned byISM. Correspondingly, with the exception of the two random datasets, our test accuracy consistentlyperformed well across all datasets. KNet further differentiates itself on a high dimension Face datasetwhere it was the only method that avoided overﬁtting. While we cannot deﬁnitively attribute theimpressive test results to Thm. 3, the experimental evidence appears to be aligned with its implication. obj σ ↑ L ↓ Train Acc ↑ Test Acc ↑ Time(s) ↓ H ∗ ↑ MSE ↓ CE ↓ C ↓ T ↓ r a ndo m H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 3.30 ± ± ± ± ± ± ± ± ± a dv e r H ± ± ± ± ± ± ± ± ± CE - 3.60 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± s p i r a l H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 5.10 ± ± ± ± ± ± ± ± ± w i n e H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 6.10 ± ± ± ± ± ± ± ± ± ca n ce r H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 8.10 ± ± ± ± ± ± ± ± ± ca r H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 4.90 ± ± ± ± ± ± ± ± ± f ace H ± ± ± ± ± ± ± ± ± CE - 4.00 ± ± ± ± ± ± ± ± ± MSE - 4.00 ± ± ± ± ± ± ± ± ± d i vo r ce H ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± MSE - 4.10 ± ± ± ± ± ± ± ± ± Table 1: Each dataset contains 3 rows comparing the greedily trained

KNet using H against traditionalMLPs trained using MSE and CE via SGD given the same network width and depth. The best resultsare in bold with ↑ / ↓ indicating larger/smaller values preferred.Since Thm. 1 also claims that we can achieve H ∗ − H l < δ in ﬁnite number of layers, we include inTable 1 the average length of the H - Sequence ( L ). The table suggests that the H - Sequence convergesquickly with 9 layers as the deepest network. Additionally, notice in Fig. 2 where 12 layers arerequired for the suboptimal W s to convergence. In contrast, ISM used much smoother and larger σ s(0.38 and 0.5 (cid:29) KNet can besolved via a single forward pass while SGD requires many iterations of BP,

KNet should be faster.The Time column of Table 1 conﬁrmed this expectation by a wide margin. The biggest differencecan be observed by comparing the face dataset, H ﬁnished with 0.78 seconds while MSE required745 seconds; that is almost 1000 times difference. While the execution times reﬂect our expectation,techniques that vastly accelerate kernel computations [34, 35] would be required for larger datasets.Predicted by Thm. 2, KNet induces low T and C as shown in Table 1, implying that samples in S and S c are being pulled together and pushed apart based on Euclidean and angular distance in IDS andRKHS. Given this geometry, will its optimal arguments also induce a low MSE and CE? We evaluatethese predictions from Corollaries 1 and 2 by keeping the same network weights while replacing theﬁnal objective with MSE and CE. Our corroborating results are highlighted in the columns of MSEand CE in Table 1. Interestingly, while using HSIC induces a low MSE and CE, training via BP usingMSE or CE does not necessarily translate to good results for each other.Within the network, Fig. 1 plots out all key metrics at each layer during training. Here, the H - Sequence is clearly monotonic and converging towards a global optimal of 1. Moreover, the trendsfor T and C indicate an incremental clustering of samples into separate partitions. We are unsurewhy C consistently forms a hunchback pattern, this implies an initial expansion in the dimension ofthe data following compression. However, we note that this behavior is also observed with traditionalnetworks by Ansuini et al. [42]. Corresponding to low T and C values, the low MSE and CE errorsat convergence further reinforces the claims of Corollaries 1 and 2. Note that these identical patternsare consistent and repeatable across all datasets as shown in App. J.We lastly highlight a visual pattern for the Kernel Sequence in Fig. 3. We rearrange the samples ofthe same class to be adjacent to each other. This allows us to quickly evaluate the kernel qualityvia its block diagonal structure. Since GK is restricted to values between 0 and 1, we let white anddark blue be 0 and 1 respectively where the gradients reﬂect values in between. Our proof predictsthat the

Kernel Sequence should converge to the optimal kernel K ∗ , i.e., the Kernel Sequence shouldevolve from an uninformative kernel into a highly discriminating kernel of perfect block diagonalstructures. Corresponding to the top row, the bottom row plots out the samples in IDS at each layer.As predicted by Thm. 2, the samples of the same class incrementally converge towards a single pointin IDS. Again, this pattern is observable on all datasets, and the complete collection of the kernelsequences for each dataset can be found in App. I.Figure 3: A visual conﬁrmation of Thm. 2. The kernel matrices per layer produced by the

KernelSequence are displayed in the top row with their corresponding outputs in IDS in the bottom row.

Conclusion.

We have presented a new model of MLP for classiﬁcation that bypasses BP and SGD.Our model,

KNet , is guaranteed to reach the global optimum of HSIC in ﬁnite steps. The resultinggeometric orientation of the samples minimizes the scatter ratio while produces an optimal kernel, K ∗ . This results in a geometric orientation that consequently solves MSE and CE in different spaces.Indeed, these patterns are predictable by our theorems and experimentally reproducible. Therefore, KNet opens the door to a new perspective to analyze MLPs.8 eferences [1] David E Rumelhart, Geoffrey E Hinton, Ronald J Williams, et al. Learning representations byback-propagating errors.

Cognitive modeling , 5(3):1, 1988.[2] Balázs Csanád Csáji. Approximation with artiﬁcial neural networks.

Faculty of Sciences, EtvsLornd University, Hungary , 24:48, 2001.[3] George Cybenko. Approximation by superpositions of a sigmoidal function.

Mathematics ofcontrol, signals and systems , 2(4):303–314, 1989.[4] Kurt Hornik. Approximation capabilities of multilayer feedforward networks.

Neural networks ,4(2):251–257, 1991.[5] Zhou Lu, Hongming Pu, Feicheng Wang, Zhiqiang Hu, and Liwei Wang. The expressive powerof neural networks: A view from the width. In

Advances in neural information processingsystems , pages 6231–6239, 2017.[6] Jaehoon Lee, Yasaman Bahri, Roman Novak, Samuel S. Schoenholz, Jeffrey Pennington, andJascha Sohl-Dickstein. Deep neural networks as gaussian processes.

ArXiv , abs/1711.00165,2018.[7] Radford M Neal.

Bayesian learning for neural networks , volume 118. Springer Science &Business Media, 2012.[8] Alexander G de G Matthews, Mark Rowland, Jiri Hron, Richard E Turner, and ZoubinGhahramani. Gaussian process behaviour in wide deep neural networks. arXiv preprintarXiv:1804.11271 , 2018.[9] Souﬁane Hayou, Arnaud Doucet, and Judith Rousseau. On the impact of the activation functionon deep neural networks training. arXiv preprint arXiv:1902.06853 , 2019.[10] Jaehoon Lee, Lechao Xiao, Samuel S Schoenholz, Yasaman Bahri, Jascha Sohl-Dickstein, andJeffrey Pennington. Wide neural networks of any depth evolve as linear models under gradientdescent. arXiv preprint arXiv:1902.06720 , 2019.[11] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian pro-cess behavior, gradient independence, and neural tangent kernel derivation. arXiv preprintarXiv:1902.04760 , 2019.[12] David Duvenaud, Oren Rippel, Ryan Adams, and Zoubin Ghahramani. Avoiding pathologies invery deep networks. In

Artiﬁcial Intelligence and Statistics , pages 202–210, 2014.[13] Arthur Jacot, Franck Gabriel, and Clément Hongler. Neural tangent kernel: Convergence andgeneralization in neural networks. In

Advances in neural information processing systems , pages8571–8580, 2018.[14] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang.On exact computation with an inﬁnitely wide neural net. arXiv preprint arXiv:1904.11955 ,2019.[15] Mikhail Belkin, Siyuan Ma, and Soumik Mandal. To understand deep learning we need tounderstand kernel learning. In

ICML , 2018.[16] Arthur Gretton, Olivier Bousquet, Alex Smola, and Bernhard Schölkopf. Measuring statisticaldependence with hilbert-schmidt norms. In

International conference on algorithmic learningtheory , pages 63–77. Springer, 2005.[17] Chieh Wu, Stratis Ioannidis, Mario Sznaier, Xiangyu Li, David Kaeli, and Jennifer Dy. Iterativespectral method for alternative clustering. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 115–123, 2018.[18] Chieh Wu, Jared Miller, Yale Chang, Mario Sznaier, and Jennifer G. Dy. Solving interpretablekernel dimensionality reduction. In

NeurIPS , 2019.919] Wan-Duo Kurt Ma, JP Lewis, and W Bastiaan Kleijn. The hsic bottleneck: Deep learningwithout back-propagation. arXiv preprint arXiv:1908.01580 , 2019.[20] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understandingdeep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530 , 2016.[21] Devansh Arpit, Stanisław Jastrz˛ebski, Nicolas Ballas, David Krueger, Emmanuel Bengio,Maxinder S Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, et al.A closer look at memorization in deep networks. In

Proceedings of the 34th InternationalConference on Machine Learning-Volume 70 , pages 233–242. JMLR. org, 2017.[22] Giacomo De Palma, Bobak Kiani, and Seth Lloyd. Random deep neural networks are biasedtowards simple functions. In

Advances in Neural Information Processing Systems , pages1962–1974, 2019.[23] Guillermo Valle-Pérez, Chico Q Camargo, and Ard A Louis. Deep learning generalizesbecause the parameter-function map is biased towards simple functions. arXiv preprintarXiv:1805.08522 , 2018.[24] Tomaso Poggio, Qianli Liao, and Andrzej Banburski. Complexity control by gradient descentin deep networks.

Nature Communications , 11(1):1–5, 2020.[25] Chunyuan Li, Heerad Farkhoor, Rosanne Liu, and Jason Yosinski. Measuring the intrinsicdimension of objective landscapes. arXiv preprint arXiv:1804.08838 , 2018.[26] Stanislav Fort and Stanislaw Jastrzebski. Large scale structure of neural network loss landscapes. arXiv preprint arXiv:1906.04724 , 2019.[27] Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. In

Advancesin neural information processing systems , pages 1177–1184, 2008.[28] John Bibby. Axiomatisations of the average and a further generalisation of monotonic sequences.

Glasgow Mathematical Journal , 15(1):63–65, 1974.[29] Ronald A Fisher. The use of multiple measurements in taxonomic problems.

Annals of eugenics ,7(2):179–188, 1936.[30] Geoffrey J McLachlan.

Discriminant analysis and statistical pattern recognition , volume 544.John Wiley & Sons, 2004.[31] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wideand deep neural networks. In

Advances in Neural Information Processing Systems , pages10835–10845, 2019.[32] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over-parameterized networks that provably generalize on linearly separable data. arXiv preprintarXiv:1710.10174 , 2017.[33] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and generalization in overparame-terized neural networks, going beyond two layers. In

Advances in neural information processingsystems , pages 6155–6166, 2019.[34] Ke Wang, Geoff Pleiss, Jacob Gardner, Stephen Tyree, Kilian Q Weinberger, and Andrew Gor-don Wilson. Exact gaussian processes on a million data points. In

Advances in NeuralInformation Processing Systems , pages 14622–14632, 2019.[35] Alessandro Rudi, Luigi Carratino, and Lorenzo Rosasco. Falkon: An optimal large scale kernelmethod. In

Advances in Neural Information Processing Systems , pages 3888–3898, 2017.[36] Dua Dheeru and Eﬁ Karra Taniskidou. UCI machine learning repository, 2017. URL http://archive.ics.uci.edu/ml .[37] Corinna Cortes, Mehryar Mohri, and Afshin Rostamizadeh. Algorithms for learning kernelsbased on centered alignment.

Journal of Machine Learning Research , 13(Mar):795–828, 2012.1038] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers:Surpassing human-level performance on imagenet classiﬁcation. In

Proceedings of the IEEEinternational conference on computer vision , pages 1026–1034, 2015.[39] Eric Jones, Travis Oliphant, Pearu Peterson, et al. SciPy: Open source scientiﬁc tools forPython, 2001–. URL . [Online; accessed ].[40] Lars Buitinck, Gilles Louppe, Mathieu Blondel, Fabian Pedregosa, Andreas Mueller, OlivierGrisel, Vlad Niculae, Peter Prettenhofer, Alexandre Gramfort, Jaques Grobler, Robert Layton,Jake VanderPlas, Arnaud Joly, Brian Holt, and Gaël Varoquaux. API design for machine learningsoftware: experiences from the scikit-learn project. In

ECML PKDD Workshop: Languages forData Mining and Machine Learning , pages 108–122, 2013.[41] Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation inpytorch. 2017.[42] Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimensionof data representations in deep neural networks. arXiv preprint arXiv:1905.12784 , 2019.[43] Grasgoire Montavon, Mikio L Braun, and Klaus-Robert Matller. Kernel analysis of deepnetworks.

Journal of Machine Learning Research , 12(Sep):2563–2581, 2011.11 ppendix A Proof for Theorem 1

Theorem 1:

Given σ and σ as the σ values from the last layer and the current layer, then thereexists a lower bound for H l , denoted as L ( σ , σ ) such that H l ≥ L ( σ , σ ) . (18) Basic Background, Assumptions, and Notations.

1. The simulation of this theorem for Adversarial and Random data is also publicly availableon https://github.com/anonymous .2. Here we show that this bound can be established given the last 2 layers.3. σ is the σ value of the previous layer4. σ is the σ value of the current layer5. τ is the number of classes6. n is total number of samples7. n i is number of samples in the i th class8. S is a set of all i, j sample pairs where r i and r j belong to the same class.9. S c is a set of all i, j sample pairs where r i and r j belong to different same classes.10. S β is a set of all i, j sample pairs that belongs to the same β th classes.11. r ( α ) i is the i th sample in the α th class among τ classes.12. We assume no r i (cid:54) = r j pair are equal ∀ i (cid:54) = j .13. Among all r i (cid:54) = r j pairs, there exists an optimal r ∗ i , r ∗ j pair where (cid:104) r ∗ i , r ∗ j (cid:105) ≥ (cid:104) r i , r j (cid:105)∀ r i (cid:54) = r ∗ i and r j (cid:54) = r ∗ j . We denote this maximum inner product as u σ = (cid:104) r ∗ i , r ∗ j (cid:105) . (19)14. In KNet , each r i sample is assumed to be a sample in the RKHS of the Gaussian kernel,therefore all inner products are bounded such that ≤ (cid:104) r i , r j (cid:105) ≤ u σ . (20)15. We let W be W s = 1 √ ζ (cid:104)(cid:80) ι r (1) ι (cid:80) ι r (2) ι ... (cid:80) ι r ( τ ) ι (cid:105) . (21)Instead of using an optimal W ∗ deﬁned as W ∗ = arg max W H l ( W ) , we use a suboptimal W s where each dimension is simply the average direction of each class: √ ζ is a normalizingconstant ζ = || W s || that ensures W Ts W s = I . By using W s , this implies that the H we obtain is already a lower bound compare H obtained by W ∗ . But, we will use thissuboptimal W s to identify an even lower bound. Note that based on the deﬁnition W ∗ , wehave the property H ( W ∗ ) ≥ H ( W ) ∀ W .126. We note that the objective H is H = (cid:88) i,j ∈S Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:124) (cid:123)(cid:122) (cid:125) W − (cid:88) i,j ∈S c | Γ i,j | e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B (22)where we let W be the summation of terms associated with the within cluster pairs, and let B be the summation of terms associated with the between cluster pairs. Proof.

The equation is divided into smaller parts organized into multiple sections.

For sample pairs in S . The ﬁrst portion of the function can be split into multiple classes where W = (cid:88) S Γ i,j e − ( r (1) i − r (1) j ) T WWT ( r (1) i − r (1) j )2 σ (cid:124) (cid:123)(cid:122) (cid:125) W + ... + (cid:88) S τ Γ i,j e − ( r ( τ ) i − r ( τ ) j ) T WWT ( r ( τ ) i − r ( τ ) j )2 σ (cid:124) (cid:123)(cid:122) (cid:125) W τ (23)Realize that to ﬁnd the lower bound, we need to determine the minimum possible value of eachterm which translates to maximum possible value of each exponent. Without of loss of generalitywe can ﬁnd the lower bound for one term and generalize its results to other terms due to theirsimilarity. Let us focus on the numerator of the exponent from W . Given W s as W , our goal isidentify the maximum possible value for ( r (1) i − r (1) j ) T W (cid:124) (cid:123)(cid:122) (cid:125) Π W T ( r (1) i − r (1) j ) (cid:124) (cid:123)(cid:122) (cid:125) Π . (24)Zoom in further by looking only at Π , we have the following relationships Π = r (1) T i W (cid:124) (cid:123)(cid:122) (cid:125) ξ − r (1) T j W (cid:124) (cid:123)(cid:122) (cid:125) ξ (25) ξ = 1 √ ζ r (1) T i (cid:104)(cid:80) ι r (1) ι (cid:80) ι r (2) ι ... (cid:80) ι r ( τ ) ι (cid:105) (26) = 1 √ ζ r (1) T i (cid:104) ( r (1)1 + ... + r (1) n ) ... ( r ( τ )1 + ... + r ( τ ) n τ ) (cid:105) (27) ξ = 1 √ ζ r (1) T j (cid:104)(cid:80) ι r (1) ι (cid:80) ι r (2) ι ... (cid:80) ι r ( τ ) ι (cid:105) (28) = 1 √ ζ r (1) T j (cid:104) ( r (1)1 + ... + r (1) n ) ... ( r ( τ )1 + ... + r ( τ ) n τ ) (cid:105) (29)By knowing that the inner product is constrained between [0 , u σ ] , we know the maximumpossible value for ξ and the minimum possible value for ξ to be ξ = 1 √ ζ [1 + ( n − u σ n u σ n u σ ... n τ u σ ] (30) ξ = 1 √ ζ [1 0 0 ... . (31)Which leads to Π = 1 √ ζ ( ξ − ξ ) = 1 √ ζ [( n − u σ n u σ n u σ ... n τ u σ ] (32)Since Π T = Π we have Π Π = 1 ζ [( n − u σ + n u σ + n u σ + ... + n τ u σ ] (33) = 1 ζ [( n − + n + n + ... + n τ ] u σ (34)13he lower bound for just the W term emerges as W ≥ (cid:88) S Γ i,j e − [( n − n n ... + n τ ] u σ ζσ . (35)To further condense the notation, we deﬁne the following constant N g = 12 ζ [ n + n + ... + ( n g − + ... + n τ ] . (36)Therefore, the lower bound for W can be simpliﬁed as W ≥ (cid:88) S Γ i,j e − N u σ σ (37)and the general pattern for any W g becomes W g ≥ (cid:88) S i Γ i,j e − N gu σ σ . (38)The lower bound for the entire set of S then becomes (cid:88) i,j ∈S Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ = W + ... + W τ ≥ τ (cid:88) g =1 (cid:88) S g Γ i,j e − N gu σ σ (cid:124) (cid:123)(cid:122) (cid:125) Lower bound . (39) For sample pairs in S c . To simplify the notation, we note that − B g ,g = − (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − ( r ( g i − r ( g j ) T WWT ( r ( g i − r ( g j )2 σ (40) = − (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − Tr( WT (( r ( g i − r ( g j ))(( r ( g i − r ( g j )) T W )2 σ (41) = − (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − Tr(

WT A ( g ,g i,j W )2 σ (42)(43)We now derived the lower bound for the sample pairs in S c . We start by writing out the entiresummation sequence for B . B = − (cid:88) i ∈S (cid:88) j ∈S | Γ i,j | e − Tr(

WT A (1 , i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B , − ... (cid:124)(cid:123)(cid:122)(cid:125) B g (cid:54) = g − (cid:88) i ∈S (cid:88) j ∈S τ | Γ i,j | e − Tr(

WT A (1 ,τ ) i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B ,τ − (cid:88) i ∈S (cid:88) j ∈S | Γ i,j | e − Tr(

WT A (2 , i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B , − ... (cid:124)(cid:123)(cid:122)(cid:125) B g (cid:54) = g − (cid:88) i ∈S (cid:88) j ∈S τ | Γ i,j | e − Tr(

WT A (2 ,τ ) i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B ,τ ... − (cid:88) i ∈S τ (cid:88) j ∈S | Γ i,j | e − Tr(

WT A ( τ, i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B τ, − ... (cid:124)(cid:123)(cid:122)(cid:125) B g (cid:54) = g − (cid:88) i ∈S τ − (cid:88) j ∈S τ | Γ i,j | e − Tr(

WT A ( τ − ,τ ) i,j W )2 σ (cid:124) (cid:123)(cid:122) (cid:125) B τ − ,τ (44)14sing a similar approach with the terms from W , note that B is a negative value, so we need tomaximize this term to obtain a lower bound. Consequently, the key is to determine the minimal possible values for each exponent term. Since every one of them will behave very similarly, wecan simply look at the numerator of the exponent from B , and then arrive to a more generalconclusion. Given W s as W , our goal is to identify the minimal possible value for ( r (1) i − r (2) j ) T W (cid:124) (cid:123)(cid:122) (cid:125) Π W T ( r (1) i − r (2) j ) (cid:124) (cid:123)(cid:122) (cid:125) Π . (45)Zoom in further by looking only at Π , we have the following relationships Π = r (1) T i W (cid:124) (cid:123)(cid:122) (cid:125) ξ − r (2) T j W (cid:124) (cid:123)(cid:122) (cid:125) ξ (46) ξ = 1 √ ζ r (1) T i (cid:104)(cid:80) ι r (1) ι (cid:80) ι r (2) ι ... (cid:80) ι r ( τ ) ι (cid:105) (47) = 1 √ ζ r (1) T i (cid:104) ( r (1)1 + ... + r (1) n ) ... ( r ( τ )1 + ... + r ( τ ) n τ ) (cid:105) (48) ξ = 1 √ ζ r (2) T j (cid:104)(cid:80) ι r (1) ι (cid:80) ι r (2) ι ... (cid:80) ι r ( τ ) ι (cid:105) (49) = 1 √ ζ r (2) T j (cid:104) ( r (1)1 + ... + r (1) n ) ... ( r ( τ )1 + ... + r ( τ ) n τ ) (cid:105) (50)By knowing that the inner product is constrained between [0 , u σ ] , we know the minimum possible value for ξ and the maximum possible value for ξ to be ξ = 1 √ ζ [1 0 0 ... (51) ξ = 1 √ ζ [ n u σ n − u σ n u σ ... n τ u σ ] (52)Which leads to Π = 1 √ ζ ( ξ − ξ ) = 1 √ ζ [1 − n u σ − (1 + ( n − u σ ) − n u σ ... − n τ u σ ] (53)Since Π T = Π we have Π Π = 1 ζ [(1 − n u σ ) + (1 + ( n − u σ ) + n u σ + ... + n τ u σ ] . (54)The lower bound for just the B , term emerges as − B , ≥ − (cid:88) S (cid:88) S | Γ i,j | e − (1 − n uσ n − uσ n u σ ... + n τ u σ ζσ . (55)To further condense the notation, we deﬁne the following function N g ,g ( u σ ) = 12 ζ [ n u σ + n u σ + ... + (1 − n g u σ ) + ... + (1 + ( n g − u σ ) + ... + n τ u σ ] . (56)Note that while for S , the u σ term can be separated out. But here, we cannot, and therefore N here must be a function of u σ . Therefore, the lower bound for B , can be simpliﬁed into − B , ≥ − (cid:88) S (cid:88) S | Γ i,j | e − N , uσ σ (57)15nd the general pattern for any B g ,g becomes − B g ,g ≥ − (cid:88) S g (cid:88) S g Γ i,j e − N g ,g uσ σ . (58)The lower bound for the entire set of S c then becomes − (cid:88) i,j ∈S c | Γ i,j | e − ( ri − rj ) T WWT ( ri − rj )2 σ = − B , − B , − ... − B τ − ,τ (59) ≥ − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − N g ,g uσ σ (cid:124) (cid:123)(cid:122) (cid:125) Lower bound . (60) Putting S and S c Together. H = W + B (61) ≥ τ (cid:88) g =1 (cid:88) S g Γ i,j e − N gu σ σ (cid:124) (cid:123)(cid:122) (cid:125) Lower bound of W − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − N g ,g uσ σ (cid:124) (cid:123)(cid:122) (cid:125) Lower bound of B . (62)Therefore, we have identiﬁed a lower bound that is a function of σ and σ where L ( σ , σ ) = τ (cid:88) g =1 (cid:88) S g Γ i,j e − N gu σ σ − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − N g ,g uσ σ . (63)From the lower bound, it is obvious why it is a function of σ . The lower bound is also a functionof σ because u σ is actually a function of σ . To speciﬁcally clarify this point, we have the nextlemma. Lemma 2.

The u σ used in Lemma 1 is a function of σ where u σ approaches to zero as σ approaches to zero, i.e. lim σ → u σ = 0 . (64) Assumptions and Notations.

1. We use Fig. 4 to help clarify the notations. We here only look at the last 2 layers.2. We let H be the H of the last layer, and H , the H of the current layer.3. The input of the data is X with each sample as x i , and the output of the previous layer aredenoted as r i . ψ σ is the feature map of the previous layer using σ and ψ σ corresponds tothe current layer. Figure 4: Figure of a 2 layer network.16. As deﬁned from Lemma 1, among all r i (cid:54) = r j pairs, there exists an optimal r ∗ i , r ∗ j pair where (cid:104) r ∗ i , r ∗ j (cid:105) ≥ (cid:104) r i , r j (cid:105) ∀ r i (cid:54) = r ∗ i and r j (cid:54) = r ∗ j . We denote this maximum inner product as u σ = (cid:104) r ∗ i , r ∗ j (cid:105) . (65) Proof.

Given Fig. 4, the equation for H is H = (cid:88) i,j ∈S Γ i,j e − ( xi − xj ) T WWT ( xi − xj )2 σ − (cid:88) i,j ∈S c | Γ i,j | e − ( xi − xj ) T WWT ( xi − xj )2 σ (66) = (cid:88) i,j ∈S Γ i,j (cid:104) ψ σ ( x i ) , ψ σ ( x j ) (cid:105) − (cid:88) i,j ∈S c | Γ i,j |(cid:104) ψ σ ( x i ) , ψ σ ( x j ) (cid:105) (67)Notice that as σ → , we have lim σ → (cid:104) ψ σ ( x i ) , ψ σ ( x j ) (cid:105) = (cid:26) ∀ i (cid:54) = j ∀ i = j . (68)In other words, as σ → , the samples r i in the RKHS of a Gaussian kernel approachesorthogonal to all other samples. Given this fact, it also implies that the σ controls the innerproduct magnitude in RKHS space of the maximum sample pair r ∗ i , r ∗ j . We deﬁne this maximuminner product as (cid:104) ψ σ ( x ∗ i ) , ψ σ ( x ∗ j ) (cid:105) ≥ (cid:104) ψ σ ( x i ) , ψ σ ( x j ) (cid:105) (69)or equivalently (cid:104) r ∗ i , r ∗ j (cid:105) ≥ (cid:104) r i , r j (cid:105) (70)Therefore, given a σ , it controls the upper bound of the inner product. Notice that as σ → , every sample in RKHS becomes orthogonal. Therefore, the upper bound of (cid:104) r i , r j (cid:105) alsoapproaches 0 when r i (cid:54) = r j . From this, we see the relationship lim σ → u σ = lim σ → exp − ( | . | /σ ) = 0 (71), where |.| is bounded and has a minimum and maximum, because we have ﬁnite number ofsamples. Lemma 3.

Given any ﬁxed σ > , the lower bound L ( σ , σ ) is a function with respect to σ andas σ → , L ( σ , σ ) approaches the function L ( σ ) = τ (cid:88) g =1 (cid:88) S g Γ i,j − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − ζσ . (72) At this point, if we let σ → , we have lim σ → L ( σ ) = (cid:88) i,j ∈S Γ i,j (73) = H ∗ . (74) Proof.

Given Lemma 2, we know that lim σ → u σ = 0 . (75)Therefore, having σ → is equivalent to having u σ → . Since Lemma 1 provide the equationof a lower bound that is a function of u σ , this lemma is proven by simply evaluating L ( σ , σ ) as u σ → . Following these steps, we have L ( σ ) = lim u σ → τ (cid:88) g =1 (cid:88) S g Γ i,j e − N gu σ σ − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − N g ,g uσ σ , (76) = τ (cid:88) g =1 (cid:88) S g Γ i,j − τ (cid:88) g (cid:54) = g (cid:88) i ∈S g (cid:88) j ∈S g | Γ i,j | e − ζσ . (77)17t this point, as σ → , our lower bound reaches the global maximum lim σ → L ( σ ) = τ (cid:88) g =1 (cid:88) S g Γ i,j = (cid:88) i,j ∈S Γ i,j (78) = H ∗ . (79) Lemma 4.

Given any H l − , δ > , there exists a σ > and σ > such that H ∗ − H l ≤ δ. (80) Proof.

Observation 1.

Note that the objective of H l is H l = max W (cid:88) i,j ∈S Γ i,j e − ( r ( S ) i − r ( S ) j ) T WWT ( r ( S ) i − r ( S ) j )2 σ − (cid:88) i,j ∈S c | Γ i,j | e − ( r ( S c ) i − r ( S c ) j ) T WWT ( r ( S c ) i − r ( S c ) j )2 σ . (81)Since the Gaussian kernel is bounded between 0 and 1, the theoretical maximum of H ∗ is whenthe kernel is 1 for S and 0 for S c with the theoretical maximum as H ∗ = (cid:80) i,j ∈S Γ i,j . ThereforeEq. (80) inequality is equivalent to (cid:88) i,j ∈S Γ i,j − H l ≤ δ. (82) Observation 2.

If we choose a σ such that L ∗ ( σ ) − L ( σ , σ ) ≤ δ and H ∗ − L ∗ ( σ ) ≤ δ (83)then we have identiﬁed the condition where σ > and σ > such that (cid:88) i,j ∈S Γ i,j − L ( σ , σ ) ≤ δ. (84)Note that the L ∗ ( σ ) is a continuous function of σ . Therefore, a σ exists such that L ∗ ( σ ) can be set arbitraty close to H ∗ . Hence, we choose an σ that has the following property: H ∗ − L ∗ ( σ ) ≤ δ . (85)We next ﬁx σ , we also know L ( σ , σ ) is a continuous function of σ , and it has a limit L ∗ ( σ ) as σ approaches to 0, hence there exits a σ , where L ∗ ( σ ) − L ( σ , σ ) ≤ δ (86)Then we have: L ∗ ( σ ) − L ( σ , σ ) ≤ δ and H ∗ − L ∗ ( σ ) ≤ δ . (87)By adding the two δ , we conclude the proof.18 emma 5. There exists a Kernel Sequence { φ l ◦ } Ll =1 parameterized by a set of weights W l and a setof bandwidths σ l such that lim l →∞ H l = H ∗ , H l +1 > H l ∀ l (88)Before, the proof, we use the following ﬁgure, Fig. 5, to illustrate the relationship between KernelSequence { φ l ◦ } Ll =1 that generates the H - Sequence {H l } Ll =1 . By solving a network greedily, weseparate the network into L separable problems. At each additional layer, we rely on the weightslearned from the previous layer. At each network, we ﬁnd σ l − , σ l , and W l for the next network. Wealso note that since we only need to prove the existence of a solution, this proof is done by Proof byConstruction , i.e, we only need to show an example of its existence. Therefore, this proof consists ofus constructing a H - Sequence which satisﬁes the lemma.Figure 5: Relating

Kernel Sequence to H - Sequence . Proof.

We ﬁrst note that from Lemma 4, we have previously proven given any H l − , δ > , there existsa σ > and σ > such that H ∗ − H l ≤ δ l . (89)This implies that based on Fig. 5, at any given layer, we could reach arbitrarily close to H ∗ . Giventhis, we list the 2 steps to build the H - Sequence . Step 1:

Deﬁne {E n } ∞ n =1 as a sequence of numbers H ∗ − H ∗ −H n on the real line. We have thefollowing properties for this sequence: lim n →∞ E n = H ∗ , E = H . (90)Using these two properties, for any H l − ∈ [ H , H ∗ ] there exist an unique n , where E n ≤ H l − < E n +1 . (91) Step 2:

For any given l , we choose δ l to satisﬁes Eq. (89) by the following procedure, First ﬁndan n that satisﬁes E n ≤ H l − < E n +1 , (92)and second deﬁne δ l to be δ l = H ∗ − E n +1 . (93)To satisfy Eq. (89), the following must be true. H ∗ − H l − ≤ δ l − . (94)and further we found n such that E n ≤ H l − < E n +1 = ⇒ H ∗ − E n ≥ H ∗ − H l − > H ∗ − E n +1 . (95)Thus combining Eq. (93), Eq. (94), and Eq. (95) we have δ l − > δ l . (96)19herefore, { δ l } is a decreasing sequence. Step 3:

Note that {E n } is a converging sequence where lim n →∞ H ∗ − H ∗ − H n = H ∗ . (97)Therefore, { ∆ n } = H ∗ − {E n } is also a converging sequence where lim n →∞ H ∗ − H ∗ + H ∗ − H n = 0 (98)and { δ l } is a subsequence of { ∆ l } . Since any subsequence of a converging sequence alsoconverges to the same limit, we know that lim l →∞ δ l = 0 . (99)Following this construction, if we always choose H l such that H ∗ − H l ≤ δ l . (100)As l → ∞ , the inequality becomes H ∗ − lim l →∞ H l ≤ lim l →∞ δ l , (101) ≤ . (102)Since we know that H ∗ − H l ≥ ∀ l. (103)The condition of ≤ H ∗ − lim l →∞ H l ≤ (104)is true only if H ∗ − lim l →∞ H l = 0 . (105)This allows us to conclude H ∗ = lim l →∞ H l . (106) Proof of the Monotonic Improvement.

Given Eq. (91) and Eq. (93), at each step we have the following: H l − < E n +1 (107) ≤ H ∗ − δ l . (108)Rearranging this inequality, we have δ l < H ∗ − H l − . (109)By combining the inequalities from Eq. (109) and Eq. (100), we have the following relationships. H ∗ − H l ≤ δ l < H ∗ − H l − (110) H ∗ − H l < H ∗ − H l − (111) −H l < −H l − (112) H l > H l − , (113)(114)which concludes the proof of theorem. Lemma 6.

Given √ ζ as a normalizing constant for W s = √ ζ (cid:80) α r α such that W T W = I , then W s is not guaranteed to be the optimal solution for the HSIC objective. roof. We start with the Lagrangian L = − (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ − Tr(Λ( W T W − I )) . (115)If we now take the derivative with respect to the Lagrange, we get ∇L = 1 σ (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ ( r i − r j )( r i − r j ) T W − W Λ . (116)By setting the gradient to 0, we have  σ (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ ( r i − r j )( r i − r j ) T  W = W Λ . (117) Q l W = W Λ . (118)From Eq. (118), we see that W is only the optimal solution when W is the eigenvector of Q l .Therefore, by setting W to W s = √ ζ (cid:80) α r α , it is not guaranteed to yield an optimal. Appendix B Proof for Theorem 2

Theorem 2: As l → ∞ and H l → H ∗ , the following properties are satisﬁed: I the scatter ratio approaches 0 where lim l →∞ Tr( S lw )Tr( S lb ) = 0 (119)II the Kernel Sequence converges to the following kernel: lim l →∞ K ( x i , x j ) l = K ∗ = (cid:26) ∀ i, j ∈ S c ∀ i, j ∈ S . (120) Proof.

We start by proving condition II starting from the H objective using a GK max W (cid:88) i,j ∈S Γ i,j K W ( r i , r j ) − (cid:88) i,j ∈S c | Γ i,j |K W ( r i , r j ) (121) max W (cid:88) i,j ∈S Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ − (cid:88) i,j ∈S c | Γ i,j | e − ( ri − rj ) T WWT ( ri − rj )2 σ (122)Given that H l → H ∗ , and the fact that ≤ K W ≤ , this implies that the following condition mustbe true: H ∗ = (cid:88) i,j ∈S Γ i,j = (cid:88) i,j ∈S Γ i,j (1) − (cid:88) i,j ∈S c | Γ i,j | (0) . (123)Based on Eq. (89), our construction at each layer ensures to satisfy H ∗ − H l ≤ δ l . (124)Substituting the deﬁnition of H ∗ and H l , we have (cid:88) i,j ∈S Γ i,j (1) −  (cid:88) i,j ∈S Γ i,j K W ( r i , r j ) − (cid:88) i,j ∈S c | Γ i,j |K W ( r i , r j )  ≤ δ l (125) (cid:88) i,j ∈S Γ i,j (1 − K W ( r i , r j )) + (cid:88) i,j ∈S c | Γ i,j |K W ( r i , r j ) ≤ δ l . (126)21ince every term within the summation in Eq. (126) is positive, this implies − K W ( r i , r j ) ≤ δ l i, j ∈ S (127) K W ( r i , r j ) ≤ δ l i, j ∈ S c . (128)So as l → ∞ and δ l → , every component getting closer to limit Kernel, i.e, taking the limit fromboth sides and using the fact that is proven is theorem 1 lim l →∞ δ l = 0 leads to lim l →∞ ≤ K W ( r i , r j ) i, j ∈ S (129) lim l →∞ K W ( r i , r j ) ≤ i, j ∈ S c (130)both terms must instead be strictly equality. Therefore, we see that at the limit point K W would havethe form K ∗ = (cid:26) ∀ i, j ∈ S c ∀ i, j ∈ S . (131) First Property :Using Eq. (127) and Eq. (128) we have: − δ l ≤ e − ( ri − rj ) T WWT ( ri − rj )2 σ i, j ∈ S (132) e − ( ri − rj ) T WWT ( ri − rj )2 σ ≤ δ l i, j ∈ S c . (133)As lim l →∞ δ l = 0 , taking the limit from both side leads to:  e − ( ri − rj ) T WWT ( ri − rj )2 σ = 1 ∀ i, j ∈ S e − ( ri − rj ) T WWT ( ri − rj )2 σ = 0 ∀ i, j ∈ S c . (134)If we take the log of the conditions, we get (cid:26) σ ( r i − r j ) T W W T ( r i − r j ) = 0 ∀ i, j ∈ S σ ( r i − r j ) T W W T ( r i − r j ) = ∞ ∀ i, j ∈ S c . (135)This implies that as l → ∞ we have lim l →∞ (cid:88) i,j ∈S σ ( r i − r j ) T W W T ( r i − r j ) = lim l →∞ Tr( S w ) = 0 . (136) lim l →∞ (cid:88) i,j ∈S c σ ( r i − r j ) T W W T ( r i − r j ) = lim l →∞ Tr( S b ) = ∞ , (137)This yields the ratio lim H l →H ∗ Tr( S w )Tr( S b ) = 0 ∞ = 0 . (138) Appendix C Proof for Theorem 3

Theorem 3:

Eq. (5) objective is equivalent to (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ ( r Ti W W T r j ) − (cid:88) i D i ( W ) || W T r i || . (139)22 roof. Let A i,j = ( r i − r j )( r i − r j ) T . Given the Lagranian of the HSIC objective as L = − (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ − Tr[Λ( W T W − I )] . (140)Our layer wise HSIC objective becomes min W − (cid:88) i,j Γ i,j e − ( ri − rj ) T WWT ( ri − rj )2 σ − Tr[Λ( W T W − I )] . (141)We take the derivative of the Lagrangian, the expression becomes ∇ W L ( W, Λ) = (cid:88) i,j Γ i,j σ e − Tr(

WT Ai,jW )2 σ A i,j W − W Λ . (142)Setting the gradient to 0, and consolidate some scalar values into ˆΓ i,j , we get the expression (cid:88) i,j Γ i,j σ e − Tr(

WT Ai,jW )2 σ A i,j  W = W Λ (143)  (cid:88) i,j ˆΓ i,j A i,j  W = W Λ (144) Q W = W Λ . (145)From here, we see that the optimal solution is an eigenvector of Q . Based on ISM, it further provedthat the optimal solution is not just any eigenvector, but the eigenvectors associated with the smallestvalues of Q . From this logic, ISM solves objective (141) with a surrogate objective min W Tr  W T  (cid:88) i,j ˆΓ i,j A i,j  W  s . t . W T W = I. (146)Given D ˆΓ as the degree matrix of ˆΓ and R = [ r , r , ... ] T , ISM further shows that Eq. (146) can bewritten into min W Tr (cid:16) W T R T (cid:104) D ˆΓ − ˆΓ (cid:105) RW (cid:17) s . t . W T W = I (147) max W Tr (cid:16) W T R T (cid:104) ˆΓ − D ˆΓ (cid:105) RW (cid:17) s . t . W T W = I (148) max W Tr (cid:16) W T R T ˆΓ RW (cid:17) − Tr (cid:0) W T R T D ˆΓ RW (cid:1) s . t . W T W = I (149) max W Tr (cid:16) ˆΓ RW W T R T (cid:17) − Tr (cid:0) D ˆΓ RW W T R T (cid:1) s . t . W T W = I (150) max W (cid:88) i,j ˆΓ i,j [ RW W T R T ] i,j − (cid:88) i,j D ˆΓ i,j [ RW W T R T ] i,j s . t . W T W = I. (151)Since the jump from Eq. (146) can be intimidating for those not familiar with the literature, weincluded a more detailed derivation in App. D.Note that the degree matrix D ˆΓ only have non-zero diagonal elements, all of its off diagonal are 0.Given [ RW W T R T ] i,j = ( r Ti W W T r j ) , the objective becomes max W (cid:88) i,j ˆΓ i,j ( r Ti W W T r j ) − (cid:88) i D i ( W ) || W T r i || s . t . W T W = I. (152)Here, we treat D i as a penalty weight on the norm of the W T r i for every sample.To better understand the behavior of D i ( W ) , note that ˆΓ matrix looks like ˆΓ = 1 σ  (cid:104) Γ S e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:105) (cid:104) −| Γ S c | e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:105) ... (cid:104) −| Γ S c | e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:105) (cid:104) Γ S e − ( ri − rj ) T WWT ( ri − rj )2 σ (cid:105) ...... ... ...  . (153)23he diagonal block matrix all Γ i,j elements that belong to S and the off diagonal are elements thatbelongs to S c . Each penalty term is the summation of its corresponding row. Hence, we can writeout the penalty term as D i ( W l ) = 1 σ (cid:88) j ∈S| i Γ i,j K W l ( r i , r j ) − σ (cid:88) j ∈S c | i | Γ i,j |K W l ( r i , r j ) . (154)From this, it shows that as W improve the objective, the penalty term is also increased. In fact, at itsextreme as H l → H ∗ , all the negative terms are gone and all of its positive terms are maximized andthis matrix approaches ˆΓ ∗ = 1 σ  [Γ S ] [0] ... [0] [Γ S ] ...... ... ...  . (155)From the matrix ˆΓ ∗ and the deﬁnition of D i ( W l ) , we see that as K W from S increase,Since D i ( W ) is the degree matrix of ˆΓ , we see that as H l → H ∗ , we have D ∗ i ( W ) > D i ( W ) . (156) Appendix D Derivation for (cid:80) i,j Ψ i,j ( x i − x j )( x i − x j ) T = 2 X T ( D Ψ − Ψ) X Since Ψ is a symmetric matrix, and A i,j = ( x i − x j )( x i − x j ) T , we can rewrite the expression into (cid:80) i,j Ψ i,j A i,j = (cid:80) i,j Ψ i,j ( x i − x j )( x i − x j ) T = (cid:80) i,j Ψ i,j ( x i x Ti − x j x Ti − x i x Tj + x j x Tj )= 2 (cid:80) i,j Ψ i,j ( x i x Ti − x j x Ti )= (cid:104) (cid:80) i,j Ψ i,j ( x i x Ti ) (cid:105) − (cid:104) (cid:80) i,j Ψ i,j ( x i x Tj ) (cid:105) . If we expand the 1st term, we get n (cid:88) i n (cid:88) j Ψ i,j ( x i x Ti ) = 2 (cid:88) i Ψ i, ( x i x Ti ) + . . . + Ψ i,n ( x i x Ti ) (157) = 2 n (cid:88) i [Ψ , + Ψ , + ... ] x i x Ti (158) = 2 n (cid:88) i d i x i x Ti (159) = 2 X T D Ψ X (160)Given Ψ i as the i th row, next we look at the 2nd term (cid:88) i (cid:88) j Ψ i,j x i x Tj = 2 (cid:88) i Ψ i, x i x T + Ψ i, x i x T + Ψ i, x i x T + ... (161) = 2 (cid:88) i x i (Ψ i, x T ) + x i (Ψ i, x T ) + x i (Ψ i, x T ) + ... (162) = 2 (cid:88) i x i (cid:2) (Ψ i, x T ) + (Ψ i, x T ) + (Ψ i, x T ) + ... (cid:3) (163) = 2 (cid:88) i x i (cid:2) X T Ψ Ti (cid:3) T (164) = 2 (cid:88) i x i [Ψ i X ] (165) = 2 [ x Ψ X + x Ψ X + x Ψ X + ... ] (166) = 2 [ x Ψ + x Ψ + x Ψ + ... ] X (167) = 2 X T Ψ X (168)(169)24utting both terms together, we get (cid:88) i,j Ψ i,j A i,j = 2 X T D Ψ X − X T Ψ Xa (170) = 2 X T [ D Ψ − Ψ] X (171)(172) Appendix E Proof for Corollary 1 and 2

Corollary Given H l → H ∗ , the network output in IDS solves MSE via a translation of labels.Proof. As H l → H ∗ , Thm. 2 shows that sample of the same class are mapped into the same point.Assuming that φ has mapped the sample into c points α = [ α , ..., α c ] that’s different from thetruth label ξ = [ ξ , ..., ξ c ] . Then the MSE objective is minimized by translating the φ output by ξ − α. (173) Corollary Given H l → H ∗ , the network output in RKHS solves CE via a change of bases. Assumptions, and Notations. n is the number of samples.2. τ is the number of classes.3. y i ∈ R τ is the ground truth label for the i th sample. It is one-hot encoded where only the j th element is 1 if x i belongs to the j th class, all other elements would be 0.4. We denote φ as the network, and ˆ y i ∈ R τ as the network output where ˆ y i = φ ( x i ) . We alsoassume that ˆ y i is constrained on a probability simplex where y Ti n .5. We denote the j th element of y i , and ˆ y i as y i,j and ˆ y i,j respectively.6. We deﬁne Orthogonality Condition:

A set of samples { ˆ y , ..., ˆ y n } satisﬁes the orthogonalitycondition if (cid:26) (cid:104) ˆ y i , ˆ y j (cid:105) = 1 ∀ i, j same class (cid:104) ˆ y i , ˆ y j (cid:105) = 0 ∀ i, j not in the same class . (174)7. We deﬁne the Cross-Entropy objective as arg min φ − n (cid:88) i =1 τ (cid:88) j =1 y i,j log( φ ( x i ) i,j ) . (175) Proof.

From Thm. 2, we know that the network φ output, { ˆ y , ˆ y , ..., ˆ y n } , satisfy the orthogonalitycondition at H ∗ . Then there exists a set of orthogonal bases represented by Ξ = [ ξ , ξ , ..., ξ c ] that maps { ˆ y , ˆ y , ..., ˆ y n } to simulate the output of a softmax layer. Let ξ i = ˆ y j , j ∈ S i , i.e.,for the i th class we arbitrary choose one of the samples from this class and assigns ξ i of thatclass to be equal to the sample’s output. Realize in our problem we have < ˆ y i , ˆ y i > = 1 , so if < ˆ y i , ˆ y j > = 1 , then subtracting these two would lead to < ˆ y i , ˆ y i − ˆ y j > = 0 , which is the sameas ˆ y i = ˆ y j . So this representation is well-deﬁned and its independent of choices of the samplefrom each group if they satisfy orthogonality condition. Now we deﬁne transformed labels, Y as: Y = ˆ Y Ξ . (176)Note that Y = [ y , y , ..., y n ] T which each y i is a one hot vector representing the class member-ship of i sample in c classes. Since given Ξ as the change of basis, we can match ˆ Y to Y exactly,CE is minimized. 25 ppendix F Dataset Details No samples were excludes from any of the dataset.

Wine.

This dataset has 13 features, 178 samples, and 3 classes. The features are continuous andheavily unbalanced in magnitude. The dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/wine.

Divorce.

This dataset has 54 features, 170 samples, and 2 classes. The features are discrete andbalanced in magnitude. The dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/Divorce+Predictors+data+set.

Car.

This dataset has 6 features, 1728 samples and 2 classes. The features are discrete and balancedin magnitude. The dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/Car+Evaluation.

Cancer.

This dataset has 9 features, 683 samples, and 2 classes. The features are discrete andunbalanced in magnitude. The dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/Breast+Cancer+Wisconsin+(Diagnostic) . Face.

This dataset consists of images of 20 people in various poses. The 624 images are vector-ized into 960 features. The dataset can be downloaded at https://archive.ics.uci.edu/ml/datasets/CMU+Face+Images . Random.

This dataset has 2 features, 80 samples and 2 classes. It is generate with a gaussiandistribution where half of the samples are randomly labeled as 1 or 0.

Adversarial.

This dataset has 2 features, 80 samples and 2 classes. It is generate with the followingcode: ∗ np . random . r a n d n ( n , 2 )X = np . v s t a c k ( ( X1 , X2 ) )Y = np . v s t a c k ( ( np . z e r o s ( ( n , 1 ) ) , np . o n e s ( ( n , 1 ) ) ) ) Appendix G W l Dimensions for each 10 Fold of each Dataset

We report the input and output dimensions of each W l for every layer of each dataset in the form of ( α, β ) ; the corresponding dimension becomes W l ∈ R α × β . Since each dataset consists of 10-folds,the network structure for each fold is reported. We note that the input of the 1st layer is the dimensionof the original data. However, after the ﬁrst layer, the width of the RFF becomes the output of eachlayer; here we use 300.The β value is chosen during the ISM algorithm. By keeping only the most dominant eigenvector ofthe Φ matrix, the output dimension of each layer corresponds with the rank of Φ . It can be seen fromeach dataset that the ﬁrst layer signiﬁcantly expands the rank. The expansion is generally followed bya compression of fewer and fewer eigenvalues. These results conform with the observations made byMontavon et al. [43] and Ansuini et al. [42]. Data Layer 1 Layer 2 Layer 3 Layer 4adversarial 1 (2, 2) (300, 61) (300, 35)adversarial 2 (2, 2) (300, 61) (300, 35)adversarial 3 (2, 2) (300, 61) (300, 8) (300, 4)adversarial 4 (2, 2) (300, 61) (300, 29)adversarial 5 (2, 2) (300, 61) (300, 29)adversarial 6 (2, 2) (300, 61) (300, 7) (300, 4)adversarial 7 (2, 2) (300, 61) (300, 34)adversarial 8 (2, 2) (300, 12) (300, 61) (300, 30)adversarial 9 (2, 2) (300, 61) (300, 33)adversarial 10 (2, 2) (300, 61) (300, 33) Data Layer 1 Layer 2 Layer 3Random 1 (3, 3) (300, 47) (300, 25)Random 2 (3, 3) (300, 46) (300, 25)Random 3 (3, 3) (300, 46) (300, 25)Random 4 (3, 3) (300, 47) (300, 4)Random 5 (3, 3) (300, 47) (300, 25)Random 6 (3, 3) (300, 45) (300, 23)Random 7 (3, 3) (300, 45) (300, 25)Random 8 (3, 3) (300, 45) (300, 21)Random 9 (3, 3) (300, 45) (300, 26)Random 10 (3, 3) (300, 47) (300, 25) ata Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6spiral 1 (2, 2) (300, 15) (300, 6) (300, 7) (300, 6)spiral 2 (2, 2) (300, 13) (300, 6) (300, 7) (300, 6) (300, 6)spiral 3 (2, 2) (300, 12) (300, 6) (300, 7) (300, 6) (300, 6)spiral 4 (2, 2) (300, 13) (300, 6) (300, 7) (300, 6) (300, 6)spiral 5 (2, 2) (300, 13) (300, 6) (300, 7) (300, 6)spiral 6 (2, 2) (300, 14) (300, 6) (300, 7) (300, 6)spiral 7 (2, 2) (300, 14) (300, 6) (300, 7) (300, 6)spiral 8 (2, 2) (300, 14) (300, 6) (300, 7) (300, 6) (300, 6)spiral 9 (2, 2) (300, 13) (300, 6) (300, 7) (300, 6)spiral 10 (2, 2) (300, 14) (300, 6) (300, 7) (300, 6) Data Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6wine 1 (13, 11) (300, 76) (300, 6) (300, 7) (300, 6) (300, 6)wine 2 (13, 11) (300, 76) (300, 6) (300, 6) (300, 6) (300, 6)wine 3 (13, 11) (300, 75) (300, 6) (300, 7) (300, 6) (300, 6)wine 4 (13, 11) (300, 76) (300, 6) (300, 6) (300, 6) (300, 6)wine 5 (13, 11) (300, 74) (300, 6) (300, 7) (300, 6) (300, 6)wine 6 (13, 11) (300, 74) (300, 6) (300, 6) (300, 6) (300, 6)wine 7 (13, 11) (300, 74) (300, 6) (300, 6) (300, 6) (300, 6)wine 8 (13, 11) (300, 75) (300, 6) (300, 7) (300, 6) (300, 6)wine 9 (13, 11) (300, 75) (300, 6) (300, 8) (300, 6) (300, 6)wine 10 (13, 11) (300, 76) (300, 6) (300, 7) (300, 6) (300, 6)Data Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6car 1 (6, 6) (300, 96) (300, 6) (300, 8) (300, 6)car 2 (6, 6) (300, 96) (300, 6) (300, 8) (300, 6)car 3 (6, 6) (300, 91) (300, 6) (300, 8) (300, 6)car 4 (6, 6) (300, 88) (300, 6) (300, 8) (300, 6) (300, 6)car 5 (6, 6) (300, 94) (300, 6) (300, 8) (300, 6)car 6 (6, 6) (300, 93) (300, 6) (300, 7)car 7 (6, 6) (300, 92) (300, 6) (300, 8) (300, 6)car 8 (6, 6) (300, 95) (300, 6) (300, 7) (300, 6)car 9 (6, 6) (300, 96) (300, 6) (300, 9) (300, 6)car 10 (6, 6) (300, 99) (300, 6) (300, 8) (300, 6) Data Layer 1 Layer 2 Layer 3 Layer 4 Layer 5divorce 1 (54, 35) (300, 44) (300, 5) (300, 5)divorce 2 (54, 35) (300, 45) (300, 4) (300, 4)divorce 3 (54, 36) (300, 49) (300, 6) (300, 6)divorce 4 (54, 36) (300, 47) (300, 7) (300, 6)divorce 5 (54, 35) (300, 45) (300, 6) (300, 6)divorce 6 (54, 36) (300, 47) (300, 6) (300, 6)divorce 7 (54, 35) (300, 45) (300, 6) (300, 6) (300, 4)divorce 8 (54, 36) (300, 47) (300, 6) (300, 7) (300, 4)divorce 9 (54, 36) (300, 47) (300, 5) (300, 5)divorce 10 (54, 36) (300, 47) (300, 6) (300, 6)Data Layer 1 Layer 2 Layer 3 Layer 4 Layer 5 Layer 6 Layer 7 Layer 8 Layer 9 Layer 10cancer 1 (9, 8) (300, 90) (300, 5) (300, 6) (300, 6) (300, 5) (300, 4) (300, 5) (300, 6) (300, 6)cancer 2 (9, 8) (300, 90) (300, 6) (300, 7) (300, 8) (300, 11) (300, 8) (300, 4)cancer 3 (9, 8) (300, 88) (300, 5) (300, 6) (300, 7) (300, 7) (300, 6) (300, 4)cancer 4 (9, 8) (300, 93) (300, 6) (300, 7) (300, 9) (300, 11) (300, 8)cancer 5 (9, 8) (300, 93) (300, 9) (300, 10) (300, 10) (300, 11) (300, 9) (300, 7)cancer 6 (9, 8) (300, 92) (300, 7) (300, 8) (300, 8) (300, 7) (300, 7)cancer 7 (9, 8) (300, 90) (300, 4) (300, 4) (300, 5) (300, 6) (300, 6) (300, 6) (300, 6)cancer 8 (9, 8) (300, 88) (300, 5) (300, 6) (300, 7) (300, 8) (300, 7) (300, 6)cancer 9 (9, 8) (300, 88) (300, 5) (300, 7) (300, 7) (300, 7) (300, 7)cancer 10 (9, 8) (300, 97) (300, 9) (300, 11) (300, 12) (300, 13) (300, 6)Data Layer 1 Layer 2 Layer 3 Layer 4face 1 (960, 233) (300, 74) (300, 73) (300, 46)face 2 (960, 231) (300, 75) (300, 73) (300, 43)face 3 (960, 231) (300, 76) (300, 73) (300, 44)face 4 (960, 232) (300, 76) (300, 74) (300, 44)face 5 (960, 231) (300, 77) (300, 73) (300, 43)face 6 (960, 232) (300, 74) (300, 72) (300, 47)face 7 (960, 232) (300, 76) (300, 73) (300, 45)face 8 (960, 230) (300, 74) (300, 74) (300, 44)face 9 (960, 233) (300, 76) (300, 76) (300, 45)face 10 (960, 231) (300, 76) (300, 70) (300, 43) ppendix H Sigma Values used for Random and Adversarial Simulation The simulation of Thm. 1 as shown in Fig. 2 spread the improvement across multiple layers. The σ l and H l values are recorded here. We note that σ l are reasonably large and not approaching 0 and theimprovement of H l is monotonic. Figure 6Figure 7Given a sufﬁciently small σ and σ , Thm. 1 claims that it can come arbitrarily close to the globaloptimal using a minimum of 2 layers. We here simulate 2 layers using a relatively small σ values( σ = 10 − ) on the Random (left) and Adversarial (right) data and display the results of the 2 layersbelow. Notice that given 2 layer, it generated a clearly separable clusters that are pushed far apart.Figure 8: Random Dataset with 2layers and σ = 10 − Figure 9: Adversarial Dataset with 2layers and σ = 10 − ppendix I Graphs of Kernel Sequences A representation of the

Kernel Sequence are displayed in the ﬁgures below for each dataset. Thesamples of the kernel matrix are previously organized to form a block structure by placing samples ofthe same class adjacent to each other. Since the Gaussian kernel is restricted to values between 0 and1, we let white and dark blue be 0 and 1 respectively where the gradients reﬂect values in between.Our theorems predict that the

Kernel Sequence will evolve from an uninformative kernel into a highlydiscriminating kernel of perfect block structures.Figure 10: The kernel sequence for the wine dataset.Figure 11: The kernel sequence for the cancer dataset.Figure 12: The kernel sequence for the Adversarial dataset.29igure 13: The kernel sequence for the car dataset.Figure 14: The kernel sequence for the face dataset.30igure 15: The kernel sequence for the divorce dataset.Figure 16: The kernel sequence for the spiral dataset.Figure 17: The kernel sequence for the Random dataset.31 ppendix J Evaluation Metrics Graphs

Figure 18Figure 19: Figures of key metrics for all datasets as samples progress through the network. It isimportant to notice the uniformly and monotonically increasing H - Sequence for each plot since thisguarantees a converging kernel/risk sequence. As the T approach 0, samples of the same/differenceclasses in IDS are being pulled into a single point or pushed maximally apart respectively. As C approach 0, samples of the same/difference classes in RKHS are being pulled into 0 or π cosinesimilarity respectively. 32 ppendix K Optimal Gaussian σ for Maximum Kernel Separation Although the Gaussian kernel is the most common kernel choice for kernel methods, its σ valueis a hyperparameter that must be tuned for each dataset. This work proposes to set the σ valuebased on the maximum kernel separation. The source code is made publicly available on https://github.com/anonamous .Let X ∈ R n × d be a dataset of n samples with d features and let Y ∈ R n × τ be the correspondingone-hot encoded labels where τ denotes the number of classes. Given κ X ( · , · ) and κ Y ( · , · ) as twokernel functions that applies respectively to X and Y to construct kernel matrices K X ∈ R n × n and K Y ∈ R n × n . Given a set S , we denote |S| as the number of elements within the set. Also let S and S c be sets of all pairs of samples of ( x i , x j ) from a dataset X that belongs to the same and differentclasses respectively, then the average kernel value for all ( x i , x j ) pairs with the same class is d S = 1 |S| (cid:88) i,j ∈S e − || xi − xj || σ (177)and the average kernel value for all ( x i , x j ) pairs between different classes is d S c = 1 |S c | (cid:88) i,j ∈S c e − || xi − xj || σ . (178)We propose to ﬁnd the σ that maximizes the difference between d S and d S c or max σ |S| (cid:88) i,j ∈S e − || xi − xj || σ − |S c | (cid:88) i,j ∈S c e − || xi − xj || σ . (179)It turns out that is expression can be computed efﬁciently. Let g = |S| and ¯ g = |S c | , and let n × n ∈ R n × n be a matrix of 1s, then we can deﬁne Q as Q = − gK Y + ¯ g ( n × n − K Y ) . (180)Or Q can be written more compactly as Q = ¯ g n × n − ( g + ¯ g ) K Y . (181)Given Q , Eq. (179) becomes min σ Tr( K X Q ) . (182)This objective can be efﬁciently solved with BFGS.Below in Fig. 20, we plot out the average within cluster kernel and the between cluster kernel valuesas we vary σ . From the plot, we can see that the maximum separation is discovered via BFGS. Relation to HSIC.

From Eq. (182), we can see that the σ that causes maximum kernel separation isdirectly related to HSIC. Given that the HSIC objective is normally written as min σ Tr( K X HK Y H ) , (183)by setting Q = HK Y H , we can see how the two formulations are related. While the maximumkernel separation places the weight of each sample pair equally, HSIC weights the pair differently.We also notice that the Q i,j element is positive/negative for ( x i , x j ) pairs that are with/betweenclasses respectively. Therefore, the argument for the global optimum should be relatively close forboth objectives. Below in Figure 21, we show a ﬁgure of HSIC values as we vary σ . Notice how theoptimal σ is almost equivalent to the solution from maximum kernel separation. For the purpose of KNet , we use σσ