[PDF] Learning Diverse and Discriminative Representations via the Principle of Maximal Coding Rate Reduction

Abstract

To learn intrinsic low-dimensional structures from high-dimensional data that most discriminate between classes, we propose the principle of Maximal Coding Rate Reduction ( MCR 2 ), an information-theoretic measure that maximizes the coding rate difference between the whole dataset and the sum of each individual class. We clarify its relationships with most existing frameworks such as cross-entropy, information bottleneck, information gain, contractive and contrastive learning, and provide theoretical guarantees for learning diverse and discriminative features. The coding rate can be accurately computed from finite samples of degenerate subspace-like distributions and can learn intrinsic representations in supervised, self-supervised, and unsupervised settings in a unified manner. Empirically, the representations learned using this principle alone are significantly more robust to label corruptions in classification than those using cross-entropy, and can lead to state-of-the-art results in clustering mixed data from self-learned invariant features.

Full PDF

LLearning Diverse and Discriminative Representationsvia the Principle of Maximal Coding Rate Reduction

Yaodong Yu † Kwan Ho Ryan Chan † Chong You † Chaobing Song ‡ Yi Ma †† Department of EECS, University of California, Berkeley ‡ Tsinghua-Berkeley Shenzhen Institute, Tsinghua University

Abstract

To learn intrinsic low-dimensional structures from high-dimensional data that mostdiscriminate between classes, we propose the principle of

Maximal Coding RateReduction (MCR ), an information-theoretic measure that maximizes the codingrate difference between the whole dataset and the sum of each individual class.We clarify its relationships with most existing frameworks such as cross-entropy,information bottleneck, information gain, contractive and contrastive learning, andprovide theoretical guarantees for learning diverse and discriminative features.The coding rate can be accurately computed from ﬁnite samples of degeneratesubspace-like distributions and can learn intrinsic representations in supervised,self-supervised, and unsupervised settings in a uniﬁed manner. Empirically, therepresentations learned using this principle alone are signiﬁcantly more robust tolabel corruptions in classiﬁcation than those using cross-entropy, and can lead tostate-of-the-art results in clustering mixed data from self-learned invariant features. Given a random vector x ∈ R D which is drawn from a mixture of, say k , distributions D = {D j } kj =1 ,one of the most fundamental problems in machine learning is how to effectively and efﬁciently learnthe distribution from a ﬁnite set of i.i.d samples, say X = [ x , x , . . . , x m ] ∈ R D × m . To this end,we seek a good representation through a continuous mapping, f ( x , θ ) : R D → R d , that capturesintrinsic structures of x and best facilitates subsequent tasks such as classiﬁcation or clustering. Supervised learning of discriminative representations.

To ease the task of learning D , in thepopular supervised setting, a true class label, represented as a one-hot vector y i ∈ R k , is given foreach sample x i . Extensive studies have shown that for many practical datasets (images, audios, andnatural languages, etc.), the mapping from the data x to its class label y can be effectively modeledby training a deep network [GBC16], here denoted as f ( x , θ ) : x (cid:55)→ y with network parameters θ ∈ Θ . This is typically done by minimizing the cross-entropy loss over a training set { ( x i , y i ) } mi =1 ,through backpropagation over the network parameters θ : min θ ∈ Θ CE ( θ, x , y ) . = − E [ (cid:104) y , log[ f ( x , θ )] (cid:105) ] ≈ − m m (cid:88) i =1 (cid:104) y i , log[ f ( x i , θ )] (cid:105) . (1)Despite its effectiveness and enormous popularity, there are two serious limitations with this approach:1) It aims only to predict the labels y even if they might be mislabeled. Empirical studies showthat deep networks, used as a “black box,” can even ﬁt random labels [ZBH + The precise geometric and statistical properties of the learned features are also often ∗ The ﬁrst two authors contributed equally to this work. despite plenty of empirical efforts in trying to illustrate or interpreting the so-learned features [ZF14]. a r X i v : . [ c s . L G ] J un bscured, which leads to the lack of interpretability and subsequent performance guarantees (e.g.,generalizability, transferability, and robustness, etc.) in deep learning. Therefore, the goal of thispaper is to address such limitations of current learning frameworks by reformulating the objectivetowards learning explicitly meaningful representations for the data x . Minimal discriminative features via information bottleneck.

One popular approach to interpretthe role of deep networks is to view outputs of intermediate layers of the network as selecting certainlatent features z = f ( x , θ ) ∈ R d of the data that are discriminative among multiple classes. Learnedrepresentations z then facilitate the subsequent classiﬁcation task for predicting the class label y byoptimizing a classiﬁer g ( z ) : x f ( x ,θ ) −−−−−−→ z ( θ ) g ( z ) −−−−−→ y . The information bottleneck (IB) formulation [TZ15] further hypothesizes that the role of the networkis to learn z as the minimal sufﬁcient statistics for predicting y . Formally, it seeks to maximize themutual information I ( z , y ) between z and y while minimizing I ( x , z ) between x and z : max θ ∈ Θ IB ( x , y , z ( θ )) . = I ( z ( θ ) , y ) − βI ( x , z ( θ )) , β > . (2)This framework has been successful in describing certain behaviors of deep networks. But by beingtask-dependent (depending on the label y ) and seeking a minimal set of most informative features forthe task at hand (for predicting the label y only), the network sacriﬁces generalizability, robustness,or transferability. To address this, our framework uses label y only as side information to assistlearning discriminative features, hence making learned features more robust to mislabeled data. Contractive learning of generative representations.

Complementary to the above superviseddiscriminative approach, auto-encoding [BH89, Kra91] is another popular unsupervised (label-free) framework used to learn good latent representations. The idea is to learn a compact latentrepresentation z ∈ R d that adequately regenerates the original data x to certain extent, say throughoptimizing some decoder or generator g ( z , η ) : x f ( x ,θ ) −−−−−−→ z ( θ ) g ( z ,η ) −−−−−−→ (cid:98) x ( θ, η ) . (3)Typically, such representations are learned in an end-to-end fashion by imposing certain heuristicson geometric or statistical “compactness” of z , such as its dimension, energy, or volume. Forexample, the contractive autoencoder [RVM +

11] penalizes local volume expansion of learnedfeatures approximated by the Jacobian (cid:107) ∂ z ∂θ (cid:107) . Another key design factor of this approach is the choiceof a proper, but often elusive, metric that can measure the desired similarity between x and thedecoded (cid:98) x , either between sample pairs x i and (cid:98) x i or between the two distributions D x and D (cid:98) x . Representations learned through this framework can be arguably rich enough to regenerate the datato a certain extent. But depending on the choice of the regularizing heuristics on z and similaritymetrics on x (or D x ), the objective is typically task-dependent and often grossly approximated[RVM +

11, GPAM + multi-modal structures, naive heuristicsor inaccurate metrics may fail to capture all internal subclass structures or to explicitly discriminateamong them for classiﬁcation or clustering purposes. To address this, we propose a principledmeasure (on z ) to learn representations that promotes multi-class discriminative property from dataof mixed structures, which works in both supervised and unsupervised settings. This work: Learning diverse and discriminative representations.

Whether the given data X ofa mixed distribution D can be effectively classiﬁed depends on how separable (or discriminative) Mutual information is deﬁned to be I ( z , y ) . = H ( z ) − H ( z | y ) where H ( z ) is the entropy of z [CT06]. given one can overcome some caveats associated with this framework [KTVK18] and practical difﬁcultiessuch as how to accurately evaluate mutual information with ﬁnitely samples of degenerate distributions. in case the labels can be corrupted or the learned features be tackled. hence the auto-encoding [BH89, Kra91] can be viewed as a nonlinear extension to the classical PCA [Jol02]. for tasks such as denoising, in which the metric can be chosen to the (cid:96) p -norm between samples of x and ˆ x : min θ,η E [ (cid:107) x − (cid:98) x (cid:107) p ] , where typically p = 1 or 2, for tasks such as image denoising. the distance between distributions of x and (cid:98) x , say the KL divergence KL ( D x ||D (cid:98) x ) , is very difﬁcult toevaluate when the data distributions are discrete and degenerate. In practice, it can only be approximated withthe help of an additional disriminative network, known as GAN [GPAM +

14, ACB17]. One consequence of this is the phenomenon of mode collapsing in learning generative models for data thathave mixed multi-modal structures; see [LPZM20] and references therein. ( x , θ ) R D R d M M M M j x i S S S j z i Figure 1:

Left and Middle:

The distribution D of high-dim data x ∈ R D is supported on a manifold M andits classes on low-dim submanifolds M j , we learn a map f ( x , θ ) such that z i = f ( x i , θ ) are on a union ofmaximally uncorrelated subspaces {S j } . Right:

Cosine similarity between learned features by our methodfor the CIFAR10 training dataset. Each class has 5,000 samples and their features span a subspace of over 10dimensions (see Figure 3(c)). the component distributions D j are (or can be made). One popular working assumption is that thedistribution of each class has relatively low-dimensional intrinsic structures. Hence we may assumethe distribution D j of each class has a support on a low-dimensional submanifold, say M j withdimension d j (cid:28) D , and the distribution D of x is supported on the mixture of those submanifolds, M = ∪ kj =1 M j , in the high-dimensional ambient space R D , as illustrated in Figure 1 left.With the manifold assumption in mind, we want to learn a mapping z = f ( x , θ ) that maps each ofthe submanifolds M j ⊂ R D to a linear subspace S j ⊂ R d (see Figure 1 middle). To do so, werequire our learned representation to have the following properties:1. Between-Class Discriminative:

Features of samples from different classes/clusters shouldbe highly uncorrelated and belong to different low-dimensional linear subspaces.2.

Within-Class Compressible:

Features of samples from the same class/cluster should berelatively correlated in a sense that they belong to a low-dimensional linear subspace.3.

Maximally Diverse Representation:

Dimension (or variance) of features for each class/clustershould be as large as possible as long as they stay uncorrelated from the other classes.Notice that, although the intrinsic structures of each class/cluster may be low-dimensional, they areby no means simply linear in their original representation x . Here the subspaces {S j } can be viewedas nonlinear generalized principal components for x [VMS16]. Furthermore, for many clusteringor classiﬁcation tasks (such as object recognition), we consider two samples as equivalent if theydiffer by certain class of domain deformations or augmentations T = { τ } . Hence, we are onlyinterested in low-dimensional structures that are invariant to such deformations, which are known tohave sophisticated geometric and topological structures [WDCB05] and can be difﬁcult to learn in aprincipled manner even with CNNs [CW16, CGW19]. There are previous attempts to directly enforcesubspace structures on features learned by a deep network for supervised [LQMS18] or unsupervisedlearning [JZL +

17, ZJH +

18, PFX +

17, ZHF18, ZJH +

19, ZLY +

19, LQMS18]. However, the self-expressive property of subspaces exploited by [JZL +

17] does not enforce all the desired propertieslisted above; [LQMS18] uses a nuclear norm based geometric loss to enforce orthogonality betweenclasses, but does not promote diversity in the learned representations, as we will soon see. Figure 1right illustrates a representation learned by our method on the CIFAR10 dataset. More details can befound in the experimental Section 3.

Although the above properties are all highly desirable for the latent representation z , they are by nomeans easy to obtain: Are these properties compatible so that we can expect to achieve them all at There are many reasons why this assumption is plausible: 1. high dimensional data are highly redundant; 2.data that belong to the same class should be similar and correlated to each other; 3. typically we only care aboutequivalent structures of x that are invariant to certain classes of deformation and augmentations. So x ∈ M iff τ ( x ) ∈ M for all τ ∈ T . simple but principled objective that can measure the goodness of the resultingrepresentations in terms of all these properties? The key to these questions is to ﬁnd a principled“measure of compactness” for the distribution of a random variable z or from its ﬁnite samples Z . Such a measure should directly and accurately characterize intrinsic geometric or statisticalproperties of the distribution, in terms of its intrinsic dimension or volume. Unlike cross-entropy (1)or information bottleneck (2), such a measure should not depend explicitly on class labels so that itcan work in all supervised, self-supervised, semi-supervised, and unsupervised settings. Low-dimensional degenerate distributions.

In information theory [CT06], the notion of entropy H ( z ) is designed to be such a measure. However, entropy is not well-deﬁned for continuousrandom variables with degenerate distributions. This is unfortunately the case here. To alleviate thisdifﬁculty, another related concept in information theory, more speciﬁcally in lossy data compression,that measures the “compactness” of a random distribution is the so-called rate distortion [CT06]:Given a random variable z and a prescribed precision (cid:15) > , the rate distortion R ( z , (cid:15) ) is theminimal number of binary bits needed to encode z such that the expected decoding error is lessthan (cid:15) . Although this framework has been successful in explaining feature selection in deep networks[MWHK19], the rate distortion of a random variable is difﬁcult, if not impossible to compute, exceptfor simple distributions such as discrete and Gaussian. Nonasymptotic rate distortion for ﬁnite samples.

When evaluating the lossy coding rate R , onepractical difﬁculty is that we normally do not know the distribution of z . Instead, we have a ﬁnitenumber of samples as learned representations where z i = f ( x i , θ ) ∈ R d , i = 1 , . . . , m , for the givendata samples X = [ x , . . . , x m ] . Fortunately, [MDHW07] provides a precise estimate on the numberof binary bits needed to encoded ﬁnite samples from a subspace-like distribution. In order to encodethe learned representation Z = [ z , . . . , z m ] up to a precision (cid:15) , the total number of bits neededis given by the following expression : L ( Z , (cid:15) ) . = (cid:0) m + d (cid:1) log det (cid:0) I + dm(cid:15) ZZ (cid:62) (cid:1) . Therefore, thecompactness of learned features as a whole can be measured in terms of the average coding lengthper sample (as the sample size m is large), a.k.a. the coding rate subject to the distortion (cid:15) : R ( Z , (cid:15) ) . = 12 log det (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19) . (4) Rate distortion of data with a mixed distribution.

In general, the features Z of multi-class datamay belong to multiple low-dimensional subspaces. To evaluate the rate distortion of such mixeddata more accurately , we may partition the data Z into multiple subsets: Z = Z ∪ · · · ∪ Z k ,with each in one low-dim subspace. So the above coding rate (4) is accurate for each subset. Forconvenience, let Π = { Π j ∈ R m × m } kj =1 be a set of diagonal matrices whose diagonal entriesencode the membership of the m samples in the k classes. Then, according to [MDHW07], withrespect to this partition, the average number of bits per sample (the coding rate) is R c ( Z , (cid:15) | Π ) . = k (cid:88) j =1 tr ( Π j )2 m log det (cid:18) I + d tr ( Π j ) (cid:15) Z Π j Z (cid:62) (cid:19) . (5)Notice that when Z is given, R c ( Z , (cid:15) | Π ) is a concave function of Π . The function log det( · ) in theabove expressions has been long known as an effective heuristic for rank minimization problems, withguaranteed convergence to local minimum [FHB03]. As it nicely characterizes the rate distortion ofGaussian or subspace-like distributions, log det( · ) can be very effective in clustering or classiﬁcationof mixed data [MDHW07, WTL +

08, KPCC15]. We will soon reveal more desired properties of thisfunction.

On one hand, for learned features to be discriminative, features of different classes/clusters arepreferred to be maximally incoherent to each other. Hence they together should span a space of the given the probability density p ( z ) of a random variable, H ( z ) . = − (cid:82) p ( z ) log p ( z ) d z . The same difﬁculty resides with evaluating mutual information I ( x , z ) for degenerate distributions. Say in terms of the (cid:96) -norm, we have E [ (cid:107) z − (cid:98) z (cid:107) ] ≤ (cid:15) for the decoded (cid:98) z . This formula can be derived either by packing (cid:15) -balls into the space spanned by Z or by computing thenumber of bits needed to quantize the SVD of Z subject to the precision, see [MDHW07] for proofs. That is, the diagonal entry Π j ( i, i ) of Π j indicates the probability of sample i belonging to subset j .Therefore Π lies in a simplex: Ω . = { Π | Π j ≥ , Π + · · · + Π k = I } . Z should be as large aspossible. On the other hand, learned features of the same class/cluster should be highly correlated andcoherent. Hence, each class/cluster should only span a space (or subspace) of a very small volumeand the coding rate should be as small as possible. Therefore, a good representation Z of X is onesuch that, given a partition Π of Z , achieves a large difference between the coding rate for the wholeand that for all the subsets: ∆ R ( Z , Π , (cid:15) ) . = R ( Z , (cid:15) ) − R c ( Z , (cid:15) | Π ) . (6)If we choose our feature mapping z = f ( x , θ ) to be a deep neural network, the overall process of thefeature representation and the resulting rate reduction w.r.t. certain partition Π can be illustrated bythe following diagram: X f ( x ,θ ) −−−−−−→ Z ( θ ) Π ,(cid:15) −−−−→ ∆ R ( Z ( θ ) , Π , (cid:15) ) . (7)Note that ∆ R is monotonic in the scale of the features Z . So to make the amount of reductioncomparable between different representations, we need to normalize the scale of the learnedfeatures, either by imposing the Frobenius norm of each class Z j to scale with the number of featuresin Z j ∈ R d × m j : (cid:107) Z j (cid:107) F = m j or by normalizing each feature to be on the unit sphere: z i ∈ S d − .This formulation offers a natural justiﬁcation for the need of “batch normalization” in the practice oftraining deep neural networks [IS15]. An alternative, arguably simpler, way to normalize the scale oflearned representations is to ensure that the mapping of each layer of the network is approximately isometric [QYW + Z ( θ ) = f ( X , θ ) and their partition Π (if not given in advance) such that they maximize the reduction between thecoding rate of all features and that of the sum of features w.r.t. their classes: max θ, Π ∆ R (cid:0) Z ( θ ) , Π , (cid:15) (cid:1) = R ( Z ( θ ) , (cid:15) ) − R c ( Z ( θ ) , (cid:15) | Π ) , s.t. (cid:107) Z j ( θ ) (cid:107) F = m j , Π ∈ Ω . (8)We refer to this as the principle of maximal coding rate reduction (MCR ), an embodiment ofAristotle’s famous quote: “ the whole is greater than the sum of the parts. ” Note that for the clusteringpurpose alone, one may only care about the sign of ∆ R for deciding whether to partition thedata or not, which leads to the greedy algorithm in [MDHW07]. Here to seek or learn the bestrepresentation, we further desire the whole is maximally greater than its parts.

Relationship to information gain.

The maximal coding rate reduction can be viewed as a gener-alization to

Information Gain (IG), which aims to maximize the reduction of entropy of a randomvariable, say z , with respect to an observed attribute, say π : max π IG ( z , π ) . = H ( z ) − H ( z | π ) , i.e., the mutual information between z and π [CT06]. Maximal information gain has been widelyused in areas such as decision trees [Qui86]. However, MCR is used differently in several ways:1) One typical setting of MCR is when the data class labels are given, i.e. Π is known, MCR focuses on learning representations z ( θ ) rather than ﬁtting labels. 2) In traditional settings of IG,the number of attributes in z cannot be so large and their values are discrete (typically binary).Here the “attributes” Π represent the probability of a multi-class partition for all samples and theirvalues can even be continuous. 3) As mentioned before, entropy H ( z ) or mutual information I ( z , π ) [HFLM +

18] is not well-deﬁned for degenerate continuous distributions whereas the rate distortion R ( z , (cid:15) ) is and can be accurately and efﬁciently computed for (mixed) subspaces, at least. In theory, the MCR principle (8) beneﬁts from great generalizability and can be applied to represen-tations Z of any distributions with any attributes Π as long as the rates R and R c for the distributionscan be accurately and efﬁciently evaluated. The optimal representation Z ∗ and partition Π ∗ shouldhave some interesting geometric and statistical properties. We here reveal nice properties of theoptimal representation with the special case of subspaces, which have many important use cases in Here different representations can be either representations associated with different network parameters orrepresentations learned after different layers of the same deep network. Strictly speaking, in the context of clustering ﬁnite samples, one needs to use the more precise measure ofthe coding length mentioned earlier, see [MDHW07] for more details.

Comparison of two learned representations Z and Z (cid:48) via reduced rates: R is the number of (cid:15) -ballspacked in the joint distribution and R c is the sum of the numbers for all the subspaces (the green balls). ∆ R istheir difference (the number of blue balls). The MCR principle prefers Z (the left one). machine learning. When the desired representation for Z is multiple subspaces, the rates R and R c in(8) are given by (4) and (5), respectively. At the maximal rate reduction, MCR achieves its optimalrepresentations, denoted as Z ∗ = Z ∗ ∪ · · · ∪ Z ∗ k ⊂ R d with rank ( Z ∗ j ) ≤ d j . One can show that Z ∗ has the following desired properties (see Appendix A for a formal statement and detailed proofs). Theorem 2.1 (Informal Statement) . Suppose Z ∗ = Z ∗ ∪ · · · ∪ Z ∗ k is the optimal solution thatmaximizes the rate reduction (8) . We have: • Between-class Discriminative : As long as the ambient space is adequately large ( d ≥ (cid:80) kj =1 d j ), the subspaces are all orthogonal to each other, i.e. ( Z ∗ i ) (cid:62) Z ∗ j = for i (cid:54) = j . • Maximally Diverse Representation : As long as the coding precision is adequately high, i.e., (cid:15) < min j (cid:110) m j m d d j (cid:111) , each subspace achieves its maximal dimension, i.e. rank ( Z ∗ j ) = d j .In addition, the largest d j − singular values of Z ∗ j are equal. In other words, in the case of subspaces, the MCR principle promotes embedding of data intomultiple independent subspaces, with features distributed isotropically in each subspace (except forpossibly one dimension). In addition, among all such discriminative representations, it prefers the onewith the highest dimensions in the ambient space. This is substantially different from the objective ofinformation bottleneck (2). Comparison to the geometric OLE loss.

To encourage the learned features to be uncorrelatedbetween classes, the work of [LQMS18] has proposed to maximize the difference between the nuclearnorm of the whole Z and its subsets Z j , called the orthogonal low-rank embedding (OLE) loss: max θ OLE ( Z ( θ ) , Π ) . = (cid:107) Z ( θ ) (cid:107) ∗ − (cid:80) kj =1 (cid:107) Z j ( θ ) (cid:107) ∗ , added as a regularizer to the cross-entropy loss(1). The nuclear norm (cid:107) · (cid:107) ∗ is a nonsmooth convex surrogate for low-rankness, whereas log det( · ) is smooth concave instead. Unlike the rate reduction ∆ R , OLE is always negative and achieves themaximal value when the subspaces are orthogonal, regardless of their dimensions. So in contrastto ∆ R , this loss serves as a geometric heuristic and does not promote diverse representations. Infact, OLE typically promotes learning one-dim representations per class, whereas MCR encourageslearning subspaces with maximal dimensions (Figure 7 of [LQMS18] versus our Figure 6). Relation to contrastive learning.

If samples are evenly drawn from k classes, a randomly chosenpair ( x i , x j ) is of high probability belonging to difference classes if k is large. We may view thelearned features of two samples together with their their augmentations Z i and Z j as two classes.Then the rate reduction ∆ R ij = R ( Z i ∪ Z j , (cid:15) ) − ( R ( Z i , (cid:15) ) + R ( Z j , (cid:15) )) gives a “distance” measurefor how far the two sample sets are. We may try to further “expand” pairs that likely belong todifferent classes. From Theorem 2.1, the (averaged) rate reduction ∆ R ij is maximized when featuresfrom different samples are uncorrelated Z (cid:62) i Z j = (see Figure 2) and features Z i from the samesample are highly correlated. Hence, when applied to sample pairs, MCR naturally conducts the Nonsmoothness poses additional difﬁculties in using this loss to learn features via gradient descent. For example, when k ≥ , a random pair is of probability 99% belonging to different classes. Number of iterations L o ss ∆ R R R c (a) Evolution of R, R c , ∆ R duringthe training process. Epoch L o ss ∆ R (train)∆ R (test) R (train) R (test) R c (train) R c (test) (b) Training loss versus testing loss. Components S i g u l a r V a l u e s (c) PCA: ( red ) overall data; ( blue )individual classes. Figure 3:

Evolution of the rates of MCR in the training process and principal components of learned features. Number of iterations L o ss noise=0.0noise=0.1 noise=0.3noise=0.5 (a) ∆ R (cid:0) Z ( θ ) , Π , (cid:15) (cid:1) . Number of iterations . . . . . . . . L o ss noise=0.0noise=0.1 noise=0.3noise=0.5 (b) R ( Z ( θ ) , (cid:15) ) . Number of iterations L o ss noise=0.0noise=0.1 noise=0.3noise=0.5 (c) R c ( Z ( θ ) , (cid:15) | Π ) . Figure 4:

Evolution of rates

R, R c , ∆ R of MCR during training with corrupted labels. so-called contrastive learning [HCL06, OLV18, HFW + is not limited to expand (orcompress) pairs of samples and can uniformly conduct “contrastive learning” for a subset with anynumber of samples as long as we know they likely belong to different (or the same) classes, say byrandomly sampling subsets from a large number of classes or with a good clustering method. Our theoretical analysis above shows how the maximal coding rate reduction (MCR ) is a principledmeasure for learning discriminative and diverse representations for mixed data. In this section, wedemonstrate experimentally how this principle alone, without any other heuristics, is adequate tolearning good representations in the supervised, self-supervised, and unsupervised learning settingsin a uniﬁed fashion. Due to limited space and time, instead of trying to exhaust all its potential andpractical implications with extensive engineering, our goal here is only to validate effectiveness ofthis principle through its most basic usage and fair comparison with existing frameworks. Moreimplementation details and experiments are given in Appendix B. The code can be found in https://github.com/ryanchankh/mcr2 . When class labels are provided during training, we assignthe membership (diagonal) matrix Π = { Π j } kj =1 as follows: for each sample x i with label j , set Π j ( i, i ) = 1 and Π l ( i, i ) = 0 , ∀ l (cid:54) = j . Then the mapping f ( · , θ ) can be learned by optimizing (8),where Π remains constant. We apply stochastic gradient descent to optimize MCR , and for eachiteration we use mini-batch data { ( x i , y i ) } mi =1 to approximate the MCR loss. Evaluation via classiﬁcation.

As we will see, in the supervised setting, the learned representation hasvery clear subspace structures. So to evaluate the learned representations, we consider a natural nearestsubspace classiﬁer. For each class of learned features Z j , let µ j ∈ R p be its mean and U j ∈ R p × r j be the ﬁrst r j principal components for Z j , where r j is the estimated dimension of class j . Thepredicted label of a test data x (cid:48) is given by j (cid:48) = arg min j ∈{ ,...,k } (cid:107) ( I − U j U (cid:62) j )( f ( x (cid:48) , θ ) − µ j ) (cid:107) . xperiments on real data. We consider CIFAR10 dataset [Kri09] and ResNet-18 [HZRS16] for f ( · , θ ) . We replace the last linear layer of ResNet-18 by a two-layer fully connected network withReLU activation function such that the output dimension is 128. We set the mini-batch size as m = 1 , and the precision parameter (cid:15) = 0 . . More results can be found in Appendix B.3.2.Figure 3(a) illustrates how the two rates and their difference (for both training and test data) evolvesover epochs of training: After an initial phase, R gradually increases while R c decreases, indicatingthat features Z are expanding as a whole while each class Z j is being compressed. Figure 3(c) showsthe distribution of singular values per Z j and Figure 1 (right) shows the angles of features sorted byclass. Compared to the geometric loss [LQMS18], our features are not only orthogonal but also ofmuch higher dimension . We compare the singular values of representations, both overall data andindividual classes, learned by using cross-entropy and MCR in Figure 6 and Figure 7 in AppendixB.3.1. We ﬁnd that the representations learned by using MCR loss are much more diverse thanthe ones learned by using cross-entropy loss. In addition, we ﬁnd that we are able to select diverseimages from the same class according to the “principal” components of the learned features (seeFigure 8 and Figure 9 in Appendix B.3.1). Robustness to corrupted labels.

Because MCR by design encourages richer representations thatpreserves intrinsic structures from the data X , training relies less on class labels than traditional losssuch as cross-entropy (CE). To verify this, we train the same network using both CE and MCR with certain ratios of randomly corrupted training labels. Figure 4 illustrates the learning process: fordifferent levels of corruption, while the rate for the whole set always converges to the same value,the rates for the classes are inversely proportional to the ratio of corruption, indicating our methodonly compress samples with valid labels. The classiﬁcation results are summarized in Table 1. Byapplying exact the same training parameters, MCR is signiﬁcantly more robust than CE, especiallywith higher ratio of corrupted labels. This can be an advantage in the settings of self-supervisedlearning or constrastive learning when the grouping information can be very noisy.Table 1: Classiﬁcation results with features learned with labels corrupted at different levels.R

ATIO =0.1 R

ATIO =0.2 R

ATIO =0.3 R

ATIO =0.4 R

ATIO =0.5CE T

RAINING T RAINING

Motivated by self-supervised learning algo-rithms [LHB04, KRFL09, OLV18, HFW +

19, WXYL18], we use the MCR principle to learnrepresentations that are invariant to certain class of transformations/augmentations, say T with adistribution P T . Given a mini-batch of data { x j } kj =1 , we augment each sample x j with n transfor-mations/augmentations { τ i ( · ) } ni =1 randomly drawn from P T . We simply label all the augmentedsamples X j = [ τ ( x j ) , . . . , τ n ( x j )] of x j as the j -th class, and Z j the corresponding learnedfeatures. Using this self-labeled data, we train our feature mapping f ( · , θ ) the same way as thesupervised setting above. For every mini-batch, the total number of samples for training is m = kn . Evaluation via clustering.

To learn invariant features, our formulation itself does not require theoriginal samples x j come from a ﬁxed number of classes. For evaluation, we may train on afew classes and observe how the learned features facilitate classiﬁcation or clustering of the data.A common method to evaluate learned features is to train an additional linear classiﬁer [OLV18,HFW + k classes,we use an off-the-shelf subspace clustering algorithm EnSC [YLRV16], which is computationallyefﬁcient and is provably correct for data with well-structured subspaces. We also use K-Meanson the original data X as our baseline for comparison. We use normalized mutual information(NMI), clustering accuracy (ACC), and adjusted rand index (ARI) for our evaluation metrics, seeAppendix B.3.4 for their detailed deﬁnitions. Controlling dynamics of expansion and compression.

By directly optimizing the rate reduction ∆ R = R − R c , we achieve . clustering accuracy on CIFAR10 dataset, which is the second best Both CE and MCR can have better performance by choosing larger models for our mapping. Number of iterations L o ss ∆ R R R c (a) MCR Number of iterations L o ss ∆ R R R c (b) MCR - CTRL . Figure 5:

Evolution of the rates of ( left ) MCR and ( right ) MCR - CTRL in the training process in the self-supervised setting on CIFAR10 dataset.

Table 2:

Clustering results on CIFAR10, CIFAR100, and STL10 datasets.D

ATASET M ETRIC

K-M

EANS

JULE RTM DEC DAC DCCM MCR - C TRL

CIFAR10 NMI 0.087 0.192 0.197 0.257 0.395 0.496

ACC 0.229 0.272 0.309 0.301 0.521 0.623

ARI 0.049 0.138 0.115 0.161 0.305 0.408

CIFAR100 NMI 0.084 0.103 - 0.136 0.185 0.285

ACC 0.130 0.137 - 0.185 0.237 0.327

ARI 0.028 0.033 - 0.050 0.087

ACC 0.192 0.182 - 0.359 0.470 0.482

ARI 0.061 0.164 - 0.186 0.256 0.262 result compared with previous methods. More details can be found in Appendix B.3.3. Empirically,we observe that, without class labels, the overall coding rate R expands quickly and the MCR loss saturates (at a local maximum), see Fig 5(a). Our experience suggests that learning a goodrepresentation from unlabeled data might be too ambitious when directly optimizing the original ∆ R . Nonetheless, from the geometric meaning of R and R c , one can design a different learningstrategy by controlling the dynamics of expansion and compression differently during training. Forinstance, we may re-scale the rate by replacing R ( Z , (cid:15) ) with (cid:101) R ( Z , (cid:15) ) . = γ log det( I + γ dm(cid:15) ZZ (cid:62) ) .With γ = γ = k , the learning dynamics change from Fig 5(a) to Fig 5(b): All features are ﬁrstcompressed then gradually expand. We denote the controlled MCR training by MCR - CTRL . Experiments on real data.

Similar to the supervised learning setting, we train exactly the same

ResNet-18 network on the CIFAR10, CIFAR100, and STL10 [CNL11] datasets. We set the mini-batch size as k = 20 , number of augmentations for each sample as n = 50 and the precision parameteras (cid:15) = 0 . . Table 2 shows the results of the proposed MCR - CTRL in comparison with methodsJULE [YPB16], RTM [NMM19], DEC [XGF16], DAC [CWM + +

19] thathave achieved the best results on these datasets. Surprisingly, without utilizing any inter-class orinter-sample information and heuristics on the data, the invariant features learned by our method withaugmentations alone achieves a better performance over other highly engineered clustering methods.More ablation studies can be found in Appendix B.3.4.Nevertheless, compared to the representations learned in the supervised setting where the optimalpartition Π in (8) is initialized by correct class information, the representations here learned withself-supervised classes are far from being optimal – they at best correspond to local maxima ofthe MCR objective (8) when θ and Π are jointly optimized . It remains wide open how to designbetter optimization strategies and dynamics to learn from unlabelled or partially-labelled data betterrepresentations (and the associated partitions) close to the global maxima of the MCR objective (8). We ﬁnd that the supervised learned representation on CIFAR10 in Section 3.1 can easily achieve a clusteringaccuracy over 99% on the entire training data. Conclusion and Future Work

This work provides rigorous theoretical justiﬁcations and clear empirical evidences for why themaximal coding rate reduction (MCR ) is a fundamental principle for learning discriminative low-dimrepresentations in almost all learning settings. It uniﬁes and explains existing effective frameworksand heuristics widely practiced in the (deep) learning literature. It remains open why MCR is robustto label noises in the supervised setting, why self-learned features with MCR alone are effective forclustering, and how in future practice instantiations of this principle can be systematically harnessedto further improve clustering or classiﬁcation tasks.We believe that MCR gives a principled and practical objective for (deep) learning and can potentiallylead to better design operators and architectures of a deep network. A potential direction is to monitorquantitatively the amount of rate reduction ∆ R gained through every layer of the deep network. Byoptimizing the rate reduction through the network layers, it is no longer engineered as a “black box.”On the learning theoretical aspect, although this work has demonstrated only with mixed subspaces,this principle applies to any mixed distributions or structures, for which conﬁgurations that achievemaximal rate reduction are of independent theoretical interest. Another interesting note is that theMCR formulation goes beyond the supervised multi-class learning setting often studied throughempirical risk minimization (ERM) [DSBDSS15]. It is more related to the expectation maximization(EMX) framework [BDHM + Acknowledgements

Yi would like to thank Professor Yann LeCun of New York University for having a stimulatingdiscussion in his NYU ofﬁce last November about the search for a proper “energy” function forfeatures to be learned by a deep network [LCH + + + log det( · ) function. Ryan would like to thank Yuexiang Zhai for helpful discussions on learning subspacestructures. Last but not the least, we are very grateful for Xili Dai and Professor Xiaojun Yuan ofUESTC and Professor Hao Chen of UC Davis who have generously provided us their GPU clustersto help us conduct the extensive experiments reported in this paper. References [ACB17] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adver-sarial networks. In

International Conference on Machine Learning , pages 214–223,2017. 10BDHM +

17] Shai Ben-David, Pavel Hrubes, Shay Moran, Amir Shpilka, and Amir Yehudayoff. Alearning problem that is independent of the set theory ZFC axioms, 2017.[BH89] Pierre Baldi and Kurt Hornik. Neural networks and principal component analysis:Learning from examples without local minima.

Neural networks , 2(1):53–58, 1989.[BV04] Stephen P Boyd and Lieven Vandenberghe.

Convex optimization . Cambridge universitypress, 2004.[CGW19] Taco S Cohen, Mario Geiger, and Maurice Weiler. A general theory of equivariantcnns on homogeneous spaces. In

Advances in Neural Information Processing Systems ,pages 9142–9153, 2019.[CNL11] Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks inunsupervised feature learning. In

International Conference on Artiﬁcial Intelligenceand Statistics , pages 215–223, 2011.[CT06] Thomas M. Cover and Joy A. Thomas.

Elements of Information Theory (Wiley Seriesin Telecommunications and Signal Processing) . Wiley-Interscience, USA, 2006.[CW16] Taco Cohen and Max Welling. Group equivariant convolutional networks. In

Interna-tional Conference on Machine Learning , pages 2990–2999, 2016.[CWM +

17] Jianlong Chang, Lingfeng Wang, Gaofeng Meng, Shiming Xiang, and Chunhong Pan.Deep adaptive image clustering. In

Proceedings of the IEEE International Conferenceon Computer Vision , pages 5879–5887, 2017.[DSBDSS15] Amit Daniely, Sivan Sabato, Shai Ben-David, and Shai Shalev-Shwartz. Multiclasslearnability and the ERM principle.

J. Mach. Learn. Res. , 16(1):2377–2404, January2015.[FHB03] Maryam Fazel, Haitham Hindi, and Stephen P Boyd. Log-det heuristic for matrixrank minimization with applications to hankel and euclidean distance matrices. In

Proceedings of the 2003 American Control Conference, 2003. , volume 3, pages 2156–2162. IEEE, 2003.[GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville.

Deep Learning . MIT Press,2016. .[GPAM +

14] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley,Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In

Advances in Neural Information Processing Systems , pages 2672–2680, 2014.[HA85] Lawrence Hubert and Phipps Arabie. Comparing partitions.

Journal of Classiﬁcation ,2(1):193–218, 1985.[HCL06] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learningan invariant mapping. In , pages 1735–1742, 2006.[HFLM +

18] R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, PhilBachman, Adam Trischler, and Yoshua Bengio. Learning deep representations bymutual information estimation and maximization. arXiv preprint arXiv:1808.06670 ,2018.[HFW +

19] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum con-trast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722 ,2019.[HZRS16] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning forimage recognition. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 770–778, 2016.[IS15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep networktraining by reducing internal covariate shift. arXiv preprint arXiv:1502.03167 , 2015.[Jol02] Ian T Jolliffe.

Principal Component Analysis . Springer-Verlag, 2nd edition, 2002.[JZL +

17] Pan Ji, Tong Zhang, Hongdong Li, Mathieu Salzmann, and Ian Reid. Deep subspaceclustering networks. In

Advances in Neural Information Processing Systems , pages24–33, 2017. 11KPCC15] Zhao Kang, Chong Peng, Jie Cheng, and Qiang Cheng. Logdet rank minimizationwith application to subspace clustering.

Computational Intelligence and Neuroscience ,2015, 2015.[Kra91] Mark A Kramer. Nonlinear principal component analysis using autoassociative neuralnetworks.

AIChE Journal , 37(2):233–243, 1991.[KRFL09] Koray Kavukcuoglu, Marc’Aurelio Ranzato, Rob Fergus, and Yann LeCun. Learninginvariant features through topographic ﬁlter maps. In , pages 1605–1612. IEEE, 2009.[Kri09] Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009.[KTVK18] Artemy Kolchinsky, Brendan D Tracey, and Steven Van Kuyk. Caveats for informationbottleneck in deterministic scenarios. arXiv preprint arXiv:1808.07593 , 2018.[LCH +

06] Yann LeCun, Sumit Chopra, Raia Hadsell, M Ranzato, and F Huang. A tutorial onenergy-based learning.

Predicting structured data , 1(0), 2006.[LHB04] Yann LeCun, Fu Jie Huang, and Leon Bottou. Learning methods for generic objectrecognition with invariance to pose and lighting. In

Proceedings of the 2004 IEEEComputer Society Conference on Computer Vision and Pattern Recognition, 2004.CVPR 2004. , volume 2, pages II–104. IEEE, 2004.[LPZM20] Ke Li, Shichong Peng, Tianhao Zhang, and Jitendra Malik. Multimodal image synthe-sis with conditional implicit maximum likelihood estimation.

International Journal ofComputer Vision , 2020.[LQMS18] José Lezama, Qiang Qiu, Pablo Musé, and Guillermo Sapiro. OLE: Orthogonal low-rank embedding-a plug and play geometric loss for deep learning. In

Proceedings ofthe IEEE Conference on Computer Vision and Pattern Recognition , pages 8109–8118,2018.[MDHW07] Yi Ma, Harm Derksen, Wei Hong, and John Wright. Segmentation of multivariatemixed data via lossy data coding and compression.

IEEE Transactions on PatternAnalysis and Machine Intelligence , 29(9):1546–1562, 2007.[MWHK19] Jan MacDonald, Stephan Wäldchen, Sascha Hauch, and Gitta Kutyniok. A rate-distortion framework for explaining neural network decisions.

CoRR , abs/1905.11092,2019.[NMM19] Oliver Nina, Jamison Moody, and Clarissa Milligan. A decoder-free approach forunsupervised clustering and manifold learning with random triplet mining. In

Pro-ceedings of the IEEE International Conference on Computer Vision Workshops , pages0–0, 2019.[NW06] Jorge Nocedal and Stephen J. Wright.

Numerical Optimization . Springer, New York,NY, USA, second edition, 2006.[OLV18] Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning withcontrastive predictive coding. arXiv preprint arXiv:1807.03748 , 2018.[PFX +

17] Xi Peng, Jiashi Feng, Shijie Xiao, Jiwen Lu, Zhang Yi, and Shuicheng Yan. Deepsparse subspace clustering. arXiv preprint arXiv:1709.08374 , 2017.[PGM +

19] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch:An imperative style, high-performance deep learning library. In

Advances in NeuralInformation Processing Systems , pages 8024–8035, 2019.[Qui86] J. R. Quinlan. Induction of decision trees.

Mach. Learn. , 1(1):81–106, March 1986.[QYW +

20] Haozhi Qi, Chong You, Xiaolong Wang, Yi Ma, and Jitendra Malik. Deep isometriclearning for visual recognition. In

Proceedings of the International Conference onInternational Conference on Machine Learning , 2020.[RVM +

11] Salah Rifai, Pascal Vincent, Xavier Muller, Xavier Glorot, and Yoshua Bengio. Con-tractive auto-encoders: Explicit invariance during feature extraction. In

In InternationalConference on Machine Learning , page 833–840, 2011.12SG02] Alexander Strehl and Joydeep Ghosh. Cluster ensembles—a knowledge reuse frame-work for combining multiple partitions.

Journal of Machine Learning Research ,3(Dec):583–617, 2002.[SZ15] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition. In

International Conference on Learning Representations ,2015.[TZ15] Naftali Tishby and Noga Zaslavsky. Deep learning and the information bottleneckprinciple. In , pages 1–5. IEEE, 2015.[VMS16] Rene Vidal, Yi Ma, and S. S. Sastry.

Generalized Principal Component Analysis .Springer Publishing Company, Incorporated, 1st edition, 2016.[WDCB05] Michael B Wakin, David L Donoho, Hyeokho Choi, and Richard G Baraniuk. Themultiscale structure of non-differentiable image manifolds. In

Proceedings of SPIE,the International Society for Optical Engineering , pages 59141B–1, 2005.[WLW +

19] Jianlong Wu, Keyu Long, Fei Wang, Chen Qian, Cheng Li, Zhouchen Lin, and HongbinZha. Deep comprehensive correlation mining for image clustering. In

Proceedings ofthe IEEE International Conference on Computer Vision , pages 8150–8159, 2019.[WTL +

08] John Wright, Yangyu Tao, Zhouchen Lin, Yi Ma, and Heung-Yeung Shum. Clas-siﬁcation via minimum incremental coding length (micl). In

Advances in NeuralInformation Processing Systems , pages 1633–1640, 2008.[WXYL18] Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. Unsupervised featurelearning via non-parametric instance discrimination. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 3733–3742, 2018.[XGD +

17] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggre-gated residual transformations for deep neural networks. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 1492–1500, 2017.[XGF16] Junyuan Xie, Ross Girshick, and Ali Farhadi. Unsupervised deep embedding forclustering analysis. In

International Conference on Machine Learning , pages 478–487,2016.[YLRV16] Chong You, Chun-Guang Li, Daniel P Robinson, and René Vidal. Oracle based activeset algorithm for scalable elastic net subspace clustering. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , pages 3928–3937, 2016.[YPB16] Jianwei Yang, Devi Parikh, and Dhruv Batra. Joint unsupervised learning of deep rep-resentations and image clusters. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 5147–5156, 2016.[ZBH +

17] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.Understanding deep learning requires rethinking generalization. In

InternationalConference on Learning Representations , 2017.[ZF14] Matthew D Zeiler and Rob Fergus. Visualizing and understanding convolutionalnetworks. In

European Conference on Computer Vision , pages 818–833. Springer,2014.[ZHF18] Pan Zhou, Yunqing Hou, and Jiashi Feng. Deep adversarial subspace clustering. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 1596–1604, 2018.[ZJH +

18] Tong Zhang, Pan Ji, Mehrtash Harandi, Richard Hartley, and Ian Reid. Scalable deepk-subspace clustering. In

Asian Conference on Computer Vision , pages 466–481.Springer, 2018.[ZJH +

19] Tong Zhang, Pan Ji, Mehrtash Harandi, Wenbing Huang, and Hongdong Li. Neuralcollaborative subspace clustering. arXiv preprint arXiv:1904.10596 , 2019.[ZLY +

19] Junjian Zhang, Chun-Guang Li, Chong You, Xianbiao Qi, Honggang Zhang, Jun Guo,and Zhouchen Lin. Self-supervised convolutional subspace clustering network. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition ,pages 5473–5482, 2019. 13 ppendices

A Properties of the Rate Reduction Function

This section is organized as follows. We present background and preliminary results for the log det( · ) function and the coding rate function in Section A.1. Then, Section A.2 and A.3 provide technicallemmas for bounding the coding rate and coding rate reduction functions, respectively. Such lemmasare key results for proving our main theoretical results, which are stated informally in Theorem 2.1and formally in Section A.4. Finally, proof of our main theoretical results is provided in Section A.5. Notations

Throughout this section, we use S d ++ , R + and Z ++ to denote the set of symmetricpositive deﬁnite matrices of size d × d , nonnegative real numbers and positive integers, respectively. A.1 PreliminariesProperties of the log det( · ) function.Lemma A.1. The function log det( · ) : S d ++ → R is strictly concave. That is, log det((1 − α ) Z + α Z )) ≥ (1 − α ) log det( Z ) + α log det( Z ) for any α ∈ (0 , and { Z , Z } ⊆ S d ++ , with equality holds if and only if Z = Z .Proof. Consider an arbitrary line given by Z = Z + t ∆ Z where Z and ∆ Z (cid:54) = are symmetricmatrices of size d × d . Let f ( t ) . = log det( Z + t ∆ Z ) be a function deﬁned on an interval of valuesof t for which Z + t ∆ Z ∈ S d ++ . Following the same argument as in [BV04], we may assume Z ∈ S d ++ and get f ( t ) = log det Z + d (cid:88) i =1 log(1 + tλ i ) , where { λ i } di =1 are eigenvalues of Z − ∆ ZZ − . The second order derivative of f ( t ) is given by f (cid:48)(cid:48) ( t ) = − d (cid:88) i =1 λ i (1 + tλ i ) < . Therefore, f ( t ) is strictly concave along the line Z = Z + t ∆ Z . By deﬁnition, we conclude that log det( · ) is strictly concave. Properties of the coding rate function.

The following properties, also known as the Sylvester’sdeterminant theorem, for the coding rate function are known in the paper [MDHW07].

Lemma A.2 (Commutative property [MDHW07]) . For any Z ∈ R d × m we have R ( Z , (cid:15) ) . = 12 log det (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19) = 12 log det (cid:18) I + dm(cid:15) Z (cid:62) Z (cid:19) . Lemma A.3 (Invariant property [MDHW07]) . For any Z ∈ R d × m and any orthogonal matrices U ∈ R d × d and V ∈ R m × m we have R ( Z , (cid:15) ) = R ( U ZV (cid:62) , (cid:15) ) . A.2 Lower and Upper Bounds for Coding Rate

The following result provides an upper and a lower bound on the coding rate of Z as a function of thecoding rate for its components { Z j } kj =1 . The lower bound is tight when all the components { Z j } kj =1 have the same covariance (assuming that they have zero mean). The upper bound is tight when thecomponents { Z j } kj =1 are pair-wise orthogonal. 14 emma A.4. For any { Z j ∈ R d × m j } kj =1 and any (cid:15) > , let Z = [ Z , · · · , Z k ] ∈ R d × m with m = (cid:80) kj =1 m j . We have k (cid:88) j =1 m j (cid:18) I + dm j (cid:15) Z j Z (cid:62) j (cid:19) ≤ m (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19) ≤ k (cid:88) j =1 m (cid:18) I + dm(cid:15) Z j Z (cid:62) j (cid:19) , (9) where the ﬁrst equality holds if and only if Z Z (cid:62) m = Z Z (cid:62) m = · · · = Z k Z (cid:62) k m k , and the second equality holds if and only if Z (cid:62) j Z j = for all ≤ j < j ≤ k .Proof. By Lemma A.1, log det( · ) is strictly concave. Therefore, log det (cid:16) k (cid:88) j =1 α j S j (cid:17) ≥ k (cid:88) j =1 α j log det( S j ) , for all { α j > } kj =1 , k (cid:88) j =1 α j = 1 and { S j ∈ S d ++ } kj =1 , where equality holds if and only if S = S = · · · = S k . Take α j = m j m and S j = I + dm j (cid:15) Z j Z (cid:62) j ,we get m (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19) ≥ k (cid:88) j =1 m j (cid:18) I + dm j (cid:15) Z j Z (cid:62) j (cid:19) , with equality holds if and only if Z Z (cid:62) m = · · · = Z k Z (cid:62) k m k . This proves the lower bound in (9).We now prove the upper bound. By the strict concavity of log det( · ) , we have log det( Q ) ≤ log det( S ) + (cid:104)∇ log det( S ) , Q − S (cid:105) , for all { Q , S } ⊆ S m ++ , where equality holds if and only if Q = S . Plugging in ∇ log det( S ) = S − (see e.g., [BV04]) and S − = ( S − ) (cid:62) gives log det( Q ) ≤ log det( S ) + tr ( S − Q ) − m. (10)We now take Q = I + dm(cid:15) Z (cid:62) Z = I + dm(cid:15)  Z (cid:62) Z Z (cid:62) Z · · · Z (cid:62) Z k Z (cid:62) Z Z (cid:62) Z · · · Z (cid:62) Z ... ... . . . ... Z (cid:62) k Z Z (cid:62) k Z · · · Z (cid:62) k Z k  , and (11) S = I + dm(cid:15)  Z (cid:62) Z · · · Z (cid:62) Z · · · ... ... . . . ... · · · Z (cid:62) k Z k  . From the property of determinant for block diagonal matrix, we have log det( S ) = k (cid:88) j =1 log det (cid:18) I + dm(cid:15) Z (cid:62) j Z j (cid:19) . (12)15lso, note that tr ( S − Q )= tr  ( I + dm(cid:15) Z (cid:62) Z ) − ( I + dm(cid:15) Z (cid:62) Z ) · · · ( I + dm(cid:15) Z (cid:62) Z ) − ( I + dm(cid:15) Z (cid:62) Z k ) ... . . . ... ( I + dm(cid:15) Z (cid:62) k Z k ) − ( I + dm(cid:15) Z (cid:62) k Z ) · · · ( I + dm(cid:15) Z (cid:62) k Z k ) − ( I + dm(cid:15) Z (cid:62) k Z k )  = tr  I · · · ∗ ... . . . ... ∗ · · · I  = m, (13)where “*” denotes nonzero quantities that are irrelevant for the purpose of computing the trace.Plugging (12) and (13) back in (10) gives m (cid:18) I + dm(cid:15) Z (cid:62) Z (cid:19) ≤ k (cid:88) j =1 m (cid:18) I + dm(cid:15) Z (cid:62) j Z j (cid:19) , where the equality holds if and only if Q = S , which by the formulation in (11), holds if and only if Z (cid:62) j Z j = for all ≤ j < j ≤ k . Further using the result in Lemma A.2 gives m (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19) ≤ k (cid:88) j =1 m (cid:18) I + dm(cid:15) Z j Z (cid:62) j (cid:19) , which produces the upper bound in (9). A.3 An Upper Bound on Coding Rate Reduction

We may now provide an upper bound on the coding rate reduction ∆ R ( Z , Π , (cid:15) ) (deﬁned in (8)) interms of its individual components { Z j } kj =1 . Lemma A.5.

For any Z ∈ R d × m , Π ∈ Ω and (cid:15) > , let Z j ∈ R d × m j be Z Π j with zero columnsremoved. We have ∆ R ( Z , Π , (cid:15) ) ≤ k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z j Z (cid:62) j (cid:1) det m j (cid:16) I + dm j (cid:15) Z j Z (cid:62) j (cid:17)  , (14) with equality holds if and only if Z (cid:62) j Z j = for all ≤ j < j ≤ k .Proof. From (4), (5) and (6), we have ∆ R ( Z , Π , (cid:15) )= R ( Z , (cid:15) ) − R c ( Z , (cid:15) | Π )= 12 log (cid:18) det (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19)(cid:19) − k (cid:88) j =1 (cid:26) tr ( Π j )2 m log (cid:18) det (cid:18) I + d Z Π j Z (cid:62) tr ( Π j ) (cid:15) (cid:19)(cid:19)(cid:27) = 12 log (cid:18) det (cid:18) I + dm(cid:15) ZZ (cid:62) (cid:19)(cid:19) − k (cid:88) j =1 (cid:40) m j m log (cid:32) det (cid:32) I + d Z j Z (cid:62) j m j (cid:15) (cid:33)(cid:33)(cid:41) ≤ k (cid:88) j =1

12 log (cid:18) det (cid:18) I + dm(cid:15) Z j Z (cid:62) j (cid:19)(cid:19) − k (cid:88) j =1 (cid:40) m j m log (cid:32) det (cid:32) I + d Z j Z (cid:62) j m j (cid:15) (cid:33)(cid:33)(cid:41) = k (cid:88) j =1 m log (cid:18) det m (cid:18) I + dm(cid:15) Z j Z (cid:62) j (cid:19)(cid:19) − k (cid:88) j =1 (cid:40) m log (cid:32) det m j (cid:32) I + d Z j Z (cid:62) j m j (cid:15) (cid:33)(cid:33)(cid:41) = k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z j Z (cid:62) j (cid:1) det m j (cid:16) I + dm j (cid:15) Z j Z (cid:62) j (cid:17)  , where the inequality follows from the upper bound in Lemma A.4, and that the equality holds if andonly if Z (cid:62) j Z j = for all ≤ j < j ≤ k . 16 .4 Main Results: Properties of Maximal Coding Rate Reduction We now present our main theoretical results. The following theorem states that for any ﬁxed encodingof the partition Π , the coding rate reduction is maximized by data Z that is maximally discriminativebetween different classes and is diverse within each of the classes. This result holds provided that thesum of rank for different classes is small relative to the ambient dimension, and that (cid:15) is small. Theorem A.6.

Let Π = { Π j ∈ R m × m } kj =1 with { Π j ≥ } kj =1 and Π + · · · + Π k = I be agiven set of diagonal matrices whose diagonal entries encode the membership of the m samples inthe k classes. Given any (cid:15) > , d > and { d ≥ d j > } kj =1 , consider the optimization problem Z ∗ ∈ arg max Z ∈ R d × m ∆ R ( Z , Π , (cid:15) ) s.t. (cid:107) Z Π j (cid:107) F = tr ( Π j ) , rank ( Z Π j ) ≤ d j , ∀ j ∈ { , . . . , k } . (15) Under the conditions • (Large ambient dimension) d ≥ (cid:80) kj =1 d j , and • (High coding precision) (cid:15) < min j ∈{ ,...,k } (cid:110) tr ( Π j ) m d d j (cid:111) ,the optimal solution Z ∗ satisﬁes • (Between-class discriminative) ( Z ∗ j ) (cid:62) Z ∗ j = for all ≤ j < j ≤ k , i.e., Z ∗ j and Z ∗ j lie in orthogonal subspaces, and • (Within-class diverse) For each j ∈ { , . . . , k } , the rank of Z ∗ j is equal to d j and either allsingular values of Z ∗ j are equal to tr ( Π j ) d j , or the d j − largest singular values of Z ∗ j areequal and have value larger than tr ( Π j ) d j ,where Z ∗ j ∈ R d × tr ( Π j ) denotes Z ∗ Π j with zero columns removed. A.5 Proof of Main Results

We start with presenting a lemma that will be used in the proof to Theorem A.6.

Lemma A.7.

Given any twice differentiable f : R + → R , integer r ∈ Z ++ and c ∈ R + , considerthe optimization problem max x r (cid:88) p =1 f ( x p ) s.t. x = [ x , . . . , x r ] ∈ R r + , x ≥ x ≥ · · · ≥ x r , and r (cid:88) p =1 x p = c. (16) Let x ∗ be an arbitrary global solution to (16) . If the conditions • f (cid:48) (0) < f (cid:48) ( x ) for all x > , • There exists x T > such that f (cid:48) ( x ) is strictly increasing in [0 , x T ] and strictly decreasingin [ x T , ∞ ) , • f (cid:48)(cid:48) ( cr ) < (equivalently, cr > x T ),are satisﬁed, then we have either • x ∗ = [ cr , . . . , cr ] , or • x ∗ = [ x H , . . . , x H , x L ] for some x H ∈ ( cr , cr − ) and x L > .Proof. The result holds trivially if r = 1 . Throughout the proof we consider the case where r > .17e consider the optimization problem with the inequality constraint x ≥ · · · ≥ x r in (16) removed: max x =[ x ,...,x r ] ∈ R r + r (cid:88) p =1 f ( x p ) s.t. r (cid:88) p =1 x p = c. (17)We need to show that any global solution x ∗ = [ x ∗ , . . . , x ∗ r ] to (17) is either x ∗ = [ cr , . . . , cr ] or x ∗ = [ x H , . . . , x H , x L ] · P for some x H > cr , x L > and permutation matrix P ∈ R r × r . Let L ( x , λ ) = r (cid:88) p =1 f ( x p ) − λ · (cid:32) r (cid:88) p =1 x p − c (cid:33) − r (cid:88) p =1 λ p x p be the Lagragian function for (17) where λ = [ λ , λ , . . . , λ r ] is the Lagragian multiplier. By theﬁrst order optimality conditions (i.e., the Karush–Kuhn–Tucker (KKT) conditions, see, e.g., [NW06,Theorem 12.1]), there exists λ ∗ = [ λ ∗ , λ ∗ , . . . , λ ∗ r ] such that r (cid:88) p =1 x ∗ q = c, (18) x ∗ q ≥ , ∀ q ∈ { , . . . , r } , (19) λ ∗ q ≥ , ∀ q ∈ { , . . . , r } , (20) λ ∗ q · x ∗ q = 0 , ∀ q ∈ { , . . . , r } , and (21) [ f (cid:48) ( x ∗ ) , . . . , f (cid:48) ( x ∗ r )] = [ λ ∗ , . . . , λ ∗ ] + [ λ ∗ , . . . , λ ∗ r ] . (22)By using the KKT conditions, we ﬁrst show that all entries of x ∗ are strictly positive. To prove bycontradiction, suppose that x ∗ has r nonzero entries and r − r zero entries for some ≤ r < r .Note that r ≥ since an all zero vector x ∗ does not satisfy the equality constraint (18).Without loss of generality, we may assume that x ∗ p > for p ≤ r and x ∗ p = 0 otherwise. By (21),we have λ ∗ = · · · = λ ∗ r = 0 . Plugging it into (22), we get f (cid:48) ( x ∗ ) = · · · = f (cid:48) ( x ∗ r ) = λ ∗ . From (22) and noting that x r +1 = 0 we get f (cid:48) (0) = f (cid:48) ( x r +1 ) = λ ∗ + λ ∗ r +1 . Finally, from (20), we have λ ∗ r +1 ≥ . Combining the last three equations above gives f (cid:48) (0) − f (cid:48) ( x ∗ ) ≥ , contradicting the assumptionthat f (cid:48) (0) < f (cid:48) ( x ) for all x > . This shows that r = r , i.e., all entries of x ∗ are strictly positive.Using this fact and (21) gives λ ∗ p = 0 for all p ∈ { , . . . , r } . Combining this with (22) gives f (cid:48) ( x ∗ ) = · · · = f (cid:48) ( x ∗ r ) = λ ∗ . (23)It follows from the fact that f (cid:48) ( x ) is strictly unimodal that ∃ x H ≥ x L > s.t. { x ∗ p } rp =1 ⊆ { x L , x H } . (24)That is, the set { x ∗ p } rp =1 may contain no more than two values. To see why this is true, supposethat there exists three distinct values for { x ∗ p } rp =1 . Without loss of generality we may assume that < x ∗ < x ∗ < x ∗ . If x ∗ ≤ x T (recall x T := arg max x ≥ f (cid:48) ( x ) ), then by using the fact that f (cid:48) ( x ) is strictly increasing in [0 , x T ] , we must have f (cid:48) ( x ∗ ) < f (cid:48) ( x ∗ ) which contradicts (23). A similarcontradiction is arrived by considering f (cid:48) ( x ∗ ) and f (cid:48) ( x ∗ ) for the case where x ∗ > x T .There are two possible cases as a consequence of (24). First, if x L = x H , then we have x ∗ = · · · = x ∗ r .By further using (18) we get x ∗ = · · · = x ∗ r = cr .

18t remains to consider the case where x L < x H . First, by the unimodality of f (cid:48) ( x ) , we must have x L < x T < x H , therefore f (cid:48)(cid:48) ( x L ) > and f (cid:48)(cid:48) ( x H ) < . (25)Let (cid:96) := |{ p : x p = x L }| be the number of entries of x ∗ that are equal to x L and h := r − (cid:96) . Weshow that it is necessary to have (cid:96) = 1 and h = r − . To prove by contradiction, assume that (cid:96) > and h < r − . Without loss of generality we may assume { x ∗ p = x H } hp =1 and { x ∗ p = x L } rp = h +1 .By (25), we have f (cid:48)(cid:48) ( x ∗ p ) > for all p > h. In particular, by using h < r − we have f (cid:48)(cid:48) ( x ∗ r − ) > and f (cid:48)(cid:48) ( x ∗ r ) > . (26)On the other hand, by using the second order necessary conditions for constraint optimization (see,e.g., [NW06, Theorem 12.5]), the following result holds v (cid:62) ∇ xx L ( x ∗ , λ ∗ ) v ≤ , for all (cid:40) v : (cid:42) ∇ x (cid:32) r (cid:88) p =1 x ∗ p − c (cid:33) , v (cid:43) = 0 (cid:41) ⇐⇒ r (cid:88) p =1 f (cid:48)(cid:48) ( x ∗ p ) · v p ≤ , for all (cid:40) v = [ v , . . . , v r ] : r (cid:88) p =1 v p = 0 (cid:41) . (27)Take v to be such that v = · · · = v r − = 0 and v r − = − v r (cid:54) = 0 . Plugging it into (27) gives f (cid:48)(cid:48) ( x ∗ r − ) + f (cid:48)(cid:48) ( x ∗ r ) ≤ , which contradicts (26). Therefore, we may conclude that (cid:96) = 1 . That is, x ∗ is given by x ∗ = [ x H , . . . , x H , x L ] , where x H > x L > . By using the condition in (18), we may further show that ( r − x H + x L = c = ⇒ x H = cr − − cx L < x L r − , ( r − x H + x L = c = ⇒ ( r − x H + x H > c = ⇒ x H > cr , which completes our proof. Proof of Theorem A.6.

Without loss of generality, let Z ∗ = [ Z ∗ , . . . , Z ∗ k ] be the optimal solution ofproblem (15).To show that Z ∗ j , j ∈ { , . . . , k } are pairwise orthogonal, suppose for the purpose of arriving at acontradiction that ( Z ∗ j ) (cid:62) Z ∗ j (cid:54) = for some ≤ j < j ≤ k . By using Lemma A.5, the strictinequality in (14) holds for the optimal solution Z ∗ . That is, ∆ R ( Z ∗ , Π , (cid:15) ) < k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  . (28)On the other hand, since (cid:80) kj =1 d j ≤ d , there exists { U (cid:48) j ∈ R d × d j } kj =1 such that the columns of thematrix [ U (cid:48) , . . . , U (cid:48) k ] are orthonormal. Denote Z ∗ j = U ∗ j Σ ∗ j ( V ∗ j ) (cid:62) the compact SVD of Z ∗ j , and let Z (cid:48) = [ Z (cid:48) , . . . , Z (cid:48) k ] , where Z (cid:48) j = U (cid:48) j Σ ∗ j ( V ∗ j ) (cid:62) . It follows that ( Z (cid:48) j ) (cid:62) Z (cid:48) j = V ∗ j Σ ∗ j ( U (cid:48) j ) (cid:62) U (cid:48) j Σ ∗ j ( V ∗ j ) (cid:62) = V ∗ j Σ ∗ j ∗ j ( V ∗ j ) (cid:62) = for all ≤ j < j ≤ k. That is, the matrices Z (cid:48) , . . . , Z (cid:48) k are pairwise orthogonal. Applying Lemma A.5 for Z (cid:48) gives ∆ R ( Z (cid:48) , Π , (cid:15) ) = k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z (cid:48) j ( Z (cid:48) j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z (cid:48) j ( Z (cid:48) j ) (cid:62) (cid:17)  = k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  , (29)19here the second equality follows from Lemma A.3. Comparing (28) and (29) gives ∆ R ( Z (cid:48) , Π , (cid:15) ) > ∆ R ( Z ∗ , Π , (cid:15) ) , which contradicts the optimality of Z ∗ . Therefore, we must have ( Z ∗ j ) (cid:62) Z ∗ j = for all ≤ j < j ≤ k. Moreover, from Lemma A.3 we have ∆ R ( Z ∗ , Π , (cid:15) ) = k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  . (30)We now prove the result concerning the singular values of Z ∗ j . To start with, we claim that thefollowing result holds: Z ∗ j ∈ arg max Z j log  det m (cid:0) I + dm(cid:15) Z j Z (cid:62) j (cid:1) det m j (cid:16) I + dm j (cid:15) Z j Z (cid:62) j (cid:17)  s.t. (cid:107) Z j (cid:107) F = m j , rank ( Z j ) ≤ d j . (31)To see why (31) holds, suppose that there exists (cid:101) Z j such that (cid:107) (cid:101) Z j (cid:107) F = m j , rank ( (cid:101) Z j ) ≤ d j and log  det m (cid:16) I + dm(cid:15) (cid:101) Z j (cid:101) Z (cid:62) j (cid:17) det m j (cid:16) I + dm j (cid:15) (cid:101) Z j (cid:101) Z (cid:62) j (cid:17)  > log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  . (32)Denote (cid:101) Z j = (cid:101) U j (cid:101) Σ j (cid:101) V (cid:62) j the compact SVD of (cid:101) Z j and let Z (cid:48) = [ Z ∗ , . . . , Z ∗ j − , Z (cid:48) j , Z ∗ j +1 , . . . , Z ∗ k ] , where Z (cid:48) j := U ∗ j (cid:101) Σ j (cid:101) V (cid:62) j . Note that (cid:107) Z (cid:48) j (cid:107) F = m j , rank ( Z (cid:48) j ) ≤ d j and ( Z (cid:48) j ) (cid:62) Z ∗ j (cid:48) = for all j (cid:48) (cid:54) = j . It follows that Z (cid:48) is afeasible solution to (15) and that the components of Z (cid:48) are pairwise orthogonal. By using Lemma A.5,Lemma A.3 and (32) we have ∆ R ( Z (cid:48) , Π , (cid:15) )= 12 m log  det m (cid:0) I + dm(cid:15) Z (cid:48) j ( Z (cid:48) j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z (cid:48) j ( Z (cid:48) j ) (cid:62) (cid:17)  + (cid:88) j (cid:48) (cid:54) = j m log  det m (cid:0) I + dm(cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:1) det m j (cid:48) (cid:16) I + dm j (cid:48) (cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:17)  = 12 m log  det m (cid:16) I + dm(cid:15) (cid:101) Z j ( (cid:101) Z j ) (cid:62) (cid:17) det m j (cid:16) I + dm j (cid:15) (cid:101) Z j ( (cid:101) Z j ) (cid:62) (cid:17)  + (cid:88) j (cid:48) (cid:54) = j m log  det m (cid:0) I + dm(cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:1) det m j (cid:48) (cid:16) I + dm j (cid:48) (cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:17)  > m log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  + (cid:88) j (cid:48) (cid:54) = j m log  det m (cid:0) I + dm(cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:1) det m j (cid:48) (cid:16) I + dm j (cid:48) (cid:15) Z ∗ j (cid:48) ( Z ∗ j (cid:48) ) (cid:62) (cid:17)  = k (cid:88) j =1 m log  det m (cid:0) I + dm(cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:1) det m j (cid:16) I + dm j (cid:15) Z ∗ j ( Z ∗ j ) (cid:62) (cid:17)  . Combining it with (30) shows ∆ R ( Z (cid:48) , Π , (cid:15) ) > ∆ R ( Z ∗ , Π , (cid:15) ) , contradicting the optimality of Z ∗ .Therefore, the result in (31) holds.Observe that the optimization problem in (31) depends on Z j only through its singular values. Thatis, by letting σ j := [ σ ,j , . . . , σ min( m j ,d ) ,j ] be the singular values of Z j , we have log  det m (cid:0) I + dm(cid:15) Z j Z (cid:62) j (cid:1) det m j (cid:16) I + dm j (cid:15) Z j Z (cid:62) j (cid:17)  = min { m j ,d } (cid:88) p =1 log (cid:32) (1 + dm(cid:15) σ p,j ) m (1 + dm j (cid:15) σ p,j ) m j (cid:33) , also, we have (cid:107) Z j (cid:107) F = min { m j ,d } (cid:88) p =1 σ p,j and rank ( Z j ) = (cid:107) σ j (cid:107) . max σ j ∈ R min { mj,d } + min { m j ,d } (cid:88) p =1 log (cid:32) (1 + dm(cid:15) σ p,j ) m (1 + dm j (cid:15) σ p,j ) m j (cid:33) s.t. min { m j ,d } (cid:88) p =1 σ p,j = m j , and rank ( Z j ) = (cid:107) σ j (cid:107) (33)Let σ ∗ j = [ σ ∗ ,j , . . . , σ ∗ min { m j ,d } ,j ] be an optimal solution to (33). Without loss of generality weassume that the entries of σ ∗ j are sorted in descending order. It follows that σ ∗ p,j = 0 for all p > d j , and [ σ ∗ ,j , . . . , σ ∗ d j ,j ] = arg max [ σ ,j ,...,σ dj,j ] ∈ R dj + σ ,j ≥···≥ σ dj,j d j (cid:88) p =1 log (cid:32) (1 + dm(cid:15) σ p,j ) m (1 + dm j (cid:15) σ p,j ) m j (cid:33) s.t. d j (cid:88) p =1 σ p,j = m j . (34)Then we deﬁne f ( x ; d, (cid:15), m j , m ) = log (cid:32) (1 + dm(cid:15) x ) m (1 + dm j (cid:15) x ) m j (cid:33) , and rewrite (34) as max [ x ,...,x dj ] ∈ R dj + x ≥···≥ x dj d j (cid:88) p =1 f ( x p ; d, (cid:15), m j , m ) s.t. d j (cid:88) p =1 x p = m j . (35)We compute the ﬁrst and second derivative for f with respect to x , which are given by f (cid:48) ( x ; d, (cid:15), m j , m ) = d x ( m − m j )( dx + m(cid:15) )( dx + m j (cid:15) ) ,f (cid:48)(cid:48) ( x ; d, (cid:15), m j , m ) = d ( m − m j )( mm j (cid:15) − d x )( dx + m(cid:15) ) ( dx + m j (cid:15) ) . Note that • f (cid:48) (0) < f (cid:48) ( x ) for all x > , • f (cid:48) ( x ) is strictly increasing in [0 , x T ] and strictly decreasing in [ x T , ∞ ) , where x T = (cid:15) (cid:113) md m j d , and • by using the condition (cid:15) < m j m d d j , we have f (cid:48)(cid:48) ( m j d j ) < .Therefore, we may apply Lemma A.7 and conclude that the unique optimal solution to (35) is either • x ∗ = [ m j d j , . . . , m j d j ] , or • x ∗ = [ x H , . . . , x H , x L ] for some x H ∈ ( m j d j , m j d j − ) and x L > .Equivalently, we have either • [ σ ∗ ,j , . . . , σ ∗ d j ,j ] = (cid:104)(cid:113) m j d j , . . . , (cid:113) m j d j (cid:105) , or • [ σ ∗ ,j , . . . , σ ∗ d j ,j ] = [ σ H , . . . , σ H , σ L ] for some σ H ∈ (cid:16)(cid:113) m j d j , (cid:113) m j d j − (cid:17) and σ L > ,as claimed. 21 Additional Simulations and Experiments

B.1 Simulations - Verifying Diversity Promoting Properties of MCR As proved in Theorem A.6, the proposed MCR objective promotes within-class diversity. In thissection, we use simulated data to verify the diversity promoting property of MCR . As shown inTable 3, we calculate our proposed MCR objective on simulated data. We observe that orthogonalsubspaces with higher dimension achieve higher MCR value, which is consistent with our theoreticalanalysis in Theorem A.6.Table 3: MCR objective on simulated data. We evaluate the proposed MCR objective deﬁned in (8),including R , R c , and ∆ R , on simulated data. The output dimension d is set as 512, 256, and 128. We set thebatch size as m = 1000 and random assign the label of each sample from to , i.e., 10 classes. We generatetwo types of data: 1) (R ANDOM G AUSSIAN ) For comparison with data without structures, for each class wegenerate random vectors sampled from Gaussian distribution (the dimension is set as the output dimension d )and normalize each vector to be on the unit sphere. 2) (S UBSPACE ) For each class, we generate vectors sampledfrom its corresponding subspace with dimension d j and normalize each vector to be on the unit sphere. Weconsider the subspaces from different classes are orthogonal/nonorthogonal to each other. R R c ∆ R O RTHOGONAL ? O

UTPUT D IMENSION R ANDOM G AUSSIAN (cid:51)

UBSPACE ( d j = 50 ) 545.63 108.46 (cid:51) UBSPACE ( d j = 40 ) 487.07 92.71 394.36 (cid:51) UBSPACE ( d j = 30 ) 413.08 74.84 338.24 (cid:51) UBSPACE ( d j = 20 ) 318.52 54.48 264.04 (cid:51) UBSPACE ( d j = 10 ) 195.46 30.97 164.49 (cid:51) UBSPACE ( d j = 1 ) 31.18 4.27 26.91 (cid:51) ANDOM G AUSSIAN (cid:51)

UBSPACE ( d j = 25 ) 288.65 56.34 (cid:51) UBSPACE ( d j = 20 ) 253.51 47.58 205.92 (cid:51) UBSPACE ( d j = 15 ) 211.97 38.04 173.93 (cid:51) UBSPACE ( d j = 10 ) 161.87 27.52 134.35 (cid:51) UBSPACE ( d j = 5 ) 98.35 15.55 82.79 (cid:51) UBSPACE ( d j = 1 ) 27.73 3.92 23.80 (cid:51) ANDOM G AUSSIAN (cid:51)

UBSPACE ( d j = 12 ) 144.36 27.72 (cid:51) UBSPACE ( d j = 10 ) 129.12 24.06 105.05 (cid:51) UBSPACE ( d j = 8 ) 112.01 20.18 91.83 (cid:51) UBSPACE ( d j = 6 ) 92.55 16.04 76.51 (cid:51) UBSPACE ( d j = 4 ) 69.57 11.51 58.06 (cid:51) UBSPACE ( d j = 2 ) 41.68 6.45 35.23 (cid:51) UBSPACE ( d j = 1 ) 24.28 3.57 20.70 (cid:51) UBSPACE ( d j = 50 ) 145.60 75.31 70.29 (cid:55) UBSPACE ( d j = 40 ) 142.69 65.68 77.01 (cid:55) UBSPACE ( d j = 30 ) 135.42 54.27 81.15 (cid:55) UBSPACE ( d j = 20 ) 120.98 40.71 80.27 (cid:55) UBSPACE ( d j = 15 ) 111.10 32.89 78.21 (cid:55) UBSPACE ( d j = 12 ) 101.94 27.73 74.21 (cid:55) B.2 Implementation DetailsTraining Setting.

We mainly use ResNet-18 [HZRS16] in our experiments, where we use 4 residualblocks with layer widths [64 , , , . The implementation of network architectures used inthis paper are mainly based on this github repo. For data augmentation in the supervised setting,we apply the

RandomCrop and

RandomHorizontalFlip . For the supervised setting, we train themodels for 500 epochs and use stage-wise learning rate decay every 200 epochs (decay by a factor of10). For the supervised setting, we train the models for 100 epochs and use stage-wise learning ratedecay at 20-th epoch and 40-th epoch (decay by a factor of 10). https://github.com/kuangliu/pytorch-cifar valuation Details. For the supervised setting, we set the number of principal components fornearest subspace classiﬁer r j = 30 . We also study the effect of r j in Section B.3.2. For the CIFAR100dataset, we consider 20 superclasses and set the cluster number as 20, which is the same setting as in[CWM +

17, WXYL18].

Datasets.

We apply the default datasets in PyTorch, including CIFAR10, CIFAR100, and STL10.

Augmentations T used for the self-supervised setting. We apply the same data augmentation forCIFAR10 dataset and CIFAR100 dataset and the pseudo-code is as follows. import torchvision.transforms as transformsTRANSFORM = transforms.Compose([transforms.RandomResizedCrop(32),transforms.RandomHorizontalFlip(),transforms.RandomApply([transforms.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),transforms.RandomGrayscale(p=0.2),transforms.ToTensor()])

The augmentations we use for STL10 dataset and the pseudo-code is as follows. import torchvision.transforms as transformsTRANSFORM = transforms.Compose([transforms.RandomResizedCrop(96),transforms.RandomHorizontalFlip(),transforms.RandomApply([transforms.ColorJitter(0.8, 0.8, 0.8, 0.2)], p=0.8),transforms.RandomGrayscale(p=0.2),GaussianBlur(kernel_size=9),transforms.ToTensor()])

Cross-entropy training details.

For CE models presented in Table 1, Figure 6(d)-6(f), and Figure 7,we use the same network architecture, ResNet-18 [HZRS16], for cross-entropy training on CIFAR10,and set the output dimension as 10 for the last layer. We apply SGD, and set learning rate lr=0.1 ,momentum momentum=0.9 , and weight decay wd= 5e-4 . We set the total number of training epochas 400, and use stage-wise learning rate decay every 150 epochs (decay by a factor of 10).

B.3 Additional Experimental ResultsB.3.1 PCA Results of MCR Training versus Cross-Entropy Training

For comparison, similar to Figure 3(c), we calculate the principle components of representationslearned by MCR training and cross-entropy training. For cross-entropy training, we take the outputof the second last layer as the learned representation. The results are summarized in Figure 6. Wealso compare the cosine similarity between learned representations for both MCR training andcross-entropy training, and the results are presented in Figure 7.As shown in Figure 6, we observe that representations learned by MCR are much more diverse,the dimension of learned features (each class) is around a dozen, and the dimension of the overallfeatures is nearly 120, and the output dimension is 128. In contrast, the dimension of the overallfeatures learned using entropy is slightly greater than 10, which is much smaller than that learnedby MCR . From Figure 7, for MCR training, we ﬁnd that the features of different class are almostorthogonal. Visualize representative images selected from CIFAR10 dataset by using MCR . As men-tioned in Section 1, obtaining the properties of desired representation in the proposed MCR principleis equivalent to performing nonlinear generalized principle components on the given dataset. Asshown in Figure 6(a)-6(c), MCR can indeed learn such diverse and discriminative representations.In order to better interpret the representations learned by MCR , we select images according to their“principal” components (singular vectors using SVD) of the learned features. In Figure 8, we visualizeimages selected from class-‘Bird’ and class-‘Ship’. For each class, we ﬁrst compute top-10 singular23 Components S i g u l a r V a l u e s (a) PCA: MCR training learnedfeatures for overall data (ﬁrst 30components). Components S i g u l a r V a l u e s (b) PCA: MCR training learnedfeatures for overall data. Components S i g u l a r V a l u e s (c) PCA: MCR training learnedfeatures for every class. Components S i g u l a r V a l u e s (d) PCA: cross-entropy traininglearned features for overall data(ﬁrst 30 components). Components S i g u l a r V a l u e s (e) PCA: cross-entropy traininglearned features for overall data. Components S i g u l a r V a l u e s (f) PCA: cross-entropy traininglearned features for every class. Figure 6:

Principal component analysis (PCA) of learned representations for the MCR trained model ( ﬁrstrow ) and the cross-entropy trained model ( second row ). Figure 7:

Cosine similarity between learned features by using the MCR objective ( left ) and CE loss ( right ). vectors of the SVD of the learned features and then for each of the top singular vectors, we displayin each row the top-10 images whose corresponding features are closest to the singular vector. Asshown in Figure 8, we observe that images in the same row share many common characteristics suchas shapes, textures, patterns, and styles, whereas images in different rows are signiﬁcantly differentfrom each other – suggesting our method captures all the different “modes” of the data even withinthe same class. Notice that top rows are associated with components with larger singular values,hence they are images that show up more frequently in the dataset.In Figure 9(a), we visualize the 10 “principal” images selected from CIFAR10 for each of the 10classes. That is, for each class, we display the 10 images whose corresponding features are mostcoherent with the top-10 singular vectors. We observe that the selected images are much more diverseand representative than those selected randomly from the dataset (displayed on the CIFAR ofﬁcialwebsite), indicating such principal images can be used as a good “summary” of the dataset.24 C C C C C C C C C C C C C C C C C C C (a) Bird C C C C C C C C C C C C C C C C C C C C (b) Ship Figure 8:

Visualization of principal components learned for class 2-‘Bird’ and class 8-‘Ship’. For each class j ,we ﬁrst compute the top-10 singular vectors of the SVD of the learned features Z j . Then for the l -th singularvector of class j , u lj , and for the feature of the i -th image of class j , z ij , we calculate the absolute value of innerproduct, |(cid:104) z ij , u lj (cid:105)| , then we select the top-10 images according to |(cid:104) z ij , u lj (cid:105)| for each singular vector. In theabove two ﬁgures, each row corresponds to one singular vector (component C l ). The rows are sorted based onthe magnitude of the associated singular values, from large to small. airplane automobile bird cat deer dog frog horse ship truck (a) 10 representative images from each class basedon top-10 principal components of the SVD oflearned representations by MCR . airplane automobile bird cat deer dog frog horse ship truck (b) Randomly selected 10 images from each class. Figure 9:

Visualization of top-10 “principal” images for each class in the CIFAR10 dataset. (a)

For each class- j ,we ﬁrst compute the top-10 singular vectors of the SVD of the learned features Z j . Then for the l -th singularvector of class j , u lj , and for the feature of the i -th image of class j , z ij , we calculate the absolute value of innerproduct, |(cid:104) z ij , u lj (cid:105)| , then we select the largest one for each singular vector within class j . Each row correspondsto one class, and each image corresponds to one singular vector, ordered by the value of the associated singularvalue. (b) For each class, 10 images are randomly selected in the dataset. These images are the ones displayed inthe CIFAR dataset website [Kri09].

B.3.2 Experimental Results of MCR in the Supervised Learning Setting.Training details for mainline experiment. For the model presented in Figure 1 (

Right ) andFigure 3, we use ResNet-18 to parameterize f ( · , θ ) , and we set the output dimension d = 128 ,precision (cid:15) = 0 . , mini-batch size m = 1 , . We use SGD in Pytorch [PGM +

19] as the optimizer,and set the learning rate lr=0.01 , weight decay wd=5e-4 , and momentum=0.9 .25 xperiments for studying the effect of hyperparameters and architectures.

We present theexperimental results of MCR training in the supervised setting by using various training hyperpa-rameters and different network architectures. The results are summarized in Table 4. Besides theResNet architecture, we also consider VGG architecture [SZ15] and ResNext achitecture [XGD + m can lead to better performance. Also, models withhigher output dimension d require larger training batch size m .Table 4: Experiments of MCR in the supervised setting on the CIFAR10 dataset.A RCH D IM d P RECISION (cid:15) B ATCH S IZE m lr ACC C

OMMENT R ES N ET -18 128 0.5 1,000 0.01 92.20% M AINLINE , F IG ES N EXT -29 128 0.5 1,000 0.01 92.55% D

IFFERENT A RCHITECTURE

VGG-11 128 0.5 1,000 0.01 90.76%R ES N ET -18 512 0.5 1,000 0.01 88.60% E FFECT OF O UTPUT D IMENSION R ES N ET -18 256 0.5 1,000 0.01 92.10%R ES N ET -18 64 0.5 1,000 0.01 92.21%R ES N ET -18 128 1.0 1,000 0.01 93.06% E FFECT OFPRECISION R ES N ET -18 128 0.4 1,000 0.01 91.93%R ES N ET -18 128 0.2 1,000 0.01 90.06%R ES N ET -18 128 0.5 500 0.01 82.33% E FFECT OF B ATCH S IZE R ES N ET -18 128 0.5 2,000 0.01 93.02%R ES N ET -18 128 0.5 4,000 0.01 92.59%R ES N ET -18 512 0.5 2,000 0.01 92.47%R ES N ET -18 512 0.5 4,000 0.01 92.17%R ES N ET -18 128 0.5 1,000 0.05 86.02% E FFECT OF lr R ES N ET -18 128 0.5 1,000 0.005 92.39%R ES N ET -18 128 0.5 1,000 0.001 92.23% Effect of r j on classiﬁcation. Unless otherwise stated, we set the number of components r j = 30 for nearest subspace classiﬁcation. We study the effect of r j when used for classiﬁcation, and theresults are summarized in Table 5. We observe that the nearest subspace classiﬁcation works for awide range of r j .Table 5: Effect of number of components r j for nearest subspace classiﬁcation in the supervised setting.N UMBER OF COMPONENTS r j = 10 r j = 20 r j = 30 r j = 40 r j = 50 M AINLINE (L ABEL N OISE R ATIO =0.0) 92.68% 92.53% 92.20% 92.32% 92.17%L

ABEL N OISE R ATIO =0.1 91.71% 91.73% 91.16% 91.83% 91.78%L

ABEL N OISE R ATIO =0.2 90.68% 90.61% 89.70% 90.62% 90.54%L

ABEL N OISE R ATIO =0.3 88.24% 87.97% 88.18% 88.15% 88.10%L

ABEL N OISE R ATIO =0.4 86.49% 86.67% 86.66% 86.71% 86.44%L

ABEL N OISE R ATIO =0.5 83.90% 84.18% 84.30% 84.18% 83.76%

Effect of (cid:15) on learning from corrupted labels. To further study the proposed MCR on learningfrom corrupted labels, we use different precision parameters, (cid:15) = 0 . , . , in addition to the oneshown in Table 1. Except for the precision parameter (cid:15) , all the other parameters are the same as themainline experiment (the ﬁrst row in Table 4). The ﬁrst row ( (cid:15) = 0 . ) in Table 6 is identical to theMCR TRAINING in Table 2. Notice that with slightly different choices in (cid:15) , one might even seeslightly improved performance over the ones reported in the main body.Table 6: Effect of Precision (cid:15) on classiﬁcation results with features learned with labels corrupted at differentlevels by using MCR training.P RECISION R ATIO =0.1 R

ATIO =0.2 R

ATIO =0.3 R

ATIO =0.4 R

ATIO =0.5 (cid:15) = 0 . (cid:15) = 0 . (cid:15) = 1 . .3.3 Experimental Results of MCR in the Self-supervised Learning SettingTraining details of MCR - CTRL . For three datasets (CIFAR10, CIFAR100, and STL10), we useResNet-18 as in the supervised setting, and we set the output dimension d = 128 , precision (cid:15) = 0 . ,mini-batch size k = 20 , number of augmentations n = 50 , γ = γ = 20 . We observe that MCR - CTRL can achieve better clustering performance by using smaller γ , i.e., γ = 15 , on CIFAR10 andCIFAR100 datasets. We use SGD in Pytorch [PGM +

19] as the optimizer, and set the learning rate lr=0.1 , weight decay wd=5e-4 , and momentum=0.9 . Training dynamic comparison between MCR and MCR - CTRL . In the self-supervised setting,we compare the training process for MCR and MCR - CTRL in terms of R, (cid:101) R, R c , and ∆ R . ForMCR training, the features ﬁrst expand (for both R and R c ) then compress (for ). For MCR - CTRL ,both (cid:101) R and R c ﬁrst compress then (cid:101) R expands quickly and R c remains small, as we have seen inFigure 5 in the main body. Clustering results comparison.

We compare the clustering performance between MCR andMCR - CTRL in terms of NMI, ACC, and ARI. The clustering results are summarized in Table 7. Weﬁnd that MCR - CTRL can achieve better performance for clustering.Table 7:

Clustering comparison between MCR and MCR - CTRL on CIFAR10 dataset.NMI ACC ARIMCR - C TRL

B.3.4 Clustering Metrics and More Results

We ﬁrst introduce the deﬁnitions of normalized mutual information (NMI) [SG02], clustering accuracy(ACC), and adjusted rand index (ARI) [HA85].

Normalized mutual information (NMI).

Suppose Y is the ground truth partition and C is theprediction partition. The NMI metric is deﬁned asNMI ( Y, C ) = (cid:80) ki =1 (cid:80) sj =1 | Y i ∩ C j | log (cid:16) m | Y i ∩ C j || Y i || C j | (cid:17)(cid:114)(cid:16)(cid:80) ki =1 | Y i | log (cid:16) | Y i | m (cid:17)(cid:17) (cid:16)(cid:80) sj =1 | C j | log (cid:16) | C j | m (cid:17)(cid:17) , where Y i is the i -th cluster in Y and C j is the j -th cluster in C , and m is the total number of samples. Clustering accuracy (ACC).

Given m samples, { ( x i , y i ) } mi =1 . For the i -th sample x i , let y i be itsground truth label, and let c i be its cluster label. The ACC metric is deﬁned asACC ( Y , C ) = max σ ∈ S (cid:80) mi =1 { y i = σ ( c i ) } m , where S is the set includes all the one-to-one mappings from cluster to label, and Y = [ y , . . . , y m ] , C = [ c , . . . , c m ] . Adjusted rand index (ARI).

Suppose there are m samples, and let Y and C be two clustering ofthese samples, where Y = { Y , . . . , Y r } and C = { C , . . . , C s } . Let m ij denote the number of theintersection between Y i and C j , i.e., m ij = | Y i ∩ C j | . The ARI metric is deﬁned asARI = (cid:80) ij (cid:0) m ij (cid:1) − (cid:16)(cid:80) i (cid:0) a i (cid:1) (cid:80) j (cid:0) b j (cid:1)(cid:17) (cid:14)(cid:0) m (cid:1) (cid:16)(cid:80) i (cid:0) a i (cid:1) + (cid:80) j (cid:0) b j (cid:1)(cid:17) − (cid:16)(cid:80) i (cid:0) a i (cid:1) (cid:80) j (cid:0) b j (cid:1)(cid:17) (cid:14)(cid:0) m (cid:1) , where a i = (cid:80) j m ij and b j = (cid:80) i m ij . 27 ore experiments on the effect of hyperparameters of MCR - CTRL . We provide more exper-imental results of MCR - CTRL training in the self-supervised setting by varying training hyperpa-rameters on the STL10 dataset. The results are summarized in Table 8. Notice that the choice ofhyperparameters only has small effect on the performance with the MCR - CTRL objective. We mayhypothesize that, in order to further improve the performance, one has to seek other, potentially better,control of optimization dynamics or strategies. We leave those for future investigation.Table 8:

Experiments of MCR - CTRL in the self-supervised setting on STL10 dataset.A

RCH P RECISION (cid:15) L EARNING R ATE lr NMI ACC ARIR ES N ET -18 0.5 0.1 0.446 0.491 0.290R ES N ET -18 0.75 0.1 0.450 0.484 0.288R ES N ET -18 0.25 0.1 0.447 0.489 0.293R ES N ET -18 0.5 0.2 0.477 0.473 0.295R ES N ET -18 0.5 0.05 0.444 0.496 0.293R ES N ET -18 0.25 0.05 0.454 0.489 0.294-18 0.25 0.05 0.454 0.489 0.294