[PDF] Parity Partition Coding for Sharp Multi-Label Classification

Abstract

The problem of efficiently training and evaluating image classifiers that can distinguish between a large number of object categories is considered. A novel metric, sharpness, is proposed which is defined as the fraction of object categories that are above a threshold accuracy. To estimate sharpness (along with a confidence value), a technique called fraction-accurate estimation is introduced which samples categories and samples instances from these categories. In addition, a technique called parity partition coding, a special type of error correcting output code, is introduced, increasing sharpness, while reducing the multi-class problem to a multi-label one with exponentially fewer outputs. We demonstrate that this approach outperforms the baseline model for both MultiMNIST and CelebA, while requiring fewer parameters and exceeding state of the art accuracy on individual labels.

Full PDF

PParity Partition Coding for Sharp Multi-Label Classiﬁcation

Christopher G. Blake* Giuseppe Castiglione* Christopher Srinivasa Marcus Brubaker*Shared ﬁrst authorshipAugust 27, 2019

Abstract

The problem of eﬃciently training and evaluating image classiﬁers that can distinguish betweena large number of object categories is considered. A novel metric, sharpness, is proposed which isdeﬁned as the fraction of object categories that are above a threshold accuracy. To estimate sharpness(along with a conﬁdence value), a technique called fraction-accurate estimation is introduced whichsamples categories and samples instances from these categories. In addition, a technique called paritypartition coding, a special type of error correcting output code, is introduced, increasing sharpness,while reducing the multi-class problem to a multi-label one with exponentially fewer outputs. Wedemonstrate that this approach outperforms the baseline model for both MultiMNIST and CelebA,while requiring fewer parameters and exceeding state of the art accuracy on individual labels.

Mulit-label classiﬁcation is the problem of, given an image or instance, accurately labelling its attributes.For example, given an image of a face, a multi-label classiﬁcation task may require us to identify if theface is male, if there are blue eyes, if the image is smiling, and so on. Since it’s possible that each of theattributes can vary independently, the number of possible classes to which an instance can belong can growexponentially in the number of attributes. A naive and unrealistic approach to this problem is to create aclassiﬁer for each of the K possible categories (where K is the number of attribute labels). In practicethis is not done; and the more sensible approach is to train a binary classiﬁer for each binary attribute,and estimating the instance’s category from the outputs of these classiﬁers (this is the multi-class tobinary technique introduced in [1]).There are three engineering challenges with this approach which we address in this paper: the ﬁrstis, we want to make sure that the classiﬁer is accurate. Secondly, we want the accurate classiﬁers to becompact and have a small training time. Finally, we want to be able to test and benchmark such classiﬁers,and we want to do it in a way that ensures that the classiﬁers are not biased. In this paper we introduce atechnique called parity partition coding , a special type of error correcting output code [2]. We demonstratethat using this technique results in more accurate classiﬁers when ensembling a comparable numberof models. The technique involves learning both primitive attributes (those with labels), and derivedattributes (in our case, attributes that can be viewed as parity functions of primitive attributes). As asub-technique, for training the derived attributes, we introduce quadratic feature transformation whichwe show decreases training time for learning parity attributes. Finally, we introduce a technique called fraction accurate estimation to estimate the fraction of categories that are accurate above a threshold bysampling over categories and then sampling instances of each of these categories.We demonstrate our techniques on two problems: a synthetic dataset we call multi-MNIST, and afacial attribute recognition challenge based on the celebA dataset. For the celebA problem, we showthat our technique compares to state of the art accuracy for the accuracy of individual attributes. Whenaccuracy is deﬁned as the number of instances in which all the attributes must be accurate for the classiﬁerto be accurate, our technique outperforms our baseline ensemble of primitive attribute models.1 a r X i v : . [ c s . L G ] A ug Prior Work

The authors of [3] argue that there is a relationship between error control coding and machine learning,which inspires our work. Our work herein is in the category of ensemble methods in machine learning,and [4] provides a good overview. Our paper also involves learning attributes as in [5, 6].The technique we consider today can be viewed as a special case of an error correcting output codeinvented by Dietterich et al. [2]. This is a technique used for multi-class classiﬁcation problems, likedigit recognition. The Dietterich technique involves selecting binary attributes that characterize targetcategories. Each category is then associated with a unique binary attribute string encoding the attributesassociated with this category. A sample instance is then fed into each binary categorizer, producing anestimate of the binary attributes of that instance. The category is then chosen to be the category withattribute string closest in Hamming distance to the estimated attribute string. In our work, we considerthe problem where attributes are all labelled, and the object categories under considerations are thosethat are deﬁned by the unique conﬁguration of these attributes. This requires us to train both primitiveattribute classiﬁers as well as derived attribute classiﬁers.Shannon [7] introduced the concept of error control coding by proving that there exist codes withasymptotic rates above with error probability that approaches for channels with independent noise.In the channel coding problem, the fundamental cost is in channel uses. In the attribute classiﬁcationproblem, the analogous costs are training energy, and the energy used for classiﬁcation of a new instanceusing the already trained network. In the channel coding case, high rate codes imply high channel useeﬃciency. For parity partition coding, high rates imply energy eﬃcient training and deployment.The ﬁrst paper to recognize the equivalence of the output of a classiﬁer being the output of a noisychannel is [8]. In [9], error control output codes are considered for the multi-label classiﬁcation problem.The authors of [10] consider the problem of ensembling neural networks and points out that these ensembleswork if the models are accurate (better than random) and diverse (diﬀerent models make errors on diﬀerentinputs). This notion is a weaker notion than our pure independence assumption which informs the theoryin Section 4.An analysis of the eﬀectiveness of error correcting output codes is in [11] and in [12], the authorsconsider a similar model to ours. We build upon this literature by performing experiments to show whatwe consider the main advantage of error correcting output codes: an asymptotic savings in the number ofmodels needed in an ensemble to reach high accuracy. We will deﬁne our problem generally and then use a synthetic multiMNIST dataset to give a concreteexample. Example instances of multiMNIST are given in Figure 1. We also consider the attribute labelledcelebA dataset, which includes images of celebrity faces with attributes like ‘has bangs’ and ‘glasses.’ Welet the instance space be the set of all possible inputs into the classiﬁer [13].

Deﬁnition 1

A binary attribute is a bipartition of the instance space. A primitive attribute is an attributethat is labelled. A derived attribute is an attribute induced by a function of the primitive attributes of aninstance.

For example, in multiMNIST, ’1’ and ’2’ are primitive attributes. In celebA, "has bangs" is a primitiveattribute. The attribute ’1 XOR 2’ is a derived attribute for the multiMNIST dataset. This is the set ofinstances with a ’1’ but not ’2’ and a ’2’ but not ’1.’

Deﬁnition 2 An attribute template is an ordered tuple of attributes. The primitive attribute string ofan instance is a string of symbols representing the state of attributes for that instance. The idea here is that an image encodes a “message” within the state of its attributes. The goal ofclassiﬁcation is to ﬁgure out the primitive attribute string of the instance. For example, in the multi-MNIST problem, the goal is to produce a length binary string where each element of the stringcorresponds to the presence ( ) or absence ( ) of the MNIST digits that may be present in the image.See the labels at the top of each image in Figure 1 to see examples of primitive attribute strings.We let the length of each element in the attribute space (or equivalently the length of the attributetemplate) be denoted by the symbol K . We denote a primitive attribute string of length K as X , . . . X K .2 a) (b) (c) Figure 1: Samples from the MultiMnist Dataset, with corresponding attribute strings labelled at the topof each image. This dataset is produced by randomly generating positions and numbers, and then drawinga random MNIST image for each number and placing it at a random position.Consider a sequence of N functions, each mapping a primitive attribute string to { , } . Let’s denotethese functions f f . . . f N . Deﬁnition 3 A length N binary linear error correcting output code is a sequence of N such functionswhere each function is a mod 2 sum of a subset of the elements of the primitive attribute string. Notethat the sum of these functions may simply output a primitive attribute ( i.e. , they are the sum of singleattribute). We shall also consider derived attributes where f i ( X , . . . X K ) is a mod 2 sum of a subsetof the K elements of the primitive attribute string. The functions f f . . . f N are called the encodingfunctions . Without loss of generality, we denote the outputs of the N functions as: Y , . . . , Y N . Note that each ofthese functions induces a partition of the instance space, one partition for each function f i in the naturalway. Precisely, a function f i mapping an attribute string to or bipartitions the instance space into sets, A and A , where A i is the set of instances whose attribute string maps to the symbol i . Thus, foreach of these partitions we can train a binary classiﬁer, producing N binary classiﬁers . This is the keyidea of parity partition coding.After feeding the sample instance into the N classiﬁers, a sample output string is produced, which isnot necessarily the ground truth output string , which is the string produced when feeding the primitiveattribute string into the encoding functions. The task is to ﬁnd the attribute string with a ground truthoutput string which is “closest” to the sample output string. Closeness may be measured in Hammingdistance. Estimating the closest ground truth output string is called decoding .The technique of training primitive and parity attribute classiﬁers and then decoding the producedoutputs on an instance we call parity partition coding . Note that this is a special case of error correctingoutput codes [2] except in this case we distinguish primitive and parity attributes. We shall use the admittedly overly simple assumption that each binary classiﬁer is a binary asymmetricchannel that is independently distributed, where the input is a label for an attribute, and the output isa or indicating the classiﬁer’s estimate of the state of that attribute for a random instance of thatattribute. An illustration of this concept is given in Figure 3. Moreover, we assume we will be testing ourclassiﬁer on an independent attribute distribution . That is, we shall assume that each attribute occursindependently and randomly with probability / .We consider error control coding for the binary asymmetric channel. We assume K bits encoded tolength N s = K/R , where R is the rate of the code. We deﬁne error probability as the probability thatour classiﬁer does not produce the correct primitive attribute string for an instance. We consider using aShannon code [7] that has error probability for the channel e − c N s from some c > . (Such codes exist,see [15, 16]. Moreover they have close to optimal encoding and decoding complexity [17]).As a comparison, we consider a repetition code that has length N r = Kn r where n r is the number ofrepetitions. Such a code is equivalent to training n r separate K -output primitive attribute classiﬁers. Asimple probability analysis shows that such a code has error probability that scales as e − c n r for some c > . Thus, to get equivalent error probability for the repetition code as for the Shannon code, it must3 andom ImageGeneratorBinary Digita Generator (0 or 1) Attribute ClassifierEquivalent to Binary Asymmetric Channel: 010 1p1-p1-q q Figure 2: A diagram of a binary classiﬁer and it’s equivalence to a binary asymmetric channel. Therandom image generator takes in as input a binary variable indicating the desired state of a given attribute,and the image generator produces an image that has that attribute. The purpose of the classiﬁer is toclassify images created by this random image generator. Sometimes the random image generator will besome randomized function, but it can also be the natural images that arise in an application. We see atthe bottom of the graph that once the random distribution and the classiﬁer is ﬁxed, then the labelledblack box is exactly equivalent to a binary asymmetric channel, where − p is the probability an imagewith attribute label is classiﬁed with a , and − q is the probability an image with attribute label isclassiﬁed with a . When there are multiple attributes (shown in Figure 3, this can be viewed as multipleparallel binary asymmetric channels. 4 andom image generatorSample imageGround Truth Attribute String X= X X X X )1(X )2(X )3(X ) X=1100 X + X + X X + X + X X + X + X y' y' y' y' y' y' y' Decoder X' Binary ClassifiersAttribute String Generator sample output string Decoder estimate of attribute string

Figure 3: Diagram for an attribute classiﬁer for classifying 0, 1, 2, and 3 for the multiMNIST datasetusing an attribute partition coding scheme based on a (7 , -Hamming code [14]. The random imagegenerator takes in as input a binary string and produces an image with attributes encoded by this binarystring. The purpose of the binary classiﬁer and decoding layer is to estimate what the original attributestring was. In the binary classiﬁer layer, the top classiﬁers are binary classiﬁers for { , , , } , and thethree bottom classiﬁers are parity partition classiﬁers: that is, they are classiﬁers trained to output when the mod 2 sum of the check bits is . The ground truth symbol for each of the top attributes issymbolized as x , x , x , x as labelled in the diagram. So, for example, for the bottom parity classiﬁer,this will be a classiﬁer trained to recognize if, for a given image, x + x + x = 1 , where addition isperformed modulo . 5e that: e − c n r = e − c N and thus n r = c (cid:48) N for another c (cid:48) > . This means that the length of the repetition code must be N r = c (cid:48) KN , that is, a K factor greater than the equivalent Shannon code. This is the primary advantageof the parity partition coding technique: an asymptotic savings in the number of binary classiﬁersneeded. We ﬁnd that training binary classiﬁers for the derived attributes (those corresponding to parityfunctions) is more diﬃcult than primitive attributes. This has been observed in the literature [18] whenlearning a simple parity function (and not a parity of attributes). This motivates the following techniqueswhich we test on multiMNIST: quadratic feature transformation , targeted bagging , and pre-trained weightinitialization . We describe each of these technique below and explain how our theory informs how thesetechniques can help in training. Below we describe the three special techniques we studied for training the parity partition coders. Theiruse is informed by our theory, which we explain in each subsection.

To employ the parity partition coding technique, we need to learn a parity function of primitive attributes.We also want models that learn diﬀerent parity attributes to be accurate and diverse. Thus, inspiredby the technique of quadratic transformation of inputs to a parity function learner [19], in the outputlayer of the neural network model, we apply a quadratic transformation of the features before beingfed into the ﬁnal linear layer, via an outer product. Due to symmetry, we retain only the oﬀ-diagonal,upper-triangular portion of the outer-product matrix. This transformation makes XOR linearly separable,and thus easier to learn. In our experiments, we found that all other things being equal, quadratic featuretransformation cut down the number of training epochs needed by over to reach the same accuracy( epochs compared to ). Adapted from [20], targeted bagging trains diﬀerent targets on diﬀerent splits of the dataset. This leads todecorrelation of the outputs. Roughly speaking, this corresponds to making the outputs of each of themodels more independent, which according to our theory should allow for an increase in the eﬀectivenessof the parity partition coding technique, since it makes the errors more independent. In Table 1 we cansee the eﬀectiveness of this technique.

We also study a technique called pre-trained weight initialization . This involves re-using feature extractorstrained for a separate classiﬁcation task, and holds them ﬁxed throughout training in a new task. Thisintroduces some correlation amongst the outputs, however, it dramatically cuts down on training time(another engineering consideration).

As we get to a huge number of object categories (as in attribute classiﬁcation), it is hard to estimatethe number of categories that can be accurately categorized. Note that this is diﬀerent than accuracyon a typical test set; this is because in real datasets, especially multi-label datasets, the distributionof categories of the instances heavily favours typical categories. Thus, standard accuracy scores do notcapture how “balanced” the classiﬁer is.However, it is expensive to produce novel instances, especially from rare categories. Nevertheless, wedon’t need to test every category to get an estimate of the number of categories that will be accurate.All we need is a distribution from which to draw instances; once we have this we can sample instancesfrom this distribution and use this to produce a bound on the fraction of categories that are accurate. Wecan also use statistical tools to produce a conﬁdence: the probability that our estimating technique willproduce a correct bound (that is, a bound that includes that actual fraction of accurate categories).6et X be an instance space. We consider a set of possible categories into which elements of an instancespace can be classiﬁed which we denote C . We consider a joint distribution of inputs P ( x | c ) where c ∈ C and x is in the set of all instances in X from category C . When this instance is put into a classiﬁer theoutput is a category ˆ c ∈ C which is the classiﬁer’s estimate of the instance’s category c . We let X c (where c ∈ C ) represent the set of instances that are in category c .Of all the C possible classes (which is equal to K in the case of K binary attributes), we let thevector ( p , . . . , p | C | ) be the vector of true-positive accuracies for each of the C categories. Precisely thetrue-positive accuracy is deﬁned as P (ˆ c = c | c ) , the probability that the classiﬁer’s estimate is equal toclass c , conditioned on c being the class. Deﬁnition 4

We say that a particular category c is α -accurate if the classiﬁer has a true-positive accuracyfor category c greater than or equal to α . Note that this is not a random variable, it is a ﬁxed property ofthe category and the classiﬁer. Deﬁnition 5

We call the fraction of categories that are α -accurate the accurate fraction and denote it θ . Our goal is to estimate θ . We do so by deﬁning a fraction-accurate estimator , with parameters (cid:15) , (cid:15) , α , M and N . In such an estimator, we ﬁrst randomly draw N categories (each drawn from the set C of categories, with replacement), and for each of these N categories produce M sample instances fromthose categories. We feed these instances into the classiﬁer and compare the classiﬁer’s output to thetrue category, producing a vector of empirical accuracies, which we denote (ˆ p . . . ˆ p N ) . We estimatethat all those values in this vector that are an (cid:15) amount greater than or equal to α correspond to a classiﬁable category. We compute the fraction of these N categories that are classiﬁable, which we call ˆ q ,and then claim that: “there are at least (ˆ q − (cid:15) ) categories that are classiﬁable.” We call the value (cid:15) the accuracy deviation threshold and the value (cid:15) the fraction deviation threshold . We call such an estimator a ( α, N, M, (cid:15) , (cid:15) ) -fraction accurate estimator.Note that the estimator is also associated with a random experiment, which in this case correspondsto randomly drawing N categories and then drawing M instances of each category. Thus it assumes thereis some distribution from which instances can be drawn.We deﬁne c p,n ( x ) = n (cid:88) i = x (cid:18) ni (cid:19) p i (1 − p ) n − i (1)which is the the weight in the tail of a binomial distribution with number of coin ﬂips n and coin bias p . In other words, it is the probability that a coin with probability of heads p comes up with x or moreheads after n ﬂips. Theorem 1

The conﬁdence of an ( α, N, M, (cid:15) , (cid:15) ) fraction-accurate estimator is bounded by:conﬁdence ≥ min <θ< − c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99)(cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) − c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) where we recall the deﬁnition in (1), the weight of the tale of a binomial distribution. Proof:

Let the event E ( θ, N, M, (cid:15) , (cid:15) ) be the event that a random experiment followed by a fraction-accurate estimator produces an estimate that does not include θ (that is, the event of an error). We omitthe arguments and denote this event E . Our goal shall be to upper bound this probability and we do sousing straightforward probability arguments.To bound this probability, we deﬁne the following events: A : {the event that we draw greater than or equal to ( θ + (cid:15) / N classiﬁable categories} B : {The event that we draw less than ( θ + (cid:15) / N classiﬁable categories (2)but more more ( θ + (cid:15) ) N of them are classiﬁed as accurate } = A ∩ E (3)7n the deﬁnition of B we observe that the event that “more than ( θ + (cid:15) ) N of categories are classiﬁed asaccurate” will result in a lower bound estimate of θ at least slightly greater than θ . In other words, thisevent is A ∩ E . Observe that by using elementary set theory operations: E = ( A ∩ E ) ∪ ( A ∩ E ) ⊆ A ∪ ( A ∩ E ) = A ∪ B Thus we can conclude that: P ( E ) ≤ P ( A ∪ B ) We now study these two events A and B separately.We ﬁrst bound event A , the event that we draw greater than or equal to ( θ + (cid:15) / N classiﬁablecategories when there are only θ classiﬁable categories.The probability can then be bounded by the tail of a binomial distribution: P ( A ) ≤ N (cid:88) i = (cid:100) ( θ + (cid:15) / N (cid:101) (cid:18) Ni (cid:19) θ i (1 − θ ) N − i = c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) (4)Let’s let C be the event that a classiﬁer with accuracy equal to α has empirical accuracy greater than α + (cid:15) . As a function of M , the number of instances drawn of a particular category, we see that this isrelated to the binomial distribution: P ( C ) = M (cid:88) i = (cid:100) ( α + (cid:15) ) M (cid:101) (cid:18) Mi (cid:19) α i (1 − α ) M − i = c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) (5)where we use the deﬁnition of the function in (1) to simplify the expression. We now can use this expressionto write an expression for the probability of event B . Recall that event B includes the event that (cid:15) N non-classiﬁable categories have empirical accuracy at least α + (cid:15) .We note that the probability of the event that least X of a set of categories with accuracy less than α have empirical accuracy over α + (cid:15) is maximized when all the categories have accuracy α . We canprove this using a simple exchange argument. We let Q be the event that the number of categories withempirical accuracy above α + (cid:15) exceeds X for some arbitrary X . Suppose the accuracies were written p , . . . , p j . Then consider a particular category with accuracy p i < α . We let the event i denote the eventthat this category is classiﬁed as accurate. We have: P ( Q ) = P ( Q | i ) p i + P ( Q | ¯ i )(1 − p i ) We see this expression increases with increasing p i because P ( Q | ¯ i ) ≤ P ( Q | i ) , because the event that wehave at least X categories being accurate conditioned on i being inaccurate is strictly a subset of theevent that we have at least X categories accurate given i being accurate. Thus this probability can bereplaced with the value α and then the probability of Q increases. Thus: P ( B ) < P (cid:0)(cid:6) (cid:15) N (cid:7) of the at least N − (cid:98) θ + (cid:15) / N (cid:99) not accurate classiﬁers classiﬁed as accurate (cid:1) The probability that a particular not-accurate classiﬁer gets classiﬁed as accurate is bounded as in (5).Thus: P ( B ) < N −(cid:98) θ + (cid:15) / N (cid:99) ) N (cid:99) (cid:88) i = (cid:100) (cid:15) N (cid:101) (cid:18) N − (cid:98) θ + (cid:15) / N (cid:99) ) N (cid:101) i (cid:19) ( c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) )) i (6) (1 − c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) )) ( N −(cid:98) θ + (cid:15) / N (cid:99) ) − i (7) = c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) ) (cid:101) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) (8)8e can now conclude by combining (4) and (7) we get: P ( E ) < c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) + c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) and thus conﬁdence ≥ min <θ< − c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) − c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) We can estimate this value of this over all valid θ by sweeping over the value of θ and choosing theminimum. Computationally this is not exactly the minimum but we increase the ﬁneness of our sweepand show that the estimated conﬁdence changes only minimally. In this section we detail our experimental results. We test the naive repetition technique, and compare itto the parity partition coding technique. Consistent with the theoretical predictions, the parity partitiontechnique consistently outperforms the naive repetition technique when the same number and size ofmodels are used.Models are trained to optimize the bit-accuracy of their respective codes (though we evaluate themusing the f score of the decoded predictions), using binary cross-entropy as surrogate for continuousoptimization. For the parity models, we train parity classiﬁers for all two-bit parity functions and use thecode induced by appending these classiﬁers together. For all experiments, we used the Adam optimizer[21] with a learning rate of 0.001 and a batch size of 64.For our multi-MNIST classiﬁcation model, the architecture is composed of two feed-forward modules.First, we have six stacked ResNet blocks which act as feature extractors, then we ﬁrst apply a × convolution to reduce the number of channels followed by a global average pooling layer. We then useBatchNorm before applying the quadratic transformation. The ﬁnal layer is a linear layer with sigmoidactivation. In Table 1 we show an ablation study where we compared the baseline identity technique(which is just a K -output multi-target classiﬁer) with the repetition technique (an ensemble of primitivemodels), as well as the parity technique. We test the weight-transfer techniques as well as the targetedbagging technique.Code Weight Targeted Baggingtransfer no yesParity no 0.816 ± ± ± / ± yes 0.766 ± ± ± ± ± ± ± ± ± ± ± ± ± f -scores (on the left) and average hamming distance (on the right) alongwith standard deviation for diﬀerent codes on multi-MNIST, averaged over 10 trials. The parity codeincorporates 45 parity checks into the message string, corresponding to the XOR between all pairs ofattributes. The repetition code uses 5 repetitions (to match the number of models in the parity code).The highest f1 score and lowest Hamming distance are emphasized in bold.To demonstrate the eﬀect of the gain from longer code sizes, we perform a successive addition studywhich we present in Figure 4. In this experiment, we train multiple primitive attribute models, as well asmultiple parity check models using quadratic feature transformation. We then evaluate the f1-score onthese models when we aggregate them. We observe that as we the length of the associated codes increases,the parity technique outperforms the repetition technique.For our next set of experiments, we trained a multi-attribute facial feature recognizer trained oncelebA with the following attribute template: {Wearing necktie, Male, Gray hair, Chubby, Wearing hat,Blond hair, Bald, Heavy makeup, No beard, Eyeglasses}. As this is a more computationally intensive9igure 4: Eﬀect of increasing number of classiﬁers for multi-MNIST classiﬁer. For the repetition code, wetrain 6 separate models, each outputting the 10 primitive attributes.Figure 5: Eﬀect of increasing number of classiﬁers for celebA classiﬁer. Note that as we ensemble moremodels the accuracy of the parity technique outperforms the baseline repetition technique.task, we used the study presented in Table 1 to inform our choice of techniques used for celebA. Thus, forcelebA, we use quadratic feature transformation (which trained faster in multiMNIST), split the datasetusing targetted bagging, and do not use weight transfer. For our CelebA classiﬁcation model, we usethe same strategy as we did for the MultiMnist, but use a larger ResNet encoder as found in [22], sinceit achieves state-of-the-art results. The primitive attributes in the CelebA dataset are noticeably lessbalanced than in the MultiMnist case, (and thus, so are derived attributes), which we combat by addingfrequency-weights to the binary cross entropy. Unlike in the MultiMnist case, where targets were chunkedto save on computational resources, for CelebA we train separate models for each attribute (primitive orderived).Correct classiﬁcation occurs when all of these attributes are classiﬁed correctly (which explains therelatively low classiﬁcation accuracy). Nonetheless, the parity partition technique outperforms the baselineidentity techniques, and repetition techniques. We present these results in Figure 5 and Table 2. We alsocompare the bit-level accuracies of the models for both multiMNIST and celebA. For celebA we see thatour models have bit accuracies that exceed state of the art [22] for out of the attributes, and theseresults are shown in the appendix.To utilize our fraction-accurate estimation technique, we employ two studies, one each for the multi-MNIST dataset and the celebA dataset. For the multi-MNIST dataset, we sample from all categoriesto estimate the fraction of categories that are accurate. We set as a threshold α = 0 . to estimate thefraction of categories with accuracy over . . We choose accuracy deviation threshold (cid:15) = 0 . and thefraction deviation threshold (cid:15) = 0 . , number of categories sampled N = 100 , and number of samplesfrom each category M = 20 . We use Theorem 1 to compute our conﬁdence is . , and our study givesus a lower bound on the fraction of accurate categories as . . Thus, for our baseline model, with . of the categories are accurate. On the other hand, when we use paritypartition coding and the same set of parameters, we ﬁnd that of the categories are accurate. For ourrepetition baseline, we get .For celebA, we face a diﬀerent issue compared to multiMNIST, because we cannot necessarily produceinstances from all categories. However, we select a fraction of categories for which our dataset hasmultiple instances and sample from this set. We set accuracy threshold to α = 0 . , N = 100 , M = 10 , and (cid:15) = (cid:15) = 0 . , for which we compute a conﬁdence of . . For this set of parameters, for our baselinemodel we estimate there are at least accurate categories, for our repetition model, we estimate ,and for our parity model we estimate (less than the repetition model, but with empirical estimatewithin the fraction deviation threshold of the repetition model).Code f score Hamming DistanceIdentity 0.704 0.346 ± ± ± Table 2: The f1 score and average hamming distance from ground truth label, for the celebA datasetclassiﬁcation challenge. We note that the parity model has a higher f1 score and lower hamming distancethan the baseline repetition and identity models.

We introduce parity partition coding, and argue using results from coding theory that this technique willresult in an O ( K ) savings in number of binary classiﬁers required to get high accuracy, where K is thenumber of attributes. We test the technique on multiMNIST, a synthetic multi-label dataset, and a labelclassiﬁcation challenge based on celebA, where we reach comparable to state of the art. We introducethe notion of sharpness, and show how to bound this quantity by sampling over categories while alsoproducing a conﬁdence estimate for this bound. References [1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifyingapproach for margin classiﬁers.

J. Mach. Learn. Res. , 1:113–141, September 2001.[2] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correctingoutput codes.

J. Artif. Int. Res. , 2(1):263–286, January 1995.[3] M. Abbe, E. Wainwright. Information theory and machine learning (tutorial). In

ISIT , June 2015.[4] Thomas G. Dietterich. Ensemble methods in machine learning. In

Proceedings of the First InternationalWorkshop on Multiple Classiﬁer Systems , MCS ’00, pages 1–15, London, UK, UK, 2000. Springer-Verlag.[5] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In

IEEE Conf. on Comp. Vision and Pattern Recogn. , pages 951–958, June2009.[6] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In

CVPR 2011 , pages3337–3344, June 2011.[7] C. E. Shannon. A mathematical theory of communication.

Bell Sys. Techn. J. , 27(3):379–423 &623–656, 1948.[8] S. Ferdowsi and Voloshynovskiy. Content identiﬁcation: Machine learning meets coding. In

Proc.35th WIC Symp. Info. Theory , May 2014. 119] Chao Li, Zhiyong Feng, and Chao Xu. Error-correcting output codes for multi-label emotionclassiﬁcation.

Multimedia Tools and Applications , 75, 05 2016.[10] Lars Kai Hansen and Peter Salamon. Neural network ensembles.

IEEE Trans. Pattern Anal. Mach.Intell. , 12:993–1001, 1990.[11] Francesco Masulli and Giorgio Valentini. Eﬀectiveness of error correcting output codes in multiclasslearning problems. In

Multiple Classiﬁer Systems , 2000.[12] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and error-correcting codes. In

Proceedings of the Twelfth Annual Conference on Computational Learning Theory , COLT ’99, pages145–155, New York, NY, USA, 1999. ACM.[13] L. G. Valiant. A theory of the learnable.

Commun. ACM , 27(11):1134–1142, November 1984.[14] R. W. Hamming. Error detecting and error correcting codes.

The Bell System Technical Journal ,29(2):147–160, April 1950.[15] G. David Forney, Jr. Concatenated codes, December 1965.[16] S. Hassani, K. Alishahi, and R. Urbanke. Finite-length scaling for polar codes.

IEEE Trans. Info.Theory , 60(10):5875–5898, July 2014.[17] C. G. Blake.

Energy Consumption of Error Control Coding Circuits . PhD thesis, University ofToronto, Toronto, June 2017.[18] Maxwell Nye and Andrew Saxe. Are eﬃcient deep representations learnable? ICLR 2018 WorkshopSubmission.[19] F. Piazza, A. Uncini, and M. Zenobi. Artiﬁcial neural networks with adaptive polynomial activationfunction, 1992.[20] Leo Breiman. Bagging predictors.

Machine Learning , 24:123–140, 1996.[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In , 2015.[22] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.

CoRR ,abs/1810.04650, 2018. 12

Appendix: Bit Accuracies

Attribute Senderet. al Baselineaccuracy Repetition-correctedaccuracy Parity-correctedaccuracyWearing Necktie 0.965 0.965 0.970

Male

Chubby 0.955 0.952 0.957

Wearing Hat 0.989 0.990 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±0.002