Parity Partition Coding for Sharp Multi-Label Classification
Christopher G. Blake, Giuseppe Castiglione, Christopher Srinivasa, Marcus Brubaker
PParity Partition Coding for Sharp Multi-Label Classification
Christopher G. Blake* Giuseppe Castiglione* Christopher Srinivasa Marcus Brubaker*Shared first authorshipAugust 27, 2019
Abstract
The problem of efficiently training and evaluating image classifiers that can distinguish betweena large number of object categories is considered. A novel metric, sharpness, is proposed which isdefined as the fraction of object categories that are above a threshold accuracy. To estimate sharpness(along with a confidence value), a technique called fraction-accurate estimation is introduced whichsamples categories and samples instances from these categories. In addition, a technique called paritypartition coding, a special type of error correcting output code, is introduced, increasing sharpness,while reducing the multi-class problem to a multi-label one with exponentially fewer outputs. Wedemonstrate that this approach outperforms the baseline model for both MultiMNIST and CelebA,while requiring fewer parameters and exceeding state of the art accuracy on individual labels.
Mulit-label classification is the problem of, given an image or instance, accurately labelling its attributes.For example, given an image of a face, a multi-label classification task may require us to identify if theface is male, if there are blue eyes, if the image is smiling, and so on. Since it’s possible that each of theattributes can vary independently, the number of possible classes to which an instance can belong can growexponentially in the number of attributes. A naive and unrealistic approach to this problem is to create aclassifier for each of the K possible categories (where K is the number of attribute labels). In practicethis is not done; and the more sensible approach is to train a binary classifier for each binary attribute,and estimating the instance’s category from the outputs of these classifiers (this is the multi-class tobinary technique introduced in [1]).There are three engineering challenges with this approach which we address in this paper: the firstis, we want to make sure that the classifier is accurate. Secondly, we want the accurate classifiers to becompact and have a small training time. Finally, we want to be able to test and benchmark such classifiers,and we want to do it in a way that ensures that the classifiers are not biased. In this paper we introduce atechnique called parity partition coding , a special type of error correcting output code [2]. We demonstratethat using this technique results in more accurate classifiers when ensembling a comparable numberof models. The technique involves learning both primitive attributes (those with labels), and derivedattributes (in our case, attributes that can be viewed as parity functions of primitive attributes). As asub-technique, for training the derived attributes, we introduce quadratic feature transformation whichwe show decreases training time for learning parity attributes. Finally, we introduce a technique called fraction accurate estimation to estimate the fraction of categories that are accurate above a threshold bysampling over categories and then sampling instances of each of these categories.We demonstrate our techniques on two problems: a synthetic dataset we call multi-MNIST, and afacial attribute recognition challenge based on the celebA dataset. For the celebA problem, we showthat our technique compares to state of the art accuracy for the accuracy of individual attributes. Whenaccuracy is defined as the number of instances in which all the attributes must be accurate for the classifierto be accurate, our technique outperforms our baseline ensemble of primitive attribute models.1 a r X i v : . [ c s . L G ] A ug Prior Work
The authors of [3] argue that there is a relationship between error control coding and machine learning,which inspires our work. Our work herein is in the category of ensemble methods in machine learning,and [4] provides a good overview. Our paper also involves learning attributes as in [5, 6].The technique we consider today can be viewed as a special case of an error correcting output codeinvented by Dietterich et al. [2]. This is a technique used for multi-class classification problems, likedigit recognition. The Dietterich technique involves selecting binary attributes that characterize targetcategories. Each category is then associated with a unique binary attribute string encoding the attributesassociated with this category. A sample instance is then fed into each binary categorizer, producing anestimate of the binary attributes of that instance. The category is then chosen to be the category withattribute string closest in Hamming distance to the estimated attribute string. In our work, we considerthe problem where attributes are all labelled, and the object categories under considerations are thosethat are defined by the unique configuration of these attributes. This requires us to train both primitiveattribute classifiers as well as derived attribute classifiers.Shannon [7] introduced the concept of error control coding by proving that there exist codes withasymptotic rates above with error probability that approaches for channels with independent noise.In the channel coding problem, the fundamental cost is in channel uses. In the attribute classificationproblem, the analogous costs are training energy, and the energy used for classification of a new instanceusing the already trained network. In the channel coding case, high rate codes imply high channel useefficiency. For parity partition coding, high rates imply energy efficient training and deployment.The first paper to recognize the equivalence of the output of a classifier being the output of a noisychannel is [8]. In [9], error control output codes are considered for the multi-label classification problem.The authors of [10] consider the problem of ensembling neural networks and points out that these ensembleswork if the models are accurate (better than random) and diverse (different models make errors on differentinputs). This notion is a weaker notion than our pure independence assumption which informs the theoryin Section 4.An analysis of the effectiveness of error correcting output codes is in [11] and in [12], the authorsconsider a similar model to ours. We build upon this literature by performing experiments to show whatwe consider the main advantage of error correcting output codes: an asymptotic savings in the number ofmodels needed in an ensemble to reach high accuracy. We will define our problem generally and then use a synthetic multiMNIST dataset to give a concreteexample. Example instances of multiMNIST are given in Figure 1. We also consider the attribute labelledcelebA dataset, which includes images of celebrity faces with attributes like ‘has bangs’ and ‘glasses.’ Welet the instance space be the set of all possible inputs into the classifier [13].
Definition 1
A binary attribute is a bipartition of the instance space. A primitive attribute is an attributethat is labelled. A derived attribute is an attribute induced by a function of the primitive attributes of aninstance.
For example, in multiMNIST, ’1’ and ’2’ are primitive attributes. In celebA, "has bangs" is a primitiveattribute. The attribute ’1 XOR 2’ is a derived attribute for the multiMNIST dataset. This is the set ofinstances with a ’1’ but not ’2’ and a ’2’ but not ’1.’
Definition 2 An attribute template is an ordered tuple of attributes. The primitive attribute string ofan instance is a string of symbols representing the state of attributes for that instance. The idea here is that an image encodes a “message” within the state of its attributes. The goal ofclassification is to figure out the primitive attribute string of the instance. For example, in the multi-MNIST problem, the goal is to produce a length binary string where each element of the stringcorresponds to the presence ( ) or absence ( ) of the MNIST digits that may be present in the image.See the labels at the top of each image in Figure 1 to see examples of primitive attribute strings.We let the length of each element in the attribute space (or equivalently the length of the attributetemplate) be denoted by the symbol K . We denote a primitive attribute string of length K as X , . . . X K .2 a) (b) (c) Figure 1: Samples from the MultiMnist Dataset, with corresponding attribute strings labelled at the topof each image. This dataset is produced by randomly generating positions and numbers, and then drawinga random MNIST image for each number and placing it at a random position.Consider a sequence of N functions, each mapping a primitive attribute string to { , } . Let’s denotethese functions f f . . . f N . Definition 3 A length N binary linear error correcting output code is a sequence of N such functionswhere each function is a mod 2 sum of a subset of the elements of the primitive attribute string. Notethat the sum of these functions may simply output a primitive attribute ( i.e. , they are the sum of singleattribute). We shall also consider derived attributes where f i ( X , . . . X K ) is a mod 2 sum of a subsetof the K elements of the primitive attribute string. The functions f f . . . f N are called the encodingfunctions . Without loss of generality, we denote the outputs of the N functions as: Y , . . . , Y N . Note that each ofthese functions induces a partition of the instance space, one partition for each function f i in the naturalway. Precisely, a function f i mapping an attribute string to or bipartitions the instance space into sets, A and A , where A i is the set of instances whose attribute string maps to the symbol i . Thus, foreach of these partitions we can train a binary classifier, producing N binary classifiers . This is the keyidea of parity partition coding.After feeding the sample instance into the N classifiers, a sample output string is produced, which isnot necessarily the ground truth output string , which is the string produced when feeding the primitiveattribute string into the encoding functions. The task is to find the attribute string with a ground truthoutput string which is “closest” to the sample output string. Closeness may be measured in Hammingdistance. Estimating the closest ground truth output string is called decoding .The technique of training primitive and parity attribute classifiers and then decoding the producedoutputs on an instance we call parity partition coding . Note that this is a special case of error correctingoutput codes [2] except in this case we distinguish primitive and parity attributes. We shall use the admittedly overly simple assumption that each binary classifier is a binary asymmetricchannel that is independently distributed, where the input is a label for an attribute, and the output isa or indicating the classifier’s estimate of the state of that attribute for a random instance of thatattribute. An illustration of this concept is given in Figure 3. Moreover, we assume we will be testing ourclassifier on an independent attribute distribution . That is, we shall assume that each attribute occursindependently and randomly with probability / .We consider error control coding for the binary asymmetric channel. We assume K bits encoded tolength N s = K/R , where R is the rate of the code. We define error probability as the probability thatour classifier does not produce the correct primitive attribute string for an instance. We consider using aShannon code [7] that has error probability for the channel e − c N s from some c > . (Such codes exist,see [15, 16]. Moreover they have close to optimal encoding and decoding complexity [17]).As a comparison, we consider a repetition code that has length N r = Kn r where n r is the number ofrepetitions. Such a code is equivalent to training n r separate K -output primitive attribute classifiers. Asimple probability analysis shows that such a code has error probability that scales as e − c n r for some c > . Thus, to get equivalent error probability for the repetition code as for the Shannon code, it must3 andom ImageGeneratorBinary Digita Generator (0 or 1) Attribute ClassifierEquivalent to Binary Asymmetric Channel: 010 1p1-p1-q q Figure 2: A diagram of a binary classifier and it’s equivalence to a binary asymmetric channel. Therandom image generator takes in as input a binary variable indicating the desired state of a given attribute,and the image generator produces an image that has that attribute. The purpose of the classifier is toclassify images created by this random image generator. Sometimes the random image generator will besome randomized function, but it can also be the natural images that arise in an application. We see atthe bottom of the graph that once the random distribution and the classifier is fixed, then the labelledblack box is exactly equivalent to a binary asymmetric channel, where − p is the probability an imagewith attribute label is classified with a , and − q is the probability an image with attribute label isclassified with a . When there are multiple attributes (shown in Figure 3, this can be viewed as multipleparallel binary asymmetric channels. 4 andom image generatorSample imageGround Truth Attribute String X= X X X X )1(X )2(X )3(X ) X=1100 X + X + X X + X + X X + X + X y' y' y' y' y' y' y' Decoder X' Binary ClassifiersAttribute String Generator sample output string Decoder estimate of attribute string
Figure 3: Diagram for an attribute classifier for classifying 0, 1, 2, and 3 for the multiMNIST datasetusing an attribute partition coding scheme based on a (7 , -Hamming code [14]. The random imagegenerator takes in as input a binary string and produces an image with attributes encoded by this binarystring. The purpose of the binary classifier and decoding layer is to estimate what the original attributestring was. In the binary classifier layer, the top classifiers are binary classifiers for { , , , } , and thethree bottom classifiers are parity partition classifiers: that is, they are classifiers trained to output when the mod 2 sum of the check bits is . The ground truth symbol for each of the top attributes issymbolized as x , x , x , x as labelled in the diagram. So, for example, for the bottom parity classifier,this will be a classifier trained to recognize if, for a given image, x + x + x = 1 , where addition isperformed modulo . 5e that: e − c n r = e − c N and thus n r = c (cid:48) N for another c (cid:48) > . This means that the length of the repetition code must be N r = c (cid:48) KN , that is, a K factor greater than the equivalent Shannon code. This is the primary advantageof the parity partition coding technique: an asymptotic savings in the number of binary classifiersneeded. We find that training binary classifiers for the derived attributes (those corresponding to parityfunctions) is more difficult than primitive attributes. This has been observed in the literature [18] whenlearning a simple parity function (and not a parity of attributes). This motivates the following techniqueswhich we test on multiMNIST: quadratic feature transformation , targeted bagging , and pre-trained weightinitialization . We describe each of these technique below and explain how our theory informs how thesetechniques can help in training. Below we describe the three special techniques we studied for training the parity partition coders. Theiruse is informed by our theory, which we explain in each subsection.
To employ the parity partition coding technique, we need to learn a parity function of primitive attributes.We also want models that learn different parity attributes to be accurate and diverse. Thus, inspiredby the technique of quadratic transformation of inputs to a parity function learner [19], in the outputlayer of the neural network model, we apply a quadratic transformation of the features before beingfed into the final linear layer, via an outer product. Due to symmetry, we retain only the off-diagonal,upper-triangular portion of the outer-product matrix. This transformation makes XOR linearly separable,and thus easier to learn. In our experiments, we found that all other things being equal, quadratic featuretransformation cut down the number of training epochs needed by over to reach the same accuracy( epochs compared to ). Adapted from [20], targeted bagging trains different targets on different splits of the dataset. This leads todecorrelation of the outputs. Roughly speaking, this corresponds to making the outputs of each of themodels more independent, which according to our theory should allow for an increase in the effectivenessof the parity partition coding technique, since it makes the errors more independent. In Table 1 we cansee the effectiveness of this technique.
We also study a technique called pre-trained weight initialization . This involves re-using feature extractorstrained for a separate classification task, and holds them fixed throughout training in a new task. Thisintroduces some correlation amongst the outputs, however, it dramatically cuts down on training time(another engineering consideration).
As we get to a huge number of object categories (as in attribute classification), it is hard to estimatethe number of categories that can be accurately categorized. Note that this is different than accuracyon a typical test set; this is because in real datasets, especially multi-label datasets, the distributionof categories of the instances heavily favours typical categories. Thus, standard accuracy scores do notcapture how “balanced” the classifier is.However, it is expensive to produce novel instances, especially from rare categories. Nevertheless, wedon’t need to test every category to get an estimate of the number of categories that will be accurate.All we need is a distribution from which to draw instances; once we have this we can sample instancesfrom this distribution and use this to produce a bound on the fraction of categories that are accurate. Wecan also use statistical tools to produce a confidence: the probability that our estimating technique willproduce a correct bound (that is, a bound that includes that actual fraction of accurate categories).6et X be an instance space. We consider a set of possible categories into which elements of an instancespace can be classified which we denote C . We consider a joint distribution of inputs P ( x | c ) where c ∈ C and x is in the set of all instances in X from category C . When this instance is put into a classifier theoutput is a category ˆ c ∈ C which is the classifier’s estimate of the instance’s category c . We let X c (where c ∈ C ) represent the set of instances that are in category c .Of all the C possible classes (which is equal to K in the case of K binary attributes), we let thevector ( p , . . . , p | C | ) be the vector of true-positive accuracies for each of the C categories. Precisely thetrue-positive accuracy is defined as P (ˆ c = c | c ) , the probability that the classifier’s estimate is equal toclass c , conditioned on c being the class. Definition 4
We say that a particular category c is α -accurate if the classifier has a true-positive accuracyfor category c greater than or equal to α . Note that this is not a random variable, it is a fixed property ofthe category and the classifier. Definition 5
We call the fraction of categories that are α -accurate the accurate fraction and denote it θ . Our goal is to estimate θ . We do so by defining a fraction-accurate estimator , with parameters (cid:15) , (cid:15) , α , M and N . In such an estimator, we first randomly draw N categories (each drawn from the set C of categories, with replacement), and for each of these N categories produce M sample instances fromthose categories. We feed these instances into the classifier and compare the classifier’s output to thetrue category, producing a vector of empirical accuracies, which we denote (ˆ p . . . ˆ p N ) . We estimatethat all those values in this vector that are an (cid:15) amount greater than or equal to α correspond to a classifiable category. We compute the fraction of these N categories that are classifiable, which we call ˆ q ,and then claim that: “there are at least (ˆ q − (cid:15) ) categories that are classifiable.” We call the value (cid:15) the accuracy deviation threshold and the value (cid:15) the fraction deviation threshold . We call such an estimator a ( α, N, M, (cid:15) , (cid:15) ) -fraction accurate estimator.Note that the estimator is also associated with a random experiment, which in this case correspondsto randomly drawing N categories and then drawing M instances of each category. Thus it assumes thereis some distribution from which instances can be drawn.We define c p,n ( x ) = n (cid:88) i = x (cid:18) ni (cid:19) p i (1 − p ) n − i (1)which is the the weight in the tail of a binomial distribution with number of coin flips n and coin bias p . In other words, it is the probability that a coin with probability of heads p comes up with x or moreheads after n flips. Theorem 1
The confidence of an ( α, N, M, (cid:15) , (cid:15) ) fraction-accurate estimator is bounded by:confidence ≥ min <θ< − c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99)(cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) − c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) where we recall the definition in (1), the weight of the tale of a binomial distribution. Proof:
Let the event E ( θ, N, M, (cid:15) , (cid:15) ) be the event that a random experiment followed by a fraction-accurate estimator produces an estimate that does not include θ (that is, the event of an error). We omitthe arguments and denote this event E . Our goal shall be to upper bound this probability and we do sousing straightforward probability arguments.To bound this probability, we define the following events: A : {the event that we draw greater than or equal to ( θ + (cid:15) / N classifiable categories} B : {The event that we draw less than ( θ + (cid:15) / N classifiable categories (2)but more more ( θ + (cid:15) ) N of them are classified as accurate } = A ∩ E (3)7n the definition of B we observe that the event that “more than ( θ + (cid:15) ) N of categories are classified asaccurate” will result in a lower bound estimate of θ at least slightly greater than θ . In other words, thisevent is A ∩ E . Observe that by using elementary set theory operations: E = ( A ∩ E ) ∪ ( A ∩ E ) ⊆ A ∪ ( A ∩ E ) = A ∪ B Thus we can conclude that: P ( E ) ≤ P ( A ∪ B ) We now study these two events A and B separately.We first bound event A , the event that we draw greater than or equal to ( θ + (cid:15) / N classifiablecategories when there are only θ classifiable categories.The probability can then be bounded by the tail of a binomial distribution: P ( A ) ≤ N (cid:88) i = (cid:100) ( θ + (cid:15) / N (cid:101) (cid:18) Ni (cid:19) θ i (1 − θ ) N − i = c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) (4)Let’s let C be the event that a classifier with accuracy equal to α has empirical accuracy greater than α + (cid:15) . As a function of M , the number of instances drawn of a particular category, we see that this isrelated to the binomial distribution: P ( C ) = M (cid:88) i = (cid:100) ( α + (cid:15) ) M (cid:101) (cid:18) Mi (cid:19) α i (1 − α ) M − i = c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) (5)where we use the definition of the function in (1) to simplify the expression. We now can use this expressionto write an expression for the probability of event B . Recall that event B includes the event that (cid:15) N non-classifiable categories have empirical accuracy at least α + (cid:15) .We note that the probability of the event that least X of a set of categories with accuracy less than α have empirical accuracy over α + (cid:15) is maximized when all the categories have accuracy α . We canprove this using a simple exchange argument. We let Q be the event that the number of categories withempirical accuracy above α + (cid:15) exceeds X for some arbitrary X . Suppose the accuracies were written p , . . . , p j . Then consider a particular category with accuracy p i < α . We let the event i denote the eventthat this category is classified as accurate. We have: P ( Q ) = P ( Q | i ) p i + P ( Q | ¯ i )(1 − p i ) We see this expression increases with increasing p i because P ( Q | ¯ i ) ≤ P ( Q | i ) , because the event that wehave at least X categories being accurate conditioned on i being inaccurate is strictly a subset of theevent that we have at least X categories accurate given i being accurate. Thus this probability can bereplaced with the value α and then the probability of Q increases. Thus: P ( B ) < P (cid:0)(cid:6) (cid:15) N (cid:7) of the at least N − (cid:98) θ + (cid:15) / N (cid:99) not accurate classifiers classified as accurate (cid:1) The probability that a particular not-accurate classifier gets classified as accurate is bounded as in (5).Thus: P ( B ) < N −(cid:98) θ + (cid:15) / N (cid:99) ) N (cid:99) (cid:88) i = (cid:100) (cid:15) N (cid:101) (cid:18) N − (cid:98) θ + (cid:15) / N (cid:99) ) N (cid:101) i (cid:19) ( c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) )) i (6) (1 − c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) )) ( N −(cid:98) θ + (cid:15) / N (cid:99) ) − i (7) = c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) ) (cid:101) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) (8)8e can now conclude by combining (4) and (7) we get: P ( E ) < c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) + c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) and thus confidence ≥ min <θ< − c c α,M ( (cid:100) ( α + (cid:15) ) M (cid:101) ) ,N −(cid:98) θ + (cid:15) / N (cid:99) (cid:16)(cid:108) (cid:15) N (cid:109)(cid:17) − c θ,N ( (cid:100) ( θ + (cid:15) / N (cid:101) ) We can estimate this value of this over all valid θ by sweeping over the value of θ and choosing theminimum. Computationally this is not exactly the minimum but we increase the fineness of our sweepand show that the estimated confidence changes only minimally. In this section we detail our experimental results. We test the naive repetition technique, and compare itto the parity partition coding technique. Consistent with the theoretical predictions, the parity partitiontechnique consistently outperforms the naive repetition technique when the same number and size ofmodels are used.Models are trained to optimize the bit-accuracy of their respective codes (though we evaluate themusing the f score of the decoded predictions), using binary cross-entropy as surrogate for continuousoptimization. For the parity models, we train parity classifiers for all two-bit parity functions and use thecode induced by appending these classifiers together. For all experiments, we used the Adam optimizer[21] with a learning rate of 0.001 and a batch size of 64.For our multi-MNIST classification model, the architecture is composed of two feed-forward modules.First, we have six stacked ResNet blocks which act as feature extractors, then we first apply a × convolution to reduce the number of channels followed by a global average pooling layer. We then useBatchNorm before applying the quadratic transformation. The final layer is a linear layer with sigmoidactivation. In Table 1 we show an ablation study where we compared the baseline identity technique(which is just a K -output multi-target classifier) with the repetition technique (an ensemble of primitivemodels), as well as the parity technique. We test the weight-transfer techniques as well as the targetedbagging technique.Code Weight Targeted Baggingtransfer no yesParity no 0.816 ± ± ± / ± yes 0.766 ± ± ± ± ± ± ± ± ± ± ± ± ± f -scores (on the left) and average hamming distance (on the right) alongwith standard deviation for different codes on multi-MNIST, averaged over 10 trials. The parity codeincorporates 45 parity checks into the message string, corresponding to the XOR between all pairs ofattributes. The repetition code uses 5 repetitions (to match the number of models in the parity code).The highest f1 score and lowest Hamming distance are emphasized in bold.To demonstrate the effect of the gain from longer code sizes, we perform a successive addition studywhich we present in Figure 4. In this experiment, we train multiple primitive attribute models, as well asmultiple parity check models using quadratic feature transformation. We then evaluate the f1-score onthese models when we aggregate them. We observe that as we the length of the associated codes increases,the parity technique outperforms the repetition technique.For our next set of experiments, we trained a multi-attribute facial feature recognizer trained oncelebA with the following attribute template: {Wearing necktie, Male, Gray hair, Chubby, Wearing hat,Blond hair, Bald, Heavy makeup, No beard, Eyeglasses}. As this is a more computationally intensive9igure 4: Effect of increasing number of classifiers for multi-MNIST classifier. For the repetition code, wetrain 6 separate models, each outputting the 10 primitive attributes.Figure 5: Effect of increasing number of classifiers for celebA classifier. Note that as we ensemble moremodels the accuracy of the parity technique outperforms the baseline repetition technique.task, we used the study presented in Table 1 to inform our choice of techniques used for celebA. Thus, forcelebA, we use quadratic feature transformation (which trained faster in multiMNIST), split the datasetusing targetted bagging, and do not use weight transfer. For our CelebA classification model, we usethe same strategy as we did for the MultiMnist, but use a larger ResNet encoder as found in [22], sinceit achieves state-of-the-art results. The primitive attributes in the CelebA dataset are noticeably lessbalanced than in the MultiMnist case, (and thus, so are derived attributes), which we combat by addingfrequency-weights to the binary cross entropy. Unlike in the MultiMnist case, where targets were chunkedto save on computational resources, for CelebA we train separate models for each attribute (primitive orderived).Correct classification occurs when all of these attributes are classified correctly (which explains therelatively low classification accuracy). Nonetheless, the parity partition technique outperforms the baselineidentity techniques, and repetition techniques. We present these results in Figure 5 and Table 2. We alsocompare the bit-level accuracies of the models for both multiMNIST and celebA. For celebA we see thatour models have bit accuracies that exceed state of the art [22] for out of the attributes, and theseresults are shown in the appendix.To utilize our fraction-accurate estimation technique, we employ two studies, one each for the multi-MNIST dataset and the celebA dataset. For the multi-MNIST dataset, we sample from all categoriesto estimate the fraction of categories that are accurate. We set as a threshold α = 0 . to estimate thefraction of categories with accuracy over . . We choose accuracy deviation threshold (cid:15) = 0 . and thefraction deviation threshold (cid:15) = 0 . , number of categories sampled N = 100 , and number of samplesfrom each category M = 20 . We use Theorem 1 to compute our confidence is . , and our study givesus a lower bound on the fraction of accurate categories as . . Thus, for our baseline model, with . of the categories are accurate. On the other hand, when we use paritypartition coding and the same set of parameters, we find that of the categories are accurate. For ourrepetition baseline, we get .For celebA, we face a different issue compared to multiMNIST, because we cannot necessarily produceinstances from all categories. However, we select a fraction of categories for which our dataset hasmultiple instances and sample from this set. We set accuracy threshold to α = 0 . , N = 100 , M = 10 , and (cid:15) = (cid:15) = 0 . , for which we compute a confidence of . . For this set of parameters, for our baselinemodel we estimate there are at least accurate categories, for our repetition model, we estimate ,and for our parity model we estimate (less than the repetition model, but with empirical estimatewithin the fraction deviation threshold of the repetition model).Code f score Hamming DistanceIdentity 0.704 0.346 ± ± ± Table 2: The f1 score and average hamming distance from ground truth label, for the celebA datasetclassification challenge. We note that the parity model has a higher f1 score and lower hamming distancethan the baseline repetition and identity models.
We introduce parity partition coding, and argue using results from coding theory that this technique willresult in an O ( K ) savings in number of binary classifiers required to get high accuracy, where K is thenumber of attributes. We test the technique on multiMNIST, a synthetic multi-label dataset, and a labelclassification challenge based on celebA, where we reach comparable to state of the art. We introducethe notion of sharpness, and show how to bound this quantity by sampling over categories while alsoproducing a confidence estimate for this bound. References [1] Erin L. Allwein, Robert E. Schapire, and Yoram Singer. Reducing multiclass to binary: A unifyingapproach for margin classifiers.
J. Mach. Learn. Res. , 1:113–141, September 2001.[2] Thomas G. Dietterich and Ghulum Bakiri. Solving multiclass learning problems via error-correctingoutput codes.
J. Artif. Int. Res. , 2(1):263–286, January 1995.[3] M. Abbe, E. Wainwright. Information theory and machine learning (tutorial). In
ISIT , June 2015.[4] Thomas G. Dietterich. Ensemble methods in machine learning. In
Proceedings of the First InternationalWorkshop on Multiple Classifier Systems , MCS ’00, pages 1–15, London, UK, UK, 2000. Springer-Verlag.[5] C. H. Lampert, H. Nickisch, and S. Harmeling. Learning to detect unseen object classes by between-class attribute transfer. In
IEEE Conf. on Comp. Vision and Pattern Recogn. , pages 951–958, June2009.[6] J. Liu, B. Kuipers, and S. Savarese. Recognizing human actions by attributes. In
CVPR 2011 , pages3337–3344, June 2011.[7] C. E. Shannon. A mathematical theory of communication.
Bell Sys. Techn. J. , 27(3):379–423 &623–656, 1948.[8] S. Ferdowsi and Voloshynovskiy. Content identification: Machine learning meets coding. In
Proc.35th WIC Symp. Info. Theory , May 2014. 119] Chao Li, Zhiyong Feng, and Chao Xu. Error-correcting output codes for multi-label emotionclassification.
Multimedia Tools and Applications , 75, 05 2016.[10] Lars Kai Hansen and Peter Salamon. Neural network ensembles.
IEEE Trans. Pattern Anal. Mach.Intell. , 12:993–1001, 1990.[11] Francesco Masulli and Giorgio Valentini. Effectiveness of error correcting output codes in multiclasslearning problems. In
Multiple Classifier Systems , 2000.[12] Venkatesan Guruswami and Amit Sahai. Multiclass learning, boosting, and error-correcting codes. In
Proceedings of the Twelfth Annual Conference on Computational Learning Theory , COLT ’99, pages145–155, New York, NY, USA, 1999. ACM.[13] L. G. Valiant. A theory of the learnable.
Commun. ACM , 27(11):1134–1142, November 1984.[14] R. W. Hamming. Error detecting and error correcting codes.
The Bell System Technical Journal ,29(2):147–160, April 1950.[15] G. David Forney, Jr. Concatenated codes, December 1965.[16] S. Hassani, K. Alishahi, and R. Urbanke. Finite-length scaling for polar codes.
IEEE Trans. Info.Theory , 60(10):5875–5898, July 2014.[17] C. G. Blake.
Energy Consumption of Error Control Coding Circuits . PhD thesis, University ofToronto, Toronto, June 2017.[18] Maxwell Nye and Andrew Saxe. Are efficient deep representations learnable? ICLR 2018 WorkshopSubmission.[19] F. Piazza, A. Uncini, and M. Zenobi. Artificial neural networks with adaptive polynomial activationfunction, 1992.[20] Leo Breiman. Bagging predictors.
Machine Learning , 24:123–140, 1996.[21] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In , 2015.[22] Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization.
CoRR ,abs/1810.04650, 2018. 12
Appendix: Bit Accuracies
Attribute Senderet. al Baselineaccuracy Repetition-correctedaccuracy Parity-correctedaccuracyWearing Necktie 0.965 0.965 0.970
Male
Chubby 0.955 0.952 0.957
Wearing Hat 0.989 0.990 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ±0.002