Adversarial Data Programming: Using GANs to Relax the Bottleneck of Curated Labeled Data
AAdversarial Data Programming: Using GANs to Relax the Bottleneck of CuratedLabeled Data
Arghya Pal, Vineeth N BalasubramanianDepartment of Computer Science and EngineeringIndian Institute of Technology, Hyderabad, INDIA { cs15resch11001,vineethnb } @iith.ac.in Abstract
Paucity of large curated hand-labeled training data forevery domain-of-interest forms a major bottleneck in the de-ployment of machine learning models in computer visionand other fields. Recent work (Data Programming) hasshown how distant supervision signals in the form of la-beling functions can be used to obtain labels for given datain near-constant time. In this work, we present Adversar-ial Data Programming (ADP), which presents an adver-sarial methodology to generate data as well as a curatedaggregated label has given a set of weak labeling func-tions. We validated our method on the MNIST, FashionMNIST, CIFAR 10 and SVHN datasets, and it outperformedmany state-of-the-art models. We conducted extensive ex-periments to study its usefulness, as well as showed how theproposed ADP framework can be used for transfer learn-ing as well as multi-task learning, where data from two do-mains are generated simultaneously using the frameworkalong with the label information. Our future work will in-volve understanding the theoretical implications of this newframework from a game-theoretic perspective, as well asexplore the performance of the method on more complexdatasets.
1. Introduction
Curated labeled data is a key building block of modernmachine learning algorithms, and a driving force for deepneural network models. The large parameter space of deepmodels requires very large labeled datasets to build effec-tive models that work in practice. However, this inheriteddependency on large curated labeled data has become themajor bottleneck of progress in the use of machine learn-ing and deep learning in computer vision and other domains[41]. Creation of large scale hand-annotated datasets in ev-ery domain is a challenging task due to the requirement forextensive domain expertise, long hours of human labour and time - which collectively make the overall process expen-sive and time-consuming. Even when data annotation iscarried out using crowdsourcing (e.g. Amazon Mechani-cal Turk), additional effort is required to measure the cor-rectness (or goodness) of the obtained labels. We seek toaddress this problem in this work .In particular, we focus on automatically learning the pa-rameters of a given joint image-label probability distribu-tion (as provided in training image-label pairs) with a viewto automatically create labeled datasets. To achieve thisobjective, we exploit the use of distant supervision signalsto generate labeled data. These distant supervision signalsare provided to our framework as a set of weak labelingfunctions which represent domain knowledge or heuristicsobtained from experts or crowd annotators. Writing a setof labeling functions (as we found in our experiments) isfairly easy and quick, and can then be used in our frame-work to generate data as well as associated labels. Moreinterestingly, such labeling functions are often easily gen-eralizable, thus allowing our framework to be extended totransfer learning and multi-task learning (discussed in Sec-tion 5). Figure 1 shows a few examples of our results toillustrate the overall idea.In practice, labeling functions can be associated with twokinds of dependencies: (i) relative accuracies, which mea-sure the correctness of the labeling functions w.r.t. the trueclass label; and (ii) inter-function dependencies that cap-ture the relationships between the labeling functions withrespect to the predicted class label. In this work, we haveproposed a novel adversarial framework using GenerativeAdversarial Networks (GANs) that learns these dependen-cies along with the data distribution using a minmax game.Our GAN learns to generate a joint data-label distributionusing a generator block, a discriminator block and a La-beling Functions Block (LFB), which contains another dis-criminator that helps in learning the two kinds of depen-dencies mentioned above. The overall architecture of the This paper is accepted in CVPR 2018 a r X i v : . [ c s . C V ] M a r igure 1: (a) Sample results of image-label pairs generated using the proposed ADP framework trained on CIFAR-10, MNIST and SVHNdatasets (top to bottom respectively). Note that the label is generated by our model; (b) Demonstration of cross-domain multi-task learningusing ADP , where the same model generates data from both Fashion MNIST and LookBook datasets (Section 5). Note that FashionMNIST is grayscale while LookBook is color, and the model still generates both data effectively; (c) Demonstration of transfer learning ofour ADP from MNIST dataset (source domain) to generate image-label pairs on the SVHN dataset (target domain). proposed ADP architecture is presented in Figure 2a.Our broad idea of learning relative accuracies and inter-function dependencies of labeling functions is inspired bythe recently proposed Data Programming (DP) framework[36] (and hence, the name ADP), but our method is differ-ent in many ways: (i) DP is a strict conditional model (i.e. P (˜ y | x ) ) that requires additional unlabeled data points evenat test time, while our model is a joint distribution model,i.e. P ( x , y ) , and does not require any additional unlabeleddata points at test/generation time. (ii) DP learns a gener-ative model using Maximum Likelihood Estimation (MLE)and gradient descent to learn the relative accuracies of la-beling functions. We however replace this approach witha GAN-based adversarial estimation of parameters. [11]and [42] provide insights on the advantage of using a GAN-based estimator over MLE to achieve a relatively quickertraining time and good robustness on generated samples.(iii) To learn the statistical dependencies of labeling func-tions, DP models the dependency structure of labeling func-tions as a factor graph, and uses computationally expensiveGibbs sampling techniques to update the gradient in eachstep. We replace the factor graph and Gibbs sampling-basedestimation of inter-function dependencies with another dis-criminator in our GAN-based estimation, which speeds upthe learning process again and provides a robust generationat run-time.As our outcomes of this work, we show how a set of low-quality, weak labeling functions can be used within a frame-work that models a joint data-label distribution to generaterobust samples. We also show that this idea can be general-ized quite easily to transfer learning and multi-task learningsettings, showing the generalizability of this work. Our con-tributions can be summarized as follows:• We propose a novel adversarial framework, ADP to generate robust data-label pairs that be used to obtaindatasets in domains that have very little data and thussave human labor and time.• We show how an adversarial framework can be used tolearn dependencies between weak labeling functionsand thus provide high-fidelity aggregated labels alongwith generated data in a GAN setting.• The proposed framework can also be used in a transferlearning setting where ADP can be trained on a sourcedomain, and then finetuned on a target domain to thengenerate data-label pairs in the target domain.• We also show the potential of this ADP frameworkto generate cross-domain data in a multi-task setting,where images from two domains are generated simul-taneously by the model along with the labels.
2. Related Work
Data augmentation seems a natural answer to the scarcityof curated hand-labeled training data. However, heuris-tic data augmentation techniques like [15] and [19] usea limited form of class-preserving image transformationssuch as rotation, mirroring, addition of small noise, ran-dom crop etc. Interpolation-based methods proposed in [13]and class-conditional models of diffeomorphisms proposedin [20] interpolate between nearest-neighbor labeled datapoints. The popular SMOTE algorithm [7] performs over-sampling to reduce class imbalance and augment the givendata. All of these methods vastly depends on hand-tunedparameters, the order of geometric transformations, the op-timal value of transformation parameters, etc. A smallchange in parameters can often lead to negative impacts onfinal performance as studied in [37], [10] and [15]).In this work, we choose to use a more intuitive way ofcreating labeled data by learning a joint distribution model.earning a joint data-label distribution using generativemodels such as [14], [18], and [28] is non-trivial, since thelabel often requires domain knowledge and not directly in-ferrably from data. Our proposed model hence uses dis-tant supervision signals (in the form of labeling functions)to generate novel labeled data points. Distant supervisionsignals such as labeling functions are cheaper than man-ual annotation of each data point, and has been successfullyused in recent methods such as [36]. Ratner et al. pro-posed a generative model in [36] that uses a fixed numberof user-defined labeling functions to programatically gener-ate synthetic labels for data in near-constant time. DP out-performed number of approaches such as multiple-instancelearning ([38]), co-training ([4]), crowdsourcing ([17]), orensemble based weak-learner method like boosting ([40]),thus reinforcing our choice in this work. Alfonseca et al. [1]generated additional training data using hierarchical topicmodels for weak supervision. Heuristics for distant super-vision are also proposed in [6], but this method does notmodel the inherent noise associated with such heuristics.Structure learning [43][37] also exploits the use of distantsupervision signals for generating labels, but as describedin Section 1, these methods like [36] require unlabeled testdata to generate a labeled dataset. Additionally, [36], [37]and [43] are computationally expensive due to its use ofGibbs sampling in MLE.We instead use an adversarial approach to learn the jointdistribution by weighting a set of domain-specific labelfunctions using a Generative Adversarial Network (GAN).GAN ([18]) approximates the real data distribution by op-timizing a minmax objective function and thus generatesnovel out-of-sample data points. Broadly, GAN can beviewed in terms of three manifestations: (i) GANs can betrained to sample from a marginal distribution P ( x ) ([12],[35][2]), where x refers to data. (ii) Recent efforts in lit-erature such as Conditional GAN [31], Auxiliary Classi-fier GAN [34] and InfoGAN [9] show training of GANsconditioned on class labels y to thus sample from a condi-tional distribution, i.e. P ( x | y ) . Other state-of-the-art mod-els with similar objectives have exploited other modalitiesfor the same purpose; for example, Zhang et al [49] pro-pose a GAN conditioned on images, while Hu et al [21]propose a GAN conditioned on text. (iii) There have been afew very recent efforts [46], [51] and [22]), which attemptto train GANs to sample from a joint distribution. For ex-ample, CoGAN ([29]) introduces a parameter-sharing ap-proach to learn an unpaired joint distribution between twodomains, while TripleGAN [27] brings together a classifieralong with the discriminator and generator which helps in asemi-supervised setting. In this work, we propose a novelidea to instead use distant supervision signals to accomplishlearning the joint distribution of labeled images. We nowdescribe the proposed methodology.
3. Adversarial Data Programming (ADP):Methodology
Our central aim in this work is to learn parameters of a prob-abilistic model: P ( x , y ) (1)that captures the joint distribution over the data x and thecorresponding labels y , thus allowing us to generate out-of-sample data points along with their corresponding labels(we focus on images in the rest of this paper).While recent efforts such as [29] and [16] have con-sidered complementary objectives, they largely focused onlearning joint probability distributions in cross-domain un-derstanding settings. In this work, we focus on learning thejoint image-label probability distribution with a view to au-tomatically create labeled datasets, by exploiting the use ofdistant supervision signals to generate labeled data. To thebest of our knowledge, this is the first such work that in-vokes distant supervision while learning the joint distribu-tion P ( x , y ) , so as to generate labeled data points at scalefrom P ( x , y ) . Besides, automatic generation of labels fordata based on training data-label pairs is non-trivial, andoften does not work directly. Distant supervision providesus a mechanism to achieve this challenging goal. We en-code distant supervision signals as a set of (weak) defini-tions by annotators using which unlabeled data points canbe labeled. These definitions can be harvested from knowl-edge bases, domain heuristics, ontologies, rules-of-thumb,educated guesses, decisions of weak classifiers or obtainedusing crowdsourcing. Many application domains have suchdistant supervision available in different means through do-main knowledge or heuristics, which can be leveraged inthe proposed framework. We provide examples in Section4 when we describe our experiments.We encapsulate all available distant supervision sig-nals, henceforth called labeling functions , in a unified ab-stract container called L abeling F unctions B lock (LFB, seeFigure 2a). Let LFB comprise of n labeling functions λ , λ , · · · , λ n , where each labeling function is a mapping: λ i : x j → Λ ij (2)that maps a data point x j to a m -dimensional probabilisticlabel vector, Λ ij ∈ R m , where m is the number of classlabels with (cid:80) m Λ ij = 1 and ≤ Λ kij ≤ for each k ∈{ , · · · , m } . For example, x j could be thought of as animage from the MNIST dataset, and Λ ij ∈ R would bethe corresponding label vector when the labeling function λ i is applied to x j . Λ ij , for instance, could be the one-hot10-dimensional class vector, see Figure 2b.We characterize the set of labeling functions, { λ i , i =1 , · · · , n } , with two kinds of dependencies: (i) relative ac-curacies of the labeling functions with respect to the true igure 2: (a) Overall architecture of the Adversarial Data Programming (ADP) framework; (b) Example of a set of labeling functions class label of a given data point; and (ii) inter-function de-pendencies that capture the relationships between the label-ing functions with respect to the predicted class label. Toobtain a final label y for a given data point x using the LFB,we use two different sets of parameters, Θ and Φ to captureeach of these dependencies between the labeling functions.We, hence, denote the Labeling Function Block (LFB) as: LF B λ, Θ , Φ : x j → Λ j (3)i.e. given a set of labeling functions λ , a set of parame-ters capturing the relative accuracy-based dependencies be-tween the labeling functions, Θ , and a second set of param-eters capturing inter-label dependencies, Φ , LF B providesa probabilistic label vector, Λ j , for a given data input x j .The joint distribution we seek to model in this work(Equation 1) hence becomes: P ( x , LF B λ, Θ , Φ ( x )) (4)In the rest of this section, we show how we can learn theparameters of the above distribution modeling image-labelpairs using an adversarial framework with a high degreeof label fidelity. We use Generative Adversarial Networks(GANs) to model the joint distribution in Equation 4. Inparticular, we provide a mechanism to integrate the LFB(Equation 3) into the GAN framework, and show how Θ and Φ can be learned through the framework itself. Ouradversarial loss function is given by: min max L ( G, D ) = E ( x ,y ) ∼ P real ( x ,y ) log( D ( x , y ))+ E (˜ x , Λ) ∼ P fake ( z ) log(1 − D (˜ x , Λ)) (5) where G is the generator module and D is the discrimina-tor module. The overall architecture of the proposed ADPframework is shown in Figure 2a.This approach has a few advantages: (i) firstly, label-ing functions (which can even be just loosely defined) arecheaper to obtain than collecting labels for a large dataset;(ii) labeling functions can help bring domain knowledgeinto such generative models; (iii) labeling functions act asan implicit regularizer in the label space, thus allowing goodgeneralization; (iv) with a small fine-tuning, labeling func-tions can be easily re-purposed for new domains ( transferlearning ), as we describe later in this paper.The ADP architecture is designed to learn the parametersrequired to model the joint distribution in Equation 4, andthus generate out-of-sample image-label pairs. This archi-tecture is broadly divided into three modules: the generator,discriminator and the LFB. We now describe each of thesemodules individually. Given a noise input z and a set of labeling functions λ ,the generator G outputs an image x and the parameters Θ and Φ , the dependencies between the labeling functions de-scribed earlier. In particular, G consists of three blocks: G common , G image and G parameter , as shown in Figure 2a. G common captures the common high-level semantic rela-tionships between the data and the label space, and is com-prised only of fully connected (FC) layers. The output of G common forks into two branches: G image and G parameter ,where G image generates the image ˜ x , and G parameter gen-rates the parameters (Θ , Φ) . While G parameter uses FClayers, G image uses Fully Convolutional (FCONV) layersto generate the image (more details in Section 4). Thus, thegenerator G outputs ( x , Θ , Φ) given input z ∼ N (0 , I ) , thestandard normal distribution. The discriminator D of ADP estimates the likelihood ofan image-label input pair being drawn from the real dis-tribution obtained from training data. D takes a batch ofimage-label pairs as input and maps that to a probabil-ity score to estimate the aforementioned likelihood of theimage-label pair. To accomplish this, D has two branches: D image and D label (shown in the Discriminator block inFigure 2a). These two branches are not coupled in the ini-tial layers, so as to separately extract required low-level fea-tures. The branches share weights in later layers to extractjoint semantic features that help D classify correctly if animage-label pair is fake or real .We hence expand our objective function from Equation5 to the following: min max L ( G, D image , D label ) = E ( x ,y ) ∼ P real ( x ,y ) log( D image ( x ))+ E z ∼N (0 ,I ) log(1 − D image ( G image ( z )))+ E ( x ,y ) ∼ P real ( x ,y ) log( D label ( y ))+ E z ∼N (0 ,I ) log(1 − D label ( LF B ( G ( z )))) (6) This is a critical module of the proposed ADP frame-work. Our initial work revealed that a simple weighted(linear or non-linear) sum of the labeling functions do notperform well in generating out-of-sample image-label pairs.We hence used a separate adversarial methodology withinthis block to learn the dependencies, both relative accura-cies and inter-function (discussed earlier in this section),between the labeling functions provided to the framework.We describe the components of the LFB below.
The output, Θ , of the G parameter block in the ADP-Generator G provides the relative accuracies of the labelingfunctions. Given the image output generated by G image : ˜ x ,the labeling functions { λ , · · · , λ n } , and the probabilisticlabel vectors { Λ i , i = 1 , · · · , n } obtained using the label-ing functions (as in Eqn 2), we define the aggregated finallabel as: ˜ y = n (cid:88) i =1 ˜ θ i Λ i = ˜Θ · Λ (7)where ˜ θ i is the normalized version of θ i , i.e. ˜ θ i = θ i (cid:80) nk =1 θ k .The aggregated label, ˜ y , is provided as an output of the LFB. Our empirical studies with considering only relative accu-racies of labeling functions as a weighting mechanism ledto mode collapse in the joint distribution space, a well-understood problem in GANs. Our preliminary empiricalstudies demonstrated mode collapse in the joint distribu-tion space (either images of same class with different labels,or images of different classes with same label were gener-ated). The rationale behind taking two discriminators is topenalize the missing modes. Related literature [36] showsthat inter-functional dependencies act as an implicit regu-larizer in the label space. We also conducted experimentson synthetic data to demonstrate this issue (please see Fig ?? below). We hence introduced an adversarial mechanisminside the LFB to influence the final relative accuracies, ˜ θ ,using the inter-function dependencies between the labelingfunctions. D LF B , a discriminator inside LFB, receives twoinputs: Φ , which is output by G parameter , and Φ real , whichis obtained from Θ using the procedure described in Algo-rithm 1. Algorithm 1:
Procedure to compute Φ real Input:
Labeling functions { λ , · · · , λ n } , Relativeaccuracies θ , · · · , θ n , Output probability vectors oflabeling functions Λ , · · · , Λ n Output: Φ real Set Φ real = I ( n, n ) ; /* I = Identity Matrix */ for i = 1 to n do /* For each labeling function */ for j = i + 1 to n do /* For each other labeling function */ /* If one-hot encoding of the outputsof two functions match, increment ( i, j ) th entry in Φ real by 1 */ Φ real ( i, j ) =Φ real ( i, j ) + OneHot ( θ i Λ i ) · OneHot ( θ j Λ j ) ; endendfor p = 1 to n do Φ real (p, .) = Φ real ( p,. ) (cid:80) nu =1 Φ real ( p,u ) ; end Set Φ real = Φ real + Φ Treal − diag (Φ real ) /* Complete matrix using symmetry */
Algorithm 1 computes a matrix of interdependencies be-tween the labeling functions, Φ real , by looking at the one-hot encodings of their predicted label vectors. If the one-hot encodings match for a given data input, we increase thecount of their correlation by one, and compute this matrixacross a particular mini-batch of data points under consid-eration. The counts are then normalized row-wise to ob-ain Φ real . The task of the discriminator is to recognize thecomputed interdependencies as real, and the Φ generatedthrough the network in G parameter as fake. The gradientbackpropagated through this discriminator to the G block iscritical as a regularizer in learning a better Θ , which is fi-nally used to weight the labeling functions (as in Section3.3.1). Combining the gradient information from D LF B along with D , penalizes missing modes and helps G to gen-erate more variety in the samples. The objective function ofour second adversarial module is hence: min max L ( D LF B , G ) =+ E z ∼N (0 ,I ) log( D LF B (Φ real ( z )))+ E z ∼N (0 ,I ) log(1 − D LF B (Φ( z ))) (8)where Φ real and Φ are obtained from G parameter ( z ) as de-scribed above. More details of the LFB are provided in im-plementation details in Section 4.The overall architecture of ADP (Figure 2a) is trainedusing end-to-end backpropagation with gradients from bothdiscriminators, D and D LF B , influencing the weightslearned inside the generator G . Mini-batches of image-labelpairs from a given training distribution are provided as inputto ADP , and Stochastic Gradient Descent (SGD) is used tolearn the parameters of the model. At the end of training,we define the aggregated final label as: ˜ y = ˜Θ · Φ T · Λ (9)the samples (˜ x , ˜ y ) generated using the G and LF B mod-ules thus provide samples from the desired joint distribution(Eqn 1) modeled using the framework.
4. Experiments and Results
We validated the ADP framework on standard datasets:MNIST ([26]), Fashion MNIST ([45]), SVHN ([33]),and CIFAR-10 ([23]). No additional pre-processing isperformed on MNIST, Fashion-MNIST and CIFAR 10datasets. For the SVHN dataset, we used the ‘Format 2Cropped version’, and included an additional crop on eachimage to reduce presence of more than one digit, though thedimension × is maintained . Labeling functions form a critical element of ADP , andwe used different cues from state-of-the-art algorithms tohelp obtain labeling functions for our experiments. Table 1shows the labeling functions we used for our experimentson MNIST and SVHN (digit recognition problems), andTable 2 shows the functions used for CIFAR and Fashion-MNIST. We categorized labeling functions as: (i) Heuristic; Code available at https://github.com/ArghyaPal/Adversarial-Data-Programming (ii) Image processing-based; and (iii) Deep learning-basedlabeling functions (as in Tables 1 and 2). Table 3 presentsthe statistics of the number of labeling functions used foreach of the considered datasets (the empirical study thatmotivated these choices is presented in Section 5). In thiswork, for each labeling function, a simple threshold rule onthe L -norm of the aforementioned features is used, wherethe threshold is obtained empirically as the mean of the L -norms of a randomly chosen subset, which is α -trimmedto remove outliers. More examples of labeling functionsand ablation studies on their usefulness are presented in theSupplementary Section. Type Labeling Functions used
Heuristic Presence of long edges (vertical or horizon-tal) [30]; Image histogramImage Process-ing based Bag-of-feature [39]; Haar wavelet [8];Discrete-continuous ADM [25]; Compres-sive sensing [50]Deep Learningbased Convolution kernels from last conv layer(before fully connected layers) of LeNetTable 1: Labeling functions used for MNIST and SVHN datasets,both of which represent the digit recognition task
Type Labeling Functions used
Heuristic PatchMatch [3]; Blob Detection; Presenceof edges [1]; Textons; Image histogramImage Process-ing based Global descriptor (GIST-based) [32]; Lo-cal descriptor (SIFT-based) [48]; Bag-of-visual-words; Histogram of Oriented Gra-dient (HOG)-based: HoGgles [44]Deep Learningbased Convolution kernels from last conv layer(before fully connected layers) of (Ima-geNet) pre-trained AlexNetTable 2: Labeling functions used for CIFAR 10 and FashionMNIST datasets G common has 3 dense fully connected layers (FC) (128nodes per layer) with batch-normalization. G image contin-ues with fractional length convolutional layers, similar to[29] (FCONV: 1024 nodes per layer, Kernel size: × ,Stride: 1, followed by batch-normalization and Parameter-ized ReLU), and generates image x . G parameter uses FClayers and generates Θ , Φ . The discriminator network D follows the “in-plane rotation” network of [29]. D label is astack of FC layers. Both D image and D LF B have 2 FCONVlayers followed by FC layers. We trained the complete
Heuristic Image Processing Deep LearningMNIST 43 10 1Fashion-MNIST 50 6 1SVHN 43 10 2CIFAR 10 46 18 2
Table 3: Number of labeling functions used for different datasetsigure 3: (Best viewed in color) Image-label pairs generated by training on CIFAR10 dataset using CGAN, ACGAN, InfoGAN, CoGAN,TripleGAN and our method, ADP . For a given model, the columns of images represents generations after 0.1k, 20k, 40k, 50k epochs, andthe rows correspond to the associated class label. ‘ap’ stands for airplane, and ‘am’ stands for the automobile class of CIFAR 10 dataset.Note the clarity of generations of the proposed method. model with mini-batch Stochastic Gradient Descent (SGD)with a batch size of 128, learning rate of 0.0001, momentumfactor of 0.5 and Adam as an optimizer.
Qualitative Results:
We compared our method againstother generative methods that allow generation of dataalong with a label: Conditional GAN or CGAN ([18]), AC-GAN ([34]), InfoGAN ([9]), CoGAN ([29]) and TripleGAN([27]). (We changed the use case setup of these methodsto generate data-label pairs as required. For example, fora conditional GAN, we specified a class label, generated acorresponding image and used this as an image-label pair.)We used the publicly available codes for each of the abovemethods, and the results for CIFAR10 are shown in Figure 3(Results for other datasets are shown in the SupplementarySection). The figure shows that the proposed model gener-ates images with very good clarity. Besides, while some ofthe aforementioned methods (such as CGAN and InfoGAN)generate images conditioned on a given label (and hence re-quire a label to be provided as input), the label is providedby the model in our case.
Quantitative Results:
We considered three evaluationmetrics for studying the performance of our method quan-titatively: (i)
Human Turing test (HTT):
This metric studieshow hard it is for a human annotator to tell the differencebetween real and generated samples. We asked 40 subjectsto evaluate image quality and image-label correspondence(Table 5) on a scale of 10, given 50 random image-labelsamples from the generated pool for each method consid-ered. Table 5 shows consistently good performance of ADPover other methods, especially in image-label correspon-dence, which is the focus of this work; (ii)
Inception Score:
The inception score, as used in [29] and [27], for the CI-FAR 10 dataset is shown in Figure 5. The figure shows that ADP and TripleGAN perform significantly better than therest of the methods (more results on other datasets includedin the Supplementary Section). We also used a
Parzen win-dow based evaluation metric, and these results are includedin the Supplementary Section.We trained a ResNet-56 model on the CIFAR-10 datasetunder different settings, and the results are shown in Table4. The addition of labeled data generated using our methodsignificantly lowers the test cross-entropy loss across theepochs. We will include these in the paper. (Training Data, Test Data) Epochs5k 10k 15k 20k 30k 40k 50k(Real data-50K, Real data-10K) 9.83 7.3 7.12 6.3 6.1 4.3 4.19(ADP data-50K, Real data-10K) 9.32 8.9 8.13 7.0 6.75 5.53 5.0(Real data-50K, ADP data-10K) 9.67 9.4 7.92 7.3 6.81 6.18 5.6(ADP-25K + Real data-25K,Real data-10K) 8.5 6.6 6.21 5.7 5.5 4.83 3.5(
ADP-50K + Real data-50K,Real data-10K ) Table 4: Test cross-entropy loss of ResNet-56 on CIFAR-10dataset. (Real data-50K, Real data-10K) = standard dataset; ADP= our method; 10/25/50K = the number of data points used in thou-sands. In ADP-25K + Real data-25K, class ratios were maintainedas in the original dataset.
Classification Performance:
To study the usefulness ofthe generated image-label pairs, we studied the classifica-tion cross-entropy loss of a pretrained ResNet model onthe image-label pairs generated by our ADP at test time.A lesser cross-entropy loss in ResNet at test time indicatesthe efficacy of our model as a data augmentation method.We compared our method against TripleGAN, InfoGAN,CoGAN as well as the popular oversampling technique,SMOTE [7]. Figure 4 shows the result, which shows thatthe proposed model has significantly lower cross-entropyloss over other methods, highlighting its usefulness. ataset Image Quality Image-Label Correspondence
ACGAN CGAN InfoGAN TripleGAN ADP ACGAN CGAN InfoGAN TripleGAN ADPMNIST . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . FMNIST . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . SVHN . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . CIFAR10 . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . . ± . Table 5: Human Turing Test for image quality and image-label correspondence (Section 4.4, higher the better). Note that the proposedmethod, ADP performs the best in most cases, and is a close second when TripleGAN wins.Figure 4: Classification performance of a pretrained ResNet modelon image-label pairs generated by various models trained on CI-FAR 10Figure 5: Inception scores on CIFAR 10 (Section 4.4, QuantitativeAnalysis)
5. Discussion and Analysis
Optimal Number of Labeling Functions:
We studiedthe performance of ADP when the number of labeling func-tions is varied to understand the impact of this parameteron the performance. We studied the test cross-entropy errorof a pretrained ResNet model with image-label pairs gen-erated by ADP, trained using different number of labelingfunctions. Table 6 shows our results, suggesting that 50-55labeling functions provides the best performance, depend-ing on the dataset. This justifies our choice of number oflabeling functions in Table 3.
Transfer Learning:
The use of distant supervision sig-nals such as labeling functions (which can often be generic)allows us to extend the proposed ADP model to a transferlearning setting. In this setup, we trained ADP initially ona source dataset and then finetuned the model to a targetdataset, with very limited training. In particular, we firsttrained ADP on the MNIST dataset, and subsequently fine-tuned the G image branch alone with the SVHN dataset. We No of LabelingFunctions MNIST F-MNIST CIFAR10 SVHN3 70.23% 81.02% 87.39% 83.82%10 47.92% 71.52% 60.11% 75.91%
25 20.32% 30.53% 42.31% 38.30%
30 4.56% 12.47% 21.19% 26.66%40 1.40% 6.81% 19.93% 16.62%50 % %65 1.31% 4.73% 18.43% 12.82%70 1.25% 4.75% 18.40% 12.80% Table 6: Performance of ADP when number of labeling functionsis varied (Section 5, Optimal Number of Labeling Functions).Figure 6: Transfer learning from MNIST to SVHN dataset. Digitswithin parentheses indicate true label, while the other is the labelgenerated using our method (Section 5, Transfer Learning) note that the weights of G common , G parameter and D LF B are unaltered. The final finetuned model is then used togenerate image-label pairs (which we hypothesize will looksimilar to SVHN). Figure 1 shows encouraging results ofour experiments in this regard.
Multi-task Joint Distribution Learning:
Learning across-domain joint distribution from heterogeneous do-mains is a challenging task. We show that the proposedADP method can be used to achieve this, by modifying itsarchitecture as shown in Figure 7, to simultaneously gener-ate data from two different domains. We study this archi-tecture on the MNIST and SVHN datasets, and show thepromising results of our experiments in Figure 8. The LFBacts as a regularizer and maintains the correlations betweenthe domains in this case. More results on other datasets - inparticular, LookBook and Fashion MNIST - are included inthe Supplementary Section as well as Figure 1.
6. Conclusions
Paucity of large curated hand-labeled training data forevery domain-of-interest forms a major bottleneck in de-ploying machine learning methods in practice. Standarddata augmentation techniques and other heuristics are of-ten limited in their scope and require carefully picked hand-tuned parameters. We instead propose a new adversarialframework called Adversarial Data Programming (ADP), igure 7: ADP for Multi-Task Learning: Proposed ArchitectureFigure 8: Results of ADP - Multi-Task Learning on MNIST (blackand white) and SVHN (RGB) datasets which can learn the joint data-label distribution effectivelyusing a set of weakly defined labeling functions. Themethod shows promise on standard datasets, as well ason settings such as transfer learning and multi-task learn-ing. Our future work will involve understanding the the-oretical implications of this new framework from a game-theoretic perspective, as well as explore the performance ofthe method on more complex datasets.
References [1] E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pat-tern learning for relation extraction with a hierarchical topicmodel. In
Proceedings of the 50th Annual Meeting of theAssociation for Computational Linguistics: Short Papers-Volume 2 , pages 54–59. Association for Computational Lin-guistics, 2012.[2] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875 , 2017.[3] C. Barnes, D. B. Goldman, E. Shechtman, and A. Finkel-stein. The patchmatch randomized matching algorithmfor image manipulation.
Communications of the ACM ,54(11):103–110, 2011.[4] A. Blum and T. Mitchell. Combining labeled and unlabeleddata with co-training. In
Proceedings of the eleventh an-nual conference on Computational learning theory , pages92–100. ACM, 1998.[5] O. Breuleux, Y. Bengio, and P. Vincent. Unlearning for bettermixing.[6] R. Bunescu and R. Mooney. Learning to extract relationsfrom the web using minimal supervision. In
ACL , 2007. [7] N. V. Chawla, K. W. Bowyer, L. O. Hall, and W. P.Kegelmeyer. Smote: synthetic minority over-sampling tech-nique.
Journal of artificial intelligence research , 16:321–357, 2002.[8] X. Chen, X. Cheng, and S. Mallat. Unsupervised deep haarscattering on graphs. In
Advances in Neural InformationProcessing Systems , pages 1709–1717, 2014.[9] X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever,and P. Abbeel. Infogan: Interpretable representation learningby information maximizing generative adversarial nets. In
Advances in Neural Information Processing Systems , pages2172–2180, 2016.[10] D. Ciresan, U. Meier, L. Gambardella, and J. Schmidhu-ber. Deep big simple neural nets excel on handwritten digitrecognition [ ]. 2010. : http://arxiv. org/pdf/1003.0358, . . .(03.07. 2014) .[11] I. Danihelka, B. Lakshminarayanan, B. Uria, D. Wier-stra, and P. Dayan. Comparison of maximum likeli-hood and gan-based training of real nvps. arXiv preprintarXiv:1705.05263 , 2017.[12] E. L. Denton, S. Chintala, R. Fergus, et al. Deep genera-tive image models using a laplacian pyramid of adversarialnetworks. In
Advances in neural information processing sys-tems , pages 1486–1494, 2015.[13] T. DeVries and G. W. Taylor. Dataset augmentation in fea-ture space. arXiv preprint arXiv:1702.05538 , 2017.[14] C. Doersch. Tutorial on variational autoencoders. arXivpreprint arXiv:1606.05908 , 2016.[15] A. Dosovitskiy, P. Fischer, J. T. Springenberg, M. Ried-miller, and T. Brox. Discriminative unsupervised featurelearning with exemplar convolutional neural networks.
IEEEtransactions on pattern analysis and machine intelligence ,38(9):1734–1747, 2016.[16] Z. Gan, L. Chen, W. Wang, Y. Pu, Y. Zhang, H. Liu, C. Li,and L. Carin. Triangle generative adversarial networks. arXiv preprint arXiv:1709.06548 , 2017.[17] H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowd-sourcing power of social media for disaster relief.
IEEE In-telligent Systems , 26(3):10–14, 2011.[18] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Gen-erative adversarial nets. In
Advances in neural informationprocessing systems , pages 2672–2680, 2014.[19] B. Graham. Fractional max-pooling. arXiv preprintarXiv:1412.6071 , 2014.[20] S. Hauberg, O. Freifeld, A. B. L. Larsen, J. Fisher, andL. Hansen. Dreaming more data: Class-dependent distribu-tions over diffeomorphisms for learned data augmentation.In
Artificial Intelligence and Statistics , pages 342–350, 2016.[21] Z. Hu, Z. Yang, X. Liang, R. Salakhutdinov, and E. P.Xing. Controllable text generation. arXiv preprintarXiv:1703.00955 , 2017.[22] T. Kim, M. Cha, H. Kim, J. Lee, and J. Kim. Learning todiscover cross-domain relations with generative adversarialnetworks. arXiv preprint arXiv:1703.05192 , 2017.[23] A. Krizhevsky and G. Hinton. Learning multiple layers offeatures from tiny images. 2009.24] V. V. Kumar, A. Srikrishna, B. R. Babu, and M. R. Mani.Classification and recognition of handwritten digits by usingmathematical morphology.
Sadhana , 35(4):419–426, 2010.[25] E. Laude, J.-H. Lange, F. Schmidt, B. Andres, and D. Cre-mers. Discrete-continuous splitting for weakly supervisedlearning. arXiv preprint arXiv:1705.05020 , 2017.[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceed-ings of the IEEE , 86(11):2278–2324, 1998.[27] C. Li, K. Xu, J. Zhu, and B. Zhang. Triple generative adver-sarial nets. arXiv preprint arXiv:1703.02291 , 2017.[28] Y. Li, K. Swersky, and R. Zemel. Generative moment match-ing networks. In
Proceedings of the 32nd International Con-ference on Machine Learning (ICML-15) , pages 1718–1727,2015.[29] M. Liu, O. Tuzel, and A. Sullivan. Coupled generative ad-versarial nets. 2016.[30] S. Mandal, S. Sur, A. Dan, and P. Bhowmick. Handwrittenbangla character recognition in machine-printed forms usinggradient information and haar wavelet. In
Image InformationProcessing (ICIIP), 2011 International Conference on , pages1–6. IEEE, 2011.[31] M. Mirza and S. Osindero. Conditional generative adversar-ial nets. arXiv preprint arXiv:1411.1784 , 2014.[32] H. Moudni, M. Er-rouidi, M. Oujaoura, and O. Bencharef.Recognition of amazigh characters using surf & gist descrip-tors. In
International Journal of Advanced Computer Scienceand Application. Special Issue on Selected Papers from Thirdinternational symposium on Automatic Amazigh processing ,pages 41–44, 2013.[33] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y.Ng. Reading digits in natural images with unsupervised fea-ture learning. In
NIPS workshop on deep learning and unsu-pervised feature learning , volume 2011, page 5, 2011.[34] A. Odena, C. Olah, and J. Shlens. Conditional imagesynthesis with auxiliary classifier gans. arXiv preprintarXiv:1610.09585 , 2016.[35] A. Radford, L. Metz, and S. Chintala. Unsupervised repre-sentation learning with deep convolutional generative adver-sarial networks. arXiv preprint arXiv:1511.06434 , 2015.[36] A. J. Ratner, C. M. De Sa, S. Wu, D. Selsam, and C. R´e.Data programming: Creating large training sets, quickly. In
Advances in Neural Information Processing Systems , pages3567–3575, 2016.[37] A. J. Ratner, H. R. Ehrenberg, Z. Hussain, J. Dunnmon, andC. R´e. Learning to compose domain-specific transformationsfor data augmentation. arXiv preprint arXiv:1709.01643 ,2017.[38] S. Riedel, L. Yao, and A. McCallum. Modeling relations andtheir mentions without labeled text.
Machine learning andknowledge discovery in databases , pages 148–163, 2010.[39] L. Rothacker, S. Vajda, and G. A. Fink. Bag-of-featuresrepresentations for offline handwriting recognition appliedto arabic script. In
Frontiers in Handwriting Recognition(ICFHR), 2012 International Conference on , pages 149–154. IEEE, 2012.[40] R. E. Schapire and Y. Freund.
Boosting: Foundations andalgorithms . MIT press, 2012. [41] C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisitingunreasonable effectiveness of data in deep learning era. arXivpreprint arXiv:1707.02968 , 2017.[42] L. Theis, A. v. d. Oord, and M. Bethge. A note on the evalua-tion of generative models. arXiv preprint arXiv:1511.01844 ,2015.[43] P. Varma, B. He, P. Bajaj, I. Banerjee, N. Khandwala, D. L.Rubin, and C. R´e. Inferring generative model structure withstatic analysis. arXiv preprint arXiv:1709.02477 , 2017.[44] C. Vondrick, A. Khosla, T. Malisiewicz, and A. Torralba.HOGgles: Visualizing Object Detection Features.
ICCV ,2013.[45] H. Xiao, K. Rasul, and R. Vollgraf. Fashion-mnist: anovel image dataset for benchmarking machine learning al-gorithms. arXiv preprint arXiv:1708.07747 , 2017.[46] Z. Yi, H. Zhang, P. T. Gong, et al. Dualgan: Unsuperviseddual learning for image-to-image translation. arXiv preprintarXiv:1704.02510 , 2017.[47] D. Yoo, N. Kim, S. Park, A. S. Paek, and I. S. Kweon. Pixel-level domain transfer. In
European Conference on ComputerVision , pages 517–532. Springer, 2016.[48] K. Yu, Y. Lin, and J. Lafferty. Learning image represen-tations from the pixel level via hierarchical sparse coding.In
Computer Vision and Pattern Recognition (CVPR), 2011IEEE Conference on , pages 1713–1720. IEEE, 2011.[49] H. Zhang, T. Xu, H. Li, S. Zhang, X. Huang, X. Wang, andD. Metaxas. Stackgan: Text to photo-realistic image syn-thesis with stacked generative adversarial networks. arXivpreprint arXiv:1612.03242 , 2016.[50] M. Zhenjiang and Y. Baozong. Handwritten character recog-nition by extended loop neural networks. In
Speech, Im-age Processing and Neural Networks, 1994. Proceedings,ISSIPNN’94., 1994 International Symposium on , pages 460–463. IEEE, 1994.[51] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros. Unpaired image-to-image translation using cycle-consistent adversarial net-works. arXiv preprint arXiv:1703.10593 , 2017. upplementary Section
In this section, we include more details about our algorithm,labeling functions, datasets as well as additional results andcomparisons, which could not be included in the main paperdue to space constraints.
A. Algorithm
Algorithm 2 presents the overall stepwise routine of theproposed method, ADP , as described in Section 3. Dur-ing the training phase, the algorithm updates weights of themodel by estimating gradients for a batch of labeled datapoints. The hyperparameters that need to be provided in-clude standard parameters that are provided while traininga GAN, such as: (i) number of iterations of Algorithm 2;(ii) parameter k (similar to [18]) that describes how manytimes D and D LF B would be updated with respect to G ;and (iii) minibatch size m . Using empirical studies, wechose m = 128 , k = 2 and number of iterations to be , . B. Datasets
In this section, we provide more information on thedatasets used for validating ADP in this work: MNIST,Fashion MNIST, SVHN and CIFAR 10. The MNISTdataset comprises × grayscale images (with one hand-written digit in each image) along with the correspondinglabel, with 50,000 training samples (image-label pairs). Incase of SVHN, we used “format 2” of the dataset, whichcomprises 73257 × images (each containing a digitcaptured from street views of house numbers) with the cor-responding labels. In case of CIFAR 10, we merged fivetraining batches of the dataset and built a training set of50,000 images. This dataset contains RGB-images eachof size of × spanning 10 classes: automobile, air-plane, bird, cat, deer, dog, frog, horse, ship, truck . The totalnumber of samples are almost equally distributed across allclasses. Fashion MNIST, similar to MNIST, consists of atraining set of 50,000 × grayscale images with one of10 classes: Tshirt, Trouser, Pullover, Dress, Coat, Sandal,Shirt, Sneaker, Bag, Ankle boot .We also used the LookBook [47] dataset (Figure 9) todemonstrate cross-domain multi-task learning using ADP ,as described in Section 5. This dataset contains 84,748 im-ages across 17 classes:
Midi dress, mini dress, coat, jacket,fur jacket, padded jacket, hooded jacket, jumper, cardigan,knitwear, blouse, shirt, sleeveless tee, short sleeve tee, longsleeve tee, hoody, vest . In this work, we grouped these 17classes into 4 classes: coat, pullover, t-shirt, dress , in or-der to match with the Fashion MNIST dataset and thus helpstudy cross-domain learning. We grouped coat, jacket, furjacket, padded jacket, hooded jacket, jumper, cardigan toa single coat class; hoody to the pullover class; sleeveless
Figure 9: Sample images from the datasets studied in this work:(a) CIFAR 10, (b) Fashion MNIST, (c) MNIST, (d) LookBook tee, short sleeve tee to the t-shirt class; cardigan, knitwear,blouse, Midi dress, mini dress to the
Dress class. FashionMNIST dataset also has the same classes: coat, pullover,t-shirt, Dress among its label, thus facilitating our study.No additional pre-processing was performed on MNIST,Fashion MNIST, CIFAR 10 or the LookBook datasets. Incase of SVHN, an additional crop was performed on eachimage to ensure only one digit is present in the image. Thecropped image was subsequently sampled to maintain the × size. Figure 9 shows illustrative examples of imagesfrom the chosen datasets. C. More on Labeling Functions
The Labeling Functions Block (LFB) in Figure 2a (Sec-tion 3) is implemented using the open-source framework,Snorkel [36]. We modified the underlying architecture ofthe Snorkel framework to include an adversarial approach,which otherwise estimates dependencies using MLE invok-ing Gibbs sampling. Labeling functions of three kinds:heuristics, image processing-based and deep learned fea-tures have been used in this work, as described in Section 4.Examples of labeling functions used in this work are shownas Labeling Functions 1, 2, 3 and 4. For each labeling func-tion, a simple threshold rule on the L -norm of the afore-mentioned features is used. For each class of a dataset, thethreshold information is obtained empirically as the average L -norm of the feature vectors of 20 random samples of thatclass (with α -trimming to remove outliers). It is worthy tomention that, for an abstract understanding of working pro-cess of our labeling functions, the return value of example lgorithm 2: Training ADP
Input:
Number of iterations, Number of steps to train D : k , Minibatch size: m Output:
Trained ADP model for number of iterations dofor k steps do Given noise prior z ∼ N (0 , I ) , draw a batch of m samples from G : { (˜ x , Θ , Φ ) , · · · , (˜ x m , Θ m , Φ m ) } ;Use Equation 3 (from LF B ) to compute probabilistic label vectors { Λ , · · · , Λ m } given { (˜ x , Θ , Φ ) , · · · , (˜ x m , Θ m , Φ m ) } ;Draw a batch of m image-label pairs (( x , y ) , · · · , ( x m , y m )) from real distribution P real ( x , y ) ;Update weights of discriminators D and D LF B ( ψ d and ψ l respectively), using mini-batch stochastic gradientascent with gradients as computed below: ∇ ψ d m m (cid:88) i =1 (cid:104) log D ( x i , y i ) + log(1 − D (˜ x i , Λ i )) (cid:105) and, ∇ ψ l m m (cid:88) i =1 (cid:104) log D LF B (Φ real i ) + log(1 − D LF B (Φ i )) (cid:105) end Given noise prior z ∼ N (0 , I ) , draw a batch of m samples from G : { (˜ x , Θ , Φ ) , · · · , (˜ x m , Θ m , Φ m ) } ;Update weights of generator G , ψ g , using mini-batch stochastic gradient descent with gradients as computed below(each step below updated sequentially, one after another); ∇ ψ g m m (cid:88) i =1 (cid:104) log(1 − D LF B (Φ i )) (cid:105) and ∇ ψ g m m (cid:88) i =1 (cid:104) log(1 − D (˜ x i , Λ i )) (cid:105) end Labeling Functions 3 and 4 are one-hot encoding. Thoughin practice we fit a nonlinear function to get a probabilisticoutput.
C.1. Ablation Studies with Labeling Functions
In order to understand the effect of different kinds oflabeling functions, we performed an ablation study on theCIFAR10 dataset (considering it is the most natural of theconsidered datasets, and that it allows us to compute an In-ception score to quantitatively compare the performance ofvarious methods). In this study, we did not alter any hy-perparameters described in Section 4. Our ablation studyconsiders the following models:M1:
ADP :
Full modelM2:
ADP with no dependencies:
Same model as ADP hav-ing 55 labeling functions, as in Table 3. Each label-ing function is, however, considered independent ofthe other. M3:
ADP with only heuristic labeling functions:
Samemodel as ADP with 36 heuristic labeling functions butwithout any image processing or deep learned feature-based labeling functionsM4:
ADP with only image processing labeling functions:
Same model as ADP with only 17 image processing-based labeling functions but without heuristic or deeplearned feature-based labeling functionsM5:
ADP with only deep learned feature-based labelingfunctions:
Same model as ADP with only 2 deeplearned feature-based labeling functions but withoutheuristic or image processing labeling functionsM6:
ADP with (heuristic labeling functions + deep learnedfeature-based labeling functions)
M7:
ADP with (deep learned feature-based labeling func-tions + image processing labeling functions) abeling Function 1:
Sample heuristic labeling func-tion (used for blobs in digits like: 0, 9, 6)
Input:
Image
Output:
Probabilistic label vector /* Decision tree for English numeralsrecognition [24] */ if blob(Image) == TRUE thenif blob diameter(Image) ≤ then number = count blob(Image); if number == 2 then return [0.2,0,0,0,0,0.1,0,0,0.6,0.1]; endif number == 1 then return [0.6, 0, 0.2, 0, 0.1, 0, 0, 0, 0.1] endendif blob diameter(Image) > then return [0.4, 0, 0, 0, 0.1, 0, 0.3, 0.1, 0.1] endendLabeling Function 2: Sample heuristic labeling func-tion (used for digits with blob and stem like: 4, 6, 9)
Input:
Image
Output:
Probabilistic label vector /* Decision tree for English numeralsrecognition [24] */ if blob(Image) == TRUE then number = count stem (Image); if number == 0 then return [0.8, 0, 0, 0, 0.1, 0, 0, 0, 0.1] endif number == 1 then return [0.1, 0, 0, 0, 0.4, 0, 0.4, 0, 0.1] endif number == 2 then return [0, 0, 0, 0, 0.8, 0, 0, 0, 0.2] endend M8:
ADP with (image processing labeling functions +heuristic labeling functions)
The inception scores for the aforementioned 8 models arepresented in Table 7. The base ADP model comprising ofall labeling functions outperforms all other models, high-lighting the usefulness of a variety of labeling functions tomodel the non-trivial P ( x , y ) distribution. D. More Qualitative Results
In addition to the results on CIFAR 10 presented inSection 4.4, we also studied the performance of our ADP
Labeling Function 3:
Sample image processing basedlabeling function (based on Bag-of-Words)
Input:
Image, n : Number of classes Output:
Probabilistic label vector /* NOTE: Loop presented below for purposesof clarity - it is implemented only oncefor a dataset */ for i=1 · · · n do v avg i = average value of L norm ofBag-of-feature() on subset of training samplesfrom class i ; endv = Bag − of − f eature ( Image ) ; v image = (cid:107) v (cid:107) ;return OneHot( arg min i (cid:104) | v avg i − v Image | (cid:105) ) Labeling Function 4:
Sample deep learned feature-based labeling function
Input:
Image, n : Number of classes, Kernelsfrom first layer of pre-trained AlexNet (trained onImageNet) Output:
Probabilistic label vector /* Deep learning based labeling function */ m = Number of kernels from first layer of pre-trainedAlexNet; for i=1 · · · n dofor j = 1 · · · m do v avg ij = average value of Frobenius norm ofactivation map of j th kernel on subset oftraining samples from class i ; endendfor j = 1 · · · m do v Image j = value of Frobenius norm of activationmap of j th kernel on Image; end return OneHot( arg min i [ v avg i · v Image ] ) M1 M2 M3 M4 M5 M6 M7 M8InceptionScore
Table 7: Ablation study w.r.t labeling functions, as described inSection C.1 method against other generative methods (CGAN, AC-GAN, InfoGAN, CoGAN, TripleGAN) on MNIST, SVHNand Fashion MNIST datasets. Similar to CIFAR 10 genera-tion, we changed the use case setup of the other methods togenerate labeled images, using the publicly available codefor each of the methods. Figures 10, 11 and 12 present theseresults. igure 10: Image-label pairs generated by training on MNIST dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and ourmethod, ADP . For a given model, the columns of images represents generations after 0.01k, 0.1k, 1k, 3k, 6k and 9k epochs, and the rowscorrespond to the associated class label. It is evident that from 6k epochs onward, ADP model starts generating quality images acrossclasses and with a good image-to-label correspondence.Figure 11: Image-label pairs generated by training on SVHN dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN and ADP .For a given model, the columns of images represents generations after 0.1k, 1k, 10k, 30k, 40k and 60k epochs, and the rows correspond tothe associated class label.
MNIST:
Figure ?? shows that both our method ADP andTripleGAN generate good quality images on the MNISTdataset. We observe that both ADP and TripleGAN givea high image-to-label correspondence. Surprisingly, state-of-the-art methods such as CGAN, ACGAN, InfoGAN andCoGAN fail to capture image-to-label correspondence de-spite generating good quality images. SVHN:
As shown in Figure ?? , we observe that ourmethod generates human-recognizable images with a goodimage-to-label correspondence in just 1k epochs on the rel-atively harder SVHN dataset. At higher epochs, CoGAN(epoch = 30) and TripleGAN (epoch = 40) also generateimages of good quality, but broadly fail to capture differ-ent styles, backgrounds and illuminations of the generateddigit. Fashion MNIST:
Most of the considered methods dowell on this dataset. ADP and TripleGAN provide thesharpest results on close visual observation.
E. More Quantitative Results
Parzen Window Based Evaluation:
In addition to the re-sults with Inception score and
HTT presented in Section 4.4,we compared our method against other generative models(described in Section 4) based on the Parzen window scoreat test time. Parzen window [5] is a commonly used non-parametric density estimation method to evaluate generativemodels (especially GANs [18]) for which exact likelihood isnot tractable. Based on the samples generated by the model,we use a Parzen window with a Gaussian kernel as a densityestimator. This helps obtain a proxy for true log-likelihoodand thereby evaluate test log-likelihood. These results areshown in Table 8. The table shows that ADP has performedsignificantly well on MNIST (Score is 344) and SVHN(Score is 246) dataset and outperformed other state-of-the-art models including TripleGAN. For Fashion MNIST, ourmethod is a close second with respect to TripleGAN. Wechose the Parzen window size using cross-validation, as de-scribed in [18]. igure 12: Image-label pairs generated by training on Fashion MNIST dataset using CGAN, ACGAN, InfoGAN, CoGAN, TripleGAN andour method, ADP . For a given model, the columns of images represents generations after 0.1k, 1k, 10k, 15k, 20k and 25k epochs, and therows correspond to the associated class label.Figure 13: (a) Test-time classification cross-entropy loss of a pre-trained ResNet model on image-label pairs generated by ADP, ADP (i.e.only its Image-GAN component) with majority voting and ADP (i.e. only its Image-GAN component) with DP for labels; (b) Averagerunning time of ADP against other methods to estimate the relative accuracies and inter-function dependencies in DP.
F. More Results on Multi-task Joint Distribu-tion Learning
In continuation to the results presented in Section 5 (and1), we present more results for the capability of ADP toperform multi-task joint distribution learning in Figure 15.The figure captures our promise and shows that ADP is ableto generate samples from two different domains, includingsamples of different colors.
G. Comparison against Vote AggregationMethods
Comparison against Majority Voting and DP:
To studythe usefulness of learning relative accuracies and inter-function dependencies using ADP , we compared the per-formance of our method, both with majority voting andData Programming (DP, [36]). In majority voting,
LF B does not estimate relative accuracies and inter-function de-pendencies of labeling functions as described in Section 3.Instead, for a given image, each labeling function of
LF B makes a probabilistic prediction, and we take a maximumvote to obtain the final label. As in Section 4.4, we stud-ied the test-time classification cross-entropy loss of a pre-trained ResNet model on image-label pairs generated byADP, ADP (i.e. only its Image-GAN component) with ma-jority voting and DP. The results are presented in Figure13a, which shows that ADP has significantly lower cross-entropy loss than the other two methods, thus corroboratingits effectiveness.
Adversarial Data Programming vs MLE-based DataProgramming:
To further quantify the benefits of ourADP , we also show how our method compares againstData Programming (DP) [36] using different variants ofMLE: MLE, Maximum Pseudo-likelihood, and Hamilto-nian Monte Carlo. We note that DP only aggregates labels;we hence, combined a vanilla GAN with DP as separatecomponents to conduct this study. We started with a smallnumber of labeling functions (viz., 35 functions) and pro-gressively added additional labeling functions, noting theAN CGAN ACGAN InfoGAN CoGAN ADP TripleMNIST 198 201 204 225 278
SVHN 87 145 178 158 123