[PDF] Inspector Gadget: A Data Programming-based Labeling System for Industrial Images

Abstract

As machine learning for images becomes democratized in the Software 2.0 era, one of the serious bottlenecks is securing enough labeled data for training. This problem is especially critical in a manufacturing setting where smart factories rely on machine learning for product quality control by analyzing industrial images. Such images are typically large and may only need to be partially analyzed where only a small portion is problematic (e.g., identifying defects on a surface). Since manual labeling these images is expensive, weak supervision is an attractive alternative where the idea is to generate weak labels that are not perfect, but can be produced at scale. Data programming is a recent paradigm in this category where it uses human knowledge in the form of labeling functions and combines them into a generative model. Data programming has been successful in applications based on text or structured data and can also be applied to images usually if one can find a way to convert them into structured data. In this work, we expand the horizon of data programming by directly applying it to images without this conversion, which is a common scenario for industrial applications. We propose Inspector Gadget, an image labeling system that combines crowdsourcing, data augmentation, and data programming to produce weak labels at scale for image classification. We perform experiments on real industrial image datasets and show that Inspector Gadget obtains better performance than other weak-labeling techniques: Snuba, GOGGLES, and self-learning baselines using convolutional neural networks (CNNs) without pre-training.

Full PDF

IInspector Gadget: A Data Programming-based LabelingSystem for Industrial Images

Geon Heo, Yuji Roh, Seonghyeon Hwang, Dayun Lee, Steven Euijong Whang

KAIST { geon.heo, yuji.roh, sh.hwang, dayun.lee, swhang } @kaist.ac.kr ABSTRACT

As machine learning for images becomes democratized in theSoftware 2.0 era, one of the serious bottlenecks is securingenough labeled data for training. This problem is especiallycritical in a manufacturing setting where smart factories relyon machine learning for product quality control by analyzingindustrial images. Such images are typically large and mayonly need to be partially analyzed where only a small por-tion is problematic (e.g., identifying defects on a surface).Since manual labeling these images is expensive, weak su-pervision is an attractive alternative where the idea is togenerate weak labels that are not perfect, but can be pro-duced at scale. Data programming is a recent paradigm inthis category where it uses human knowledge in the formof labeling functions and combines them into a generativemodel. Data programming has been successful in applica-tions based on text or structured data and can also be ap-plied to images usually if one can ﬁnd a way to convert theminto structured data. In this work, we expand the horizon ofdata programming by directly applying it to images withoutthis conversion, which is a common scenario for industrialapplications. We propose Inspector Gadget, an image la-beling system that combines crowdsourcing, data augmen-tation, and data programming to produce weak labels atscale for image classiﬁcation. We perform experiments onreal industrial image datasets and show that Inspector Gad-get obtains better accuracy than state-of-the-art techniques:Snuba, GOGGLES, and self-learning baselines using convo-lutional neural networks (CNNs) without pre-training.

1. INTRODUCTION

In the era of Software 2.0, machine learning techniques forimages are becoming democratized where the applicationsrange from manufacturing to medical. For example, smartfactories regularly use computer vision techniques to classifydefective and non-defective product images [22]. In medicalapplications, MRI scans are analyzed to identify diseases likecancer [20]. However, many companies are still reluctant toadapt machine learning due to the lack of labeled data wheremanual labeling is simply too expensive [31].We focus on the problem of scalable image labeling forclassiﬁcation where large images are partially analyzed , andthere are few or no labels to start with. Although many com-panies face this problem, it has not been studied enough.

Inspector Gadget

End ModelSmart Factory DefectOK

WeakLabels

Unlabeled

Figure 1: Labeling industrial images with Inspector Gadgetin a smart factory application.Based on a collaboration with a large manufacturing com-pany, we provide the following running example. Supposethere is a smart factory application where product imagesare analyzed for quality control (Figure 1). These imagesare usually taken from industrial cameras and have high-resolution. The goal is to look at each image and tell ifthere are certain defects (e.g., identify scratches, bubbles,and stampings). For convenience, we hereafter use the term defect to mean a part of an image of interest.A conventional solution is to collect enough labels man-ually and train say a convolutional neural network on thetraining data. However, fully relying on crowdsourcing forimage labeling can be too expensive. In our application, wehave heard of domain experts demanding six-ﬁgure salaries,which makes it infeasible to simply ask them to label images.In addition, relying on general crowdsourcing platforms likeAmazon Mechanical Turk may not guarantee high-enoughlabeling quality.Among the possible methods for data labeling (see anextensive survey [26]), weak supervision is an importantbranch of research where the idea is to semi-automaticallycollect labels that are not perfect like manual ones (thuscalled “weak”), but have reasonable quality where the quan-tity compensates for the quality. Data programming [25] isa representative technique of employing humans to developlabeling functions that individually perform labeling, per-haps not accurately. However, the combination of labelingfunctions into a generative model results in a reasonable la-beling program. These weak labels can then be used to trainan end discriminative model.So far, data programming has been shown to be eﬀec-tive in ﬁnding various relationships in text and structureddata [24]. Data programming has also been successfully ap-plied to images where they are usually converted to struc-tured data beforehand [34, 36]. However, this conversionlimits the applicability of data programming. As an alter-native to data programming, the recently proposed GOG-1 a r X i v : . [ c s . L G ] A p r LES [9] demonstrates that, on images, automatic approachesusing pre-trained models may be more eﬀective. Here theidea is to extract semantic prototypes of images using thepre-trained model and then cluster and label the imagesusing the prototypes. However, GOGGLES also has limita-tions (see Section 6.2), and it is not clear if it is the onlysolution for generating training data for image classiﬁcation.We thus propose Inspector Gadget, which opens up anew class of problems for data programming by enabling di-rect image labeling at scale without the need to convert tostructured data using a combination of crowdsourcing, dataaugmentation, and data programming techniques. Inspec-tor Gadget provides a crowdsourcing workﬂow where work-ers identify patterns that indicate defects. Here we makethe tasks easy enough for non-experts to contribute. Thesepatterns are augmented using general adversarial networks(GANs) [12] and policies [7]. Each pattern eﬀectively be-comes a labeling function by being matched with other im-ages. The similarities are then used as features to train amulti-layer perceptron (MLP), which generates weak labels.In our experiments, Inspector Gadget performs better over-all than state-of-the-art methods: Snuba, GOGGLES, andself-learning baselines that use CNNs (VGG-19 [29] and Mo-bileNetV2 [27]) without pre-training. We release our codeas a community resource [1].In the rest of the paper, we present the following: • The architecture of Inspector Gadget (Section 2). • The component details of Inspector Gadget: • Crowdsourcing workﬂow for helping workers identifypatterns (Section 3). • Pattern augmenter for expanding the patterns usingGANs and policies (Section 4). • Feature generator and labeler for generating similar-ity features and producing weak labels (Section 5). • Experimental results where Inspector Gadget outperformsstate-of-the-art image labeling techniques – Snuba, GOG-GLES, and self-learning baselines using CNNs – wherethere are few or no labels to start with (Section 6).

2. OVERVIEW

The main technical contribution of Inspector Gadget isits eﬀective combination of crowdsourcing, data augmenta-tion, and data programming for scalable image labeling forclassiﬁcation. Figure 2 shows the overall process of Inspec-tor Gadget. First, a crowdsourcing workﬂow helps workersidentify patterns of interest from images that may indicatedefects. While the patterns are informative, they may not beenough and are thus augmented using generative adversarialnetworks (GANs) [12] and policies [8]. Each pattern eﬀec-tively becomes a labeling function where it is compared withother images to produces similarities that indicate whetherthe images contain defects. A separate development set isused to train a small model that uses the similarity outputsas features. This model is then used to generate weak labelsof images in the test set. Figure 3 shows the architectureof the Inspector Gadget system. After training the Labeler,Inspector Gadget only utilizes the components highlightedin gray for generating weak labels. In the following sections,we describe each component in more detail.

Images Identify Patterns AugmentDev. Set Images Matching patterns 1.00.50.5 0.500

Labeler

Features Auto tune

Figure 2: An overview of how Inspector Gadget constructsa model (labeler) that generates weak labels.

Crowdsourcing Workflow Pattern AugmenterFeature Generator

Images & Dev. Labels

Patterns Labeler

Inspector Gadget

UnlabeledImages Weak Labels

Figure 3: The architecture of Inspector Gadget.

3. CROWDSOURCING WORKFLOW

Since collecting enough labels to train a CNN on the en-tire images can be expensive, we would like to utilize humanknowledge as much as possible and reduce the additionalamount of training data needed. We propose a crowdsourc-ing workﬂow shown in Figure 4. First, the workers markdefects using bounding boxes through a UI. Since the work-ers are not necessarily experts, the UI educates them howto identify defects beforehand. The bounding boxes in turnbecome the patterns we use to ﬁnd other defects. Figure 5shows sample images of a real-world smart factory dataset(called

Product ; see Section 6.1 for a description) where de-fects are highlighted with bounding boxes. Notice that thedefects are not easy to ﬁnd as they tend to be small andmixed with other larger parts of the product. InspectorGadget can be extended to ask for more ﬁne-grained seg-mentation of the defects instead of bounding boxes.As with any crowdsourcing application, we may run intoquality control issues where the bounding boxes of the work-ers vary even for the same defect. Inspector Gadget ad-dresses this problem by ﬁrst combining overlapping bound-ing boxes together. While there are several ways to combineboxes, we ﬁnd that taking the average of their coordinatesworks reasonably well. For the remaining outlier boundingboxes, Inspector Gadget goes through a peer review phasewhere workers discuss which ones really contain defects. InSection 6.3, we perform ablation tests to show how each ofthese steps helps improve the quality of patterns.Another challenge is determining how many images mustbe annotated to generate enough patterns. In general, wemay not have statistics on the portion of images that havedefects. Hence, our solution is to randomly select images andannotate them until the number of defective images exceedsa given threshold. In our experiments, identifying tens ofdefective images is suﬃcient (see the N DV values in Table 1).All the annotated images form a development set , which we2 I Defect Markings OverlappingOutliers CombinePeer Review Final Patterns Candidate Patterns

Crowdsourcing Workflow

Images Patterns & Dev. Labels

Figure 4: The crowdsourcing workﬂow of Inspector Gadget. Crowd workers can interact with a UI and provide boundingboxes that identify defects. The boxes are used as patterns, which are combined or go through a peer review phase.Figure 5: Sample images in the

Product dataset (see Sec-tion 6.1) containing scratch, bubble, and stamping defectswhere we highlight each defect with a bounding box.use in later steps for model parameter tuning.

4. PATTERN AUGMENTER

Pattern augmentation is a way to compensate for the pos-sible lack of patterns even after using crowdsourcing. Thepatterns can be lacking if not enough human work is done toidentify all possible patterns and especially if there are notenough images containing defects so one has to go throughmany images just to encounter a negative label. We wouldthus like to automatically generate more patterns withoutresorting to more expensive crowdsourcing.We consider two types of augmentation – GAN-based [12]and policy-based [7]. The two methods complement eachother and have been used together in medical applicationsfor identifying lesions [11]. GAN-based augmentation isgood at generating random variations of existing defects thatdo not deviate signiﬁcantly. On the other hand, policy-basedaugmentation is better for generating speciﬁc variations ofdefects that can be quite diﬀerent, exploiting domain knowl-edge. In Section 6.4, we show that neither augmentationsubsumes the other, and the two can be used together toproduce the best results.The augmentation can be done eﬃciently because we areaugmenting small patterns instead of the entire images. Forhigh-resolution images, it is sometimes infeasible to traina GAN at all. In addition, if most of the image is not ofinterest for analysis, it is diﬃcult to generate fake parts ofinterest while leaving the rest of the image as is. By onlyfocusing on augmenting small patterns, it becomes practicalto apply sophisticated augmentation techniques.

The ﬁrst method is to use generative adversarial networks(GAN) to generate variations of patterns that are similarto the existing ones. Intuitively, these augmented patternscan ﬁll in the gaps of the existing patterns. The originalGAN [12] trains a generator to produce realistic fake data

Image resize

Real PatternFake Pattern

Image resize

GANTrain Gen.Resize Resize

Figure 6: GAN-based augmentation on a pattern containinga scratch defect from the

Product dataset.where the goal is to deceive a discriminator that tries todistinguish the real and fake data. More recently, manyvariations of the original GAN have been proposed (see arecent survey [37]).Since we need the generated patterns to be as realistic aspossible, we use a Relativistic GAN (RGAN) [17], which isdesigned for this purpose. The formulation of RGAN is:max D E ( x r ,G ( z )) ∼ ( P , Q ) [log( σ ( D ( x r ) − D ( G ( z ))))]max G E ( x r ,G ( z )) ∼ ( P , Q ) [log( σ ( D ( G ( z )) − D ( x r )))]where G is the generator, x r is real data, D ( x ) is the prob-ability that x is real, z is a random noise vector that isused by G to generate fake data, and σ is the sigmoid func-tion. While training, the discriminator of RGAN not onlydistinguishes data, but also tries to maximize the diﬀerencebetween the probability that a real image is real and a fakeimage is fake. This setup enforces fake images to be as real-istic as possible, in addition to simply being distinguishablefrom real images as in original GANs. We also use SpectralNormalization [21], which is a commonly-used technique ap-plied to a neural network structure where the discriminatorrestricts the gradient to adjust the training speed for bettertraining stability. Finally, since we are dealing with images,we use CNN models for the generator and discriminator.Another preprocessing issue is that most GANs assumethat the input images are of the same size and have a squareshape. While this assumption is reasonable for a homoge-neous set of images, the patterns identiﬁed may have dif-ferent shapes. We thus ﬁt patterns to a ﬁxed-sized squareshape by stretching or shrinking them and then augment-ing the patterns. Then we re-adjust the new patterns intoone of the sizes of the original patterns selected randomly.Figure 6 shows the entire process applied to an image.3 otateAutoContrastEqualize…BrightnessCutoutResizeX, Y [-10, 10]--…[0.1, 1.9][0, 0.2][0.8, 1.2]Operations Magnitudes Figure 7: Augmentation using policies. We can evaluate andchoose diﬀerent combinations of operations and magnitudes.

Original Brightness, 1.632Invert, 0.246 ResizeX, 0.872Rotate, 7.000

Figure 8: Policy-based augmentation on a pattern with acrack defect from the

KSDD dataset (see Section 6.1). Foreach augmentation, we show the operation and magnitude.

Policies [7] have been proposed as another way to augmentimages and complement the GAN approach. The idea is touse manual policies to decide how exactly an image is varied.We consider 17 types of policies where some are shown inFigure 7. Figure 8 shows the results of applying four ofthese policies on a surface defect from the

KSDD dataset(see description in Section 6.1).Policy-based augmentation is eﬀective for patterns whereapplying operations based on human knowledge may resultin quite diﬀerent, but valid patterns. For example, if a defectis line-shaped, then it makes sense to stretch or rotate it.There are two parameters to conﬁgure: the operation to useand the magnitude of applying that operation. Recently,policy-based techniques have become more automated, e.g.,AutoAugment [7] uses reinforcement learning to decide towhat extent can policies be applied together.We use an approach that is simpler than AutoAugment.Among certain combinations of policies, we choose the onesthat work best on the development set. We ﬁrst split thedevelopment set into train and test sets. For each policy, wespecify a range for the magnitudes and choose 10 randomvalues within that range. We then iterate all combinationsof three policies. For each combination, we augment thepatterns in the train set using the 10 magnitudes and train amodel (see details in Section 5) on the train set images untilconvergence. Then we evaluate the model on the separatetest set. Finally, we use the policy combination that resultsin the best accuracy and apply it to the entire set of patterns.

5. WEAK LABEL GENERATION

Once Inspector Gadget gathers patterns and augmentsthem, the next step is to generate features of images (alsocalled primitives according to Snuba [35]) and train a modelthat can produce weak labels. Note that the way we gener-ate features of images is more direct than existing data pro-gramming systems that ﬁrst convert images to structured data based on object detection (e.g., identify a vehicle) be-fore applying labeling functions.

Inspector Gadget provides feature generation functionsthat matches the patterns with new images to identify sim-ilar defects on any location and returns a real number forthe similarity. Each output value of the feature generationfunction is used as an input of the labeler. Depending onthe type of defect, the pattern matching may diﬀer. A na¨ıveapproach is to do an exact pixel-by-pixel comparison, butthis is unrealistic when the defects have variations. Instead,a better way is to compare the distributions of pixel val-ues. This comparison is more robust to slight variations ofa pattern. On the other hand, there may be false positiveswhere an obviously diﬀerent defect matches just because itspixels have similar distributions. In our experiments, wefound that comparing distributions on the x and y axes us-ing normalized cross correlation [2] is eﬀective in reducingsuch false positives. Given an image I with pixel dimensions W × H and pattern P i with dimensions w × h , the i th featuregeneration function F G i is deﬁned as: F G i ( I ) = max x,y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) x (cid:48) ,y (cid:48) P i ( x (cid:48) , y (cid:48) ) · I ( x + x (cid:48) , y + y (cid:48) ) (cid:113)(cid:80) x (cid:48) ,y (cid:48) P i ( x (cid:48) , y (cid:48) ) · I ( x + x (cid:48) , y + y (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where 0 ≤ x < W − w , 0 ≤ y < H − h , 0 ≤ x (cid:48) < w , and0 ≤ y (cid:48) < h . When matching a pattern against an image, astraightforward approach is to make a comparison with ev-ery possible region of the same size in the image. However,scanning the entire image may be too time consuming. In-stead, we use a pyramid method [3] where we ﬁrst search forcandidate parts of an image by reducing the resolutions ofthe image and pattern and performing a comparison quickly.Then just for the candidates, we perform a comparison usingthe full resolution. After the features are generated, Inspector Gadget trainsa model on the output similarities of the feature generationfunctions, where the goal is to produce weak labels. Themodel can have any architecture and be small because thereare not as many features as say the number of pixels inan image. We use a multilayer perceptron (MLP) becauseit is simple, but also has good performance compared toother models. An interesting observation is that, depend-ing on the model architecture (e.g., the number of layers inan MLP), the model accuracy can vary signiﬁcantly as wedemonstrate in Section 6.5. Inspector Gadget thus performsmodel tuning where it chooses the model architecture thathas the best accuracy results on the development set. Thisfeature is powerful compared to existing CNN approacheswhere the architecture is complicated, and it is too expen-sive to consider other variations.The ﬁnal image labeling process consists of two steps.First, the patterns are matched to images for generating thefeatures. Second, the trained MLP is applied on the featuresto make a prediction. We note that latency is not the mostcritical issue because we are generating weak labels, whichare used to construct the training data for training the enddiscriminative model. Training data construction is usuallydone in batch mode instead of say real time.4 ataset Image size N ( N D ) N V ( N DV ) Defect Type Task Type KSDD [32] 500 x 1257 399 (52) 78 (10) Crack Binary

Product (scratch)

162 x 2702 1673 (727) 170 (76) Scratch Binary

Product (bubble)

77 x 1389 1048 (102) 104 (10) Bubble Binary

Product (stamping)

161 x 5278 1094 (148) 109 (15) Stamping Binary

NEU [16] 200 x 200 300 per defect 100 per defect Rolled-in scale, Patches, Crazing, Multi-classPitted surface, Inclusion, ScratchTable 1: For each of the ﬁve datasets, we show the image size, the dataset size( N ) and number of defective images( N D ), thedevelopment set size( N V ) and number of defective images within it( N DV ), the types of defects detected, and the task type.

6. EXPERIMENTS

We evaluate Inspector Gadget on real datasets and answerthe following questions. • How accurate are the weak labels of Inspector Gadgetcompared to other state-of-the-art methods and are theyuseful when training the end discriminative model? • How useful is each component of Inspector Gadget? • What are the errors made by Inspector Gadget?We implement Inspector Gadget in Python and three ma-chine learning libraries: Pytorch, TensorFlow, and Scikit-learn. We use an Intel Xeon CPU to train our MLP modelsand an NVidia Titan RTX GPU to train larger CNN models.Other details can be found in our released code [1].

Datasets.

We use real datasets for classiﬁcation tasks. Foreach dataset, we construct a development set as described inSection 3. Table 1 summarizes the datasets with other ex-perimental details, and Figure 9 shows samples of them. Wenote that the results in Section 6.2 are obtained by varyingthe size of the development set, and the rest of the experi-ments utilize the same development set size as described inTable 1. The

NEU dataset has the smallest images, and the

Product dataset has the largest. For each dataset, we have agold standard of labels. Hence, we are able to compute theaccuracy of the labeling on separate test data. • The Kolektor Surface-Defect Dataset (

KSDD [32]) is con-structed from images of defected electrical commuta-tors that were provided and annotated by the KolektorGroup. There is only one type of defect – cracks – buteach one varies signiﬁcantly in shape. • The

Product dataset (Figure 5) is proprietary and ob-tained through a collaboration with a manufacturing com-pany that has a smart factory application. Each prod-uct has a circular shape where diﬀerent strips are spreadinto rectangular shapes. There are three types of defects:scratches, bubbles, and stampings, which occur in diﬀer-ent strips. The scratches vary in length and direction.The bubbles are more uniform, but have small sizes. Thestampings are small and appear in ﬁxed positions. Wedivide the dataset into three, as if there is a separatedataset for each defect type. • The Northeastern University Surface Defect Database(

NEU [16]) contains images that are divided into 6 defecttypes of typical surface defects of hot-rolled steel strips:rolled-in scale, patches, crazing, pitted surface, inclusion, (a)

KSDD data crazing inclusion patchespitted surface scratchesrolled-in scale (b)

NEU data

Figure 9: Sample images in

KSDD [32] and

NEU [16] datasetswhere we highlight the defects with bounding boxes. SeeFigure 5 for the

Product dataset images.and scratches. Compared to the other datasets, thesedefects take larger portions of the images. Since thereare no images without defects, we solve the diﬀerent taskof multi-class classiﬁcation where the goal is to determinewhich defect is present.

Pattern Matching.

We use an OpenCV library function [2]that compares pixel distributions on the x and y axes usingnormalized cross correlation (explained in Section 5.1).

GAN-based Augmentation.

We provide more details forSection 4.1. For all datasets, the input random noise vectorhas a size of 100, the learning rates of the generator anddiscriminator are both 1e −

4, and the number of epochs isabout 1K. We ﬁt patterns to a square shape where the widthand height are set to 100 or the averaged value of all widthsand heights of patterns, whichever is smaller.5

LP Model Tuning.

We use an L-BFGS optimizer [18],which provides stable training on small data, with a 1e − k -fold cross validation where each foldhas at least 20 examples per class and early stopping in orderto compare the accuracies of MLPs before they overﬁt. Accuracy Measure.

We use the F score, which is the har-monic mean between precision and recall. Suppose that theset of true defects is D while the set of predictions is P .Then the precision P r = | D ∩ P || P | , recall Re = | D ∩ P || D | , and F = × Pr × RePr + Re . While there are other possible measureslike AUC, F is known to be more suitable for data wherethe labels are imbalanced [13] as in most of our settings. Systems Compared.

We compare Inspector Gadget withstate-of-the-art image labeling systems and self-learning base-lines that train a CNN model on available labeled data.Snuba [35] automates the process of labeling function con-struction by starting from a set of primitives that are anal-ogous to our feature generation functions and iteratively se-lecting subsets of them to train decision tree models, whichbecomes the labeling functions. Each iteration involves com-paring models trained on all possible subsets of the primi-tives up to a certain size. Finally, the labeling functions arecombined into a generative model. We faithfully implementSnuba and use our crowdsourced and augmented patternsfor generating primitives, in order to be favorable to Snuba.However, adding more patterns quickly slows down Snubaas its runtime is exponential to the number of patterns.We also compare with GOGGLES [9], which takes theopposite approach of not using crowdsourcing. However, itrelies on the fact that there is a pre-trained model and ex-tracts semantic prototypes of images where each prototyperepresents the part of an image where the pre-trained modelis activated the most. Each image is assumed to have oneobject, and GOGGLES clusters similar images for unsuper-vised learning. In our experiments, we use the opensourcedcode of GOGGLES and a pre-trained VGG-16 model [29].Finally, we compare Inspector Gadget with self-learning [33]baselines that train CNN models on the development set us-ing cross validation and use them to label the rest of the im-ages. To make a fair comparison, we experiment with bothheavy and light-weight CNN models. For the heavy model,we use VGG-19 [29], which is widely used in the literature.For the light-weight model, we use MobileNetV2 [27], whichis designed to train eﬃciently in a mobile setting, but nearlyhas the performance of heavy CNNs. We also make a com-parison with VGG-19 whose weights are pre-trained on Im-ageNet [10]. (A pre-trained MobileNetV2 does not performas well and is thus not compared.) Transfer learning obvi-ously gives a signiﬁcant boost in performance, and we donot claim that Inspector Gadget, which is not pre-trained,is better than this version of VGG-19. In addition, we usepreprocessing techniques on images that are favorable forthe baselines. For example, the images from the

Product dataset are long rectangles, so we split each image in halfand stack them on top of each other to make them moresquare-like, which is advantageous for CNNs.

We compare the weak label accuracy of Inspector Gad-get with the other methods. Figure 10 compares Inspector Gadget with GOGGLES and the self-learning baselines byincreasing the amount of training data and observing the F scores. To clearly show how Inspector Gadget compareswith other methods, we use a solid line to draw its plot whileusing dotted lines for the rest. Among the models that are not pre-trained (i.e., ignore “SL (VGG19 + Pre-training)”for now), we observe that Inspector Gadget performs bestoverall because it is either the best or second-best methodin all ﬁgures. This result is important because industrialimages have various defect types that must all be identi-ﬁed correctly. For KSDD (Figure 10d), Inspector Gadgetperforms the best because the pattern augmentation helpsInspector Gadget ﬁnd more variations of cracks (see Sec-tion 6.4). For

Product (Figures 10a–10c), Inspector Gadgetconsistently performs the ﬁrst or second best despite the dif-ferent characteristics of the defects. For

NEU (Figure 10e),Inspector Gadget ranks ﬁrst for the multi-class classiﬁcation.We explain the performances of other methods. Snubaconsistently has a lower F than Inspector Gadget possiblybecause the number of patterns is too large to handle. In-stead of considering all combinations of patterns and train-ing decision trees, Inspector Gadget’s approach of trainingand auto-tuning an MLP works better for our experiments.GOGGLES does not need training data and thus has aconstant accuracy. In Figure 10a, GOGGLES has a high F because the defect sizes are large, and the pre-trainedVGG-16 is eﬀective in identifying them as objects. For theother ﬁgures, however, GOGGLES does not perform as wellbecause the defect sizes are small and diﬃcult to identifyas objects. VGG-19 without pre-training (“SL (VGG19)”)only performs the best in Figure 10c where CNN modelsare very good at detecting stamping defects because theyappear in a ﬁxed location on the images. For other ﬁgures,VGG-19 performs poorly because there is not enough la-beled data. MobileNetV2 does not perform well in any ofthe ﬁgures. Finally, the pre-trained VGG-19 (“SL (VGG19+ Pre-training)”) does outperform Inspector Gadget in Fig-ures 10a, 10d, and 10e. While we do not claim that InspectorGadget outperforms pre-trained models, it is the best optionwhen pre-training is not possible. We evaluate how eﬀectively we can use the crowd to la-bel and identify patterns using the

Product datasets. Table 2compares the full crowdsourcing workﬂow in Inspector Gad-get with two variants: (1) a workﬂow that does not averagethe patterns at all and (2) a workﬂow that does average thepatterns, but still does not perform peer reviews. For eachscenario, we compare the F score of the MLP trained onthe similarity features generated by matching the patternson the development set, without using pattern augmenta-tion. As a result, the full workﬂow clearly performs the bestfor the Product (scratch) and

Product (stamping) datasets.For the

Product (bubble) dataset, the workﬂow that doesnot combine patterns has a better average F , but the ac-curacies vary among diﬀerent workers. Instead, it is betterto use the stable full workﬂow without the variance. We evaluate how augmented patterns help improve theweak label F score of Inspector Gadget. Table 3 shows theimpact of the GAN-based and policy-based augmentation onthe ﬁve datasets. For each augmentation, we add 100–5006 Size of Dev. Set . . F s c o r e (a) Product (scratch)

30 40 50 60 70 80 90 100

Size of Dev. Set . . . F s c o r e (b) Product (bubble)

40 60 80 100

Size of Dev. Set . . . F s c o r e (c) Product (stamping)

20 30 40 50 60 70 80

Size of Dev. Set . . . F s c o r e (d) KSDD [32]

200 400 600 800

Size of Dev. Set . . . F s c o r e (e) NEU [16]

20 30 40 50 60 70 80

Size of Dev. Set . . . F s c o r e Inspector GadgetSnuba with IG featuresGOGGLESSL (VGG19)SL (MobileNetV2)SL (VGG19 + Pre-training)

Figure 10: Weak label accuracy comparison between Inspector Gadget, Snuba [35], GOGGLES [9], and the self-learningbaselines (SL) using VGG-19 [29] (with or without pre-trained weights) and MobileNetV2 [27] on diﬀerent sizes of trainingdata. Among the models that are not pre-trained , Inspector Gadget performs either the best or second-best in all ﬁgures. F scoresNo avg. No FullDataset ( ± std/ ) peer review workﬂow Product (scratch) ± . Product (bubble) ± . Product (stamping) ± . Product datasets. Each F score is the performance of theMLP trained on each workﬂow. Dataset No Policy GAN UsingAug. Based Based Both

KSDD [32] 0.415 0.578 0.509

Product (scratch)

Product (bubble)

Product (stamping)

NEU [16] 0.936

Table 3: Pattern augmentation impact on Inspector Gadget.For each dataset, we highlight the highest F score.patterns, which empirically results in the best F improve-ments (see below). When using both augmentations, wesimply combine the patterns from each augmentation. Asa result, while each augmentation helps improve F , usingboth of them usually gives the best results.Figure 11 shows how adding patterns impacts the F scorefor the Product (bubble) dataset. While adding more pat-terns helps to a certain extent, it has diminishing returnsafterwards. The results for the other datasets are similar,although sometimes noisier. The best number of augmentedpatterns diﬀers per dataset, but falls in the range of 100–500.

Number of Augmented Patterns . . . F s c o r e Policy-basedGAN-based

Figure 11: Policy-based and GAN-based augmentation re-sults on the

Product (stamping) dataset.

We evaluate the impact of model tuning on accuracy de-scribed in Section 5.2 as shown in Figure 12. We use an MLPwith 1 to 3 hidden layers and varied the number of nodesper hidden layer to be one of { n | n = 1 . . . m and 2 m − ≤ I ≤ m } where I is the number of input nodes. For eachdataset, we ﬁrst obtain the maximum and minimum possible F scores by evaluating all the tuned models we considereddirectly on the test data. Then, we compare these resultswith the (test data) F score of the actual model that In-spector Gadget selected after comparing the models usingthe development set. We observe that the model tuning inInspector Gadget can indeed improve the model accuracy,close to the maximum possible value. We now address the issue of whether the weak labels areactually helpful for training the end discriminative model.We compare the F score of this end model with the samemodel that is trained on the development set. For the dis-7 SDD Scratch Bubble Stamping NEU . . . . . . F s c o r e Max Min Our tuning

Figure 12: The variations in F scores when tuning the MLPmodel hyper-parameters. Dataset End Model Dev. Set WL (IG)

KSDD [32] VGG19 0.499 0.700

Product (scratch)

VGG19 0.925 0.978

Product (bubble)

VGG19 0.359 0.720

Product (stamping)

VGG19 0.782 0.876

NEU [16] ResNet50 0.953 0.970Table 4: F scores of end models trained on the developmentset (Dev. Set) only or the development set combined withthe weak labels produced by Inspector Gadget (WL (IG)).criminative model, we use VGG-19 [29] for the binary classi-ﬁcation tasks on KSDD and

Product , and ResNet50 [14] forthe multi-class task on

NEU . We can use other discrimina-tive models that have higher absolute F , but the point is toshow the relative F improvements when using weak labels.Table 4 shows that the F scores improve by 0.02–0.36. We perform an error analysis on which cases InspectorGadget fails to make correct predicts for the ﬁve datasetsbased on manual investigation. We use the ground truth in-formation for the analysis. Table 5 shows that most commonerror is when certain defects do not match with the patterns,which can be improved by using better pattern augmenta-tion and matching techniques. The next common case iswhen the data is noisy, which can be improved by cleaningthe data. The last case is the most challenging where evenhumans have diﬃculty identifying the defects because theyare not obvious (e.g., a near-invisible scratch).

7. RELATED WORK

Crowdsourcing for machine learning.

Using humans foradvanced analytics is increasingly becoming mainstream [38].While there is a push to make all steps of machine learningautomatic, there are cases where humans are essential andthe key issue is using them with the minimum eﬀort. Thereis also a heavy literature in HCI on how to integrate crowd-sourcing in machine learning [6]. These methods focus onhow to provide the right interfaces to help users constructfeatures. Inspector Gadget builds on top of these approachesby relying on crowdsourcing for identifying patterns.

Data Programming.

Data programming [25] is a recentparadigm where workers program labeling functions, which

CauseMatching Noisy DiﬃcultDataset failure data to humans

KSDD [32] 10 (52.6 %) 5 (26.3 %) 4 (21.1 %)

Product (scratch)

11 (36.7 %) 11 (36.7 %) 8 (26.6 %)

Product (bubble)

19 (45.2 %) 15 (35.7 %) 8 (19.1 %)

Product (stamping)

15 (45.5 %) 13 (39.4 %) 5 (15.1 %)

NEU [16] 35 (63.6 %) 4 (7.3 %) 16 (29.1 %)Table 5: Error analysis of Inspector Gadget.are used to generate weak labels at scale. Snorkel [24, 4] isa seminal system that demonstrates the practicality of dataprogramming, and Snuba [35] extends it by automaticallyconstructing labeling functions using primitives. In com-parison, Inspector Gadget does not assume any accuracyguarantees on the feature generation function and directlylabels images without converting them to structured data.Several systems have studied the problem of automatinglabeling function construction. CrowdGame [19] proposes amethod for constructing labeling functions for entity resolu-tion on structured data. Adversarial data programming [23]proposes a GAN-based framework for labeling with labelingfunction results and claims to be better than Snorkel-basedapproaches. In comparison, Inspector Gadget solves the dif-ferent problem of partially analyzing large images.

Automatic Image labeling.

There is a variety of generalautomatic image labeling techniques. Data augmentation [28]is a general method to generate new labeled images. Gener-ative adversarial networks (GANs) [12] have been proposedto generate fake, but realistic images based on existing im-ages. Policies [7] were proposed to apply custom transfor-mations on images as long as they remain realistic. Mostof the existing work operate on the entire images. In com-parison, Inspector Gadget is eﬃcient because it only needsto augment patterns, which are much smaller than the im-ages. Label propagation techniques [5] organize images intoa graph based on their similarities and then propagates ex-isting labels of images to their most similar ones. In com-parison, Inspector Gadget is designed for images where onlya small part of them are of interest while the main partmay be nearly identical to other images, so we cannot uti-lize similarities. There are also application-speciﬁc defectdetection methods [30, 16, 15], some of which are designedfor the datasets we used. In comparison, Inspector Gadgetprovides a general framework for image labeling. Recently,GOGGLES [9] is an image labeling system that relies on apre-trained model to extract semantic prototypes of imagesand construct an aﬃnity matrix that can be used to identifysimilar images. In comparison, Inspector Gadget does notrely on pre-trained models and is more suitable for partiallyanalyzing large images using human knowledge.

8. CONCLUSION

We proposed Inspector Gadget, a scalable image labelingsystem for classiﬁcation problems that eﬀectively combinescrowdsourcing, data augmentation, and data programmingtechniques. Inspector Gadget targets applications in man-ufacturing where large industrial images are partially ana-lyzed, and there are few or no labels to start with. Unlike8xisting data programming approaches that convert imagesto structured data beforehand, Inspector Gadget directly la-bels images by providing a crowdsourcing workﬂow to lever-age human knowledge for identifying patterns of interest.The patterns are then augmented and matched with otherimages to generate similarity features for MLP model train-ing. Our experiments show that Inspector Gadget outper-forms the state-of-the-art methods Snuba, GOGGLES, andself-learning baselines using CNNs without pre-training. Wethus believe that Inspector Gadget opens up a new class ofproblems to apply data programming.

9. REFERENCES [1] Inspector gadget github repository. https://github.com/geonheo/InspectorGadget .Accessed April 1st, 2020.[2] Opencv. https://docs.opencv.org/2.4/modules/imgproc/doc/object_detection.html . Accessed April1st, 2020.[3] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J.Burt, and J. M. Ogden. 1984, Pyramid methods inimage processing.

RCA Engineer , 29(6):33–41, 1984.[4] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao,C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi,R. Kuchhal, C. R´e, and R. Malkin. Snorkel drybell: Acase study in deploying weak supervision at industrialscale. In

SIGMOD , pages 362–375, 2019.[5] T. D. Bui, S. Ravi, and V. Ramavajjala. Neural graphlearning: Training neural networks using graphs. In

WSDM , pages 64–71, 2018.[6] J. Cheng and M. S. Bernstein. Flock: Hybridcrowd-machine learning classiﬁers. In

CSCW , pages600–611, 2015.[7] E. D. Cubuk, B. Zoph, D. Man´e, V. Vasudevan, andQ. V. Le. Autoaugment: Learning augmentationpolicies from data.

CoRR , abs/1805.09501, 2018.[8] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, andQ. V. Le. Autoaugment: Learning augmentationstrategies from data. In

CVPR , pages 113–123, 2019.[9] N. Das, S. Chaba, S. Gandhi, D. H. Chau, andX. Chu. GOGGLES: automatic training datageneration with aﬃnity coding. In

SIGMOD , 2020.[10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li.Imagenet: A large-scale hierarchical image database.In

CVPR , pages 248–255, 2009.[11] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai,J. Goldberger, and H. Greenspan. Gan-basedsynthetic medical image augmentation for increasedCNN performance in liver lesion classiﬁcation.

Neurocomputing , 321:321–331, 2018.[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. C. Courville, andY. Bengio. Generative adversarial nets. In

NIPS ,pages 2672–2680, 2014.[13] Q. Gu and Z. Zhu, Liand Cai. Evaluation measures ofthe classiﬁcation performance of imbalanced data sets.In Z. Cai, Z. Li, Z. Kang, and Y. Liu, editors,

CIIS ,pages 461–471, Berlin, Heidelberg, 2009.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In , pages 770–778. IEEE Computer Society,2016.[15] Y. He, K. Song, H. Dong, and Y. Yan. Semi-superviseddefect classiﬁcation of steel surface based onmulti-training and generative adversarial network.

Optics and Lasers in Engineering , 122:294–302, 2019.[16] Y. He, K.-C. Song, Q. Meng, and Y. Yan. Anend-to-end steel surface defect detection approach viafusing multiple hierarchical features.

IEEETransactions on Instrumentation and Measurement ,69:1493–1504, 04 2020.[17] A. Jolicoeur-Martineau. The relativistic discriminator:a key element missing from standard GAN. In

ICLR ,2019.[18] D. C. Liu and J. Nocedal. On the limited memoryBFGS method for large scale optimization.

Math.Program. , 45(1-3):503–528, 1989.[19] T. Liu, J. Yang, J. Fan, Z. Wei, G. Li, and X. Du.Crowdgame: A game-based crowdsourcing system forcost-eﬀective data labeling. In

SIGMOD , pages1957–1960, 2019.[20] Y. Liu, T. Kohlberger, M. Norouzi, G. Dahl, J. Smith,A. Mohtashamian, N. Olson, L. Peng, J. Hipp, andM. Stumpe. Artiﬁcial intelligence based breast cancernodal metastasis detection: Insights into the black boxfor pathologists.

Archives of Pathology & LaboratoryMedicine , 2018.[21] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida.Spectral normalization for generative adversarialnetworks.

CoRR , abs/1802.05957, 2018.[22] E. Oztemel and S. Gursev. Literature review ofindustry 4.0 and related technologies.

Journal ofIntelligent Manufacturing , 2018.[23] A. Pal and V. N. Balasubramanian. Adversarial dataprogramming: Using gans to relax the bottleneck ofcurated labeled data. In

CVPR , pages 1556–1565,2018.[24] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. R´e.Snorkel: Fast training set generation for informationextraction. In

SIGMOD , pages 1683–1686, 2017.[25] A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e.Data programming: Creating large training sets,quickly. In

NIPS , pages 3567–3575, 2016.[26] Y. Roh, G. Heo, and S. E. Whang. A survey on datacollection for machine learning: a big data - AIintegration perspective.

IEEE TKDE , 2019.[27] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov,and L. Chen. Mobilenetv2: Inverted residuals andlinear bottlenecks. In

CVPR , pages 4510–4520, 2018.[28] C. Shorten and T. M. Khoshgoftaar. A survey onimage data augmentation for deep learning.

J. BigData , 6:60, 2019.[29] K. Simonyan and A. Zisserman. Very deepconvolutional networks for large-scale imagerecognition. In

ICLR , 2015.[30] K.-C. Song and Y. Yan. A noise robust method basedon completed local binary patterns for hot-rolled steelstrip surface defects.

Applied Surface Science ,285:858–864, 2013.[31] M. Stonebraker and E. K. Rezig. Machine learningand big data: What is important?

IEEE Data Eng. ull. , 2019.[32] D. Tabernik, S. ˇSela, J. Skvarˇc, and D. Skoˇcaj.Segmentation-Based Deep-Learning Approach forSurface-Defect Detection. Journal of IntelligentManufacturing , May 2019.[33] I. Triguero, S. Garc´ıa, and F. Herrera. Self-labeledtechniques for semi-supervised learning: taxonomy,software and empirical study.

Knowl. Inf. Syst. ,42(2):245–284, 2015.[34] P. Varma, B. D. He, P. Bajaj, N. Khandwala,I. Banerjee, D. L. Rubin, and C. R´e. Inferringgenerative model structure with static analysis. In

NeurIPS , pages 240–250, 2017.[35] P. Varma and C. R´e. Snuba: Automating weaksupervision to label training data.

PVLDB ,12(3):223–236, 2018.[36] P. Varma, F. Sala, A. He, A. Ratner, and C. R´e.Learning dependency structures for weak supervisionmodels. In K. Chaudhuri and R. Salakhutdinov,editors,

ICML , volume 97, pages 6418–6427, 2019.[37] Z. Wang, Q. She, and T. E. Ward. Generativeadversarial networks: A survey and taxonomy.

CoRR ,abs/1906.01529, 2019.[38] D. Xin, L. Ma, J. Liu, S. Macke, S. Song, and A. G.Parameswaran. Accelerating human-in-the-loopmachine learning: Challenges and opportunities. In