Inspector Gadget: A Data Programming-based Labeling System for Industrial Images
Geon Heo, Yuji Roh, Seonghyeon Hwang, Dayun Lee, Steven Euijong Whang
IInspector Gadget: A Data Programming-based LabelingSystem for Industrial Images
Geon Heo, Yuji Roh, Seonghyeon Hwang, Dayun Lee, Steven Euijong Whang
KAIST { geon.heo, yuji.roh, sh.hwang, dayun.lee, swhang } @kaist.ac.kr ABSTRACT
As machine learning for images becomes democratized in theSoftware 2.0 era, one of the serious bottlenecks is securingenough labeled data for training. This problem is especiallycritical in a manufacturing setting where smart factories relyon machine learning for product quality control by analyzingindustrial images. Such images are typically large and mayonly need to be partially analyzed where only a small por-tion is problematic (e.g., identifying defects on a surface).Since manual labeling these images is expensive, weak su-pervision is an attractive alternative where the idea is togenerate weak labels that are not perfect, but can be pro-duced at scale. Data programming is a recent paradigm inthis category where it uses human knowledge in the formof labeling functions and combines them into a generativemodel. Data programming has been successful in applica-tions based on text or structured data and can also be ap-plied to images usually if one can find a way to convert theminto structured data. In this work, we expand the horizon ofdata programming by directly applying it to images withoutthis conversion, which is a common scenario for industrialapplications. We propose Inspector Gadget, an image la-beling system that combines crowdsourcing, data augmen-tation, and data programming to produce weak labels atscale for image classification. We perform experiments onreal industrial image datasets and show that Inspector Gad-get obtains better accuracy than state-of-the-art techniques:Snuba, GOGGLES, and self-learning baselines using convo-lutional neural networks (CNNs) without pre-training.
1. INTRODUCTION
In the era of Software 2.0, machine learning techniques forimages are becoming democratized where the applicationsrange from manufacturing to medical. For example, smartfactories regularly use computer vision techniques to classifydefective and non-defective product images [22]. In medicalapplications, MRI scans are analyzed to identify diseases likecancer [20]. However, many companies are still reluctant toadapt machine learning due to the lack of labeled data wheremanual labeling is simply too expensive [31].We focus on the problem of scalable image labeling forclassification where large images are partially analyzed , andthere are few or no labels to start with. Although many com-panies face this problem, it has not been studied enough.
Inspector Gadget
End ModelSmart Factory DefectOK
WeakLabels
Unlabeled
Figure 1: Labeling industrial images with Inspector Gadgetin a smart factory application.Based on a collaboration with a large manufacturing com-pany, we provide the following running example. Supposethere is a smart factory application where product imagesare analyzed for quality control (Figure 1). These imagesare usually taken from industrial cameras and have high-resolution. The goal is to look at each image and tell ifthere are certain defects (e.g., identify scratches, bubbles,and stampings). For convenience, we hereafter use the term defect to mean a part of an image of interest.A conventional solution is to collect enough labels man-ually and train say a convolutional neural network on thetraining data. However, fully relying on crowdsourcing forimage labeling can be too expensive. In our application, wehave heard of domain experts demanding six-figure salaries,which makes it infeasible to simply ask them to label images.In addition, relying on general crowdsourcing platforms likeAmazon Mechanical Turk may not guarantee high-enoughlabeling quality.Among the possible methods for data labeling (see anextensive survey [26]), weak supervision is an importantbranch of research where the idea is to semi-automaticallycollect labels that are not perfect like manual ones (thuscalled “weak”), but have reasonable quality where the quan-tity compensates for the quality. Data programming [25] isa representative technique of employing humans to developlabeling functions that individually perform labeling, per-haps not accurately. However, the combination of labelingfunctions into a generative model results in a reasonable la-beling program. These weak labels can then be used to trainan end discriminative model.So far, data programming has been shown to be effec-tive in finding various relationships in text and structureddata [24]. Data programming has also been successfully ap-plied to images where they are usually converted to struc-tured data beforehand [34, 36]. However, this conversionlimits the applicability of data programming. As an alter-native to data programming, the recently proposed GOG-1 a r X i v : . [ c s . L G ] A p r LES [9] demonstrates that, on images, automatic approachesusing pre-trained models may be more effective. Here theidea is to extract semantic prototypes of images using thepre-trained model and then cluster and label the imagesusing the prototypes. However, GOGGLES also has limita-tions (see Section 6.2), and it is not clear if it is the onlysolution for generating training data for image classification.We thus propose Inspector Gadget, which opens up anew class of problems for data programming by enabling di-rect image labeling at scale without the need to convert tostructured data using a combination of crowdsourcing, dataaugmentation, and data programming techniques. Inspec-tor Gadget provides a crowdsourcing workflow where work-ers identify patterns that indicate defects. Here we makethe tasks easy enough for non-experts to contribute. Thesepatterns are augmented using general adversarial networks(GANs) [12] and policies [7]. Each pattern effectively be-comes a labeling function by being matched with other im-ages. The similarities are then used as features to train amulti-layer perceptron (MLP), which generates weak labels.In our experiments, Inspector Gadget performs better over-all than state-of-the-art methods: Snuba, GOGGLES, andself-learning baselines that use CNNs (VGG-19 [29] and Mo-bileNetV2 [27]) without pre-training. We release our codeas a community resource [1].In the rest of the paper, we present the following: • The architecture of Inspector Gadget (Section 2). • The component details of Inspector Gadget: • Crowdsourcing workflow for helping workers identifypatterns (Section 3). • Pattern augmenter for expanding the patterns usingGANs and policies (Section 4). • Feature generator and labeler for generating similar-ity features and producing weak labels (Section 5). • Experimental results where Inspector Gadget outperformsstate-of-the-art image labeling techniques – Snuba, GOG-GLES, and self-learning baselines using CNNs – wherethere are few or no labels to start with (Section 6).
2. OVERVIEW
The main technical contribution of Inspector Gadget isits effective combination of crowdsourcing, data augmenta-tion, and data programming for scalable image labeling forclassification. Figure 2 shows the overall process of Inspec-tor Gadget. First, a crowdsourcing workflow helps workersidentify patterns of interest from images that may indicatedefects. While the patterns are informative, they may not beenough and are thus augmented using generative adversarialnetworks (GANs) [12] and policies [8]. Each pattern effec-tively becomes a labeling function where it is compared withother images to produces similarities that indicate whetherthe images contain defects. A separate development set isused to train a small model that uses the similarity outputsas features. This model is then used to generate weak labelsof images in the test set. Figure 3 shows the architectureof the Inspector Gadget system. After training the Labeler,Inspector Gadget only utilizes the components highlightedin gray for generating weak labels. In the following sections,we describe each component in more detail.
Images Identify Patterns AugmentDev. Set Images Matching patterns 1.00.50.5 0.500
Labeler
Features Auto tune
Figure 2: An overview of how Inspector Gadget constructsa model (labeler) that generates weak labels.
Crowdsourcing Workflow Pattern AugmenterFeature Generator
Images & Dev. Labels
Patterns Labeler
Inspector Gadget
UnlabeledImages Weak Labels
Figure 3: The architecture of Inspector Gadget.
3. CROWDSOURCING WORKFLOW
Since collecting enough labels to train a CNN on the en-tire images can be expensive, we would like to utilize humanknowledge as much as possible and reduce the additionalamount of training data needed. We propose a crowdsourc-ing workflow shown in Figure 4. First, the workers markdefects using bounding boxes through a UI. Since the work-ers are not necessarily experts, the UI educates them howto identify defects beforehand. The bounding boxes in turnbecome the patterns we use to find other defects. Figure 5shows sample images of a real-world smart factory dataset(called
Product ; see Section 6.1 for a description) where de-fects are highlighted with bounding boxes. Notice that thedefects are not easy to find as they tend to be small andmixed with other larger parts of the product. InspectorGadget can be extended to ask for more fine-grained seg-mentation of the defects instead of bounding boxes.As with any crowdsourcing application, we may run intoquality control issues where the bounding boxes of the work-ers vary even for the same defect. Inspector Gadget ad-dresses this problem by first combining overlapping bound-ing boxes together. While there are several ways to combineboxes, we find that taking the average of their coordinatesworks reasonably well. For the remaining outlier boundingboxes, Inspector Gadget goes through a peer review phasewhere workers discuss which ones really contain defects. InSection 6.3, we perform ablation tests to show how each ofthese steps helps improve the quality of patterns.Another challenge is determining how many images mustbe annotated to generate enough patterns. In general, wemay not have statistics on the portion of images that havedefects. Hence, our solution is to randomly select images andannotate them until the number of defective images exceedsa given threshold. In our experiments, identifying tens ofdefective images is sufficient (see the N DV values in Table 1).All the annotated images form a development set , which we2 I Defect Markings OverlappingOutliers CombinePeer Review Final Patterns Candidate Patterns
Crowdsourcing Workflow
Images Patterns & Dev. Labels
Figure 4: The crowdsourcing workflow of Inspector Gadget. Crowd workers can interact with a UI and provide boundingboxes that identify defects. The boxes are used as patterns, which are combined or go through a peer review phase.Figure 5: Sample images in the
Product dataset (see Sec-tion 6.1) containing scratch, bubble, and stamping defectswhere we highlight each defect with a bounding box.use in later steps for model parameter tuning.
4. PATTERN AUGMENTER
Pattern augmentation is a way to compensate for the pos-sible lack of patterns even after using crowdsourcing. Thepatterns can be lacking if not enough human work is done toidentify all possible patterns and especially if there are notenough images containing defects so one has to go throughmany images just to encounter a negative label. We wouldthus like to automatically generate more patterns withoutresorting to more expensive crowdsourcing.We consider two types of augmentation – GAN-based [12]and policy-based [7]. The two methods complement eachother and have been used together in medical applicationsfor identifying lesions [11]. GAN-based augmentation isgood at generating random variations of existing defects thatdo not deviate significantly. On the other hand, policy-basedaugmentation is better for generating specific variations ofdefects that can be quite different, exploiting domain knowl-edge. In Section 6.4, we show that neither augmentationsubsumes the other, and the two can be used together toproduce the best results.The augmentation can be done efficiently because we areaugmenting small patterns instead of the entire images. Forhigh-resolution images, it is sometimes infeasible to traina GAN at all. In addition, if most of the image is not ofinterest for analysis, it is difficult to generate fake parts ofinterest while leaving the rest of the image as is. By onlyfocusing on augmenting small patterns, it becomes practicalto apply sophisticated augmentation techniques.
The first method is to use generative adversarial networks(GAN) to generate variations of patterns that are similarto the existing ones. Intuitively, these augmented patternscan fill in the gaps of the existing patterns. The originalGAN [12] trains a generator to produce realistic fake data
Image resize
Real PatternFake Pattern
Image resize
GANTrain Gen.Resize Resize
Figure 6: GAN-based augmentation on a pattern containinga scratch defect from the
Product dataset.where the goal is to deceive a discriminator that tries todistinguish the real and fake data. More recently, manyvariations of the original GAN have been proposed (see arecent survey [37]).Since we need the generated patterns to be as realistic aspossible, we use a Relativistic GAN (RGAN) [17], which isdesigned for this purpose. The formulation of RGAN is:max D E ( x r ,G ( z )) ∼ ( P , Q ) [log( σ ( D ( x r ) − D ( G ( z ))))]max G E ( x r ,G ( z )) ∼ ( P , Q ) [log( σ ( D ( G ( z )) − D ( x r )))]where G is the generator, x r is real data, D ( x ) is the prob-ability that x is real, z is a random noise vector that isused by G to generate fake data, and σ is the sigmoid func-tion. While training, the discriminator of RGAN not onlydistinguishes data, but also tries to maximize the differencebetween the probability that a real image is real and a fakeimage is fake. This setup enforces fake images to be as real-istic as possible, in addition to simply being distinguishablefrom real images as in original GANs. We also use SpectralNormalization [21], which is a commonly-used technique ap-plied to a neural network structure where the discriminatorrestricts the gradient to adjust the training speed for bettertraining stability. Finally, since we are dealing with images,we use CNN models for the generator and discriminator.Another preprocessing issue is that most GANs assumethat the input images are of the same size and have a squareshape. While this assumption is reasonable for a homoge-neous set of images, the patterns identified may have dif-ferent shapes. We thus fit patterns to a fixed-sized squareshape by stretching or shrinking them and then augment-ing the patterns. Then we re-adjust the new patterns intoone of the sizes of the original patterns selected randomly.Figure 6 shows the entire process applied to an image.3 otateAutoContrastEqualize…BrightnessCutoutResizeX, Y [-10, 10]--…[0.1, 1.9][0, 0.2][0.8, 1.2]Operations Magnitudes Figure 7: Augmentation using policies. We can evaluate andchoose different combinations of operations and magnitudes.
Original Brightness, 1.632Invert, 0.246 ResizeX, 0.872Rotate, 7.000
Figure 8: Policy-based augmentation on a pattern with acrack defect from the
KSDD dataset (see Section 6.1). Foreach augmentation, we show the operation and magnitude.
Policies [7] have been proposed as another way to augmentimages and complement the GAN approach. The idea is touse manual policies to decide how exactly an image is varied.We consider 17 types of policies where some are shown inFigure 7. Figure 8 shows the results of applying four ofthese policies on a surface defect from the
KSDD dataset(see description in Section 6.1).Policy-based augmentation is effective for patterns whereapplying operations based on human knowledge may resultin quite different, but valid patterns. For example, if a defectis line-shaped, then it makes sense to stretch or rotate it.There are two parameters to configure: the operation to useand the magnitude of applying that operation. Recently,policy-based techniques have become more automated, e.g.,AutoAugment [7] uses reinforcement learning to decide towhat extent can policies be applied together.We use an approach that is simpler than AutoAugment.Among certain combinations of policies, we choose the onesthat work best on the development set. We first split thedevelopment set into train and test sets. For each policy, wespecify a range for the magnitudes and choose 10 randomvalues within that range. We then iterate all combinationsof three policies. For each combination, we augment thepatterns in the train set using the 10 magnitudes and train amodel (see details in Section 5) on the train set images untilconvergence. Then we evaluate the model on the separatetest set. Finally, we use the policy combination that resultsin the best accuracy and apply it to the entire set of patterns.
5. WEAK LABEL GENERATION
Once Inspector Gadget gathers patterns and augmentsthem, the next step is to generate features of images (alsocalled primitives according to Snuba [35]) and train a modelthat can produce weak labels. Note that the way we gener-ate features of images is more direct than existing data pro-gramming systems that first convert images to structured data based on object detection (e.g., identify a vehicle) be-fore applying labeling functions.
Inspector Gadget provides feature generation functionsthat matches the patterns with new images to identify sim-ilar defects on any location and returns a real number forthe similarity. Each output value of the feature generationfunction is used as an input of the labeler. Depending onthe type of defect, the pattern matching may differ. A na¨ıveapproach is to do an exact pixel-by-pixel comparison, butthis is unrealistic when the defects have variations. Instead,a better way is to compare the distributions of pixel val-ues. This comparison is more robust to slight variations ofa pattern. On the other hand, there may be false positiveswhere an obviously different defect matches just because itspixels have similar distributions. In our experiments, wefound that comparing distributions on the x and y axes us-ing normalized cross correlation [2] is effective in reducingsuch false positives. Given an image I with pixel dimensions W × H and pattern P i with dimensions w × h , the i th featuregeneration function F G i is defined as: F G i ( I ) = max x,y (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) (cid:80) x (cid:48) ,y (cid:48) P i ( x (cid:48) , y (cid:48) ) · I ( x + x (cid:48) , y + y (cid:48) ) (cid:113)(cid:80) x (cid:48) ,y (cid:48) P i ( x (cid:48) , y (cid:48) ) · I ( x + x (cid:48) , y + y (cid:48) ) (cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) where 0 ≤ x < W − w , 0 ≤ y < H − h , 0 ≤ x (cid:48) < w , and0 ≤ y (cid:48) < h . When matching a pattern against an image, astraightforward approach is to make a comparison with ev-ery possible region of the same size in the image. However,scanning the entire image may be too time consuming. In-stead, we use a pyramid method [3] where we first search forcandidate parts of an image by reducing the resolutions ofthe image and pattern and performing a comparison quickly.Then just for the candidates, we perform a comparison usingthe full resolution. After the features are generated, Inspector Gadget trainsa model on the output similarities of the feature generationfunctions, where the goal is to produce weak labels. Themodel can have any architecture and be small because thereare not as many features as say the number of pixels inan image. We use a multilayer perceptron (MLP) becauseit is simple, but also has good performance compared toother models. An interesting observation is that, depend-ing on the model architecture (e.g., the number of layers inan MLP), the model accuracy can vary significantly as wedemonstrate in Section 6.5. Inspector Gadget thus performsmodel tuning where it chooses the model architecture thathas the best accuracy results on the development set. Thisfeature is powerful compared to existing CNN approacheswhere the architecture is complicated, and it is too expen-sive to consider other variations.The final image labeling process consists of two steps.First, the patterns are matched to images for generating thefeatures. Second, the trained MLP is applied on the featuresto make a prediction. We note that latency is not the mostcritical issue because we are generating weak labels, whichare used to construct the training data for training the enddiscriminative model. Training data construction is usuallydone in batch mode instead of say real time.4 ataset Image size N ( N D ) N V ( N DV ) Defect Type Task Type KSDD [32] 500 x 1257 399 (52) 78 (10) Crack Binary
Product (scratch)
162 x 2702 1673 (727) 170 (76) Scratch Binary
Product (bubble)
77 x 1389 1048 (102) 104 (10) Bubble Binary
Product (stamping)
161 x 5278 1094 (148) 109 (15) Stamping Binary
NEU [16] 200 x 200 300 per defect 100 per defect Rolled-in scale, Patches, Crazing, Multi-classPitted surface, Inclusion, ScratchTable 1: For each of the five datasets, we show the image size, the dataset size( N ) and number of defective images( N D ), thedevelopment set size( N V ) and number of defective images within it( N DV ), the types of defects detected, and the task type.
6. EXPERIMENTS
We evaluate Inspector Gadget on real datasets and answerthe following questions. • How accurate are the weak labels of Inspector Gadgetcompared to other state-of-the-art methods and are theyuseful when training the end discriminative model? • How useful is each component of Inspector Gadget? • What are the errors made by Inspector Gadget?We implement Inspector Gadget in Python and three ma-chine learning libraries: Pytorch, TensorFlow, and Scikit-learn. We use an Intel Xeon CPU to train our MLP modelsand an NVidia Titan RTX GPU to train larger CNN models.Other details can be found in our released code [1].
Datasets.
We use real datasets for classification tasks. Foreach dataset, we construct a development set as described inSection 3. Table 1 summarizes the datasets with other ex-perimental details, and Figure 9 shows samples of them. Wenote that the results in Section 6.2 are obtained by varyingthe size of the development set, and the rest of the experi-ments utilize the same development set size as described inTable 1. The
NEU dataset has the smallest images, and the
Product dataset has the largest. For each dataset, we have agold standard of labels. Hence, we are able to compute theaccuracy of the labeling on separate test data. • The Kolektor Surface-Defect Dataset (
KSDD [32]) is con-structed from images of defected electrical commuta-tors that were provided and annotated by the KolektorGroup. There is only one type of defect – cracks – buteach one varies significantly in shape. • The
Product dataset (Figure 5) is proprietary and ob-tained through a collaboration with a manufacturing com-pany that has a smart factory application. Each prod-uct has a circular shape where different strips are spreadinto rectangular shapes. There are three types of defects:scratches, bubbles, and stampings, which occur in differ-ent strips. The scratches vary in length and direction.The bubbles are more uniform, but have small sizes. Thestampings are small and appear in fixed positions. Wedivide the dataset into three, as if there is a separatedataset for each defect type. • The Northeastern University Surface Defect Database(
NEU [16]) contains images that are divided into 6 defecttypes of typical surface defects of hot-rolled steel strips:rolled-in scale, patches, crazing, pitted surface, inclusion, (a)
KSDD data crazing inclusion patchespitted surface scratchesrolled-in scale (b)
NEU data
Figure 9: Sample images in
KSDD [32] and
NEU [16] datasetswhere we highlight the defects with bounding boxes. SeeFigure 5 for the
Product dataset images.and scratches. Compared to the other datasets, thesedefects take larger portions of the images. Since thereare no images without defects, we solve the different taskof multi-class classification where the goal is to determinewhich defect is present.
Pattern Matching.
We use an OpenCV library function [2]that compares pixel distributions on the x and y axes usingnormalized cross correlation (explained in Section 5.1).
GAN-based Augmentation.
We provide more details forSection 4.1. For all datasets, the input random noise vectorhas a size of 100, the learning rates of the generator anddiscriminator are both 1e −
4, and the number of epochs isabout 1K. We fit patterns to a square shape where the widthand height are set to 100 or the averaged value of all widthsand heights of patterns, whichever is smaller.5
LP Model Tuning.
We use an L-BFGS optimizer [18],which provides stable training on small data, with a 1e − k -fold cross validation where each foldhas at least 20 examples per class and early stopping in orderto compare the accuracies of MLPs before they overfit. Accuracy Measure.
We use the F score, which is the har-monic mean between precision and recall. Suppose that theset of true defects is D while the set of predictions is P .Then the precision P r = | D ∩ P || P | , recall Re = | D ∩ P || D | , and F = × Pr × RePr + Re . While there are other possible measureslike AUC, F is known to be more suitable for data wherethe labels are imbalanced [13] as in most of our settings. Systems Compared.
We compare Inspector Gadget withstate-of-the-art image labeling systems and self-learning base-lines that train a CNN model on available labeled data.Snuba [35] automates the process of labeling function con-struction by starting from a set of primitives that are anal-ogous to our feature generation functions and iteratively se-lecting subsets of them to train decision tree models, whichbecomes the labeling functions. Each iteration involves com-paring models trained on all possible subsets of the primi-tives up to a certain size. Finally, the labeling functions arecombined into a generative model. We faithfully implementSnuba and use our crowdsourced and augmented patternsfor generating primitives, in order to be favorable to Snuba.However, adding more patterns quickly slows down Snubaas its runtime is exponential to the number of patterns.We also compare with GOGGLES [9], which takes theopposite approach of not using crowdsourcing. However, itrelies on the fact that there is a pre-trained model and ex-tracts semantic prototypes of images where each prototyperepresents the part of an image where the pre-trained modelis activated the most. Each image is assumed to have oneobject, and GOGGLES clusters similar images for unsuper-vised learning. In our experiments, we use the opensourcedcode of GOGGLES and a pre-trained VGG-16 model [29].Finally, we compare Inspector Gadget with self-learning [33]baselines that train CNN models on the development set us-ing cross validation and use them to label the rest of the im-ages. To make a fair comparison, we experiment with bothheavy and light-weight CNN models. For the heavy model,we use VGG-19 [29], which is widely used in the literature.For the light-weight model, we use MobileNetV2 [27], whichis designed to train efficiently in a mobile setting, but nearlyhas the performance of heavy CNNs. We also make a com-parison with VGG-19 whose weights are pre-trained on Im-ageNet [10]. (A pre-trained MobileNetV2 does not performas well and is thus not compared.) Transfer learning obvi-ously gives a significant boost in performance, and we donot claim that Inspector Gadget, which is not pre-trained,is better than this version of VGG-19. In addition, we usepreprocessing techniques on images that are favorable forthe baselines. For example, the images from the
Product dataset are long rectangles, so we split each image in halfand stack them on top of each other to make them moresquare-like, which is advantageous for CNNs.
We compare the weak label accuracy of Inspector Gad-get with the other methods. Figure 10 compares Inspector Gadget with GOGGLES and the self-learning baselines byincreasing the amount of training data and observing the F scores. To clearly show how Inspector Gadget compareswith other methods, we use a solid line to draw its plot whileusing dotted lines for the rest. Among the models that are not pre-trained (i.e., ignore “SL (VGG19 + Pre-training)”for now), we observe that Inspector Gadget performs bestoverall because it is either the best or second-best methodin all figures. This result is important because industrialimages have various defect types that must all be identi-fied correctly. For KSDD (Figure 10d), Inspector Gadgetperforms the best because the pattern augmentation helpsInspector Gadget find more variations of cracks (see Sec-tion 6.4). For
Product (Figures 10a–10c), Inspector Gadgetconsistently performs the first or second best despite the dif-ferent characteristics of the defects. For
NEU (Figure 10e),Inspector Gadget ranks first for the multi-class classification.We explain the performances of other methods. Snubaconsistently has a lower F than Inspector Gadget possiblybecause the number of patterns is too large to handle. In-stead of considering all combinations of patterns and train-ing decision trees, Inspector Gadget’s approach of trainingand auto-tuning an MLP works better for our experiments.GOGGLES does not need training data and thus has aconstant accuracy. In Figure 10a, GOGGLES has a high F because the defect sizes are large, and the pre-trainedVGG-16 is effective in identifying them as objects. For theother figures, however, GOGGLES does not perform as wellbecause the defect sizes are small and difficult to identifyas objects. VGG-19 without pre-training (“SL (VGG19)”)only performs the best in Figure 10c where CNN modelsare very good at detecting stamping defects because theyappear in a fixed location on the images. For other figures,VGG-19 performs poorly because there is not enough la-beled data. MobileNetV2 does not perform well in any ofthe figures. Finally, the pre-trained VGG-19 (“SL (VGG19+ Pre-training)”) does outperform Inspector Gadget in Fig-ures 10a, 10d, and 10e. While we do not claim that InspectorGadget outperforms pre-trained models, it is the best optionwhen pre-training is not possible. We evaluate how effectively we can use the crowd to la-bel and identify patterns using the
Product datasets. Table 2compares the full crowdsourcing workflow in Inspector Gad-get with two variants: (1) a workflow that does not averagethe patterns at all and (2) a workflow that does average thepatterns, but still does not perform peer reviews. For eachscenario, we compare the F score of the MLP trained onthe similarity features generated by matching the patternson the development set, without using pattern augmenta-tion. As a result, the full workflow clearly performs the bestfor the Product (scratch) and
Product (stamping) datasets.For the
Product (bubble) dataset, the workflow that doesnot combine patterns has a better average F , but the ac-curacies vary among different workers. Instead, it is betterto use the stable full workflow without the variance. We evaluate how augmented patterns help improve theweak label F score of Inspector Gadget. Table 3 shows theimpact of the GAN-based and policy-based augmentation onthe five datasets. For each augmentation, we add 100–5006 Size of Dev. Set . . F s c o r e (a) Product (scratch)
30 40 50 60 70 80 90 100
Size of Dev. Set . . . F s c o r e (b) Product (bubble)
40 60 80 100
Size of Dev. Set . . . F s c o r e (c) Product (stamping)
20 30 40 50 60 70 80
Size of Dev. Set . . . F s c o r e (d) KSDD [32]
200 400 600 800
Size of Dev. Set . . . F s c o r e (e) NEU [16]
20 30 40 50 60 70 80
Size of Dev. Set . . . F s c o r e Inspector GadgetSnuba with IG featuresGOGGLESSL (VGG19)SL (MobileNetV2)SL (VGG19 + Pre-training)
Figure 10: Weak label accuracy comparison between Inspector Gadget, Snuba [35], GOGGLES [9], and the self-learningbaselines (SL) using VGG-19 [29] (with or without pre-trained weights) and MobileNetV2 [27] on different sizes of trainingdata. Among the models that are not pre-trained , Inspector Gadget performs either the best or second-best in all figures. F scoresNo avg. No FullDataset ( ± std/ ) peer review workflow Product (scratch) ± . Product (bubble) ± . Product (stamping) ± . Product datasets. Each F score is the performance of theMLP trained on each workflow. Dataset No Policy GAN UsingAug. Based Based Both
KSDD [32] 0.415 0.578 0.509
Product (scratch)
Product (bubble)
Product (stamping)
NEU [16] 0.936
Table 3: Pattern augmentation impact on Inspector Gadget.For each dataset, we highlight the highest F score.patterns, which empirically results in the best F improve-ments (see below). When using both augmentations, wesimply combine the patterns from each augmentation. Asa result, while each augmentation helps improve F , usingboth of them usually gives the best results.Figure 11 shows how adding patterns impacts the F scorefor the Product (bubble) dataset. While adding more pat-terns helps to a certain extent, it has diminishing returnsafterwards. The results for the other datasets are similar,although sometimes noisier. The best number of augmentedpatterns differs per dataset, but falls in the range of 100–500.
Number of Augmented Patterns . . . F s c o r e Policy-basedGAN-based
Figure 11: Policy-based and GAN-based augmentation re-sults on the
Product (stamping) dataset.
We evaluate the impact of model tuning on accuracy de-scribed in Section 5.2 as shown in Figure 12. We use an MLPwith 1 to 3 hidden layers and varied the number of nodesper hidden layer to be one of { n | n = 1 . . . m and 2 m − ≤ I ≤ m } where I is the number of input nodes. For eachdataset, we first obtain the maximum and minimum possible F scores by evaluating all the tuned models we considereddirectly on the test data. Then, we compare these resultswith the (test data) F score of the actual model that In-spector Gadget selected after comparing the models usingthe development set. We observe that the model tuning inInspector Gadget can indeed improve the model accuracy,close to the maximum possible value. We now address the issue of whether the weak labels areactually helpful for training the end discriminative model.We compare the F score of this end model with the samemodel that is trained on the development set. For the dis-7 SDD Scratch Bubble Stamping NEU . . . . . . F s c o r e Max Min Our tuning
Figure 12: The variations in F scores when tuning the MLPmodel hyper-parameters. Dataset End Model Dev. Set WL (IG)
KSDD [32] VGG19 0.499 0.700
Product (scratch)
VGG19 0.925 0.978
Product (bubble)
VGG19 0.359 0.720
Product (stamping)
VGG19 0.782 0.876
NEU [16] ResNet50 0.953 0.970Table 4: F scores of end models trained on the developmentset (Dev. Set) only or the development set combined withthe weak labels produced by Inspector Gadget (WL (IG)).criminative model, we use VGG-19 [29] for the binary classi-fication tasks on KSDD and
Product , and ResNet50 [14] forthe multi-class task on
NEU . We can use other discrimina-tive models that have higher absolute F , but the point is toshow the relative F improvements when using weak labels.Table 4 shows that the F scores improve by 0.02–0.36. We perform an error analysis on which cases InspectorGadget fails to make correct predicts for the five datasetsbased on manual investigation. We use the ground truth in-formation for the analysis. Table 5 shows that most commonerror is when certain defects do not match with the patterns,which can be improved by using better pattern augmenta-tion and matching techniques. The next common case iswhen the data is noisy, which can be improved by cleaningthe data. The last case is the most challenging where evenhumans have difficulty identifying the defects because theyare not obvious (e.g., a near-invisible scratch).
7. RELATED WORK
Crowdsourcing for machine learning.
Using humans foradvanced analytics is increasingly becoming mainstream [38].While there is a push to make all steps of machine learningautomatic, there are cases where humans are essential andthe key issue is using them with the minimum effort. Thereis also a heavy literature in HCI on how to integrate crowd-sourcing in machine learning [6]. These methods focus onhow to provide the right interfaces to help users constructfeatures. Inspector Gadget builds on top of these approachesby relying on crowdsourcing for identifying patterns.
Data Programming.
Data programming [25] is a recentparadigm where workers program labeling functions, which
CauseMatching Noisy DifficultDataset failure data to humans
KSDD [32] 10 (52.6 %) 5 (26.3 %) 4 (21.1 %)
Product (scratch)
11 (36.7 %) 11 (36.7 %) 8 (26.6 %)
Product (bubble)
19 (45.2 %) 15 (35.7 %) 8 (19.1 %)
Product (stamping)
15 (45.5 %) 13 (39.4 %) 5 (15.1 %)
NEU [16] 35 (63.6 %) 4 (7.3 %) 16 (29.1 %)Table 5: Error analysis of Inspector Gadget.are used to generate weak labels at scale. Snorkel [24, 4] isa seminal system that demonstrates the practicality of dataprogramming, and Snuba [35] extends it by automaticallyconstructing labeling functions using primitives. In com-parison, Inspector Gadget does not assume any accuracyguarantees on the feature generation function and directlylabels images without converting them to structured data.Several systems have studied the problem of automatinglabeling function construction. CrowdGame [19] proposes amethod for constructing labeling functions for entity resolu-tion on structured data. Adversarial data programming [23]proposes a GAN-based framework for labeling with labelingfunction results and claims to be better than Snorkel-basedapproaches. In comparison, Inspector Gadget solves the dif-ferent problem of partially analyzing large images.
Automatic Image labeling.
There is a variety of generalautomatic image labeling techniques. Data augmentation [28]is a general method to generate new labeled images. Gener-ative adversarial networks (GANs) [12] have been proposedto generate fake, but realistic images based on existing im-ages. Policies [7] were proposed to apply custom transfor-mations on images as long as they remain realistic. Mostof the existing work operate on the entire images. In com-parison, Inspector Gadget is efficient because it only needsto augment patterns, which are much smaller than the im-ages. Label propagation techniques [5] organize images intoa graph based on their similarities and then propagates ex-isting labels of images to their most similar ones. In com-parison, Inspector Gadget is designed for images where onlya small part of them are of interest while the main partmay be nearly identical to other images, so we cannot uti-lize similarities. There are also application-specific defectdetection methods [30, 16, 15], some of which are designedfor the datasets we used. In comparison, Inspector Gadgetprovides a general framework for image labeling. Recently,GOGGLES [9] is an image labeling system that relies on apre-trained model to extract semantic prototypes of imagesand construct an affinity matrix that can be used to identifysimilar images. In comparison, Inspector Gadget does notrely on pre-trained models and is more suitable for partiallyanalyzing large images using human knowledge.
8. CONCLUSION
We proposed Inspector Gadget, a scalable image labelingsystem for classification problems that effectively combinescrowdsourcing, data augmentation, and data programmingtechniques. Inspector Gadget targets applications in man-ufacturing where large industrial images are partially ana-lyzed, and there are few or no labels to start with. Unlike8xisting data programming approaches that convert imagesto structured data beforehand, Inspector Gadget directly la-bels images by providing a crowdsourcing workflow to lever-age human knowledge for identifying patterns of interest.The patterns are then augmented and matched with otherimages to generate similarity features for MLP model train-ing. Our experiments show that Inspector Gadget outper-forms the state-of-the-art methods Snuba, GOGGLES, andself-learning baselines using CNNs without pre-training. Wethus believe that Inspector Gadget opens up a new class ofproblems to apply data programming.
9. REFERENCES [1] Inspector gadget github repository. https://github.com/geonheo/InspectorGadget .Accessed April 1st, 2020.[2] Opencv. https://docs.opencv.org/2.4/modules/imgproc/doc/object_detection.html . Accessed April1st, 2020.[3] E. H. Adelson, C. H. Anderson, J. R. Bergen, P. J.Burt, and J. M. Ogden. 1984, Pyramid methods inimage processing.
RCA Engineer , 29(6):33–41, 1984.[4] S. H. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao,C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi,R. Kuchhal, C. R´e, and R. Malkin. Snorkel drybell: Acase study in deploying weak supervision at industrialscale. In
SIGMOD , pages 362–375, 2019.[5] T. D. Bui, S. Ravi, and V. Ramavajjala. Neural graphlearning: Training neural networks using graphs. In
WSDM , pages 64–71, 2018.[6] J. Cheng and M. S. Bernstein. Flock: Hybridcrowd-machine learning classifiers. In
CSCW , pages600–611, 2015.[7] E. D. Cubuk, B. Zoph, D. Man´e, V. Vasudevan, andQ. V. Le. Autoaugment: Learning augmentationpolicies from data.
CoRR , abs/1805.09501, 2018.[8] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, andQ. V. Le. Autoaugment: Learning augmentationstrategies from data. In
CVPR , pages 113–123, 2019.[9] N. Das, S. Chaba, S. Gandhi, D. H. Chau, andX. Chu. GOGGLES: automatic training datageneration with affinity coding. In
SIGMOD , 2020.[10] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li.Imagenet: A large-scale hierarchical image database.In
CVPR , pages 248–255, 2009.[11] M. Frid-Adar, I. Diamant, E. Klang, M. Amitai,J. Goldberger, and H. Greenspan. Gan-basedsynthetic medical image augmentation for increasedCNN performance in liver lesion classification.
Neurocomputing , 321:321–331, 2018.[12] I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,D. Warde-Farley, S. Ozair, A. C. Courville, andY. Bengio. Generative adversarial nets. In
NIPS ,pages 2672–2680, 2014.[13] Q. Gu and Z. Zhu, Liand Cai. Evaluation measures ofthe classification performance of imbalanced data sets.In Z. Cai, Z. Li, Z. Kang, and Y. Liu, editors,
CIIS ,pages 461–471, Berlin, Heidelberg, 2009.[14] K. He, X. Zhang, S. Ren, and J. Sun. Deep residuallearning for image recognition. In , pages 770–778. IEEE Computer Society,2016.[15] Y. He, K. Song, H. Dong, and Y. Yan. Semi-superviseddefect classification of steel surface based onmulti-training and generative adversarial network.
Optics and Lasers in Engineering , 122:294–302, 2019.[16] Y. He, K.-C. Song, Q. Meng, and Y. Yan. Anend-to-end steel surface defect detection approach viafusing multiple hierarchical features.
IEEETransactions on Instrumentation and Measurement ,69:1493–1504, 04 2020.[17] A. Jolicoeur-Martineau. The relativistic discriminator:a key element missing from standard GAN. In
ICLR ,2019.[18] D. C. Liu and J. Nocedal. On the limited memoryBFGS method for large scale optimization.
Math.Program. , 45(1-3):503–528, 1989.[19] T. Liu, J. Yang, J. Fan, Z. Wei, G. Li, and X. Du.Crowdgame: A game-based crowdsourcing system forcost-effective data labeling. In
SIGMOD , pages1957–1960, 2019.[20] Y. Liu, T. Kohlberger, M. Norouzi, G. Dahl, J. Smith,A. Mohtashamian, N. Olson, L. Peng, J. Hipp, andM. Stumpe. Artificial intelligence based breast cancernodal metastasis detection: Insights into the black boxfor pathologists.
Archives of Pathology & LaboratoryMedicine , 2018.[21] T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida.Spectral normalization for generative adversarialnetworks.
CoRR , abs/1802.05957, 2018.[22] E. Oztemel and S. Gursev. Literature review ofindustry 4.0 and related technologies.
Journal ofIntelligent Manufacturing , 2018.[23] A. Pal and V. N. Balasubramanian. Adversarial dataprogramming: Using gans to relax the bottleneck ofcurated labeled data. In
CVPR , pages 1556–1565,2018.[24] A. J. Ratner, S. H. Bach, H. R. Ehrenberg, and C. R´e.Snorkel: Fast training set generation for informationextraction. In
SIGMOD , pages 1683–1686, 2017.[25] A. J. Ratner, C. D. Sa, S. Wu, D. Selsam, and C. R´e.Data programming: Creating large training sets,quickly. In
NIPS , pages 3567–3575, 2016.[26] Y. Roh, G. Heo, and S. E. Whang. A survey on datacollection for machine learning: a big data - AIintegration perspective.
IEEE TKDE , 2019.[27] M. Sandler, A. G. Howard, M. Zhu, A. Zhmoginov,and L. Chen. Mobilenetv2: Inverted residuals andlinear bottlenecks. In
CVPR , pages 4510–4520, 2018.[28] C. Shorten and T. M. Khoshgoftaar. A survey onimage data augmentation for deep learning.
J. BigData , 6:60, 2019.[29] K. Simonyan and A. Zisserman. Very deepconvolutional networks for large-scale imagerecognition. In
ICLR , 2015.[30] K.-C. Song and Y. Yan. A noise robust method basedon completed local binary patterns for hot-rolled steelstrip surface defects.
Applied Surface Science ,285:858–864, 2013.[31] M. Stonebraker and E. K. Rezig. Machine learningand big data: What is important?
IEEE Data Eng. ull. , 2019.[32] D. Tabernik, S. ˇSela, J. Skvarˇc, and D. Skoˇcaj.Segmentation-Based Deep-Learning Approach forSurface-Defect Detection. Journal of IntelligentManufacturing , May 2019.[33] I. Triguero, S. Garc´ıa, and F. Herrera. Self-labeledtechniques for semi-supervised learning: taxonomy,software and empirical study.
Knowl. Inf. Syst. ,42(2):245–284, 2015.[34] P. Varma, B. D. He, P. Bajaj, N. Khandwala,I. Banerjee, D. L. Rubin, and C. R´e. Inferringgenerative model structure with static analysis. In
NeurIPS , pages 240–250, 2017.[35] P. Varma and C. R´e. Snuba: Automating weaksupervision to label training data.
PVLDB ,12(3):223–236, 2018.[36] P. Varma, F. Sala, A. He, A. Ratner, and C. R´e.Learning dependency structures for weak supervisionmodels. In K. Chaudhuri and R. Salakhutdinov,editors,
ICML , volume 97, pages 6418–6427, 2019.[37] Z. Wang, Q. She, and T. E. Ward. Generativeadversarial networks: A survey and taxonomy.
CoRR ,abs/1906.01529, 2019.[38] D. Xin, L. Ma, J. Liu, S. Macke, S. Song, and A. G.Parameswaran. Accelerating human-in-the-loopmachine learning: Challenges and opportunities. In