End-To-End Graph-based Deep Semi-Supervised Learning
EEnd-To-End Graph-based Deep Semi-Supervised Learning
Zihao Wang, Enmei Tu and Meng Zhou
Shanghai Jiao Tong UniversityShanghai, China
ABSTRACT
The quality of a graph is determined jointly by three key factors ofthe graph: nodes, edges and similarity measure (or edge weights),and is very crucial to the success of graph-based semi-supervisedlearning (SSL) approaches. Recently, dynamic graph, which meanspart/all its factors are dynamically updated during the trainingprocess, has demonstrated to be promising for graph-based semi-supervised learning. However, existing approaches only updatepart of the three factors and keep the rest manually specified dur-ing learning stage. In this paper, we propose a novel graph-basedsemi-supervised learning approach to optimize all three factorssimultaneously in an end-to-end learning fashion. To this end, weconcatenate two neural networks (feature network and similaritynetwork) together to learn the categorical label and semantic simi-larity, respectively, and train the networks to minimize a unified SSLobjective function. We also introduce an extended graph Laplacianregularization term to increase training efficiency. Extensive experi-ments on several benchmark datasets demonstrate the effectivenessof our approach.
KEYWORDS semi-supervised learning, similarity learning, deep learning, imageclassification
Deep neural networks trained with a large number of labeled sam-ples have attained tremendous successes in many areas such ascomputer vision, natural language processing and so on [11, 14, 16].However, labeling numerous data manually is expensive for manytasks (e.g. medical image segmentation) because the labeling workis often resource- and/or time-consuming. Semi-supervised learn-ing (SSL), which leverages a small set of high quality labeled datain conjunction with a large number of easily available unlabeleddata, is a primary solution to deal with this problem. Comprehen-sive introductions and reviews of existing SSL approaches could befound in [40, 52].In recent years, deep semi-supervised learning (DSSL) becomean active research topic and a surge of novel approaches appearsin the literature. Broadly speaking, these DSSL approaches couldbe divided into two groups: • Pairwise similarity independent. This type of SSL approachuses sample-wise regularization into the model to reducenetwork overfitting on the small labeled dataset. Typicalalgorithms include: forcing the perturbed version of the dataor model to be close to the clean version [2, 24, 30, 36, 39],adopting generative model (mainly generative adversarialnetwork (GAN) or autoencoder) to learn data distributioninformation [10, 22, 23, 31, 34, 37, 45] , utilizing pseudo labelto expand the training set [5, 25, 41, 46]. • Pairwise similarity dependent. This type of SSL approachmakes use of pair-wise relationships between all samples toreduce model overfitting. Typical algorithms include: com-bining with traditional graph-based SSL [17, 19, 38], exploit-ing geometrical properties of data manifolds [33, 35, 43],embedding association learning [13], deep metric embed-ding [15], etc.We focus on the pairwise similarity based SSL, in particular,graph-based approaches. In these approaches, a graph G ( V , E , W ) (with vertex set V , edge set E and edge weight/similarity measure W ) is constructed and learning (i.e. label propagation or randomwalk) is performed on the graph. The quality of the graph (in termsof the connectivity of different components corresponding to differ-ent categorical classes in a classification problem) is controlled byits ingredients V , E , W and is a dominant factor of a graph-basedSSL to achieve good performance. Static-graph-based approaches construct a graph with constant V , E , W and the graph remain fixed during model training. Forthese approaches, V could be the raw sample set or transformedfeature vectors of the raw samples, including training and testingsamples. E is usually generated by k -nearest-neighbor method or ϵ distance method . If k equals sample set size or ϵ equals sampleset diameter, the graph will be a complete graph. The edge weight W is usually obtained by a predefined similarity measure function, W ij = f ( v i , v j ) (such as Gaussian kernel function or dot product),to reflect the affinity/closeness of a pair of vertices. Most traditionalgraph-based SSL (e.g. [3, 49, 50]) and early DSSL (e.g. [43]) utilizestatic graph. Dynamic-graph-based approaches, in contrast, construct a graphwhose vertices and/or edges are continually updated during modeltraining. Compared to static graph, dynamic graph could absorbthe newest categorical information extracted by the classifier (e.g. aneural network) and thus adjust its structure to adapt the learningprocess immediately. Recent progress in SSL has demonstratedthat dynamic graph is more advantageous and preferable for DSSL[17, 19, 29], because for complex learning task, such as naturalimage classification, any predefined static graph is either futile orarduous.Existing dynamic-graph-based DSSL approaches usually adopta specific weight/similarity function to construct the graph. In[17, 19] the authors use a dot product function to measure the simi-larity of hidden layer features. Luo et al assign 0 or 1 to the graphedges according to the pseudo labels of the corresponding networkoutput. The similarity measure methods in these approaches arenot trainable and thus may limit the adaptive capability of thedynamic graph, because edge similarity also has a direct influenceupon graph quality, hence the performance of the SSL. If the distance between two samples is less than ϵ , there will be an edge betweenthem. a r X i v : . [ c s . L G ] F e b ihao Wang, Enmei Tu and Meng Zhou In this paper, we blend dynamic graph construction with a learn-able similarity measure into a network and train a system consistingof two networks in an end-to-end fashion. In particular, our modelcontains two networks, feature network and similarity network, asshown in Figure 1. The feature network maps input samples into ahidden space and, meanwhile, learns a classifier under the guidancesample labels. The similarity network learns a similarity functionin the hidden space also under the guidance of sample labels andits output value is used to construct a dynamic graph to train theclassifier. Two networks are optimized together to minimize a novelunified SSL objective function. Both networks’ learning targets arethe sample labels, so our model is an end-to-end learning approach.
Figure 1: The architecture of our model. Feature networkmaps input samples to a latent space and similarity networklearns the semantic similarity function in the latent space.The two networks are optimized jointly to minimize a semi-supervised objective function.
A static graph means that once the matrix A is computed (by a pre-defined similarity function or by locally adaptive methods). Onceconstructed, the graph remains constant during learning process.Traditional graph-based SSL algorithms usually contain two steps: agraph is constructed from both labeled and unlabeled samples; cate-gorical information is propagated from labeled samples to unlabeledones on the graph. Representative algorithms include graphcut [6],label propagation (LP) [44], harmonic function (HF) [50], local andglobal consistency (LGC) [49] and many others. Since graph andmanifold has a close mathematical relationship, there are also graph-based algorithms exploiting differential geometry theory, such asmanifold regularization [3], manifold tangent [35], Hessian energy[20], local coordinate [48]. Because of the importance of graph qual-ity in graph-based SSL, researchers have also developed varioustechniques to optimize graph weight or transform sample featuresto obtain a better graph [8, 18, 26, 27].However, SSL classification results could vary largely for differ-ent similarity matrices [7, 51]. These static-graph-based approachesmay achieve great success on conventional classification tasks butrarely on complex tasks, such as natural image classification, be-cause for these tasks it is almost impossible to generate a static graph which could faithfully capture all classification related infor-mation.In recent years, combining traditional graph-based SSL with deepneural networks to reduce training data demand has been an activeresearch topic. Perhaps, the earliest attempt is made by Weston etal. [42] by including a graph Laplacian regularization term into theobjective function. In [19], a graph is constructed in hidden featurespace and the traditional graph-based SSL algorithm label propaga-tion [44] is adopted to compute the CCLP (Compact Clustering viaLabel Propagation) regularizer. In [17], a graph is constructed onthe hidden features of a training batch and the LGC [49] is used toobtain pseudo labels, which are treated as ground truth to train thenetwork in the next round. Taherkhani et at. use matrix completionto predict labeling matrix and construct a graph in hidden featurespace to minimize the triplet loss of their network [38]. In [29],Luo et al . [29] utilize the marginal loss [12] to exert neighborhoodsmoothness on a 0-1 sparse dynamic graph for each mini-batch. Dif-ferent from [42], the graph in these algorithms dynamically evolvesin the training stage. Due to the strong adaption property of thedynamic graph, these semi-supervised learning approaches haveshown appealing performance for complex classification tasks. These methods force two related copies of an individual sample, e.g.an image and its augmented version, to have consistent networkoutputs. The so-called consistency regularization term is definedin equation (1). L c ( x ; θ ) = (cid:213) i l c ( f θ ( x i ) , f ˜ θ ( ˜ x i )) (1)where ˜ x i is a transformation of sample x i and parameter set ˜ θ iseither equal to θ or any other transformation of it. f is the classifi-cation output distribution. Perturbation difference l c is commonlymeasured by the squared Euclidean distance, i.e. ∥ f θ ( x i ) − f ˜ θ ( ˜ x i )∥ .The perturbations in Π model [24] include data augmentation,input Gaussian noise and network’s dropout, etc. Temporal En-sembling model in [24] forces the outputs of the current networkto learn temporal average values during training. Mean teacher(MT) [39] averages network parameters to obtain an online andmore stable target f ˜ θ ( x ) . Besides, in virtual adversarial training(VAT) [30], adversarial perturbation which maximally changes theoutput class distribution acts as an effective perturbation in theconsistency loss. We will adopt perturbation loss in equation (1)into our model as a regularization term to reduce overfitting. For graph-based semi-supervised learning, feature extraction andsimilarity measure are mutually beneficial to each other, i.e. betterfeature yields better similarity measure and vice versa. As displayedin Figure 1, we propose a joint architecture to simultaneously opti-mize feature extraction and similarity learning to minimize a novelSSL objective function. We also introduce an extended version oftraditional graph Laplacian regularized term [3] to prevent thetrivial-solution problem (always output 0 regardless of input) for nd-To-End Graph-based Deep Semi-Supervised Learning traditional graph Laplacian [43]. The overall objective function con-tains supervised and unsupervised loss parts for each componentnetwork. First, let us introduce some mathematical notations.Given a data set X = { x , x , ..., x l , x l + , ..., x n } with x i ∈ R d ,SSL assumes the first l samples X L = { x , x , ..., x l } are labeledaccording to Y L = { y , y , ..., y l } with y i ∈ C = { , , ..., c } and therest n − l samples X U = { x l + , ..., x n } are unlabeled (usually l ≪ n ).The binary label matrix of Y L is denoted as Y , whose elements are y ij = x i is from class j . The goal of SSL isto learn a classifier f : X → [ , ] c parameterized by θ using allsamples in X and the labels Y L . It is usually solved by minimizingequation (2). min θ l (cid:213) i = L s ( f θ ( x i ) , y i ) + L u ( f θ (X L , X U )) (2)where L s is a supervised loss term, e.g. mean squared error (MST)or cross-entropy loss. f θ ( x ) is the parameterized classifier and L u is usually a regularization term to exploit the unlabeled sam-plesâĂŹ information. To encourage categorical information dis-tributed smoothly, a graph Laplacian regularization term is adoptedto penalize abrupt changes of the labeling function f over nearbysamples L u = (cid:213) i , j A ij ∥ f ( x i ) − f ( x j )∥ = f T ∆ f (3)where A ij denotes the pairwise similarity between samples x i and x j and A = { A ij } ni , j = is the affinity/similarity matrix to encodenodes closeness on the graph. ∆ = D − A is the graph Laplacianmatrix and D is a diagonal matrix with D ii = (cid:205) j A ij . We use a deep convolutionalneural network as the feature classifier f . It can be decomposed as f = h ◦ д , where д : X → R d is a feature extractor which mapsthe input samples to abstract features; and h : д ( x ) → [ , ] c is alinear classifier which is always connected by a softmax functionto output the probability distribution for each class. We denote thefeature extracted from sample x i by z i = д ( x i ) here.For categorical labels learning, we use the standard cross-entropyloss as the supervised loss term, as shown in equation (4). L sup _ f = − l l (cid:213) i = c (cid:213) j = y ij ln ( f θ ( x i ) j ) (4)Note that L sup _ f only applies to labeled samples in X L . Since z is a low-dimensionalfeature of the input x , the similarity W ij between two samples x i and x j can be formulated as a function of latent variables z i , z j , asshown in equation (5). W ij = Φ ( z i , z j ) = Φ ( д ( x i ) , д ( x j )) (5)where Φ (·) is a multilayer fully connected neural network (similaritynetwork). Equation (5) also shows that pairwise similarity W ij is acomposite function of sample pair x i , x j . One could equally use the normalized version ¯ ∆ = I − D − / AD − / . Here I is theidentity matrix of proposed size. To construct an end-to-end semantic similarity learning model,we consider the task of learning W ij as a binary supervised clas-sification problem. So, there are two units in the output of Φ (·) ,which represent similarity and dissimilarity respectively. Since thenetwork’s output after softmax is a probability distribution, if thesimilarity between two samples is W ij , then the dissimilarity be-tween them is ( − W ij ) . We also use the cross-entropy loss as asupervised loss term to learn the semantic similarity, as shown inequation (6). L sup _ W = − l (cid:213) ij (cid:213) k = W lijk ln ( Φ ( z i , z j ) k ) (6)where W lij ∈ [ , ] is the similarity target vector, whose value is [ , ] if the sample pairs x i , x j ∈ X L from the same class and [ , ] if from different classes. However, dissimilar pairs are far morethan similar pairs and if we randomly select sample pairs from thelabeled set X L , model can hardly learn from similar pairs. To tacklewith this problem, we generate virtual pairs including x i and itsaugmented version Auдment ( x i ) as a similar pair in a mini-batchduring training. Notice that x i and Auдment ( x i ) in generated virtualpairs are similar but z i , z j are not equal because of the random dataaugmentation, including Gaussian noise and network’s dropout,etc. We should mention that in our model, the task of learningsimilarity W here is parallel to the task of learning label y and hasequal importance. The supervised losses in equations (4) and (6) are defined on thelabeled set X L . Now we introduce the unsupervised loss term inequation (2). The categoricalprobability distribution given by f reflects how likely the classifierregards a sample coming from each class. Therefore, we could definea classifier confidence over two samples as A ij = exp (− β ∥ f ( x i ) − f ( x j )∥ ) (7)where A ij is the classifier confidence that samples x i and x j arefrom the same class. The confidence is also a similarity measure forsamples x i and x j , and, contrarily, ( − A ij ) represents the dissimi-larity between the samples. By doing so, we construct a classifierconfidence graph G c ({ f i } , A ij ) , whose nodes are categorical proba-bility distribution of the samples and whose edge weights are thesame-class confidence A ij .Trained by equation (6), the semantic similarity Φ ( z i , z j ) givenby similarity network can be served as the ideal similarity value.We construct a semantic graph G s ( z i , Φ ) on hidden layer featureswith nodes z i , i = ... n and edge weights Φ ( z i , z j ) . The purposeof regularizing unlabeled samples is to optimize the confidencegraph G c towards the semantic graph G s . We encourage the twographs matching each other by minimizing cross entropy betweenconfidence similarity A ij and semantic similarity Φ ( z i , z j ) L unsup = − (cid:213) ij Φ ( z i , z j ) ln ( A ij ) − (cid:213) ij Φ ( z i , z j ) ln ( − A ij ) (8) ihao Wang, Enmei Tu and Meng Zhou where Φ ( z i , z j ) = W ij is the output corresponding to same-classsemantic similarity and Φ ( z i , z j ) = − W ij is the output corre-sponding to different-class semantic dissimilarity. Now we sub-stitute equation (7) into equation (8), we arrive at the followingexpression. L unsup = β (cid:213) ij W ij ∥ f ( x i ) − f ( x j )∥ − (cid:213) ij ( − W ij ) ln ( − exp (− β ∥ f ( x i ) − f ( x j )∥ )) (9)The first term of equation is actually the traditional graph Laplacianregularization in equation (3), with an additional parameter β . Itpenalizes the smoothness of f over the latent graph G l . The secondterm encourages dissimilar nodes on the graph by forcibly pulling f ( x i ) and f ( x j ) far apart. This penalty of the discrepancy betweensemantic graph G s and confidence graph G c results in an extensionto the traditional graph Laplacian regularization. By including thesecond dissimilarity term, the new regularizer naturally preventsfrom model collapse (i.e. setting all output W ij of similarity networkto 0 to obtain a trivial solution to minimize traditional graph Lapla-cian regularizer). Note that in Auдment ( x i ) and x i can be treatedas a pair of similar nodes in semantic graph, so we make W ≡ f (·) and similarity W is W = W = W ij is large(small) close or equal to 1 (0) , then the Euclidean distance ∥ f ( x i ) − f ( x j )∥ is encouraged to be smaller (larger), and vice versa. We arguethat categorical label learning and the semantic similarity learningwith the extended graph Laplacian regularizer can promote eachother bidirectionally by minimizing equation (9). As demonstratedin experiments, samples from each class are encouraged to formcompact, well separated clusters. Since there are sample pairs ofindividual sample with its perturbed version and different samplesin our mini-batch (see equation (12)), both local consistency andglobal consistency are guaranteed. We also encouragethe consistency between a sample pair Φ ( z i , z j ) and its perturbedversion Φ ( z ′ i , z ′ j ) to learn more accurate pairwise similarity, where ( z i , z j ) = ( д ( x i ) , д ( x j )) and ( z ′ i , z ′ j ) = ( д ( x ′ i ) , д ( x ′ j )) . The consistencyloss of similarity is formulated in equation (10). L cons = (cid:213) ij ∥ Φ ( z i , z j ) − Φ ( z ′ i , z ′ j )∥ = (cid:213) ij ∥ Φ α ( д ( x i ) , д ( x j )) − Φ α ′ ( д ( x ′ i ) , д ( x ′ j ))∥ (10) α ′ here is an exponential moving average of network parameterssimilar to [39] and we do not propagate gradients when computing Φ α ′ (·) . Since similarity network learns a relationship of a com-bination of two samples, the learning space is much larger thancategorical learning. This perturbation regularization of similar-ity learning is important to narrow the searching space. We willdemonstrate the necessity of adding L cons in our ablation study. Algorithm 1
End-to-End Semi-Supervised Similarity Learning
Input: ( x i , x j ) := training batches in equation (12); (Y l , Y l ) :=one-hot labels of (X l , X l ) in batch ; W ≡ batch ; W :=corresponding one-hot similarity labels of batch ; f θ ( x ) , Φ α ( д ( x )) := neural network with parameters θ , α . Parameter:
Coefficients β , λ , λ , λ . for epoch in [ , ..., T ] do for mini-batch in [ , ..., B ] do Calculate each batch’s corresponding loss in equation (11) update θ , α using Adam optimizer [21] end for end for return trained parameters θ , α Given the above parts, we describe a training strategy to integratethem into a unified semi-supervised learning framework. Sincewe are training the model for an end-to-end learning of semanticsimilarity, we organize training batches in the form of sample pairs ( x i , x j ) (which means a batch of size b means there are b pairs ( x i , x j ) ). For one batch update, the overall objective function isgiven in equation (11). L = | B | + | B | × ( L sup _ f + L sup _ W ) + λ | B | + | B | L unsup + λ | B | L unsup + λ | B | + | B | + | B | L cons (11)where λ , λ and λ are regularization coefficients. B , B and B are child batch size. To compute the losses, we divide each trainingbatch into three small child batches of size B , B and B , respec-tively, and they are constructed as follows. We first randomly select B samples X from the whole dataset, together with their aug-mented version X ′ , to make the first small batch of size B . Then,we randomly select two subsets X l and X l from X L (togetherwith their labels Y l and Y l from Y L ), each of size B , to make thesecond small batch. For the third one, we randomly divide X intotwo equal subsets X and X to make the third small batch of size B /
2. We define the structure of one training batch as batch : = { batch : (X , X ′ ) ; batch : (X l , X l ) ; batch : (X , X )} (12)Obviously, W ≡ batch , so it can be used to evaluate L sup _ W , L cons and L unsup (with W = batch , we couldevaluate all the loss terms, including L sup _ f , L sup _ W , L unsup and L cons . For batch , we only evaluate the unsupervised term L unsup and L cons . Note that W in L unsup in batch is network’s output.With this training strategy, the classification network and thesimilarity learning network are optimized simultaneously usingboth labeled data and unlabeled data and the similarity is learnedin an end-to-end semi-supervised way. In summarize, the full algo-rithm is shown in Algorithm 1. nd-To-End Graph-based Deep Semi-Supervised Learning In this section, we evaluate the effectiveness of our proposed methodon several standard benchmarks and compare the results with therecently reported ones in the literature to show its performancesuperiority . As an illustrative example, we first evaluate our model on the “twomoons” and “two circles” toy datasets. Each datasets contain 6000samples of x ∈ R , y ∈ { , } with Gaussian noise of σ = . σ = .
3, respectively. There are 12 labeled samples in “twomoons” and 8 labeled samples in “two circles”. We use a three-layer fully-connected network with a hidden layer of 100 neuronsfollowed by leaky RELU α = .
1. Then we concatenate two 100-dimensional vectors to form a new 200-dimensional vector as Φ (·) ’sinput. We define Φ (·) as a three-layer fully-connected network:200 → → dropout ( . ) → → dropout ( . ) → → (a) MT (b) Π (c) SNTG (d) Ours Figure 2: Classification results of our method and baselinemethods on the “two moons” dataset. 12 labeled samples aremarked with the black cross. Note the inside end of eachmoon.
We compare our method with MT [39], Π [24] and STNG [29].The results are depicted in Figure 2 and 3. From the figure we cansee that due to the irregular distribution of the labeled samplesand the considerable class mixture, the baseline algorithms havea considerable amount of misclassified points at the inside end ofeach moon. In contrast, our method corrects the prediction of thesesamples and has a better performance. We evaluate the classification performance of the proposed modeland compare its results with several recently developed SSL mod-els in [1, 4, 5, 17, 24, 29, 30, 38, 39, 41]. In each experiment with The source code to our reproduce experimental results is available at google drive:https://drive.google.com/open?id=1BU-w3pSeIyP4X2–wFM5xO8HpgojxCBN (a) MT (b) Π (c) SNTG (d) Ours Figure 3: Classification results of our method and baselinemethods on the “two circles” dataset. 8 labeled samples aremarked with the black cross. different number of labels on different datasets, we run our modelfor 5 times across different random data splits and report the meanand standard deviation of the test error rate. Results of baselinealgorithms are adopted directly from the original papers if theyare available, or run by us using the provided code and suggestedparameters if not available.
We conduct experiments onthree datasets widely-used in previous SSL studies: SVHN, CIFAR-10 and CIFAR-100. We randomly choose a small part of the trainingsamples as labeled and use the rest training data as unlabeled. Fol-lowing the common practice in the literature such as [24, 29, 39],we ensure that the number of labeled samples between all classesare balanced and perform standard augmentation (including 2-pixel random translation all datasets and random horizontal flip onCIFAR-10/100).
SVHN.
The SVHN dataset includes 73257 training samples and26032 test samples of size 32 ×
32. The task is to recognize the cen-tered digit (0-9) of each image. For SVHN, we use the same standardaugmentation and pre-processing as those in prior work [24, 39].
CIFAR-10.
The CIFAR-10 dataset consists of 60000 RGB imagesof size 32 ×
32 from 10 classes. There are 50000 training samples and10000 test samples. For CIFAR-10, we first normalize the imagesusing per-channel standardization. Then we augment the dataset byrandom horizontal flips with probability 0.5 and random translationwith 2 pixels. Unlike prior work [24], we found it is not necessaryto use ZCA whitening, nor add Gaussian noise to the input images.
CIFAR-100.
The CIFAR-100 dataset is just like the CIFAR-10,except that it has 100 classes containing 600 images per class. Thereare also 50000 training samples and 10000 test samples. We use the ihao Wang, Enmei Tu and Meng Zhou same data normalization method as CIFAR-10. But we evaluate theperformance on CIFAR-100 only with RandAugment since it is amore difficult classification task.Furthermore, we also perform experiments with RandAugmenton SVHN and CIFAR-10 to achieve better results, which will beshown in section 4.2.3.
Settings.
For categorical labels learning (from the input images X to classification output f (·) ), we use the standard “CNN-13” ar-chitecture that has been employed as a common network structurein recent perturbation-based SSL approaches [24, 29, 39]. We treatthe 128-dimensional vector before the linear classifier as the logitsoutput of the feature extractor z = д (·) . Then we concatenate twofeatures to form a new 256-dimensional vector as the similaritynetwork Φ (·) ’s input. We define Φ (·) as a four-layer fully-connectednetwork: 256 → → dropout ( a ) → → dropout ( a ) → → dropout ( a ) → →
2. Experiments on all three datasetsare performed and results are recorded for comparison with base-line algorithms.
Parameters.
The hyperparameters of our method include β , λ , λ , λ .Following SNTG [29], we also use a ramp-up schedule for both thelearning rate and the coefficients λ , λ , λ in the beginning. Sinceneither the categorical labels nor the similarity is accurate at thebeginning of training, we do not add the third term of equation (11)until 100 epochs. Moreover, we define one epoch when x traversesall samples from the dataset.In each iteration, we sample a mini-batch according to equa-tion (12), where we set | B | = , | B | =
10 and | B | =
50. Follow-ing [32], we select the best hyperparameters for our method usinga validation set of 1000, 5000 and 5000 labeled samples for SVHN,CIFAR-10 and CIFAR-100 respectively. For coefficients λ and λ , weset them as k × n _ labeledn _ traininд and k × n _ labeledn _ traininд . Then we only needto adjust k and k . For SVHN, we set β = . k = k = , λ = .
05, and we run the experiments for 500 epochs. For CIFAR-10with standard augmentation, we set β = . , k = k = , λ = . dropout _ rate = .
2. And we use β = . , k = k = | λ = .
05 and dropout _ rate = β = . , k = k = . , λ = .
15 and dropout _ rate =
0. we runthe experiments for 600 epochs for both CIFAR-10 and CIFAR-100.The coefficients λ , λ are ramped up from 0 to their maximumvalue at first 80 epochs. Besides, λ is 0 at first 100 epochs, then itis ramped up to its maximum value in the next 50 epochs.Except for the parameters stated above, all other hyperparame-ters remain unchanged from MT implementation [39]. Advanced Data Augmentation.
To explore the performance boundof our method, we also combine our SSL with advanced data aug-mentation. We use the similar augmentation strategy as reported inRandAugment [9], e.g. randomly add two different strong augmen-tations to each image in X with random magnitude. Then f (X ) in L unsup is forced to learn the fixed target f (X ′ ) , which can be seen as teacher model’s output and only standard augmentation isadded to X ′ . For SVHN, we evaluatethe error rate with 250, 500 and 1000 labeled samples respectively,and experimental results of standard augmentation and RandAug-ment are presented in Table 1 and Table 2. Notably, while the errorrates of MT+STNG decrease 0.06, 0.19, 0.09 (for 250, 500 and 1000 la-bels, receptively) comparing to that of its baseline MT, the error ratedrop of our method is 0.31, 0.31 and 0.38 comparing to MT+STNG.This suggests that learned similarity is much better than simplyassigning 0-1 similarity using pseudo labels. In both tables, it canbe seen that our method outperforms the baseline algorithms by aconsiderable margin on the SVHN dataset.
Table 1: Error rates (%) on SVHN with standard augmenta-tion (bottom 6 rows are graph-based SSL; ∗ no standard aug-mentation). Method 250 labels 500 labels 1000 labels Π model [24] 9.93 ± ± ± ± ± ± ± ± ± ± ∗ - - 5.69 ± ± Π +SNTG [29] 5.07 ± ± ± ± ± ± ± ± ± † denotes a different architecture WRN-28-2 applied. Method 250 labels 500 labels 1000 labelsICT [41] 4.78 ± ± ± † ± ± ± † ± ± ± ± ± For CIFAR-10 with standard augmentation, we report the resultswith 1000, 2000 and 4000 labels. We also conduct experiments forfewer labeled samples from 250 to 4000 with a stronger data aug-mentation [9], since models are more likely to overfit fewer labeledsamples on CIFAR-10. Results on CIFAR-10 are listed in Table 3 andTable 4. From these tables we can see that our method outperformsprior works for most cases. As an interesting trend in table 4 forstrong augmentation, the performance gains of our model growslarger as the number of labeled samples becomes smaller. Comparedto prior works, the error rate of our model decreases an amount of2.59% for 250 labels.For CIFAR-100, we perform experiments with 10000 labels withRandAugment. The results are in Table 5. All results use the same nd-To-End Graph-based Deep Semi-Supervised Learning
Table 3: Error rates (%) on CIFAR-10 with standard augmen-tation (bottom 6 rows are graph-based SSL; ∗ no standardaugmentation). Method 1000 labels 2000 labels 4000 labels Π model [24] 31.65 ± ± ± ± ± ± ± ± ± ± ∗ - - 18.57 ± ± ± ± ± ± ± Π +SNTG [29] 21.23 ± ± ± ± ± ± ± ± ± CNN-13 architecture. Results on CIFAR-100 again confirm the ef-fectiveness of our method.We also find that the classification performance of our method isrelatively stable to the number of labels when strong data augmen-tation is applied. It indicates that our method with RandAugmentcould make better use of unlabeled samples when labels are scarce.
One extra merit that also dif-ferentiates our semi-supervised approach from existing approaches(e.g. [24, 29, 39]) is the similarity evaluation byproduct, which canevaluate directly the semantic-level similarity of any two inputsamples and thus could be utilized for image comparison/query.After training, the two component networks (feature network andsimilarity network in Figure 1) could be used as encoding networkand similarity evaluation network, respectively. In this section, weconduct experiments to show that our method can learn high-levelsemantic similarity information from raw images. We randomlyselect 5 samples from the testing set of SVHN and CIFAR-10 respec-tively and show their k -nearest neighbors ( k =9) queried accordingto the learned similarity and the Gaussian kernel function. Figure 4displays the nearest neighbors in descending order.The results of our method are obtained using 1000 labeled sam-ples for SVHN and 4000 labeled samples for CIFAR-10 with stan-dard data augmentation. As can be observed, our learned similarityfocuses more on the high-level semantic features of the sample con-tents, ignoring other distraction details and transformations suchas rotation, translation, brightness, etc. In addition, our method canmeasure the relative value of similarity rather than simply assign-ing 0 or 1 to a pair of samples, as have been done by SNTG [29]and other pseudo label based algorithms.To quantitatively evaluate the difference between learned simi-larity and pseudo-label assignment, we randomly select 500 samplesfrom the testing set of SVHN and plot the similarity matrices givenby our method and Π +SNTG [29] and depict them in Figure 5. Notethat the elements of the similarity matrix of Π +SNTG are either0 or 1 but ours are in [ , ] , having more useful closeness rankingmeaning. Visually comparing the two matrices, our model learnsmore accurate similarity for classes {2, 3, 4, 7, 8, 9} than Π +SNTG,but worse for classes {0, 6}. Quantitatively, the mean squared error(MSE) between the two matrices and the ideal similarity matrix(block diagonal) are 0.149 (ours) and 0.201 ( Π +SNTG), respectively. Figure 4: Query results given by our method (left two pan-els) and the Gaussian kernel function (right two panels) inSVHN and CIFAR-10. 10 target samples are indicated by theblue rectangle on the left of each panel. Samples surroundedby red rectangles indicate that they come from a differentclass to the corresponding query target.Figure 5: Visualization of similarity matrices. We get theresults by running our model (left) and reproducing the Π +SNTG model [29] (right) on SVHN with 1000 labels. Finally, we conduct experiments to investigate the effectiveness ofthe similarity network on semi-supervised learning. We employ the Π [24] and MT [39] as the base model to perform the ablation studyon SVHN with 1000 labels. The results are shown in Table 6. Fromthese results, one can see that the similarity learning is beneficialto improve classification performance for the models. In this paper, we proposed an end-to-end semi-supervised similaritylearning approach to jointly optimize a categorical labeling networkand a similarity measure network to minimize an overall semi-supervised objective function. Experiments on three widely usedimage benchmark datasets show that our method outperforms oris comparable to other graph-based SSL methods and can learn ihao Wang, Enmei Tu and Meng Zhou
Table 4: Error rates (%) on CIFAR-10 with RandAugment [9]. † denotes a different architecture WRN-28-2 applied. Method 250 labels 500 labels 1000 labels 2000 labels 4000 labelsICT [41] âĂŞ âĂŞ 15.48 ± ± ± † ± ± ± ± ± MT+SNTG [29] 10.56 9.39 8.57 7.47 6.63Ours ± ± ± ± ± Table 5: Error rates (%) on CIFAR-100.
Method 10000 labels Π model [24] 39.19 ± ± ± ± ± Ablation 1000 labels without learning similarity ( Π model) 4.82 with learning similarity ( Π model) 3.84 without learning similarity (MT) 3.93 with learning similarity (MT) 3.50more accurate similarity. With advanced data augmentation, ourmethod is able to fully exploit the data information to achieve state-of-the-art results. After training, an extra reward of our model is thesimilarity network, which could be used potentially for semantic-level image query. It is also worth mention that our method isextendable and is easy to apply to other methods by adding a neuralnetwork to learn the similarity. Thus in future work, we will furtherexploit the capability of our method on other learning tasks, suchas image retrieval.A potential limitation of our similarity learning approach is thatthe learned similarity matrix cannot be guaranteed to be positivesemidefinite (PSD), which may restrict its application on somelearning tasks, e.g. kernel based learning algorithms. Nevertheless,indefinite kernel learning has been found to be interesting [28, 47]and we may try to combine our approach with the indefinite kernellearning algorithms in the future. REFERENCES [1] Ben Athiwaratkun, Marc Finzi, Pavel Izmailov, and Andrew Gordon Wilson. 2019.There are many consistent explanations of unlabeled data: Why you shouldaverage. In .[2] Philip Bachman, Ouais Alsharif, and Doina Precup. 2014. Learning with pseudo-ensembles. In
Advances in Neural Information Processing Systems . 3365–3373.[3] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. 2006. Manifold regulariza-tion: A geometric framework for learning from labeled and unlabeled examples.
Journal of machine learning research
7, Nov (2006), 2399–2434.[4] David Berthelot, Nicholas Carlini, Ekin D Cubuk, Alex Kurakin, Kihyuk Sohn,Han Zhang, and Colin Raffel. 2019. ReMixMatch: Semi-Supervised Learningwith Distribution Alignment and Augmentation Anchoring. arXiv preprint arXiv:1911.09785 (2019).[5] David Berthelot, Nicholas Carlini, Ian Goodfellow, Nicolas Papernot, Avital Oliver,and Colin A Raffel. 2019. Mixmatch: A holistic approach to semi-supervisedlearning. In
Advances in Neural Information Processing Systems . 5050–5060.[6] Avrim Blum and Shuchi Chawla. 2001. Learning from labeled and unlabeled datausing graph mincuts. (2001).[7] Olivier Chapelle, Bernhard Scholkopf, and Alexander Zien. 2009. Semi-supervisedlearning (chapelle, o. et al., eds.; 2006)[book reviews].
IEEE Transactions on NeuralNetworks
20, 3 (2009), 542–542.[8] Hong Cheng, Zicheng Liu, and Jie Yang. 2009. Sparsity induced similarity measurefor label propagation. In .IEEE, 317–324.[9] Ekin D Cubuk, Barret Zoph, Jonathon Shlens, and Quoc V Le. 2019. RandAug-ment: Practical data augmentation with no separate search. arXiv preprintarXiv:1909.13719 (2019).[10] Zihang Dai, Zhilin Yang, Fan Yang, William W Cohen, and Ruslan R Salakhutdi-nov. 2017. Good semi-supervised learning that requires a bad gan. In
Advancesin neural information processing systems . 6510–6520.[11] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In
Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) . 4171–4186.[12] Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reductionby learning an invariant mapping. In , Vol. 2. IEEE, 1735–1742.[13] Philip Haeusser, Alexander Mordvintsev, and Daniel Cremers. 2017. Learning byAssociation–A Versatile Semi-Supervised Training Method for Neural Networks.In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition .89–98.[14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In
Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.[15] Elad Hoffer and Nir Ailon. 2017. Semi-supervised deep learning by metricembedding. In .[16] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger.2017. Densely connected convolutional networks. In
Proceedings of the IEEEconference on computer vision and pattern recognition . 4700–4708.[17] Ahmet Iscen, Giorgos Tolias, Yannis Avrithis, and Ondrej Chum. 2019. Label prop-agation for deep semi-supervised learning. In
Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition . 5070–5079.[18] Tony Jebara, Jun Wang, and Shih-Fu Chang. 2009. Graph construction andb-matching for semi-supervised learning. In
Proceedings of the 26th annual inter-national conference on machine learning . ACM, 441–448.[19] Konstantinos Kamnitsas, Daniel Castro, Loic Le Folgoc, Ian Walker, RyutaroTanno, Daniel Rueckert, Ben Glocker, Antonio Criminisi, and Aditya Nori. 2018.Semi-Supervised Learning via Compact Latent Space Clustering. In
InternationalConference on Machine Learning . 2459–2468.[20] Kwang I Kim, Florian Steinke, and Matthias Hein. 2009. Semi-supervised regres-sion using Hessian energy with an application to semi-supervised dimensionalityreduction. In
Advances in Neural Information Processing Systems . 979–987.[21] Diederik P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic opti-mization.[22] Durk P Kingma, Shakir Mohamed, Danilo Jimenez Rezende, and Max Welling.2014. Semi-supervised learning with deep generative models. In
Advances inneural information processing systems . 3581–3589.[23] Abhishek Kumar, Prasanna Sattigeri, and Tom Fletcher. 2017. Semi-supervisedlearning with gans: Manifold invariance with improved inference. In
Advancesin Neural Information Processing Systems . 5534–5544.[24] Samuli Laine and Timo Aila. 2017. Temporal ensembling for semi-supervisedlearning. nd-To-End Graph-based Deep Semi-Supervised Learning [25] Dong-Hyun Lee. 2013. Pseudo-label: The simple and efficient semi-supervisedlearning method for deep neural networks. In
Workshop on Challenges in Repre-sentation Learning, ICML , Vol. 3. 2.[26] Christian Leistner, Helmut Grabner, and Horst Bischof. 2008. Semi-supervisedboosting using visual similarity learning. In . IEEE, 1–8.[27] Sheng Li and Yun Fu. 2013. Low-rank coding with b-matching constraint forsemi-supervised classification. In
Twenty-Third International Joint Conference onArtificial Intelligence .[28] Gaëlle Loosli, Stéphane Canu, and Cheng Soon Ong. 2015. Learning SVM inKre˘ın spaces.
IEEE transactions on pattern analysis and machine intelligence
38, 6(2015), 1204–1216.[29] Yucen Luo, Jun Zhu, Mengxi Li, Yong Ren, and Bo Zhang. 2018. Smooth neigh-bors on teacher graphs for semi-supervised learning. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition . 8896–8905.[30] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, and Shin Ishii. 2018. Virtualadversarial training: a regularization method for supervised and semi-supervisedlearning.
IEEE transactions on pattern analysis and machine intelligence
41, 8(2018), 1979–1993.[31] Augustus Odena. 2016. Semi-supervised learning with generative adversarialnetworks. arXiv preprint arXiv:1606.01583 (2016).[32] Avital Oliver, Augustus Odena, Colin A Raffel, Ekin Dogus Cubuk, and Ian Good-fellow. 2018. Realistic evaluation of deep semi-supervised learning algorithms.In
Advances in Neural Information Processing Systems . 3235–3246.[33] Guo-Jun Qi, Liheng Zhang, Hao Hu, Marzieh Edraki, Jingdong Wang, and Xian-Sheng Hua. 2018. Global versus localized generative adversarial nets. In
Proceed-ings of the IEEE Conference on Computer Vision and Pattern Recognition . 1517–1525.[34] Antti Rasmus, Mathias Berglund, Mikko Honkala, Harri Valpola, and TapaniRaiko. 2015. Semi-supervised learning with ladder networks. In
Advances inneural information processing systems . 3546–3554.[35] Salah Rifai, Yann N Dauphin, Pascal Vincent, Yoshua Bengio, and Xavier Muller.2011. The manifold tangent classifier. In
Advances in neural information processingsystems . 2294–2302.[36] Mehdi Sajjadi, Mehran Javanmardi, and Tolga Tasdizen. 2016. Regularization withstochastic transformations and perturbations for deep semi-supervised learning.In
Advances in neural information processing systems . 1163–1171.[37] Jost Tobias Springenberg. 2016. Unsupervised and semi-supervised learning withcategorical generative adversarial networks.[38] Fariborz Taherkhani, Hadi Kazemi, and Nasser M Nasrabadi. 2019. Matrix Com-pletion for Graph-Based Deep Semi-Supervised Learning. In
Thirty-Third AAAIConference on Artificial Intelligence .[39] Antti Tarvainen and Harri Valpola. 2017. Mean teachers are better role models:Weight-averaged consistency targets improve semi-supervised deep learningresults. In
Advances in neural information processing systems . 1195–1204.[40] Jesper E van Engelen and Holger H Hoos. 2019. A survey on semi-supervisedlearning.
Machine Learning (2019), 1–68.[41] Vikas Verma, Alex Lamb, Juho Kannala, Yoshua Bengio, and David Lopez-Paz.2019. Interpolation consistency training for semi-supervised learning. In
Pro-ceedings of the 28th International Joint Conference on Artificial Intelligence . AAAIPress, 3635–3641.[42] Jason Weston, Frédéric Ratle, and Ronan Collobert. 2008. Deep learning viasemi-supervised embedding. In
Proceedings of the 25th international conferenceon Machine learning . 1168–1175.[43] Jason Weston, Frédéric Ratle, Hossein Mobahi, and Ronan Collobert. 2012. Deeplearning via semi-supervised embedding. In
Neural Networks: Tricks of the Trade .Springer, 639–655.[44] Zhu Xiaojin and Ghahramani Zoubin. 2002. Learning from labeled and unlabeleddata with label propagation.
Tech. Rep., Technical Report CMU-CALD-02–107,Carnegie Mellon University (2002).[45] Weidi Xu, Haoze Sun, Chao Deng, and Ying Tan. 2017. Variational autoencoder forsemi-supervised text classification. In
Thirty-First AAAI Conference on ArtificialIntelligence .[46] Yan Yan, Zhongwen Xu, Ivor W Tsang, Guodong Long, and Yi Yang. 2016. Robustsemi-supervised learning through label aggregation. In
Thirtieth AAAI Conferenceon Artificial Intelligence .[47] Yiming Ying, Colin Campbell, and Mark Girolami. 2009. Analysis of SVM withindefinite kernels. In
Advances in neural information processing systems . 2205–2213.[48] Kai Yu, Tong Zhang, and Yihong Gong. 2009. Nonlinear learning using localcoordinate coding. In
Advances in neural information processing systems . 2223–2231.[49] Dengyong Zhou, Olivier Bousquet, Thomas N Lal, Jason Weston, and BernhardSchölkopf. 2004. Learning with local and global consistency. In
Advances inneural information processing systems . 321–328.[50] Xiaojin Zhu, Zoubin Ghahramani, and John D Lafferty. 2003. Semi-supervisedlearning using gaussian fields and harmonic functions. In
Proceedings of the 20thInternational conference on Machine learning (ICML-03) . 912–919. [51] Xiaojin Zhu and Andrew B Goldberg. 2009. Introduction to semi-supervisedlearning.
Synthesis lectures on artificial intelligence and machine learning
3, 1(2009), 1–130.[52] Xiaojin Jerry Zhu. 2005.