Semi-supervised Learning with the EM Algorithm: A Comparative Study between Unstructured and Structured Prediction
11 Semi-supervised Learning with the EMAlgorithm: A Comparative Study betweenUnstructured and Structured Prediction
Wenchong He, Zhe Jiang,
Member, IEEE,
Abstract —Semi-supervised learning aims to learn prediction models from both labeled and unlabeled samples. There has beenextensive research in this area. Among existing work, generative mixture models with Expectation-Maximization (EM) is a popularmethod due to clear statistical properties. However, existing literature on EM-based semi-supervised learning largely focuses onunstructured prediction, assuming that samples are independent and identically distributed. Studies on EM-based semi-supervisedapproach in structured prediction is limited. This paper aims to fill the gap through a comparative study between unstructured andstructured methods in EM-based semi-supervised learning. Specifically, we compare their theoretical properties and find that bothmethods can be considered as a generalization of self-training with soft class assignment of unlabeled samples, but the structuredmethod additionally considers structural constraint in soft class assignment. We conducted a case study on real-world flood mappingdatasets to compare the two methods. Results show that structured EM is more robust to class confusion caused by noise andobstacles in features in the context of the flood mapping application.
Index Terms —Semi-supervised learning; Expectation-Maximization (EM); Structured prediction (cid:70)
NTRODUCTION S E mi-supervised learning aims to learn prediction modelsbased on both labeled and unlabeled samples. It isimportant when training data contains limited labeled sam-ples but abundant unlabeled samples. In real world spatialprediction problems, input data often contains abundantexplanatory features but very limited ground truth. For ex-ample, in earth image classification for land cover mapping,a large number of image pixels with spectral features arecollected from remote sensing platforms, but only a limitednumber of pixels are labeled with ground truth land coverclasses [1], [2]. The reason is that collecting ground truth isboth expensive and time consuming by field crew or well-trained image interpreters.The topic of semi-supervised learning has been exten-sively studied in the literature. According to a survey [3],techniques can be categorized into generative models withEM method, graph-based methods, label propagation, self-training and co-training, etc. In the EM-based method [4],unknown labels are considered as hidden variables, andboth labeled and unlabeled samples are used to estimateparameters of a generative model of the joint distribution.Graph-based methods assume that samples connected byheavy edges tend to have the same label and thus aim toobtain a smooth label assignment over the graph [5]. Inlabel propagation, the main idea is to propagate the labelsof nodes throughout the network and form class commu-nities [6]. Semi-supervised learning in structured outputspace focuses on learning dependency or some internalrelationships between classes [7]. In self-training, a classifier • Corresponding author: Zhe Jiang, [email protected] • W. He and Z. Jiang were with the Department of Computer Science,University of Alabama, Tuscaloosa, AL, 35487.Manuscript received April 19, 2005; revised August 26, 2015. is first trained with a small amount of labeled samples andthen is used to classify unlabeled samples [8]. The mostconfident predictions on unlabeled samples together withtheir predicted class labels are added to the training set.For co-training, it assumes that features can be split intotwo sets and each set is sufficient to train a good classifierseparately [9]. Each classifier then classifies the unlabeleddata and teaches the other classifier with the most confidentpredictions on unlabeled samples [3].Among all the methods, semi-supervised learning withEM provides clear statistical properties. There are two typesof EM algorithms: unstructured EM and structured EM.Unstructured EM assumes that data samples follow anidentical and independent distribution and feature variablesare independent [4]. It considers unknown class labels ashidden variables in generative mixture models(e.g., Gaus-sian mixture models). It then uses EM algorithm to learnmodel parameters and infer hidden classes at the sametime. Structured EM assumes that either input samples orfeature variables within samples follow structural depen-dency. Existing works on structured EM can be furthercategorized into two types: Bayesian structural EM and Hid-den Markov Model (HMM). Bayesian structural EM uses adirected acyclic graph (DAG) to represent the conditionaldependency between feature variables within a sample. Ituses EM to learn the DAG structure from samples whosefeature variables are partially unobserved (treated as hiddenvariables) [10]. HMM uses a chain [11] or tree [12] to rep-resent the conditional dependency between samples. It usesEM to learn the model parameters and infer hidden classesat the same time. Structured EM is of particular interestbecause data in many real-world applications (e.g., earthscience, biology, and material science) often show structuralconstraints. However, there is limited study that compares a r X i v : . [ c s . L G ] A ug structured and unstructured methods in EM-based semi-supervised learning. It remains unclear about their strengthsand weaknesses.To fill the gap, this paper provides a comparative studybetween structured methods and unstructured methods inEM-based semi-supervised learning for spatial classificationproblems [13]. For unstructured methods, we use Gaussianmixture models as an example. For structured methods, weuse a recent model called geographical hidden Markov tree(HMT) [12] as a representative example. We compare thetheoretical properties and conducted detailed case studieson real-world datasets. In summary, this paper makes thefollowing contributions: • We compared the theoretical properties of a repre-sentative unstructured methods (Gaussian MixtureModel) and structured methods (HMT) in EM-basedsemi-supervised learning. • Through theoretical analysis, we found out that bothEM-based methods can be considered as a general-ization of self-training with soft class assignment ofunlabeled samples. The difference is that structuredmethods additionally consider structure constraintsin soft class assignment. • We also empirically compared the performance of arepresentative unstructured EM (Gaussian mixturemodel) and structured EM (hidden Markov tree) ina case study on real-world flood mapping datasets.Results showed that in this particular application,unstructured EM without spatial feature could beimpacted by feature noise and obstacles. Addingspatial feature into unstructured EM could alleviatebut not fully resolve the issue. In contrast, structuredEM could resolve the issue due to explicit structuralconstraint on this particular type of applications
ROBLEM F ORMULATION
Suppose we have data set D of N data samples, whichis composed of an unlabeled subset D u and a labeledsubset D l ( D = D l ∪ D u ). D u contains N u data samples ofinputs without class labels, D u = { ( x n ) } N u n =1 . The labeledsubset D l contains N l data samples of input-output pairs, D l = { ( x n , y n ) } Nn = N u +1 where x n ∈ R m × is a vector of m explanatory features (including both non-spatial featuresand spatial contextual feature) with each element corre-sponding to one feature layer, and y n ∈ { , } is a binaryclass label.Explanatory features of labeled samples and unlabeledsamples are noted as X u = [ x , ..., x N u ] T and X l =[ x N u +1 , ..., x N ] T respectively. The class layers are notedas Y u = [ y , ..., y N u ] T and Y l = [ y N u +1 , ..., y N ] T . Y u is unknown and to be predicted. For example, in floodmapping from earth imagery, x n can be the spectral bandvalues (and elevation) of each pixel, and y n is the class label(flood, dry). We also assume N u (cid:29) N l . Labeled subset D l and an unlabeled subset D u are training and test samplesrespectively. Note that both X l and X u are used in training(semi-supervised learning). LGORITHMS FOR S EMI - SUPERVISED L EARNING
This section compares theoretical properties of EM-basedsemi-supervised learning algorithms in unstructured andstructured prediction respectively.
EM-based semi-supervised learning algorithm in un-structured prediction does not consider structural depen-dency of sample classes and assumes that the sample classesare independent and identically distributed (i.i.d.).The joint distribution of all samples is as below: P ( X , Y ) = P ( X u , Y u , X l , Y l )= N u (cid:89) n =1 P ( x n , y n ) N (cid:89) n = N u +1 P ( x n , y n ) (1)We assume that the sample features follow an i.i.d. Gaus-sian distribution in each class. The prior class probability P ( y n = c ) follows a Bernoulli distribution below. P ( y n = c ) = π c , c = { , } . (2)Feature distribution P ( x n | y n ) is as below, P ( x n | y n = c ) ∼ N ( µ c , Σ c ) (3)We denote the entire set of parameters as Θ = { π c , µ c , Σ c | c = 0 , } . The labeled data subset D l is used toinitialize parameters Θ . Then EM algorithm creates a func-tion for the posterior expectation of log-likelihood evaluatedusing current estimate for the parameters(E-Step) and nextupdates parameters maximizing the expected log-likelihoodfor the next iteration (M-Step). The unknown class Y u ishidden variable. The posterior expectation of log likelihoodof all samples is as below: E Y u | X , Y l , Θ log P ( X u , Y u , X l , Y l | Θ )= E Y u | X , Y l , Θ log( N u (cid:89) n =1 P ( x n , y n | Θ ) N (cid:89) n = N u +1 P ( x n , y n | Θ ))= (cid:88) Y u P ( Y u | X , Y l , Θ )( N u (cid:88) n =1 log P ( x n , y n | Θ ) + N (cid:88) n = N u +1 log P ( x n , y n | Θ ))= N u (cid:88) n =1 (cid:88) y n P ( y n | x n , Θ ) log P ( x n , y n | Θ )+ N (cid:88) n = N u +1 (cid:88) y n P ( y n | x n , Y l , Θ ) log P ( x n , y n | Θ )= N u (cid:88) n =1 (cid:88) y n P ( y n | x n , Θ ) log P ( x n , y n | Θ )+ N (cid:88) n = N u +1 (cid:88) c I ( y n = c ) log P ( x n , y n = c | Θ ) (4) The posterior class distribution for unlabeled samples( ≤ n ≤ N u ) P ( y n | x n , Θ ) is P ( y n = c | x n , Θ ) = P ( x n | y n = c, Θ ) P ( y n = c ) (cid:80) c (cid:48) =0 P ( x n | y n = c (cid:48) , Θ ) P ( y n = c (cid:48) ) , c = 0 , (5) Taking the above into the posterior expectation of loglikelihood, we can easily get the following formulas toupdate parameters that maximize the posterior expectation(M-Step). Note that I ( y n = c ) = 1 if y n = c , and 0 otherwise. π c = N u (cid:80) n =1 P ( y n = c | x n , Θ ) + N (cid:80) n = N u +1 I ( y n = c ) (cid:80) c (cid:48) =0 N u (cid:80) n =1 P ( y n = c (cid:48) | x n , Θ ) + (cid:80) c (cid:48) =0 N (cid:80) n = N u +1 I ( y n = c (cid:48) ) (6) µ c = N u (cid:80) n =1 P ( y n = c | x n , Θ ) x n + N (cid:80) n = N u +1 I ( y n = c ) x nN u (cid:80) n =1 P ( y n = c | x n , Θ ) + N (cid:80) n = N u +1 I ( y n = c ) (7) Σ c = Nu (cid:80) n =1 P ( yn = c | x n, Θ0 )( x n − µc )( x n − µc ) T + N (cid:80) n = Nu +1 I ( yn = c )( x n − µc )( x n − µc ) TNu (cid:80) n =1 P ( yn = c | x n, Θ0 )+ N (cid:80) n = Nu +1 I ( yn = c ) (8) Class inference : After learning model parameters, we caninfer hidden class variables by maximizing the log jointprobability of unlabeled data. log P ( X u , Y u ) = log N u (cid:89) n =1 P ( x n | y n ) P ( y n )= N u (cid:88) n =1 log P ( x n | y n ) P ( y n ) (9)To maximize the total log probability, we can maxi-mize each term in Equation 9. For each sample n, wesimply choose the class c that gives higher probability of P ( x n | y n = c ) P ( y n = c ) . EM-based semi-supervised learning algorithm in struc-tured prediction assumes a dependency structure betweensample classes. We use a spatial classification model called geographical hidden Markov tree (HMT) [12]. It is a probabilis-tic graphical model that generalizes the common hiddenMarkov model (HMM) from a one-dimensional sequenceto a partial order tree.The joint distribution of unlabeled samples’ features andclasses can be formulated as Equation 10, where P n is theset of parent samples of the n th sample in the dependencytree ( P n = ∅ for a leaf node), and y k ∈P n ≡ { y k | k ∈ P n } isthe set of parent node classes of node n . P ( X u , Y u ) = N u (cid:89) n =1 P ( x n | y n ) N u (cid:89) n =1 P ( y n | y k ∈P n ) (10)Similar to EM for unstructured prediction, the modelassumes that features in each class follows an i.i.d. Gaussiandistribution. P ( x n | y n = c ) ∼ N ( µ c , Σ c ) (11)Class transitional probability follows the partial orderflow dependency constraint [12], as shown in Table 1. TABLE 1: Class transition probability and prior probability P ( y n | y P n ) y P n = 0 y P n = 1 y n = 0 1 1 − ρy n = 1 0 ρ P ( y n ) y n = 0 π y n = 1 π We denote the entire set of parameters as Θ = { ρ, π c , µ c , Σ c | c = 0 , } . The posterior expectation of loglikelihood of unlabeled samples (E-step) is as below: E Y u | X u , Θ log P ( X u , Y u | Θ )= E Y u | X u , Θ log N u (cid:89) n =1 P ( x n | y n , Θ ) N u (cid:89) n =1 P ( y n | y k ∈P n , Θ ) = (cid:88) Y u P ( Y u | X u , Θ ) N u (cid:88) n =1 log P ( x n | y n , Θ ) + N u (cid:88) n =1 log P ( y n | y k ∈P n , Θ ) = N u (cid:88) n =1 (cid:88) y n P ( y n | X , Θ ) log P ( x n | y n , Θ )+ N u (cid:88) n =1 (cid:88) y n ,y k ∈P n P ( y n , y k ∈P n | X , Θ ) log P ( y n | y k ∈P n , Θ ) (12) After computation of marginal class posterior distribu-tion through forward and backward message propagation,we can get the parameter updating formula by maximizingthe posterior expectation of log likelihood as below (M-Step). ρ = (cid:80) n |P n (cid:54) = ∅ (cid:80) y n (cid:80) y P n y P n (1 − y n ) P ( y n , y P n | X , Θ ) (cid:80) n |P n (cid:54) = ∅ (cid:80) y n (cid:80) y P n y P n P ( y n , y P n | X , Θ ) (13) π = (cid:80) n |P n = ∅ (cid:80) y n y n P ( y n | X , Θ ) (cid:80) n |P n = ∅ (cid:80) y n P ( y n | X , Θ ) (14) µ c = (cid:80) n x n P ( y n = c | X , Θ ) (cid:80) n P ( y n = c | X , Θ ) , c = 0 , (15) Σ c = (cid:80) n ( x n − µ c )( x n − µ c ) T P ( y n = c | X , Θ ) (cid:80) n P ( y n = c | X , Θ ) , c = 0 , (16) Class inference : After learning model parameters, we caninfer hidden class variables by maximizing the overall jointprobability. log P ( X , Y ) = N u (cid:88) n =1 log P ( x n | y n ) + N u (cid:88) n =1 log P ( y n | y k ∈P n ) (17)A naive approach that enumerate all combinations of classassignment is infeasible due to the exponential cost. We usea dynamic programming based method called max-sum [14]. TABLE 2: Comparison between unstructured EM (i.i.d.) andstructured EM
Gaussian mixture model Hidden Markov treeParameter π c , µ c , Σ c ρ, π c , µ c , Σ c Posteriorprobability p ( y n | x n ) , from Bayestheorem p ( y n | X ) , from messagepropagation.Featureprobability p ( x n | y n ) ∼ N ( µ c , Σ c ) p ( x n | y n ) ∼ N ( µ c , Σ c ) Classprobability p ( y n = c ) = π c p ( y n = c ) = π c for leafnodes, p ( y n | y k ∈P n ) fornon-leaf nodes Table 2 summarizes the comparison of theoretical prop-erties between the two EM methods. From the expressionof joint probabilities, both methods assume that features ineach class follows an i.i.d. Gaussian distribution. The differ-ence lies in the class prior probability. For unstructured EMmethods, the class prior probability follows i.i.d. Bernoullidistribution p ( y n ) ∼ B (1 , π ) while for the structured EMmethod, sample classes follow a dependency structure ,which can be expressed by class transition probability P ( y n | y k ∈P n ) [12]. Moreover, both methods have similarformulas for parameter update where sample mean andcovariance are reweighted by each sample’s class posteriorprobability. The difference lies in the way they computethe class posterior probability. For unstructured model, itis from Bayes theorem; while for structured model it is frommessage propagation, considering the class dependencystructure [12], [15], [16], [17].When analysing the theoretical properties of the two EMalgorithms, we found that EM algorithms can be consideredas a generalization of self-training [6]. In self-training, aclassifier is first trained with a small amount of labeleddata. The classifier is then used to classify the unlabeleddata. The most confident unlabeled points together withtheir predicted labels are added to the training set in thenext iteration [3]. In contrast, EM algorithm first uses labeledsamples to initialize model parameters, and then estimatesthe class posterior probability of each unlabeled sample. Inthe next iteration, it uses the class posterior probability as aweight to re-estimate model parameter (Equations 13, 14, 15,and 16). In summary, both self-training and EM algorithmuse the labeled data to iteratively learn model parameters.The difference is that self-training makes a hard class as-signment of unlabeled samples to retrain model in iterationswhile the EM algorithm uses the class posterior probabilityto make a soft class assignment for each unlabeled sample.It is important to note that the local optimal problemexists for both approaches. If class estimation of unlabeleddata is misleading, it may further hurt learning in itera-tions. This potential problem can be alleviated by a goodinitial parameters estimation from representative trainingsamples. The labeled samples are used to initialize somemodel parameters, i.e., the mean vectors µ c and covariancematrices Σ c of features of samples in each class. Theseparameters could have been initialized randomly withoutlabeled samples (unsupervised setting of Gaussian mixturemodel, also called EM clustering, as well as unsupervised Fig. 1: Illustration of importance of labeled samples inGaussian mixture models (red and green are two classes)setting of hidden Markov models). However, randomlyinitialized parameters of mean and covariance matrix maynot converge to good values after EM iterations, particularlywhen the feature clusters of samples in two classes arenot well-separated from each other (as the case of ourflood mapping application), as shown by the example inFigure 1(b). In this situation, labeled samples in each classcan be used to estimate its corresponding mean vector andcovariance matrix more accurately, as shown by Figure 1(c).TABLE 3: Summary of unstructured EM and structured EMin unsupervised learning and semi-supervised learning. Example
Unstructured EM • Samples are identical and independent. • Feature variables are independent. • Unknown classes as hidden variables.
Structured EM • Either samples or feature variables follow structural dependency . • Unknown classes and missing features are hidden variables.
Unsupervised Learning • Randomly initialize model parameters. • Update model parameters without labels. Generative mixture models without labels (EM clustering) [11]. Bayesian structural model [8].
Semi-supervised Learning • Initialize model parameters with labeled samples. • Update model parameter with or without labeled samples. Generative mixture models with partial labels [1]. Hidden Markov model (HMM) with extra labeled samples [10].
Table 3 compares unstructured EM versus structuredEM in both unsupervised and semi-supervised settings ina broader perspective. Both unstructured and structuredEM learn model parameters with the existence of missingvariables (e.g., hidden class labels or missing feature obser-vations). The difference is that structured EM incorporatesstructural dependency between samples or feature variableswithin a sample. In unsupervised setting, class labels are ei-ther fully unknown (e.g. , EM clustering) or not of relevance(e.g., feature dependency learning in Bayesian structuredEM). Thus, initialization and update of model parametersdo not rely on class labels. In semi-supervised setting, classlabels are partially available. These labels can be used inparameter initialization and potentially in parameter updateas well. Specifically, in Gaussian mixture models with partiallabels, both labeled and unlabeled samples are used inparameter initialization as well as class inference. In hiddenMarkov models, if extra labeled samples are available, theselabels can be used to provide a reasonable initialization ofsome model parameters (e.g., mean and covariance matrixin each class). After this, model parameters are iterativelyupdated based on the features and dependency structure ontest samples. The process belongs to transdutive learningsince a model is learned for a specific structure acrosstest samples. It is worth to note that HMMs can also be unsupervised with randomly initialized parameters withoutclass labels. But the converged parameters in this case maybe ineffective in discriminating two classes on test samples(see [12]).
VALUATION
In this section, we compared the performance of un-structured EM prediction with structured EM predictionthrough case studies on real world datasets. Our goal is toget insights on how well different EM methods can handleclass confusion due to noise and obstacles in features. More-over, we also compare EM algorithm with other baselinemethods. We chose Gradient Boost Machine and Randomforest models as baseline because these models have well-tested source codes and have also shown superior perfor-mance over other models in the literature. The candidateclassification methods are listed below. • Unstructured EM (Gaussian mixture model) w/oelevation feature : We implemented our codes inMatlab. • Unstructured EM (Gaussian mixture model) w/elevation feature : We implemented our codes inMatlab. • Structured EM (HMT) : We implemented the HMTsource code in C++. • Gradient Boost Model (GBM) : We used the GBMin R gbm packages on raw features together withelevation feature. • Random forest (RF) : We used the random forest inR randomForest packages on raw features togetherwith elevation feature.
Hyperparameter:
For unstructured EM, the hyper-parameter includes the parameter convergence threshold,and the cutoff threshold to decide the positive and negativeclasses. For structured EM, the hyper-parameter includesthe parameter convergence threshold, and initial parametervalues of ρ (class transitional probability) and π (class priorprobability). The parameter convergence threshold was setas 1e-5, and the cutoff threshold was set as 0.5. We set ρ = 0 . and π = 0 . based on earlier sensitivity study(see [12]). For random forest, the hyper-parameters includethe number of trees N , the number of variables randomlysampled as candidates at each split N and the minimumsize of terminal nodes N . For N , we tried values of 300,350, 400, 450, 500. For N , we tried values of 1, 2, 3. For N ,we tried values of 5, 10, 20, 40. When choosing the optimalvalue for one parameter, we kept other parameters constant.The optimal values are: N = 350 , N = 2 and N = 10 .For gradient boosted model, the hyper-parameters includethe number of trees T , the maximum depth of each tree T ,and the shrinkage parameter T that is used to reduce theimpact of each additional fitted tree. For T , we tried valuesof 1000, 1500, 2000, 2500, 3000. For T , we tried values of 1,2, 3, 4. For T , we tried values of 0.1, 0.01, 0.001. The optimalhyper-parameters are: T = 2000 , T = 1 , and T = 0 . . Dataset description:
We used two flood mapping datasetsfrom Hurricane Harvey floods in Texas 2017 and HurricaneMathew in North Carolina 2016 respectively. Non-spatialexplanatory features include red, green, blue bands in aerial Fig. 2: Training and test polygonsimagery from PlanetLab Inc. and NOAA National GeodeticSurvey [18] respectively. The spatial contextual feature wasdigital elevation map from Texas natural resource man-agement department and the University of North CarolinaLibraries [19] respectively. All data were resampled into2 meter by 2 meter resolution. Figure 3 shows the entireinput features in the Harvey dataset, including non-spatialfeatures (RGB bands) in Figure 3(a) and spatial contextualfeature (elevation) in Figure 3(b). From the images, we cansee class confusion due to noise and obstacles in non-spatialfeatures (there are pixels with tree colors in both flood anddry areas). Due to space limit, we put the results on thesecond dataset in Appendix.
Training and test dataset split:
We used simple validation.The separation of training and test sets is shown in Figure 2.We had a test region (highlighted by the black rectangle)with labeled polygons in both classes. The training regionwith training polygons in both classes was outside thetest region. In the experiment, we randomly selected 10000pixels from training polygons (5000 in flood and 5000 indry) and 103374 pixels from test polygons (43972 in dry and59402 in flood)
Parameter iteration and convergence:
Our convergencethreshold was set to . . Figure 4 shows the iterationsof µ c , Σ c ( c = 0 for the dry class, c = 1 for the flood class)in unstructured EM without spatial contextual feature (ele-vation). For Σ c , we only plotted the diagonal elements (i.e.,the variance of each feature) and omitted covariance valuesdue to space limit. From the results, we can see that theconverged mean values of the two classes are well separatedwith µ converged to to and µ converged to to . This is consistent with Figure 3(a) since floodareas have lighter color than dry areas. Note that the rangeof values in red, blue, and green bands are bigger than 256due to a different imagery data type. We can also observethat the variance of the flood class Σ converged to a lowerrange of values (around × ∼ × ). The reason is thatunstructured EM without spatial contextual feature will re-group samples with class confusion in their feature values (a) High-resolution satellite imagery in NC.(b) Digital elevation Fig. 3: RGB feature and spatial elevation feature
Iteration (a) Iteration of parameter µ Iteration (b) Iteration of parameter Σ Fig. 4: Parameter iterations and convergence for unstruc-tured EM without elevation feature
Iteration (a) Iteration of parameter µ Iteration (b) Iteration of parameter Σ Fig. 5: Parameter iterations and convergence for unstruc-tured EM with elevation feature
Iteration (a) Iteration of parameter µ Iteration (b) Iteration of parameter Σ Fig. 6: Parameter iterations and convergence for structuredEM algorithm(e.g., tree pixels with the same color in both flood and dryareas) into the same class based on class posterior (Figure 7),which in turn will influence the update of parameters ineach class.For unstructured EM with spatial contextual feature(elevation), the parameter iteration and convergence areshown in Figure 5. Note that there is one more dimensionfor elevation in the plots. We can see that the convergedmean values of two classes are less separately comparedwith Figure 4. µ converges to a range of to and µ converges to a range of to . Anotherdramatical change is on the variance of flood class Σ ,which increases to a much larger range of values ( . × to . × ). This can be explained by the marginal classposterior probabilities in Figure 8, where samples with highposterior probability in the flood class (yellow color) aregrowing and include more tree pixels in the flood areas.Because of this, the variance of the flood class grows biggerand the mean of the flood class drops (tree pixels in waterare darker than exposed flood water).For structured EM, the converged parameter values aremoderate in the middle of the above two cases, but moresimilar to unstructured EM with spatial contextual feature(elevation). The results can be explained by the posteriorclass probability in Figure 9, where samples with highposterior probability in the flood class (yellowish pixels) aremoderately in between the previous two maps. The maindifference is that µ c and Σ c converge in fewer (only 2)iterations compared with unstructured EM with elevation. (a) Satellite imagery (b) Iteration 0 (c) Iteration 5 (d) Iteration 10 (e) Iteration 15 (f) Iteration 20 Fig. 7: Posterior Probability of Unstructured EM without elevation (a) Satellite imagery (b) Iteration 0 (c) Iteration 5 (d) Iteration 10 (e) Iteration 15 (f) Iteration 20
Fig. 8: Posterior Probability of Unstructured EM with elevation (a) Satellite imagery (b) Iteration 0 (c) Iteration 5 (d) Iteration 10 (e) Iteration 15 (f) Iteration 20
Fig. 9: Posterior Probability of structured EM
Figures 7, 8, and 9 show the iterations of posteriorprobabilities of samples in the flood class for unstructuredEM without and with elevation features as well as struc-tured EM respectively. Sample class posterior probability isimportant in understanding how the EM algorithm worksbecause it shows how much a sample contribute to theparameter update for the next iteration in each class (e.g.,Equations 6, 7, 8, and Equations 13, 14, 15, 16). For unstruc-tured EM, posterior class probability is estimated based onthe Bayes Theorem with an i.i.d. assumption. Thus, sampleswith class confusion in feature values will be estimatedtowards the same class. This explains why pixels correspondto trees in the flood have low posterior probability in theflood class in Figure 7 (same as the trees in dry areas). Incontrast, unstructured EM with the elevation feature couldseparate confused tree pixels in flood and dry areas sincetheir elevation values differ. This is shown in Figure 8(f).Finally, for structured EM, posterior class probability is esti-mated based on both local class likelihood from non-spatialfeatures (RGB colors) and class dependency structure. Thisexplains why the class posterior in Figure 9 is moderatecompared with Figure 7 and Figure 8. It is worth noting thatposterior class probability for individual pixels are not thesame as final class prediction in structured EM (final classprediction could be smoother due to jointly predicting allclasses with dependency).
Table 4 and Figure 11 show the final classification resultsof the three methods. We can see that unstructured EM TABLE 4: Comparison on Harvey, Texas flood data
Classifiers Class Prec. Recall F Avg. FGBM Dry 0.88 0.98 0.93 0.87Flood 0.96 0.70 0.81RF Dry 0.70 0.99 0.82 0.82Flood 0.99 0.69 0.81Unstructured EM w/o elev. Dry 0.68 0.99 0.81 0.75Flood 0.99 0.53 0.70Unstructured EM w/ elev. Dry 0.99 0.99 0.99 0.99Flood 0.99 0.99 0.99EM Structured Dry 0.99 0.99 0.99 0.99Flood 0.99 0.99 0.99 without spatial contextual feature (elevation) performedpoorly with class confusion. Unstructured EM with spatialcontextual feature performs significantly better with lessclass confusion, but also produces some salt-and-peppernoise errors since spatial features are used with an i.i.d.assumption. In contrast, the structured EM method canboth address the class confusion issue and show a smoothclass map due to explicitly considering spatial dependencystructure.Moreover, we plot the ROC curve and calculate the AUCof ROC curve for unstructured EM, GBM and RF classifier.As shown in Figure 10 and Table 5, the unstructured EMwith elevation. gives the best ROC curve with AUC of0.996, while the unstructured EM without elevation. showsthe less significant result with AUC of 0.786. It means thespatial features play an important role in the unstructuredEM classifier. The two baseline methods random forest andGBM show better results than unstructured EM withoutelevation classifier.
False positive rate T r ue po s i t i v e r a t e ROC Curve on Harvey, Texas flood data
Unstructured EM w/o elev.Unstructured EM w/ elev.GBMRF
Fig. 10: ROC curve on Harvey, Texas dataTABLE 5: Comparison on AUC of ROC curve
Classifiers AUCGBM 0.975RF 0.946Unstructured EM w/o elev. 0.786Unstructured EM w/ elev. 0.996
TABLE 6: Comparison on the total number of salt-and-pepper noise (The total number of pixels is 19,167,008)
Classifiers Number of Salt-and-pepper noiseGBM 62,846RF 69,354Unstructured EM w/o elev. 77,625Unstructured EM w/ elev. 23,803EM Structured 3,740
We used a spatial autocorrelation statistic called Gammaindex [20] to quantify the salt-and-pepper noise level.Gamma index measures the similarity between the attribute (a) Unstructured EM with-out elevation (b) Unstructured EM withelevation(c) Structured EM (d) Satellite imagery
Fig. 11: Comparison of class prediction -7 -6 -5 -4 -3 -2 -1 The ratio of labeled data A v e . F Unstructured EM w/o elev.Unstructured EM w/ elev.Structured EM
Fig. 12: Sensitivity to the ratio of labeled samplesvalues of a location and those of its neighbors. It is definedas Γ i = (cid:80) j W i,j I i I j (cid:80) j W i,j (18)where i and j are locations, W i,j is 1 if j is is neighbor, and0 otherwise, I i is 1 if pixel i is flood pixel and -1 otherwise.We define a salt-and-pepper noise pixel as a pixel that hasa negative local Gamma index ( Γ i < ). We calculatedthe total number of salt-and-pepper noise pixels across alllocations in the predicted class map. The total numbers ofsalt-and-pepper noise pixels of the five classifiers are sum-marized in Table 6. We can see that RF, GBM, and EM i.i.d.without elevation have the highest salt-and-pepper noiselevel (above 60,000). EM i.i.d. with elevation is much betterwith only 23,803 salt-and-pepper noise. The EM structuredmethod has the lowest number of salt-and-pepper noise,i.e., 3,740, about one order of magnitude lower than otherclassifiers. We analyzed the sensitivity of three candidate EM meth-ods to the ratio of training labels. Specifically, we increasedthe ratio of labeled samples from × − to × − .The results were summarized in Figure 12. We can seethat as the ratio of labeled samples increase, the F-scoresof all three methods first improved and then convergedto an optimal value. Specifically, EM i.i.d. (represented byGaussian mixture model) achieved the lowest peak F-score(around 0.75), while EM i.i.d. with elevation feature andEM structured model (represented by HMT) have a muchbetter peak F-score (around 0.99). We also observe thatwhen the ratio of labeled sample is small (e.g., below − ),EM structured model has a poor performance (with an F-score below 0.65). This is probably because more labeledsamples are needed to initialize good parameters for ourrepresentative EM structured model (HMT). ONCLUSIONS AND F UTURE WORK
This paper makes a comparative study between unstruc-tured and structured EM in semi-supervised learning. Wecompare the two methods in their theoretical propertiesand find that EM-based semi-supervised learning can beconsidered as a generalization of self-training method with soft class assignment on unlabeled samples. A case studyon flood mapping datasets shows that unstructured EMmethod can be significantly impacted by noise and obstaclesin sample features. Adding spatial contextual features inunstructured EM method can reduce the impact of noiseand obstacles but will still produce salt-and-peper noiseerrors. Finally, structured EM can better address the issuecompared with the other methods in this flood mappingapplications. In future work, we plan to conduct the com-parison studies on more types of datasets and applicationsto see if the conclusion can hold in general. A CKNOWLEDGEMENT
This material is based upon work supported by theNSF under Grant No. IIS-1850546, IIS-2008973, CNS-1951974and the University Corporation for Atmospheric Research(UCAR). R EFERENCES [1] Z. Jiang and S. Shekhar, “Spatial big data science,”
Schweiz:Springer International Publishing AG , 2017.[2] S. Shekhar, Z. Jiang, R. Y. Ali, E. Eftelioglu, X. Tang, V. Gunturi, andX. Zhou, “Spatiotemporal data mining: a computational perspec-tive,”
ISPRS International Journal of Geo-Information , vol. 4, no. 4,pp. 2306–2338, 2015.[3] X. Zhu, “Semi-supervised learning literature survey,” 2005.[4] Z. Ghahramani and M. Jordan, “Learning from incomplete data(tech. rep. no. aim-1509),” 1994.[5] G. Camps-Valls, T. V. B. Marsheva, and D. Zhou, “Semi-supervisedgraph-based hyperspectral image classification,”
IEEE Transactionson Geoscience and Remote Sensing , vol. 45, no. 10, pp. 3044–3054,2007.[6] X. Zhu and Z. Ghahramani, “Learning from labeled and unlabeleddata with label propagation,” 2002.[7] U. Brefeld and T. Scheffer, “Semi-supervised learning for struc-tured output variables,” in
Proceedings of the 23rd internationalconference on Machine learning . ACM, 2006, pp. 145–152.[8] I. D´opido, J. Li, P. R. Marpu, A. Plaza, J. M. B. Dias, and J. A.Benediktsson, “Semisupervised self-learning for hyperspectral im-age classification,”
IEEE Transactions on Geoscience and RemoteSensing , vol. 51, no. 7, pp. 4032–4044, 2013.[9] Y. Hong and W. Zhu, “Spatial co-training for semi-supervisedimage classification,”
Pattern Recognition Letters , vol. 63, pp. 59–65, 2015.[10] N. Friedman, “The bayesian structural em algorithm,” arXivpreprint arXiv:1301.7373 , 2013.[11] C. Damian, Z. Eksi, and R. Frey, “Em algorithm for markovchains observed via gaussian noise and point process information:Theory and case studies,”
Statistics & Risk Modeling , vol. 35, no. 1-2, pp. 51–72, 2018.[12] M. Xie, Z. Jiang, and A. M. Sainju, “Geographical hidden markovtree for flood extent mapping,” in
Proceedings of the 24th ACMSIGKDD International Conference on Knowledge Discovery & DataMining , ser. KDD ’18. ACM, 2018, pp. 2545–2554.[13] Z. Jiang, “A survey on spatial prediction methods,”
IEEE Transac-tions on Knowledge and Data Engineering , 2018.[14] L. R. Rabiner, “A tutorial on hidden markov models and selectedapplications in speech recognition,”
Proceedings of the IEEE , vol. 77,no. 2, pp. 257–286, 1989.[15] Z. Jiang, M. Xie, and A. M. Sainju, “Geographical hidden markovtree,”
IEEE Transactions on Knowledge and Data Engineering , 2019.[16] Z. Jiang and A. M. Sainju, “Hidden markov contour tree: A spatialstructured model for hydrological applications,” in
Proceedingsof the 25th ACM SIGKDD International Conference on KnowledgeDiscovery & Data Mining , 2019, pp. 804–813.[17] A. M. Sainju, W. He, and Z. Jiang, “A hidden markov contourtree model for spatial structured prediction,”
IEEE Transactions onKnowledge and Data Engineering
Geographi-cal analysis , vol. 27, no. 2, pp. 93–115, 1995.
Wenchong He is a Ph.D. student in the de-partment of Computer Science at the Universityof Alabama. He received his B.S. degree fromUniversity of Science and Technology of China(USTC) in 2017, and Master degree from theCollege of William Mary in 2019. His researchinterests include machine learning, data miningand deep learning.