A Multiscale Graph Convolutional Network for Change Detection in Homogeneous and Heterogeneous Remote Sensing Images
Junzheng Wu, Biao Li, Yao Qin, Weiping Ni, Han Zhang, Yuli Sun
>> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 1 Abstract — Change detection (CD) in remote sensing images has been an ever-expanding area of research. To date, although many methods have been proposed using various techniques, accurately identifying changes is still a great challenge, especially in the high resolution or heterogeneous situations, due to the difficulties in effectively modeling the features from ground objects with different patterns. In this paper, a novel CD method based on the graph convolutional network (GCN) and multiscale object-based technique is proposed for both homogeneous and heterogeneous images. First, the object-wise high level features are obtained through a pre-trained U-net and the multiscale segmentations. Treating each parcel as a node, the graph representations can be formed and then, fed into the proposed multiscale graph convolutional network with each channel corresponding to one scale. The multiscale GCN propagates the label information from a small number of labeled nodes to the other ones which are unlabeled. Further, to comprehensively incorporate the information from the output channels of multiscale GCN, a fusion strategy is designed using the father-child relationships between scales. Extensive Experiments on optical, SAR and heterogeneous optical/SAR data sets demonstrate that the proposed method outperforms some state-of the-art methods in both qualitative and quantitative evaluations. Besides, the Influences of some factors are also discussed.
Index Terms —Change detection, graph convolutional network, multiscale segmentation, semisupervised, remote sensing images. I. I NTRODUCTION
HANGE detection (CD) that aims at identifying the changes of regions or phenomenon in the same geographical area at different times has been an attractive research topic [1]. It has been extensively applied to various fields, such as resources investigation [2], urban growth
J. Wu, B. Li and Y. Sun are with the Key Laboratory of ATR, College of Electronic Science, National University of Defense Technology, Changsha, 410073, China (e-mail: [email protected]; [email protected]; [email protected] ). monitoring [3] and disaster assessment [4]. Up to now, optical and synthetic aperture radar (SAR) images have been two of the most common types of remote sensing (RS) data in CD tasks [5]. To be specific, the great majority of remote sensing images are acquired from optical sensors and they can represent abundant information of land cover (such as texture, structure and color).
Therefore, CD with optical images has been of interest [6] for a long time and is relatively mature. On the other hand, the properties that SAR images can be acquired in all-weather and all-time conditions [7] make them reasonably potential in some applications, and literally, CD with SAR images has also been an active scope for researchers. CD with images collected by the same kind of sensors, e.g., SAR or optical sensor, can refer to the homogeneous CD which is the main stream in the current CD field. However, CD with heterogeneous images collected by different types of sensors has great practical significance in some emergency situations (e.g., earthquake or flood) where the rapid mapping of damages is needed. Frequently, only the pre-event optical image can be obtained from the archived data, whereas maybe only the post-event SAR image can be available due to the adverse atmospheric conditions [8]. Consequently, the heterogeneous CD has drawn increasing attention. Generally, heterogeneous CD is particularly challenging because of the distinct feature representation in images acquired by different sensors. Numerous works have been devoted to both homogeneous and heterogeneous CD tasks in the passing several decades. In the case of homogeneous images, many methods are based on the difference between spectral bands or between intensities. For instance, differencing, change vector analysis (CVA) [9], multivariate alteration detection (MAD) [10] and iteratively reweighted MAD (IR-MAD) [11] are frequently used for optical images while log-ratio (LR) [12] and mean-ratio (MR) [13] are popular in SAR CD tasks. Inspired by the above strategies, plenty of methods have been proposed to generate more reliable difference images (DI). Wu et al [14] proposed a slow feature analysis (SFA) method which minimized the feature variation of unchanged pixels, and meanwhile, changed pixels could be highlighted and separated for multispectral images. Saha et al [15] developed an extended CVA in unsupervised schema, namely deep change vector analysis (DCVA), to obtain robust change vectors that model the spatial
A Multiscale Graph Convolutional Network for Change Detection in Homogeneous and Heterogeneous Remote Sensing Images
Junzheng Wu, Biao Li, Yao Qin, Weiping Ni, Han Zhang and Yuli Sun C REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 2 context information for very high resolution (VHR) images. Lv et al [16] designed an object-oriented key point vector distance to measure the change magnitude between VHR images. Wang et al [17] constructed the relationships among pixels and their coupling neighbours using hypergraphs which can capture both change level of pixels and local consistency, thus enabling to generate DI with a good separability for SAR images. As CD can be treated as classification problems, many researchers have resorted to classification and statistical techniques. Some of them categorized the feature extracted from the input image pairs into changed/unchanged classes by taking advantage of supervised classifiers, e.g., support vector machine (SVM) [18], extreme learning machine (ELM) [19] and random forest (RF) [20]. Some of them made the assumption that the image intensities or DI obey a specific statistical distribution [21]-[23]. For example, Zanetti et al [24] modeled the magnitude of the DI using a Rayleigh-Rice mixture density, via the novel parameter estimation, the mixture model outperformed several empirical models for CD with optical images. Despite fewer comparing with those for homogeneous CD, quite a few methods have emerged in heterogeneous CD tasks. For instance, Mercier et al [25] transformed one of the two images to the other one using the copula theory, which led to obtain comparable characteristics, then, the Kullkack–Leibler (KL) distance was employed to calculate the change indices. Ferraris et al [26] used the coupled dictionary learning framework to model the two heterogeneous images which can be represented by a sparse linear combination of atoms belonging to a pair of coupled over-complete dictionaries learnt from the two images. Wan et al [27] combines the cooperative multi-temporal segmentation and hierarchical compound classification strategy to overcome the error propagation in classification-based methods. Moreover, multivariate statistical model [28], homogeneous pixel transformation [29], image regression [30] and theory of graph [31] were also developed to learn the latent relationships between heterogeneous images. Recently, deep learning techniques have demonstrated remarkable performance in image processing field due to their capabilities of automatically obtaining abstract high level representations by gradually aggregating the low-level feature, by which the complicated feature engineering can be avoided [32], and without doubt, various deep learning methods have been employed in CD tasks, including convolutional neural network (CNN), auto-encoder (AE), recurrent neural network (RNN) and deep belief network (DBN). For example, Lyu et al [33] adopted RNN to learn transferable change-rules between multi-temporal homogeneous images. Liu et al [34] proposed a stacked Fisher auto-encoder to extract layer-wise feature which was more discriminative in SAR CD. Chen et al [35] proposed a Siamese convolutional multiple-layers recurrent neural network to extract spatial–spectral feature from homogeneous or heterogeneous image patches. Among these deep learning methods, CNN has drawn intensive attention and has been the most popularly used as the backbone or feature extractors in CD tasks [36]. The CNN-based CD methods will be discussed in more detail in Section Ⅱ.A. Although the aforementioned methods have achieved promising performances in some cases, they generally have at least one of the following weaknesses: (1) Hand-crafted features are needed to extract within most of the difference based and classification based methods, which require much domain-specific knowledge and may be affected by noise and atmospheric conditions. (2) The deep learning based methods rely heavily on large amounts of annotated training samples, which are typically generated by manually labeling or pre-classification, the former is time-consuming and tedious, while the latter requires highly reliable classification results which can rarely be guaranteed in practice. (3) Many methods are designed for only one type of data, consequently, the performance is unsatisfactory when transferred to other types of data. (4) The current state-of-the-art CNN-based methods cannot fully capture the geometric variations of different object regions because the convolutions are only conducted on the regular rectangle regions. Aiming at addressing or alleviating these shortcomings, a novel method is proposed for CD combining the multiscale object-based technique and the graph convolutional network (GCN) in this paper. First, a pre-trained U-net combining with the multiscale segmentation technique is employed to extract features of the input images. Instead of using high level features for pixels/patches or hand-crafted features for ground objects with varied sizes and shapes, here, the high level features and the comprehensive contextual information can be incorporated. On the basis of treating each object as a graph node with features extracted previously, the multiscale GCN can be constructed and a fusion strategy is designed for the outputs of the different GCN channels based on the relationships between the segmented scales. Finally, a binary change map can be produced owing to the property that GCN can propagate information form a small amount of labeled nodes to the unlabeled ones. The main contributions of this paper are as follows. 1) We propose a novel multiscale graph convolutional network (MSGCN) for CD which uses multi-channel graph convolutional layers to process the extracted high level features. Moreover, a multiscale decision fusion strategy is designed to make full use of the comprehensive information of these channels, which improves the detection accuracy. 2) The designed feature extractor which adopts a pre-trained U-net integrating with the multiscale segmentation can automatically obtain object-wise high level features to construct the nodes of graphs. 3) The remarkable experimental results have demonstrated that the proposed method can be effectively applied to both homogeneous and heterogeneous high resolution RS images with small amounts of labeled samples, making it practical in various CD applications. The rest of this article is organized as follows. We discuss the related works in Section Ⅱ. Section Ⅲ illustrates the proposed method in detail. Section Ⅳ provides the details about the data, experimental results and discussions. Finally, the conclusion is presented in Section Ⅴ. REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 3 II. R ELATED W ORKS
In this section, we provide a brief review of CNN-based CD methods and GCN, as they are related to this article. A. CNN-based Change Detection
Recently, various CNN-based CD methods have been proposed due to the capabilities of automatically extracting high level semantic features that can avoid the hand-crafted feature design. For instance, Nemoto et al [37] intuitively utilized CNN to classify the images acquired at two periods, respectively, and then, the classifications are compared to obtain change information. Lim et al [38] designed three encoder-decoder structured CNNs to yield change maps using Google Earth images. A fully convolutional Siamese auto-encoder method for CD in UAV images was presented in [39], which can reduce the number of labeled samples required to achieve competitive results. Liu et al [40] used a pre-trained CNN-based U-net architecture and designed a new loss function to achieve transfer learning for CD tasks among different data sets. Similarly, Zhang and Shi [41] used a CNN to learn the deep features and then used a CNN-based transfer learning framework to compose a two channel network with shared weight to generate a multiscale and multi-depth feature difference map for CD. Several generative adversarial network (GAN) architectures based on CNN units have been also exploited for CD [42-44]. Although the existing CNN-based methods have validated the availabilities in many cases, some limitations still exist. To be specific, ground objects in RS images commonly appear with multiscale and various shape, whereas the CNN models only conduct the convolution on the regular rectangular regions. In other works, the shape information of objects cannot be completely captured by CNN. Besides, the weights of each convolution kernel are identical when convolving all patches [45]. As a result, the boundaries between changed and unchanged classes may be lost. B. Graph Convolutional Network
The notion of graph neural network (GNN) was initially outlined in [46] and further elaborated in [47-49]. Wu et al have made a comprehensive survey on GNN in [50]. As mentioned above, CNN often fails to analyze spatial vector data because of the regularity requirements for data structures. To handle the irregular data such as graph, new generalizations and definitions of the convolution operations have been rapidly developed over the past few years, which can be the umbrella of GCN. For instance, Bruna et al [51] used a spectral convolution to define a multilayer neural network model, which is similar to the classical CNN. Sandryhaila and Mouraj [52] attempted to redefine the convolution as a polynomial function of the graph adjacency matrix. Kipf and Welling [53] proposed a fast approximation localized convolution and designed a simple layer-wise propagation rule for semi-supervised classification, which makes the GCN model able to encode both graph structure and node features. Since then, increasing extensions and improvements have emerged, such as FastGCN [54] and graph convolutional auto-encoder using Laplacian smoothing and sharpening (GALA) [55]. With the capability of modeling the irregular data structures, GCN has been widely applied to various vision tasks, such as semantic segmentation [56], specific object detection [57] and hyperspectral RS image classification [45]. To the best of our knowledge, GCN has been deployed for CD in only one prior work [58]. However, in the work of [58], several hand-crafted features (mean, maximum, and minimum spectral values and area) of the objects under multiscale are simply concatenated to be the features of nodes, and thus, the distinction among objects cannot be fully reflected. To be specific, if two objects under a fine scale are both within the areas of one object under a coarser scale, these two objects would share a majority of the concatenated features, thus, the discrimination may be weaken. Besides, despite multiscale hand-crafted features are extracted, only the objects under the finest scale are treated as nodes to construct the graph model and fed to GCN, which means that multiscale information has not been adequately exploited. To cope with these issues, we proposed a multiscale GCN that uses object-wise features extracted by a pre-trained U-net integrating with the multiscale segmentation as the features of nodes and fuses outputs of the multi-channel networks for CD. As a result, accurate node embedding and enrich multiscale information can be cooperated, which can facilitate the promotion of detection performance. III. P ROPOSED M ETHOD
In this section, we elaborate on the details of the proposed method, as shown in Fig.1. Firstly, the two input images are stacked as one image and then fed into the pre-trained feature extracting network and the multiscale segmented module, simultaneously. After that, for each segmented scale, one graph is constructed using the object-wise features of nodes obtained from the extracted high level features combining with the objects under the segmented scale. Then, the multiscale GCN layers are conducted on these graphs, which can cluster the objects potentially belonging to the same class (changed/unchanged) together in the embedding space. Finally, the CD result is produced through fusing the outputs of the multiscale GCN by the designed fusion module. A. Multiscale Segmentation
Object-based image analysis uses a homogeneous region (namely object) as the basic processing unit which contains sufficient spatial information. It always starts with object segmentation, thus, creating representative objects through segmentation approaches is crucial for feature extraction. In this paper, fractal net evolution approach (FNEA) [59] which has been demonstrated to be effective [60] in object-based RS image analysis is adopted for the segmentation task. The objects are obtained by merging regions according to an optimization function, which requires the heterogeneity of the merged object in terms of spectral and shape properties to be lower than the user-defined threshold. Compared with other segmentation approaches such as SLIC and mean-shift, FNEA is adopted for CD in this paper based on the following advantages. First, the hierarchical segmentation strategy allows REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 4 the objects with different sizes to be fully extracted by simply tuning the scale parameter. Second, not only spectral properties but also geometric information are taken into account during segmentation, thus, objects with various shapes can be extracted with relatively high accuracy [27]. In addition, an object under a coarse segmented scale can be obtained by merging several objects under a finer one keeping other parameters are invariant. It means that the father-child relationships exist between segmented scales, and these relationships can be exploited to fuse the multiscale information. B. Feature Extracting Network
In many vision tasks, feature extraction and selection is a complex but rather pivotal step that requires professional knowledge and experience. Considering that the improved version of fully convolutional neural network (FCN) called U-net has been effectively exploited to some CD works [61-62], meanwhile, several open RS datasets for CD tasks have been available to train the networks, such as the ONERA Satellite Change Detection (OSCD) dataset [63] and the SZTAKI AirChange Benchmark set [64], we adopted and modified the original U-net structure to extract pixel-wise high level features, which can avoid the manual designs for feature extraction and incomprehensively modeling the features. As shown in Fig.2, the network is trained in the end-to-end manner with free-size input image. The U-net structure consists three parts that involve a contracting (or encoding) path, bottleneck, and expanding (or decoding) paths. The encoding path which is a typical convolutional neural network structure comprises four sets of convolution operations and four down-sampling operations, among which each convolution set adopts two consecutive 3×3 convolution operations. Each set of convolution operations is dimensioned by a down-sampling operation and the Relu function is used as the activation function. Between the contracting and expanding paths, the
Fig.1. DIs generated by (a) IRMAD(LR for SAR), (b) ITPCA(MR for SAR), (c) PCVA(ITPCA for SAR), (d) OCVA(OMR for SAR), (e) MVKL(OCN for SAR) and (f) proposed approach. Fig.2. DIs generated by (a) IRMAD(LR for SAR), (b) ITPCA(MR for SAR), (c) PCVA(ITPCA for SAR)
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 5 bottleneck is built from two simple convolution layers with the kernel size of 3×3. The decoding path is almost symmetric with the decoding path, and it comprises four sets of transposed convolution operations and four up-sampling operations, among which each convolution set uses the concatenation with the corresponding cropped feature maps from contracting path to help the decoder better repair target details. The kernel size of the convolution layers in decoding path is also 3×3. In addition, the final two layers of the network uses two 1×1 convolutional layers that gradually convert the feature maps into one map that indicate the probabilities of change for each pixel. The cross entropy function is adopted as the loss function with the following formula: ln 1 ln 1 n n n nn loss y a y a (1) Where 𝑦 𝑛 is the real value of the sample, and 𝑎 𝑛 is the actual output result. In this paper, we pre-trained the U-net with the openly free OSCD dataset [63]. The input images are clipped into patches with the size of 112×112 in the pre-training process. When the U-net has been trained, it can be used as the feature extracting network as the module shown in Fig.1, and the input images can be free-size without any adjustment. It is worth noting that the 128 feature maps (marked by the red box in Fig.2) of the last convolution operations set are treated as the extracting results of the feature extracting network and fed to the following graph construction. C. Graph Construction
Supposing that the stacked input image 𝐼 𝑆 is segmented by the FNEA approach under L scale parameters ( , , L S S , ranging from fine to coarse) with the same of other segmented parameters, respectively. Then, L object sets can be obtained: , , , , , , L N L L L N
P P P P , where , l k P denotes the k th object under the scale parameter ,( 1, 2, , ) l S l L , and , , L N N denote the numbers of object corresponding to the scale parameters , , L S S , respectively, and naturally, L N N N . Treating each object as one node, L undirected affinity graphs , , , , L L L
G V E G V E can be constructed, where , l i l v V denotes each object region, and , , , , l ij l i l j l e v v E represents the spatial relationships between two arbitrary objects under the scale parameter ,( 1, 2, , ) l S l L . In this paper, we generate the graph-based representation by building features for each node and representing the edges between nodes using an adjacency matrix under each scale parameter. Through the pre-trained feature extracting network, 128 high level feature maps , , M M can be extracted, and each of them has the same size with the input image ( H × W ). To build the object-wise feature vector l i F R representing the i th node under the scale parameter l S , we combine the above feature maps which represent high level semantic information and the segmentation result ,1 , , , , l l l l N P P that contain abundantly spatial information to characterize the node. , l i F is formulated by: , , ,, | , ( ) , / s l i l i n l ij k I j k P n M j k P n F (2) where , l i P represents the number of pixels in , l i P . Given the l th segmentation set ,1 , , , , l l l l N P P , we aim to obtain an adjacency matrix l l
N Nl A R which indicates the interaction of each pair of nodes in l G . The elements of l A can be calculated as: , , , , , , ,, , |0, F Fd i j l i l j l i l j l j l il ij e P Nei P P Nei Potherwise e A (3) Where , l i Nei P is the set of neighbors of the object , l i P , the parameter 𝛾 is empirically set to 0.2 in the experiments, and d(i,j) represents the normalized Euclidean distance between the central point of , l i P and that of , l j P . It can be seen that if two objects , l i P and , l j P are not adjacent, their value in the adjacency matrix is set to 0. Besides, diagonal elements are also set to 0. Through the operations, the structure information of neighbors is embedded into the feature information of nodes. By combining the deep feature learning and object-based analysis techniques, the abovementioned procedure of graph construction not only takes the spatial and temporal information into account, but also utilizes the high level semantic features to represent the nodes. This strategy helps to efficiently and fully exploit kinds of information to achieve better performance. In addition, the interactions among adjacent nodes are modeled with their distances and similarities of the extracted features, which would be instrumental in accurately clustering the nodes for the consideration of structure information. The built features of nodes and the adjacency matrices represent a compact and effective way to provide relatively comprehensive information as input to the multiscale GCN. D. Multiscale GCN
Our GCN model is inspired from [53] and uses the proposed features of nodes and adjacency matrices as input. GCN extends the concept of convolution from regular grids data to graph structured data by generating node embeddings that gradually fuse the features in the neighborhood. The intrinsic difference grid-based convolution is that the number of neighbors of a node is not fixed in graph-based convolution. GCN can be categorized into two types: spectral and spatial based. In this paper, we employ a spectral-based approach in which the spectral filtering on graphs is defined. The spectral convolutions on graphs can be defined as the multiplication of a signal N x R with a filter g diag θ θ parameterized by N θ R in the Fourier domain, namely: T g g θ θ x U U x (4) REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 6 where U is the matrix of eigenvectors of the normalized graph Laplacian T L I D AD UΛU , with A being the adjacent matrix, Λ being the diagonal matrix containing the eigenvalues of L , I being the identity matrix with proper size, ii ijj D A . Then, the filter g θ can be understood as a function of the eigenvalues of L , i.e. g θ Λ . However, evaluating the formula (4) is computationally expensive. To reduce the computational cost of eigenvector decomposition, Hammond et al [65] approximated g θ Λ using Chebyshev polynomials k T x up to K th order: '0 K k kk g T ' θ Λ θ Λ (5) where ' K θ R is a vector of Chebyshev coefficients, and max Λ Λ I with max being the largest eigenvalue of L , The Chebyshev polynomials are recursively defined as k k k T T T x x x x , with T x and T x x . Therefore, the convolution on graph signal x with a filter g ' θ can be defined as: '0 K k kk g T ' θ x θ L x (6) with max L L I denoting the scaled Laplacian matrix. Equation (6) can be easily verified by noticing that kT k T UΛU UΛ U . From (6), it can be inferred that the convolution on graph signal x depends only on nodes that are at maximum K steps away from the central node ( K th-order neighborhood). In this paper, we consider the first-order neighborhood, i.e., K=
1, in another word, the central node and those adjacent to it are involved. Thus, (6) becomes a linear function on the graph Laplacian spectrum with respect to L . With the linear formulation, Kipf and Welling [53] further approximated max , as the neural network parameters can adapt to this change in scale during the training process. We build a network of L channels corresponding to the segmentations results and the L constructed graphs, as the configuration shown in Fig.1. To be specific, equation (6) for the l th channel can be simplified to ' ' ' ' ' 1/2 1/2,0 ,1 ,0 ,1 l l l l l l l l l g θ x θ θ L I θ θ D A D x (7) where the subscript l denotes the l th channel (
1, , l L ), ' ,0 l θ and ' ,1 l θ are two free parameters. Since it can be beneficial to constrain the number of parameters to address overfitting and to minimize the number of operations (such as the matrix multiplications) per layer, (7) can be converted to l l l l l g θ x θ I D A D x (8) by setting ' ',0 ,1 l l l θ θ θ . Note that l l l I D A D has eigenvalues in the range [0, 2], numerical instabilities and exploding/vanishing gradients may happen if this operator is repeatedly applied in a deep neural network. To alleviate this problem, we use the renormalization trick Kipf and Welling introduced in [53]: l l l l l l
I D A D D A D , with , , , l l l ii l ijj A A I D A . After that, we can generalize this definition to the signal l N Cl F R introduced in Section Ⅲ.C. Here, l N is the number of objects under the scale parameter l S , and C is the dimensional number of feature vector for each node which is 128 in this paper. The convolutional operation on the constructed graphs above therefore can be written as: l l l l l l Z D A D F W (9) where l N Ml Z R is the convolved signal matrix, and
C Ml W R is trainable weight matrix. A multilayer model based on graph convolutions can therefore be built by stacking multiple convolutional layers of the form of (9) with the activation function. Considering N layer GCN in the l th channel, the proposed forward model takes the following form: , softmax ... ... Nl l l l l l l l l f F A A A A F W W (10) where l W denotes the weight matrix from input to the first hidden layer and N-1 l W is that from the last hidden layer to the output, represents an activation function, such as the Relu function adopted in this paper. The softmax activation function, which is defined as softmax exp / i i x x Z with exp ii Z x . E. Fusion Module
According to the equation (10), the dimensional numbers of output , , , ,
L L L l f f
O F A O F A of the L channels are ,..., L N F N F , respectively, with F being the class number which is 2 in this paper. To preserve detailed information as much as possible in the final results, we design the fusing matrices for , , L O O to fuse them into the finest output O . As mentioned in Section Ⅲ.A, one segmented object under , , L S S can be obtained by merging several objects under S with other segmented parameters keeping invariant. We define the fusing matrices using the area ratio and spectral similarity between the child and father objects. The elements of fusing matrices , 2, , l N Nl l L T R can be calculated as: il ij i l jl j
P sim P PP T (11) where represents the number of pixels in the object, the parameter β is empirically set to 0.5 in the experiments, and
1, , , i l j sim P P denotes the spectral similarity between i P and , l j P which is calculated using distance between the spectral mean vectors of the two objects. REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 7 After that, the coarser outputs , , L O O can be refined to the finest scale with the same size and object set as O using the multiplication of the fusing matrix with O . Thus, the fusion of the multi-channel output results can be obtained with the following formula: L L
Ε O T O T O (12) where the fused result Ε has the same size and corresponding nodes with O . The aforementioned fusion module is designed based on the following considerations. First, the finest result can preserve more details, whereas, reasonably adding coarser information would reduce the error detection for the reason that only the finest one can hardly present the ground objects with various sizes. The definition of fusing matrix is consistent with the intuition of vision for the element in the matrix is large when the spectral values of the child and father objects are close, and the area of the child accounts for a great proportion in that of the father. Second, only a fraction of the objects under the finest scale need to be labeled (as the red nodes shown in Fig.1), which may evidently reduce the workloads and cost of labeling samples. We then evaluate the cross-entropy error as the loss function: ln T F tf tft Y f Y E (13) where T Y is the set of node indices that have labels. Given , the network weights , , ( 1, , ) Nl l l L W W are trained using gradient descent approach, where all the nodes are utilized to perform gradient descent. In this way, the gradient information from the labeled nodes is spread across other unlabeled adjacent nodes. IV. E XPERIMENTS AND D ISCUSSION
In this section, the illustration of data sets is firstly presented. Then, we provide a brief description of the implementation details and evaluation metrics. Following that, the experimental results performed on the optical, SAR and optical-SAR data sets are presented and analyzed, respectively. Finally, some discussion about our method is made in detailed. A. Descriptions of Data Sets
For the optical cases, two openly VHR RS image data sets are employed: the Beijing & Tianjin data set provided by B. Hou and Q. Liu [36] and the LEVIR-CD data set [66] released by H. Chen and Z. Shi. The first data set includes 29 pairs of images covering two big cities, Beijing and Tianjin, in China during the period from 2006 to 2017. These images are captured at different times of a day and seasons under different imaging conditions, which increase the complexities and diversities of the data. They are with quite large sizes, e.g., 2000×2000 pixels, most of these images are collected by Google Earth with the spatial resolution of 0.46m, some test image patches collected by GF-2 with the spatial resolution of 1m are also provided. Several patch samples of the first data set are shown in Fig.3. The second data set consists of 637 pairs of
Fig.3. Image patch examples and corresponding reference images of the Beijing &Tianjin data set. Fig.4. Image patch examples and corresponding reference images of the LEVIR-CD data set. (a) (b) (c) Fig.5. SAR data sets. (a) Image T1. (b) Image T2. (c) Reference change map (a) (b) (c) Fig.6. Heterogeneous data sets. (a) Optical image. (b) SAR image. (c) Reference change map
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 8 images with sizes of 1024×1024 and the spatial resolution of 0.5m collected by Google Earth, covering different cities in Texas of the US, including Austin, Lakeway, Bee Cave, Buda, Kyle, Manor, Pflugervilletx, Dripping Springs, etc. Some examples of the second data set are shown in Fig.4.The two data sets mainly represent the construction and decline of buildings. Meanwhile, both of them introduce variations derived from the seasonal factors and illumination conditions, which could help develop effective methods that can mitigate the impact of irrelevant changes on real changes. Two VHR data sets are used to conduct the experiments of CD for SAR images. The first data set is a pair of images acquired by TerraSAR-X sensor with HH polarization and 1m/pixel covering a suburban area of Wuhan, China, where the remarkable changes are the construction and demolition of buildings, as shown in the first row Fig.5. The second data set corresponds to an area in Beijing, China, as shown in the second row in Fig5. The images are acquired by Gaofen-3 with the size of 550×900 and a spatial resolution of 1m/pixel. Two heterogeneous data sets are used to evaluate the efficiency of the proposed method in heterogeneous CD tasks. The first data set is a pair of optical/SAR images (Shuguang Village, Dongying, China), as shown in the first row of Fig.6. The optical image, with the size of 593×921 and red, green, blue bands, is acquired from Google Earth in September 2012; whereas the SAR image is taken by the Radarsat-2 with the size of 593×921 and C-band in June 2008. The changes of land use in the farmland and water cover are the most significantly changed information during the period. The second pair of optical/SAR images shown in the second row of Fig. 6 describe an urban area in Wuhan, China. The optical image was captured from Google Earth with the size of 495×503 and red, green, blue bands in November 2011; whereas the SAR image is acquired by Radarsat-2 with the size of 495×503 and C-band in June 2008. B. Implementation and Evaluation Metrics
The proposed method is implemented via Pytorch framework on a single GeForce GTX 1080Ti GPU. In the feature extracting network training phase, images with labels in the OSCD dataset are clipped into 6400 training images of 112×112 pixels with data augmentation, including rotation, flip. The stochastic gradient descent (SGD) with momentum is applied for training. The learning rate is fixed and set to 0.001, meanwhile, the momentum and the weight decay is set to 0.9 and 0.0005, respectively. For each image pair introduced in Section Ⅳ-A, we randomly select 5% of the objects under the finest scale as labeled nodes. The number of GCN layers is set to 3, namely
N=3 in formula (10). We employ 3 scales to conduct the FNEA, namely
L=3 in formula (12) and Fig.1. We train the proposed MSGCN for 400 epochs with the dropout rate of 0.5 and the weight decay of 0.0005. To evaluate the performance of the proposed method, four quantitative evaluation indices, false alarm rate (FAR), missed alarm rate (MAR), overall accuracy (OA) and Kappa coefficient (Kappa) are adopted as metrics. The equations of FAR, MAR and OA can be formulated as FAR=FP/(FP+TN), MAR=FN/(FN+TP) and OA=(TP+TN)/(TP+TN+FP+FN), respectively, where TP denotes the number of true positives, FP denotes the number of false positives, TN denotes the number of true negatives, and FN denotes the number of false negatives. Kappa is a statistical measure of the consistency between the change map and the reference map. It is calculated by Kappa=(OA-PRE)/(1-PRE),(TP+FN)g(TP+FP)+(TN+FP)g(TN+FN)PRE= (TP+TN+TP+FN) (14) C. Experiments on Optical Images
To verify the effectiveness of the proposed method for optical images, we compare our method with the following six benchmark methods on the aforementioned Beijing & Tianjin and LEVIR-CD data sets: 1) FC-Siam-con: The fully convolutional Siamese concatenation (FC-Siam-con) [61] applied a Siamese encoding stream to extract deep features from bi-temporal images, then, the features were concatenated in the decoding stream for CD. 2) FCN-PP: The fully convolutional network with pyramid pooling (FCN-PP) [67] has been proposed for landslide detection. It consists of a U-shape architecture to learn the deep features of the input images and a pyramid pooling layer to enlarge the receptive filed. 3) Unet_ASPP: Unet is a widely used architecture in the semantic segmentation and CD tasks. The atrous spatial pyramid pooling (ASPP) can capture contextual information with multi-rate via sampling the input feature map using dilated convolutions of different sampling rates in parallel [68]. We insert the ASPP module between the down-sampling and up-sampling operators in Unet, that is, Unet_ASPP. 4) DSIFN: A deeply supervised image fusion network (DSIFN) has been proposed for CD in high resolution RS images [69]. The DSIFN consists of a shared deep feature network and a difference discrimination network which utilizes the channel attention module and spatial attention module. The authors have release the codes in [69]. 5) W-net: The W-shape network proposed in [36] applied an end-to-end dual-branch architecture, and performed the differencing operator in the feature domain rather than in the traditional image domain, which greatly alleviated loss of useful information for determining the changes. 6) GCNCD: A network with two GCN layers has been proposed for CD in [58]. The method uses several hand-crafted features of objects to build nodes of the graph model. The first five methods are supervised, and we use 1200 pairs of images for training, 40 pairs for testing on the Beijing & Tianjin data set and 1780 pairs for training, 512 pairs for testing on the LEVIR-CD data set, respectively, to evaluate these methods. The sizes of all images above are 512×512. The GCNCD is semi-supervised as our method. Thus, to ensure the fairness, we randomly select 5% of the objects under the finest scale as labeled nodes for the GCNCD and our method. The three scale parameters of multiscale segmentation are set as 8, 15, and 20 for both Beijing & Tianjin and LEVIR-CD data sets. Some of the results on the Beijing & Tianjin testing data are presented in Fig.7. Similarly, some typical results on the LEVIR-CD testing data are displayed in Fig.8. Intuitively, the change maps generated by the proposed are more consistent with the reference change maps. To be specific, there are many discontinuous small noise strips (false alarms) in the results of FC-Siam-con, caused by its limited robustness REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 9 on the inevitable misregistration errors and spectral variations in high resolution RS image pairs. Due to the similar reasons, FCN-PP causes massive false alarms, as can be seen in the red boxes in Fig.7(d), Fig.7(e), Fig.8(d) and Fig.8(e). The Unet_ASPP can cause fewer false alarms and obtain smoother results. However, the excessive smoothness hinders its performance at areas with complex structures. Thus, the accuracies of boundaries between changed and unchanged regions are dissatisfactory. Besides, as is clear, some small changed regions are missed in the results of Unet_ASPP. The results of DSIFN exit considerable false alarms and missed detections. The W-net achieves relatively good performances (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Fig.7. Some typical CD maps by different methods on the Beijing & Tianjin data set. (a) Image T1. (b) Images T2. (c) Reference change map. (d) FC-Siam-con. (e) FCN-PP. (f) Unet_ASPP. (g) DSIFN. (h) W-net. (i) GCNCD. (j) Proposed MSGCN. (a) (b) (c) (d) (e) (f) (g) (h) (i) (j) Fig.8. Some typical CD maps by different methods on the LEVIR-CD data set. (a) Image T1. (b) Images T2. (c) Reference change map. (d) FC-Siam-con. (e) FCN-PP. (f) Unet_ASPP. (g) DSIFN. (h) W-net. (i) GCNCD. (j) Proposed MSGCN.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 10 on some image pairs, such as the second row of Fig.7(h) and the fourth row of Fig.8(h), but obvious false alarms occur in some results ,such as the fourth row of Fig.7(h) and the third row of Fig.8(h). The results of GCN have good homogeneity in changed regions and obtain relatively precise boundaries. Nevertheless, due to the fact that hand-crafted features have limited ability of presentation, GCN fails to capture some complex structures of changed regions, such as the third row of Fig.7(i). The proposed MSGCN can reduce the false alarms and missed detection to a low level, simultaneously, by incorporating the deep features with high robustness and the multiscale object analysis into GCN.
The quantitative evaluation results of different methods are listed in Table Ⅰ. It can be concluded that the quantitative results are consistent with the visual performance, and the proposed MSGCN outperforms all of the compared methods in terms of Kappa, OA and MAR on both of the two data sets. Regarding the FAR, the MSGCN is the second best. Comparing with other methods, the MSGCN yields an improvement of at least 5.02%, 1.19%, and 6.48% for MAR, OA and Kappa, respectively, on the Beijing & Tianjin data set. The improvement on the LEVIR-CD data set is also evident, with the improving values of at least 2.90%, 1.79%, and 7.27% for MAR, OA and Kappa, respectively. The Unet_ASPP achieves slightly lower FARs than those of MSGCN. However, it yields significantly higher MARs. On the whole, the MSGCN can suppress the false alarms and reduce the missed detections, simultaneously. The reasons of this behavior are: 1) the combination of pixel-wise high level features with object-based extraction improve the robustness on the misregistration errors and spectral variations, and this can reduce the false alarms and missed detection; 2) the multiscale features are made full use in MSGCN, thus, information from the labeled nodes can be spread to the unlabeled ones, effectively and accurately. D. Experiments on SAR Images
We evaluate the performance of MSGCN on two SAR data sets comparing with the following four state-of-the-art CD methods: 1) PCA-Net. The PCA-Net has been applied to SAR image CD [70].The main difference from convolution neural network is that the filter kernels are obtained without back-propagation, instead, they are considered as the eigenvectors of most large eigenvalues after applying the eigen decomposition to covariance matrix. 2) S-PCA-Net. For SAR image CD, the S-PCA-Net introduced the imbalanced learning process into PCA-Net to solve the imbalance samples issue [71]. 3) CWNN. The method used convolutional-wavelet neural network (CWNN) instead of CNN to extract robust features with better noise immunity for SAR image CD [72]. 4) CNN. The method proposed a novel CNN framework without any preprocessing operations, which can automatically extract the spatial characteristics [73], and the codes have been released at https://github.com/xhwNobody/Change-Detection. Among these methods, PCA-Net and CWNN are unsupervised methods, while S-PCA-Net and CNN are supervised ones. The involved parameters of these methods are set as those in their original articles. From fine to coarse, the three scale parameters of multiscale segmentation are set as 10, 15, 25 for Wuhan SAR data set and 15, 25, 35 for Beijing SAR data set, respectively. The visual results on the two SAR data sets are shown in Fig.9 and Fig.10. It is shown that for the Wuhan data set, the proposed MSGCN can obtain more completed changed regions. Particularly, much more changed regions are missed in the left building construction area in Fig.9(a)~ Fig.9(d) comparing with the result of MSGCN. In addition, many isolated spots appear (a) (b) (c) (d) (e) (f) Fig.9. Visual results by different methods on the Wuhan SAR data set. (a) PCA-Net. (b) S-PCA-Net. (c) CWNN. (d) CNN. (e) Proposed MSGCN. (f) Reference change map. (b) (b) (c) (d) (e) (f) Fig.10. Visual results by different methods on the Beijing SAR data set. (a) PCA-Net. (b) S-PCA-Net. (c) CWNN. (d) CNN. (e) Proposed MSGCN. (f) Reference change map. TABLE
I Q
UANTITATIVE A CCURACY R ESULTS FOR D IFFERENT M ETHODS ON THE O PTICAL D ATA S ETS (%) Data set Method FAR MAR OA Kappa Beijing &Tianjin FC-Siam-con 7.08 59.54 86.52 47.39 FCN-PP 6.19 45.89 89.13 52.81 Unet_ASPP
LEVIR -CD FC-Siam-con 6.23 41.91 88.57 56.36 FCN-PP 3.94 32.12 90.61 63.58 Unet_ASPP
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 11 in Fig.9(a)~ Fig.9(d). By contrast, the MSGCN can reduce the number of isolated spots, as shown in Fig9(e). For Beijing data set, due to the influence of speckle noise, the PCA-Net, S-PCA-Net and CWNN cause many false alarms. To be specific, the horizontal road haven’t been changed during the two imaging times, but many pixels of the road are detected as changes in Fig.10(a)~ Fig.10(c). In comparison, both CNN and MSGCN can effectively suppress false alarms. However, quite a lot of changed regions are missed in Fig.10(d), such as the up-right part, where building constructions had happened during the two imaging times, but the CNN method fails to accurately detect this changed information. On the whole, the MSGCN can produce change map with completed changed regions and relatively less false alarms. Table Ⅱ shows the quantitative evaluation results on the two SAR data sets. It can be seen that the MSGCN outperform other methods significantly, with the improving values of at least 15.78%, 2.78%, and 10.99% for MAR, OA and Kappa on the Wuhan data set, and at least 5.48%, 0.82%, and 6.46% for MAR, OA and Kappa on the Beijng data set, respectively. For the Wuhan data set, although the MSGCN achieves the second lowest FAR, which is a little higher than the S-PCA-Net, the MAR is reduced from 32.91% to 13.09% by a large margin. For the Beijing data set, due to the complicated noise situation, the performances of the unsupervised PCA-Net and CWNN are not very well. Obviously, the false alarms in results of these two methods are much more than others. Although the CNN achieves slightly lower FAR than the MSGCN, it yields significantly higher MARs, which is as high as 57.08%, and this means that many changed pixels are missed in the result of CNN. Comparing the indices in Table Ⅱ comprehensively, the MSCGN produces best results on both data sets. E. Experiments on Heterogeneous Optical/SAR Images
To evaluate the efficiency of the proposed MSGCN on heterogeneous optical/SAR image CD tasks, we validate it comparing with the following benchmark methods: 1) HPT. The homogeneous pixel transformation (HPT) [29] estimated mapping pixels based on the known unchanged pixels. In our experiments, we use 40% of the unchanged pixels as training samples. 2) M3CD. The Markov model for multimodal change detection (M3CD) algorithm used an observation field built up from a pixel pairwise modeling in an unsupervised way [74]. 3) SCCN. A symmetric convolutional coupling network (SCCN) [75] was designed to infer spatial information form the data and learn new representations for heterogeneous CD. 4) PSGM. The unsupervised patch similarity graph matrix-based (PSGM) [76] method assumed that the patch similarity graph structure of each homogeneous or heterogeneous image is consistent if no change occurs. From fine to coarse, the three scale parameters of multiscale segmentation are set as 10, 15, 20 for Shuguang data set and 7, 12, 20 for Wuhan data set, respectively. Fig.11 and Fig.12 show the changed maps of all comparing methods on the heterogeneous Shuguang and Wuhan data set, respectively. As can be seen, many unchanged pixels are misclassified into changes in Fig.11(a) and Fig.12(a), as the discrepancy between optical and SAR feature spaces can’t be eliminated completely through the HPT. Therefore, the features of some unchanged ground objects are still dissimilar in the mapping feature space. The results of M3CD are much smoother than others, intuitively. However, excessive smoothing of the Markov model may reduce the accuracy of boundaries between changed and unchanged regions. Besides, some evidently changed areas are missed in the results of M3CD, such as the areas marked by the red boxes in Fig.11(b) and Fig.12(b), where the changes of water cover and constructions of some buildings are missed, respectively. The SCCN also misses (a) (b) (c) (d) (e) (f) Fig.11. Visual results by different methods on the Shuguang heterogeneous data set. (a) HPT. (b) M3CD. (c) SCCN. (d) PSGM. (e) Proposed MSGCN. (f) Reference change map. (c) (b) (c) (d) (e) (f) Fig.12. Visual results by different methods on the Wuhan heterogeneous data set. (a) HPT. (b) M3CD. (c) SCCN. (d) PSGM. (e) Proposed MSGCN. (f) Reference change map. TABLE
Ⅱ Q
UANTITATIVE A CCURACY R ESULTS FOR D IFFERENT M ETHODS ON THE
SAR D ATA S ETS (%) Data set Method FAR MAR OA Kappa Wuhan PCA-Net 3.34 35.84 91.71 66.12 S-PCA-Net
Beijing PCA-Net 6.44 40.78 90.17 48.87 S-PCA-Net 4.43 39.74 92.08 55.65 CWNN 10.97 36.50 86.51 40.93 CNN
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 12 many changed pixels in the area marked by the red box in Fig.11(c). In addition, relatively many false alarms are generated on the Wuhan data set, as shown in the red ellipse in Fig.12(c). The PSGM can obtain relatively complete changed information with some false alarm areas, such as the areas marked by red ellipses in Fig.11(d) and Fig.12(d). In contrast, the MSGCN generates obviously less false alarms on both data sets. Meanwhile, the completeness of changed regions can be well preserved by the MSGCN. The quantitative evaluation results on the two heterogeneous data sets are listed in Table Ⅲ. For the Shuguang data set, the MSGCN outperforms other methods significantly in terms of all evaluation metrics. As it is consistent with the visual comparison of Fig.11, FAR and MAR of the MSGCN are both lower than those of others, and it means that the MSGCN can effectively suppress the false alarms and avoid the missed detections simultaneously. For the Wuhan data set, the MSGCN produces the best FAR, OA, and Kappa coefficient. Although SCCN and PSGM achieve lower MAR, they yield significantly higher FARs. In accordance with the visual analysis of Fig.12, many unchanged pixels are erroneously detected as changes in the results of SCCN and PSGM. On the whole, MSGCN still outperforms SCCN and PSGM, and the highest OA and Kappa coefficient of MSGCN can support this conclusion. F. Discussion
In the following, effectiveness of the fusion module, Influence of the ratio of labeled samples, performance with different number of graph convolutional layers will be discussed in detail.
1) Effectiveness of the fusion module : In order to capture more comprehensive information of the ground objects with various sizes, we segment the input image pair under three scale parameters, which can be named as fine, medium, and coarse scale, respectively. A fusion module is designed to make full use of outputs of the multiscale GCN. As the finest segments can preserve more details and only a fraction of the objects under the finest scale need to be labeled, performances with four scale combinations (fine, fine-medium, fine-coarse and fine-medium-coarse) are compared to prove the effectiveness of the fusion module. Specifically, the fine-medium and fine-coarse combinations mean L= L= L=
3, respectively. Take the Beijing SAR data set for instance, the three scale parameters are 15, 25 and 35, thus, the fine-medium case means the combination of 15and 25. Fig.13 shows two samples of optical CD results with different scale combinations, while Fig.14 and Fig.15 show the experimental results on the Wuhan SAR and heterogeneous data sets, respectively. The corresponding quantitative evaluation results are listed in Table Ⅳ. From Fig.13~15, interpreting in detail, we have the following observations. First, the fusion module is capable of capturing more accurate structures of changed regions. For instance, comparing the areas of red boxes in Fig.13, the changed regions of fine-medium-coarse combination is closer to the reference than those of single scale or two scale combinations. Second, incorporating multiscale information can alleviate the missed detection phenomenon, such as the areas of red boxes in Fig.14, where the holes in the detected changed regions are missed detections and the fine-medium-coarse combination obtains less and smaller holes than those of others. Since the ground objects in images can be of various sizes, considering a multiscale fusion strategy can be highly beneficial. It can be observed from Table Ⅳ that fine-medium-coarse combination achieves the best performances in terms of MAR, OA and Kappa. In addition, fine-medium and fine-coarse combinations outperform the single fine scale on all data sets. The reason may be that the information that should be multiscale and it cannot be sufficiently exploited using only a single scale, especially when image scenes are complex. These demonstrate that the proposed fusion module can better exploit the multiscale information to improve the performances.
2) Influence of the ratio of labeled samples : Unlike other supervised methods which use some individual pairs of images as training data and other pairs as testing data, the semisupervised MSGCN performs training with a few labeled superpixels (objects) and rest unlabeled ones on each pair. Therefore, the ratio of labeled samples would unavoidably influence the performance of MSGCN and this influence should be investigated. To this end, we vary the ratio of labeled samples from 5% to 30% in steps of 5% and report the OA and Kappa coefficient on all the aforementioned data sets, as shown in Fig.16. We can make the observation from Fig.16 that the performance on all data sets can be improved by increasing the ratio of labeled examples. It is noteworthy that the proposed MSGCN can yield relatively high precision accuracy even though the labeled ratio is low, as the experiments in Section Ⅳ-C~ Section Ⅳ-E, where the MSGCN achieves promising performance with the labeled ratio of 5%. This advantage reduces the requirement of abundant labeled samples, and thus, makes it quite feasible in practical CD tasks.
3) Performances with different numbers of graph convolutional layers : The number of graph convolutional layers is an important hyperparameter in network design. To evaluate the performances with different numbers of layers, we implement the proposed MSGCN with two, three, four, and five graph convolutional layers, respectively. As the input dimension is 128 and the output dimension is 2, constantly, the dimensions of the two-layer, three-layer, four-layer and five-layer MSGCNs are set as 128-32-2, 128-32-8-2, 128-32-16-4-2 and 128-32-16-8-4-2, respectively. Fig.17 shows two samples of optical CD results with different layer
TABLE
Ⅲ Q
UANTITATIVE A CCURACY R ESULTS FOR D IFFERENT M ETHODS ON THE H ETEROGENEOUS O PTICAL /SAR D ATA S ETS (%) Data set Method FAR MAR OA Kappa Shuguang HPT 2.28 35.91 96.17 58.62 M3CD 2.39 28.47 96.20 60.17 SCCN 0.93 36.21 97.18 65.20 PSGM 1.50 22.21 97.66 74.38 MSGCN
Wuhan HPT 6.03 37.27 91.94 46.15 M3CD 3.55 40.20 93.38 47.58 SCCN 4.94 27.88 94.14 61.08 PSGM 4.12
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 13 numbers, while Fig.18 and Fig.19 show the experimental results on the Wuhan SAR and Shuguang heterogeneous data sets, respectively. Table Ⅴ presents the corresponding quantitative evaluation results. The labeled ratios in these experiments are all set as 5%. It can be observed from Fig.17~Fig.19 that increasing the layer number don’t consequentially result in improvement of performance. For example, in the results of Shuguang heterogeneous data set, the obvious changes of water cover which locate in the right-down part are completely missed in (a) (b) (c) (d) (e) (f) (g) Fig.13. Two samples of optical CD results with different scale combinations. (a) Image T1. (b) Image T2. (c) Scale combination: 8. (d) Scale combination: 8-15. (e) Scale combination: 8-20. (f) Scale combination: 8-15-20. (g) Reference change map (a) (b) (c) (d) (e) Fig.14. Change maps of Wuhan SAR data set with different scale combinations. (a) Scale combination: 10. (b) Scale combination: 10-15. (c) Scale combination: 10-25. (d) Scale combination: 10-15-25. (e) Reference change map (a) (b) (c) (d) (e) Fig.15. Change maps of Wuhan heterogeneous data set with different scale combinations. (a) Scale combination: 7. (b) Scale combination: 7-12. (c) Scale combination: 7-20. (d) Scale combination: 7-12-20. (e) Reference change map TABLE
Ⅳ Q
UANTITATIVE A CCURACY R ESULTS FOR D IFFERENT S CALE C OMBINATIONS (%) Scale combinations Optical images Wuhan SAR data set Wuhan heterogeneous data set FAR MAR OA Kappa FAR MAR OA Kappa FAR MAR OA Kappa fine 4.24 15.52 94.22 76.44 3.27 14.83 94.91 81.00 1.60 37.32 96.07 65.45 fine-medium 3.61 15.04 94.79 78.84 (a) (b) (c) Fig.16. The Kappa coefficient and OA of MSGCN with different labeled ratios. (a) Optical images data sets. (b) SAR data set. (c) Heterogeneous data sets
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 14 the result with four layers, whereas, most of these changes can be captured in the result with two layers and three layers, as the regions marked by red boxes in Fig.19 (a)~(c). Interpreting in detail, results of three-layer MSGCN seen to be closer to the reference change maps on all the above data sets. From Table Ⅴ, we can see that the three-layer MSGCN achieves the best performances in terms of MAR, OA and Kappa on no matter which data set, which is consistent with the visual Interpretations. It can be possibly explained as follow: when the depth of network is 2, as the network is relatively shallow, the represented capability of features of nodes has not achieved the peak. When the number is 4 or 5, excessive smoothing may happen, as the GCN is intrinsic a low pass filter. In fact, the issue of excessive smoothing is a common limitation of the multi-layer GCN framework. V. C ONCLUSION
A semisupervised change detection method based on graph convolutional network and the multiscale object-oriented analysis has been proposed in this paper to better address the CD tasks for both homogeneous and heterogeneous remote sensing images. To exploit the multiscale spatial information in high resolution images, the input image pair is firstly segmented through FNEA with different scale parameters, respectively, to obtain multiscale parcels (namely objects). Treating each parcel as one node, a graph construction strategy is used to form the parcel into a graph representation for each scale. Instead of concatenating hand-crafted features directly, we adopt a pre-trained U-net to extract pixel-wise high level (a) (b) (c) (d) (e) (f) (g) Fig.17. Two samples of optical CD results with different layer numbers. (a) Image T1. (b) Image T2. (c) Layer numbers: 2. (d) Layer numbers: 3. (e) Layer numbers: 4. (f) Layer numbers: 5. (g) Reference change map (a) (b) (c) (d) (e) Fig.18. Change maps of Wuhan SAR data set with different layer numbers. (a) Layer numbers: 2. (b) Layer numbers: 3. (c) Layer numbers: 4. (d) Layer numbers: 5. (e) Reference change map (a) (b) (c) (d) (e) Fig.19. Change maps of Wuhan heterogeneous data set with different layer numbers. (a) Layer numbers: 2. (b) Layer numbers: 3. (c) Layer numbers: 4. (d) Layer numbers: 5. (e) Reference change map TABLE
Ⅴ Q
UANTITATIVE A CCURACY R ESULTS FOR D IFFERENT N UMBERS OF
GCN L AYERS (%) Number of layers Optical images Wuhan SAR data set Shuguang heterogeneous data set FAR MAR OA Kappa FAR MAR OA Kappa FAR MAR OA Kappa 2
4 1.89 28.87 94.54 73.42 2.97 15.68 95.03 81.28
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 15 features which are combined with the segmented results to obtain features of nodes in the graph representations. As several open RS datasets with labels for CD tasks have been available, no extra labeling workloads are needed. A novel multiscale graph convolutional network with each channel corresponding to one scale is proposed based on the property that the iterative training process helps to propagate the label information from labeled nodes to unlabeled ones, which allows us to use a fraction of labeled nodes to infer CD information for the whole image pair. Further, a fusion strategy is designed to incorporate the multiscale information. Three types of data, including openly optical data with thousands of high resolution image pairs, two pairs of high resolution SAR images, two pairs of heterogeneous images, are used to verify the effectiveness and superiority of the proposed method. The experimental results have demonstrated the superiority against some popular methods. Our future work is to extend the supervised framework to an unsupervised scenario and explore the possibility of distinguishing different kinds of change. A
CKNOWLEDGMENT
The authors would like to thank Doc. Qingjie Liu of Beihang University for providing the free Beijing & Tianjin data set. R
EFERENCES [1]
A. Singh, “Review article digital change detection techniques using remotely-sensed data,”
Int. J. Remote Sens ., vol. 10, no. 6,pp. 989–1003, 2010. [2]
B. Hou, Y. Wang, and Q. Liu, “Change detection based on deep features and low rank,”
IEEE Geosci. Remote Sens. Lett. , vol. 14 no. 12, pp. 2418-2422, Dec. 2017. [3]
F. Baselice, G. Ferraioli, and V. Pascazio, “Markovian change detection of urban areas using very high resolution complex SAR images,”
IEEE Geosci. Remote Sens. Lett ., vol.11, no.5, pp.995-999, May. 2014. [4]
Y. Kim and M. Lee, “Rapid change detection of fold affected area after collapse of the Laos Xe-Pian Xe-Nanoy dam using Sentinel-1 GRD data,”
Remote Sens ., vol.12, no.6, pp.1978, 2020. [5]
Y. Ban and O. Yousif, “Change detection techniques: a review,”
Multitemporal Remote Sensing , Springer, Nov. 2016, pp.19-43, DOI: 10.1007/978-3-319-47037-5_2. [6]
T. Liu, L. Yang, and D. Lunga, “Change detection using deep learning approach with object-based image analysis,”
Remote Sens. Environ ., 256, p.112308, 2021. [7]
L. Wan, T. Zhang, and H. You, “Boject-based multiscale method for SAR image change detection”,
J. Appl. Remote Sens ., vol.12, no.2, p.025004, 2018. [8]
Y. Sun, L. Lei, X. Li, H. Sun, and G. Kuang, “Nonlocal patch similarity based heterogeneous remote sensing change detection,”
Patten Recognit ., 109, p.107598, 2021. [9]
E.F. Lambin and A.H. Strahlers, “Change-vector analysis in multitemporal space: a tool to detect and categorize land-cover change processes using high temporal-resolution satellite data,”
Remote Sens. Environ ., 48, pp.231-244, 1994. [10]
A.A. Nielsen, K. Conradsen, and J.J. Simpson, “Multivariate alteration detection (MAD) and MAF postprocessing in multispectral, bitemporal image data: new approaches to change detection studies,”
Remote Sens. Environ ., 64, pp.1-19, 1998. [11]
A.A. Nielsen, “The regularized iteratively reweighted mad method for change detection in multi-and hyperspectral data,”
IEEE Trans. Image Process. , vol. 16, no. 2, pp. 463-478, Feb. 2007. [12]
R. J. Dekker, “Speckle filtering in satellite SAR change detection imagery,”
Int. J. Remote Sens., vol. 19, no.6, pp. 1133–1146, Jun.1998. [13]
M. Gong, Z. Zhou, and J. Ma, “Change detection in synthetic aperture radar images based on image fusion and fuzzy clustering,”
IEEE Trans. Image Process. , vol. 21, no. 4, pp. 2141-2151, Apr. 2012. [14]
C. Wu, B. Du, and L. Zhang, “Slow feature analysis for change detection in multispectral imagery,”
IEEE Trans. Geosci. Remote Sens ., vol.52, no. 5, pp.2858-2874, May. 2014. [15]
S. Saha, F. Bovolo, and L. Bruzzone, “Unsupervised deep change vector analysis for multi-change detection in VHR images,”
IEEE Trans. Geosci. Remote Sens ., vol.57, no. 6, pp.3677-3693, Jun. 2019. [16]
Z. Lv, T. Liu, and J.A. Benediktsson, “Object-based key point vector distance for binary land cover change detection using VHR remote sensing images,”
IEEE Trans. Geosci. Remote Sens., vol.58, no. 9, pp.6524-6533, Sep. 2020. [17]
J. Wang, X. Yang, X. Yang, L. Jia, and S. Fang, “Unsupervised change detection between SAR images based on hypergraphs,”
ISPRS J. Photogramm. Remote Sens ., vol.164, pp.61-72, Apr. 2020. [18]
H. Nemmour and Y. Chibani, “Multiple support vector machines for land cover change detection: an application for mapping urban extension,”
ISPRS J. Photogramm. Remote Sens ., vol.61, pp.125-133, Feb. 2006. [19]
S. Liu, Q. Du, X. Tong, A. Samat, and L. Bruzzone, “Unsupervised change detection in multispectral remote sensing images via spectral-spatial band expansion,”
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens ., vol.12, no.9, pp.3578-3587, Sep. 2019. [20]
N. Chang, M. Han, W. Yao, L. Chen, and S. Xu, “Change detection of land use and land cover in an urban region with SPOT-5 images and partial Lanczos extreme learning machine,”
J. Appl. Remote Sens ., vol.4, no.1, p.043551, 2010. [21]
F. Chatelain, J.Y. Tourneret, J. Inglada, and A. Ferrari, “Bivariate gamma distributions for image registration and change detection,”
IEEE Trans. Image Process. , vol. 16, no. 7, pp. 1796-1806, Jul. 2007. [22]
Y. Bazi, F. Melgani, L. Bruzzone, and G. Vernazza, “A genetic expectation-maximization method for unsupervised change detection in multitemporal SAR imagery, ”
Int. J. Remote Sens ., vol.30, no.34, pp.6591-6610, Dec,2009. [23]
G. Yang, H. Li, W. Yang, K. Fu, Y. Sun, and W.J. Emery, “Unsupervised change detection of SAR images based on variational multivariate Gaussian mixture model and Shannon entropy,”
IEEE Geosci. Remote Sens. Lett ., vol.16, no.5, pp.826-830, May. 2019. [24]
M. Zanetti, F.Bovolo, and L. Bruzzone, “Rayleigh-Rice mixture parameter estimation via EM algorithm for change detection in multispectral images,”
IEEE Trans. Image Process. , vol. 24, no. 12, pp. 5004-5016, Dec. 2015. [25]
G. Mercier, G. Moser, and S.B. Serpico, “Conditional copulas for change detection in heterogeneous remote sensing images,”
IEEE Trans. Geosci. Remote Sens., vol.46, no. 5, pp.1428-1441, May. 2008. [26]
V. Ferraris, N. Dobigeon, Y. Cavalcanti, T. Oberlin, and M. Chabert, “Coupled dictionary learning for unsupervised change detection between multimodal remote sensing images,”
Comput Vis. Image Understanding ., [Online] https://doi.org/10.1016/j.cviu.2019.102817. [27]
L. Wan, Y. Xiang, and H. You, “An object-based hierarchical compound classification method for change detection in heterogeneous optical and SAR images,”
IEEE Trans. Geosci. Remote Sens., vol.57, no. 12, pp.9941-9959, Dec. 2019. [28]
J. Prendes, M. Chabert, F. Pascal, A.Giros, and J.Y. Tourneret, “A new multivariate statistical model for change detection in images acquired by homogeneous and heterogeneous sensors,”
IEEE Trans. Image Process. , vol. 24, no. 3, pp. 799-812, Mar. 2015. [29]
Z. Liu, G. Li, G. Mercier, Y. He, and Q. Pan, “Change detection in heterogeneous remote sensing images via homogeneous pixel transformation,”
IEEE Trans. Image Process. , vol. 27, no. 4, pp. 1822-1834, Apr. 2018. [30]
L.T. Luppino, F.M. Bianchi, G. Moser, and S.N. Anfinsen, “Unsupervised image regression for heterogeneous change detection,”
IEEE Trans. Geosci. Remote Sens., vol.57, no. 12, pp.9960-9975, Dec. 2019. [31]
Y. Sun, L. Lei, D. Guan, X. Li, and G. Kuang, “SAR image change detection based on nonlocal low-rank model and two-level clustering,”
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens ., vol.13, no.1, pp.293-306, Jan. 2019. [32]
J.E. Ball, D.T. Anderson, and C.S. Chan, “A comprehensive survey of deep learning in remote sensing: theories, tools and challenges for community,”
J. Appl. Remote Sens ., vol.11, no.4, p.042609, 2017. [33]
H. Lyu, H. Lu, and L. Mou, “Learning a transferable change rule from a recurrent neural network for land cover change detection,”
Remote Sens ., vol. 8, no. 6, p. 506, Jun. 2016. [34]
G. Liu, L. Li, L. Jiao, Y. Dong, and X. Li, “Stacked Fisher autoencoder for SAR change detection,”
Patten Recognit ., 96, p.106971, 2019.
REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 16 [35]
H. Chen, C. Wu, B. Du, L. Zhang, L. Wang, “Change detection in multisource VHR images via deep Siamese convolutional multiple-layers recurrent neural network,”
IEEE Trans. Geosci. Remote Sens., vol.58, no. 4, pp.2848-2864, Apr. 2020. [36]
B. Hou, Q. Liu, H. Wang, and Y. Wang, “From W-net to CDGAN: bitemporal change detection via deep learning techniques,”
IEEE Trans. Geosci. Remote Sens., vol.58, no. 3, pp.1790-1802, Mar. 2020. [37]
K. Nemoto, R. Hamaguchi, M. Sato, A. Fujita, T. Imaizumi, and S. Hikosaka, “Building change detection via a combination of CNNs using only RGB aerial imageries,”
Proc. SPIE , vol. 10431, Oct. 2017, Art. no. 104310J. [38]
K. Lim, D. Jin, and C.S. Kim, “Change detection in high resolution satellite images using an ensemble of convolutional neural networks,”
In Proc. APSIPA Annual Summit and Conf ., Nov. 2018, pp.509-515. [39]
D.B Mesquita, R.F. Santos, D.G. Macharet, M.F.M. Campos, and E.R. Nascimento, “Fully convolutional Siamese autoencoder for change detection in UAV aerial images,”
IEEE Geosci. Remote Sens. Lett ., vol.17, no.8, pp.1455-1459, Aug. 2020. [40]
J. Liu, K. Chen, G. Xu, X. Sun, M. Yan W. Diao, and H. Han, “Convolutional neural network-based transfer learning for optical aerial images change detection,”
IEEE Geosci. Remote Sens. Lett ., vol.17, no.1, pp.127-131, Jan. 2020. [41]
M. Zhang and W. Shi, “A feature difference convolutional neural network-based change Detection method,”
IEEE Trans. Geosci. Remote Sens., vol.58, no. 10, pp.7232-7246, Oct. 2020. [42]
M. Gong, X. Niu, P. Zhang, and Z. Li, “Generative adversarial networks for change detection in multispectral imagery,”
IEEE Geosci. Remote Sens. Lett. , vol. 14, no. 12, pp. 2310–2314, Nov. 2017. [43]
M. Gong, Y. Yang, T. Zhan, X. Niu, and S. Li, “A generative discriminatory classified network for change detection in multispectral imagery,”
IEEE J. Sel. Topics Appl. Earth Observat. Remote Sens ., vol. 12, no. 1, pp. 321–333, Jan. 2019. [44]
D. Peng, L. Bruzzone, Y. Zhang, H.Guan, H. Ding, and X. Huang, “SemiCDNet-A semisupervised convolutional neural network for change detection in high resolution remote-sensing images,”
IEEE Trans. Geosci. Remote Sens.,
DOI: 10.1109/TGRS.2020.3011913. [45]
S. Wan, C. Gong, P. Zhong, B. Du, L. Zhang, and J. Yang, “Multiscale dynamic graph convolutional network for hyperspectral image classification,”
IEEE Trans. Geosci. Remote Sens., vol.58, no. 5, pp.3162-3177, May. 2020. [46]
M. Gori, G. Monfardini, and F. Scarselli, “A new model for learning in graph domains,” in Proc. IEEE Int. Joint Conf. Neural Netw ., vol. 2, Aug. 2005, pp. 729–734. [47]
F. Scarselli, M. Gori, A. C. Tsoi, M. Hagenbuchner, and G. Monfardini, “ The graph neural network model,”
IEEE Trans. Neural Netw. Learn. Syst ., vol. 20, no. 1, pp. 61–80, Jan. 2009. [48]
C. Gallicchio and A. Micheli, “Graph echo state networks,” in Proc. Int. Joint Conf. Neural Netw . (IJCNN) , Jul. 2010, pp. 1–8. [49] Y. Li, D. Tarlow, M. Brockschmidt, and R. Zemel, “Gated graph sequence neural networks,” in
Proc. Int. ICLR , 2015, pp. 1–20. [50]
Z. Wu, S. Pan, P. Chen, G. Long, C. Zhang, and P.S. Yu, “A comprehensive survey on graph neural networks,”
IEEE Trans. Neural Netw. Learn. Syst ., vol. 32, no. 1, pp. 4–24, Jan. 2021. [51]
J. Bruna, W. Zaremba, A. Szlam, and Y. Lecun, “Spectral networks and locally connected networks on graphs,” in
Proc. ICLR , 2014, pp. 1–14. [52]
A. Sandryhaila and J. Mouraj, “Discrete signal processing on graphs,”
IEEE Trans. Signal Process. , vol.61, no.7, pp. 1644-1656, Jul. 2013. [53]
T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” in
Proc. ICLR , 2017, pp. 1–14. [54]
J. Chen, T. Ma, and C. Xiao, “FastGCN: Fast learning with graph convolutional networks via importance sampling,” in Proc. ICLR , 2018, pp. 1–15. [55]
J. Park, M. Lee, H.J. Chang, K. Lee, and J.Y. Choi, “Symmetric graph convolutional autoencoder for unsupervised graph representation Learning,” in Proc. ICCV , 2019, pp.6518-6527. [56]
X. J. Zhu, “Semi-supervised learning literature survey,” Dept. Comput. Sci., Univ. Wisconsin-Madison, Madison, WI, USA, Tech. Rep. 1530, 2005. [57]
X. Yan, T. Ai, M. Yang, and H. Yin, “A graph convolutional neural network for classification of building patterns using spatial vector data,”
ISPRS J. Photogramm. Remote Sens ., 150, pp.259-273, 2019. [58]
S. Saha, L. Mou, X. Zhu, F. Bovolo, and L. Bruzzone, “Semisupervised change detection using graph convolutional network,”
IEEE Geosci. Remote Sens. Lett. , DOI: 10.1109/LGRS.2020.2985340. [59]
M. Baatz and A. Schape, “Multiresolution segmentation: An optimization approach for high quality multiscale image segmentation,” in
Proc. AGIS XII , Heidelberg, Germany, 12-23. 2000. [60]
J. Wu, B. Li, W. Ni, W. Yan, and H. Zhang, “Optimal segmentation scale selection for object-based change detection in remote sensing images using Kullback-Leibler divergence, ”
IEEE Geosci. Remote Sens. Lett. , vol. 17 no. 7, pp. 1124-1128, Jul. 2020. [61]
R.C. Daudt, B.L. Saux, and A. Boulch, “Fully convolutional Siamese networks for change detection,” in Proc. ICIP , 2018, pp.4063-4067. [62]
R. Liu, D. Jiang, L. Zhang, and Z.Zhang, “Deep depthwise separable convolutional network for change detection in optical aerial images,”
IEEE J. Sel. Topics Appl. Earth Observ. Remote Sens ., vol.13, pp.1109-1118, Mar. 2020. [63]
R. C. Daudt, B. L. Saux, A. Boulch, and Y. Gousseau, “Urban change detection for multispectral earth observation using convolutional neural networks,” in Proc. IGARSS , 2018, pp.2115-2118. [64]
C. Benedek and T. Szirnyi, “Change detection in optical aerial images by a multilayer conditional mixed markov model,”
IEEE Trans. Geosci. Remote Sens., vol.47, no. 10, pp.3416-3430, May. 2009. [65]
D.K. Hammod, P. Vandergheynst, and R. Gribonval, “Wavelets on graph via spectral graph theory,”
Appl. Comput. Harmon Analys ., vol.30, no.2, pp.129-150, 2011. [66]
H. Chen and Z. Shi, “A spatial-temporal attention-based method and a new dataset for remote sensing images change detection,”
Remote, Sens ., 12, p.1662, 2020. [67]
T. Lei, Y. Zhang, Z. Lv, S. Li, S. Liu, and A.K. Nandi, “Landslide inventory mapping from bitemporal images using deep convolutional neural networks,”
IEEE Geosci. Remote Sens. Lett. , vol. 16, no. 6, pp. 982–986, Jun. 2019. [68]
Y. Wang, B. Liang, M. Ding, and J. Li, “Dense semantic labeling with atrous spatial pyramid pooling and decoder for high-resolution remote sensing imagery,”
Remote, Sens ., 11(1), p.0020, 2019. [69]
C. Zhang, P. Yue, D. Tapete, L. Jiang, B. Shangguan, L. Huang, and G. Liu, “A deeply supervised image fusion network for change detection in high resolution bi-temporal remote sensing images,”
ISPRS J. Photogramm. Remote Sens ., vol.166, pp.183-200, 2020. [70]
F. Gao, J. Dong, B. Li, and Q. Xu, “Automatic change detection in synthetic aperture radar images based on PCANet,”
IEEE Geosci. Remote Sens. Lett ., vol.13, no.12, pp.1792-1796, Dec. 2016. [71]
R. Wang, J. Zhang, J. Chen, L. Jiao, and M. Wang, “Imbalanced learning-based automatic SAR images change detection by morphologically supervised PCA-net,”
IEEE Geosci. Remote Sens. Lett ., vol.16, no.4, pp.554-558, Apr. 2019. [72]
F. Gao, X. Wang, Y. Gao, J. Dong, and S. Wang, “Sea ice change detection in SAR images based on convolutional-wavelet neural networks,”
IEEE Geosci. Remote Sens. Lett ., vol.16, no.8, pp.1240-1244, Aug. 2019. [73]
Y. Li, C. Peng, Y. Chen, L. Jiao, L. Zhou, and R. Shang, “A deep learning method for change detection in synthetic aperture radar images”
IEEE Trans. Geosci. Remote Sens., vol.57, no.8, pp.5751-5763, Aug. 2019. [74]
R. Touati, M. Mignotte, and M. Dahmane, “Multimodal change detection in remote sensing images using an unsupervised pixel pairwise based Markov random field model,”
IEEE Trans. Image Process. , vol. 29, pp. 757–767, Aug. 2019. [75]
J. Liu, M. Gong, K. Qin, and P. Zhang, “A deep convolutional coupling network for change detection based on heterogeneous optical and radar images,”
IEEE Trans. Neural Netw. Learn. Syst. , vol. 29, no. 3, pp. 545–559, Mar. 2018. [76]
Y. Sun, L. Lei, X. Li, X. Tan, G. Kuang, “Patch similarity graph matrix-based unsupervised remote sensing change detection with homogeneous and heterogeneous sensors,”,