3D Fully Convolutional Neural Networks with Intersection Over Union Loss for Crop Mapping from Multi-Temporal Satellite Images
33D Fully Convolutional Neural Networkswith Intersection Over Union Lossfor Crop Mapping from Multi-Temporal Satellite Images
Sina Mohammadi , Mariana Belgiu , and Alfred Stein
Dept. of Earth Observation Science, ITC Faculty, University of Twente, Enschede, The Netherlands {s.mohammadi, m.belgiu, a.stein}@utwente.nl
Abstract
Information on cultivated crops is relevant for a large num-ber of food security studies. Different scientific efforts arededicated to generate this information from remote sensingimages by means of machine learning methods. Unfortunately,these methods do not account for the spatial-temporal rela-tionships inherent in remote sensing images. In our paper, weexplore the capability of a 3D Fully Convolutional NeuralNetwork (FCN) to map crop types from multi-temporal images.In addition, we propose the Intersection Over Union (IOU)loss function for increasing the overlap between the predictedclasses and ground truth data. The proposed method was ap-plied to identify soybean and corn from a study area situatedin the US corn belt using multi-temporal Landsat images. Thestudy shows that our method outperforms related methods,obtaining a Kappa coefficient of 90.8%. We conclude thatusing the IOU Loss function provides a superior choice tolearn individual crop types.
Index Terms—
Crop mapping , deep learning, fully con-volutional neural networks , time series.
1. Introduction
Multi-temporal remote sensing images are being generatedat an unprecedented scale and rate from resources such asSentinel-2 (5 days frequency), Landsat (16 days frequency),and PlanetScope (daily). In the light of this fact, there havebeen a lot of scientific efforts towards the goal of convertingthe huge quantities of multi-temporal remote sensing imagesinto useful information. One of these scientific efforts is auto-matic crop mapping [1, 2, 3, 4] that is an active research areain remote sensing.A decisive factor towards the goal of crop classificationfrom multi-temporal images is developing methods that canlearn the temporal relationship in image time series. Tradi-tional approaches for temporal feature representation such asMulti layer Perceptron, Random Forest, Support Vector Ma-chine, and Decision Tree [5, 6, 7, 8, 9] are generally suitablefor single-date images and are not able to explicitly considerthe sequential relationship of multi-temporal data. Recently with the success of deep neural networks inlearning high-level task-specific features, CNN and LSTM-based methods have drawn increasing attention and achievedpromising results in the field of crop classification from multi-temporal images [3, 2, 10, 11, 1, 12]. While most of thedeep learning-based methods for crop mapping use pixel-by-pixel approach, in this paper, we will design a Fully Con-volutional Neural Network (FCN) and use it for crop map-ping. FCNs have been widely used in semantic segmentation,salient object detection, as well as brain tumor segmentation[13, 14, 15, 16, 17], and they are capable of generating thesegmentation map of the whole input image at once, and thusthey are more efficient computationally. In addition, the spa-tial relationship of adjacent pixels is taken into account byusing FCNs in contrast to pixel-by-pixel approaches, whichtake individual pixels as input. To fit the need of crop mapping,i.e. learning the sequential relationship of multi-temporal re-mote sensing data, we use 3D convolution operators as thebuilding blocks of this FCN so that both the spatial and thetemporal features would be extracted simultaneously. Thiswould be beneficial to crop mapping since both the spatialand temporal relationships in multi-temporal remote sensingdata are important for accurate crop mapping.To learn the crop types, most of the deep learning-basedcrop mapping methods use cross-entropy Loss [3, 11, 12, 1,10, 18]. Despite they achieved promising results, we believethat there is still room for improvement by using a loss func-tion better suited for cop mapping than the cross-entropy loss.To guide the network to generate more accurate prediction forcrop types, we propose to learn the crop types by increasingthe overlap between the prediction map and ground truth maskdirectly rather than using the cross-entropy loss that only fo-cuses on per-pixel prediction. To best of our knowledge, thisis the first attempt to use this loss function in crop mapping.In summary, the main contribution of this paper is beingthe first attempt to learn the crop types by increasing theoverlap between the prediction map and ground truth maskfor each crop type, which would result in a rethink of the lossfunctions used to train deep neural networks for crop mapping.In conjunction with this loss function, we design a 3D FCN tosimultaneously take into account the spatial and the temporal a r X i v : . [ c s . C V ] F e b elationships in multi-temporal remote sensing data.
2. The Proposed Method
In this section, we explain our designed 3D FCN and In-tersection Over Union (IOU) loss function, which is used totrain the network. This network, which is illustrated in Figure1, is composed of an encoder-decoder network, and it learnsto generate the segmentation map of crop lands from the inputimages.One important component of our proposed FCN is the3D convolutional operator that is more beneficial than 2Dconvolutional operator for multi-temporal crop mapping sinceit also extracts the temporal features in addition to the spatialfeatures. In the 3D FCN architecture, the Encoder extractsfeatures at four different levels, each of which has differentrecognition information from each other. At lower levels, theEncoder captures spatial and local information due to thesmall receptive field, whereas it captures semantic and globalinformation at higher levels because of the large receptivefield. To take advantage of the both high level global contextsand low level details, features of different levels are merged inthe Decoder through concatenation as shown in Figure 1. Inconjunction with the 3D FCN, we propose to use IntersectionOver Union (IOU) Loss to guide the FCN to output accuratesegmentation maps.In contrast to most of the deep learning-based crop map-ping methods that use cross-entropy loss to learn the croptypes, we propose a better loss function to guide the networkto learn each crop type more accurately. We propose to usea loss function that tries to increase the overlap between theprediction map and ground truth mask directly. This loss func-tion is more suited to crop mapping than the cross-entropyloss that only focuses on per-pixel prediction. Therefore, ourgoal is to maximize the Intersection Over Union (IOU) foreach crop type by adopting the following loss function: L IOU = C (cid:88) k =1 (1 − IOU k ) (1)where C denotes the number of classes, i.e. number of croptypes and IOU k is defined as: IOU k = 1 M M (cid:88) m =1 (cid:80) Ni =1 p ki,m · g ki,m (cid:80) Ni =1 [ p ki,m + g ki,m − U ( p ki,m ) · g ki,m ] (2)where M, N, p, g, and U ( . ) denote total number of exam-ples, total number of pixels in each example, prediction map,ground truth mask, and a function that maps the values be-tween zero and one to one respectively.In the Experimental Results section we will show thatusing this loss function for learning the crop types results in aboost in the performance compared to using the cross-entropyloss.
3. Experiments4. Study Area, Preprocessing, and EvaluationMetrics
Our experiments are conducted in the U.S corn belt. Weselect an 1700x1700 pixel area located in the center of thefootprint of h18v07 in the ARD grid system. We use Land-sat Analysis Ready Data (ARD) as the input to our method,which are publicly available on USGS’s EarthExplorer webportal. At each observation date, this data contains six opticalbands, namely red, green, blue, shortwave infrared 1, short-wave infrared 2, and near-infrared. Furthermore, we usedCropScape website portal to download the Cropland DataLayer (CDL) as the labels for training and testing the network.The selected region is mostly composed of corn and soybean.In this project, corn, soybean, and “other” (i.e., merged classof other land cover/land use types) are took as three classesof interest. Therefore, these three categories are assigned tothe corresponding pixels of the input image. We use the Land-sat multi-temporal data from April 22 to September 23 thatcovers growing season of corn and soybean [3]To preprocess the Landsat multi-temporal data and pre-pare them for training and testing the model, we follow thesame procedure as [3]. We remove the invalid pixels fromthe dataset. An invalid pixel is a pixel with less than sevenvalid observations after May 15 [3], and a valid observationis the pixel that is not filled, shadow, cloudy, or unclear. Theinvalid pixels are excluded from the dataset and are not usedin the training process because they do not contain enoughmulti-temporal information. Then, to fill in the resulted gapsin the valid pixels, linear interpolation is used that resultsin 23 time steps with seven days interval from 22 April to23 September. Furthermore, we normalize the data using themean and standard deviation values.As for performance evaluation of the proposed methods,we employ Cohen’s kappa coefficient, macro-averaged pro-ducer’s accuracy, and macro-averaged user’s accuracy. Pleaserefer to [3] for more detail.
We implement our method in Keras [19] using the GoogleColaboratory environment. The designed 3D FCN takes asinput a 256x256x23x6 image and outputs 256x256x3 seg-mentation map. In the input image size, 256x256, 23, and6 correspond to spatial size, number of time steps in timeseries, and number of optical bands respectively. In the outputmap size, 256x256 and 3 correspond to spatial size of thesegmentation map and number of classes respectively. Weuse the stochastic gradient descent with a momentum coeffi-cient 0.9 and a learning rate of 0.005. We split the trainingdata into five sections, and use each of them as validationand the rest for training with batch size 2, which results in 5models whose softmax outputs are averaged during testing.The code is publicly available at: https://github.com/Sina-Mohammadi/3DFCNwithIOUlossforCropMapping2 igure 1. The architecture of the designed 3D FCN, which is composed of an encoder and a decoder, and is trained using the IOU loss function.Figure 2. The predicted map of the 3D FCN trained with the IOU Loss, ground truth, and difference map. In the figure, green, yellow, black,purple, and orange represent soybean, corn, other classes, zero value, and one value respectively.Table 1. The experimental results. In this table, Kappa, MA-PA,MA-UA, and CE Loss stand for Cohen’s kappa coefficient, ,macro-averaged producer’s accuracy, macro-averaged user’s accuracy, andcross-entropy Loss.
Method Kappa MA-PA MA-UA
Transformer 88.6 90.4 92.1Random Forest 88.7 91.4 91.4Multilayer Perceptron 88.8 91.4 91.5DeepCropMapping (DCM) [3] 89.3 91.7 91.9
Ours(3DFCN+CE Loss) 90.3 92.5 92.8Ours(3DFCN+IOU Loss) 90.8 93.8 92.6
We use the data from the selected study area collected in2015,2016,2017 as our training set, and we test the trained 3D FCNs on the data collected in 2018. Then, we compareour method with the baseline classification models, namelyRandom Forest (RF) and Multilayer Perceptron (MLP), andTransformer [20] with the exact same settings introduced in[3] (Please refer to [3] for more details). Moreover, we alsocompare our method with the deep learning-based methodintroduced in [3]. The results are shown in Table 1. As seenfrom the table, our method outperforms other methods interms of different evaluation metrics. Moreover, it can be seenthat the adopted IOU loss function performs better than thecross-entropy loss overall. In addition, to visually investigatethe performance of the method, we show the predicted mapof the 3D FCN trained with the IOU Loss, ground truth, anddifference map in Figure 2.
5. Conclusion
In this study, a 3D FCN with an IOU loss function has beensuccessfully applied to map soybean and corn crops in the3S corn belt. The experimental results show that the adoptedIOU loss function, which maximizes the overlap between theprediction map and ground truth mask for each crop type,is able to increase the performance compared to using thewidely used cross-entropy loss. Therefore, using the IOULoss function is a better choice to learn individual crop type.For future work, we plan to improve the results for macro-averaged user’s accuracy by adding a loss term to our lossfunction that tries to improve the Precision of the predictedmap.
References [1] Shunping Ji, Chi Zhang, Anjian Xu, Yun Shi, and Yulin Duan,“3d convolutional neural networks for crop classification withmulti-temporal remote sensing images,”
Remote Sensing , vol.10, no. 1, pp. 75, 2018.[2] Liheng Zhong, Lina Hu, and Hang Zhou, “Deep learningbased multi-temporal crop classification,”
Remote sensing ofenvironment , vol. 221, pp. 430–443, 2019.[3] Jinfan Xu, Yue Zhu, Renhai Zhong, Zhixian Lin, Jialu Xu,Hao Jiang, Jingfeng Huang, Haifeng Li, and Tao Lin, “Deep-cropmapping: A multi-temporal deep learning approach withimproved spatial generalizability for dynamic corn and soy-bean mapping,”
Remote Sensing of Environment , vol. 247, pp.111946, 2020.[4] Mariana Belgiu, Wietske Bijker, Ovidiu Csillik, and AlfredStein, “Phenology-based sample generation for supervisedcrop type classification,”
International Journal of AppliedEarth Observation and Geoinformation , vol. 95, pp. 102264,2021.[5] Reza Khatami, Giorgos Mountrakis, and Stephen V Stehman,“A meta-analysis of remote sensing research on supervisedpixel-based land-cover image classification processes: Gen-eral guidelines for practitioners and future research,”
RemoteSensing of Environment , vol. 177, pp. 89–100, 2016.[6] LeeAnn King, Bernard Adusei, Stephen V Stehman, Peter VPotapov, Xiao-Peng Song, Alexander Krylov, Carlos Di Bella,Thomas R Loveland, David M Johnson, and Matthew CHansen, “A multi-resolution approach to national-scale culti-vated area estimation of soybean,”
Remote Sensing of Environ-ment , vol. 195, pp. 13–29, 2017.[7] Fabian Löw, U Michel, Stefan Dech, and Christopher Con-rad, “Impact of feature selection on the accuracy and spatialuncertainty of per-field crop classification using support vec-tor machines,”
ISPRS journal of photogrammetry and remotesensing , vol. 85, pp. 102–119, 2013.[8] Richard Massey, Temuulen T Sankey, Russell G Congalton,Kamini Yadav, Prasad S Thenkabail, Mutlu Ozdogan, and An-drew J Sánchez Meador, “Modis phenology-derived, multi-yeardistribution of conterminous us crop types,”
Remote Sensingof Environment , vol. 198, pp. 490–503, 2017.[9] Di Shi and Xiaojun Yang, “An assessment of algorithmicparameters affecting image classification accuracy by randomforests,”
Photogrammetric Engineering & Remote Sensing , vol.82, no. 6, pp. 407–417, 2016. [10] Charlotte Pelletier, Geoffrey I Webb, and François Petitjean,“Temporal convolutional neural network for the classificationof satellite image time series,”
Remote Sensing , vol. 11, no. 5,pp. 523, 2019.[11] Carolyne Danilla, Claudio Persello, Valentyn Tolpekin, andJohn Ray Bergado, “Classification of multitemporal sar imagesusing convolutional neural networks and markov random fields,”in . IEEE, 2017, pp. 2231–2234.[12] Nataliia Kussul, Mykola Lavreniuk, Sergii Skakun, and AndriiShelestov, “Deep learning classification of land cover andcrop types using remote sensing data,”
IEEE Geoscience andRemote Sensing Letters , vol. 14, no. 5, pp. 778–782, 2017.[13] Changqian Yu, Jingbo Wang, Chao Peng, Changxin Gao, GangYu, and Nong Sang, “Learning a discriminative feature net-work for semantic segmentation,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2018,pp. 1857–1866.[14] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille, “Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs,”
IEEE transactions on patternanalysis and machine intelligence , vol. 40, no. 4, pp. 834–848,2017.[15] Sina Mohammadi, Mehrdad Noori, Ali Bahri, Sina GhofraniMajelan, and Mohammad Havaei, “Cagnet: Content-awareguidance for salient object detection,”
Pattern Recognition , p.107303, 2020.[16] Youwei Pang, Xiaoqi Zhao, Lihe Zhang, and Huchuan Lu,“Multi-scale interactive network for salient object detection,” in
Proceedings of the IEEE/CVF Conference on Computer Visionand Pattern Recognition , 2020, pp. 9413–9422.[17] Mehrdad Noori, Ali Bahri, and Karim Mohammadi, “Attention-guided version of 2d unet for automatic brain tumor segmenta-tion,” in . IEEE, 2019, pp. 269–275.[18] Marc Rußwurm and Marco Körner, “Multi-temporal landcover classification with sequential recurrent encoders,”
ISPRSInternational Journal of Geo-Information , vol. 7, no. 4, pp.129, 2018.[19] François Chollet et al., “keras,” 2015.[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polo-sukhin, “Attention is all you need,” in
Advances in neuralinformation processing systems , 2017, pp. 5998–6008., 2017, pp. 5998–6008.