[PDF] Semantic Segmentation with Labeling Uncertainty and Class Imbalance - Researchain

Abstract

Recently, methods based on Convolutional Neural Networks (CNN) achieved impressive success in semantic segmentation tasks. However, challenges such as the class imbalance and the uncertainty in the pixel-labeling process are not completely addressed. As such, we present a new approach that calculates a weight for each pixel considering its class and uncertainty during the labeling process. The pixel-wise weights are used during training to increase or decrease the importance of the pixels. Experimental results show that the proposed approach leads to significant improvements in three challenging segmentation tasks in comparison to baseline methods. It was also proved to be more invariant to noise. The approach presented here may be used within a wide range of semantic segmentation methods to improve their robustness.

Full PDF

SS EMANTIC S EGMENTATION WITH L ABELING U NCERTAINTYAND C LASS I MBALANCE

A P

REPRINT

Patrik Olã Bressan

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

José Marcato Junior

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

José Augusto Correa Martins

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

Diogo Nunes Gonçalves

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

Daniel Matte Freitas

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

Lucas Prado Osco

University of Western São PauloPresidente Prudente, SP, Brazil [email protected]

Jonathan de Andrade Silva

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

Zhipeng Luo

Xiamen UniversityXiamen, FJ, China [email protected]

Jonathan Li

University of WaterlooWaterloo, ON, Canada [email protected]

Raymundo Cordero Garcia

Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

Wesley Nunes Gonçalves ∗ Federal University of Mato Grosso do SulCampo Grande, MS, Brazil [email protected]

February, 2021 A BSTRACT

Recently, methods based on Convolutional Neural Networks (CNN) achieved impressive success insemantic segmentation tasks. However, challenges such as the class imbalance and the uncertaintyin the pixel-labeling process are not completely addressed. As such, we present a new approachthat calculates a weight for each pixel considering its class and uncertainty during the labelingprocess. The pixel-wise weights are used during training to increase or decrease the importance ofthe pixels. Experimental results show that the proposed approach leads to signiﬁcant improvementsin three challenging segmentation tasks in comparison to baseline methods. It was also proved to bemore invariant to noise. The approach presented here may be used within a wide range of semanticsegmentation methods to improve their robustness.

Keywords

Semantic segmentation · Labeling uncertainty · Class weighting · Loss function ∗ corresponding author: [email protected] a r X i v : . [ c s . C V ] F e b rXiv - Bressan et al. (2021) A P

REPRINT

Semantic segmentation is the task of dividing an image into regions whose pixels in the same area have similarproperties, whether in color, texture, or belonging to the same object. This task is crucial to infer knowledge of ascene in computer vision systems, as shown in tasks of tree species segmentation [Lobo Torres et al., 2020]. Recently,signiﬁcant advances in semantic segmentation have been achieved through Convolutional Neural Networks (CNN),including methods such as SegNet [Badrinarayanan et al., 2017], Fully Convolutional Network (FCN) [Long et al.,2015], and DeepLabv3+ [Chen et al., 2018].Despite recent advances, two factors have been little explored in the literature during the training of CNNs for semanticsegmentation. The ﬁrst factor is the unbalance of class distribution, where the dominant portions of the data are assignedto a few classes while many classes have little representation in the data. As a consequence, semantic segmentationmethods are biased toward the dominant classes during the inference [López et al., 2013]. One way to minimizeimbalance is by uniformly sampling data and collecting images (such as well-known image datasets, ImageNet [Denget al., 2009, Chrabaszcz et al., 2017], MNIST (Modiﬁed National Institute of Standards and Technology) [Lecun et al.,1998] and CIFAR 10/100), under-sampling the majority classes [Liu and Tsoumakas, 2019, Tsai et al., 2019, Sun et al.,2018, Ha and Lee, 2016], or over-sampling the minority classes [Fernández et al., 2018, Li et al., 2017, Nekooeimehrand Lai-Yuen, 2016, Castellanos et al., 2018]. However, these approaches change the distribution of data and can affectlearning and inference [Dal Pozzolo et al., 2015].The second factor, much less explored in the literature, is related to uncertainty in image labeling [Bulò et al., 2017,Bischke et al., 2018]. In low resolution or noisy images, the edges of objects become inaccurate and even expertlabeling may include annotation errors that impact training. Even in high-resolution images, some objects (e.g.,trees [Lobo Torres et al., 2020]) have complex edges that make them difﬁcult to annotate.To overcome the aforementioned issues, we propose an approach to deal with class unbalance and uncertainty in thelabeling process for image segmentation tasks. For this approach, we introduce a loss function where the contribution ofeach pixel is weighted. First, pixels belonging to minority classes have their importance increased. Second, pixels nearthe edges of the object generally have greater uncertainty during labeling and thus have their importance diminishedduring training. These two pixel-wise weights are combined and produce a great impact during the training and inferenceof the segmentation methods. We evaluate our approach in three datasets which contain the challenges mentioned above.Experimental results show that the proposed approach provides signiﬁcant increments when compared to the baselineand state-of-the-art methods.

Imbalance Data.

In semantic segmentation, some approaches have been proposed to deal with class imbalance.Traditional approaches use resampling (e.g., oversampling and undersampling) and rebalancing schemes via datastatistics, such as inverse or median frequency [Chan et al., 2019, Xu et al., 2015, Caesar et al., 2015]. Despite correctingthe imbalance, these approaches include several disadvantages. Oversampling methods increase computational cost andmay be more prone to overﬁtting due to the inclusion of duplicate data. On the other hand, undersampling methods candiscard important data for learning. Approaches are also based on constraints during training, such as restricting thenumber of pixels contributing to the loss function during backpropagation at random [Bansal et al., 2016], based on the k highest loss of pixels [Wu et al., 2016] or hard samples [Dong et al., 2019]. Huang et al. [Huang et al., 2016] reducedthe effect of class imbalance by enforcing inter-cluster and inter-class margins in standard deep learning frameworks.These margins can be applied through quintuplet instance sampling and the associated triple-header hinge loss. Ren etal. [Ren et al., 2018] proposed a meta-learning framework that assigns weights to training examples based on theirgradient directions to reduce class imbalance and corrupted label problems. Recently, focal loss [Lin et al., 2020] wasproposed to penalize hard samples assuming that they belong to the minority class. However, this does not happenwhen minority classes are well deﬁned and may not have their participation in training effectively. A survey on deeplearning with class imbalance can be found in [Johnson and Khoshgoftaar, 2019]. Labeling Uncertainty.

Labeling uncertainty is related to image resolution and object-edge complexity. Similar to thiswork, Bischke et al. [Bischke et al., 2018] applied an adaptive uncertainty weighted class loss to segment satelliteimagery. However, only the uncertainty of the class is considered and not the uncertainty of every single pixel, asproposed in this work. Bulò et al. [Bulò et al., 2017] proposed a max-pooling loss that adaptively re-weights thecontributions of each pixel based on their observed losses. However, this method does not consider objects whose edgesare not well deﬁned and therefore present uncertainties during labeling. Ding et al. [Ding et al., 2019] proposed learningboundary objects as an additional class to increase the feature similarity of the same object. Similarly, Shen et al. [Shenet al., 2015] addressed the contour detection problem by combining a loss function for contour versus non-contour2 rXiv - Bressan et al. (2021)

A P

REPRINT samples. The labeling uncertainty problem is also related to the size of the object in the image since small objects areharder to label. Islam et al. [Islam et al., 2017] proposed a new CNN architecture to predict segmentation labels atseveral resolutions. A loss function at each stage (scale) provides supervision to improve detail on the segmentationlabels. Although it improves the segmentation of object edges, labeling uncertainty is still a problem that degrades theresult. Hamaguchi et al. [Hamaguchi et al., 2018] proposed a novel architecture called local feature extraction whichaggregates local features with decreasing dilation factor to segment small objects in remote sensing imagery.

The purpose of semantic segmentation methods is to assign a label to each pixel x of an image I ( x ) , providing apixel-level mask ˆ M ( x ) . The most common methods for this task are based on CNNs composed of convolution, pooling,and upsampling layers [Long et al., 2015, Badrinarayanan et al., 2017]. This way, the pixel-level mask ˆ M is obtainedthrough a CNN f θ with layer parameters θ , ˆ M = f θ ( I ) . The dominant loss function used to train a CNN takes thefollowing form: min θ ∈ Θ (cid:88) ( I,M ) ∈ T L ( ˆ M , M ) + λR ( θ ) (1)where ( I, M ) is an example consisting of an image I and a ground-truth mask M of the training set T , ˆ M = f θ ( I ) isthe predicted mask, L is a loss function (e.g., cross-entropy) that penalizes the wrong labels, and R is a regularizer.In semantic segmentation tasks, the loss function L is usually decomposed into a sum of pixel losses according to Eq. 2.The weight of each pixel contributes uniformly during training. L ( ˆ M , M ) = 1 n n (cid:88) x =1 L ( ˆ M ( x ) , M ( x )) (2)where n is the number of pixels.There are two main issues within this approach: i) class imbalance; and ii) uncertainty in the annotation. Theconsequence of class imbalance is a bias towards the dominant classes over the classes that occupy smaller parts inthe image. This occurs in most real-world image segmentation problems, where few classes dominate most images.Also, some classes do not have well-deﬁned borders (e.g., trees), resulting in uncertainly labeled pixels. An incorrectlylabeled pixel inﬂuences model learning, making ﬁlter convergence and learning even more difﬁcult for small objects.Figs. 2 and 3 present examples that illustrate the challenges of semantic segmentation methods. The trees in Fig. 2show that in most images the foreground covers fewer pixels than the background (class imbalance). Besides, treeshave edges that are difﬁcult to label, and some pixels may be incorrectly labeled. Fig. 3 also illustrates the labelingchallenge, in which some parts of the object are not visible in the image due to noise when capturing images. To improve these issues, we propose to weight the contribution of each pixel based on its labeled class importance anduncertainty of its labeling. A weight for each pixel w ( x ) is used in the loss function according to Eq. 3. L ( ˆ M , M ) = 1 n n (cid:88) x =1 ω ( x ) · L ( ˆ M ( x ) , M ( x )) (3)Unlike other approaches (e.g., focal loss [Lin et al., 2020]), the weight ω ( x ) of the pixel x is calculated by consideringtwo important characteristics as shown in Eq. 4. The ﬁrst part ϕ c ( x ) considers class unbalance, where c ( x ) is the classlabeled for pixel x . The second part δ ( x ) considers the labeling uncertainty of the pixel x . Both parts are described indetail in the sections below. ω ( x ) = ϕ c ( x ) · δ ( x ) (4)3 rXiv - Bressan et al. (2021) A P

REPRINT

The ﬁrst characteristic takes into account the unbalance of classes. To determine the weight of each class c , we use thetraining set according to Eq. 5. The lower the number of pixels in a given class, the higher the weight so that CNN layerﬁlters ﬁt evenly. When ω c equals 1 for all classes, training is performed as traditionally. It is important to note that thisweight is the same for all pixels in the same class. ω c = mC ∗ n c (5)where m is the number of pixels of all training images, C is the number of classes, and n c is the number of pixels thatbelong to class c . The second characteristic considers the labeling uncertainty and is calculated for each pixel in the image. This isespecially true for objects with poorly deﬁned edges or low-resolution images. We consider that the closer to the edge,the greater the uncertainty of the class labeled for a given pixel. On the other hand, pixels near the center of objects arelabeled more accurately. This feature can be modeled by Eq. 6 considering the distance of a pixel to the edges. Themain parameter σ determines the spread of uncertainty around the edge. δ ( x ) = 1 − e − d ( x )22 σ (6)where d ( x ) is the distance from the pixel x to the nearest edge pixel (can be calculated efﬁciently using the EuclideanDistance Transform) and σ is the standard deviation.Figure 1 illustrates the process of calculating δ ( x ) for each pixel x . It is possible to observe that the closer to theobject’s edge, the lower the value of δ ( x ) and therefore it is considered as a pixel with high uncertainty. As a givenpixel moves away from the edge, its uncertainty in the labeling is reduced. To evaluate the proposed approach, we used two well-known semantic segmentation methods: SegNet [Badrinarayananet al., 2017] and FCN [Long et al., 2015]. SegNet [Badrinarayanan et al., 2017] is a CNN with encoder and decodernetworks, with a ﬁnal pixel-wise classiﬁcation layer. For each input, the encoder provides a low-resolution activationmap representing the most important features. In this work, the encoder is composed of the convolutional and max-pooling layers of VGG16 [Simonyan and Zisserman, 2014]. Then, the segmented image is reconstructed by the decoder.The decoder network is composed of convolutional and upsampling layers that use the corresponding max-poolingindices from the encoder to upsample the low-resolution feature map. In the last layer, a softmax classiﬁer receives thefeature map from the decoder for pixel-wise classiﬁcation.FCN [Long et al., 2015] extended classiﬁcation CNN (VGG16 [Simonyan and Zisserman, 2014]) by transforming itinto fully convolutional, where the fully connected layers were replaced by convolutional layers. In this way, the ﬁrstpart produces a feature map with low-resolution from the image, which is upsampled to produce pixel-wise predictionsfor segmentation.

Three image datasets were used in this study to demonstrate the robustness of our method. These datasets, describedbelow, have the challenges of class imbalance and labeling uncertainty.

Urban Tree (UT).

This dataset is composed of aerial RGB orthoimages generated with a GSD (Ground SampleDistance) of 10 cm from Campo Grande municipality in Brazil. Examples of Urban Tree dataset in Fig. 2 show that theboundaries of the trees are difﬁcult to label. This dataset is composed by 966 non-overlapping patches of × pixels. In the experiments, 580, 193 and 193 patches were used for training, validation, and testing, respectively. Rib Eye Area (REA).

This image dataset consists of ultrasound images of the Longissimus dorsi muscle between the11th and 13th ribs of cattle. The goal is to automatically calculate the rib eye area (REA), an important region fordecision making during cattle breeding. The main challenge is the uncertainty in the REA annotation, since the image4 rXiv - Bressan et al. (2021)

A P

REPRINT

Figure 1: Example of calculating the uncertainty δ ( x ) of each pixel x . As a pixel approaches the edge, the greater itsuncertainty.is noisy and even experts have difﬁculty in delimiting the borders of this region. Fig. 3 presents examples of images andthe annotation made by a specialist. We can observe that some borders are absent and depend on the subjectivity andknowledge of the annotator. To evaluate the segmentation methods, 76 images with × resolution were obtainedand labeled by an expert. Due to the number of images, the division of the images in training and testing followed5-fold cross-validation. Soybean Disease (SD).

The images from this dataset were obtained through PlantVillage [Hughes and Salathé, 2015],which contains several photographs taken by a cell phone in soybean plantations. To compose the image dataset, 201images with frog-eye disease were identiﬁed and manually annotated as shown in Fig. 4. It is important to emphasizethat the images were taken in the ﬁeld, and present several lighting challenges. The images were randomly divided intothree sets: 121 for training, 40 for validation, and 40 for testing.

For REA and Urban Tree datasets, the images have been resized to × pixels. For Soybean Disease dataset, weresized the images to × pixels because the original ones have high resolution. For all segmentation methods,we use Stochastic Gradient Descent (SGD) optimizer with learning rate of 0.001, momentum of 0.9, and weight decayof 0.0005. The backbone weights of the segmentation methods started with pre-trained weights on ImageNet.To evaluate the proposed approach and baselines, we use the following popular segmentation metrics: pixel accuracy(PA) and intersection over union (IoU). PA is the percentage of pixels correctly classiﬁed for each class. On the otherhand, IoU is given by dividing the intersection area by the union area between prediction and ground-truth. Since the5 rXiv - Bressan et al. (2021) A P

REPRINT

Figure 2: Sample images from Urban Tree (UT) dataset.background is dominant in most images, we report the PA and IoU results only for the class of interest (e.g., REA, treeand diseases).

In Tables 1 and 2, we compare the baseline methods and the proposed approach using SegNet and FCN, respectively.The main parameter of the proposed approach is σ that corresponds to the spread of uncertainty used in the loss function.Therefore, results for different values of σ were also reported.For SegNet (Table 1), the proposed approach improved pixel accuracy (e.g., from 0.744 to 0.838 in Urban Tree dataset,0.888 to 0.927 in REA dataset, and 0.35 to 0.777 in Soybean Disease dataset). The proposed approach also showedsuperior IoU results, especially in Urban Tree and SD datasets, where IoU improved from 0.676 to 0.705, and from0.324 to 0.567, respectively. Further, it is found that using σ = 2 provided the best result in Urban Tree and SD datasets,while σ = 3 provided the best one in REA dataset. A lower value of σ for Urban Tree and SD datasets is expected dueto the size of the foreground (tree and disease), which generally occupies a smaller area than the background comparedto REA. 6 rXiv - Bressan et al. (2021) A P

REPRINT

Figure 3: Sample images from Rib Eye Area (REA) dataset.Table 1: Comparative results between the proposed approach using SegNet and baseline in the three image datasets.

Method Urban Tree REA SDPA IoU PA IoU PA IoU

SegNet 0.744 0.676 0.888 0.841 0.350 0.324SegNet + σ = 1 σ = 2 SegNet + σ = 3 Method Urban Tree REA SDPA IoU PA IoU PA IoU

FCN 0.820 0.730 0.966 0.861 0.750

FCN + σ = 1 σ = 2 σ = 3 rXiv - Bressan et al. (2021) A P

REPRINT

Figure 4: Sample images from Soybean Disease (SD) dataset.

As shown in the previous section, FCN achieved better results than SegNet in the three image datasets. Therefore wediscuss and present visual results of the FCN baseline and FCN using the proposed approach.

Urban tree dataset.

Fig. 5 presents three examples that show the advantage of the proposed approach. The ﬁrst columnshows the ground-truth while the second and third columns present the result of the segmentation using the baseline andthe proposed approach. The ﬁrst example (ﬁrst row) shows that the baseline incorrectly segments the grass as a tree. Onthe other hand, the proposed approach can correctly segment the grass as a background, even though the colors aresimilar. The second example shows that the proposed approach is capable of correctly segmenting small foregroundregions. This is because the importance of these pixels is increased during training and the weights of the convolutionallayers tend to adjust better for these regions. Finally, the third example also shows small regions correctly segmented bythe proposed approach. Also, it is possible to observe that the tree edge is better deﬁned when compared to the baseline.This is possible due to the uncertainty included in tree-border regions, which are hardly labeled correctly. Concerningthe border of objects, the proposed method decreases the importance of pixels, making CNN weights take this intoaccount.

REA dataset.

This image dataset has high uncertainty during labeling due to noise from the ultrasound image. In somecases, the border of REA is not completely visible and must be estimated by the specialist. Therefore, the proposedapproach becomes essential to obtain accurate segmentation at the edges. The segmentation examples in Fig. 6 showthat the baseline was not able to deﬁne the REA correctly due to the uncertainty of the labeling. On the other hand, theproposed approach presents results close to the specialist in regions that the border needs to be estimated.

Soybean Disease Dataset.

As shown in Fig. 7, the proposed approach was able to segment soybean disease withexcellent pixel accuracy. It detects regions of disease that the baseline was not capable of, as illustrated in the secondexample. The proposed approach also segments the disease pixels more accurately compared to the baseline (see thethird example). However, the proposed approach generally segments a region larger than the ground-truth, whichexplains the lower IoU compared to the baseline. In this task, it is important to have a low false-negative (as in theproposed approach) to detect diseases early and reduce losses.8 rXiv - Bressan et al. (2021)

A P

REPRINT (a) Ground-truth (b) FCN (c) Proposed Approach

Figure 5: Example of (a) ground-truth, (b) FCN and (c) proposed approach from Urban Tree dataset.

Noise invariance of semantic segmentation methods was assessed on the Urban Tree dataset. Gaussian noise with σ = 0 . was added to the images as illustrated in Fig. 8. We trained the proposed approach and the FCN baselineusing noisy images. Then, we evaluated them in the test set with and without noise.The results using noisy images in the training of both approaches are shown in Table 3. The second column of thetable presents the results using noisy test images. As expected, both approaches still provided good results as theywere trained and tested on noisy images. Our approach has achieved superior pixel accuracy and IoU compared to thebaseline (e.g., 0.875 versus 0.776 and 0.697 versus 0.686).Although these results are promising, it is not possible to guarantee that the method discarded noise in training, sincethe test images were also noisy. To effectively assess the noise invariance, the third column of Table 3 shows the results9 rXiv - Bressan et al. (2021) A P

REPRINT (a) Ground-truth (b) FCN (c) Proposed Approach

Figure 6: Example of (a) ground-truth, (b) FCN and (c) proposed approach from REA dataset.using noisy images in the training and noise-free images in the test. The baseline FCN presented weak results, showingthat the noise had great interference in its training. On the other hand, the proposed approach showed consistent results,which demonstrates its robustness to noise. Our approach obtained pixel accuracies of 0.875 and 0.847 in test imageswith and without noise, a drop of only 0.028.Fig. 9 shows the visual segmentation results of both methods in test images with and without noise. The results of thebaseline FCN and the proposed approach in a noisy test image (Fig. 9(a)) are shown in Figs. 9(b) and 9(c), respectively.As the methods were trained on noisy images, they achieved satisfactory results despite the apparent noise. However,when a noisy-free image is used in testing methods trained with noisy images, the results of the proposed approach aresuperior to FCN as shown in Figs. 9(d)- 9(i). 10 rXiv - Bressan et al. (2021)

A P

REPRINT (a) Ground-truth (b) FCN (c) Proposed Approach

Figure 7: Example of (a) ground-truth, (b) FCN and (c) proposed approach from Soybean Disease dataset.Table 3: Comparative results between our method and the baseline FCN using noisy images to train.

Method Noisy Images Noise-free ImagesPA IoU PA IoU

FCN 0.776 0.686 0.122 0.122Ours 0.875 0.697 0.847 0.569

A correctly weighted loss is important for semantic segmentation methods, mainly in datasets with imbalanced classesand labeling uncertainty. This paper shows how these challenges can be considered in a new loss function. The proposedapproach combines two weights: i) the importance of the class given its occurrence and ii) the uncertainty in the labeling11 rXiv - Bressan et al. (2021)

A P

REPRINT

Figure 8: Original images and their respective noisy images.of pixels close to the edges. To the best of our knowledge, this is the ﬁrst approach that overcomes both challengesusing pixel-wise weights during training.The robustness of the proposed approach can be ascertained for the three datasets considered; which present differentcharacteristics and challenges. The results showed that the proposed approach obtains superior metrics regardless ofthe segmentation method (e.g., SegNet and FCN). Signiﬁcant results with an increase of up to 40% in accuracy wereachieved by the proposed approach, which clearly shows its relevance. Our approach also proved to be more invariantto noise, even when training was performed on noisy images and tested on noise-free images.As a future work, we intend to evaluate new semantic segmentation methods. Further research also includes theapplication of the proposed approach to segmentation problems with several classes.

Acknowledgments.

This study was supported by the FUNDECT - State of Mato Grosso do Sul Foundation to Support Education, Scienceand Technology, CAPES - Brazilian Federal Agency for Support and Evaluation of Graduate Education, and CNPq -National Council for Scientiﬁc and Technological Development. The Titan V and XP used for this research was donatedby the NVIDIA Corporation. 12 rXiv - Bressan et al. (2021)

A P

REPRINT (a) Noisy test image (b) FCN (c) Proposed Approach(d) Noisy-free test image (e) FCN (f) Proposed Approach(g) Noisy-free test image (h) FCN (i) Proposed Approach

Figure 9: Comparative results of the proposed approach and FCN trained in noisy images. The ﬁrst row of imagesshows the segmentation using a noisy test image while the second row of images shows the results using a noisy-freetest image.

References

Badrinarayanan, V., Kendall, A., Cipolla, R., 2017. Segnet: A deep convolutional encoder-decoder architec-ture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 2481–2495.doi:doi:10.1109/TPAMI.2016.2644615. 13 rXiv - Bressan et al. (2021)

A P

REPRINT

Bansal, A., Chen, X., Russell, B.C., Gupta, A., Ramanan, D., 2016. Pixelnet: Towards a general pixel-level architecture.CoRR abs/1609.06694. arXiv:1609.06694 .Bischke, B., Helber, P., Borth, D., Dengel, A., 2018. Segmentation of imbalanced classes in satellite imagery usingadaptive uncertainty weighted class loss, in: IGARSS, pp. 6191–6194. doi:doi:10.1109/IGARSS.2018.8517836.Bulò, S.R., Neuhold, G., Kontschieder, P., 2017. Loss max-pooling for semantic image segmentation, in: CVPR, pp.7082–7091. doi:doi:10.1109/CVPR.2017.749.Caesar, H., Uijlings, J., Ferrari, V., 2015. Joint calibration for semantic segmentation, in: BMVC, BMVA Press. pp.29.1–29.13. doi:doi:10.5244/C.29.29.Castellanos, F.J., Valero-Mas, J.J., Calvo-Zaragoza, J., Rico-Juan, J.R., 2018. Oversampling imbalanced data in thestring space. Pattern Recognition Letters 103, 32 – 38. doi:doi:https://doi.org/10.1016/j.patrec.2018.01.003.Chan, R., Rottmann, M., Hüger, F., Schlicht, P., Gottschalk, H., 2019. Application of decision rules for handling classimbalance in semantic segmentation. CoRR abs/1901.08394. arXiv:1901.08394 .Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H., 2018. Encoder-decoder with atrous separable convolutionfor semantic image segmentation, in: ECCV, Springer International Publishing. pp. 833–851.Chrabaszcz, P., Loshchilov, I., Hutter, F., 2017. A downsampled variant of imagenet as an alternative to the CIFARdatasets. CoRR abs/1707.08819. arXiv:1707.08819 .Dal Pozzolo, A., Caelen, O., Bontempi, G., 2015. When is undersampling effective in unbalanced classiﬁcation tasks?,in: Appice, A., Rodrigues, P.P., Santos Costa, V., Soares, C., Gama, J., Jorge, A. (Eds.), Machine Learning andKnowledge Discovery in Databases, Cham. pp. 200–215.Deng, J., Dong, W., Socher, R., Li, L., Kai Li, Li Fei-Fei, 2009. Imagenet: A large-scale hierarchical image database,in: CVPR, pp. 248–255. doi:doi:10.1109/CVPR.2009.5206848.Ding, H., Jiang, X., Liu, A.Q., Thalmann, N.M., Wang, G., 2019. Boundary-aware feature propagation for scenesegmentation, in: ICCV, pp. 6819–6829.Dong, Q., Gong, S., Zhu, X., 2019. Imbalanced deep learning by minority class incremental rectiﬁcation. IEEETransactions on Pattern Analysis and Machine Intelligence 41, 1367–1381. doi:doi:10.1109/TPAMI.2018.2832629.Fernández, A., García, S., Herrera, F., Chawla, N.V., 2018. Smote for learning from imbalanced data: Progress andchallenges, marking the 15-year anniversary. J. Artif. Int. Res. 61, 863–905.Ha, J., Lee, J.S., 2016. A new under-sampling method using genetic algorithm for imbalanced data classiﬁcation, in:International Conference on Ubiquitous Information Management and Communication, ACM, New York, NY, USA.pp. 95:1–95:6. doi:doi:10.1145/2857546.2857643.Hamaguchi, R., Fujita, A., Nemoto, K., Imaizumi, T., Hikosaka, S., 2018. Effective use of dilated con-volutions for segmenting small object instances in remote sensing imagery, in: WACV, pp. 1442–1450.doi:doi:10.1109/WACV.2018.00162.Huang, C., Li, Y., Loy, C.C., Tang, X., 2016. Learning deep representation for imbalanced classiﬁcation, in: CVPR, pp.5375–5384. doi:doi:10.1109/CVPR.2016.580.Hughes, D.P., Salathé, M., 2015. An open access repository of images on plant health to enable the development of mo-bile disease diagnostics through machine learning and crowdsourcing. CoRR abs/1511.08060. arXiv:1511.08060 .Islam, M.A., Naha, S., Rochan, M., Bruce, N., Wang, Y., 2017. Label reﬁnement network for coarse-to-ﬁne semanticsegmentation. arXiv:1703.00551 .Johnson, J.M., Khoshgoftaar, T.M., 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 27.doi:doi:10.1186/s40537-019-0192-5.Lecun, Y., Bottou, L., Bengio, Y., Haffner, P., 1998. Gradient-based learning applied to document recognition.Proceedings of the IEEE 86, 2278–2324. doi:doi:10.1109/5.726791.Li, J., Liu, L.s., Fong, S., Wong, R.K., Mohammed, S., Fiaidhi, J., Sung, Y., Wong, K.K.L., 2017. Adaptiveswarm balancing algorithms for rare-event prediction in imbalanced healthcare data. PLOS ONE 12, 1–25.doi:doi:10.1371/journal.pone.0180830.Lin, T., Goyal, P., Girshick, R., He, K., Dollár, P., 2020. Focal loss for dense object detection. IEEE Transactions onPattern Analysis and Machine Intelligence 42, 318–327.Liu, B., Tsoumakas, G., 2019. Dealing with class imbalance in classiﬁer chains via random undersampling. Knowledge-Based Systems , 105292doi:doi:https://doi.org/10.1016/j.knosys.2019.105292.14 rXiv - Bressan et al. (2021)

A P

REPRINT

Lobo Torres, D., Queiroz Feitosa, R., Nigri Happ, P., Elena Cué La Rosa, L., Marcato Junior, J., Martins, J., OlãBressan, P., Gonçalves, W.N., Liesenberg, V., 2020. Applying fully convolutional architectures for semanticsegmentation of a single tree species in urban environment on high resolution uav optical imagery. Sensors 20.doi:doi:10.3390/s20020563.Long, J., Shelhamer, E., Darrell, T., 2015. Fully convolutional networks for semantic segmentation, in: CVPR, pp.3431–3440.López, V., Fernández, A., García, S., Palade, V., Herrera, F., 2013. An insight into classiﬁcation with imbalanced data:Empirical results and current trends on using data intrinsic characteristics. Information Sciences 250, 113 – 141.doi:doi:https://doi.org/10.1016/j.ins.2013.07.007.Nekooeimehr, I., Lai-Yuen, S.K., 2016. Adaptive semi-unsupervised weighted oversampling (a-suwo) for imbalanceddatasets. Expert Systems with Applications 46, 405 – 416. doi:doi:https://doi.org/10.1016/j.eswa.2015.10.031.Ren, M., Zeng, W., Yang, B., Urtasun, R., 2018. Learning to reweight examples for robust deep learning, in: ICML, pp.4331–4340.Shen, W., Wang, X., Wang, Y., Bai, X., Zhang, Z., 2015. Deepcontour: A deep convolutional feature learned bypositive-sharing loss for contour detection, in: CVPR, pp. 3982–3991. doi:doi:10.1109/CVPR.2015.7299024.Simonyan, K., Zisserman, A., 2014. Very deep convolutional networks for large-scale image recognition. CoRRabs/1409.1556.Sun, B., Chen, H., Wang, J., Xie, H., 2018. Evolutionary under-sampling based bagging ensemble method forimbalanced data classiﬁcation. Frontiers of Computer Science 12, 331–350. doi:doi:10.1007/s11704-016-5306-z.Tsai, C.F., Lin, W.C., Hu, Y.H., Yao, G.T., 2019. Under-sampling class imbalanced datasets by combining clusteringanalysis and instance selection. Information Sciences 477, 47 – 54. doi:doi:https://doi.org/10.1016/j.ins.2018.10.029.Wu, Z., Shen, C., van den Hengel, A., 2016. High-performance semantic segmentation using very deep fullyconvolutional networks. CoRR abs/1604.04339. arXiv:1604.04339arXiv:1604.04339