[PDF] Understanding the effects of artifacts on automated polyp detection and incorporating that knowledge via learning without forgetting

Abstract

Survival rates for colorectal cancer are higher when polyps are detected at an early stage and can be removed before they develop into malignant tumors. Automated polyp detection, which is dominated by deep learning based methods, seeks to improve early detection of polyps. However, current efforts rely heavily on the size and quality of the training datasets. The quality of these datasets often suffers from various image artifacts that affect the visibility and hence, the detection rate. In this work, we conducted a systematic analysis to gain a better understanding of how artifacts affect automated polyp detection. We look at how six different artifact classes, and their location in an image, affect the performance of a RetinaNet based polyp detection model. We found that, depending on the artifact class, they can either benefit or harm the polyp detector. For instance, bubbles are often misclassified as polyps, while specular reflections inside of a polyp region can improve detection capabilities. We then investigated different strategies, such as a learning without forgetting framework, to leverage artifact knowledge to improve automated polyp detection. Our results show that such models can mitigate some of the harmful effects of artifacts, but require more work to significantly improve polyp detection capabilities.

Full PDF

UUnderstanding the eﬀects of artifacts on automated polyp detectionand incorporating that knowledge via learning without forgetting

Maxime Kayser ∗† , Roger D. Soberanis-Mukul †‡ , Anna-Maria Zvereva M.D. , PeterKlare M.D. § , Nassir Navab ¶ , and Shadi Albarqouni ¶ Computer Aided Medical Procedures, Technische Universit¨at M¨unchen, Munich, Germany Big Data Institute, University of Oxford, Oxford, UK Computer Aided Medical Procedures, Johns Hopkins University, Baltimore, USA Computer Vision Laboratory, ETH Zurich, Zurich, Switzerland Klinik f¨ur Innere Medizin II, Klinikum rechts der Isar der Technischen Universit¨atM¨unchen, Munich, Germany Abteilung Innere Medizin Gastroenterologie, Krankenhaus Agatharied Hausham,Germany

Abstract

Survival rates for colorectal cancer are higher whenpolyps are detected at an early stage and can be re-moved before they develop into malignant tumors.Automated polyp detection, which is dominatedby deep learning based methods, seeks to improveearly detection of polyps. However, current eﬀortsrely heavily on the size and quality of the trainingdatasets. The quality of these datasets often suﬀersfrom various image artifacts that aﬀect the visibilityand hence, the detection rate. In this work, we con-ducted a systematic analysis to gain a better under-standing of how artifacts aﬀect automated polyp de-tection. We look at how six diﬀerent artifact classes,and their location in an image, aﬀect the perfor- ∗ M. K. was with Technische Universit¨at M¨unchen, Munich,Germany at the moment of this research. † M. K. and R. D. S. share ﬁrst authorship. ‡ Corresponding author: [email protected] § P. K. was with Klinikum rechts der Isar der TechnischenUniversit¨at M¨unchen, Munich, Germany at the moment of thisresearch ¶ N. N. and S. A. share senior authorship. mance of a RetinaNet based polyp detection model.We found that, depending on the artifact class, theycan either beneﬁt or harm the polyp detector. Forinstance, bubbles are often misclassiﬁed as polyps,while specular reﬂections inside of a polyp region canimprove detection capabilities. We then investigateddiﬀerent strategies, such as a learning without forget-ting framework, to leverage artifact knowledge to im-prove automated polyp detection. Our results showthat such models can mitigate some of the harmfuleﬀects of artifacts, but require more work to signiﬁ-cantly improve polyp detection capabilities.

Keywords—

Polyp detection, Artifact detection,Learning without forgetting, Multi-task learning

Colorectal cancer (CRC) is the most deadly cancer inthe United States [Siegel et al., 2018]. CRC often de-velops from polyps, which may turn into tumors (Fig.1a). Early detection of these polyps can increase theCRC survival rate to up to 95% [Bernal et al., 2012].The standard procedure for polyp screening is the1 a r X i v : . [ c s . L G ] A ug ndoscopic analysis of the colon, called colonoscopy.One of the ways to reduce the polyp miss rate, whichis crucial for minimizing the risks related to CRC, isthe use of a real-time automated polyp-detection sys-tem. However, automated polyp detection remains achallenging problem, given the high variation in ap-pearance, size, and shape of polyps, their often simi-lar texture to the surrounding tissue, as well as imageartifacts obstructing and corrupting the endoscopicvideo frames.Initial work [Bernal et al., 2017] has shown howsome artifacts, such as specularity (i.e. strong re-ﬂection of light) or blur, can have a signiﬁcant eﬀecton the performance of automated polyp detectionsystems. However, this study, as it required man-ual labelling of artifacts, is limited to a restrictedset of artifact classes and labels them on an imagelevel, without specifying their location. We extendedtheir work on understanding how artifacts aﬀect au-tomated polyp detection performance by consideringan inclusive set of six diﬀerent artifacts and by speci-fying their location via bounding boxes. This allowedus to speciﬁcally analyse how our deep learning basedpolyp detection model is aﬀected when looking atregions that overlap with artifacts or have artifactsinside of them. We annotated the artifacts with amodel that has been trained on an endoscopic arti-fact dataset released by the Endoscopic Artefact De-tection (EAD) challenge [Ali et al., 2019b] and hasranked third place in that challenge.While previous literature has suggested that arti-facts have a signiﬁcant inﬂuence on automated polypdetection, no work has yet tried to incorporate knowl-edge about artifacts in deep learning based polyp de-tection models. Existing work either attempted torestore the frames, which is currently not feasiblein real-time [Ali et al., 2019a] or approached polypand artifact detection as a simple multi-class detec-tion problem [V´azquez et al., 2016]. In this work, weaddressed this knowledge gap by exploring ways toincorporate artifact information in polyp detection.The hypothesis is that, by teaching a model to repre-sent artifacts, its ability to distinguish between arti-facts and polyps will be improved. Amongst others,we built a RetinaNet based model that is modiﬁedaccording to the learning without forgetting (LwF) Figure 1: a) Sample frames from the CVC-ClinicDBdataset. Polyps locations are indicated by greenmarks. b) Samples from the EAD challenge dataset.Shown artifact classes are specularity (pink box), in-struments (red), bubbles (black), and blur (blue).Additional artifact classes include contrast, satura-tion, and misc. artifact, which aggregates all othertypes of artifacts.[Li and Hoiem, 2018] approach. LwF aims to make amodel learn new capabilities while maintaining per-formance on the old capabilities, without relying onthe training data from the old tasks. It has beenshown that LwF can thereby improve performanceon the new task [Li and Hoiem, 2018].Our contributions in this work are thus twofold: 1)we present the most extensive existing analysis of howartifacts aﬀect polyp detection and 2) we present theﬁrst work that explores multi-task learning (MTL)techniques for including artifact information in deeplearning based polyp detection. In this section, we looked at related literature inpolyp detection, artifact detection, and multi-tasklearning.2 .1 Polyp Detection

Automated polyp detection is a computer vi-sion problem that has been dominated byhand-crafted features methods before com-puting power has made deep learning possi-ble [Karkanis et al., 2003, Iakovidis et al., 2005,Alexandre et al., 2008, Ameling et al., 2009,Gross et al., 2009, Hwang et al., 2007,Ganz et al., 2012, Bernal et al., 2015]. However,given the great variance of polyp shapes, diﬀerentangles used in colonoscopy, and the diﬀerent lightingmodes applied, the appearance of polyps varieswidely and therefore these methods have always hadlimited performance [Yu et al., 2016]. With the as-cent of deep learning, automated polyp detection per-formance has improved signiﬁcantly in recent yearsand the majority of high-performing approaches nowrely on it [Bernal et al., 2017, Zhu et al., 2015,Yu et al., 2016, Brandao et al., 2017,Angermann et al., 2017, Wang et al., 2018,Mohammed et al., 2018, Shin et al., 2018,Brandao et al., 2018, Mo et al., 2018,Kang and Gwak, 2019, Jia et al., 2019]. Somemethods adopt a hybrid approach, where hand-crafted and learned features are combined[Tajbakhsh et al., 2015, Bae and Yoon, 2015,Silva et al., 2014, ˇSevo et al., 2016]. Recent workincludes [Mo et al., 2018], where they show thatFaster R-CNN [Ren et al., 2015], a popular objectdetection framework that relies on a region proposalmodule, can achieve state-of-the-art performance.[Jia et al., 2019] proposed a two-step polyp detectionframework, which uses Faster R-CNN with featurepyramid networks to detect polyp regions andsegments them with a fully convolutional networkin the second step. In another polyp segmentationapproach, [Kang and Gwak, 2019] proposed an en-semble of two Mask R-CNN models [He et al., 2017]with diﬀerent backbone architectures. However, sofar none of these works has tried to leverage artifactinformation in the polyp detection frameworks.

A study on MICCAIs 2015 polyp detection com-petition has shown that automated polyp de-tection is still hampered by the artifacts con-tained in endoscopic frames [Bernal et al., 2017].In that study, three experts were asked to labelthe ASU-Mayo Clinic Colonoscopy Video Database[Tajbakhsh et al., 2015] for massive specular high-lights presence, low visibility, specular highlightswithin polyp, and overexposed regions. They thenlooked at how the performance of the diﬀerent mod-els participating in the competition diﬀers in the pres-ence of these artifacts. We were able to extend theﬁndings of this study by involving a higher numberof artifacts (Fig. 1b) and, by having artifact anno-tations at a bounding box and not at an image la-bel level, conduct a more ﬁne-grained analysis of theeﬀects that artifacts have on polyp detection algo-rithms.Eﬀorts have also been made to automati-cally detect these artifacts in order to removethem and restore the images. Most of theseapproaches focus only on speciﬁc artifacts, suchas blur or specular reﬂections [Stehle, 2006,Tchoulack et al., 2008, Liu et al., 2011,Funke et al., 2018, Akbari et al., 2018]. A re-cent work has built a framework that can detect sixdiﬀerent artifact classes and apply artifact-speciﬁcframe restoration procedures, which are often basedon adversarial networks [Ali et al., 2019a]. Whilethis method can restore the quality of endoscopicframes for post-processing procedures, it is notapplicable in real-time. In contrast to these eﬀorts,we seek to improve polyp detection in real-time, byteaching a model to diﬀerentiate between artifactsand polyps.To our best knowledge, the only work that has ad-dressed artefacts in the context of polyp detection is[V´azquez et al., 2017]. This team has extended anexisting polyp database by adding segmentation an-notation for lumen, specularity, borders, and back-ground. Their aim was to build a model that is ableto detect and segment polyps as well as the four abovementioned classes. In contrast to our work, the re-searchers did not aim to use artifact data to improve3olyp detection. In fact, training their model to beable to detect all ﬁve classes has signiﬁcant detri-mental eﬀects on their polyp detection performance.Compared to them, we included more artifact classesand investigated more sophisticated approaches toleverage artifacts for automated polyp detection.

In our context, we will refer to multi-task learning(MTL) as the approaches that use auxiliary (or re-lated) tasks to improve the original primary task.Early work has shown that MTL can for example im-prove mortality rate prediction of pneumonia patients[Caruana, 1997]. The main inputs for this model arecertain patient characteristics, such as age or whetherthe patient presents determined symptoms. By en-suring that the model not only predicts mortalityrate but also related outputs such as white blood cellcount, the hidden layers of the model were biased tobetter capture certain characteristics of the patient,which lead to better model performance. More re-cent work presented by [Zhou et al., 2019] used thecorrelation between the severity of diabetic retinopa-thy and lesions present on the eye. The authors pro-posed a model for disease grading and multi-lesionsegmentation that allows both tasks to collaborateto improve each other. Their proposal also permitsthe segmentation model to train in a semi-supervisedway. This is done by using the segmenter’s predic-tions as input for an attentive model used by thegrading network. At the same time, the attentionmaps generated by this attentive model are used aspseudo-mask for training the segmenter with unan-notated images. In our work, the primary task ispolyp detection and the auxiliary task is artefact de-tection.

In this section we describe the single-task polyp de-tection and artifact detection models, which wereused to analyse the eﬀect that artifacts have on polypdetection. We also describe our proposed multi-task model, which uses LwF to leverage artifact informa-tion to improve polyp detection.

Given an input frame X ∈ R h × w × from acolonoscopy video sequence, we deﬁned single-taskmodels as Y t = f ( X ; θ t ), where θ t are the model pa-rameters trained for a speciﬁc task t . Thus, our polypdetection model is given by Y p = f ( X ; θ p ) and ourartifact detection model is given by Y a = f ( X ; θ a ).Both single-task models are based on RetinaNet,which is described in section 3.2.The polyp detector outputs a set of boundingboxes Y p = { y pi ; i = 1 , . . . n p } where y pi representsa bounding box detection in X and n p is the totalnumber of detections. It trains on a polyp dataset D p = { ( X pi , Y pi ); i = 1 , . . . , m p } , where m p is thenumber of samples.The artifact detector gives us an equivalent setof artifact locations Y a , and an additional set C = { c i ; i = 1 , . . . , n c } of artifact classes c i ∈ { , . . . , } ,with n c the total number of artifacts found in X .Here c i indicates the type of artifact in the box y ai .Artifact types are blur, bubbles, contrast, specular-ity, saturation, and miscellaneous (misc.) artifacts.The dataset for training this detector is deﬁned as D a = { ( X ai , Y ai , C i ); i = 1 , . . . , m a } , where m a is thenumber of samples.For the MTL model, our main objective was toimprove polyp detection performance by including in-formation about the artifacts present in X . We thusdeﬁned the MTL model as Y t = f ( X ; θ s , θ p ), where θ s is a set of shared parameters across artefact andpolyp detection tasks and θ p represents polyp detec-tion speciﬁc parameters. This model is inspired byLwF, which is explained in section 3.4. RetinaNet [Lin et al., 2017b] consists of a backbonenetwork for extracting convolutional feature mapsand two subnetworks that perform object classiﬁ-cation (classiﬁcation subnetwork) and bounding boxregression (regression subnetwork) via convolutional4ayers. The backbone network can consist of conven-tional image classiﬁcation convolutional neural net-work (CNN), such as ResNet [He et al., 2016]. Theclassiﬁcation and regression losses are given by thefocal loss and the smooth L1 loss, respectively. Itis a one-stage method, meaning that it does not re-quire a region proposal module as is common in ob-ject detection [Girshick et al., 2014, Girshick, 2015,Ren et al., 2015]. Instead, anchors at diﬀerent scalesand aspect ratios are densely distributed across theimage and classiﬁed by the network. To construct amulti-scale feature pyramid from a single resolutioninput image, the backbone network is augmented bya feature pyramid network (FPN) [Lin et al., 2017a].The main innovation of RetinaNet is focal loss. Itis a modiﬁcation of the cross-entropy loss that addsweighting parameters to avoid one-stage detectorsfrom being swamped by the great number of easybackground (i.e. non object) anchors. To addressthis imbalance, focal loss introduces a weighting fac-tor of (1 − q ∗ ) γ , where q ∗ : q ∗ = (cid:26) q if y = 11 − q otherwise, (1)where q is the estimated probability P ( y = 1) of adetection. This factor reduces the importance of easyanchors. γ is a hyperparameter that controls the ex-tent to which the loss focuses on hard examples. For γ = 0, the focal loss is the same as the cross-entropyloss. Focal loss incorporates another weighting fac-tor, α ∈ [0 ,

1] to address the class imbalance betweenbackground and foreground anchors. Foreground an-chors will be weighted by α and background anchorswill be weighted by 1 − α . We deﬁne α ∗ analogouslyto q ∗ in equation 1. Then, the focal loss is given by:FL ( q ∗ ) = − α ∗ (1 − q ∗ ) γ log ( q ∗ ) . (2) Our single-task artifact and polyp detection mod-els are based on RetinaNet. We use these modelsto evaluate the impact of artifacts present in theimage (see section 4.4). The base polyp detector Y p = f ∗ ( X ; θ p ) uses a standard RetinaNet archi- tecture with a ResNet that was pre-trained on Im-ageNet as backbone network. The base artifact de-tector [ Y a , C ] = f ∗ ( X ; θ a ) is taken from our submis-sion to EAD 2019 challenge [Kayser et al., ]. For thecompetition, we validated the model for optimal clas-siﬁcation/regression loss weighting, focal loss param-eters, data augmentation, and backbone model con-ﬁgurations (ResNet 50, 101, and 152). In the com-petition we used an ensemble of seven models andranked third overall. In this work we did not usethe ensemble model, but our best performing singlemodel from the challenge. Learning without Forgetting (LwF) is an MTL strat-egy that allows to teach an additional task t p to anexisting model previously trained on a related task t a [Li and Hoiem, 2018]. One of the main advantagesof LwF is that, to extend the capabilities of a model,only the training data of the new task is necessary.In our case, all we needed was thus an initial artifactdetector [ Y a , C ] = f ( X, θ s , θ a ) with θ a initial task-speciﬁc parameters, and a dataset D p for our newpolyp detection task.Following the LwF method, we ﬁrst used our baseartifact detector on all the images X ∈ D p to gen-erate a set of artifact boxes. Then, we incorporatedthese predictions into D p as ground truth for arti-facts. Setting the threshold at which we select theseartifact annotations is an important hyperparame-ter. We thereby obtained a new dataset D mt thatis suitable for training models on both tasks. Theprocedure is illustrated in Fig 2a.We were then able to use D mt to train a MTLmodel [ Y t , C t ] = f ( X ; θ s , θ t ) with t ∈ { p, a } usingthe loss function below. For simplicity, we set Z t =[ Y t , C t ]: L = (cid:96) p ( Z p , f ( X ; θ s , θ p )) + (cid:96) a ( Z a ; f ( X, θ s , θ a )) + R (3)where R = r ( θ s , θ p , θ a ) is a weight regularizer and (cid:96) p , (cid:96) a are task-speciﬁc losses. In [Li and Hoiem, 2018]the two tasks have diﬀerent loss functions, where theloss function l a for the initial task is the Knowl-edge Distillation loss. The purpose of that loss is5igure 2: a) Dataset D mt is generated by using our artifact detection model to annotate artifact boundingboxes in a polyp-only dataset. b) This depicts our approach for modeling the problem as a multi-class objectdetection problem. The model does not diﬀerentiate between polyps and diﬀerent artifact classes. c) Thisdepicts our learning without forgetting (LwF) inspired approach. The model has a set of shared parame-ters (backbone model, feature pyramid network, and regression subnetwork) and task-speciﬁc classiﬁcationsubnetworks for polyp and artifact detection.to encourage two networks to have the same output[Hinton et al., 2015]. Given that our goal does notinclude maintaining good performance on the relatedtask, we opted for using the same loss function, focalloss, for both the initial and the new task. The LwF method requires having task-speciﬁc pa-rameters for the polyp and artifact detection tasks.The RetinaNet architecture can naturally be ex-tended to have a set of shared and task-speciﬁc pa-rameters. We selected the parameters of the back-bone and FPN to be our shared parameters θ s . Then,for our tasks a and p , we built individual classiﬁca-tion subnetworks with parameters θ a and θ p , respec-tively. For the the regression subnetwork, parametersare shared across the tasks. Each of the three sub-networks has their own loss function. The frameworkis illustrated in Figure 2c. Our experiments are divided into two sections. First,section 4.4 evaluates how artifacts aﬀect polyp de-tection performance. For this analysis, we used anin-house dataset that consists of 55,411 frames. Thedataset is thereby far bigger than existing publiclyavailable datasets and enabled us to get more statis-tically meaningful insights. In addition, existing pub-licly available datasets are often designed for polypdetection only, and therefore might remove framesthat are strongly or entirely corrupted by artiacts.In contrast, our in-house dataset has not been pro-cessed at all and is thereby much closer to a real-lifeclinical setting. Section 4.5 then tests ways to makeuse of artifact knowledge to improve polyp detection.For this section, we conducted our results on publiclyavailable datasets. We thus used diﬀerent datasetsfor section 4.4 and 4.5. Thereby, we avoid bias bynot using insights gained from the validation sets todesign our multi-task approaches. Further, this will6nable future work to compare their performance tous.

We make use of the following datasets: • EAD2019 [Ali et al., 2019b]: The training dataof the EAD2019 dataset contains 2192 uniquevideo frames with bounding box annotationsand class labels for seven artifact classes. Thedata displays a lot of variation, as it was ob-tained from four diﬀerent centres and includesstill frames from multiple tissues, light modali-ties, and populations. Example artifact annota-tions can be found in 1 and 3. • CVC-ClinicDB [Bernal et al., 2015,Bernal et al., 2017]: The CVC-ClinicDBdataset is a publicly available colonoscopydataset that consists of 612 colonoscopy framestaken from a total of 29 video sequences. Thesequences are from routine colonoscopies andwere selected to represent as much variationin polyp appearance as possible. The wholedataset contains 31 polyps and none of theimages contain no polyps. Of these 31 polyps,22 are small polyps below the size of 10mmand 9 are greater than 10mm in diameter[Fern´andez-Esparrach et al., 2016]. The colono-scopies were conducted with standard resolutionwhite-light video colonoscope (Q160AL orQ165L). All frames have a pixel resolution of388 ×

284 pixels in standard deﬁnition. • ETIS-Larib [Bernal et al., 2017]: ETIS-Laribis a polyp dataset that was used as the test-ing dataset in the still frame analysis task in[Bernal et al., 2017]. The dataset consists of 196high deﬁnition frames that were selected from34 sequences. The dimension of the images is1225 × • in-house : Our in-house dataset was collectedin the endoscopy department of Klinikum rechts Figure 3: a) Sample frames from the EAD challengedataset, which show artifacts of the class miscella-neous (misc).der Isar, Technical University of Munich. It con-sists of 55,411 frames of size 1920 × • Kvasir-SEG [Jha et al., 2020]: The Kvasir-SEG dataset contains 1,000 polyp images withbounding boxes and segmentation masks. Im-age resolution varies between 332 ×

487 and1920 × Due to memory constrains, we used a batch size of2 during training on polyp detection. As focal lossparameters, we found γ = 2 . α = 0 .

25 to beoptimal. As a backbone network, we chose ResNet-50 pre-trained on ImageNet. Despite achiever higherperformance with ResNet-101 and ResNet-152, wechose ResNet-50 as the goal was to have a simplebaseline as benchmark against our multi-task ap-proaches. Except if stated otherwise, all our ex-periments with RetinaNet were trained with the7dam optimizer [Kingma and Ba, 2014] with a learn-ing rate of 10 e − In order to compare our polyp detection performancewith existing methods, we follow the commonlyused validation framework from [Bernal et al., 2017],where scores are reported by giving the number oftrue positives, false positives, false negatives, pre-cision, recall, F1-score, and F2-score. This valida-tion framework can be used both for segmentationand object detection approaches. A true positive isany detection where the centroid of the bounding box(or segmentation mask) lies within the polyp ground-truth mask. False positives are detections where thecentroid is outside of the polyp mask. Each ground-truth can only have one true positive, thus each de-tection that correctly detects a polyp, which has al-ready been detected, is a false positive. False nega-tives are all ground-truth polyps that have not beendetected.

The ﬁrst contribution of this work is to investigatehow deep learning based polyp detection performanceis aﬀected by the presence of image artifacts. We con-ducted these experiments on our in-house dataset.First, we used our artefact detection model to anno-tate artifact bounding boxes in that dataset. Artifactbounding boxes were taken at a conﬁdence thresholdof 0.25. We chose this value as we wanted to getthe most possible information on artifacts and thisthreshold has proven optimal in our submissions tothe EAD challenge. We then ran our polyp detectionmodel to get a set of polyp bounding box predictions Figure 4: Examples of artifacts and polyps predic-tions in our in-house dataset. Green bounding boxesrepresent polyps ground truth and red boundingboxes are polyp predictions. The remaining bound-ing boxes are bubbles (black), specularity (pink), blur(blue), and saturation (brown). Best viewed in color.on the same images. Polyp detections are always con-sidered at a threshold of 0.5, which has proven opti-mal in preliminary experiments. We then performeda number of analyses, as described below, to betterunderstand how artifacts aﬀect our polyp detector.Illustrative example frames can be found in Figure 4.

In our ﬁrst analysis, we wanted to get a general ideaof how polyp detection performance diﬀers given thepresence of artifacts. For each artifact class, we tookthe subset of images where the artifact is contained(at a given area threshold) and compared polyp de-tection performance to the subset of images wherethe artifact is not present. Table 1 shows our results.To consider that an artifact is present in the image,we deﬁned diﬀerent area thresholds for the diﬀerentartifacts (see Fig. 5). The area was computed basedon the total amount of pixels covered by a given ar-tifact. For blur, the threshold was set at 50% of theentire image size, to only include images where theentire image is blurred. For specularity, which is ef-fectively present in all (99%) of the images, we se-lected a threshold of 5%, to cover only images wherethere is a high amount of specularity. For this thresh-old, specularity is still present in 56% of images. For8igure 5: In experiment 4.4.1 we set area thresholdsfor some artifacts to consider them present in an im-age. The ﬁrst row (a) shows artifacts that do notmeet the are threshold, compared to the row below(b) where they meet the area threshold.similar reasons, we selected 2% of the image area as athreshold for misc. artifacts and bubbles. No thresh-olds were set for contrast and saturation.Table 1: For six diﬀerent artifact classes, polyp detec-tion performance is compared between images wherethe artifact is present and images where the artifactis not present. Frequency gives the share of imageswhere the artifact is present. The score diﬀerence isthe diﬀerence of the respective metrics between im-ages where the artifact is present or not present. Theexperiment was conducted of the in-house dataset,consisting of 55,411 frames.

Score Diﬀerence (%)Artifact type Frequency (%) precision recall F1 F2bubbles 9.51 -5.67 2.33 -1.61 0.76blur 20.85 4.31 -10.05 -3.69 -7.67misc. 48.99 5.62 -1.94 1.62 -0.56specularity 49.28 -6.93 4.65 -0.80 2.54saturation 69.28 0.20 2.54 1.44 2.12contrast 79.70 -2.86 4.41 1.09 3.15

We next evaluated the frequency of overlap betweenartifacts detections and polyp ground-truths as wellas true positive, false positive, and false negativepolyp predictions. We counted how many times thesepolyp bounding boxes overlap with an artifact (re-sults in Table 2) and how many times a given ar-tifact is inside of them (results in Table 3). See Figure 6: The green box represents a polyp detection.Artifact bounding boxes either overlap with a polyp(yellow) or are contained within a polyp (blue).Fig. 6 for an example of these relationships. Forthe sake of this analysis, we did not remove a pre-diction if it overlapped an existing prediction. I.e. iftwo polyp detections overlap the same polyps, theyare still both considered correct predictions (this wasnaturally not done in experiments where we evalu-ated polyp detection performance). Therefore, thesum of true positives and false positives does not addto the amount of ground-truth polyps in our resultstables. The goal of this analysis was to understandhow artifacts are aﬀecting polyp detection at a morein-depth level. This analysis for example allowed usto ﬁnd out whether a certain artifact is sometimesmisclassiﬁed as a polyp (i.e. when false positive de-tection often overlap with that artifact). We con-sidered an artifact and a polyo to overlap if theirintersection-over-union (IoU) is greater than 0.5.

Next, we investigated approaches to incorporate arti-fact knowledge in polyp detection. We compared dif-ferent such approaches, including our MTL model de-scribed in section 3.5. The results shown in this sec-tion are the averaged scores of 3-fold cross-validationon the CVC-ClinicDB dataset and follow the vali-dation framework laid out in section 4.3. Our MTLapproaches are enumerated from simple to more com-plex: we tested using the artefact model to initialize9able 2: The share of ground-truth polyps as well as true positive, false positive, and false negative polypdetections that overlap with respective artifacts (i.e. their intersection-over-union is greater than 0.5).Frequency gives the count of the diﬀerent polyp bounding box types in the dataset. Any artifact incorporatesall six artifacts. These experiments were conducted on our in-house dataset, which consists of 55,411 frames.

Share of polyps overlapping artifacts (%)Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrastground-truth 55411 17.8 1.5 3.4 2.2 2.7 4.5 2.9true positives 45532 17.4 1.3 3.5 1.6 2.2 3.8 3.0false positives 10435 23.8 3.0 6.1 2.7 1.9 1.3 5.8false negatives 13438 13.7 1.2 3.0 3.2 2.2 4.3 2.6

Table 3: The share of ground-truth polyps as well as true positive, false positive, and false negative polypdetections that contain respective artifacts inside of them (i.e. the bounding box of the artifact is fullycontained inside the polyp bounding box). Frequency gives the count of the diﬀerent polyp bounding boxtypes in the dataset. Any artifact incorporates all six artifacts. These experiments were conducted on ourin-house dataset, which consists of 55,411 frames.

Share of polyps containing artifacts (%)Polyp type Frequency Any artifact Bubbles Blur Misc. artifact Specularity Saturation Contrastground-truth 55411 88.5 11.0 0.3 33.1 86.8 12.7 0.7true positives 45532 94.0 15.4 0.5 37.3 91.9 16.2 2.5false positives 10435 92.0 22.7 1.7 37.8 87.9 16.7 6.1false negatives 13438 79.9 8.3 0.2 32.8 77.5 7.2 0.4 the weights of the polyp detector, we then consideredit a multi-class problem (with diﬀerent class weights),and ﬁnally we employed a specialized architecture tohave a model with shared and task-speciﬁc parame-ters.

In our ﬁrst approach, we looked at the most straight-forward way to utilize the artifact data: using theweights of the artifact model for initializing the polypdetection model. This is thus a transfer learning ap-proach, where we pre-trained our polyp detector onthe EAD2019 artifact dataset and then ﬁne-tuned iton the polyp dataset. We have tried two diﬀerent TLapproaches: 1) ﬁne-tuning the entire network and 2)freezing the backbone and only ﬁne-tuning the clas-siﬁcation subnetwork of RetinaNet. We also com-pared the performance with models pre-trained onImageNet and COCO. Results are shown in Table 4. Table 4: Polyp detection performance of diﬀerentTransfer Learning (TL) approaches. For ”initializa-tion”, the entire model is ﬁne-tuned on the ﬁnal task.For ”freeze backbone”, only the ﬁnal layers of themodel are trained on the ﬁnal task. ”Baseline” isour polyp-only model that was pre-trained on Ima-geNet. Results are from 3-fold cross-validation onCVC-ClinicDB.

Pre-training TL Type Precision Recall F1 F2ImageNet initialization

Artifact Model initialization 0.832 0.859 0.845 0.853Artifact Model freeze backbone 0.836 0.766 0.794 0.776

A naive approach for MTL on both polyps and ar-tifacts is simply formulating the problem as a multi-class classiﬁcation problem, where we have one classfor each artifact and one class for the polyps. We thusused our single-task polyp detector, but extended it10y six classes to include artifacts. We get our ar-tifact annotations as described in section 3.4. Weexperimented with diﬀerent thresholds for obtainingour artifact annotations. Results are shown in Table5.Table 5: Polyp detection performance of a multi-classRetinaNet model that detects polyps as well as arti-facts. Artifact threshold gives the threshold for whichwe considered artifact annotations as ground-truthin the training set. The lower that threshold is, themore artifacts are contained in each image. ”Base-line” is our polyp-only model that was pre-trained onImageNet. Results are from 3-fold cross-validation onCVC-ClinicDB.

Artifact Artifacts per Precision Recall F1 F2Threshold Image0.2 37.8 0.146 0.515 0.225 0.3360.4 7.7 0.237 0.487 0.310 0.3920.5 3.6 0.413 0.690 0.516 0.6080.6 2.0 0.722 0.640 0.673 0.6520.8 1.2

We repeated the previous experiments but increasedthe weighting of the polyp class by 25%, 50%, and75% of the classiﬁcation loss, while distributing theremaining share of the weight equally between all sixartifact classes. We further compared pre-trainingon ImageNet and the EAD2019 artifact dataset. Inorder to include rich artifact information, we haveselected artifact annotations at a threshold of 0.5.For the EAD competition, optimal artefact thresh-olds were between 0.25 and 0.5. Results are displayedin Table 6.

We then conducted experiments on our LwF inspiredMTL model, which was described in section 3.5. Wetried out four diﬀerent weighting conﬁgurations forthe loss function of the model. We weighted regres-sion and artifact classiﬁcation subnetworks by 1 andassigned weights of 1, 3, 10, and 20 to the polyp clas-siﬁcation loss. We have tried training the models on Table 6: Polyp detection performance of a multi-class RetinaNet model that detects polyps as wellas artifacts and has speciﬁc class weights. Propor-tional class weights of the polyp class are given inthe ﬁrst column. The remaining weight is distributedequally among the remaining artifact classes. Pre-training is given either from ImageNet or from ourartifact detector. Artifacts in the training data aretaken at a 0.5 conﬁdence threshold. ”Baseline” isour polyp-only model that was pre-trained on Ima-geNet. Results are from 3-fold cross-validation onCVC-ClinicDB.

Polyp Weight Pre-training Precision Recall F1 F225% ImageNet 0.774 0.726 0.749 0.73550% ImageNet 0.797 0.761 0.775 0.76675% ImageNet 0.752 0.765 0.744 0.75225% Artifact Model 0.808 0.766 0.785 0.77350% Artifact Model 0.777 0.789 0.782 0.78675% Artifact Model 0.770 artifacts taken at a 0.2 and at a 0.5 threshold. Resultsare shown in Table 7.

Section 4.4 has shown that some artifacts classes aremore relevant to polyp detection performance thanothers. In addition, our models may be overwhelmedby the high number of artifact classes and reducingthat number may be beneﬁcial. We therefore re-stricted ourselves to artifact classes that appear mostrelevant for polyp detection. For instance, if we knowthat blurs are often misclassiﬁed as polyps, or thatpolyps containing specularity are easier to detect, weknow that these classes can be useful for polyp detec-tion. We created four diﬀerent subsets of artifacts toinclude in our MTL approach. The ﬁrst subset con-tains only the artifact that we deemed most impor-tant, the second set contains the two most prioritizedones, and so on. The artifacts that we deemed themost inﬂuential, in descending order, are the follow-ing: blur, specularity, misc. artifacts, and bubbles.We run our MTL model with artifacts thresholds at0.5. The weights of the loss functions were chosento be proportionate to the amount of artifacts in the11able 7: Polyp detection performance of our learningwithout forgetting (LwF) inspired multi-task model.The model has three subnetworks on top of the modelbackbone: a regression (reg) subnetwork, an arti-fact (art) classiﬁcation subnetwork, and a polyp (pol)classiﬁcation subnetwork. These three subnetworkshave their own losses and the respective weights ofthese losses are given in the ﬁrst column. The arti-fact thresholds for the training data are given in thesecond column. ”Baseline” is our polyp-only modelthat was pre-trained on ImageNet. Results are from3-fold cross-validation on CVC-ClinicDB.

Loss Weights Artifact Precision Recall F1 F2(reg:art:pol) Threshold1:1:1 0.2 0.836 0.269 0.381 0.3041:1:3 0.2 0.841 0.484 0.604 0.5251:1:10 0.2 0.803 0.650 0.715 0.6741:1:20 0.2 0.831 0.677 0.739 0.6991:1:1 0.5 0.820 0.727 0.771 0.7441:1:3 0.5 0.802 0.755 0.773 0.7611:1:10 0.5 0.779 0.793 0.781 0.7871:1:20 0.5 0.796 image for the diﬀerent artifact classes. For the fourmodels in Table 8, they are given by 1:5:1, 1:1:3, 1:1:3,1:1:3, respectively. Weighting corresponds to the re-gression, artifact, and polyp loss functions.

Finally, we repeated some experiments from subsec-tion 4.4.2 in order to get an understanding of howleveraging artifact information has changed the waythat artifacts aﬀect polyp detection performance. Weselected the MTL RetinaNet trained on blur, bub-bles, misc., and specularity. Artifact annotations fortraining were taken at a 0.2 threshold, since it is sim-ilar to the artifact threshold of 0.25 taken in the ex-periments in 4.4.2. Results are shown in Table 9 and10.

So far, all the results in section 4.5 were from a 3-foldcross validation on the CVC-ClinicDB dataset. To Table 8: Polyp detection performance of our learningwithout forgetting (LwF) inspired multi-task modelwhen only incorporating knowledge about some ofthe existing artifact classes. The star (*) in the ar-tifact column means that the given artifact was in-cluded for this model. Artifacts in the training dataare taken at a 0.5 conﬁdence threshold. ”Baseline”is our polyp-only model that was pre-trained on Im-ageNet. Results are from 3-fold cross-validation onCVC-ClinicDB.

Artifacts Precision Recall F1 F2blur spec. misc. bubbles* 0.870 0.808 * * * 0.843 0.820 0.829 0.823* * * * 0.859 0.790 0.821 0.802baseline 0.849 0.814 0.829 0.820 verify the validity of these results on other datasets,we re-run the best-performing model of each experi-ment in sections 4.5 on the ETIS-Larib dataset, ourin-house dataset, and the Kvasir-SEG dataset. Forthis, all our models were trained on CVC-ClinicDB.While the aim of this work is not to surpass the state-of-the-art in polyp detection, thiss also enabled us toplace the performance of our models in context ofexisting methods. The results are given in Table 11.

Overall correlation

Table 1 gives an initial indi-cation that the presence of some artifacts aﬀects thepolyp detection rate. Indeed, artifacts like blur andbubbles aﬀect the F1-score negatively (-3.69% and -1.61%). For bubbles, this is due to lower precision.This could be explained by the fact that bubbles intu-itively look similar to polyps, and thus potentially bemisclassiﬁed as such. For blur, lower F1-score is dueto much lower recall (-10.05%) on these images. Atthe same time, images containing misc., saturation,and contrast tend to have higher F1-scores.12able 9: The share of ground-truth polyps as well as true positive, false positive, and false negative polypdetections that overlap with respective artifacts (i.e. their intersection-over-union is greater than 0.5). Foreach of the four artifacts, we compare how the predictions of the baseline model are aﬀected vs. how thepredictions of our learning without forgetting (LwF) inspired multi-task model are aﬀect. The multi-taskmodel incorporates knowledge from the four artifacts in this table. Frequency gives the count of the diﬀerentpolyp bounding box types in the dataset. These experiments were conducted on the ETIS-Larib dataset.Share of polyps overlapping artifacts (%)frequency bubbles blur misc. artifact specularitypolyp type polyps MTL polyps MTL polyps MTL polyps MTL polyps MTLground-truth 208 208 3.4 3.4 1.0 1.0 0.5 0.5 4.8 4.8true positives 137 152 2.2 3.3 1.5 2.0 0.0 0.0 3.6 3.9false positives 39 64 5.1 6.2 30.8 6.2 0.0 3.1 5.1 4.7false negatives 63 64 6.3 4.7 1.6 0.0 1.6 1.6 6.3 6.2Table 10: The share of ground-truth polyps as well as true positive, false positive, and false negative polypdetections that contain respective artifacts inside of them (i.e. the bounding box of the artifact is fullycontained inside the polyp bounding box). For each of the four artifacts, we compare how the predictionsof the baseline model are aﬀected vs. how the predictions of our learning without forgetting (LwF) inspiredmulti-task model are aﬀect. The multi-task model incorporates knowledge from the four artifacts in thistable. Frequency gives the count of the diﬀerent polyp bounding box types in the dataset. These experimentswere conducted on the ETIS-Larib dataset. Share of polyps containing artifacts (%)frequency bubbles blur misc. artifact specularitypolyp type polyps MTL polyps MTL polyps MTL polyps MTL polyps MTLground-truth 208 208 14.4 14.4 0.0 0.0 7.2 7.2 60.6 60.6true positives 137 152 11.7 17.8 0.0 0.0 10.2 10.5 75.9 79.6false positives 39 64 23.1 15.6 15.4 1.6 25.6 17.2 79.5 64.1false negatives 63 64 19.0 7.8 0.0 0.0 6.3 3.1 41.3 28.1

Causality vs. Correlation

To get more insightsinto causality vs. correlation, we also looked at thecorrelation between the presence of diﬀerent artifacts.We computed this by looking at how often pairs ofartifact classes occur in the same image. This allowedus to better understand whether an artifact perhapsonly aﬀects performance because it correlates to thepresence of another artifact, which actually causes it.Table 12 shows that there is no meaningful correla-tion between any of the artifacts. Only misc. ar-tifacts are slightly more correlated to others, whichmay be explained by the fact that they are loosely de-ﬁned and their distribution may overlap with otherclasses. Another way to better understand causal ef- fects is to look at the eﬀect on performance giventhe location of an artifact with respect to the polyp.For instance, if we are able to concretely show thatbubbles are likely to be misclassiﬁed as polyps, wemake a stronger case for bubbles actually causing adrop in performance (compared to e.g. simply beingcorrelated to pictures where the polyp is very hardto detect). Nonetheless, further work is required tofully understand the causal eﬀects of artifacts on thepolyp detector (see [Castro et al., 2019]).

Artifacts and their location

By assessing thecorrelation between artifacts and polyp detectionsand polyp ground-truths, Table 2 gives a more in-13able 11: Polyp detection performance on three dif-ferent datasets: ETIS-Larib (ETIS), our in-housedataset (in-house), and Kvasir-SEG (Kvasir). Wetest our best performing models of each of the exper-iments in section 4.5. Our models were all trainedon CVC-ClinicDB. If available, we compared our re-sults to existing state-of-the-art methods that usedthe same evaluation framework. Note that the aimof this work is not to surpass the state-of-the-art.Only F1 scores are given.

Method ETIS in-house KvasirState-of-the-art- [Bernal et al., 2017] 0.708 - -- [Bernal et al., 2017] 0.662 - -- [Shin et al., 2018] - -Ours- baseline - Transfer learning (4.5.1) 0.621 0.747 0.816- Weighted multi-class (4.5.3) 0.637 0.604 0.742- LwF (4.5.4) 0.618 0.650 0.766- LwF* (4.5.5) 0.642 0.695 0.846

Table 12: The correlation between the presence ofdiﬀerent artifacts on the in-house dataset. bubbles blur misc. specularity saturation contrastbubbles 1.00 -0.12 -0.14 0.00 -0.07 -0.05blur -0.12 1.00 0.15 -0.02 0.04 -0.11misc. -0.14 0.15 1.00 -0.01 0.09 0.05specularity 0.00 -0.02 -0.01 1.00 0.00 -0.01saturation -0.07 0.04 0.09 0.00 1.00 0.06contrast -0.05 -0.11 0.05 -0.01 0.06 1.00 depth view of how artifacts aﬀect polyp detection.We observe that detections which are false posi-tives more frequently overlap with artifacts (23.8%)compared to actual ground-truth polyps (17.8%),suggesting artifacts are frequently misclassiﬁed aspolyps. Amongst others, these results seem to con-ﬁrm our previous hypothesis that bubbles can be mis-classiﬁed as polyps. Indeed, bubbles are twice aslikely to overlap false positive polyp detections thanground-truth polyps. Albeit less intuitive, similar ob-servations can be made on blur and contrast. At thesame time, polyps that were missed by our polyp de-tector are less likely to overlap with an artifact thanpolyps overall (13.7% vs. 17.8%). This suggests thatpolyps that overlap with an artifact (or look like anartifact according to our artifact detector), are lesslikely to be missed. Saturated regions overlap with4.5% of ground-truth polyps and with 1.3% of falsepositives. This indicates that well-lid regions (i.e.which are saturated) are less prone to being misclas-siﬁed as polyps. While Table 3 shows that false posi-tive polyp detections contain artifacts inside of themmore often than ground-truth polyps do (92.0% vs.88.5%), this is even more so the case for true posi-tives (94.0%). In addition, missed polyps (false neg-atives) contain artifacts less frequently (79.9%) thanthe ground-truth. Therefore, this suggests that arti-facts inside of a polyp region make polyps easier todetect and improve both precision and recall. Thisis especially true for specularity and saturation. Forbubbles, blur, and contrast, we conﬁrm previous re-sults that indicate that they lead to more false posi-tives.In conclusion, Table 1 gave us an initial idea ofhow are polyp detection performance is aﬀected byartifact presence. It already indicated that artifacts,depending on their class, can either beneﬁciate orharm polyp detection. Tables 2 and 3 gave us amore in-depth understanding on how diﬀerent arti-facts, at diﬀerent locations with respect to the polyps,aﬀect the polyp detector. Bubbles, blur, and con-trast were found to lead to poorer performance. Incontrast, saturation and specularity improved perfor-mance. Either way, this suggests that incorporatingartifact knowledge into a polyp detection model couldbe beneﬁcial. The hypothesis is that by being able14o learn artifact representations, a model can learnto distinguish them from polyps and better leveragethem to classify bounding boxes correctly.

Artifact annotator trustworthiness

For train-ing our multi-task approaches, we added artifact an-notations to the CVC-ClinicDB dataset using anartifact detection model that was trained on theEAD2019 dataset. To verify that these artifact an-notations are not entirely corrupt and thereby detri-mental to our eﬀorts, we randomly selected 60 im-ages (i.e. around 10% of the total) and manuallyinspected their artifact annotations. We inspectedartifacts at a threshold of 0.25, which led to 1,422annotations. While this did not give us insights onrecall, we found a precision of 82.4%. This let usconclude that the artifact detector is performing suf-ﬁciently well on CVC-ClinicDB and can be used forour multi-task approaches.

Overall performance

We tested out diﬀerentmulti-task learning approaches to incorporate arti-fact knowledge, ranging from very simple to morecomplex, and compared them to a baseline that wasonly trained on a polyp dataset. The ﬁrst experiment(Table 4) shows that simply using the artifact datasetfor pre-training yielded unsatisfactory results, withperformance not surpassing pre-training on naturalimage datasets. Table 5 shows that modeling it asa simple multi-class problem, where the six artifactsand polyps constitute the seven classes, did not im-prove polyp detection performance either. Indeed,there seems to be an almost linear negative correla-tion between the conﬁdence threshold of artifacts inthe training data and the ﬁnal polyp detection per-formance. In the original EAD artifact dataset, thereare around 8.3 artifact annotations per image. If wesuppose a similar distribution in the CVC-ClinicDBset, then an artifact threshold of 0.4, (which yields7.7 artifacts per image) seems to be closest to reality.However, at this threshold, our model was perform-ing very poorly (F1-score of 0.31), suggesting thismethod is very inadequate. Even when trying to im-prove the method by weighting the polyp class dis- proportionately (Table 6), performance increased butremained underwhelming. A possible explanation isthat having eight diﬀerent classes for a single clas-siﬁcation subnetwork complicates the optimization.To address this, we built our double classiﬁcationsubnetwork MTL model, for which the results arein Table 7. On a NVIDIA TITAN Xp with 12 GB ofRAM, this model takes around 0.06 seconds per im-age during inference. At an artifact threshold of 0.2,performance is poor regardless of the weighting con-ﬁgurations, suggesting that these artifacts are simplytoo noisy and too numerous. At a threshold of 0.5, weobtained decent performance even without weighting.However, despite improving over the multi-class ap-proach, the model still underperformed the baseline.The discussion in section 5.1 has shown that not allartifacts have an equal eﬀect on polyp detection per-formance. We thus tried only incorporating knowl-edge on some artifacts in the MTL model. Table 8shows that this lead to performance at par or evenslightly better than the polyp model. Nonetheless,performance is not signiﬁcantly better and we havefailed to show that incorporating artifact knowledgecan indeed improve polyp detection rates.

Artifact robustness

To better understand howthe model behaviour changed after incorporating ar-tifact knowledge, we re-ran some of the experimentsfrom section 4.4.2 with our MTL model from Ta-ble 8 that included four artifact classes. Table 9shows that e.g. for blur we have drastically reducedthe number of blur artifacts that are misclassiﬁed aspolyp. Table 10 also shows that for all four artifacts,the MTL model has a lower share of false positivescontaining them, which suggests the model has beenable to make use of artifact-related features in or-der to reduce the times it misclassiﬁed backgroundregions that contain these artifacts as polyps. In ad-dition, bubbles and specularity are contained less fre-quently in false negatives for the MTL model than forthe polyp-only model (19% vs. 7.8% and 41.3% vs.28.1%), which indicates that learning the features ofthese artifacts has helped the algorithm detect morepolyps that are covered by them.Our multi-task approaches have not yet made a15onvincing case for using artifact knowledge in polypdetection. However, experiments from Table 9 and 10show encouraging signs and warrant further researchin this direction. Our multi-task approaches havebeen fairly naive and more sophisticated approachesmight be able to reap the beneﬁts of artifact knowl-edge. Besides incorporating artifact knowledge in amodel by making it learn the artifact representation,other approaches could be considered, such as com-bining artifact and polyp detections to yield an un-certainty score for the model’s polyp detections.

This work has contributed to a more thorough un-derstanding of how endoscopic artifacts aﬀect deeplearning based polyp detection models. This wasachieved by not only looking at artifact labels on animage level, as was done by [Bernal et al., 2017], butby having bounding box annotations, which speci-ﬁed the exact location of artifacts. The analysis hasshown that some artifacts, such as bubbles or blur,may deteriorate polyp detection and others, such asspecularity, can actually lead to better detection ca-pabilities. We built upon this knowledge to extend asimple baseline polyp detector by making it simulta-neously learn representation of artifacts and polyps.While we have not yet established signiﬁcant im-provements in performance by using this method, itstill showed promising results as it managed to over-come some of the artifact-related challenges that au-tomated polyp detectors face. Future work can buildupon our analysis and multi-task approaches to ulti-mately improve polyp detection capabilities.

Acknowledgments

R. D. S. is supported by Consejo Nacional de Cien-cia y Tecnolog´ıa (CONACYT), Mexico. S.A. issupported by the PRIME programme of the Ger-man Academic Exchange Service (DAAD) with fundsfrom the German Federal Ministry of Education andResearch (BMBF)

References [Akbari et al., 2018] Akbari, M., Mohrekesh, M.,Najariani, K., Karimi, N., Samavi, S., andSoroushmehr, S. R. (2018). Adaptive specular re-ﬂection detection and inpainting in colonoscopyvideo frames. In , pages3134–3138. IEEE.[Alexandre et al., 2008] Alexandre, L. A., Nobre, N.,and Casteleiro, J. (2008). Color and position versustexture features for endoscopic polyp detection. In , volume 2, pages 38–42.IEEE.[Ali et al., 2019a] Ali, S., Zhou, F., Bailey, A.,Braden, B., East, J., Lu, X., and Rittscher, J.(2019a). A deep learning framework for quality as-sessment and restoration in video endoscopy. arXivpreprint arXiv:1904.07073 .[Ali et al., 2019b] Ali, S., Zhou, F., Daul, C.,Braden, B., Bailey, A., Realdon, S., East, J.,Wagni`eres, G., Loschenov, V., Grisan, E., Blondel,W., and Rittscher, J. (2019b). Endoscopy artifactdetection (EAD 2019) challenge dataset.

CoRR ,abs/1905.03209.[Ameling et al., 2009] Ameling, S., Wirth, S.,Paulus, D., Lacey, G., and Vilarino, F. (2009).Texture-based polyp detection in colonoscopy.In

Bildverarbeitung f¨ur die Medizin 2009 , pages346–350. Springer.[Angermann et al., 2017] Angermann, Q., Bernal,J., S´anchez-Montes, C., Hammami, M.,Fern´andez-Esparrach, G., Dray, X., Romain,O., S´anchez, F. J., and Histace, A. (2017). To-wards real-time polyp detection in colonoscopyvideos: Adapting still frame-based methodologiesfor video sequences analysis. In

Computer Assistedand Robotic Endoscopy and Clinical Image-BasedProcedures , pages 29–41. Springer.[Bae and Yoon, 2015] Bae, S.-H. and Yoon, K.-J.(2015). Polyp detection via imbalanced learning16nd discriminative feature learning.

IEEE trans-actions on medical imaging , 34(11):2379–2393.[Bernal et al., 2015] Bernal, J., S´anchez, F. J.,Fern´andez-Esparrach, G., Gil, D., Rodr´ıguez, C.,and Vilari˜no, F. (2015). Wm-dova maps for accu-rate polyp highlighting in colonoscopy: Validationvs. saliency maps from physicians.

ComputerizedMedical Imaging and Graphics , 43:99 – 111.[Bernal et al., 2012] Bernal, J., S´anchez, J., and no,F. V. (2012). Towards automatic polyp detectionwith a polyp appearance model. In

Pattern Recog-nition .[Bernal et al., 2017] Bernal, J., Tajkbaksh, N.,S´anchez, F. J., Matuszewski, B. J., Chen, H., Yu,L., Angermann, Q., Romain, O., Rustad, B., Bal-asingham, I., et al. (2017). Comparative validationof polyp detection methods in video colonoscopy:results from the miccai 2015 endoscopic visionchallenge.

IEEE transactions on medical imaging ,36(6):1231–1249.[Brandao et al., 2017] Brandao, P., Mazomenos, E.,Ciuti, G., Cali`o, R., Bianchi, F., Menciassi, A.,Dario, P., Koulaouzidis, A., Arezzo, A., and Stoy-anov, D. (2017). Fully convolutional neural net-works for polyp segmentation in colonoscopy. In

Medical Imaging 2017: Computer-Aided Diagno-sis , volume 10134, page 101340F. International So-ciety for Optics and Photonics.[Brandao et al., 2018] Brandao, P., Zisimopoulos,O., Mazomenos, E., Ciuti, G., Bernal, J.,Visentini-Scarzanella, M., Menciassi, A., Dario,P., Koulaouzidis, A., Arezzo, A., et al. (2018).Towards a computed-aided diagnosis system incolonoscopy: automatic polyp segmentation usingconvolution neural networks.

Journal of MedicalRobotics Research , 3(02):1840002.[Caruana, 1997] Caruana, R. (1997). Multitasklearning.

Machine learning , 28(1):41–75.[Castro et al., 2019] Castro, D. C., Walker, I., andGlocker, B. (2019). Causality matters in medicalimaging. arXiv preprint arXiv:1912.08142 . [Fern´andez-Esparrach et al., 2016] Fern´andez-Esparrach, G., Bernal, J., L´opez-Cer´on, M.,C´ordova, H., S´anchez-Montes, C., Rodr´ıguez deMiguel, C., and S´anchez, F. J. (2016). Exploringthe clinical potential of an automatic colonicpolyp detection method based on the creation ofenergy maps.

Endoscopy , 48(09):837–842.[Funke et al., 2018] Funke, I., Bodenstedt, S., Riedi-ger, C., Weitz, J., and Speidel, S. (2018). Gen-erative adversarial networks for specular highlightremoval in endoscopic images. In

Medical Imaging2018: Image-Guided Procedures, Robotic Interven-tions, and Modeling , volume 10576, page 1057604.International Society for Optics and Photonics.[Ganz et al., 2012] Ganz, M., Yang, X., andSlabaugh, G. (2012). Automatic segmentation ofpolyps in colonoscopic narrow-band imaging data.

IEEE Transactions on Biomedical Engineering ,59(8):2144–2151.[Girshick, 2015] Girshick, R. (2015). Fast r-cnn. In

Proceedings of the IEEE international conferenceon computer vision , pages 1440–1448.[Girshick et al., 2014] Girshick, R., Donahue, J.,Darrell, T., and Malik, J. (2014). Rich feature hier-archies for accurate object detection and semanticsegmentation. In

Proceedings of the IEEE confer-ence on computer vision and pattern recognition ,pages 580–587.[Gross et al., 2009] Gross, S., Stehle, T., Behrens,A., Auer, R., Aach, T., Winograd, R., Trautwein,C., and Tischendorf, J. (2009). A comparison ofblood vessel features and local binary patterns forcolorectal polyp classiﬁcation. In

Medical Imaging2009: Computer-Aided Diagnosis , volume 7260,page 72602Q. International Society for Optics andPhotonics.[He et al., 2017] He, K., Gkioxari, G., Doll´ar, P., andGirshick, R. (2017). Mask r-cnn. In

Proceedingsof the IEEE international conference on computervision , pages 2961–2969.[He et al., 2016] He, K., Zhang, X., Ren, S., andSun, J. (2016). Deep residual learning for image17ecognition. In

Proceedings of the IEEE confer-ence on computer vision and pattern recognition ,pages 770–778.[Hinton et al., 2015] Hinton, G., Vinyals, O., andDean, J. (2015). Distilling the knowledge in a neu-ral network. arXiv preprint arXiv:1503.02531 .[Hwang et al., 2007] Hwang, S., Oh, J., Tavanapong,W., Wong, J., and De Groen, P. C. (2007).Polyp detection in colonoscopy video using ellip-tical shape feature. In , volume 2, pagesII–465. IEEE.[Iakovidis et al., 2005] Iakovidis, D. K., Maroulis,D. E., Karkanis, S. A., and Brokos, A. (2005).A comparative study of texture features for thediscrimination of gastric polyps in endoscopicvideo. In , pages 575–580. IEEE.[Jha et al., 2020] Jha, D., Smedsrud, P. H., Riegler,M. A., Halvorsen, P., de Lange, T., Johansen, D.,and Johansen, H. D. (2020). Kvasir-seg: A seg-mented polyp dataset. In

International Conferenceon Multimedia Modeling , pages 451–462. Springer.[Jia et al., 2019] Jia, X., Mai, X., Cui, Y., Yuan,Y., Xing, X., Seo, H., Xing, L., and Meng, M.Q.-H. (2019). Automatic polyp recognition incolonoscopy images using deep learning and two-stage pyramidal feature prediction.

IEEE TRANS-ACTIONS ON AUTOMATION SCIENCE ANDENGINEERING .[Kang and Gwak, 2019] Kang, J. and Gwak, J.(2019). Ensemble of instance segmentation mod-els for polyp segmentation in colonoscopy images.

IEEE Access .[Karkanis et al., 2003] Karkanis, S. A., Iakovidis,D. K., Maroulis, D. E., Karras, D. A., andTzivras, M. (2003). Computer-aided tumor detec-tion in endoscopic video using color wavelet fea-tures.

IEEE transactions on information technol-ogy in biomedicine , 7(3):141–152. [Kayser et al., ] Kayser, M., Soberanis-Mukul, R. D.,Albarqouni, S., and Navab, N. Focal loss for arte-fact detection in medical endoscopy.[Kingma and Ba, 2014] Kingma, D. P. and Ba, J.(2014). Adam: A method for stochastic optimiza-tion. arXiv preprint arXiv:1412.6980 .[Li and Hoiem, 2018] Li, Z. and Hoiem, D. (2018).Learning without forgetting.

IEEE Transactionson Pattern Analysis and Machine Intelligence ,40(12):2935–2947.[Lin et al., 2017a] Lin, T.-Y., Doll´ar, P., Girshick,R., He, K., Hariharan, B., and Belongie, S.(2017a). Feature pyramid networks for object de-tection. In

Proceedings of the IEEE conferenceon computer vision and pattern recognition , pages2117–2125.[Lin et al., 2017b] Lin, T.-Y., Goyal, P., Girshick,R., He, K., and Doll´ar, P. (2017b). Focal loss fordense object detection. In

Proceedings of the IEEEinternational conference on computer vision , pages2980–2988.[Liu et al., 2011] Liu, H., Lu, W.-S., and Meng, M.Q.-H. (2011). De-blurring wireless capsule en-doscopy images by total variation minimization. In

Proceedings of 2011 IEEE Paciﬁc Rim Conferenceon Communications, Computers and Signal Pro-cessing , pages 102–106. IEEE.[Mo et al., 2018] Mo, X., Tao, K., Wang, Q., andWang, G. (2018). An eﬃcient approach for polypsdetection in endoscopic videos based on faster r-cnn. In .[Mohammed et al., 2018] Mohammed, A., Yildirim,S., Farup, I., Pedersen, M., and Hovde, Ø. (2018).Y-net: A deep convolutional neural network forpolyp detection. arXiv preprint arXiv:1806.01907 .[Ren et al., 2015] Ren, S., He, K., Girshick, R., andSun, J. (2015). Faster r-cnn: Towards real-timeobject detection with region proposal networks.In

Advances in neural information processing sys-tems , pages 91–99.18ˇSevo et al., 2016] ˇSevo, I., Avramovi´c, A., Balasing-ham, I., Elle, O. J., Bergsland, J., and Aabakken,L. (2016). Edge density based automatic detectionof inﬂammation in colonoscopy videos.

Computersin biology and medicine , 72:138–150.[Shin et al., 2018] Shin, Y., Qadir, H. A., Aabakken,L., Bergsland, J., and Balasingham, I. (2018). Au-tomatic colon polyp detection using region baseddeep cnn and post learning approaches.

IEEE Ac-cess , 6:40950–40962.[Siegel et al., 2018] Siegel, R. L., Miller, K. D., andJemal, A. (2018). Cancer statistics, 2018.

CA: acancer journal for clinicians , 68(1):7–30.[Silva et al., 2014] Silva, J., Histace, A., Romain, O.,Dray, X., and Granado, B. (2014). Toward embed-ded detection of polyps in wce images for earlydiagnosis of colorectal cancer.

International Jour-nal of Computer Assisted Radiology and Surgery ,9(2):283–293.[Stehle, 2006] Stehle, T. (2006). Removal of specularreﬂections in endoscopic images.

Acta Polytech-nica , 46(4).[Tajbakhsh et al., 2015] Tajbakhsh, N., Gurudu,S. R., and Liang, J. (2015). Automated polyp de-tection in colonoscopy videos using shape and con-text information.

IEEE transactions on medicalimaging , 35(2):630–644.[Tchoulack et al., 2008] Tchoulack, S., Langlois,J. P., and Cheriet, F. (2008). A video streamprocessor for real-time detection and correction ofspecular reﬂections in endoscopic images. In ,pages 49–52. IEEE.[V´azquez et al., 2017] V´azquez, D., Bernal, J.,S´anchez, F. J., Fern´andez-Esparrach, G., L´opez,A. M., Romero, A., Drozdzal, M., and Courville,A. (2017). A benchmark for endoluminal scenesegmentation of colonoscopy images.

Journal ofhealthcare engineering , 2017. [V´azquez et al., 2016] V´azquez, D., Bernal, J.,S´anchez, F. J., Fern´andez-Esparrach, G., L´opez,A. M., Romero, A., Drozdzal, M., and Courville,A. C. (2016). A benchmark for endoluminalscene segmentation of colonoscopy images.

CoRR ,abs/1612.00799.[Wang et al., 2018] Wang, P., Xiao, X., Brown, J.R. G., Berzin, T. M., Tu, M., Xiong, F., Hu, X.,Liu, P., Song, Y., Zhang, D., et al. (2018). Develop-ment and validation of a deep-learning algorithmfor the detection of polyps during colonoscopy.

Na-ture biomedical engineering , 2(10):741.[Yu et al., 2016] Yu, L., Chen, H., Dou, Q., Qin,J., and Heng, P. A. (2016). Integrating onlineand oﬄine three-dimensional deep learning for au-tomated polyp detection in colonoscopy videos.

IEEE journal of biomedical and health informat-ics , 21(1):65–75.[Zhou et al., 2019] Zhou, Y., He, X., Huang, L., Liu,L., Zhu, F., Cui, S., and Shao, L. (2019). Col-laborative learning of semi-supervised segmenta-tion and classiﬁcation for medical images.

TheIEEE Conference on Computer Vision and Pat-tern Recognition (CVPR) .[Zhu et al., 2015] Zhu, R., Zhang, R., and Xue, D.(2015). Lesion detection of endoscopy images basedon convolutional neural network features. In20158th International Congress on Image and SignalProcessing (CISP)