[PDF] Intrapapillary Capillary Loop Classification in Magnification Endoscopy: Open Dataset and Baseline Methodology

Abstract

Purpose. Early squamous cell neoplasia (ESCN) in the oesophagus is a highly treatable condition. Lesions confined to the mucosal layer can be curatively treated endoscopically. We build a computer-assisted detection (CADe) system that can classify still images or video frames as normal or abnormal with high diagnostic accuracy. Methods. We present a new benchmark dataset containing 68K binary labeled frames extracted from 114 patient videos whose imaged areas have been resected and correlated to histopathology. Our novel convolutional network (CNN) architecture solves the binary classification task and explains what features of the input domain drive the decision-making process of the network. Results. The proposed method achieved an average accuracy of 91.7 % compared to the 94.7 % achieved by a group of 12 senior clinicians. Our novel network architecture produces deeply supervised activation heatmaps that suggest the network is looking at intrapapillary capillary loop (IPCL) patterns when predicting abnormality. Conclusion. We believe that this dataset and baseline method may serve as a reference for future benchmarks on both video frame classification and explainability in the context of ESCN detection. A future work path of high clinical relevance is the extension of the classification to ESCN types.

Full PDF

IInt. J. Comput. Assist. Radiol. Surg. manuscript No. (will be inserted by the editor)

Intrapapillary Capillary Loop Classiﬁcation inMagniﬁcation Endoscopy: Open Dataset and BaselineMethodology

Luis C. Garc´ıa-Peraza-Herrera · MartinEverson · Laurence Lovat · Hsiu-PoWang · Wen Lun Wang · Rehan Haidry · Danail Stoyanov · S´ebastien Ourselin · Tom VercauterenAbstractPurpose

Early squamous cell neoplasia (ESCN) in the oesophagus is a highlytreatable condition. Lesions conﬁned to the mucosal layer can be curatively treatedendoscopically. We build a computer-assisted detection (CADe) system that canclassify still images or video frames as normal or abnormal with high diagnosticaccuracy.

Methods

We present a new benchmark dataset containing 68K binary labeledframes extracted from 114 patient videos whose imaged areas have been resectedand correlated to histopathology. Our novel convolutional network (CNN) archi-tecture solves the binary classiﬁcation task and explains what features of the inputdomain drive the decision-making process of the network.

Results

The proposed method achieved an average accuracy of 91 . . This work was supported through an Innovative Engineering for Health award by Well-come Trust [WT101957]; Engineering and Physical Sciences Research Council (EPSRC)[NS/A00027/1] and a Wellcome / EPSRC Centre award [203145Z/16/Z & NS/A000050/1].L. C. Garcia-Peraza-HerreraDepartment of Medical Physics and Biomedical Engineering, UCL, London, UKE-mail: [email protected]. Everson, L. Lovat, R. HaidryDivision of Surgery & Interventional Science, UCL, Department of Gastroenterology; Univer-sity College Hospital NHS Foundation Trust, London, UKH. WangDepartment of Internal Medicine, National Taiwan University, Taipei, TaiwanW. L. WangDepartment of Internal Medicine, E-Da Hospital/I-Shou University, Kaohsiung, TaiwanD. StoyanovWellcome / EPSRC Centre for Interventional and Surgical Sciences, UCL, London, UKL. C. Garcia-Peraza-Herrera, S. Ourselin, T. VercauterenSchool of Biomedical Engineering & Imaging Science, KCL, UK a r X i v : . [ ee ss . I V ] F e b Luis C. Garc´ıa-Peraza-Herrera et al.

Figure 1

Magnifying endoscopy (ME) frames extracted from videos of patients with diﬀerenthistopathology. Normal patients typically present a clear deep submucosal vasculature, largegreen-like vessels such as the one highlighted within the dashed yellow line are usually visible.Intrapapilary capillary loops (IPCL) refers to the microvasculature (pointed by the arrows).Healthy patients tend to present thinner (yellow arrows) and less tangled IPCL patterns thanthose with abnormal tissue (blue arrows).

Conclusion

We believe that this dataset and baseline method may serve as a ref-erence for future benchmarks on both video frame classiﬁcation and explainabilityin the context of ESCN detection. A future work path of high clinical relevance isthe extension of the classiﬁcation to ESCN types.

Keywords

Early Squamous Cell Neoplasia (ESCN), Intrapapillary CapillaryLoop (IPCL), Class Activation Map (CAM)

Oesophageal cancer is the sixth most common cause of cancer deaths world-wide [16] and a burgeoning health issue in developing nations from Africa alonga ‘cancer belt’ to China. The current gold standard to investigate oesophagealcancer is gastroscopy with biopsies for histological analysis. Early squamous cellneoplasia (ESCN) is a highly treatable type of oesophageal cancer, with recent ad-vances in endoscopic therapy meaning that lesions conﬁned to the mucosal layercan be curatively resected endoscopically with a < . Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 3 detection (CADe) system that can classify still images or video frames as normalor abnormal with high diagnostic accuracy could provide a useful adjunct to bothexpert and inexpert endoscopists.1.1 ContributionsWe focus on the problem of classifying video frames as normal/abnormal. Theseframes are extracted from the magniﬁcation endoscopy (ME) recording of a pa-tient. To the best of our knowledge, we introduce the ﬁrst IPCL normal/abnormalopen dataset containing ME video sequences correlated with histopathology. Ourdataset contains 68K video frames from 114 patients.For a small and representative sample of 158 frames (IPCL types A, B1, B2,B3) we ask 12 senior clinicians to label them as normal/abnormal and report theinter-rater agreement as Krippendorﬀ’s α coeﬃcient [8], achieving 76 . gold standard histopathology results,achieving an average accuracy across raters of 94 . . explain where it is looking at prior to the generation of a class prediction. Lookingat the activation maps for the abnormal class, we have observed that the networkis looking at IPCL patterns when predicting abnormality. No conclusive evidencehas been found that it is paying attention to large deep submucosal vessels to de-tect normal tissue. We believe that this baseline method may serve as a referencefor future benchmarks on both video frame classiﬁcation and explainability in thecontext of ESCN detection.

Computer-aided endoscopic detection and diagnosis could oﬀer an adjunct in theendoscopic assessment of ESCN lesions; there has been a high level of interest inrecent years in developing clinically interpretable models. The use of CNNs hasshown potential across several medical specialties. In gastroenterology, consider-able eﬀorts have been devoted to the detection of malignant colorectal polyps[15,5,14] and upper gastrointestinal cancer [9]. However, its utility in endoscopicdiagnosis of early oesophageal neoplasia remains in its infancy [2].Guo et al. [4] propose a CNN that can classify images as dysplastic or non-dysplastic. Using a dataset of 6671 images, they demonstrate per frame sensitivityof 98 % for the detection of ESCN. Using a video dataset of 20 videos, they demon-strate per frame sensitivity of 96 % for the detection of ESCN. Although the resultsare encouraging, the size of the patient sample is limited. Given the black box na-ture of CNNs this may represent a matter of concern with regards to generalizationcapability. Zhao et al. [17] have also reported a CNN for the classiﬁcation of IPCL https://github.com/luiscarlosgph/ipcl Luis C. Garc´ıa-Peraza-Herrera et al. patterns in order to identify ESCN. Using 1383 images, although heavily skewedtowards Type B1 IPCLs, they demonstrated overall accuracies of 87 % for the clas-siﬁcation of IPCL patterns. In this study, however the authors excluded type B3IPCLs from the training and testing phase. The CNN also demonstrated only a71 % classiﬁcation rate for normal IPCLs, indicating that it over-diagnoses normaltissue as containing type B1 IPCLs, and so representing dysplastic tissue. This dataset will be made publicly available online upon publication and can thusserve as a benchmark for future work on detection of ESCN based on magniﬁcationendoscopy images.3.1 Patient recruitment, endoscopic procedures and video acquisitionPatients attending for endoscopic assessment to two early squamous cell neoplasia(ESCN) referral centres in Taiwan (National Taiwan University Hospital and E-Da Hospital) were recruited with consent. Patients with oesophageal ulceration,active oesophageal bleeding or Barrett’s oesophagus were excluded. Gastroscopieswere performed by two expert endoscopists (WLW, HPW), either under conscioussedation or local anaesthesia. An expert endoscopist was deﬁned as a consultantgastroenterologist performing >

50 Early Squamous Cell Neoplasia (ESCN) assess-ments per year. All endoscopies were performed using an HD ME-NBI GIF-H260Zendoscope, with Olympus Lucera CV-290 processor (Olympus, Tokyo, Japan). Asolution of water of simethicone was applied via the endoscope working channelto the oesophageal mucosa, in order to remove mucus, food residue or blood. Thisallowed good visualization of the oesophageal mucosa and microvasculature, in-cluding IPCLs.3.2 Correlating imaged areas with histologyInitially, a macroscopic assessment was made of the suspected lesion in an overview,with the borders of the lesion delineated by the endoscopist. The endoscopistthen identiﬁed areas within the borders of the lesion on which to undertake mag-niﬁcation endoscopy. The IPCL patterns were interrogated using magniﬁcationendoscopy in combination with narrow-band imaging (ME-NBI). Magniﬁcationendoscopy was performed on areas of interest at 80 − worst-case histology for the whole lesion. Theentire lesion was then resected by either endoscopic mucosal resection (EMR) orendoscopic submucosal dissection (ESD). Resected specimens were formalin-ﬁxedand assessed by an expert gastrointestinal histopathologist. As is the gold standardthe worst case histology was reported for the lesion as a whole, based on patho-logical changes seen within the resected specimen. Similarly to abnormal lesion Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 5 areas, type A recordings (normal, healthy patients) were obtained by visual iden-tiﬁcation of healthy areas, magniﬁcation endoscopy, visual conﬁrmation of normalvasculature and IPCL patterns, and biopsy to conﬁrm the assessment.3.3 Dataset descriptionOur IPCL dataset comprises a total of 114 patients (45 normal, 69 abnormal).Every patient has a ME-NBI video (30fps) recorded following protocol in Sec-tion 3.2. Raw videos can present some parts where NBI is active. In this dataset,only ME subsequences are considered. All frames are extracted and assigned to theclass normal or abnormal depending on the histopathology of the patient. Theyare quality controlled one-by-one (running twice over all the frames) by a seniorclinician with experience in the endoscopic imaging of oesophageal cancer. Framesthat are highly degraded due to lighting artifacts (e.g. blur, ﬂares and reﬂections)up to the point where it is not possible (for the senior clinician) to make a visualjudgement of whether they are normal or abnormal are marked as uninformativeand not used. This curation process results in a dataset of 67742 annotated frames(28078 normal, 39662 abnormal) with an average of 593 frames per patient. Foreach fold, patients (not frames) are randomly split into 80 % training, 10 % vali-dation (used for hyperparameter tuning), and 10 % testing (used for evaluation).The statistics of each individual fold are presented in the supplementary material.3.4 Evaluation per patient clipLet { ˆ y f,p } F p f =1 be the set of estimated probabilities for the frames f (out of F p )belonging to patient clip p . Then, the estimated probability of abnormality for p is computed as an average of frame probabilities: P (cid:16) X = abnormal (cid:12)(cid:12)(cid:12) { ˆ y f,p } F p f =1 (cid:17) = 1 F p F p X f =1 ˆ y f,p (1)Similarly to frame predictions, a threshold ( p = 0 .

5) is applied to obtain a classlabel for p . As per our data collection protocol (see 3.2), magniﬁcation endoscopyclips contain either normal or abnormal tissue. Hence, a correlation between P ( X = abnormal |{ ˆ y f,p } F p f =1 ) and histopathology is expected. The analysis of clip classiﬁ-cation errors facilitates the identiﬁcation of worst cases, singling out patient-widemistakes from negligible frame prediction errors. In this section, we propose a reference method for IPCL binary classiﬁcation witha particular focus on explainability that may serve as a baseline for future bench-marks. As it is common in data-driven classiﬁcation, we aim to solve for a mapping f such that f θ ( x ) ≈ y , where x is an input image, y the class label correspond-ing to x , and θ a vector of parameters. All the input images were preprocessedby downscaling them to a width of 256 pixels (height automatically computed Luis C. Garc´ıa-Peraza-Herrera et al. !" !" '(%) !" ((%) !" )(%) !" *(%) + = OUTPUT !" (%) ℒ , (!" ) ℒ ,' (!" '(%) ) ℒ ,( (!" ((%) ) ℒ ,) (!" )(%) ) ℒ ,* (!" *(%) ) ℒ . (!" (%) ) GAP + GAP + GAP + GAP Class Score

Legend :ResNet-18 Convolutional Layer

GAP

Global Average Pooling ℒ ,/ (0) Side Loss in eq. 11 ℒ . (0) Side Loss in eq. 10Abnormal CAMNormal CAM

Figure 2

Proposed model ResNet-18-CAM-DS with embedded positive class activation mapsat all resolutions. from their original aspect ratio) so that we could ﬁt a large batch of images intothe GPU. To account for changes in viewpoint due to endoscope motion, random( p = 0 .

5) on-the-ﬂy ﬂips are applied to each image. Our baseline model is ResNet-18 [6]. The batch normalization moving average fraction is set to 0 .

7. Our batchsize, momentum and weight decay hyperparameters are set to 256, 0 .

9, and 0 . λ = 5 e − ≈

40 epochs) bya factor of 0 . ≈

200 epochs). In our implementation, usingan NVIDIA GeForce TITAN X Pascal GPU and Caﬀe 1 . . . . f θ = r ( h ( g ( T x ))) where T x = T θ ( x ) ∈ R H × W × K is the feature tensor obtained after processing x at thedeepest pipeline resolution, K represents the number of feature channels, T x ( k )is a matrix that represents the feature channel with index k , and g , h , and r Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 7 represent the global average pooling (GAP), fully connected (FC), and ﬁnal scoringconvolution layers respectively.The FC layer h represents a challenge for explainability, as relevance is re-distributed when gradients ﬂow backwards, losing its spatial connection to theprediction being made [11]. Hence, inspired by [18], we stripped out the fully con-nected layer of 1000 neurons from the baseline model (ResNet-18), connecting theoutput of the GAP directly to the neurons that predict class score (those in layer r ) and setting their bias to zero. Formally, this leads to f θ = r ( g ( T x )), the outputof the network before softmax being ˆ y ( c ) = X k ∈ K w k,c  HW X i,j | {z } GAP T x ( k ) | {z } Feature tensor  (2)where w k,c ∈ ˆ θ , and ˆ y ( c ) is the score predicted for class c . Following this approach,a heatmap per class can be generated obviating the GAP layer during inference,simply computing ˆ y ( c ) CAM = X k ∈ K w k,c T x ( k ) (3)These heatmaps called Class Activation Maps (CAMs) [18] keep a direct spatialrelationship to the input, which is relevant for visual explanations. Although thearchitecture proposed in [18] requires removing the GAP layer to produce theCAMs, (2) can be reformulated as ˆ y ( c ) = 1 HW X i,j | {z } GAP " X k ∈ K w k,c T x ( k ) CAM (4)in which case the CAMs are embedded within the network pipeline as a 1 × f θ = g ( r ( T x )). Werefer to this architecture as ResNet-18-CAM (as for the baseline, LR is set to 5 e − . looking at , as abnormal microvasculature in anendoscopic image is not localized only in a single spot. In our clinical problem,two types of features could be expected to be highlighted, submucosal vessels and Luis C. Garc´ıa-Peraza-Herrera et al.

Figure 3

Representative images from the testing set of fold 1 (left). Highest resolution CAMgenerated by ResNet-18-CAM-DS for the abnormal class (better viewed in the digital version).That is, ˆ y ( c ) CAM t = ˆ y (1) CAM (center). Class Activation Maps generated by ResNet-18-CAM [18](right). In contrast to traditional CAMs generated by ResNet-18-CAM (right), ours (center)suggest that our network is looking at IPCLs to predict abnormality.

IPCLs, which represent endoscopic markers for ESCN [7,13,12]. The procedure togenerate the CAM proposed in [18] employs the deepest feature maps as inputs toproduce the attention heatmaps. For our input images of 256 ×

256 pixels, thesefeature maps have a resolution of 8 × × looked at to predict abnormality. A trivial solution would beto reduce the depth of the network, but this could potentially hamper the learningof abstract features and decrease performance. In addition, the optimal amountof resolution levels for the given task to balance accuracy and explainability is ahyperparameter that would need to be tuned. Instead, we propose an alternativepath modelling f θ ( x ) as f θ ( x ) = ( f θ t ◦ f θ t − ◦ · · · ◦ f θ ◦ f θ )( x ) (5)where f θ t represents the function that processes the input at resolution t , andwhose output tensor has a width and height downsampled (strided convolution)by a factor of 0 . x of size 256 ×

256 pixels and t = 5, the output of f θ is 8 × Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 9

Given (5), let T x ,t be the output tensor produced by f θ t . Then, similarly to (4),we propose to generate a class score prediction at each resolution t following ˆ y ( c ) t = 1 HW X i,j | {z } GAP " X k ∈ K w k,c T x ,t ( k ) CAM at resolution t (6)and ﬁnal class scores are obtained as a sum over scores at diﬀerent resolutions: ˆ y ( c ) = X t ˆ y ( c ) t (7)As indicated by (6), prior to generating a class prediction, a CAM at resolution t is produced. This heatmap contains both positive and negative contributionsfrom the input image towards class c . However, for the sake of heatmap clarity,we consider only the positive contributions towards each class when generatingour CAMs. That is, we want to see what part of the image contributes to nor-mality/abnormality, as opposed to what part of the image does not contribute tonormality/abnormality. Thus, our CAMs are generated following ˆ y ( c ) CAM t = " X k ∈ K w k,c T x ,t ( k ) + (8)where z + = max(0 , z ). A loss based just on this ﬁnal score alone would not forcethe network to produce meaningful CAMs at every resolution level. Therefore, wealso propose to deeply supervise the side predictions in our proposed loss: L (cid:16) x , y, ˆ θ , { ˆ y ( c ) t } C,Tc =1 ,t =1 (cid:17) = L f (cid:16) x , y, ˆ θ , { ˆ y ( c ) t } C,Tc =1 ,t =1 (cid:17) + X t L ts (cid:16) x , y, ˆ θ , { ˆ y ( c ) t } Cc =1 (cid:17) (9)where x is the input image, y the ground truth class label, ˆ θ the network parame-ters, and { ˆ y ( c ) t } C,Tc =1 ,t =1 represent the score predictions for each class c at resolution t . Both L f ( · ) and L ts ( · ) are denoted L f and L ts for a simpliﬁed notation. L f isdeﬁned as L f = − y log " σ X t ˆ y ( c ) t ! c =1 − (1 − y ) log " σ X t ˆ y ( c ) t ! c =0 (10)where σ ( · ) c represents the softmax function for class index c . L ts is the side lossfor the prediction at each diﬀerent resolution t , deﬁned as: L ts = − y log h σ (cid:16) ˆ y ( c ) (cid:17) c =1 i − (1 − y ) log h σ (cid:16) ˆ y ( c ) (cid:17) c =0 i (11)In addition to the network generating CAMs at every resolution prior to generatingthe scores as part of the prediction pipeline, the combined loss L proposed allowsfor the validation of the accuracy at each resolution depth of the network. Werefer to the architecture that implements the model in (5) with embedded CAMsat diﬀerent resolutions following (6) and loss (9) as ResNet-18-CAM-DS (see ﬁg 2). Table 1

Results for ResNet-18 (baseline model) on frame classiﬁcation over the testing set ofeach fold of the IPCL dataset.

Measure (%) Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Sensitivity 99 . . . . . . . . . . . . . . . . . . F score 95 . . . . . . Table 2

Results for ResNet-18-CAM on frame classiﬁcation over the testing set of each foldof the IPCL dataset.

Measure (%) Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Sensitivity 98 . . . . . . . . . . . . . . . . . . F score 96 . . . . . . Table 3

Results for ResNet-18-CAM-DS on frame classiﬁcation over the testing set of eachfold of the IPCL dataset.

Measure (%) Fold 1 Fold 2 Fold 3 Fold 4 Fold 5 Average

Sensitivity 99 . . . . . . . . . . . . . . . . . . F score 94 . . . . . . Our recording protocol (see section 3.2) enforces that areas recorded in the shortpatient clips are biopsied. Histopathology labels (normal/abnormal) correspondingto the biopsied specimen are propagated to all the frames of the clip. It is then ofinterest to evaluate the agreement between the label assigned to each individualframe (based on patient’s histopathology) and its correlation to the assessmentmade by visual inspection of IPCL patterns. A team of 12 senior clinicians withexperience in endoscopic imaging of oesophageal cancer labelled 158 images fromthe dataset (randomly picked across patients and manually ﬁltered so that quasi-identical images are not included). A 25 % per IPCL pattern class (normal, B1,B2, B3) is kept across the sample (leading to an imbalance 25 % normal, 75 %abnormal). The inter-rater agreement was evaluated using the Krippendorﬀ’s α coeﬃcient, where values 0 % and 100 % represent extreme disagreement and perfectagreement respectively, α > = 80 % indicates reliable agreement, and α > = 66 . α obtained for the senior clinicianswas 76 . F score (given in %,with a 95 % conﬁdence interval) across the 12 clinicians of 97 . .

1, 1 . . .

6, 1 . . .

9, 99 . . .

7, 99 .

8] respectively.We report the quantitative classiﬁcation results for ResNet-18, ResNet-18-CAM, and ResNet-18-CAM-DS in tables 1, 2, and 3 respectively. ResNet-18-CAM-DS achieved an average sensitivity, speciﬁcity, accuracy, and F score of 93 . . . . Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 11

ResNet-18 and ResNet-18-CAM. Accuracy is only three percentage points awayfrom the average of clinical raters. Across all folds a total of 60 patient clips (12per fold) are predicted to be normal/abnormal. The binary class estimation foreach clip is computed following (1). Each patient in the dataset folder has a uniqueidentiﬁcation number. We will refer to them in this section to facilitate the searchof these patients in the dataset folder. Following (1) to estimate the class of apatient clip, ResNet-18 fails on three patients. Folds 1, 2, and 4 fail on patient158 (false positive), fold 3 fails on patient 143 (false positive), and fold 5 fails onpatient 66 (false negative). ResNet-18-CAM fails on two patients, 143 (false posi-tive) on fold 3, and 66 (false negative) on fold 5. ResNet-18-CAM-DS fails only onfolds 1 and 4 in patient 158 (see supplementary material for some frames of theseproblematic patients). In ﬁg. 3 a qualitative comparison is shown between the classactivation maps produced for the abnormal class by ResNet-18-CAM-DS (at itshighest resolution) and the standard class activation maps proposed by Zhou etal. [18]. As our system is designed as a CADe, we have computed the ROC curve(see supplementary material) to inform the consequences that several choices ofsensitivity have on speciﬁcity. The AUC of the system is 95 . Our proposed method ResNet-18-CAM-DS achieves slightly higher average ac-curacy (91 . . . . . looking at IPCL patterns to assess abnormality, whichaligns with the clinical practice. However, we have not observed high activationsover the large green submucosal vessels in the heatmaps for the normal class. Thissuggests that they may not be used by the network as an aid to solving the classiﬁ-cation problem. Future work could concentrate on adding an attention mechanismto the network in order to consider such vessels as a feature of normal images.

Declaration of conﬂicting interests, ethical approval and informed con-sent

R. J. H. has received research grant support from Pentax Medical, Cook En-doscopy, Fractyl Ltd, Beamline Ltd and Covidien plc to support research infras-tructure. T. V. owns shares from Mauna Kea Technologies, Paris, France. Theother authors declare that they have no conﬂict of interest. All procedures per-formed in studies involving human participants were in accordance with the eth-ical standards of the institutional and/or national research committee and withthe 1964 Helsinki declaration and its later amendments or comparable ethicalstandards. The Institutional Review Board of E-Da Hospital approved this study(IRB number: EMRP-097-022. July 2017). Informed consent was obtained fromall individual participants included in the study.

References

1. Cho, J.W., Choi, S.C., Jang, J.Y., Shin, S.K., Choi, K.D., Lee, J.H., Kim, S.G., Sung,J.K., Jeon, S.W., Choi, I.J., Kim, G.H., Jee, S.R., Lee, W.S., Jung, H.Y.: Lymph NodeMetastases in Esophageal Carcinoma: An Endoscopist’s View. Clinical Endoscopy (6),523 (2014). DOI 10.5946/ce.2014.47.6.5232. Everson, M., Herrera, L., Li, W., Luengo, I.M., Ahmad, O., Banks, M., Magee, C., Al-zoubaidi, D., Hsu, H., Graham, D., Vercauteren, T., Lovat, L., Ourselin, S., Kashin, S.,Wang, H.P., Wang, W.L., Haidry, R.: Artiﬁcial intelligence for the real-time classiﬁcationof intrapapillary capillary loop patterns in the endoscopic diagnosis of early oesophagealsquamous cell carcinoma: A proof-of-concept study. United European GastroenterologyJournal (2), 297–306 (2019). DOI 10.1177/20506406188218003. Garcia-Peraza-Herrera, L.C., Everson, M., Li, W., Luengo, I., Berger, L., Ahmad, O.,Lovat, L., Wang, H.P., Wang, W.L., Haidry, R., Stoyanov, D., Vercauteren, T., Ourselin,S.: Interpretable Fully Convolutional Classiﬁcation of Intrapapillary Capillary Loops forReal-Time Detection of Early Squamous Neoplasia. Arxiv (2018). URL http://arxiv.org/abs/1805.00632

4. Guo, L., Xiao, X., Wu, C., Zeng, X., Zhang, Y., Du, J., Bai, S., Xie, J., Zhang, Z., Li,Y., Wang, X., Cheung, O., Sharma, M., Liu, J., Hu, B.: Real-time automated diagnosis ofprecancerous lesions and early esophageal squamous cell carcinoma using a deep learningmodel (with videos). Gastrointestinal Endoscopy (1), 41–51 (2020). DOI 10.1016/j.gie.2019.08.0185. Hassan, C., Wallace, M.B., Sharma, P., Maselli, R., Craviotto, V., Spadaccini, M., Repici,A.: New artiﬁcial intelligence system: ﬁrst validation study versus experienced endo-scopists for colorectal polyp detection. Gut pp. gutjnl–2019–319914 (2019). DOI10.1136/gutjnl-2019-3199146. He, K., Zhang, X., Ren, S., Sun, J.: Deep Residual Learning for Image Recognition. In:2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. IEEE (2016). DOI 10.1109/CVPR.2016.907. Inoue, H., Honda, T., Yoshida, T., Nishikage, T., Nagahama, T., Yano, K., Nagai, K.,Kawano, T., Yoshino, K., Tani, M., Takeshita, K., Endo, M.: Ultra-high MagniﬁcationEndoscopy of the Normal Esophageal Mucosa. Digestive Endoscopy (2), 134–138 (1996).DOI 10.1111/j.1443-1661.1996.tb00429.x8. Krippendorﬀ, K.: Content Analysis: An Introduction to Its Methodology. Sage Pub-lications, Inc. (2004). URL https://us.sagepub.com/en-us/nam/content-analysis/book258450

9. Luo, H., Xu, G., Li, C., He, L., Luo, L., Wang, Z., Jing, B., Deng, Y., Jin, Y., Li, Y.,Li, B., Tan, W., He, C., Seeruttun, S.R., Wu, Q., Huang, J., Huang, D.w., Chen, B., Lin,S.b., Chen, Q.m., Yuan, C.m., Chen, H.x., Pu, H.y., Zhou, F., He, Y., Xu, R.h.: Real-time artiﬁcial intelligence for detection of upper gastrointestinal cancer by endoscopy:a multicentre, case-control, diagnostic study. The Lancet Oncology (12), 1645–1654(2019). DOI 10.1016/S1470-2045(19)30637-010. Menon, S., Trudgill, N.: How commonly is upper gastrointestinal cancer missed at en-doscopy? A meta-analysis. Endoscopy International Open (02), E46–E50 (2014). DOI10.1055/s-0034-136552411. Montavon, G., Samek, W., M¨uller, K.R.: Methods for interpreting and understanding deepneural networks. Digital Signal Processing , 1–15 (2018). DOI 10.1016/j.dsp.2017.10.01112. Oyama, T., Inoue, H., Arima, M., Momma, K., Omori, T., Ishihara, R., Hirasawa, D.,Takeuchi, M., Tomori, A., Goda, K.: Prediction of the invasion depth of superﬁcialsquamous cell carcinoma based on microvessel morphology: magnifying endoscopic clas-siﬁcation of the Japan Esophageal Society. Esophagus (2), 105–112 (2017). DOI10.1007/s10388-016-0527-713. Sato, H., Inoue, H., Ikeda, H., Sato, C., Onimaru, M., Hayee, B., Phlanusi, C., Santi,E., Kobayashi, Y., Kudo, S.e.: Utility of intrapapillary capillary loops seen on magnifyingnarrow-band imaging in estimating invasive depth of esophageal squamous cell carcinoma.Endoscopy (02), 122–128 (2015). DOI 10.1055/s-0034-139085814. Su, J.R., Li, Z., Shao, X.J., Ji, C.R., Ji, R., Zhou, R.C., Li, G.C., Liu, G.Q., He, Y.S.,Zuo, X.L., Li, Y.Q.: Impact of a real-time automatic quality control system on colorectalpolyp and adenoma detection: a prospective randomized controlled study (with videos).Gastrointestinal Endoscopy (2), 415–424.e4 (2020). DOI 10.1016/j.gie.2019.08.026 Benchmark Dataset and Evaluation Methodology for IPCL Classiﬁcation 1315. Wang, P., Berzin, T.M., Glissen Brown, J.R., Bharadwaj, S., Becq, A., Xiao, X., Liu, P., Li,L., Song, Y., Zhang, D., Li, Y., Xu, G., Tu, M., Liu, X.: Real-time automatic detection sys-tem increases colonoscopic polyp and adenoma detection rates: a prospective randomisedcontrolled study. Gut (10), 1813–1819 (2019). DOI 10.1136/gutjnl-2018-31750016. Zhang, Y.: Epidemiology of esophageal cancer. World Journal of Gastroenterology (34),5598 (2013). DOI 10.3748/wjg.v19.i34.559817. Zhao, Y.Y., Xue, D.X., Wang, Y.L., Zhang, R., Sun, B., Cai, Y.P., Feng, H., Cai, Y.,Xu, J.M.: Computer-assisted diagnosis of early esophageal squamous cell carcinoma usingnarrow-band imaging magnifying endoscopy. Endoscopy (04), 333–341 (2019). DOI10.1055/a-0756-875418. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning Deep Features forDiscriminative Localization. In: 2016 IEEE Conference on Computer Vision and PatternRecognition (CVPR), pp. 2921–2929. IEEE (2016). DOI 10.1109/CVPR.2016.319 nt. J. Comput. Assist. Radiol. Surg. manuscript No. (will be inserted by the editor) Intrapapillary Capillary Loop Classiﬁcation inMagniﬁcation Endoscopy: Open Dataset and BaselineMethodology

Supplementary Material

Luis C. Garc´ıa-Peraza-Herrera · MartinEverson · Laurence Lovat · Hsiu-PoWang · Wen Lun Wang · Rehan Haidry · Danail Stoyanov · S´ebastien Ourselin · Tom Vercauteren

L. C. Garcia-Peraza-HerreraDepartment of Medical Physics and Biomedical Engineering, UCL, London, UK E-mail:[email protected]. Everson, L. Lovat, R. HaidryDivision of Surgery & Interventional Science, UCL, Department of Gastroenterology; Univer-sity College Hospital NHS Foundation Trust, London, UKH. WangDepartment of Internal Medicine, National Taiwan University, Taipei, TaiwanW. L. WangDepartment of Internal Medicine, E-Da Hospital/I-Shou University, Kaohsiung, TaiwanD. StoyanovWellcome / EPSRC Centre for Interventional and Surgical Sciences, UCL, London, UKL. C. Garcia-Peraza-Herrera, S. Ourselin, T. VercauterenSchool of Biomedical Engineering & Imaging Science, KCL, UK a r X i v : . [ ee ss . I V ] F e b Luis C. Garc´ıa-Peraza-Herrera et al.

A Number of patients and frames per fold

The number of patients per fold is shown in appendix table A1. Similarly, the number of framesis shown in appendix table A2.

Table A1

Number of patients per fold (80% training, 10% validation, 10% testing).

Dataset Fold Fold Fold Fold Fold

Table A2

Number of frames per fold.

Dataset Fold Fold Fold Fold Fold

B Qualitative classiﬁcation results and cases of patient failure

Figure A1

Qualitative results for ResNet-18-CAM-DS on frame classiﬁcation over the testingset of each fold. TP, TN, FP, FN stand for true positives, true negatives, false positives andfalse negatives respectively.

Best

TP refers to the abnormal image with the highest estimatedprobability of being abnormal. Analogously, the best

TN represents the image with lowestestimated probability. Median and worst cases are estimated in a similar fashion. The FPmedian and worst case of fold 5 are the same image because in the testing set of this fold thereis only this false positive image. Luis C. Garc´ıa-Peraza-Herrera et al.

Figure A2

Representative frames of cases of patient failure (i.e. when the average estimatedclass for the whole patient clip is wrong). Patient 158 (normal IPCL, left), 143 (normal IPCL,center), 66 (right, abnormal IPCL). ResNet-18 failed on all of them. ResNet-18-CAM only on143 and 66. ResNet-18-CAM-DS failed only on case 158.

C ROC result for ResNet-18-CAM-DS