[PDF] Efficient video indexing for monitoring disease activity and progression in the upper gastrointestinal tract

Abstract

Endoscopy is a routine imaging technique used for both diagnosis and minimally invasive surgical treatment. While the endoscopy video contains a wealth of information, tools to capture this information for the purpose of clinical reporting are rather poor. In date, endoscopists do not have any access to tools that enable them to browse the video data in an efficient and user friendly manner. Fast and reliable video retrieval methods could for example, allow them to review data from previous exams and therefore improve their ability to monitor disease progression. Deep learning provides new avenues of compressing and indexing video in an extremely efficient manner. In this study, we propose to use an autoencoder for efficient video compression and fast retrieval of video images. To boost the accuracy of video image retrieval and to address data variability like multi-modality and view-point changes, we propose the integration of a Siamese network. We demonstrate that our approach is competitive in retrieving images from 3 large scale videos of 3 different patients obtained against the query samples of their previous diagnosis. Quantitative validation shows that the combined approach yield an overall improvement of 5% and 8% over classical and variational autoencoders, respectively.

Full PDF

EEFFICIENT VIDEO INDEXING FOR MONITORING DISEASE ACTIVITY ANDPROGRESSION IN THE UPPER GASTROINTESTINAL TRACT

Sharib Ali (cid:63)

Jens Rittscher (cid:63) (cid:63)

Institute of Biomedical Engineering, Department of Engineering Science,University of Oxford, Oxford

ABSTRACT

Endoscopy is a routine imaging technique used for both di-agnosis and minimally invasive surgical treatment. While theendoscopy video contains a wealth of information, tools tocapture this information for the purpose of clinical reportingare rather poor. In date, endoscopists do not have any ac-cess to tools that enable them to browse the video data in anefﬁcient and user friendly manner. Fast and reliable videoretrieval methods could for example, allow them to reviewdata from previous exams and therefore improve their abil-ity to monitor disease progression. Deep learning providesnew avenues of compressing and indexing video in an ex-tremely efﬁcient manner. In this study, we propose to use anautoencoder for efﬁcient video compression and fast retrievalof video images. To boost the accuracy of video image re-trieval and to address data variability like multi-modality andview-point changes, we propose the integration of a Siamesenetwork. We demonstrate that our approach is competitivein retrieving images from 3 large scale videos of 3 differentpatients obtained against the query samples of their previousdiagnosis. Quantitative validation shows that the combinedapproach yield an overall improvement of 5% and 8% overclassical and variational autoencoders, respectively.

Index Terms — Endoscopy, deep learning, autoencoders,Siamese network, image retrieval

1. INTRODUCTION

Due to the lack of efﬁcient tools for indexing and retrieval,the wealth of information contained in endoscopy video isonly rarely used for improving clinical reporting and diag-nosis. Today only manually selected still frames ( low qualityscreenshots ) are included into clinical reports. In addition, itis not feasible for endoscopists to look into long videos. Com-puter assisted methods for extracting clinically relavant videoframes from a larger video stream do not exist. Compressingand indexing the entire video in an extremely efﬁcient mannerwould open up the possibility to use corresponding data fromprevious exams to enable the monitoring of disease progres-sion and response to therapy. Content-based image retrieval(CBIR) methods allow to locate images of interest (query) in target databases. Such techniques have been already appliedon large scale medical databases [1, 2, 3].Oesophageal carcinoma is the sixth leading cause of mor-tality and the eight most common cancer worldwide. Theoverall 5-year survival of patients with oesophageal carci-noma ranges from 15% to 25%; while earlier diagnosis canresult in improved survival rate [4]. Endoscopy is a routineclinical procedure for both diagnosis and early treatment ofpre-cancerous malignancies seen in oesophagus commonlyreferred as “

Barrett’s Oesophagus ” (BE). Patients with BEhave a 30–125-fold increased lifetime risk of developing oe-sophageal cancer. Barrett’s is deﬁned as the substitution ofthe normal stratiﬁed squamous epithelium of the oesophaguswith a columnar cell lining and can be visualized using an en-doscope. It is therefore required that patients with BE mustundergo periodic endoscopies during which biopsies are takento examine the cancer risk. The goal of this work is to developan approach that can effectively support monitoring of thesepatients utilizing information content of video endoscopy atan optimal compression, speed and retrieval accuracy whichare important factors for clinical usability.

2. RELATED WORK

Existing CBIR approaches are typically not suitable for clin-ical use as they utilise a very low resolution representationfor the video data. Important diagnostically relevant detailis often lost. Such system use derived features such as tex-ture, colour, shape as well as local spatial properties to learna low-dimensional representation. Often binary coding orhashing are used. Query images are then compared withlow-dimensional representation of target images for fast im-age retrieval. The efﬁciency of CBIR directly depends onthe used feature extraction and representation. Liu et al. [5]used colour difference histogram utilizing colours and edgeorientations for better feature representation. Murala et al. [1]employed local binary patterns (LBP) to extract features fromlow-level wavelet sub-bands in CT and MRI data. Scale in-variant feature transform (SIFT) were used in [2] to deriverepresentation for bag-of-visual words. Ye et al. [3] usedLBP based 496- dimensional image histogram descriptor anda hashing technique based on random forest for real time a r X i v : . [ c s . C V ] M a y iopsy retargeting utilizing video endoscopy. All of theseprevious approaches are based on hand-crafted features anddo not incorporate high-level semantic ( semantic gap ).Recently, approaches that make use of advances in deeplearning have demonstrated considerable success in learningboth low- and high-level semantically relevant features. Ah-mad et al. [6] suggested using a convolutional neural network(CNN) based salient features to retrieve endoscopic images.Convolutional kernels from the ﬁrst layer of a pre-trainedAlexNet model was used along with a pooling strategy forachieving compact 96 bin histogram. We argue that variationsin endoscopy data can not be captured using such pre-trainedmodels as the features in endocopic images are very differentfrom natural images (used by such pre-trained networks).Masci and collegues [7] demontrated that unsupervised con-volution autoencoders can be used to extract salient features.Krizhevsky and Hinton [8] revealed that deep autoencoderscan be applied for extremely fast image retrieval task that isindependent of database sizes due to high data compressioncapability of autoencoders. Stacked autoencoders were usedby Sharma and colleagues [9] on medical images (x-ray data).Two challenges need to be addressed before image auto-mated retrieval from endoscopy video can be applied in theclinical setting: 1) Efﬁcient compression - the number ofimages per video is extremely large (nearly 15-40 thousands)which demands for large storage (nearly 1.5-4 GB per video)and 2) preservation of diagnostically relevant features - whileit is necessary to preserve a high level of anatomical detail,various artefacts need to be discarded. We argue that autoen-coders are extremely well suited to address the ﬁrst problem.However, in the presence of challenges posed by data vari-abilities, autoencoders can fail to accurately retrieve imagesin restricted search space. In this paper, we propose to use au-toencoders (AE) and Siamese network working side-by-sidefor achieving better compression, and fast and accurate imageretrieval. We demonstrated that combining classical AE withSiamese network or variational autoencoder (VAE, [10]) withSiamese results in improved accuracy. We observed that VAEcompresses data nearly 70 folds more than classical AE atan improved retrieval speed but with a compromise in ac-curacy. However, when combined with Siamese network, ityields very competitive retrieval results. We have comparedretrieval performances with and without Siamese network forboth autoencoders on 3 different BE patient videos. The restof the paper is organized as follows: Section 3 brieﬂy de-scribes both classical and variational autoencoders, Siamesenetworks and our combined approach for image retrievaltask. Section 4 evaluates our combined approaches utilizingoesophageal endoscopic video data and Section 5 concludesthe paper.

3. METHOD

After providing a brief description of autoencoders andSiamese networks we motivate on why these should be com-

Fig. 1 : Autoencoder for video frame retrieval.

Query andtarget video images are ﬁrst transformed to latent variable(LV) space using trained encoders. sorted LVs based onestimated distance ( l -norm) between query and target LVsare used to retrieve candidate frames from the target video. ASiamese network is used to further penalize the dissimilaritybetween the query and the retrieved images. Target video im-ages are encoded (compressed) ofﬂine (dashed rectangle) toachieve real time image retrieval. bined and present our approach for efﬁcient retrieval of videoendoscopy. An autoencoder is an unsupervised machine learning algo-rithm that is capable of learning efﬁcient and compressed datarepresentation referred as “coding“ or “latent-space represen-tation“, say z . A reverse process “decoding“ is performedto achieve outputs similar to the input data. Decoder triesto reconstruct using fewer number of bits from the bottle-neck (latent-space). A latent-space representation is learnedwhen the dissimilarity between decoder output and input datais minimized.Here, a convolutional autoencoder (CAE, [7]) that can betrained in an end-to-end fashion is used. Our architecture con-sists of 3 convolution ﬁlter layers with ‘relu‘ activation func-tion and a fully connected last layer (32 dense connections). Asubsequent downsampling with stride 2 is performed at eachlayer for encoder and a similar architecture with upsamplingwith stride 2 is performed for decoder . The decoder unﬂat-tens the encoder output ﬁrst and then upscales it using similarsize ( mirrored ) convolution ﬁlters. Cross-entropy has beenused as a loss function with an Adadelta optimizer and a reluactivation. The image size used for training is × × .There are in total of 4,089,283 trainable parameters. We havetrained our CAE for 500 epochs utilizing 33,000 samples fortraining and 15,000 for validation. In contrast to classical autoencoders where the informationregarding input data distribution is not known, variational au- ethod Training Compression Testing(s)Samples Epochs Time(s) Data Comp. Time(s) Load Execute

AE 33k/15k 500 18/epoch 15k 160 MB 1.63 3 0.16VAE 33k/15k 500 21/epoch 15k 1.7 MB 1.21 2 0.18Siamese 900/100 1000 48/epoch - - - 1.05 1.35

Table 1 : Details on training, test and compression.

All the computations were done on NVIDIA Tesla P100. Informationregarding compression is provided for 15000 video images each of size × and does not includes data loading time. Testtimes are provided for listing 100 retrieved images from a pool of 15,000 images for autoencoders and retrieval of 10 imagesfrom 100 images (each rescaled to 1 × pixels) for Siamese network. toencoders [10] assert the latent-space representation to bedrawn from a unit normal distribution, N (0 , I ) . Thus, suchautoencoders are effective generative models that can pro-duce samples from the learned unit normal distribution. Forour image retrieval task, the encoded target embedding mightnot exactly match the encoded query but VAEs have tremen-dous strength to learn more meaningful latent representationsyielding in an effective search space.We have used the same architecture (refer Sec.3.1) for ourencoder-decoder network in VAE. However, the ﬁnal layerof VAE encoding consists of mean µ and standard deviation σ encoding (vectors) for ’n’ ( = 10 for our case) latent em-beddings. The actual coding z is then randomly sampledfrom a unit normal distribution for decoding. Negative ofcross-entropy loss is minimized using an Adam optimizer. Inorder to push the autoencoder to learn unit normal distribu-tion, a second loss “latent loss“ is used which is computed asKL-divergence between the target normal distribution and theactual coding. The total trainable parameters are 1,362,480which is lot less than classical autoencoder as only mean andvariances in the data are learnt. In our retrieval task, wecompare only mean embeddings between the encoded targetdataset and query images. This reduces both the computa-tional complexity and embedding ﬁle size (see Tab.1). Wehave used a 10-dimensional latent variable space and trainedour VAE for 500 epochs utilizing 33,000 samples for trainingand 15,000 for validation. A Siamese network [11] learns to differentiate between twoinput images and consists of two identical neural networks,say N and N , for doing this. The dissimilarities are com-puted as a contrastive loss function: (1 − Y ) 12 ( D N ) + Y { max (0 , m − D N ) } , (1)where D N is the Euclidean distance between outputs of sisternetworks, Y is the class label and m > is the margin value.Both sister networks N and N have exact same weights.We have trained a Siamese network for dealing withmulti-modality and varying view-points in our oesophagealendoscopy dataset. For this we created a database consisting of 100 sets of images with 10 images each (in total 1000) thatincluded WL, NBI and 8 different viewing angles. Pairedmulti-modality images were generated by using a trained do-main adaptive network (refere ‘cycleGAN’ [12]). CycleGANwas trained using 300 pairs of WL and NBI images. For ad-dressing view-point changes in endoscopy, simulated imagesusing different rotation angles ( ◦ ≤ θ ≤ ◦ ) were gener-ated. An Adam optimizer was used to minimize contrastiveloss (see Eq. 1). The network was trained for 1000 epochsand 100 iterations with learning parameter of 0.005. An overview of the proposed combined approach is presentedin Fig. 1. First, both the autoencoder and the Siamese net-work are trained separately (see Section 3.1-3.3). Then, thetrained autoencoder is used to compress the target video into alow dimensional latent variable space vector (target LV). Of-ﬂine batch processing can be used to encode all target videosneeded for patient follow-up. Thanks to the very high com-pression ratios this method achieves, only a modest amountof storage is required to hold the compressed videos. Given anew query image, the trained autoencoder is used to project itinto the latent space. The resulting latent variable space vec-tor (source LV) is then compared with the target LV using an L metric. 100 LVs are sorted based on their similarity scoresand are processed to either a decoder or an image list. Thetrained Siamese network is then used to penalise any imageswhich don’t satisfy the similarity requirement. The distanceoutput of the Siamese network is used to produce the ﬁnalranking of the candidate images. Finally, the n -best retrievedimages (in our case n = 10 ) are being presented.

4. EXPERIMENTS

10 oesophageal endoscopy videos were used in this study.Each video consists of more than 15,000 frames. 4 differ-ent patient video images were combined randomly and usedin different proportions for training purposes of our networks(see Table 1) and 6 videos of 3 different patients were usedfor our frame retrieval test (3 for query and 3 for retrieval). ethod Patient Average

VAE 356 134 0.73 351 139 0.71 437 53 0.89 381.3 108 0.77AE 417 73 0.85 393 97 0.80 456 34 0.93 422 68 0.86VAE-Siamese 399 91 0.81 410 80 0.83 439 51 0.89 416 74 0.85AE -Siamese 437 53 0.89 437 53 0.89 473 17 0.96 449 41 0.91

Table 2 : Query frame retrieval for BE diagnosis/treatment:

Average TP (true positive), FP (false positive) and P (precision)are provided for 49 query samples randomly selected from previous visits of each patient video endoscopy. All values werecalculated for 10 ﬁrst retrieved images for each case with reference to the query samples by an expert.

Query

Fig. 2 : Visual analysis of improved performance of pro-posed approach.

Query

Table 2 shows image retrieval performance of both the soleapplication of autoencoders and our combined approaches. Itcan be seen that Siamese network improves the average re-trieval precision by 8% and 5%, respectively, for VAE and AE. AE-Siamese yields the best results for all 3 patient cases(89%, 89% and 96%) and on average 6% higher precisionthan VAE-Siamese. However, from Tab. 1 it can be observedthat VAE has the best compression performance, nearly 70folds more than AE, and a very reasonable average precisionof 85% (see Tab. 2).

Fig. 2 presents visual analysis of image retrieval on twoquery video images in our dataset. These query images weresearched in a same patient video archived 6 months earlier andconsisted of nearly 21,462 image frames. For query

5. CONCLUSION

While autoencoders allow for better compression of large-scale endoscopy data eliminating the need for large archivalspaces, our experiments demonstrated that a Siamese net-work on top can be used to provide a very effective similarityscore for improved retrieval accuracy of clinically signif-icant frames. Our resulting system can thus achieve highcompression and maintain a feature representation that keepsdiagnostically relevant detail in the context of monitoringBarrett’s oesophagus. Our future work will include com-bining of text information on the report with image data forobtaining more accurate and meaningful image retrieval ofoesophageal endoscopic videos.

Acknowledgement

SA is supported by the NIHR Oxford BRC. JR is funded byEPSRC EP/M013774/1 Seebibyte. . REFERENCES [1] Subrahmanyam Murala, R. P. Maheshwari, and R. Bal-asubramanian, “Directional binary wavelet patterns forbiomedical image indexing and retrieval,”

Journal ofMedical Systems , vol. 36, no. 5, pp. 2865–2879, Oct2012.[2] Antonio Foncubierta-Rodr´ıguez, Alba Garc´ıa Seco deHerrera, and Henning M¨uller, “Medical image retrievalusing bag of meaningful visual words: Unsupervised vi-sual vocabulary pruning with plsa,” in

Proceedings ofthe 1st ACM International Workshop on Multimedia In-dexing and Information Retrieval for Healthcare , 2013,pp. 75–82.[3] Menglong Ye, Edward Johns, Benjamin Walter, Alexan-der Meining, and Guang-Zhong Yang, “An image re-trieval framework for real-time endoscopic image retar-geting,”

International Journal of Computer Assisted Ra-diology and Surgery , vol. 12, no. 8, pp. 1281–1292, Aug2017.[4] Arjun Pennathur, Michael K Gibson, Blair A Jobe, andJames D Luketich, “Oesophageal carcinoma,”

Lancet ,vol. 381, no. 9864, pp. 400–12, Feb 2013.[5] Guang-Hai Liu and Jing-Yu Yang, “Content-based im-age retrieval using color difference histogram,”

PatternRecognition , vol. 46, no. 1, pp. 188 – 198, 2013.[6] Jamil Ahmad, Khan Muhammad, Mi Young Lee, andSung Wook Baik, “Endoscopic image classiﬁcation andretrieval using clustered convolutional features,”

Jour-nal of Medical Systems , vol. 41, no. 12, pp. 196, Oct2017.[7] Jonathan Masci, Ueli Meier, Dan Cires¸an, and J¨urgenSchmidhuber, “Stacked convolutional auto-encoders forhierarchical feature extraction,” in

Proceedings of the21th International Conference on Artiﬁcial Neural Net-works (ICANN) . 2011, pp. 52–59, Springer-Verlag.[8] Alex Krizhevsky and Geoffrey E. Hinton, “Using verydeep autoencoders for content-based image retrieval,”in , April 2011.[9] S. Sharma, I. Umar, L. Ospina, D. Wong, and H. R.Tizhoosh, “Stacked autoencoders for medical imagesearch,” in

International Symposium on Visual Com-puting (ISVC) . 2016, pp. 45–54, LNCS, Springer.[10] Diederik P. Kingma and Max Welling, “Auto-encodingvariational bayes.,”

CoRR , 2013.[11] Gregory Koch, Richard Zemel, and Ruslan Salakhut-dinov, “Siamese neural networks for one-shot imagerecognition,” in

ICML Deep Learning workshop , 2015. [12] J. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpairedimage-to-image translation using cycle-consistent ad-versarial networks,” in