[PDF] A Hierarchical Conditional Random Field-based Attention Mechanism Approach for Gastric Histopathology Image Classification

Abstract

In the Gastric Histopathology Image Classification (GHIC) tasks, which are usually weakly supervised learning missions, there is inevitably redundant information in the images. Therefore, designing networks that can focus on effective distinguishing features has become a popular research topic. In this paper, to accomplish the tasks of GHIC superiorly and to assist pathologists in clinical diagnosis, an intelligent Hierarchical Conditional Random Field based Attention Mechanism (HCRF-AM) model is proposed. The HCRF-AM model consists of an Attention Mechanism (AM) module and an Image Classification (IC) module. In the AM module, an HCRF model is built to extract attention regions. In the IC module, a Convolutional Neural Network (CNN) model is trained with the attention regions selected and then an algorithm called Classification Probability-based Ensemble Learning is applied to obtain the image-level results from patch-level output of the CNN. In the experiment, a classification specificity of 96.67% is achieved on a gastric histopathology dataset with 700 images. Our HCRF-AM model demonstrates high classification performance and shows its effectiveness and future potential in the GHIC field.

Full PDF

NNoname manuscript No. (will be inserted by the editor)

A Hierarchical Conditional Random Field-basedAttention Mechanism Approach for GastricHistopathology Image Classiﬁcation

Yixin Li · Xinran Wu · Chen Li · Changhao Sun · Md Rahaman · YudongYao · Xiaoyan Li · Yong Zhang · TaoJiang

Received: date / Accepted: date

Abstract

In the

Gastric Histopathology Image Classiﬁcation (GHIC) tasks, whichis usually weakly supervised learning missions, there is inevitably redundant in-formation in the images. Therefore, designing networks that can focus on eﬀec-tive distinguishing features has become a popular research topic. In this paper,to accomplish the tasks of GHIC superiorly and to assist pathologists in clinical

Yixin LiMicroscopic Image and Medical Image Analysis Group, College of Medicine and BiologicalInformation Engineering, Northeastern University, ChinaE-mail: [email protected] Wu (co-frist author)Microscopic Image and Medical Image Analysis Group, College of Medicine and BiologicalInformation Engineering, Northeastern University, ChinaChen Li (corresponding author)Microscopic Image and Medical Image Analysis Group, College of Medicine and BiologicalInformation Engineering, Northeastern University, ChinaE-mail: [email protected] SunMicroscopic Image and Medical Image Analysis Group, College of Medicine and BiologicalInformation Engineering, Northeastern University, China; Shenyang Institute of Automation,Chinese Academy of Sciences, Shenyang, ChinaMd RahamanMicroscopic Image and Medical Image Analysis Group, College of Medicine and BiologicalInformation Engineering, Northeastern University, ChinaYudong YaoDepartment of Electrical and Computer Engineering, Stevens Institute of Technology, Hobo-ken, NJ 07030, USAXiaoyan LiChina Medical University, Liaoning Cancer Hospital and Institute, ShenyangYong ZhangChina Medical University, Liaoning Cancer Hospital and Institute, ShenyangTao JiangControl Engineering College, Chengdu University of Information Technology, Chengdu Peo-ple’s Republic of China a r X i v : . [ c s . C V ] F e b Yixin Li et al. diagnosis, an intelligent

Hierarchical Conditional Random Field based AttentionMechanism (HCRF-AM) model is proposed. The HCRF-AM model consists of an

Attention Mechanism (AM) module and an

Image Classiﬁcation (IC) module. Inthe AM module, an HCRF model is built to extract attention regions. In the ICmodule, a

Convolutional Neural Network (CNN) model is trained with the atten-tion regions selected and then an algorithm called Classiﬁcation Probability-basedEnsemble Learning is applied to obtain the image-level results from patch-leveloutput of the CNN. In the experiment, a classiﬁcation speciﬁcity of 96 .

67% isachieved on a gastric histopathology dataset with 700 images. Our HCRF-AMmodel demonstrates high classiﬁcation performance and shows its eﬀectivenessand future potential in the GHIC ﬁeld.

Keywords

Attention mechanism · conditional random ﬁeld · gastric cancer · histopathology image · image classiﬁcation Gastric cancer is one of the top ﬁve most frequently diagnosed malignant tu-mors worldwide, according to the World Health Organisation (WHO) report [1].It remains a deadly cancer for its high incidence and fatality rate, which leads toover 1,000,000 new cases and over 700,000 deaths per year, making it the thirdleading cause of cancer deaths [2]. Surgical removal of gastric cancer in the earlystage without metastasis is the only possible cure. The median survival of gas-tric cancer rarely exceeds 12 months, and after the tumor metastasis, 5-years ofsurvival is observed with a survival rate under 10% [3]. Therefore, early treat-ment can eﬀectively reduce the possibility of death and an accurate estimate ofthe patient’s prognosis is demanded. Although endoscopic ultrasonography andComputerized Tomography (CT) are the primary methods for diagnosing gastriccancer, whereas histopathology images are considered as the gold standard for thediagnosis [4]. However, histopathology images are usually large with redundantinformation, which means histopathology analysis is a time-consuming specializedtask and highly associated with pathologists’ skill and experience [5]. Professionalpathologists are often in short supply, and long hours of heavy work can lead tolower diagnostic quality. Thus, an intelligent diagnosis system plays a signiﬁcantrole in automatically detecting and categorizing histopathology images.In recent years, Deep Learning (DL) techniques have shown signiﬁcant im-provements in a wide range of computer vision tasks, including diagnosis of gas-tric cancer, lung cancer and breast cancer, assisting doctors in classifying andanalyzing medical images. Especially,

Gastric Histopathology Image Classiﬁcation (GHIC) is a weakly supervised problem, which means that an image labeled asabnormal contains abnormal tissues with cancer cells and normal tissues with-out cancer cells existing in the surrounding area at the same time. However, theexisting networks usually fail to focus only on abnormal regions to make their di-agnosis, which leads to noise regions and redundant information, bringing negative inﬂuence on the ﬁnal decision-making process and aﬀecting the network perfor-mance [6]. Therefore, some advanced methods are proposed to incorporate visual

Attention Mechanisms (AMs) into

Convolutional Neural Networks (CNNs), whichallows a deep model to adaptively focus on related regions of an image [7]. More-over, the fully dense annotations of pathological ﬁndings such as the contours or itle Suppressed Due to Excessive Length 3 bounding boxes are not available in most cases due to its cost and time-consumingnature. Hence, we propose an intelligent

Hierarchical Conditional Random Fieldbased Attention Mechanism (HCRF-AM) model that includes additional regionlevel images to guide the attention of CNNs for the GHIC tasks. The HCRF-AMmodel includes the AM module (where the Hierarchical Conditional Random Field(HCRF) model [8, 9] is applied to extract attention areas) and the

Image Classiﬁ-cation (IC) module. The workﬂow of the proposed HCRF-AM model is shown inFig. 1.

Training PartTesting Part

Attention Area

Ground Truth ImagesTrainingImages Normal Training ImagesAbnormal Training ImagesImageNet Data

TestingImages

Fig. 1

Workﬂow of the proposed HCRF-AM model for GHIC.

There are three main contributions of our work: First, the AM module inte-grated into the network improves both the performance and interpretability ofgastric cancer diagnosis. Second, we develop the HCRF model to obtain full an-notations for weakly supervised classiﬁcation tasks automatically. Thirdly, we usea publicly available gastric histopathology image dataset, which consists of 700images and extensive experiments on the dataset demonstrate the eﬀectiveness ofour method.This paper is organized as follows. In Sec. 2, we review the existing methodsrelated to automatic gastric cancer diagnosis, AMs and the Conditional RandomField (CRF). We explain our proposed method in Sec. 3. Sec. 4 elaborates theexperimental settings, implementation, results and comparison. Sec. 5 comparesour method to previous GHIC studies. Sec. 6 concludes this paper and discussesthe future work.

Yixin Li et al. gastric cancer slides by pathologists is the only way to diagnose gastric cancer withconﬁdence. Researchers have devoted a considerable amount of eﬀort and there isa great deal of work on automatic classiﬁcation of gastric histopathological images.Here, we group Computer Aided Diagnosis (CAD) methods of GHIC into twotypes: classical Machine Learning (ML) techniques and Deep Learning (DL) tech-niques. The classical ML methods extract some handcrafted features like color [13]and texture descriptors [14] [15] and use classiﬁers like Support Vector Machine(SVM) [16] [17], Random Forest (RF) [18] and Adaboost algorithm [13] to makedecision. However, the above classical ML methods only consider a handful offeatures on images, yielding relatively low classiﬁcation accuracy.In recent years, numerous DL models have been proposed in literature to di-agnose gastric cancer with images obtained under the optical microscope. Forinstance, a pure supervised feedforward CNN model for classiﬁcation of gastriccarcinoma Whole Slide Images (WSIs) is introduced in [19], and the performanceof the developed DL approach is quantitatively compared with traditional imageanalysis methods requiring prior computation of handcrafted features. The com-parative experimental results reveal that DL methods compare favorably to tradi-tional methods. The work in [20] creates a deep residual neural network model forGHIC tasks, which has deeper and more complex structures with fewer parametersand higher accuracy. A whole slide gastric image classiﬁcation method based on Re-calibrated Multi-instance Deep Learning (RMDL) is proposed in [21]. The RMDLprovides an eﬀective option to explore the interrelationship of diﬀerent patches andconsider the various impacts of them to image-level label classiﬁcation. A convo-lutional neural network of DeepLab-v3 with the ResNet-50 architecture is appliedas the binary image segmentation method in [22], and the network is trained with2123 pixel-level annotated Haematoxylin and Eosin (H&E) stained WSIs in theirprivate dataset. A deep neural network that can learn multi-scale morphologicalpatterns of histopathology images simultaneously is proposed in [23]. The workof [24] contributes to reducing the number of parameters of standard Inception-v3 network by using a depth multiplier. The output of the Inception-v3 featureextractor feeds in a Recurrent Neural Network (RNN) consisting of two LongShort-Term Memory (LSTM) layers and forms the ﬁnal architecture. The modelsare trained to classify WSIs into adenocarcinoma, adenoma, and non-neoplastic.Although existing methods based on DL models provide signiﬁcant perfor-mance boost in gastric histopathology image analysis, the existing methods stillneglect that the images in weakly-supervised learning tasks contain large redun-dancy regions that are insigniﬁcant in the DL process, which is the main challengein computational pathology.2.2 Applications of Attention MechanismThe visual Attention Mechanism (AM) has the capacity to make a deep modeladaptively focus on related regions of an image and hence is an essential way to enhance its eﬀectiveness in many vision tasks, such as object detection [25], [26],image caption [27], [28] and action recognition [29]. A prediction model to analyzewhole slide histopathology images is proposed in [30], which integrates a recurrentAM. The AM is capable of attending to the discriminatory regions of an imageby adaptively selecting a limited sequence of locations. An attention-based CNN itle Suppressed Due to Excessive Length 5 is introduced in [31], where the attention maps are predicted in the attention pre-diction subnet to highlight the salient regions for glaucoma detection. A DenseNetbased Guided Soft Attention network is developed in [32] which aims at localizingregions of interest in breast cancer histopathology images, and simultaneously us-ing them to guide the classiﬁcation network. A Thorax-Net for the classiﬁcation ofthorax diseases on chest radiographs is constructed in [6]. The attention branch ofthe proposed network exploits the correlation between class labels. The locationsof pathological abnormalities by analyzing the feature maps are learned by theclassiﬁcation branch. Finally, a diagnosis is derived by averaging and binarizingthe outputs of two branches. A CAD approach called HIENet is introduced in [33]to classify histopathology images of endometrial diseases using a CNN and AM.The Position Attention block of the HIENet is a self-AM, which is utilized to cap-ture the context relations between diﬀerent local areas in the input images. GHICis intrinsically a weakly supervised learning problem and the location of essentialareas plays a critical role in the task. Therefore, it is reasonable to combine theAMs in the classiﬁcation of tissue-scale gastric histopathology images.2.3 Applications of Conditional Random FieldsConditional Random Fields (CRFs), as an important and prevalent type of MLmethod, are designed for building probabilistic models to explicitly describe thecorrelation of the pixels or the patches being predicted and label sequence data.The CRFs are attractive in the ﬁeld of ML because they allow achieving in var-ious research ﬁelds, such as Name Entity Recognition Problem in Natural Lan-guage Processing [34], Information Mining [35], Behavior Analysis [36], Image andComputer Vision [37], and Biomedicine [38]. In recent years, with the rapid de-velopment of DL, the CRF models are usually utilized as an essential pipelinewithin the deep neural network in order to reﬁne the image segmentation results.Some research incorporates them into the network architecture, while others in-clude them in the post-processing step. In [39], a dense CRF is embedded intothe loss function of a deep CNN model to improve the accuracy and further reﬁnethe model. In [40], a multi-resolution hierarchical framework (called SuperCRF)is inspired by pathologists to perceive regional tissue architecture is introduced.The labels of the CRF single-cell nodes are connected to the regional classiﬁcationresults from superpixels producing the ﬁnal result. In [41], a method based ona CNN is presented for the objective of automatic Gleason grading and Gleasonpattern region segmentation of images with prostate cancer pathologies, where aCRF-based post-processing is applied to the prediction. In [42], a DL convolutionnetwork based on Group Equivariant Convolution and Conditional Random Field(GECNN-CRF) is proposed. The output probability of the CNN model is ableto build up the unary potential of the CRFs. The pairwise loss function used toexpress the magnitude of the correlation between two blocks is designed by thefeature maps of the neighboring patches.

In our previous work [43], an environmental microorganism classiﬁcation en-gine that can automatically analyze microscopic images using CRF and DeepConvolutional Neural Networks (DCNN) is proposed. The experimental resultsshow 94.2% of overall segmentation accuracy. In another work [44], we suggest amultilayer hidden conditional random ﬁelds (MHCRFs) to classify gastric cancer

Yixin Li et al. images, achieving an overall accuracy of 93%. In [8], we optimize our architectureand propose the HCRF model, which is employed to segment gastric cancer imagesfor the ﬁrst time. The results show overall better performance compared to otherexisting segmentation methods on the same dataset. Furthermore, we combine theAM with the HCRF model and apply them in classiﬁcation tasks, obtaining pre-liminary research results in [45]. For more information, please refer to our previoussurvey paper [46]. The spatial dependencies on patches are usually neglected inprevious GHIC tasks, and the inference is only based on the appearance of indi-vidual patches. Hence, we describe an AM based on the HCRF framework in thispaper, which has not been applied to the problem in this ﬁeld.

Various kinds of classiﬁers have been used in GHIC tasks, and CNN classiﬁersare proved to achieve better performance than some classical Machine Learning(ML) methods. However, the results obtained by training them directly are notso satisfying. Considering that fact, we develop the HCRF-AM model to to reﬁnethe classiﬁcation results further. Our proposed method consists of three mainbuilding blocks such as Attention Mechanism (AM) module, Image Classiﬁcation(IC) module, and

Classiﬁcation Probability-based Ensemble Learning (CPEL). The structure of our HCRF-AM model is illustrated in Fig. 2. We explain each building block in the next subsections.

RseNet-50RseNet-50 RseNet-50RseNet-50RseNet-50RseNet-50RseNet-50RseNet-50 Inception-v3Inception-v3 Inception-v3Inception-v3Inception-v3Inception-v3Inception-v3Inception-v3

VGG-16VGG-16VGG-16VGG-16VGG-16VGG-16VGG-16VGG-16VGG-16VGG-16

RseNet-50Inception-v3

U-NetVGG-16U-Net U-NetU-NetU-NetU-NetU-NetU-NetU-Net ……… pixel-level potentials (a) Input (b) AM module (c) IC module patch-level potentials

Fig. 2

Overview of HCRF-AM framework for analyzing H&E stained gastric histopathologicalimage (a) The example of input dataset. (b) The AM module. (c) The IC module itle Suppressed Due to Excessive Length 7 while. The HCRF, which is the improvement of CRF [47], have excellent attentionarea detection performance, because it can characterize the spatial relationship ofimages [46]. The fundamental deﬁnition of CRFs will be introduced ﬁrst. The de-tail information of HCRF model, including pixel-unary, pixel-binary, patch-unary,patch-binary potentials, and their combination will be elaborated afterwards.

The basic theory of CRF is introduced in [47]: Firstly, Y is the random variable ofthe observation label sequence, and X is the random variable of the relative labelsequence. Secondly, G = ( V , E ) represents a graph where X = ( X v ) v ∈ V , while X isindexed by the nodes or vertices of G . V is the array of all sites, which correspondswith the vertices in the related undirected graph G , whose edges E construct theinteractions among adjacent sites. Thus, ( X , Y ) is a CRF in case, when conditionedon observation sequence Y , the random variables X v follow the Markov propertiesrelated to the graph: p = ( X v | Y , X w , w (cid:54) = v ) = p ( X v | Y , X w , w ∼ v ), in which w ∼ v implies w and v are neighbours in G = ( V , E ). These principles demonstrate theCRF model is an undirected graph where two disjoint sets, X and Y , are separatedfrom the nodes. In that case, the conditional distribution model is p ( X | Y ).Based on the deﬁnition of the random ﬁelds in [48], the joint distribution overthe label sequence X is given Y and forms as Eq. (1). p θ ( x | y ) ∝ exp( (cid:88) e ∈ E,k λ k f k ( e, x | e , y ) + (cid:88) v ∈ V,k µ k g k ( v, x | v , y )) , (1)where y is the observation sequence, x is the corresponding label sequence, and x | S is the set of sections of x in association with the vertices of sub-graph S .Furthermore, from [49–51], it can be comprehended that a redeﬁnition of Eq. (1)is Eq. (2). p ( X | Y ) = 1 Z (cid:89) C ψ C ( X C , Y ) , (2)where Z = (cid:80) XY P ( X | Y ) is the normalization factor and ψ C ( X C , Y ) is the po-tential function over the clique C . The clique C is the subset of the vertices in theundirected graph G , where C ⊆ V , in this way, every two diﬀerent vertices areadjoining. Diﬀerent from most of CRF models that have been built up with only unary andbinary potentials [49,50], two types of higher order potentials are introduced in ourwork. One is a patch-unary potential to characterize the information of tissues, theother is a patch-binary potential to depict the surrounding spatial relation among

Yixin Li et al. diﬀerent tissue areas. Our HCRF is expressed by Eq. (3). p ( X | Y ) = 1 Z (cid:89) i ∈ V ϕ i ( x i ; Y ; w V ) (cid:89) ( i,j ) ∈ E ψ ( i,j ) ( x i , x j ; Y ; w E ) × (cid:89) m ∈ V P ϕ m (x m ; Y ; w m ; w V P ) × (cid:89) ( m,n ) ∈ E P ψ ( m,n ) (x m , x n ; Y ; w ( m,n ) ; w E P ) , (3)where Z = (cid:88) XY (cid:89) i ∈ V ϕ i ( x i ; Y ) (cid:89) ( i,j ) ∈ E ψ ( i,j ) ( x i , x j ; Y ) × (cid:89) m ∈ V P ϕ m (x m ; Y ) (cid:89) ( m,n ) ∈ E P ψ ( m,n ) (x m , x n ; Y ) (4)is the normalization factor; V is the set of all nodes in the graph G , correspondingto the image pixels; E is the set of all edges in the graph G . V P is one patchdivided from an image; E P represents the neighboring patches of a single patch.The usual clique potential function contains two parts (terms): The pixel-unarypotential function ϕ i ( x i , Y ) is used to measure the probability that a pixel node i is labeled as x i ∈ X , which takes values from a given set of classes L , for a given ob-servation vector Y [43]; the pixel-binary potential function ψ ( i,j ) ( x i , x j ; X ) is usedto describe the adjacent nodes i and j in the graph. The spatial context relationshipbetween them is related not only to the label of node i but also to the label of itsneighbour node j . Furthermore, ϕ m (x m ; Y ) and ψ ( m,n ) (x m , x n ; Y ) are the newlyintroduced higher order potentials. The patch-unary potential function ϕ m (x m , Y )is used to measure the probability that a patch node m is labeled as x m for a givenobservation vector Y ; the patch-binary potential function ψ ( m,n ) (x m , x n ; Y ) isused to describe the adjacent nodes m and n in the patch. w V , w E , w V P and w E P are the weights of the four potentials, ϕ i ( x i , Y ), ψ ( i,j ) ( x i , x j ; Y ), ϕ m (x m , Y ) and ψ ( m,n ) (x m , x n ; Y ), respectively. w m and w ( m,n ) are the weights of the ϕ m ( · ; Y )and ψ ( m,n ) ( · , · ; Y ), respectively. These weights are used to ﬁnd the largest poste-rior label ˜ X = arg max X p ( X | Y ) and to further improve the image segmentationperformance.The workﬂow of the proposed HCRF model can be concluded as follows: First,to obtain pixel-level segmentation information, the U-Net [52] is trained to build upthe pixel-level potential. Then, in order to obtain abundant spatial segmentationinformation in patch-level, we ﬁne-tune three pre-trained CNNs, including VGG-16 [53], Inception-V3 [54] and ResNet-50 [55] networks to build up the patch-levelpotential. Thirdly, based on the pixel- and patch-level potentials, our HCRF modelis structured. In the AM module, a half of abnormal images and their GroundTruth (GT) images are applied to train the HCRF and the attention extractionmodel is obtained. The pixel-unary potential ϕ i ( x i ; Y ; w V ) in Eq. (3) is related to the probabilityweights w V of a label x i , taking a value c ∈ L given the observation data Y by itle Suppressed Due to Excessive Length 9 Eq. (5). ϕ i ( x i ; Y ; w V ) ∝ (cid:16) p ( x i = c | f i ( Y ) (cid:17) w V , (5)where the image content is characterized by site-wise feature vector f i ( Y ), whichmay be determined by all the observation data Y [56]. The probability maps p ( x i = c | f i ( Y ) at the last convolution layer of the U-Net serves as the featuremaps, and the 256 × × f i ( Y ) obtains. So, thepixel-unary potential is updated to Eq. (6). ϕ i ( x i ; Y ; w V ) = ϕ i ( x i ; F i ; w V ) , (6)where the data Y determines F i . The pixel-binary potential ψ ( i,j ) ( x i , x j ; Y ; w E ) in Eq. (3) describes the similarityof the pairwise adjacent sites i and j to take label ( x i , x j ) = ( c, c (cid:48) ) given the dataand weights, and it is deﬁned as Eq. (7). ψ ( i,j ) ( x i , x j ; Y ; w E ) ∝ (cid:16) p ( x i = c ; x j = c (cid:48) | f i ( Y ) , f j ( Y )) (cid:17) w E . (7)The layout of the pixel-binary potential is shown in Fig. 3. This “lattice” (or“reseau” or “array”) layout is used to describe the probability of each classiﬁedpixel is calculated by averaging each pixel of neighbourhood unary probability [57].The other procedures are the same as the pixel-unary potential calculation inSec. 3.1.3. x i x j Fig. 3

48 neighbourhood ‘lattice’ layout of pixel-binary potential in the AM module. Averageof unary probabilities of 48 neighbourhood pixels is used as probability of pixel (central pixelin orange)0 Yixin Li et al.

In order to extract abundant spatial information, VGG-16, Inception-V3 andResNet-50 networks are selected to extract patch-level features. In patch-levelterms, α, β, γ represent VGG-16, Inception-V3 and ResNet-50 networks, respec-tively. In patch-unary potentials ϕ m (x m ; Y ; w m ; w V P ) of Eq. (3), label x m = { x ( m,α ) , x ( m,β ) , x ( m,γ ) } and w m = { w ( m,α ) , w ( m,β ) , w ( m,γ ) } . ϕ m (x m ; Y ; w m ; w V P )are related to the probability of labels ( w ( m,α ) , w ( m,β ) , w ( m,γ ) ) = ( c, c, c ) giventhe data Y by Eq. (8). ϕ m (x m ; Y ; w m ; w V P ) ∝ (cid:16) ( p (x ( m,α ) = c | f ( m,α ) ( Y ))) w ( m,α ) ( p (x ( m,β ) = c | f ( m,β ) ( Y ))) w ( m,β ) ( p (x ( m,γ ) = c | f ( m,γ ) ( Y ))) w ( m,γ ) (cid:17) w VP , (8)where the characteristics in image data are transformed by site-wise feature vectors f ( m,α ) ( Y ), f ( m,β ) ( Y ) and f ( m,γ ) ( Y ) that may be determined by all the input data Y . For f ( m,α ) ( Y ), f ( m,β ) ( Y ), and f ( m,γ ) ( Y ), we use 1024-dimensional patch-levelbottleneck features F ( m,α ) , F ( m,β ) and F ( m,γ ) , obtained from pre-trained VGG-16, Inception-V3 and ResNet-50 by ImageNet; and retrain their last three fullyconnected layers [58] using gastric histopathology images to calculate the classiﬁ-cation probability of each class. Therefore, the patch-unary potential is updatedto Eq. (9). ϕ m (x m ; Y ; w m ; w V P ) = ϕ m (x m ; F ( m,α ) ; F ( m,β ) ; F ( m,γ ) ; w m ; w V P ) , (9)where the data Y determines F ( m,α ) , F ( m,β ) and F ( m,γ ) . The patch-binary potential ψ ( m,n ) (x m , x n ; Y ; w ( m,n ) ; w E P ) of the Eq. (3) demon-strates how similarly the pairwise adjacent patch sites m and n is to take label(x m , x n ) = ( c, c (cid:48) ) given the data and weights, and it is deﬁned as Eq. (10). ψ ( m,n ) (x m , x n ; Y ; w ( m,n ) ; w E P ) ∝ (cid:16) ( p (x ( m,α ) = c ; x ( n,α ) = c (cid:48) | f ( m,α ) ( Y ) , f ( n,α ) ( Y ))) w ( m,n,α ) ( p (x ( m,β ) = c ; x ( n,β ) = c (cid:48) | f ( m,β ) ( Y ) , f ( n,β ) ( Y ))) w ( m,n,β ) ( p (x ( m,γ ) = c ; x ( n,γ ) = c (cid:48) | f ( m,γ ) ( Y ) , f ( n,γ ) ( Y ))) w ( m,n,γ ) (cid:17) w EP , (10)where x n = { x ( n,α ) , x ( n,β ) , x ( n,γ ) } denotes the patch labels and w ( m,n ) = { w ( m,n,α ) ,w ( m,n,β ) , w ( m,n,γ ) } represents the patch weights. A “lattice” (or “reseau” or “ar- ray”) layout of eight neighbourhood in Fig. 4 is designed to calculate the prob-ability of each classiﬁed patch by averaging each patch of neighbourhood unaryprobability [57]. The other operations are identical to the patch-binary potentialin Sec. 3.1.5.The core process of HCRF can be found in Algorithm 1. itle Suppressed Due to Excessive Length 11 x m x n Fig. 4

Eight neighbourhood ‘lattice’ layout of patch-binary potential in the AM module.Average of unary probabilities of eight neighbourhood patches is used as probability of targetpatch (central patch in orange)

Algorithm 1

HCRF

Input:

The original image, I ; The real label image, L ; Output:

The image for segmentation result, I seg ;1: Put the original image I into network and get p ( x i = c | f i ( Y );2: for pixel i in the original image I do

3: Get ϕ i ( x i ; Y ; w V ) deﬁned as Eq.(5);4: for pixel j in the neighbour nodes of pixel i do

5: Get ψ ( i,j ) ( x i , x j ; Y ; w E ) deﬁned as Eq.(7);6: end for end for

8: Each pixel is taken as the center to get its corresponding patch;9: Put the original image I into three networks and get p (x ( m,α ) = c | f ( m,α ) ( Y )), p (x ( m,β ) = c | f ( m,β ) ( Y ) and p (x ( m,γ ) = c | f ( m,γ ) ( Y );10: for patch m in the original image I do

11: Get ϕ m (x m ; Y ; w m ; w V P ) deﬁned as Eq.(8);12: for patch n in the neighbour nodes of patch m do

13: Get ψ ( m,n ) (x m , x n ; Y ; w ( m,n ) ; w E P ) deﬁned as Eq.(10);14: end for end for for pixel i in the original image I do

17: Get the corresponding patch m of pixel i ;18: Get normalization factor Z deﬁned as Eq.(4);19: Get p ( X | Y ) deﬁned as Eq.(3);20: Get pixel-level classiﬁcation result;21: end for

22: Get the image I seg for segmentation result;23: return I seg ; Firstly, the abnormal images of the IC module in the training and validation setare sent to the trained HCRF model. The output map of this step can be used tolocate the diagnostically relevant regions and guide the attention of the networkfor classiﬁcation of microscopic images. The next step is to threshold and meshthe output probability map. If the attention area occupies more than 50% of the area of a 256 ×

256 patch, this patch is chosen as the ﬁnal attention patch (thisparameter is obtained by traversing the proportion from 10% to 90% using gridoptimization method). The proposed HCRF-AM method emphasizes and givesprominence to features which own higher discriminatory power.Chemicals that are valuable for the diagnosis of gastric cancer, such as miRNA-215 [59], are also often expressed at higher levels in paracancerous tissue than innormal tissue [60], indicating the signiﬁcance of adjacent tissues for the diagnosis ofgastric cancer. This suggests that it is not suﬃcient that only speciﬁc tumor areasfor the networks are conserved. Hence, all the images in the IC module dataset aswell as the attention patches are used as input. The patches that are most likely tocontain tumor areas are given more weight. Meanwhile, the neighboring patchesof the attention patches will not be abandoned.Transfer Learning (TL) is a method that uses CNNs pretrained on a large an-notated image database (such as ImageNet) to complete various tasks. TL focuseson acquiring knowledge from a problem and applying it to diﬀerent but relatedproblems. It essentially uses additional data so that CNNs can decode by using thefeatures of past experience training, after that the CNNs can have better general-ization ability and higher eﬃciency [61]. In this paper, we have compared the VGGseries, Inception series, ResNet series, and DenseNet series as our classiﬁer. Theﬁnal selection is based on comprehensive classiﬁcation performance and a numberof parameters. We ﬁnally apply VGG-16 networks for the TL classiﬁcation pro-cess, where the parameters are pre-trained on the ImageNet dataset [62]. The sizeof input images is 256 × × p ( c j | Y im ) = T (cid:89) i =1 p ( c j | Y pa ( i ) ) ∝ T (cid:88) i =1 ln ( p ( c j | Y pa ( i ) )) (11)Here, c j denotes the image label ( c represents normal images and c representsabnormal images). Y im is the input image with size of 2048 × Y pa is the input patch with size of 256 ×

256 pixels. T means the number of patchescontained in an input image. p ( c j | Y im ) represents the probability of an imagelabeled as normal or abnormal; Similarly, p ( c j | Y pa ) represents that of a patch. Additionally, in order to guarantee the image patch classiﬁcation accuracy, the logoperation is carried out to the probability ( ln ( · ) means the natural logarithm ofa number). The ﬁnal prediction is determined by the category which owns largerprobability.The whole process of our HCRF-AM framework is shown in Algorithm 2. itle Suppressed Due to Excessive Length 13 Algorithm 2

HCRF-AM framework

Input:

The image set for training and validation set of CNN with binary label, I ; The reallabel image set for abnormal images in I , L ; The image set for test set of CNN, I test ; Output:

The probability of an image labeled as normal or abnormal p ( c j | Y im ) of I test ;1: Divide I into abnormal image set I ab and normal image set I nor according to the binarylabel;2: for image I in I do

3: Divide I into patches and put them into CNN;4: if I ∈ I ab then

5: Get real label image L of I from L ;6: Put I and L into AM module and get segmentation result I seg ;7: Divide I seg into patches and get patch set P seg ;8: for patch P seg in P seg do if over 50% pixels in P seg are segmented as abnormal regions then P seg is chosen as attention region;11: Put P seg into CNN;12: end if end for end if end for

16: Get CNN model;17: for image I test in I test do

18: Put I test into CNN model and get patch-level classiﬁcation result p ( c j | Y pa ( i ) );19: Get image-level classiﬁcation result p ( c j | Y im ) deﬁned as Eq.(11);20: end for return p ( c j | Y im ) of I test ; In this study, we use a publicly available Haematoxylin and Eosin (H&E) stainedgastric histopathology image dataset to test the eﬀectiveness of our HCRF-AMmodel [64], and some examples in the dataset are represented in Fig. 5.The images in our dataset are processed with H&E stains, which is essential foridentifying the various tissue types in histopathological images. In a typical tissue,nuclei are stained blue by haematoxylin, whereas the cytoplasm and extracellularmatrix have varying degrees of pink staining due to eosin [65]. The images aremagniﬁed 20 times and most of the abnormal regions are marked by practicalhistopathologists. The image format is ‘*.tiﬀ’ or ‘*.png’ and the image size is 2048 × are arranged regularly, the nucleo-cytoplasmic ratio is low, and a stable structurecan be seen. By contrast, in the abnormal images, cancerous gastric tissue usuallypresents nuclear enlargement. Hyperchromasia without visible cell borders andprominent perinuclear vacuolization is also a typical feature [66], [67]. In the GTimages, the cancer regions are labeled in the sections. Fig. 5

Examples in the H&E stained gastric histopathological image dataset. The columna. presents the original images of normal tissues. The original images in column b. containabnormal regions, and column c. shows the corresponding GT images of column b. In the GTimages, the brighter regions are abnormal tissues with cancer cells, and the darker regions arenormal tissues without cancer cells.

The proposed HCRF-AM model consists of AM module and IC module, so wedistribute the images in the dataset according to the needs. The allocation isrepresented in Table 1.

Table 1

The images allocation for AM module and IC module.

Image type AM module IC module

Original normal images 0 140Original abnormal images 280 280

In the AM module, 280 abnormal images and the corresponding GT imagesare used to train the HCRF model to acquire attention areas, and they are divided into training and validation sets with a ratio of 1:1 (the detail information is inSec. 3.1).The AM module data setting is represented in Table 2. Before beingsent into the model, we augment the training and validation datasets six times.Furthermore, because cellular visual features in a histopathological image is alwaysobserved on patch scales by the pathologists, we crop the original and the GT itle Suppressed Due to Excessive Length 15

Table 2

The AM module data setting.

Image type Train Validation Sum

Original abnormal images 140 140 280Augmented abnormal images 53760 53760 107520 images into 256 ×

256 pixels. Finally, we obtain 53760 training, 53760 validationimages.In the IC module, 280 abnormal images remain and 140 normal images areapplied in CNN classiﬁcation part (the detail information is in Sec. 3.2). TheIC module data setting is represented in Table 3. Among them, 70 images from

Table 3

The IC module data setting.

Image type Train Validation Test Sum

Original normal images 35 35 70 140Original abnormal images 35 35 210 280Cropped normal images 2240 2240 – –Cropped abnormal images 2240 2240 – – each class are randomly selected for training and validation sets, and the test setcontains 70 normal images and 210 abnormal images. Similarly, we mesh theseimages into patches (256 ×

256 pixels). So, the initial dataset of the IC modulecomprises of 2240 training and 2240 validation images from each category.

To evaluate our model, accuracy, sensitivity, speciﬁcity, precision and F1-scoremetrics are used to measure the classiﬁcation result. These ﬁve indicators aredeﬁned in Table 4.

Table 4

The ﬁve evaluation criteria and corresponding deﬁnitions.

Criterion Deﬁnition Criterion Deﬁnition

Accuracy TP + TNTP + FN + TN + FP Sensitivity TPTP + FNSpeciﬁcity TNTN + FP Precision TPTP + FPF1-score 2 · Precision · SensitivityPrecision + Sensitivity

In this paper, the samples labeled as normal are positive samples, and thesamples labeled as abnormal are negative samples. In the deﬁnition of these indi-cators, TP denotes the true positive, which represents positive cases diagnosed aspositive. TN denotes the true negative, which indicates negative cases diagnosed as negative. FP denotes the false positive, which are negative cases diagnosed aspositive and FN denotes the false negative, which are positive cases diagnosed asunfavorable. The accuracy is the ratio of the number of samples correctly classi-ﬁed by the classiﬁer to the total number of samples. The sensitivity reﬂects thepositive case of correct judgement accounting for the proportion of the total pos-itive samples, and the speciﬁcity reﬂects the negative case of correct judgementaccounting for the proportion of the total negative samples. The precision reﬂectsthe proportion of positive samples that are determined by the classiﬁer to be pos-itive samples. The F1-score is an indicator that comprehensively considers theaccuracy and sensitivity.4.2 Baseline Classiﬁer SelectionFor baseline, we compare the performance between diﬀerent CNN-based classiﬁersand evaluate the eﬀect of Transfer Learning (TL) method on the initial dataset.We use the cropped images in Table 3 as the train and validation set to build thenetworks and the classiﬁcation accuracy is obtained on the test set. The result isshown in Fig. 6. R es n e t I n ce p t i on v3 V G G X ce p t i on V G G D e n se n e t D e n se n e t A cc u r acy TL method trained CNNs De-novo trained CNNs

Fig. 6

Comparison between image classiﬁcation performance of diﬀerent CNN-Based Classi-ﬁers on test set.

From Fig. 6, it is observed that the VGG-16 TL method performs the best andachieves an accuracy of 0.875, followed by Resnet 50 and VGG-19 [53] network. Itcan be also seen from the Fig. 6 that the method of training models from scratch(De-novo trained CNNs) performs signiﬁcantly worse than each TL algorithmin terms of classiﬁcation accuracy. Therefore, the VGG-16 TL method is ﬁnallyselected as the classiﬁer in the baseline. itle Suppressed Due to Excessive Length 17 and four classical methods (Level-Set [70], Otsu thresholding [71], Watershed [72],and MRF [73]) when segmenting interesting regions and objects. A comparativeanalysis with existing work on our dataset is presented in Fig. 7. The state-of-the-art methods are all trained on the dataset in Table 2.

Fig. 7

Comparison between HCRF and other attention area extracted methods on test set((a), (b) two typical examples of attention area extraction results using diﬀerent methods).

It can be displayed that our HCRF method has better attention area extractedperformance than other existing methods in the visible comparison, where morecancer regions are correctly marked and less noise remains. The detailed informa-tion of evaluation index is shown in Table. 5. The classical methods have similar

Table 5

A numerical comparison of the image segmentation performance between our HCRFmodel and other existing methods. The ﬁrst row shows diﬀerent methods. The ﬁrst columnshows the evaluation criteria. Dice is in the interval [0,1], and a perfect segmentation yieldsa Dice of 1. RVD is an asymmetric metric, and a lower RVD means a better segmentationresult. IoU is a standard metric for segmentation purposes that computes a ratio between theintersection and the union of two sets, and a high IoU means a better segmentation result.The bold texts are the best performance for each criterion.

Criterion Our HCRF DenseCRF U-Net SegNet Level-Set Otsu thresholding Watershed k -means MRF Dice

Table 6

The parameter settings for TL networks.

Hyper-parameter

VGG-16Initial input size 256 × × results, where entire the extracted region is scattered and abnormal areas cannotbe separated. Except recall and speciﬁcity, the proposed HCRF performs betteron other indexes compared to the state-of-the-art method. The precision has moreeﬀectiveness in evaluating the foreground segmentation result and recall has moreeﬀectiveness in evaluating the background segmentation result. Consequently, theHCRF model is suitable for us to extract the attention regions and it is chosen inour following experimental steps.In addition, based on the third-party experiments [74], the excellent perfor-mance of our HCRF model is also veriﬁed. In their experiments, the HCRF andother state-of-the-art methods (BFC [75], SAM [76], FRFCM [77], MDRAN [78],LVMAC [79], PABVS [80], FCMRG [81]) are used for nuclei segmentation, andour HCRF model perform well, second only to the method proposed for their taskin this experiment.4.4 Evaluation of HCRF-AM ModelBased on the experiment results in Sec. 4.2, we choose VGG-16 as our classiﬁerin the IC module. First, training and validation sets in Table. 3 as well as theirattention areas are used to train the VGG-16 network with a Transfer Learning(TL) strategy. The validation set is applied to tune the CNN parameters andavoid the overﬁtting or underﬁtting of CNN during the training process. Second,2048 × ×

256 pixel images andsent into the trained network to obtain the patch prediction probability. Thirdly,CPEL method is applied in order to acquire the ﬁnal label of an image of 2048 × model are about 1% to 15% higher than the baseline model. The results denotethat although the test set has 280 images which are four times the number of thetraining and validation sets (the ﬁgure for the abnormal images is seven times), ourproposed HCRF-AM model also provides good classiﬁcation performance (espe-cially the classiﬁcation accuracy of abnormal images), showing high stability and itle Suppressed Due to Excessive Length 19

53 1911917 N o r m a l A b n o r m a l s u m _ c o l Normal Abnormal sum_row P r e d i c t e d Actual

Baseline on test set

25 32310 N o r m a l A b n o r m a l s u m _ c o l Normal Abnormal sum_row P r e d i c t e d Actual

HCRF-AM model on validation set

24 32311 N o r m a l A b n o r m a l s u m _ c o l Normal Abnormal sum_row P r e d i c t e d Actual

Baseline on validation set

53 203717 N o r m a l A b n o r m a l s u m _ c o l Normal Abnormal sum_row P r e d i c t e d Actual

HCRF-AM model on test set

Fig. 8

Image classiﬁcation results on the validation sets and test sets. The confusion matricespresent the classiﬁcation results of baseline method and our HCRF-AM method, respectively.

Accuracy Sensitivity Specificity Precision F1-score

HCRF-AM model Baseline

Fig. 9

Comparison between image classiﬁcation accuracy of proposed HCRF-AM model andbaseline on test sets. strong robustness of our method. Moreover, it has been veriﬁed that the HCRFmodel achieves better attention region extraction performance using GT imagesas standard in Sec.4.3. A numerical comparison between the ﬁnal classiﬁcationresults of our HCRF method and other existing methods as attention extractionmethod on the test set is given in Table. 7. It is indicated that the HCRF modelperforms better on all indexes considering the ﬁnal classiﬁcation performance.

Table 7

Numerical comparison of classiﬁcation results between diﬀerent attention extractedmethods.

Criterion HCRF Level-Set DenseCRF U-Net Watershed MRF Otsu SegNetAccuracy 0.914

Sensitivity 0.757

Speciﬁcity 0.967

Precision 0.883

F1-score 0.815

In order to show the potential of the proposed HCRF-AM method for the GHICtask, it is compared with four existing methods of AMs, including Squeeze-and-Excitation Networks (SENet) [83], Convolutional Block Attention Module (CBAM) [84],Non-local neural networks (Non-local) [85] and Global Context Network (GC-Net) [86]. VGG-16 has a great number of parameters and it is hard to converseespecially when integrated with other blocks [53] [87]. Based on the experimentconstructed, we also ﬁnd that it is tricky to facilitate the training of VGG-16 fromscratch. Meanwhile, the AMs nowadays have been extensively applied to Resnetand it is popular with the researchers [88] [89]. Therefore, we combine these ex-isting attention methods with Resnet in our contrast experiment in most cases.The experimental settings of these existing methods are brieﬂy introduced as fol-lows: (1) SE blocks are integrated into a simple CNN with convolution kernel of32 × , × , × , ×

256 pixels. (2) CBAM is incorporated into Resnetv2 with 11 layers. (3) Nonlocal is applied to all residual blocks in Resnet with34 layers. (4) GC blocks are integrated to Resnet v1 with 14 layers. They are alltrained on the dataset in Table 3 and the input data size is 256 ×

256 pixels.

According to the experimental design in the Sec. 4.5.1, we obtained the experi-mental results in the Table 8.

Table 8

A comparison of the image classiﬁcation results of our HCRF-AM model and otherexisting methods on the test set.

Ref. Method Accuracy Sensitivity Speciﬁcity [83] SENet+CNN 0.754 0.429 0.862[84] CBAM+Resnet 0.393

HCRF-AM 0.914

Table 8 indicates that: (1) Comparing to four state-of-the-art methods, exceptsensitivity, the proposed HCRF-AM performs better on other indexes. The overall itle Suppressed Due to Excessive Length 21 accuracy of most methods is around 70%, apparently lower than that of ours. (2)The sensitivity of HCRF-AM is the second best only after CBAM-Resnet, andthe other two indicators of CBAM-Resnet are far lower than us. And in practicaldiagnosis, the speciﬁcity, which reﬂects the abnormal case of correct judgement,is of particular importance. (3) The sensitivity and speciﬁcity of SE blocks andGC blocks vary widely, whose diﬀerences are around 30%. This suggests that theirprediction strategy is out of balance (see further discussion in Sec. 5.3).4.6 Computational TimeIn our experiment, we use a workstation with Intel (cid:114)

Core

T M i7-8700k CPU 3.20GHz, 32GB RAM and GeForce RTX 2080 8 GB. The training time of our modelincludes two modules, the AM module and IC module, taking about 50 h fortraining 280 images (2048 × × × (a) Original (b) GT (c) Attention area extracted by AM module

Fig. 10

Typical examples of some images in our dataset for analysis. (a) presents the originalimages. (b) denotes the GT images. The regions in the red curves in (b) are the abnormalregions in the redrawn GT images by our cooperative histopathologists. The red regions of (c)shows the attention extraction results by the AM module.

Fig. 11

Examples of mis-classiﬁcation. The row (a) presents the normal cases diagnosed asabnormal (FN). The row (b) presents the abnormal cases diagnosed as normal (FP).itle Suppressed Due to Excessive Length 23

For FN samples in Fig. 11(a), some larger bleeding spots can be found in somenormal samples, leading to misdiagnosis. Some images have many bright areas inthe ﬁeld of view, which may be caused by being at the edge of the whole slice,and these bright areas cannot provide information eﬀectively. For FP samplesin Fig. 11(b), the cancer areas in some images for abnormal samples are smalland scattered, making them insuﬃciently noticed in classiﬁcation. Simultaneously,in some samples, the staining of the two stains is not uniform and suﬃcient.In some images, diseased areas appear atypical, which increases the diﬃculty ofclassiﬁcation.5.3 Analysis of the Existing Attention MechanismsRecently, Attention Mechanisms (AMs) have drawn great attention from scholarsand they have been extensively applied to solve practical problems in variousﬁelds. For example, the non-local network is proposed to model the long-rangedependencies using one layer, via a self-AM [85]. However, with the increasingarea of the receptive ﬁeld, the computation costs become more extensive at thesame time. These AMs, which have a large memory requirement, are not suitablein GHIC tasks because the size of histopathology images are always 2048 × In this paper, we develop a novel approach for GHIC using an HCRF based AM.Through experiments, we choose high-performance methods and networks in theAM and IC modules of HCRF-AM model. In the evaluation process, the proposedHCRF method outperforms the state-of-the-art attention area extraction methods,showing the robustness and potential of our method. Finally, our method achievesa classiﬁcation accuracy of 91 .

4% and a speciﬁcity of 96 .

7% on the testing im-ages. We have compared our proposed method with some existing popular AMsmethods that uses same dataset to further verify the performance. Consideringthe advantages mentioned above, the HCRF-AM model holds the potential to be employed in a human-machine collaboration pattern for early diagnosis in gastriccancer, which may help increase the productivity of pathologists. In the discussionpart, the possible causes of misclassiﬁcation in the experiment are analyzed, whichprovides a reference for improving the performance of the model.Though our method provides satisfactory performance, there are a few limita-tions. First, our proposed HCRF model in the AM module only considers infor- mation in single scale, which degrades model performance. Moreover, our modelcan be further improved by the technique shown in [40], where the pathologistsincorporate large-scale tissue architecture and context across spatial scales, in or-der to improve single-cell classiﬁcation. Second, we have investigated four kinds ofDL models, using TL methods and integrating the AM into them. In the future,we can investigate other DL models and compare their results for higher classiﬁ-cation accuracy. Finally, our AM is a weakly supervised system at present. Hence,the unsupervised learning method [91] may be of certain reference signiﬁcance toours, which applies a pure transformer directly to sequences of image patches andperforms well on nature image classiﬁcation tasks.

Acknowledgements

This study was supported by the National Natural Science Foundationof China (grant No. 61806047). We thank Miss Xiran Wu, due to her contribution is consideredas important as the ﬁrst author in this paper. We also thank Miss Zixian Li and Mr. GuoxianLi for their important discussion.

References

1. C. Wild, B. Stewart, and C. Wild.

World Cancer Report 2014 . World Health Organization,Geneva, Switzerland, 2014.2. F. Bray, J. Ferlay, I. Soerjomataram, R. Siegel, L. Torre, and A. Jemal. Global CancerStatistics 2018: GLOBOCAN Estimates of Incidence and Mortality Worldwide for 36Cancers in 185 Countries.

CA: A Cancer Journal for Clinicians , 68(6):394–424, 2018.3. M. Orditura, G. Galizia, V. Sforza, V. Gambardella, A. Fabozzi, M. Laterza, F. Andreozzi,J. Ventriglia, B. Savastano, A. Mabilia, et al. Treatment of Gastric Cancer.

World Journalof Gastroenterology: WJG , 20(7):1635, 2014.4. E. Van Cutsem, X. Sagaert, B. Topal, K. Haustermans, and H. Prenen. Gastric Cancer.

The Lancet , 388(10060):2654–2664, 2016.5. T. Elsheikh, M. Austin, D. Chhieng, F. Miller, A. Moriarty, and A. Renshaw. AmericanSociety of Cytopathology Workload Recommendations for Automated Pap Test Screening:Developed by the Productivity and Quality Assurance in the Era of Automated ScreeningTask Force.

Diagnostic Cytopathology , 41(2):174–178, 2013.6. H. Wang, H. Jia, L. Lu, and Y. Xia. Thorax-Net: An Attention Regularized Deep NeuralNetwork for Classiﬁcation of Thoracic Diseases on Chest Radiography.

IEEE Journal ofBiomedical and Health Informatics , 24(2):475–485, 2019.7. L. Li, M. Xu, X. Wang, L. Jiang, and H. Liu. Attention Based Glaucoma Detection: Alarge-scale Database and CNN Model. In

Proc. of CVPR 2019 , pages 10571–10580, 2019.8. C. Sun, C. Li, J. Zhang, M. Rahaman, S. Ai, H. Chen, F. Kulwa, Y. Li, X. Li, andT. Jiang. Gastric Histopathology Image Segmentation Using a Hierarchical ConditionalRandom Field.

Biocybernetics and Biomedical Engineering , 40(4):1535–1555, 2020.9. C. Sun, C. Li, J. Zhang, F. Kulwa, and X. Li. Hierarchical Conditional Random FieldModel for Multi-object Segmentation in Gastric Histopathology Images.

Electronics Let-ters , 56(15):750–753, 2020.10. R. Zhu, R. Zhang, and D. Xue. Lesion Detection of Endoscopy Images Based on Convolu-tional Neural Network Features. In , pages 372–376, 2015.11. K. Ishihara, T. Ogawa, and M. Haseyama. Detection of Gastric Cancer Risk from X-ray Images via Patch-based Convolutional Neural Network. In , pages 2055–2059, 2017.12. R. Li, J. Li, X. Wang, P. Liang, and J. Gao. Detection of Gastric Cancer and its HistologicalType based on Iodine Concentration in Spectral CT.

Cancer Imaging , 18(1):1–10, 2018.13. J. Li, W. Li, A. Sisk, H. Ye, W. Wallace, W. Speier, and C. Arnold. A Multi-resolutionModel for Histopathology Image Classiﬁcation and Localization with Multiple InstanceLearning. arXiv Preprint arXiv:2011.02679 , 2020.itle Suppressed Due to Excessive Length 2514. S. Korkmaz, A. Ak¸ci¸cek, H. B´ınol, and M. Korkmaz. Recognition of the Stomach CancerImages with Probabilistic HOG Feature Vector Histograms by Using HOG Features. In ,pages 000339–000342, 2017.15. S. Korkmaz and H. Binol. Classiﬁcation of Molecular Structure Images by Using ANN, RF,LBP, HOG, and Size Reduction Methods for Early Stomach Cancer Detection.

Journalof Molecular Structure , 1156:255–263, 2018.16. H. Sharma, N. Zerbe, I. Klempert, S. Lohmann, B. Lindequist, O. Hellwich, and P. Huf-nagl. Appearance-based Necrosis Detection Using Textural Features and SVM with Dis-criminative Thresholding in Histopathological Whole Slide Images. In , pages 1–6, 2015.17. B. Liu, M. Zhang, T. Guo, and Y. Cheng. Classiﬁcation of Gastric Slices Based on DeepLearning and Sparse Representation. In , pages 1825–1829, 2018.18. H. Sharma, N. Zerbe, C. B¨oger, S. Wienert, O. Hellwich, and P. Hufnagl. A ComparativeStudy of Cell Nuclei Attributed Relational Graphs for Knowledge Description and Cate-gorization in Histopathological Gastric Cancer Whole Slide Images. In , pages 61–66,2017.19. H. Sharma, N. Zerbe, I. Klempert, O. Hellwich, and P. Hufnagl. Deep ConvolutionalNeural Networks for Automatic Classiﬁcation of Gastric Carcinoma Using Whole SlideImages in Digital Histopathology.

Computerized Medical Imaging and Graphics , 61:2–13,2017.20. B. Liu, K. Yao, M. Huang, J. Zhang, Y. Li, and R. Li. Gastric Pathology Image RecognitionBased on Deep Residual Networks. In , volume 2, pages 408–412, 2018.21. S. Wang, Y. Zhu, L. Yu, H. Chen, H. Lin, X. Wan, X. Fan, and P. Heng. RMDL: Recali-brated Multi-instance Deep Learning for Whole Slide Gastric Image Classiﬁcation.

MedicalImage Analysis , 58:101549, 2019.22. Z. Song, S. Zou, W. Zhou, Y. Huang, L. Shao, J. Yuan, X. Gou, W. Jin, Z. Wang, X. Chen,et al. Clinically Applicable Histopathological Diagnosis System for Gastric Cancer Detec-tion Using Deep Learning.

Nature Communications , 11(1):1–9, 2020.23. S. Kosaraju, J. Hao, H. Koh, and M. Kang. Deep-Hipo: Multi-scale Receptive Field DeepLearning for Histopathological Image Analysis.

Methods , 179:3–13, 2020.24. O. Iizuka, F. Kanavati, K. Kato, M. Rambeau, K. Arihiro, and M. Tsuneki. Deep LearningModels for Histopathological Classiﬁcation of Gastric and Colonic Epithelial Tumours.

Scientiﬁc Reports , 10(1):1–11, 2020.25. J. Ba, V. Mnih, and K. Kavukcuoglu. Multiple Object Recognition with Visual Attention. arXiv preprint arXiv:1412.7755 , 2014.26. W. Li, K. Liu, L. Zhang, and F. Cheng. Object Detection Based on an Adaptive AttentionMechanism.

Scientiﬁc Reports , 10(1):1–13, 2020.27. K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, R. Zemel, and Y. Bengio.Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. In

International conference on machine learning , pages 2048–2057, 2015.28. M. Liu, L. Li, H. Hu, W. Guan, and J. Tian. Image Caption Generation with DualAttention Mechanism.

Information Processing & Management , 57(2):102178, 2020.29. S. Sharma, R. Kiros, and R. Salakhutdinov. Action recognition using visual attention. arXiv preprint arXiv:1511.04119 , 2015.30. A. BenTaieb and G. Hamarneh. Predicting Cancer with a Recurrent Visual AttentionModel for Histopathology Images. In

International Conference on Medical Image Com-puting and Computer-Assisted Intervention , pages 129–137, 2018.31. L. Li, M. Xu, H. Liu, Y. Li, X. Wang, L. Jiang, Z. Wang, X. Fan, and N. Wang. ALarge-Scale Database and a CNN Model for Attention-Based Glaucoma Detection.

IEEETransactions on Medical Imaging , 39(2):413–424, 2019.32. H. Yang, J. Kim, H. Kim, and S. Adhikari. Guided Soft Attention Network for Classiﬁ-cation of Breast Cancer Histopathology Images.

IEEE Transactions on Medical Imaging ,39(5):1306–1315, 2019.33. H. Sun, X. Zeng, T. Xu, G. Peng, and Y. Ma. Computer-aided Diagnosis in Histopatho-logical Images of the Endometrium Using a Convolutional Neural Network and AttentionMechanisms.

IEEE Journal of Biomedical and Health Informatics , 24(6):1664–1676, 2019.6 Yixin Li et al.34. X. Zhang, Y. Jiang, H. Peng, K. Tu, and D. Goldwasser. Semi-Supervised StructuredPrediction with Neural CRF Autoencoder. In

Proceedings of the 2017 Conference onEmpirical Methods in Natural Language Processing , pages 1701–1711, 2017.35. A. Wicaksono and S. Myaeng. Toward Advice Mining: Conditional Random Fields forExtracting Advice-Revealing Text Units. In

Proceedings of the 22nd ACM InternationalConference on Information & Knowledge Management , pages 2039–2048, 2013.36. L. Zhuowen and K. Wang. Human Behavior Recognition Based on Fractal ConditionalRandom Field. In , pages1506–1510, 2013.37. S. Kruthiventi and R. Babu. Crowd Flow Segmentation in Compressed Domain UsingCRF. In , pages 3417–3421, 2015.38. D. Liliana and C. Basaruddin. A Review on Conditional Random Fields as a SequentialClassiﬁer in Machine Learning. In , pages 143–148, 2017.39. H. Qu, P. Wu, Q. Huang, J. Yi, G. Riedlinger, S. De, and D. Metaxas. Weakly Super-vised Deep Nuclei Segmentation Using Points Annotation in Histopathology Images. In

International Conference on Medical Imaging with Deep Learning , pages 390–400, 2019.40. Z. Konstantinos, F. Henrik, Raza S., R. Ioannis, J. Yann, and Y. Yinyin. Superpixel-based Conditional Random Fields (SuperCRF): Incorporating Global and Local Contextfor Enhanced Deep Learning in Melanoma Histopathology.

Frontiers in Oncology , 9:1045,2019.41. Y. Li, M. Huang, Y. Zhang, J. Chen, H. Xu, G. Wang, and W. Feng. Automated GleasonGrading and Gleason Pattern Region Segmentation based on Deep Learning for Patho-logical Images of Prostate Cancer.

IEEE Access , 8:117714–117725, 2020.42. J. Dong, X. Guo, and G. Wang. GECNN-CRF for Prostate Cancer Detection with WSI.In

Proceedings of 2020 Chinese Intelligent Systems Conference , pages 646–658, 2021.43. S. Kosov, K. Shirahama, C. Li, and M. Grzegorzek. Environmental Microorganism Clas-siﬁcation Using Conditional Random Fields and Deep Convolutional Neural Networks.

Pattern Recognition , 77:248–261, 2018.44. C. Li, H. Chen, L. Zhang, N. Xu, D. Xue, Z. Hu, H. Ma, and H. Sun. Cervical Histopathol-ogy Image Classiﬁcation Using Multilayer Hidden Conditional Random Fields and WeaklySupervised Learning.

IEEE Access , 7:90378–90397, 2019.45. Y. Li, X. Wu, C. Li, C. Sun, X. Li, M Rahaman, and H. Zhang. Intelligent GastricHistopathology Image Classiﬁcation Using Hierarchical Conditional Random Field basedAttention Mechanism. In

Proceedings of the 2021 13th International Conference on Ma-chine Learning and Computing , 2021.46. C. Li, Y. Li, C. Sun, H. Chen, and H. Zhang. A Comprehensive Review for MRF andCRF Approaches in Pathology Image Analysis. arXiv preprint arXiv:2009.13721 , 2020.47. J. Laﬀerty, A. McCallum, and F. Pereira. Conditional Random Fields: Probabilistic Modelsfor Segmenting and Labeling Sequence Data. 2001.48. P. Cliﬀord. Markov Random Fields in Statistics; Disorder in Physical Systems: A Volumein Honour of John M. Hammersley.

Oxford University Press , 19:32, 1990.49. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. DeepLab: Semantic ImageSegmentation with Deep Convolutional Nets, Atrous Convolution, and Fully ConnectedCRFs.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(4):834–848,2018.50. S. Zheng, S. Jayasumana, B. Romera-Paredes, et al. Conditional Random Fields as Re-current Neural Networks. In

Proc. of ICCV 2015 , pages 1–17, 2015.51. R. Gupta. Conditional Random Fields. Unpublished Report, IIT Bombay, 2006.52. O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolutional Networks for BiomedicalImage Segmentation. In

Proc. ofMICCAI 2016 , pages 234–241, 2015.53. K. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-scale ImageRecognition. arXiv:1409.1556 , 2014.54. C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, and Z. Wojna. Rethinking the InceptionArchitecture for Computer Vision. In

Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , pages 2818–2826, 2016.55. K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages770–778, 2016.itle Suppressed Due to Excessive Length 2756. S. Kumar and M. Hebert. Discriminative Random Fields.

International Journal of Com-puter Vision , 68(2):179–201, 2006.57. C. Li, H. Chen, D. Xue, Z. Hu, L. Zhang, L. He, N. Xu, S. Qi, H. Ma, and H. Sun. WeaklySupervised Cervical Histopathological Image Classiﬁcation Using Multilayer Hidden Con-ditional Random Fields. In

Proc. of ITIB 2019 , pages 209–221, 2019.58. D. Kermany, M. Goldbaum, W. Cai, C. Valentim, H. Liang, S. Baxter, A. McKeown,G. Yang, X. Wu, F. Yan, et al. Identifying Medical Diagnoses and Treatable Diseases byImage-based Deep Learning.

Cell , 172(5):1122–1131, 2018.59. S. Deng, X. Zhang, Y. Qin, W. Chen, H. Fan, X. Feng, J. Wang, R. Yan, Y. Zhao, Y. Cheng,et al. miRNA-192 and-215 Activate Wnt/ β -catenin Signaling Pathway in Gastric Cancervia APC. Journal of Cellular Physiology , 235(9):6218–6229, 2020.60. M. Wang, Y. Yu, F. Liu, L. Ren, Q. Zhang, and G. Zou. Single Polydiacetylene Micro-tube Waveguide Platform for Discriminating microRNA-215 Expression Levels in ClinicalGastric Cancerous, Paracancerous and Normal Tissues.

Talanta , 188:27–34, 2018.61. T. Kamishima, M. Hamasaki, and S. Akaho. TrBagg: A Simple Transfer Learning Methodand its Application to Personalization in Collaborative Tagging. In , pages 219–228, 2009.62. J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. Imagenet: A Large-scaleHierarchical Image Database. In , pages 248–255, 2009.63. J. Kittler, M. Hatef, R. Duin, and J. Matas. On Combining Classiﬁers.

IEEE Transactionson Pattern Analysis and Machine Intelligence , 20(3):226–239, 1998.64. Z. Zhang and C. Lin. Pathological Image Classiﬁcation of Gastric Cancer Based on DepthLearning.

ACM Trans. Intell. Syst. Technol. , 45(11A):263–268, 2018.65. A. Fischer, K. Jacobson, J. Rose, and R. Zeller. Hematoxylin and Eosin Staining of Tissueand Cell Sections.

Cold Spring Harbor Protocols , 2008(5):pdb–prot4986, 2008.66. M. Miettinen and J. Lasota. Gastrointestinal Stromal Tumors: Review on Morphology,Molecular Pathology, Prognosis, and Diﬀerential Diagnosis.

Archives of Pathology & Lab-oratory Medicine , 130(10):1466–1478, 2006.67. M. Miettinen. Gastrointestinal Stromal Tumors (GISTs): Deﬁnition, Occurrence, Pathol-ogy, Diﬀerential Diagnosis and Molecular Genetics.

Polish Journal of Pathology , 54, 2003.68. L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. Yuille. Deeplab: Semantic ImageSegmentation with Deep Convolutional Nets, Atrous Convolution, and Fully ConnectedCRFs.

IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(4):834–848,2017.69. V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A Deep Convolutional Encoder-decoder Architecture for Image Segmentation.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 39(12):2481–2495, 2017.70. S. Osher and J. Sethian. Fronts Propagating with Curvature-dependent Speed: AlgorithmsBased on Hamilton-Jacobi Formulations.

Journal of Computational Physics , 79(1):12–49,1988.71. N. Otsu. A Threshold Selection Method from Gray-level Histograms.

IEEE Transactionson Systems, Man, and Cybernetics , 9(1):62–66, 1979.72. L. Vincent and P. Soille. Watersheds in Digital Spaces: An Eﬃcient Algorithm Basedon Immersion Simulations.

IEEE Transactions on Pattern Analysis and Machine Intelli-gence , 13(6):583–598, 1991.73. S. Li. Markov Random Field Models in Computer Vision. In

Proc. of ECCV 1994 , pages361–370, 1994.74. Y. Kurmi and V. Chaurasia. Content-based Image Retrieval Algorithm for Nuclei Seg-mentation in Histopathology Images.

Multimedia Tools and Applications , pages 1–21,2020.75. S. Zafari, T. Eerola, J. Sampo, H. K¨alvi¨ainen, and H. Haario. Segmentation of Overlap-ping Elliptical Objects in Silhouette Images.

IEEE Transactions on Image Processing ,24(12):5942–5952, 2015.76. Z. Wang. A Semi-automatic Method for Robust and Eﬃcient Identiﬁcation of NeighboringMuscle Cells.

Pattern Recognition , 53:300–312, 2016.77. T. Lei, X. Jia, Y. Zhang, L. He, H. Meng, and A. Nandi. Signiﬁcantly Fast and Ro-bust Fuzzy c-means Clustering Algorithm Based on Morphological Reconstruction andMembership Filtering.

IEEE Transactions on Fuzzy Systems , 26(5):3027–3041, 2018.78. Q. Vu, S. Graham, T. Kurc, M. To, M. Shaban, T. Qaiser, N. Koohbanani, S. Khurram,J. Kalpathy-Cramer, T. Zhao, et al. Methods for Segmentation and Classiﬁcation of DigitalMicroscopy Tissue Images.

Frontiers in Bioengineering and Biotechnology , 7:53, 2019.8 Yixin Li et al.79. Y. Peng, S. Liu, Y. Qiang, X. Wu, and L. Hong. A Local Mean and Variance ActiveContour Model for Biomedical Image Segmentation.

Journal of Computational Science ,33:11–19, 2019.80. C. Yu, Y. Yan, S. Zhao, and Y. Zhang. Pyramid Feature Adaptation for Semi-supervised Cardiac Bi-ventricle Segmentation.

Computerized Medical Imaging and Graph-ics , 81:101697, 2020.81. C. Sheela and G. Suganthi. Morphological Edge Detection and Brain Tumor Segmentationin Magnetic Resonance (MR) Images Based on Region Growing and Performance Evalu-ation of Modiﬁed Fuzzy C-Means (FCM) Algorithm.

Multimedia Tools and Applications ,pages 1–14, 2020.82. D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.83. J. Hu, L. Shen, and G. Sun. Squeeze-and-excitation Networks. In

Proceedings of the IEEEconference on computer vision and pattern recognition , pages 7132–7141, 2018.84. S. Woo, J. Park, J. Lee, and I. Kweon. Cbam: Convolutional Block Attention Module. In

Proceedings of the European Conference on Computer Vision (ECCV) , pages 3–19, 2018.85. X. Wang, R. Girshick, A. Gupta, and K. He. Non-local Neural Networks. In

Proceedingsof the IEEE Conference on Computer Vision and Pattern Recognition , pages 7794–7803,2018.86. Y. Cao, J. Xu, S. Lin, F. Wei, and H. Hu. Gcnet: Non-local Networks Meet Squeeze-excitation Networks and Beyond. In

Proceedings of the IEEE/CVF International Confer-ence on Computer Vision Workshops , pages 0–0, 2019.87. S. Ioﬀe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift. In

International Conference on Machine Learning ,pages 448–456, 2015.88. M. Hammad, P. P(cid:32)lawiak, K. Wang, and U. Acharya. ResNet-Attention Model for HumanAuthentication Using ECG Signals.

Expert Systems , page e12547, 2020.89. S. Roy, S. Manna, T. Song, and L. Bruzzone. Attention-Based Adaptive Spectral-SpatialKernel ResNet for Hyperspectral Image Classiﬁcation.

IEEE Transactions on Geoscienceand Remote Sensing , 2020.90. D. Mishkin and J. Matas. All You Need is a Good Init. arXiv preprint arXiv:1511.06422 ,2015.91. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. De-hghani, M. Minderer, G. Heigold, S. Gelly, et al. An Image is Worth 16x16 Words: Trans-formers for Image Recognition at Scale. arXiv preprint arXiv:2010.11929arXiv preprint arXiv:2010.11929