[PDF] HoVer-Net: Simultaneous Segmentation and Classification of Nuclei in Multi-Tissue Histology Images

Abstract

Nuclear segmentation and classification within Haematoxylin & Eosin stained histology images is a fundamental prerequisite in the digital pathology work-flow. The development of automated methods for nuclear segmentation and classification enables the quantitative analysis of tens of thousands of nuclei within a whole-slide pathology image, opening up possibilities of further analysis of large-scale nuclear morphometry. However, automated nuclear segmentation and classification is faced with a major challenge in that there are several different types of nuclei, some of them exhibiting large intra-class variability such as the tumour cells. Additionally, some of the nuclei are often clustered together. To address these challenges, we present a novel convolutional neural network for simultaneous nuclear segmentation and classification that leverages the instance-rich information encoded within the vertical and horizontal distances of nuclear pixels to their centres of mass. These distances are then utilised to separate clustered nuclei, resulting in an accurate segmentation, particularly in areas with overlapping instances. Then for each segmented instance, the network predicts the type of nucleus via a devoted up-sampling branch. We demonstrate state-of-the-art performance compared to other methods on multiple independent multi-tissue histology image datasets. As part of this work, we introduce a new dataset of Haematoxylin & Eosin stained colorectal adenocarcinoma image tiles, containing 24,319 exhaustively annotated nuclei with associated class labels.

Full PDF

11 HoVer-Net: Simultaneous Segmentation andClassiﬁcation of Nuclei in Multi-Tissue HistologyImages

Simon Graham ∗ , Quoc Dang Vu ∗ ,Shan E Ahmed Raza, Ayesha Azam, Yee Wah Tsang,Jin Tae Kwak + and Nasir Rajpoot + Abstract —Nuclear segmentation and classiﬁcation withinHaematoxylin & Eosin stained histology images is a fundamentalprerequisite in the digital pathology work-ﬂow. The developmentof automated methods for nuclear segmentation and classiﬁcationenables the quantitative analysis of tens of thousands of nucleiwithin a whole-slide pathology image, opening up possibilities offurther analysis of large-scale nuclear morphometry. However,automated nuclear segmentation and classiﬁcation is faced witha major challenge in that there are several different types ofnuclei, some of them exhibiting large intra-class variability suchas the nuclei of tumour cells. Additionally, some of the nuclei areoften clustered together. To address these challenges, we presenta novel convolutional neural network for simultaneous nuclearsegmentation and classiﬁcation that leverages the instance-richinformation encoded within the vertical and horizontal distancesof nuclear pixels to their centres of mass. These distances arethen utilised to separate clustered nuclei, resulting in an accuratesegmentation, particularly in areas with overlapping instances.Then, for each segmented instance the network predicts the typeof nucleus via a devoted up-sampling branch. We demonstratestate-of-the-art performance compared to other methods onmultiple independent multi-tissue histology image datasets. Aspart of this work, we introduce a new dataset of Haematoxylin &Eosin stained colorectal adenocarcinoma image tiles, containing24,319 exhaustively annotated nuclei with associated class labels.

Index Terms —Nuclear segmentation, nuclear classiﬁcation,computational pathology, deep learning.

I. I

NTRODUCTION

Current manual assessment of Haematoxylin and Eosin(H&E) stained histology slides suffers from low throughputand is naturally prone to intra- and inter-observer variability[1]. To overcome the difﬁculty in visual assessment of tissueslides, there is a growing interest in digital pathology (DP),where digitised whole-slide images (WSIs) are acquired fromglass histology slides using a scanning device. This permits ∗ First authors contributed equally. + Last authors contributed equally.S.Graham, N.Rajpoot and S.E.A.Raza are with the Department of ComputerScience, University of Warwick, UK.S.Graham is also with the Mathematics for Real-World Systems Centre forDoctoral Training, University of Warwick, UK.S.E.A.Raza is also with the Centre for Evolution and Cancer & Divisionof Molecular Pathology, The Institute of Cancer Research, London, UK.Q.D.Vu and J.T.Kwak are with the Department of Computer Science andEngineering, Sejong University, South Korea.A.Azam and Y-W.T are with the Department of Pathology at UniversityHospitals Coventry and Warwickshire, Coventry, UK efﬁcient processing, analysis and management of the tissuespecimens [2]. Each WSI contains tens of thousands of nucleiof various types, which can be further analysed in a systematicmanner and used for predicting clinical outcome. Here, thetype of nucleus refers to the cell type in which it is located. Forexample, nuclear features can be used to predict survival [3]and also for diagnosing the grade and type of disease [4]. Also,efﬁcient and accurate detection and segmentation of nucleican facilitate good quality tissue segmentation [5], [6], whichcan in turn not only facilitate the quantiﬁcation of WSIs butmay also serve as an important step in understanding howeach tissue component contributes to disease. In order to usenuclear features for downstream analysis within computationalpathology, nuclear segmentation must be carried out as aninitial step. However, this remains a challenge because nucleidisplay a high level of heterogeneity and there is signiﬁcantinter- and intra-instance variability in the shape, size and chro-matin pattern between and within different cell types, diseasetypes or even from one region to another within a singletissue sample. Tumour nuclei, in particular, tend to be presentin clusters, which gives rise to many overlapping instances,providing a further challenge for automated segmentation, dueto the difﬁculty of separating neighbouring instances.As well as extracting each individual nucleus, determiningthe type of each nucleus can increase the diagnostic potentialof current DP pipelines. For example, accurately classifyingeach nucleus to be from tumour or lymphocyte enables down-stream analysis of tumour inﬁltrating lymphocytes (TILs),which have been shown to be predictive of cancer recurrence[7]. Yet, similar to nuclear segmentation, classifying the typeof each nucleus is difﬁcult, due to the high variance of nuclearappearance within each WSI. Typically, nuclei are classiﬁedusing two disjoint models: one for detecting each nucleusand then another for performing nuclear classiﬁcation [8], [9].However, it would be preferable to utilise a single uniﬁedmodel for nuclear instance segmentation and classiﬁcation.In this paper, we present a deep learning approach forsimultaneous segmentation and classiﬁcation of nuclear in-stances in histology images. The network is based on theprediction of horizontal and vertical distances (and hence thename HoVer-Net) of nuclear pixels to their centres of mass,which are subsequently leveraged to separate clustered nuclei. Model code: https://github.com/vqdang/hover net a r X i v : . [ c s . C V ] N ov For each segmented instance, the nuclear type is subsequentlydetermined via a dedicated up-sampling branch. To the best ofour knowledge, this is the ﬁrst approach that achieves instancesegmentation and classiﬁcation within the same network. Wepresent comparative results on six independent multi-tissuehistology image datasets and demonstrate state-of-the-art per-formance compared to other recently proposed methods. Themain contributions of this work are listed as follows: • A novel network, targeted at simultaneous segmentationand classiﬁcation of nuclei, where horizontal and verticaldistance map predictions separate clustered nuclei. • We show that the proposed HoVer-Net achieves state-of-the-art performance on multiple H&E histology imagedatasets, as compared to over a dozen recently publishedmethods. • An interpretable and reliable evaluation framework thateffectively quantiﬁes nuclear segmentation performanceand overcomes the limitations of existing performancemeasures. • A new dataset of 24,319 exhaustively annotated nucleiwithin 41 colorectal adenocarcinoma image tiles.II. R ELATED W ORK

A. Nuclear Instance Segmentation

Within the current literature, energy-based methods, inparticular the watershed algorithm, have been widely utilisedto segment nuclear instances. For example, [10] used thresh-olding to obtain the markers and the energy landscape as inputfor watershed to extract the nuclear instances. Nonetheless,thresholding relies on a consistent difference in intensitybetween the nuclei and background, which does not hold formore complex images and hence often produces unreliableresults. Various approaches have tried to provide an improvedmarker for marker-controlled watershed. [11] used active con-tours to obtain the markers. [12] used a series of morphologicaloperations to generate the energy landscape. However, thesemethods rely on the predeﬁned geometry of the nuclei togenerate the markers, which determines the overall accuracy ofeach method. Notably, [13] avoided the trouble of reﬁning themarkers for watershed by designing a method that relies solelyon the energy landscape. They combined an active contourapproach with nuclear shape modelling via a level-set methodto obtain the nuclear instances. Despite its widespread usage,obtaining sufﬁciently strong markers for watershed is a non-trivial task. Some methods have departed from the energy-based approach by utilising the geometry of the nuclei. Forinstance, [14], [15] and [16] computed the concavity of nuclearclusters, while [17] used eclipse-ﬁtting to separate the clusters.However, this assumes a predeﬁned shape, which does notencompass the natural diversity of the nuclei. In addition, thesemethods tend to be sensitive to the choice of manually selectedparameters.Recently, deep learning methods have received a surge ofinterest due to their superior performance in many computervision tasks [18], [19], [20]. These approaches are capable The CoNSeP dataset for nuclear segmentation is available at https://warwick.ac.uk/fac/sci/dcs/research/tia/data/. of automatically extracting a representative set of features,that strongly correlate with the task at hand. As a result,they are preferable to hand-crafted approaches, that rely on aselection of pre-deﬁned features. Inspired by the Fully Convo-lutional Network (FCN) [21], U-Net [22] has been successfullyapplied to numerous segmentation tasks in medical imageanalysis. The network has an encoder-decoder design withskip connections to incorporate low-level information and usesa weighted loss function to assist separation of instances.However, it often struggles to split neighbouring instances andis highly sensitive to pre-deﬁned parameters in the weightedloss function. A more recently proposed method in Micro-Net [23] extends U-Net by utilising an enhanced networkarchitecture with weighted loss. The network processes theinput at multiple resolutions and as a result, gains robustnessagainst nuclei with varying size. In [24], the authors developeda network that is robust to stain variations in H&E images byintroducing a weighted loss function that is sensitive to theHaematoxylin intensity within the image.Other methods exploit information about the nuclear con-tour (or boundary) within the network, such as DCAN [25]that utilised a dual architecture that outputs the nuclear clusterand the nuclear contour as two separate prediction maps.Instance segmentation is then achieved by subtracting thecontour from the nuclear cluster prediction. [26] proposeda network to predict the inner nuclear instance, the nuclearcontour and the background. The network utilised a cus-tomised weighted loss function based on the relative positionof pixels within the image to improve and stabilise the innernuclei and contour prediction. Some other methods have alsoutilised the nuclear contour to achieve instance segmentation.For example, [27] employed a deep learning technique forlabelling the nuclei and the contours, followed by a regiongrowing approach to extract the ﬁnal instances. [28] usedthe contour predictions as input into a further network forsegmentation reﬁnement. [29] proposed CIA-Net, that utilisesa multi-level information aggregation module between twotask-speciﬁc decoders, where each decoder segments eitherthe nuclei or the contours. A Deep Residual AggregationNetwork (DRAN) was proposed by [30] that uses a multi-scalestrategy, incorporating both the nuclei and nuclear contours toaccurately segment nuclei.There have been various other methods to achieve instanceseparation. Instead of considering the contour, [31] proposeda deep learning approach to detect superior markers for water-shed by regressing the nuclear distance map . Therefore, thenetwork avoids making a prediction for areas with indistinctcontours.In line with these developments, the ﬁeld of instance seg-mentation within natural images is also rapidly progressingand have had a signiﬁcant inﬂuence on nuclear instancesegmentation methods. A notable example is Mask-RCNN[32], where instance segmentation approach is achieved by ﬁrstpredicting candidate regions likely to contain an object andthen deep learning based segmentation within those proposedregions.

Fig. 1: Overview of the proposed approach for simultaneous nuclear instance segmentation and classiﬁcation. When noclassiﬁcation labels are available, the network produces the instance segmentation as shown in (a). The different coloursof the nuclear boundaries represent different types of nuclei in (b).

B. Nuclear Classiﬁcation

As well as performing instance segmentation, it is desirableto determine the type of each nucleus to facilitate and improvedownstream analysis. It is possible for current models todifferentiate between certain nuclear types in H&E, howeversub-typing of lymphocytes is an extremely hard task due to thehigh levels of similarity in morphological appearance betweenT and B lymphocytes. Typically, classifying each nucleus isdone via a two-stage approach, where the ﬁrst step involveseither nuclear segmentation or nuclear detection. When seg-mentation is used as the initial step, a series of morphologicaland textural features are extracted from each instance, whichare then used within a classiﬁer to determine the nucleiclasses. For example, [33] classiﬁed nuclei within H&E stainedbreast cancer images as either tumour, lymphocyte or stromalbased on their morphological features. [34] performed nuclearsegmentation and then classiﬁed each nucleus with AdaBoostclassiﬁer, utilising the intensity, morphology and texture ofnuclei as features. Otherwise, detection is performed as aninitial step and a patch centred at the point of detection is fedinto a classiﬁer, to predict the type of nucleus. [35] proposeda spatially constrained CNN, that initially detects all nucleiand then for each nucleus an ensemble of associated patchesare fed into a CNN to predict the type to be either epithelial,inﬂammatory, ﬁbroblast or miscellaneous.III. M

ETHODS

Our overall framework for automatic nuclear instance seg-mentation and classiﬁcation can be observed in Fig. 1 andthe proposed network in Fig. 2. Here, nuclear pixels are ﬁrstdetected and then, a tailored post-processing pipeline is used to simultaneously segment nuclear instances and obtain thecorresponding nuclear types. The framework is based uponthe horizontal and vertical distance maps, which can be seenin Fig. 3. In the ﬁgure, each nuclear pixel denotes either thehorizontal or vertical distance of pixels to their centres of mass.

A. Network Architecture

In order to extract a strong and representative set of features,we employ a deep neural network. The feature extractioncomponent of the network is inspired by the pre-activatedresidual network with 50 layers [36] (Preact-ResNet50), due toits excellent performance in recent computer vision tasks [37]and robustness against input perturbation [38]. Compared tothe standard Preact-ResNet50 implementation, we reduce thetotal down-sampling factor from 32 to 8 by using a stride of1 in the ﬁrst convolution and removing the subsequent max-pooling operation. This ensures that there is no immediate lossof information that is important for performing an accuratesegmentation. Various residual units are applied throughoutthe network at different down-sampling levels. A series ofconsecutive residual units is denoted as a residual block. Thenumber of residual units within each residual block is 3, 4,6 and 3 that are applied at down-sampling levels 1, 2, 4 and8 respectively. For clarity, a down-sampling level of 2 meansthat the input has a reduction in the spatial resolution by afactor of 2.Following Preact-ResNet50, we perform nearest neighbourup-sampling via three distinct branches to simultaneously ob-tain accurate nuclear instance segmentation and classiﬁcation.We name the corresponding branches: (i) nuclear pixel (NP)branch; (ii) HoVer branch and (iii) nuclear classiﬁcation (NC)

Fig. 2: Overview of the proposed architecture. (a) (Pre-activated) residual unit, (b) dense unit. m indicates the number offeature maps within each residual unit. The yellow square within the input denotes the considered region at the output. Whenthe classiﬁcation labels aren’t available, only the up-sampling branches in the dashed box are considered.branch. The NP branch predicts whether or not a pixel belongsto the nuclei or background, whereas the HoVer branchpredicts the horizontal and vertical distances of nuclear pixelsto their centres of mass. Then, the NC branch predicts thetype of nucleus for each pixel. In particular, the NP and HoVerbranches jointly achieve nuclear instance segmentation by ﬁrstseparating nuclear pixels from the background (NP branch)and then separating touching nuclei (HoVer branch). The NCbranch determines the type of each nucleus by aggregating thepixel-level nuclear type predictions within each instance.All three up-sampling branches utilise the same architecturaldesign, which consists of a series of up-sampling operationsand densely connected units [39] (or dense units). By stackingmultiple and relatively cheap dense units, we build a largereceptive ﬁeld with minimal parameters, compared to usinga single convolution with a larger kernel size and we ensureefﬁcient gradient propagation. We use skip connections [22] toincorporate features from the encoder, but utilise summationas opposed to concatenation. The consideration of low-levelinformation is particularly important in segmentation tasks,where we aim to precisely delineate the object boundaries.We use dense units after the ﬁrst and second up-samplingoperations, where the number of units is 4 and 8 respec-tively. Valid convolution is performed throughout the two up-sampling branches to prevent poor predictions at the boundary.This results in the size of the output being smaller than the sizeof the input. As opposed to using a dedicated network for eachtask, a shared encoder makes it possible to train the nuclearinstance segmentation and classiﬁcation model end-to-end andtherefore, reduce the total training time. Furthermore, a shared encoder can also take advantage of the shared informationacross multiple tasks and thus, help to improve the modelperformance on all tasks.Finally, if we do not have the classiﬁcation labels of thenuclei, only the NP and HoVer up-sampling branches areconsidered. Otherwise, we consider all three up-samplingbranches and perform simultaneous nuclear instance segmen-tation and classiﬁcation.We display an overview of the network architecture in Fig.2, where the spatial dimension of the input is 270 ×

270 andthe output dimension of each branch is 80 ×

80. The dashedbox within Fig. 2 highlights the branches for nuclear instancesegmentation. Additionally, we also show a residual unit anda dense unit within Fig. 2a and Fig. 2b. We denote m as thenumber of feature maps within each convolution of a givenresidual unit. At each down sampling level, from left to right, m =256, 512, 1024, 2048 respectively. We keep a ﬁxed amountof feature maps within each dense unit throughout the twobranches as shown in Fig. 2c.

1) Loss Function:

The proposed network design has 4different sets of weights: w , w , w and w which refer tothe weights of the Preact-ResNet50 encoder, the HoVer branchdecoder, the NP branch decoder and the NC branch decoder.These 4 sets of weights are optimised jointly using the loss L deﬁned as: L = λ a L a + λ b L b (cid:124) (cid:123)(cid:122) (cid:125) HoVer Branch + λ c L c + λ d L d (cid:124) (cid:123)(cid:122) (cid:125) NP Branch + λ e L e + λ f L f (cid:124) (cid:123)(cid:122) (cid:125) NC Branch (1) where L a and L b represent the regression loss with respect tothe output of the HoVer branch, L c and L d represent the losswith respect to the output at the NP branch and and ﬁnally, L e ImageCrop Horizontal Map Prediction Vertical Map Prediction Vertical Map Ground TruthHorizontal Map Ground Truth 1-11-1ImageCrop Horizontal Map Prediction Vertical Map Prediction Vertical Map Ground TruthHorizontal Map Ground Truth

Fig. 3: Cropped image regions showing horizontal and vertical map predictions, with corresponding ground truth. Arrowshighlight the strong instance information encoded within these maps, where there is a signiﬁcant difference in the pixel values.and L f represent the loss with respect to the output at the NCbranch. We choose to use two different loss functions at theoutput of each branch for an overall superior performance. λ a ...λ f are scalars that give weight to each associated lossfunction. Speciﬁcally, we set λ b to 2 and the other scalars to1, based on empirical selection.Given the input image I , at each pixel i we deﬁne p i ( I, w , w ) as the regression output of the HoVer branch,whereas q i ( I, w , w ) and r i ( I, w , w ) denote the pixel-basedsoftmax predictions of the NP and NC branches respectively.We also deﬁne Γ i ( I ) , Ψ i ( I ) and Φ i ( I ) as their correspondingground truth (GT). Ψ i ( I ) is the GT of the nuclear binary map,where background pixels have the value of 0 and nuclear pixelshave the value 1. On the other hand, Φ i ( I ) is the nuclear typeGT where background pixels have the value 0 and any integervalue larger than 0 indicates the type of nucleus. Meanwhile, Γ i ( I ) denotes the GT of the horizontal and vertical distancesof nuclear pixels to their corresponding centres of mass. For Γ i ( I ) , we assign values between -1 and 1 to nuclear pixelsin both the horizontal and vertical directions. We assign thevalue of the background and the line crossing the centre ofmass within each nucleus to be 0. For clarity, we denote thehorizontal and vertical components of the GT HoVer map as horizontal map Γ i,x and vertical map Γ i,y respectively. Visualexamples of the horizontal and vertical maps can be seen inFig. 3.At the output of the HoVer branch, we compute a multipleterm regression loss. We denote L a as the mean squared errorbetween the predicted horizontal and vertical distances and theGT. We also propose a novel loss function L b that calculatesthe mean squared error between the horizontal and verticalgradients of the horizontal and vertical maps respectively andthe corresponding gradients of the GT. We formally deﬁne L a and L b as: L a = 1 n n (cid:88) i =1 ( p i ( I ; w , w ) − Γ i ( I )) (2) L b = 1 m (cid:88) i ∈ M ( ∇ x ( p i,x ( I ; w , w )) − ∇ x ( Γ i,x ( I ))) + 1 m (cid:88) i ∈ M ( ∇ y ( p i,y ( I ; w , w )) − ∇ y ( Γ i,y ( I ))) (3) Within equation (3), ∇ x and ∇ y denote the gradient in thehorizontal x and vertical y directions respectively. m denotestotal number of nuclear pixels within the image and M denotesthe set containing all nuclear pixels. At the output of NP and NC branches, we calculate thecross-entropy loss ( L c and L e ) and the dice loss ( L d and L f ).These two losses are then added together to give the overallloss of each branch. Concretely, we deﬁne the cross entropyand dice losses as: CE = − n N (cid:88) i =1 K (cid:88) k =1 X i,k ( I ) log Y i,k ( I ) (4)Dice = 1 − × (cid:80) Ni =1 ( Y i ( I ) × X i ( I )) + (cid:15) (cid:80) Ni =1 Y i ( I ) + (cid:80) Ni =1 X i ( I ) + (cid:15) (5) where X is the ground truth, Y is the prediction, K is thenumber of classes and (cid:15) is a smoothness constant which weset to . e − . When calculating L c and L d for NP branch,for a given pixel i , we set X i and Y i as q i ( I, w , w ) and Ψ i respectively. For L c , we set K to be 2 within equation(4) because the task of the branch is to perform binarynuclear segmentation. Similarly, for L e and L f at NC branch,for a given pixel i , we substitute X i for Φ i ( I ) and Y i for r i ( I, w , w ) in equations (4) and (5). K is set as 5 withinequation (4) when calculating L e , denoting the 4 types ofnuclei that our model currently predicts and the background.Note, the value of K is chosen to reﬂect the number of nucleartypes represented in the training set.It must be noted that the NC branch loss L e and L f are onlycalculated when the classiﬁcation labels are available. In otherwords, as mentioned in Section III-A, the network performsonly instance segmentation if there are no classiﬁcation labelsgiven. B. Post Processing

Within each horizontal and vertical map, pixels betweenseparate instances have a signiﬁcant difference. This can beseen in Fig. 3 and is highlighted by the arrows. Therefore,calculating the gradient can inform where the nuclei shouldbe separated because the output will give high values betweenneighbouring nuclei, where there is a signiﬁcant difference inthe pixel values. We deﬁne: S m = max ( H x ( p x ) , H y ( p y )) (6)where p x and p y refer to the the horizontal and verticalpredictions at the output of the HoVer branch and H x and H y refer to the horizontal and vertical components of the Sobeloperator. Speciﬁcally, H x and H y compute the horizontaland vertical derivative approximations and are shown by thegradient maps in Fig. 1. Therefore, S m highlights areas wherethere is a signiﬁcant difference in neighbouring pixels withinthe horizontal and vertical maps. Therefore, areas such as theones shown by the arrows in Fig. 3 will result in high valueswithin S m . We compute markers M = σ ( τ ( q, h ) − τ ( S m , k )) .Here, τ ( a, b ) is a threshold function that acts on a and setsvalues above b to 1 or 0 otherwise. Speciﬁcally, h and k werechosen such that they gave the optimal nuclear segmentationresults. σ is a rectiﬁer that sets all negative values to 0 and q is the probability map output of the NP branch. We obtainthe energy landscape E = [1 − τ ( S m , k )] ∗ τ ( q, h ) . Finally, M is used as the marker during marker-controlled watershed Fig. 4: Examples highlighting the limitations of DICE2 andAJI with slightly different predictions. For better visualisation,ground truth contours (red dash line) for each instance havebeen overlaid on both the predictions and original images.TABLE I: Comparison between Prediction A and Prediction B from Fig.4 across various measurements. DICE2 AJI PQ

Prediction A B to determine how to split τ ( q, h ) , given the energy landscape E . This sequence of events can be seen in Fig. 1.To perform simultaneous nuclear instance segmentationand classiﬁcation, it is necessary to convert the per-pixelnuclear type prediction at the output of the NC branch to aprediction per nuclear instance. For each nuclear instance, weuse majority class of the predictions made by the NC branch,i.e., the nuclear type of all pixels in an instance is assigned tobe the class with the highest frequency count for that nuclearinstance.Please refer to Appendix A for a full analysis on thecontribution of our proposed loss function, post-processingmethod and devoted classiﬁcation branch.IV. E VALUATION M ETRICS

A. Nuclear Instance Segmentation Evaluation

Assessment and comparison of different methods is usuallygiven by an overall score that indicates which method issuperior. However, to further investigate the method, it ispreferable to break the problem into sub-tasks and measurethe performance of the method on each sub-task. This enables an in depth analysis, thus facilitating a comprehensive under-standing of the approach, which can help drive forward modeldevelopment. For nuclear instance segmentation, the problemcan be divided into the following three sub-tasks: • Separate the nuclei from the background • Detect individual nuclear instances • Segment each detected instanceIn the current literature, two evaluation metrics have beenmainly adopted to quantitatively measure the performance ofnuclear instance segmentation: 1) Ensemble Dice (DICE2)[30], and 2) Aggregated Jaccard Index (AJI) [27]. Giventhe ground truth X and prediction Y , DICE2 computes andaggregates DICE per nucleus, where Dice coefﬁcient (DICE)is deﬁned as 2 × ( X ∩ Y ) / ( | X | + | Y | ) and AJI computes the ratioof an aggregated intersection cardinality and an aggregatedunion cardinality between X and Y .These two evaluation metrics only provide an overall scorefor the instance segmentation quality and therefore provides nofurther insight into the sub-tasks at hand. In addition, these twometrics have a limitation, which we illustrate in Fig. 4. Fromthe ﬁgure, although prediction A only differs from prediction B by a few pixels, the DICE2 and AJI scores for B are inferior.These scores are shown in Table I. This problem arises dueto over-penalisation of the overlapping regions. By overlayingthe GT segment contours (red dashed line) upon the two pre-dictions, we observe that, although the cyan-coloured instancewithin prediction A overlaps mostly with the cyan-colouredGT instance, it also slightly overlaps with the blue-colouredGT instance. As a result, according to the DICE2 algorithm,the predicted cyan instance will be penalised by pixels notonly coming from the dominant overlapping cyan-coloured GTinstance, but also from the blue-coloured GT instance. The AJIalso suffers from the same phenomenon. However, becauseAJI only uses the prediction and GT instance pair with thehighest intersection over union, over-penalisation is less likelycompared to DICE2. Over-penalisation is likely to occur whenthe model completely fails to detect the neighbouring instance,such as in Fig. 4. Nonetheless, when evaluating methods acrossdifferent datasets, speciﬁcally on samples containing lots ofhard to recognise nuclei such as ﬁbroblasts or nuclei withpoor staining, the number of failed detections may increase andtherefore may have a negative impact on the AJI measurement.Due to the limitations of DICE2 and AJI, it is clear that thereis a need for an improved reliable quantitative measurement. Panoptic Quality : We propose to use another metric foraccurate quantiﬁcation and interpretability to assess the perfor-mance of nuclear instance segmentation. Originally proposedby [40], panoptic quality (PQ) for nuclear instance segmenta-tion is deﬁned as: PQ = | T P || T P | + | F P | + | F N | (cid:124) (cid:123)(cid:122) (cid:125) Detection Quality(DQ) × (cid:80) ( x,y ) ∈ TP IoU ( x, y ) | T P | (cid:124) (cid:123)(cid:122) (cid:125) Segmentation Quality(SQ) (7) where x denotes a GT segment, y denotes a prediction segmentand IoU denotes intersection over union. Each ( x,y ) pair ismathematically proven to be unique [40] over the entire setof prediction and GT segments if their IoU( x,y ) > detection quality (DQ) is the F Score that is widely used to evaluate instance detection, while segmentation quality (SQ) can be interpreted as how close eachcorrectly detected instance is to their matched GT. DQ and SQ,in a way, also provide a direct insight into the second and thirdsub-tasks, deﬁned above. We believe that PQ should set thestandard for measuring the performance of nuclear instancesegmentation methods.Overall, to fully characterise and understand the perfor-mance of each method, we use the following three metrics: 1)DICE to measure the separation of all nuclei from the back-ground; 2) Panoptic Quality as a uniﬁed score for comparisonand 3) AJI for direct comparison with previous publications .Panoptic quality is further broken down into DQ and SQcomponents for interpretability. Note, SQ is calculated onlywithin true positive segments and should therefore be observedtogether with DQ. Throughout this study, these metrics arecalculated for each image and the average of all images arereported as ﬁnal values for each dataset. B. Nuclear Classiﬁcation Evaluation

Classiﬁcation of the type of each nucleus is performedwithin the nuclear instances extracted from the instance seg-mentation or detection tasks. Therefore, the overall measure-ment for nuclear type classiﬁcation should also encompassthese two tasks. For all nuclear instances of a particular type t from both the ground truth and the prediction, the detectiontask d splits the GT and predicted instances into the followingsubsets: correctly detected instances (TP d ), misdetected GTinstances (FN d ) and overdetected predicted instances (FP d ).Subsequently, the classiﬁcation task c further breaks TP d intocorrectly classiﬁed instances of type t (TP c ), correctly clas-siﬁed instances of types other than type t (TN c ), incorrectlyclassiﬁed instances of type t (FP c ) and incorrectly classiﬁedinstances of types other than type t (FN c ). We then deﬁne theF c score of each type t for combined nuclear type classiﬁcationand detection as follows: F tc = 2( T P c + T N c ) (cid:34) T P c + T N c ) + α F P c + α F N c + α F P d + α F N d (cid:35) (8) where we use α = α = 2 and α = α = 1 to give moreemphasis to nuclear type classiﬁcation. Moreover, using thesame weighting, if we further extend t to encompass all typesof nuclei T ( t ∈ T ), the classiﬁcation within TP d is thendivided into a correctly classiﬁed set A c and an incorrectlyclassiﬁed set B c . We can therefore disassemble F tc into: F Tc = 2 A c A c + B c ) + F P d + F N d = 2( A c + B c )2( A c + B c ) + F P d + F N d × A c A c + B c = F d × Classiﬁcation Accuracy withinCorrectly Detected Instances (9) Evaluation code: https://github.com/vqdang/hover net/src/metrics where F d is simply the standard detection quality like DQwhile the other term is the accuracy of nuclear type classiﬁ-cation within correctly detected instances. In the case wherethe GT is not exhaustively annotated for nuclear type classi-ﬁcation, like in CRCHisto, an amount equal to the number ofunlabelled GT instances in each set is subtracted from B c and F N c .Finally, while IoU is utilised as the criteria in DQ forselecting the TP for detection in instance segmentation, de-tection methods can not calculate the IoU. Therefore, to facil-itate comparison of both instance segmentation and detectionmethods for the nuclear type classiﬁcation tasks, for F tc , weutilise the notion of distance to determine whether nuclei havebeen detected. To be precise, we deﬁne the region within apredeﬁned radius from the annotated centre of the nucleus asthe ground truth and if a prediction lies within this area, thenit is considered to be a true positive. Here, we are consistentwith [35] and use a radius of 6 pixels at 20 × or 12 pixels at40 × . V. E XPERIMENTAL R ESULTS

A. Datasets

As part of this work, we introduce a new dataset thatwe term as the colorectal nuclear segmentation and phe-notypes (CoNSeP) dataset , consisting of 41 H&E stainedimage tiles, each of size 1,000 × × objec-tive magniﬁcation. Images were extracted from 16 colorectaladenocarcinoma (CRA) WSIs, each belonging to an individualpatient, and scanned with an Omnyx VL120 scanner withinthe department of pathology at University Hospitals Coventryand Warwickshire, UK. We chose to focus on a single cancertype, so that we are able to display the true variation of tissuewithin colorectal adenocarcinoma WSIs, as opposed to otherdatasets that instead focus on using a small number of visualﬁelds from various cancer types. Within this dataset, stroma,glandular, muscular, collagen, fat and tumour regions can beobserved. Beside incorporating different tissue components,the 41 images were also chosen such that different nuclei types were present, including: normal epithelial; tumour epithelial;inﬂammatory; necrotic; muscle and ﬁbroblast. Here, by type we are referring to the type of cell from which the nucleusoriginates from. Within the dataset, there are many signiﬁ-cantly overlapping nuclei with indistinct boundaries and thereexists various artifacts, such as ink. As a result of the diversityof the dataset, it is likely that a model trained on CoNSePwill perform well for unseen CRA cases. For each image tile,every nucleus was annotated by one of two expert pathologists(A.A, Y-W.T). After full annotation, each annotated samplewas reviewed by both of the pathologists; therefore reﬁningtheir own and each others’ annotations. By the end of theannotation process, each pathologist had fully checked every sample and consensus had been reached. Annotating the datain this way ensured that minimal nuclei were missed in theannotation process. However, we can not avoid inevitable This dataset is available at https://warwick.ac.uk/fac/sci/dcs/research/tia/data/. pixel-level differences between the annotation and the true nu-clear boundary in challenging cases. In addition to delineatingthe nuclear boundaries, every nucleus was labelled as either:normal epithelial, malignant/dysplastic epithelial, ﬁbroblast,muscle, inﬂammatory, endothelial or miscellaneous. Withinthe miscellaneous category, necrotic, mitotic and cells thatcouldn’t be categorised were grouped. For our experiments, wegrouped the normal and malignant/dysplastic epithelial nucleiinto a single class and we grouped the ﬁbroblast, muscle andendothelial nuclei into a class named spindle-shaped nuclei.Overall, six independent datasets are utilised for this study.A full summary for each of them is provided in Table II. Fiveof these datasets are used to evaluate the instance segmentationperformance which we refer to as: CoNSeP; Kumar [27];CPM-15; CPM-17 [30] and TNBC [31]. Example images fromeach of the ﬁve datasets can be seen in Fig. 7. Meanwhile, weutilise CoNSeP and a further dataset, named CRCHisto, toquantify the performance of the nuclear classiﬁcation model.The CRCHisto dataset consists of the same nuclei typesthat are present in CoNSeP. It is also worth noting that theCRCHisto dataset is not exhaustively annotated for nuclearclass labels.

B. Implementation and Training Details

We implemented our framework with the open source soft-ware library TensorFlow version 1.8.0 [41] on a workstationequipped with two NVIDIA GeForce 1080 Ti GPUs. Duringtraining, data augmentation including ﬂip, rotation, Gaussianblur and median blur was applied to all methods. All networksreceived an input patch with a size ranging from 252 × × − and thenreduced it to a rate of 10 − after 25 epochs. This strategy wasrepeated for ﬁne-tuning. On the whole, training of the networkis stable, where the usage of fully independent decoders helpsthe network to converge each time. The network was trainedwith an RGB input, normalised between 0 and 1. C. Comparative Analysis of Segmentation Methods

Experimental Setting : We evaluated our approach by em-ploying a full independent comparison across the three largestknown exhaustively labelled nuclear segmentation datasets:Kumar; CoNSeP and CPM-17 and utilised the metrics asdescribed in Section IV-A. For this experiment, because we donot have the classiﬁcation labels for all datasets, we performinstance segmentation without classiﬁcation. This enables us to

TABLE II: Summary of the datasets used in our experiments. UHCW denotes University Hospitals Conventry and Warwickshireand TCGA denotes The Cancer Genome Atlas.

Seg denotes segmentation masks and

Class denotes classiﬁcation labels.

CoNSeP Kumar CPM-15 CPM-17 TNBC CRCHisto

Total Number of Nuclei 24,319 21,623 2,905 7,570 4,056 29,756Labelled Nuclei 24,319 0 0 0 0 22,444Number of Images 41 30 15 32 50 100Origin UHCW TCGA TCGA TCGA Curie Institute UHCWMagniﬁcation 40 × × × & 20 × × & 20 × × × Size of Images 1000 × × ×

400 to 1000 ×

600 500 ×

500 to 600 ×

600 512 ×

512 500 × Seg/Class Both Seg Seg Seg Seg Class

Number of Cancer Types 1 8 2 4 1 1

Kumar CoNSeP CPM-15 CPM-17 TNBC

Fig. 5: Sample cropped regions extracted from each of the ﬁve nuclear instance segmentation datasets used in our experiments.From left to right: Kumar [27]; CoNSeP; CPM-15; CPM-17 [30] and TNBC [31]. The different colours of nuclear contourshighlight individual instances.fully leverage all data and allows us to rigorously evaluate thesegmentation capability of our model. In the same way as [27],we split the Kumar dataset into two different sub-datasets: (i)Kumar-Train, a training set with 16 image tiles (4 breast, 4liver, 4 kidney and 4 prostate) and (ii) Kumar-Test, a test setwith 14 image tiles (2 breast, 2 liver, 2 kidney and 2 prostate,2 bladder, 2 colon, 2 stomach). Note, we utilise the exact sameimage split used by other recent approaches [27], [31], [29],but we do not separate the test set into two subsets. We do thisto ensure that the test set is large enough, ensuring a reliableevaluation. For CoNSeP, we devise a suitable train and testset that contains 26 and 14 images respectively. The imageswithin the test set were selected to ensure the true diversityof nuclei types within colorectal tissue are represented. ForCPM-17, we utilise the same split that had been employed for the challenge, with 32 images in both the training and testdatasets.We compared our proposed model to recent segmentationapproaches used in computer vision [21], [44], [32], medicalimaging [22] and also to methods speciﬁcally tuned for thetask of nuclear segmentation [25], [23], [31], [29], [30]. Wealso compared the performance of our model to two opensource software applications: Cell Proﬁler [42] and QuPath[43]. Cell Proﬁler is a software for cell-based analysis, withseveral suggested pipelines for computational pathology. Thepipeline that we adopted applies a threshold to the greyscaleimage and then uses a series of post processing operations.QuPath is an open source software for digital pathology andwhole slide image analysis. To achieve nuclear segmentation,we used the default parameters within the application. FCN, Malignant/dysplastic epithelium Normal epithelium Inflammatory MuscleFibroblast MiscellaneousEndothelial

Fig. 6: Sample cropped regions extracted from the CoNSeP datasets, where the colour of each nuclear boundary denotes thecategory.SegNet, U-Net, DCAN, Mask-RCNN and DIST have beenimplemented by the authors of the paper (S.G, Q.D.V). ForMask-RCNN, we slightly modiﬁed the original implementa-tion by using smaller anchor boxes. The default conﬁgurationis ﬁne-tuned for natural images and therefore, this modiﬁcationwas necessary to perform a successful nuclear segmentation.DIST was implemented with the assistance of the ﬁrst authorof the corresponding approach in order to ensure reliabilityduring evaluation. This also enabled us to utilise DIST forfurther comparison in our experiments. For Micro-Net, weused the same implementation that was described by [23] andwas implemented by the ﬁrst author of the corresponding paper(S.E.A.R). For CNN3 and CIA-Net, we report the results onthe Kumar dataset that are given in their respective originalpapers. The authors of CIA-Net and DRAN provided theirsegmentation output, which meant that we were able to obtainall metrics on the datasets that the models were applied to.Therefore, we report results of CIA-Net on the Kumar datasetand results of DRAN on the CPM-17 dataset. Note, for allself-implemented approaches we are consistent with our pre-processing strategy. However, DRAN, CNN3 and CIA-Netresults are directly taken from their respective papers andtherefore we can’t guarantee the same pre-processing steps.CNN3 and CIA-Net also use stain normalisation, whereasother methods described in this paper do not.

Comparative Results : Table III and the box plots in Fig.8a and 8b show detailed results of this experiment. Within thebox plots, we choose not to show AJI, due to its limitations asdiscussed in Section IV-A. A large variation in performancebetween methods within each dataset is observed. This varia- tion is particularly evident in the Kumar and CoNSeP datasets,where there exists a large number of overlapping nuclei. BothCell Proﬁler [42] and QuPath [43] achieve sub-optimal perfor-mance for all datasets. In particular, both software applicationsconsistently achieve a low DICE score, suggesting that theirinability to distinguish nuclear pixels from the backgroundis a major limiting factor. FCN-based approaches improvethe capability of models to detect nuclear pixels, yet oftenfail due to their inability to separate clustered instances. Forexample, despite a higher DICE score than Cell Proﬁler andQuPath, networks built only for semantic segmentation likeFCN8 and SegNet suffer from low PQ values. Therefore,methods that incorporate strong instance-aware techniques arefavourable. Within CPM-17, there are less overlapping nucleiwhich explains why methods that are not instance-aware arestill able to achieve a satisfactory performance. We observethat the weighted cross entropy loss that is used in both U-Net and Micro-Net can help to separate joined nuclei, but itssuccess also depends on the capacity of the network. This isreﬂected by the increased performance of Micro-Net over U-Net.DCAN is able to better distinguish between separate in-stances than FCN8, which uses a very similar encoder basedon the VGG16 network. Therefore, incorporating additionalinformation at the output of the network can improve thesegmentation performance. This is also exempliﬁed by thefairly strong performances of CNN3, DIST, DRAN and CIA-Net. In a different way, Mask-RCNN is able to successfullyseparate clustered nuclei by utilising a region proposal basedapproach. However, Mask-RCNN is less effective than other TABLE III: Comparative experiments on the Kumar [27], CoNSeP and CPM-17 [30] datasets. WS denotes watershed-basedpost processing.

Kumar CoNSeP CPM-17Methods DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ

Cell Proﬁler [42] 0.623 0.366 0.423 0.704 0.300 0.434 0.202 0.249 0.705 0.179 0.570 0.338 0.368 0.702 0.261QuPath [43] 0.698 0.432 0.511 0.679 0.351 0.588 0.249 0.216 0.641 0.151 0.693 0.398 0.320 0.717 0.230FCN8 [21] 0.797 0.281 0.434 0.714 0.312 0.756 0.123 0.239 0.682 0.163 0.840 0.397 0.575 0.750 0.435FCN8 + WS [21] 0.797 0.429 0.590 0.719 0.425 0.758 0.226 0.320 0.676 0.217 0.840 0.397 0.575 0.750 0.435SegNet [44] 0.811 0.377 0.545 0.742 0.407 0.796 0.194 0.371 0.727 0.270 0.857 0.491 0.679 0.778 0.531SegNet + WS [44] 0.811 0.508 0.677 0.744 0.506 0.793 0.330 0.464 0.721 0.335 0.856 0.594 0.779 0.784 0.614U-Net [22] 0.758 0.556 0.691 0.690 0.478 0.724 0.482 0.488 0.671 0.328 0.813 0.643 0.778 0.734 0.578Mask-RCNN [32] 0.760 0.546 0.704 0.720 0.509 0.740 0.474 0.619 0.740 0.460 0.850 0.684 0.848 0.792 0.674DCAN [25] 0.792 0.525 0.677 0.725 0.492 0.733 0.289 0.383 0.667 0.256 0.828 0.561 0.732 0.740 0.545Micro-Net [23] 0.797 0.560 0.692 0.747 0.519 0.794 0.527 0.600 0.745 0.449 0.857 0.668 0.836 0.788 0.661DIST [31] 0.789 0.559 0.601 0.732 0.443 0.804 0.502 0.544 0.728 0.398 0.826 0.616 0.663 0.754 0.504CNN3 [27] 0.762 0.508 - - - - - - - - - - - - -CIA-Net [29] 0.818 methods at detecting nuclear pixels, which is reﬂected by alower DICE score.Due to the reasoning given in Section IV, we place a largeremphasis on PQ to determine the success of different models.In particular, we consistently obtain an improved performanceover DIST, which justiﬁes the use of our proposed horizontaland vertical maps as a regression target. We also report a betterperformance than the winners of the Computational PrecisionMedicine and MoNuSeg challenges [30], [29], that utlised theCPM-17 and Kumar datasets respectively. Therefore, HoVer-Net achieves state-of-the art performance for nuclear instancesegmentation compared to all competing methods on multipledatasets that consist of a variety of different tissue types. Ourapproach also outperforms methods that were ﬁne-tuned forthe task of nuclear segmentation.

D. Generalisation Study

Experimental Setting : The goal of any automated methodis to perform well on unseen data, with high accuracy. There-fore, we conducted a large scale study to assess how all meth-ods generalise to new H&E stained images. To analyse thegeneralisation capability, we assessed the ability to segmentnuclei from: i) new organs (variation in nuclei shapes) and ii)different centres (variation in staining).The ﬁve instance segmentation datasets used within ourexperiments can be grouped into three groups according totheir origin: TCGA (Kumar, CPM-15, CPM-17), TNBC andCoNSeP. We used Kumar as the training and validation set, dueto its size and diversity, whilst the combined CPM (CPM-15and CPM-17), TNBC and CoNSeP datsets were used as threeindependent test sets. We split the test sets in this way inaccordance with their origin. Note, for this experiment we useboth the training and test sets of CPM-17 and CoNSeP to formthe independent test sets. Kumar was split into three subsets,as explained in Section V-A, and Kumar-Train was used totrain all models, i.e. trained with samples originating from the following organs: breast; prostate; kidney and liver. Despiteall samples being extracted from TCGA, CPM samples comefrom the brain, head & neck and lungs regions. Therefore, test-ing with CPM reﬂects the ability for the model to generaliseto new organs, as mentioned above by the ﬁrst generalisationcriterion. TNBC contains samples from an already seen organ(breast), but the data is extracted from an independent sourcewith different specimen preservation and staining practice.Therefore, this reﬂects the second generalisation criterion.CoNSeP contains samples taken from colorectal tissue, whichis not represented in Kumar-Train, and is also extracted froma source independent to TCGA. Therefore, this reﬂects both the ﬁrst and second generalisation criteria. Also, as mentionedin Section V-A, CoNSeP contains challenging samples, wherethere exists various artifacts and there is variation in the qualityof slide preparation. Therefore, the performance on this datasetalso reﬂects the ability of a model to generalise to difﬁcultsamples.

Comparative Results : The results are reported in Table IV,where we only display the results of methods that employ aninstance-based technique. We observe that our proposed modelis able to successfully generalise to unseen data in all threecases. However, some methods prove to perform poorly withunseen data, where in particular, U-Net and DIST performworse than other competing methods on all three datasets.Both SegNet with watershed and Mask-RCNN achieve acompetitive performance across all three generalisation tests.However, similar to the results reported in Table III, Mask-RCNN is not able to distinguish nuclear pixels from thebackground as well as other competing methods, which has anadverse effect on the overall segmentation performance shownby PQ. On the other hand, SegNet proves to successfully detectnuclear pixels, reporting a greater DICE score than HoVer-Net on both the TNBC and CoNSeP datasets. However, theoverall segmentation result for HoVer-Net is superior becauseit is better able to separate nuclear instances by incorporating Ground Truth HoVer-Net CIA-Net Micro-Net Mask-RCNNGround Truth HoVer-Net Mask-RCNN Micro-Net DRANGround Truth HoVer-Net Mask-RCNN Micro-Net DIST C P M - K u m a r C o N S e P Fig. 7: Example visual results on the CPM-17, Kumar and CoNSeP datasets. For each dataset, we display the 4 models thatachieve the highest PQ score from left to right. The different colours of the nuclear boundaries denote separate instances. (a) Kumar(b) CoNSeP Fig. 8: Box plots highlighting the performance of competing methods on the Kumar and CoNSeP datasets.the horizontal and vertical maps at the output of the network.

E. Comparative Analysis of Classiﬁcation Methods

Experimental Setting : We converted the top four per-forming nuclear instance segmentation algorithms, based ontheir panoptic quality on the CoNSeP dataset, such that theywere able to perform simultaneous instance segmentationand classiﬁcation. As mentioned in Section V-A, the nuclearcategories that we use in our experiments are: miscellaneous, inﬂammatory, epithelial and spindle-shaped. Speciﬁcally, wecompared HoVer-Net with Micro-Net, Mask-RCNN and DIST.For Micro-Net, we used an output depth of 5 rather than2, where each channel gave the probability of a pixel beingeither background, miscellaneous, inﬂammatory, epithelial orspindle-shaped. For Mask-RCNN, there is a devoted clas-siﬁcation branch that predicts the class of each instanceand therefore is well suited to a multi-class setting. DISTperforms regression at the output of the network and therefore TABLE IV: Comparative results, highlighting the generalisation capability of different models. All models are initially trainedon Kumar and then the Combined CPM [30], TNBC [31] and CoNSeP datasets are processed.

Combined CPM TNBC All CoNSePMethods DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ

FCN8 + WS [21] 0.762 0.531 0.669 0.722 0.487 0.726 0.506 0.662 0.723 0.480 0.609 0.247 0.345 0.688 0.240SegNet + WS [44] 0.791 0.583 0.738 0.755 0.561 0.758 0.559 0.734 0.750 0.554 0.681 0.315 0.449 0.733 0.332U-Net [22] 0.720 0.541 0.652 0.672 0.446 0.681 0.514 0.635 0.676 0.442 0.585 0.363 0.442 0.670 0.297Mask-RCNN [32] 0.764 0.575 0.760 0.719 0.549 0.705 0.529 0.726 0.742 0.543 0.606 0.348 0.492 0.720 0.357DCAN [25] 0.770 0.582 0.716 0.730 0.528 0.725 0.537 0.683 0.720 0.495 0.609 0.306 0.403 0.685 0.278Micro-Net [23] 0.792 0.615 0.716 0.751 0.542 0.701 0.531 0.656 0.753 0.497 0.644 0.394 0.489 0.722 0.356DIST [31] 0.775 0.563 0.593 0.720 0.432 0.719 0.523 0.549 0.714 0.404 0.621 0.369 0.379 0.701 0.268HoVer-Net converting the model such that it is able to classify nuclei intomultiple categories is non-trivial. Instead, we add an extra 1 × Comparative Results : We trained our models on the train-ing set of the CoNSeP dataset and then we evaluated themodel on both the test set of CoNSeP and also the entireCRCHisto dataset. Table V displays the results of the multi-class models on the CoNSeP and the CRCHisto datasetsrespectively, where the given metrics are described in SectionIV-B. For CoNSeP, along with the classiﬁcation metrics, weprovide PQ as an indication of the quality of instance seg-mentation. However, in CRCHisto, only the nuclear centroidsare given and therefore, we exclude PQ from the CRCHistoevaluation because it can’t be calculated without the instancesegmentation masks. We observe that HoVer-Net achieves agood quality simultaneous instance segmentation and classi-ﬁcation, compared to competing methods. It must be noted,that we should expect a lower F score for the miscellaneousclass because there are signiﬁcantly less nuclei represented.Also, there is a high diversity of nuclei types that havebeen grouped within this class, belonging to: mitotic; necroticand cells that are uncategorisable. Despite this, HoVer-Netis able to achieve a satisfactory performance on this class,where other methods fail. Furthermore, compared to othermethods, our approach achieves the best F score for epithelial,inﬂammatory and spindle classes. Therefore, due to HoVer-Netobtaining a strong performance for both nuclear segmentationand classiﬁcation, we suggest that our model may be usedfor sophisticated subsequent cell-level downstream analysis incomputational pathology.VI. D ISCUSSION AND C ONCLUSIONS

Analysis of nuclei in large-scale histopathology images isan important step towards automated downstream analysis fordiagnosis and prognosis of cancer. Nuclear features have beenoften used to assess the degree of malignancy [45]. However,visual analysis of nuclei is a very time consuming task becausethere are often tens of thousands of nuclei within a given whole-slide image (WSI). Performing simultaneous nuclearinstance segmentation and classiﬁcation enables subsequentexploration of the role that nuclear features play in predictingclinical outcome. For example, [4] utilised nuclear featuresfrom histology TMA cores to predict survival in early-stageestrogen receptor-positive breast cancer. Restricting the anal-ysis to some speciﬁc nuclear types only may be advantageousfor accurate analysis in computational pathology.In this paper, we have proposed HoVer-Net for simulta-neous segmentation and classiﬁcation of nuclei within multi-tissue histology images that not only detects nuclei with highaccuracy, but also effectively separates clustered nuclei. Ourapproach has three up-sampling branches: 1) the nuclear pixelbranch that separates nuclear pixels from the background; 2)the HoVer branch that regresses the horizontal and verticaldistances of nuclear pixels to their centres of mass and 3) thenuclear classiﬁcation branch that determines the type of eachnucleus. We have shown that the proposed approach achievesthe state-of-the-art instance segmentation performance com-pared to a large number of recently published deep learningmodels across multiple datasets, including tissues that havebeen prepared and stained under different conditions. Thismakes the proposed approach likely to translate well to apractical setting due its strong generalisation capacity, whichcan therefore be effectively used as a prerequisite step beforenuclear-based feature extraction. We have shown that utilisingthe horizontal and vertical distances of nuclear pixels to theircentres of mass provides powerful instance-rich information,leading to state-of-the-art performance in histological nuclearsegmentation. When the classiﬁcation labels are available, weshow that our model is able to successfully segment andclassify nuclei with high accuracy.Region proposal (RP) methods, such as Mask-RCNN, showgreat potential in dealing with overlapping instances becausethere is no notion of separating instances; instead nuclei aresegmented independently. However, a major limitation of theRP methods is the difﬁculty in merging instance predictionsbetween neigbouring tiles during processing. For example, ifa sub-segment of a nucleus at the boundary is assigned alabel, one must ensure that the remainder of the nucleus in theneighbouring tile is also assigned the same label. To overcomethis difﬁculty, for Mask-RCNN, we utilised an overlapping tilemechanism such that we only considered non-boundary nuclei. TABLE V: Comparative results for nuclear classiﬁcation on the CoNSeP and CRCHisto datasets. F d denotes the F score fornuclear detection, whereas F ec , F ic , F sc and F mc denote the F classiﬁcation score for the epithelial, inﬂammatory, spindle-shapedand miscellaneous classes respectively. CoNSeP CRCHistoMethods PQ F d F ec F ic F sc F mc F d F ec F ic F sc F mc SC-CNN [35] - 0.608 0.306 0.193 0.175 0.000 0.664 0.246 0.111 0.126 0.000DIST [31] 0.372 0.712 0.617 0.534 0.505 0.000 0.616 0.464 0.514 0.275 0.000Micro-Net [23] 0.430 0.743 0.615 0.592 0.532 0.117 0.638 0.422 0.518 0.249 0.059Mask-RCNN [32] 0.450 0.692 0.595 0.590 0.520 0.098 0.639

Regarding the processing time, the average time to processa 1,000 × × faster.On the other hand, the average processing time for DISTand Micro-Net was 0.600 and 0.832 seconds respectively.Mask-RCNN inherently stores a single instance per channel,which leads to very large arrays in memory when there aremany nuclei in a single image patch, which also contributesto the much longer processing time as seen above. Overall,FCN methods seem to better translate to WSI processingcompared to Mask-RCNN or RPN methods in general. Itmust be stressed that the timing is not exact and is depen-dent on hardware speciﬁcations and software implementation.With optimised code and sophisticated hardware, we expectthese timings to be considerably different. Additionally, theinference time is also dependent on the size of the output. Inparticular, with a smaller output size, a smaller stride is alsorequired during processing. For instance, if we used paddedconvolution in the up-sampling branches of HoVer-Net, thenwe observe 5.6 × speed up and the average processing time is1.97 seconds per 1000 × classiﬁcation score for the miscellaneouscategory in the classiﬁcation model because there are signiﬁ-cantly less samples within this category and there exists highintra-class variability. Future work will involve obtaining moresamples within this category, including necrotic and mitoticnuclei, to improve the class balance of the data.A CKNOWLEDGMENTS

This work was supported by the National Research Foun-dation of Korea (NRF) grant funded by the Korea government(MSIP) (No. 2016R1C1B2012433) and by the Ministry ofScience and ICT (MSIT) (No. 2018K1A3A1A74065728). Thiswork was also supported in part by the UK Medical ResearchCouncil (No. MR/P015476/1). NR is part of the PathLAKEdigital pathology consortium, which is funded from the Datato Early Diagnosis and Precision Medicine strand of thegovernments Industrial Strategy Challenge Fund, managed anddelivered by UK Research and Innovation (UKRI). We alsoacknowledge the ﬁnancial support from the Engineering andPhysical Sciences Research Council and Medical ResearchCouncil, provided as part of the Mathematics for Real-WorldSystems CDT. We thank Peter Naylor for his assistance in theimplementation of the DIST network.A

PPENDIX

A. A

BLATION S TUDIES

To gain a full understanding of the contribution of ourmethod, we investigated several of its components. Speciﬁ-cally, we performed the following ablation experiments: (i) contribution of the proposed loss strategy; (ii) Sobel-basedpost processing technique compared to other strategies and(iii) contribution of the dedicated classiﬁcation branch. Here,we utilised the Kumar and CoNSeP datasets for (i) and (ii)due to the large number of nuclei present, whereas for (iii)we use CoNSeP and CRCHisto because we do not have theclassiﬁcation labels for Kumar. Loss Terms : We conducted an experiment to understand thecontribution of our proposed loss strategy. First, we used meansquared error (MSE) of the horizontal and vertical distances L a as the loss function of the HoVer branch and binarycross entropy (BCE) loss L c as the loss function for the NPbranch. We refer to this combination as the standard strategybecause MSE and BCE are the two most commonly usedloss functions for regression and binary classiﬁcation tasksrespectively. Next, we introduced the MSE of the horizontaland vertical gradients L b to the HoVer branch and the diceloss L d to the NP branch. The intuition behind our novel L b is that it enforces the correct structure of the horizontaland vertical map predictions and therefore helps to correctlyseparate neighbouring instances. The dice loss was introducedbecause it can help the network to better distinguish betweenbackground and nuclear pixels and is particularly useful whenthere is a class-imbalance. We present the results in Table A1,where we observe an increase in all performance measures forour proposed multi-term loss strategy. Therefore, the additionalloss terms boost the network’s ability to differentiate betweennuclear and background pixels (DICE) and separate individualnuclei (DQ and PQ). In particular, there is a signiﬁcant boost inthe SQ for both Kumar and CoNSeP, which suggests that ourproposed loss function L b is necessary to precisely determinewhere nuclei should be split. Post Processing : Usually, markers obtained from applyinga threshold to an energy landscape (such as the distance map)is enough to provide a competitive input for watershed, asseen by DIST in Table III. Although HoVer-Net is not directlybuilt upon an energy landscape, we devised a Sobel-basedmethod to derive both the energy landscape and the markers.To compare with other methods, we implemented two furthertechniques for obtaining the energy landscape and the markers.We then exhaustively compared all energy landscape andmarker combinations to assess which post processing strategyis the best. We start by linking HoVer to the distance mapby calculating the square sum χ + ϕ , which can be seenas the distance from a pixel to its nearest nuclear centroid.In other words, this is a pseudo distance map. Additionally, χ and ϕ values can be interpreted as Cartesian coordinateswith each nuclear centroid as the origin. By thresholding thevalues between a certain range, we can obtain the markers.The results of all combinations are shown in Table A2. Note,our gradient-based post processing technique is speciﬁcallydesigned for the HoVer branch output. Classiﬁcation Branch : In order to assess the importanceof a devoted branch for concurrent nuclear segmentation andclassiﬁcation, we compared the proposed three branch setupof HoVer-Net to a two branch setup. Here, the two branchsetup extends the NP branch to a multi-class setting, bypredicting each nuclear type at the output. Then, to obtain the binary mask, the positive channels are combined together afternuclear type prediction. Utilising three branches decouples thetasks of nuclear classiﬁcation and nuclear detection, where aseparate branch is devoted to each task. For this ablation study,we train on the CoNSeP training set and then process both theCoNSeP test set and the entire CRCHisto dataset.We report results in Table A3, where we observe thatutilising a separate branch devoted to the task of nuclearclassiﬁcation leads to an improved overall performance ofsimultaneous nuclear instance segmentation and classiﬁcationin both the CoNSeP and CRCHisto datasets. We can see thatif the classiﬁcation takes place at the output of NP branch,then the network’s ability to determine the nuclear type iscompromised. This is because the task of nuclear classiﬁcationis challenging and therefore the network beneﬁts from theintroduction of a branch dedicated to the task of classiﬁcation.R

EFERENCES[1] J. G. Elmore, G. M. Longton, P. A. Carney, B. M. Geller, T. Onega,A. N. Tosteson, H. D. Nelson, M. S. Pepe, K. H. Allison, S. J. Schnitt et al. , “Diagnostic concordance among pathologists interpreting breastbiopsy specimens,”

Jama , vol. 313, no. 11, pp. 1122–1132, 2015.[2] A. Madabhushi and G. Lee, “Image analysis and machine learningin digital pathology: Challenges and opportunities,”

Medical ImageAnalysis

Medical Imaging 2018: DigitalPathology , vol. 10581. International Society for Optics and Photonics,2018, p. 105810E.[4] C. Lu, D. Romo-Bucheli, X. Wang, A. Janowczyk, S. Ganesan,H. Gilmore, D. Rimm, and A. Madabhushi, “Nuclear shape and orien-tation features from h&e images predict survival in early-stage estrogenreceptor-positive breast cancers,”

Laboratory Investigation , vol. 98,no. 11, p. 1438, 2018.[5] K. Sirinukunwattana, D. Snead, D. Epstein, Z. Aftab, I. Mujeeb, Y. W.Tsang, I. Cree, and N. Rajpoot, “Novel digital signatures of tissue phe-notypes for predicting distant metastasis in colorectal cancer,”

Scientiﬁcreports , vol. 8, no. 1, p. 13692, 2018.[6] S. Javed, M. M. Fraz, D. Epstein, D. Snead, and N. M. Rajpoot, “Cellularcommunity detection for tissue phenotyping in histology images,” in

Computational Pathology and Ophthalmic Medical Image Analysis .Springer, 2018, pp. 120–129.[7] G. Corredor, X. Wang, Y. Zhou, C. Lu, P. Fu, K. Syrigos, D. L. Rimm,M. Yang, E. Romero, K. A. Schalper et al. , “Spatial architecture andarrangement of tumor-inﬁltrating lymphocytes for predicting likelihoodof recurrence in early-stage non–small cell lung cancer,”

Clinical CancerResearch , vol. 25, no. 5, pp. 1526–1534, 2019.[8] H. Sharma, N. Zerbe, D. Heim, S. Wienert, H.-M. Behrens, O. Hellwich,and P. Hufnagl, “A multi-resolution approach for combining visual infor-mation using nuclei segmentation and classiﬁcation in histopathologicalimages.” in

VISAPP (3) , 2015, pp. 37–46.[9] P. Wang, X. Hu, Y. Li, Q. Liu, and X. Zhu, “Automatic cell nucleisegmentation and classiﬁcation of breast cancer histopathology images,”

Signal Processing , vol. 122, pp. 1–13, 2016.[10] X. Yang, H. Li, and X. Zhou, “Nuclei segmentation using marker-controlled watershed, tracking using mean-shift, and kalman ﬁlter intime-lapse microscopy,”

IEEE Transactions on Circuits and Systems I:Regular Papers , vol. 53, no. 11, pp. 2405–2414, 2006.[11] J. Cheng, J. C. Rajapakse et al. , “Segmentation of clustered nuclei withshape markers and marking function,”

IEEE Transactions on BiomedicalEngineering , vol. 56, no. 3, pp. 741–748, 2009.[12] M. Veta, P. van Diest, R. Kornegoor, A. Huisman, M. Viergever, andJ. Pluim, “Automatic nuclei segmentation in h&e stained breast cancerhistopathology images,”

PLoS ONE , vol. 8, no. 7, p. e70221, 2013.[13] S. Ali and A. Madabhushi, “An integrated region-, boundary-, shape-based active contour for multiple object overlap resolution in histologicalimagery,”

IEEE transactions on medical imaging , vol. 31, no. 7, pp.1448–1460, 2012. TABLE A1: Ablation study highlighting the contribution of the proposed loss strategy.

Kumar CoNSePStrategy DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ

Standard Loss 0.823 0.750 0.771 0.581 0.608 0.846 0.685 0.774 0.532 0.557Proposed Loss

TABLE A2: Ablation study for post processing techniques: Sobel-based versus thresholding to get markers and Sobel-basedversus naive conversion to get energy landscape

Kumar CoNSePEnergy Markers DICE AJI DQ SQ PQ DICE AJI DQ SQ PQ χ + ϕ Threshold 0.825 0.597 0.705 0.764 0.541 0.850 0.543 0.602 0.761 0.459 χ + ϕ Sobel 0.826 0.613 0.766 0.768 0.591 0.853 0.561 0.694 0.770 0.535Sobel Threshold 0.825 0.614 0.715 0.772 0.554 0.850 0.566 0.617 0.775 0.479Sobel Sobel

TABLE A3: Ablation study showing the contribution of the classiﬁcation branch in HoVer-Net on the CoNSeP dataset. F d denotes the F score for nuclear detection, whereas F ec , F ic , F sc and F mc denote the F classiﬁcation score for the epithelial,inﬂammatory, spindle-shaped and miscellaneous classes respectively. CoNSeP CRCHistoBranches PQ F d F ec F ic F sc F mc F d F ec F ic F sc F mc NP & HoVer 0.499 0.736 [14] S. Wienert, D. Heim, K. Saeger, A. Stenzinger, M. Beil, P. Hufnagl,M. Dietel, C. Denkert, and F. Klauschen, “Detection and segmentation ofcell nuclei in virtual microscopy images: a minimum-model approach,”

Scientiﬁc reports , vol. 2, p. 503, 2012.[15] A. LaTorre, L. Alonso-Nanclares, S. Muelas, J. Pea, and J. DeFelipe,“Segmentation of neuronal nuclei based on clump splitting and atwo-step binarization of images,”

Expert Systems with Applications

Medical Imaging 2015: Digital Pathology , vol. 9420.International Society for Optics and Photonics, 2015, p. 94200N.[17] M. Liao, Y. qian Zhao, X. hua Li, P. shan Dai, X. wenXu, J. kai Zhang, and B. ji Zou, “Automatic segmentation forcell images based on bottleneck detection and ellipse ﬁtting,”

Neurocomputing

Medical imageanalysis , vol. 42, pp. 60–88, 2017.[19] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical imageanalysis,”

Annual review of biomedical engineering , vol. 19, pp. 221–248, 2017.[20] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[21] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networksfor semantic segmentation,” in

Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2015, pp. 3431–3440.[22] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in

International Conference onMedical image computing and computer-assisted intervention . Springer,2015, pp. 234–241.[23] S. E. A. Raza, L. Cheung, M. Shaban, S. Graham, D. Epstein, S. Pelen-garis, M. Khan, and N. M. Rajpoot, “Micro-Net: A uniﬁed model forsegmentation of various objects in microscopy images,”

ArXiv e-prints ,p. arXiv:1804.08145, Apr. 2018. [24] S. Graham and N. M. Rajpoot, “Sams-net: Stain-aware multi-scalenetwork for instance-based nuclei segmentation in histology images,”in

Biomedical Imaging (ISBI 2018), 2018 IEEE 15th InternationalSymposium on . IEEE, 2018, pp. 590–594.[25] H. Chen, X. Qi, L. Yu, and P.-A. Heng, “Dcan: deep contour-awarenetworks for accurate gland segmentation,” in

Proceedings of the IEEEconference on Computer Vision and Pattern Recognition , 2016, pp.2487–2496.[26] Y. Cui, G. Zhang, Z. Liu, Z. Xiong, and J. Hu, “A deep learningalgorithm for one-step contour aware nuclei segmentation of histopatho-logical images,” arXiv preprint arXiv:1803.02786 , 2018.[27] N. Kumar, R. Verma, S. Sharma, S. Bhargava, A. Vahadane, andA. Sethi, “A dataset and a technique for generalized nuclear segmen-tation for computational pathology,”

IEEE Transactions on MedicalImaging , vol. 36, no. 7, pp. 1550–1560, July 2017.[28] M. Khoshdeli and B. Parvin, “Deep leaning models delineates multiplenuclear phenotypes in h&e stained histology sections,” arXiv preprintarXiv:1802.04427 , 2018.[29] Y. Zhou, O. F. Onder, Q. Dou, E. Tsougenis, H. Chen, and P.-A.Heng, “Cia-net: Robust nuclei instance segmentation with contour-awareinformation aggregation,” arXiv preprint arXiv:1903.05358 , 2019.[30] Q. D. Vu, S. Graham, M. N. N. To, M. Shaban, T. Qaiser, N. A.Koohbanani, S. A. Khurram, T. Kurc, K. Farahani, T. Zhao et al. ,“Methods for segmentation and classiﬁcation of digital microscopytissue images,” arXiv preprint arXiv:1810.13230 , 2018.[31] P. Naylor, M. La´e, F. Reyal, and T. Walter, “Segmentation of nuclei inhistopathology images by deep regression of the distance map,”

IEEETransactions on Medical Imaging , 2018.[32] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask R-CNN,”

ArXive-prints , p. arXiv:1703.06870, Mar. 2017.[33] K. Nguyen, A. K. Jain, and B. Sabata, “Prostate cancer detection: Fusionof cytological and textural features,”

Journal of pathology informatics ,vol. 2, 2011.[34] Y. Yuan, H. Failmezger, O. M. Rueda, H. R. Ali, S. Gr¨af, S.-F. Chin, R. F.Schwarz, C. Curtis, M. J. Dunning, H. Bardwell et al. , “Quantitativeimage analysis of cellular heterogeneity in breast tumors complementsgenomic proﬁling,”

Science translational medicine , vol. 4, no. 157, pp.157ra143–157ra143, 2012. [35] K. Sirinukunwattana, S. e Ahmed Raza, Y.-W. Tsang, D. R. Snead, I. A.Cree, and N. M. Rajpoot, “Locality sensitive deep learning for detectionand classiﬁcation of nuclei in routine colon cancer histology images.” IEEE Trans. Med. Imaging , vol. 35, no. 5, pp. 1196–1206, 2016.[36] K. He, X. Zhang, S. Ren, and J. Sun, “Identity Mappings in DeepResidual Networks,”

ArXiv e-prints , p. arXiv:1603.05027, Mar. 2016.[37] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet:A Large-Scale Hierarchical Image Database,” in

CVPR09 , 2009.[38] A. Arnab, O. Miksik, and P. H. S. Torr, “On the robustness of semanticsegmentation models to adversarial attacks,”

CoRR , vol. abs/1711.09856,2017. [Online]. Available: http://arxiv.org/abs/1711.09856[39] G. Huang, Z. Liu, L. van der Maaten, and K. Q. Weinberger,“Densely Connected Convolutional Networks,”

ArXiv e-prints , p.arXiv:1608.06993, Aug. 2016.[40] A. Kirillov, K. He, R. B. Girshick, C. Rother, and P. Doll´ar, “Panopticsegmentation,”

CoRR , vol. abs/1801.00868, 2018. [Online]. Available:http://arxiv.org/abs/1801.00868[41] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin,S. Ghemawat, G. Irving, M. Isard et al. , “Tensorﬂow: A system forlarge-scale machine learning.” in

OSDI , vol. 16, 2016, pp. 265–283.[42] A. E. Carpenter, T. R. Jones, M. R. Lamprecht, C. Clarke, I. H. Kang,O. Friman, D. A. Guertin, J. H. Chang, R. A. Lindquist, J. Moffat et al. ,“Cellproﬁler: image analysis software for identifying and quantifyingcell phenotypes,”

Genome biology , vol. 7, no. 10, p. R100, 2006.[43] P. Bankhead, M. B. Loughrey, J. A. Fern´andez, Y. Dombrowski, D. G.McArt, P. D. Dunne, S. McQuaid, R. T. Gray, L. J. Murray, H. G.Coleman et al. , “Qupath: Open source software for digital pathologyimage analysis,”

Scientiﬁc reports , vol. 7, no. 1, p. 16878, 2017.[44] V. Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con-volutional encoder-decoder architecture for image segmentation,”

IEEEtransactions on pattern analysis and machine intelligence , vol. 39,no. 12, pp. 2481–2495, 2017.[45] M. N. Gurcan, L. E. Boucheron, A. Can, A. Madabhushi, N. M.Rajpoot, and B. Yener, “Histopathological image analysis: A review,”