[PDF] An Approach for Clustering Subjects According to Similarities in Cell Distributions within Biopsies

Abstract

In this paper, we introduce a novel and interpretable methodology to cluster subjects suffering from cancer, based on features extracted from their biopsies. Contrary to existing approaches, we propose here to capture complex patterns in the repartitions of their cells using histograms, and compare subjects on the basis of these repartitions. We describe here our complete workflow, including creation of the database, cells segmentation and phenotyping, computation of complex features, choice of a distance function between features, clustering between subjects using that distance, and survival analysis of obtained clusters. We illustrate our approach on a database of hematoxylin and eosin (H&E)-stained tissues of subjects suffering from Stage I lung adenocarcinoma, where our results match existing knowledge in prognosis estimation with high confidence.

Full PDF

AAn Approach for Clustering Subjects According toSimilarities in Cell Distributions within Biopsies

Yassine El Ouahidi, Matis Feller, Matthieu Talagas, Bastien Pasdeloup

Abstract —In this paper, we introduce a novel and interpretablemethodology to cluster subjects suffering from cancer, based on fea-tures extracted from their biopsies. Contrary to existing approaches,we propose here to capture complex patterns in the repartitions oftheir cells using histograms, and compare subjects on the basis ofthese repartitions. We describe here our complete workﬂow, includingcreation of the database, cells segmentation and phenotyping, com-putation of complex features, choice of a distance function betweenfeatures, clustering between subjects using that distance, and survivalanalysis of obtained clusters. We illustrate our approach on a databaseof hematoxylin and eosin (H&E)-stained tissues of subjects sufferingfrom Stage I lung adenocarcinoma, where our results match existingknowledge in prognosis estimation with high conﬁdence.

Keywords —Cancer, clustering, histograms, survival analysis,Wasserstein distance.

I. I

NTRODUCTION P ERSONALIZED oncololgy aims at improving the sur-vival outcome of a subject, by understanding deeplytheir disease using attributes that are speciﬁc to their ownmetabolism. Among techniques that analyze biopsies of tu-mors, most works characterize subjects based on quantitativeattributes, such as the number of tumor-inﬂtrating lymphocytes(TILs) [1], their density and area [2], or the immunoscore[3]. However, it has been recently shown that more globalorganizations of cells – and in particular TILs – can have astrong impact on the survival prognosis [4], [5], [6].In this work, we introduce a novel methodology to capturein details complex repartitions of cells within the tissues. Ourapproach can be used at various scales (cell level, clusters ofcells level, etc.), thus providing numerous and complementaryinformation regarding the tissue.Contrary to existing approaches such as [6], we do notclassify the subjects in pre-deﬁned classes. Instead, we proposeto work in a non-supervised way, to group subjects based ontheir similarities regarding the deﬁned features. This allows 1)exploration of new features that may correlate with survival;and 2) research of similar subjects for a newly seen person.Our work is organized as follows: in Section II, we givean overview of our method and its inputs; in Section III, wedeﬁne some features that capture particular organizations ofcells; then, in Section IV, we explain which distance functionwe use, as well as our choice of clustering algorithm; ﬁnally,

Y. El Ouahidi, M. Feller and B. Pasdeloup are with IMT At-lantique, Lab-STICC, UMR CNRS 6285, Brest F-29238, France. E-mail: [email protected], [email protected],[email protected]. Talagas is with the University of Brest, LIEN, F-29200 Brest, France;and the Department of Dermatology, Brest University Hospital, Brest, France.E-mail: [email protected]. in Section V, we illustrate our methodology on a databaseof H&E-stained biopsies of subjects suffering from Stage Ilung adenocarcinoma, and show that our results match existingknowledge in the ﬁeld.II. O

VERVIEW OF THE METHODOLOGY

Biopsies are provided in the form of high-resolution scansof a tissue, which have been previously stained to reveal cellkernels. H&E has been the most commonly used staining foryears due to an easy and cheap acquisition of these images.Still, more complex techniques such as multiplex immuno-histochemistry (mIHC) have been developed, which allowcapturing richer information on the cells such as distinctionof CD8/CD4/CD3 lymphocytes, albeit at a higher cost.Once a high-resolution image has been acquired, a commonpractice is to perform its segmentation and phenotyping, in or-der to localize and characterze all cells. For H&E images, toolsusing deep learning models [7] have recently shown impressiveperformance in ﬁnding cells and associating them with aphenotype (cancer/stroma/lymphocyte). Proprietary tools existfor mIHC [8], as well as more recent solutions, here againinvolving deep learning models [9]. All of these tools outputsimilar information. For each cell kernel found, they provide atleast the following attributes: x coordinate, y coordinate, and phenotype . Available phenotypes depend on the stainingtechnique, and more attributes can be output depending onthe tools ( e.g. , Keratin, PDL1, etc.).Our methodology takes as an input a set S = { s ( i ) } i ∈{ ,..., |S|} of subjects. For each subject s ∈ S , we havea set C s = { c ( i ) } i ∈{ ,..., |C s |} of cells. Each cell c ∈ C s has agiven number of attributes, which we can access as c [ a ] , with a ∈ { x , y , phenotype , . . . } .From there, our method proceeds as follows:1) Let H be the set of histograms. Design a set F = { f ( i ) : S → H} i ∈{ ,..., |F|} of features of interest, which arefunctions that associate a subject s ∈ S with a histogram H ( i ) s ∈ H . Examples of features are given in Section III;2) Evaluate all features on all subjects to obtain histograms { H ( i ) s } s ∈S ,i ∈{ ,..., |F|} .3) For each feature f ( i ) ∈ F , and each two distinct subjects s ( j ) , s ( k ) ∈ S , compute the |S| × |S| distance matrix D ( i ) [ j, k ] = d H (cid:16) H ( i ) s ( j ) , H ( i ) s ( k ) (cid:17) . Details on the distancefunction d H are given in Section IV;4) For each matrix D ( i ) , feed it to a clustering algorithm,that will output a partition of S based on their similaritywith respect to feature f ( i ) . Details on the algorithm aregiven in Section IV; a r X i v : . [ q - b i o . Q M ] J u l ) Perform survival analysis of the sub-populations foundthis way to evaluate signiﬁcance of the impact of f ( i ) on survival prognosis.Alternatively, step 3) can be changed to ﬁrst compute amatrix D which aggregates the various D ( i ) associated withindividual features. This allows merging information capturedby the features into one global matrix, which can be used as aninput for the clustering algorithm. Details on the aggregationare given in Section IV-C.III. P ROPOSED FEATURES

In this section, we propose ﬁve features that capture someinteresting cell repartitions within the tissues . Each featureis a function f ( i ) : S → H that inputs a subject s ∈ S with cells C s , and outputs a histogram H ( i ) s ∈ H . In thefollowing, function values to hist ( V ) transforms a set V ofvalues into a histogram which associates each value v ∈ V with its number of occurrences in V .In order to simplify features deﬁnition, we ﬁrst introduce anotion of ﬁltering. Let C − s ⊆ C s be any subset of the cells C s of subject s ∈ S . We note C − s [ P ] ⊆ C − s the subset of cells in C − s for which a proposition P : C − s → B is true.As an example, remember from Section II that cellsare given some attributes, including their phenotypes. As-sume here the attribute phenotype can take values in { "cancer" , "stroma" , "lymphocyte" } , we can deﬁnethe following ﬁlter: lymph : C − s → B c (cid:55)→ c [ phenotype ] = "lymphocyte" . (1)Using this ﬁlter, we can get all lymphocytes in C s as C s [ lymph ] . Similarly, we can easily deﬁne ﬁlters stroma and cancer that return cells of corresponding phenotypes. A. f (1) : distances lymphocytes – cancer cells Let us consider a subject s ∈ S with cells C s . In a ﬁrstfeature, we want to capture the proximity between lympho-cytes and cancer cells. Each computed value will then be theminimum distance between a lymphocyte and all cancer cells.To simplify this deﬁnition, we introduce a ﬁlter that, givena reference cell c ∈ C s , returns the cell within the set C − s ⊆ C s being ﬁltered which minimizes Euclidean distance to c : closest ( c ) : C − s → B c (cid:48) (cid:55)→ c (cid:48) = arg min c (cid:48)(cid:48) ∈C − s d xy ( c (cid:48)(cid:48) , c ) , (2)where: d xy ( c, c (cid:48) ) = (cid:112) ( c (cid:48) [ x ] − c [ x ]) + ( c (cid:48) [ y ] − c [ y ]) . (3)Using this ﬁlter, Algorithm 1 details computation of H (1) s .Informally, C s [ cancer ][ closest ( c )] can be translated as thecancer cell which is the closest to the given cell c . Codes for all subsequent functions will be made available upon acceptance.

Algorithm 1 f (1) ( s ) V := {} for all c ∈ C s [ lymph ] do d := d xy ( c, C s [ cancer ][ closest ( c )]) V := V ∪ { d } end for H (1) s := values to hist ( V ) return H (1) s B. f (2) : distances lymphocytes – cancer/stroma interface In a second feature, we want to capture the proximitybetween lymphocytes and the cancer/stroma interface. Eachcomputed value will then be the minimum distance betweena lymphocyte and a cancer cell at that interface.To simplify the deﬁnition of this feature we again introducea new ﬁlter, that formalizes the notion of interface betweencell types. Let C − s ⊆ C s and C − (cid:48) s ⊆ C s be two non-intersectingsubsets of the cells. Additionally, let T s be the set of trianglesobtained by the Delaunay triangulation [10] of the cells in C s ,built from the x and y cell attributes (see Figure 1 for anexample of this triangulation). Each triangle t ∈ T s is the setof three cells that deﬁne it. We can deﬁne the ﬁlter: inter ( C − (cid:48) s ) : C − s → B c (cid:55)→ ∃ t ∈ T s : c ∈ t and ∃ c (cid:48) ∈ t : c (cid:48) ∈ C − (cid:48) s , (4)which returns the subset of cells in C − s that are spatially closeto cells in C − (cid:48) s , by checking existence of a triangle deﬁned bycells of both subsets. This allows an easy deﬁnition of ﬁltersto obtain cells located at the border between two cell types,such as cancer and stroma for instance. Stroma cells at thatinterface are thus C s [ stroma ][ inter ( C s [ cancer ])] and cancerones C s [ cancer ][ inter ( C s [ stroma ])] .Using this ﬁlter, Algorithm 2 details computation of H (2) s .It is worth noting that we distinguish between lymphocytesthat are within a cancer environment and those that are in astroma environment by checking on which side of the interfacethey are located. The former will be associated with a negativedistance in the histogram. Algorithm 2 f (2) ( s ) V := {} for all c ∈ C s [ lymph ] do d c := d xy ( c, C s [ cancer ][ inter ( C s [ stroma ])][ closest ( c )]) d s := d xy ( c, C s [ stroma ][ inter ( C s [ cancer ])][ closest ( c )]) if d c < d s then d c := − d c end if V := V ∪ { d c } end for H (2) s := values to hist ( V ) return H (2) s . f (3) : distances between aggregates of lymphocytes This third feature catpures the proximity between denseaggregates of cells. Each computed value will be the minimumdistance between an aggregate of lymphocytes and the otheraggregates of such cells.Once again, we simplify the deﬁnition of the feature byintroducing a ﬁlter that will only keep cells that belong tothe same aggregate. To do so, we again use the Delaunaytriangulation of the cells. Let E s be the set of edges in theDelaunay triangulation. Then, let C − s ⊆ C s be a subset of thecells, and let E − s ⊆ E s be the subset of edges that connecttwo cells in C − s . The graph G s made of vertices C − s and edges E − s consists of disjoint connected components, which we note {G ( i ) s } i ∈{ ,..., |G s |} . We now deﬁne the following ﬁlter: cc ( i ) : C − s → B c (cid:55)→ c ∈ G ( i ) s . (5)Using this ﬁlter, Algorithm 3 details computation of H (3) s .Informally, C s [ lymph ][ cc ( j )][ closest ( c )] can be translated as the cell in the j th aggregate of lymphocytes which is the closestto the given cell c . Algorithm 3 f (3) ( s ) V := {} for all i do d i := ∞ for all j , j (cid:54) = i do d ij := ∞ for all c ∈ C s [ lymph ][ cc ( i )] do d ij := min( d ij , d xy ( c, C s [ lymph ][ cc ( j )][ closest ( c )])) end for d i := min( d i , d ij ) end for V := V ∪ { d i } end for H (3) s := values to hist ( V ) return H (3) s D. f (4) : sizes of the aggregates of lymphocytes This fourth feature catpures the variety of sizes amongaggregates of lymphocytes. Each computed value will be thenumber of lymphocytes in an aggregate of such cells.Algorithm 4 details computation of H (4) s . Algorithm 4 f (4) ( s ) V := {} for all i do n i := |C s [ lymph ][ cc ( i )] | V := V ∪ { n i } end for H (4) s := values to hist ( V ) return H (4) s E. f (5) : densities of lymphocytes at cancer/stroma interface Finally, this ﬁfth feature catpures the densities of lym-phocytes in bands around the cancer/stroma interface. Eachcomputed value will be the density of lymphocytes among allcells located in a certain interval of distance from interfacecells. As for feature f (2) , we make the distinction betweenbands within the cancer environment and those within thestroma environment, by associating negative bins with theformer. We have chosen to consider bands of µm to includemultiple cells (as kernel size of the considered tissues isapproximately − µm ).Algorithm 5 details computation of H (5) s . Algorithm 5 f (5) ( s ) band width := 20 nb cells := {} nb lymph := {} for all c ∈ C s do d c := d xy ( c, C s [ cancer ][ inter ( C s [ stroma ])][ closest ( c )]) d s := d xy ( c, C s [ stroma ][ inter ( C s [ cancer ])][ closest ( c )]) if d c < d s then d c := − d c end if band := d c //band width if band (cid:54)∈ nb cells then nb cells [ band ] := 0 nb lymph [ band ] := 0 end if nb cells [ band ] := nb cells [ band ] + 1 if c [ phenotype ] = "lymphocyte" then nb lymph [ band ] := nb lymph [ band ] + 1 end if H (5) s := {} for all band ∈ nb cells do H (5) s [ band ] := nb lymph [ band ] nb cells [ band ] end forend forreturn H (5) s IV. D

ISTANCES BETWEEN FEATURES , AND CLUSTERING

A. A distance function between histograms

In Section III, we have introduced a few features thatcapture various repartitions of particular types of cells. Eachof these features takes the form of a histogram, which is aricher representation than simple numerical features.The idea of what comes next is to cluster subjects basedon the repartitions of cells captured by these histograms. Thisimplies the necessity to choose a distance function d H betweenhistograms that complies with the captured information, i.e. ,for two histograms H ( i ) s and H ( i ) s (cid:48) – obtained by evaluatingfeature f ( i ) on two subjects s, s (cid:48) ∈ S – we want the distance d H ( H ( i ) s , H ( i ) s (cid:48) ) to be low when both histograms have similarshapes around the same values; and to increase as their shapesor locations differ. good candidate is therefore the Wasserstein distancebetween distributions (see e.g. , [11]). This distance measuresthe quantity of work required to transform a distributioninto another one. More formally, given two (positive, unit-sum) histograms H and H , it is obtained by solving anoptimization problem deﬁned as: d wass ( H , H ) = min T (cid:88) h ,h T [ h , h ] C [ h , h ] s.t. T1 = H ; T (cid:62) = H ; T ≥ , (6)where C is the matrix that deﬁnes the cost to move a unit ofmass from the h th bin of distribution H to the h th bin ofdistribution H . A solution to the problem is a matrix T ofoptimal transport between histogram H and H . This problemis of complexity O ( n ) . However, solvers are very efﬁcientand make its resolution very tractable [12].As this distance is deﬁned on distributions, we must there-fore normalize our histograms so that the sum of values equals . A direct consequence of this is that we cannot distinguishsome histograms anymore. A simple example is – on one side– a histogram consisting of a unique bin at with value ;and – on the other side – a histogram consisting of a uniquebin at with value . After normalization, both distributionsfeature a unique bin at of value .At ﬁrst glance, this may seem problematic, as some featuressuch as f (1) enforce a direct relation between the numberof lymphocytes in the tissue and the number of histogramentries. However, such quantitative aspects can be included asnumerical features during classiﬁcation if one wants to includethe total number of lymphocytes as a feature. Additionally, thisnormalization does not change the shape of the histogram,which allows us to compare – taking again the example of f (1) – the repartitions of lymphocytes around cancer cells.In the remaining of this document, we will therefore useWasserstein distance d wass between our histograms for d H ,assuming both histograms H ( i ) s and H ( i ) s (cid:48) – obtained by evalu-ating feature f ( i ) on two subjects s, s (cid:48) ∈ S – have previouslybeen normalized, i.e. , d H (cid:16) H ( i ) s , H ( i ) s (cid:48) (cid:17) = d wass  H ( i ) s (cid:12)(cid:12)(cid:12) H ( i ) s (cid:12)(cid:12)(cid:12) , H ( i ) s (cid:48) (cid:12)(cid:12)(cid:12) H ( i ) s (cid:48) (cid:12)(cid:12)(cid:12)  , (7)where | · | denotes the (cid:96) norm of all values in the histogram.The cost matrix C in Equation 6 is chosen to be linear, i.e. if there is a bin at value h and a bin at value h , the cost C [ h , h ] for transporting a unit of mass between h and h is | h − h | . Dimension of C is chosen so that each distinctvalue in H ( i ) s and H ( i ) s (cid:48) has its own bin. B. Clustering of the subjects

Now that we have deﬁned a distance function that comparesthe shapes of distributions captured by our features, we cancluster all subjects in S – in a non-supervised way – to grouptogether those that have similar repartitions of cells.For each feature f ( i ) ∈ F , and any pair of subjects s ( j ) , s ( k ) ∈ S , we can now compute a |S| × |S| matrix: D ( i ) [ j, k ] = d H (cid:16) H ( i ) s ( j ) , H ( i ) s ( k ) (cid:17) . (8) We can now feed D ( i ) to a clustering algorithm to assessthe abiliy of the feature to create meaningful sub-populations.Because of its high interpretability, we have chosen to use as-cending hierarchical classiﬁcation (AHC) with single-linkage[13] as our clustering algorithm. Starting with each subject intotheir own cluster, this algorithms iteratively merges clusters– chosen to minimize dissimilarity between merged clusters– until the entire population belongs to the same cluster.This way, it produces a dendrogram, in which the leaves areindividual subjects. As a consequence, the deeper we go inthe dendrogram (toward its leaves), the more homogeneousthe clusters we ﬁnd are.Based on D ( i ) , we use AHC to partition S into two disjointsub-populations S = S ( i )1 (cid:116) S ( i )2 as follows: • The ﬁrst sub-population S ( i )1 ⊆ S is obtained by cuttingthe dendrogram at a threshold τ AHC , and keeping thedeepest resulting sub-tree. By construction, this sub-population will be homogeneous with respect to thefeature used to build the input distances matrix; • The second sub-population S ( i )2 ⊆ S is obtained bymerging all subjects that do not belong to the ﬁrst one.By construction, these subjects will be dissimilar withrespect to the feature.The reasoning behind this approach is that we want toverify whether having one particular distribution of cells ischaracteristic of the survival prognosis, assuming the subjectswithout that particular distribution may have very personal –diverse – proﬁles.In order to determine the value of the threshold τ AHC thatwill separate sub-populations, we have made an exhaustivesearch over its possible values. Then, for each two result-ing sub-populations S ( i )1 and S ( i )2 , we have computed theirKaplan-Meier survival curves [14], and have performed a log-rank test [15] to assess how different they are. We have thenkept the value of τ AHC which minimizes the p -value of that test, i.e. , which most signiﬁcantly separates the two survival curves.The idea behind this choice is to maximize the homogeneityof S ( i )1 without any prior on the size of the clusters. C. Using multiple features

In the previous paragraphs, we have explained how tocluster subjects based on a single matrix of distances, asso-ciated with one particular feature. In practice, we would liketo combine the various features we have deﬁned, to grouptogether subjects that tend to behave similarly on multipleaspects.To do so, we want to build a matrix D that captures the factthat two subjects tend to belong to the same clusters S ( i )1 or S ( i )2 , for each feature f ( i ) ∈ F . Additionally, we want to beable to give more importance to some features in that globalmatrix, to allow a pathologist to inject additional knowledgein the aggregation.Let s ( j ) , s ( k ) ∈ S be two subjects. The requirementsdescribed previously can be integrated in the computation of as follows: D [ j, k ] = |F| (cid:88) i =1 w ( i ) · same (cid:16) s j , s k , S ( i )1 (cid:17) , (9)where: same (cid:16) s j , s k , S ( i )1 (cid:17) =  if ( s j ∈ S ( i )1 and s k ∈ S ( i )1 ) or ( s j (cid:54)∈ S ( i )1 and s k (cid:54)∈ S ( i )1 )1 otherwise , (10)and where { w ( i ) } i ∈{ ,..., |F| are given real values, weightingthe importance of features in the combination. Their valuescan be determined either using expert knowledge, or moreagnostic methods, as proposed in Section V-C.In Equation 9, the higher the value in D [ j, k ] , the moresubjects s ( j ) and s ( k ) tend to belong to different clusters.V. E XPERIMENTS

A. Database used

In order to evaluate our approach, we have chosen tofocus on Stage I lung adenocarcinoma. We have extractedall corresponding subjects from the Cancer Genome Atlas(TCGA) repository [16] to create a population S of subjects. For each of these subjects, the TCGA repositoryprovides us with high resolution scans of the subject’s biopsy,stained with H&E, as well as clinical information, includingthe subject’s time to last follow-up (TTLFU) and alive stateat that time. When combined, these two pieces of informationallow us to compute the Kaplan-Meier survival estimator ofthe population.From these subjects, we have chosen to remove thosethat were alive with a too small TTLFU (for which we don’thave enough hindsight), as well as those that were dead with atoo high TTLFU (as their tumor may have evolved too muchsince diagnosis). We have therefore set up a threshold TTLFU τ DB , below which alive subjects are dropped, and above whichdead subjects are dropped. The value of τ DB = 366 days hasbeen chosen to maximize the database size while encouragingthese criteria. After this ﬁltering, our database consists of subjects, of which have an alive state.For each subject, we have considered one scan of theirbiopsy, and have deﬁned multiple regions of interest (ROIs),chosen at hand to be located in areas that appear to includeboth cancer and stroma cells, mostly located at the tumorborder. The number of ROIs deﬁned this way depends on thesize and contents of the image. Figure 1 depicts an exampleof ROI selection in a tissue.For each identiﬁed ROI, we have segmented and phenotypedthe cells in presence using the ConvPath tool [7]. Eachcell is therefore associated with the following attributes: x coordinate, y coordinate and phenotype , that last one takingvalues in { "cancer" , "stroma" , "lymphocyte" } .Finally, to enforce our prior during ROIs selection, we haveﬁltered out those in which the ratio of cancer/stroma cells wasnot in the [0 . , . interval. After this second ﬁltering, we haveas an input a database of subjects, of which have analive state. Each of them is associated with to ROIs, with

TABLE IS

URVIVAL ANALYSIS OF THE POPULATION , BASED ON THE PROPOSEDFEATURES , TAKEN INDIVIDUALLY .Feature (cid:12)(cid:12)(cid:12) S ( i )1 (cid:12)(cid:12)(cid:12) |S| (cid:12)(cid:12)(cid:12) S ( i )2 (cid:12)(cid:12)(cid:12) |S| p -value f (1) .

55 0 .

45 0 . f (2) .

31 0 .

69 0 . f (3) .

52 0 .

48 0 . f (4) .

64 0 .

36 0 . f (5) .

19 0 .

81 0 . an average of . ROI per subject. In the remaining of thesedocuments, features for a subject will be computed per ROI,and aggregated in a single histogram before normalization.

B. Individual results per feature

First, we want to assess if the features we have proposed inSection III are signiﬁcantly correlating with survival prognosis.To do so, we apply our methodology for each feature proposedin Section III.Table I presents the obtained results per feature. For eachfeature f ( i ) ∈ F , we report respective sizes of resulting sub-populations S ( i )1 and S ( i )2 . Signiﬁcance of the clustering isanalyzed through the log-rank test of Kaplan-Meier estimatorsbetween both sub-populations. The p -value of that test is givenin the last column .Results show that features f (2) , f (3) , f (4) and f (5) separatesurvival curves signiﬁcantly ( p -value < . ). More precisely, f (2) and f (5) have a strong statistical signiﬁcance ( p -value (cid:28) . ).We show in Figures 2, 3, 4, 5 and 6 (top) the survivalcurves obtained by AHC for each feature f ( i ) ∈ F , for whichwe are reporting results in Table I. In these ﬁgures, the orangecurve is the survival curve for the cluster of subjects that arehomogeneous to the feature ( S ( i )1 ), and the blue curve is thesurvival curve for the cluster of heterogeneous subjects ( S ( i )2 ).Shaded areas delimit a conﬁdence interval of .Additionally, we show in these ﬁgures two histograms perfeature. The ﬁrst one (left) is the most central histogram withinthe cluster of homogeneous subjects, i.e. , the histogram H ( i ) s ∗ that minimizes: H ( i ) s ∗ = arg min s ∈S ( i ) j (cid:88) s (cid:48) ∈S ( i ) j d H (cid:16) H ( i ) s , H ( i ) s (cid:48) (cid:17) , (11)where j ∈ { , } is the considered sub-population. The secondhistogram (right) is the most central within the heterogeneouscluster, using that same centrality measure.The corresponding subjects can be seen as representativesfor their clusters with respect to the feature f ( i ) ∈ F used.Analyzing their histograms allows one to have an easy inter-pretation of the results of the clustering algorithm, which helpsthe pathologist understand the key aspects within distributions. All codes used to obtain these results will be made available uponacceptance.ig. 1. Manual selection of ROIs on a subject’s SVS slide (top left), closeup on one of the selected ROIs (top right), and representation of the Delaunaytriangulation of the cells in that ROI (bottom). In that representation, lymphocytes are depicted in green, cancer cells in red and stroma cells in blue. ROIsare selected to include both cancer and stroma cells in reasonable quantities.

C. Results obtained by combining features

Now that we have studied the proposed features individ-ually, we propose to merge those that have shown enoughsigniﬁcance in survival prognosis into a unique matrix D , aspresented in Section IV-C.In order to ﬁx the values of weights { w ( i ) } i in Equation 9,we propose to give more importance to features that are morestatistically signiﬁcant than the others. For a feature f ( i ) ∈ F ,let p ( i ) be the p -value of the log-rank test between survivalcurves of clusters S ( i )1 and S ( i )2 , as reported in Table I. Wechoose w (1) = 0 – due to low signiﬁcance of the feature –and { w ( i ) } i ∈{ ,..., |F|} as follows: w ( i ) = log (cid:18) p ( i ) (cid:19) . (12) The log in Equation 12 has been introduced to preventvery signiﬁcant features such as F (2) and F (5) to absorb thecontribution of other features.Using the matrix D computed as in Equation 9 withweights in Equation 12, we then use AHC and obtain twosubpopulations S and S , of which Kaplan-Meier survivalcurves are given in Figure 7. The log-rank test between thesecurves has a p -value of . · − , which indicates a verystrong signiﬁcance of the clustering. D. Discussion

A ﬁrst observation in our results is that Feature f (1) isnot signiﬁcant enough ( p -value > . ). As a reminder, thisfeature captures the distances between lymphocytes and theirclosest cancer cells. When introducing it, we expected to

000 2000 3000 4000 5000 6000 7000TTLFU (days)0.00.20.40.60.81.0 % a li v e S (1)2 (42 subjects) S (1)1 (52 subjects) Distance lymphocyte cancer cell ( µ m) N u m b e r o f o cc u rr e n ce s Fig. 2. Survival curves associated with both sub-populations S (1)1 and S (1)2 obtained by AHC using D (1) (top); histogram H (1) s ∗ associated with the mostcentral subject s ∗ ∈ S (1)1 (alive, TTLFU days) of the ﬁrst sub-population S (1)1 , obtained by feeding D (1) to the AHC clustering algorithm (bottomleft); and corresponding histogram H (1) s ∗ (alive, TTLFU days) of the central subject s ∗ ∈ S (1)2 of the other sub-population S (1)2 (bottom right). ﬁnd that subjects with a good survival prognosis would tendto have a strong proximity between such cells, as havinglymphocytes in contact with cancer cells would indicate animmune response to the tumor. However, a more carefulinspection of the feature reveals that it does not take intoaccount the information that some lymphocytes are inﬁltrated,while others are not. It follows that the algorithm cannotdistinguish between subjects that have lymphocytes close tocancer cells deep in the tumor, from those that only havelymphocytes in the stroma environment at the interface withthe tumor, without any sort of inﬁltration (in both cases,corresponding histograms would show a peak in low positivebins).This notion of inﬁltration of lymphocytes within the tumor has however been introduced into features f (2) and f (5) ,which correlate with survival with high conﬁdence. The formerfeature captures the distance between lymphocytes and thetumor/stroma interface, associating a negative distance whenlymphocytes are located in the tumor environment. The lattercan be understood as a reﬁnement of the former, and measuresthe density of lymphocytes among cells as distance to thatinterface increases. Again, a negative distance distinguishesbands that are within the tumor environment from those thatare within the stroma environment.When looking in more details at central histograms in Figure3 (bottom), it reveals that both subjects seem to have bothinﬁltrating and non-inﬁltrating lymphocytes. The absence ofbins around distance µm is an artifact of the method used

000 2000 3000 4000 5000 6000 7000TTLFU (days)0.00.20.40.60.81.0 % a li v e S (2)2 (65 subjects) S (2)1 (29 subjects) − − − −

10 0 10 20 30 40 distance lymphocyte cancer/stroma interface ( µ m) N u m b e r o f o cc u rr e n ce s − − −

25 0 25 50 75 100

Distance lymphocyte cancer/stroma interface ( µ m) N u m b e r o f o cc u rr e n ce s Fig. 3. Survival curves associated with both sub-populations S (2)1 and S (2)2 obtained by AHC using D (2) (top); histogram H (2) s ∗ associated with the mostcentral subject s ∗ ∈ S (2)1 (alive, TTLFU days) of the ﬁrst sub-population S (2)1 , obtained by feeding D (2) to the AHC clustering algorithm (left); andcorresponding histogram H (2) s ∗ of the central subject s ∗ ∈ S (2)2 (alive, TTLFU days) of the other sub-population S (2)2 (right). to determine the cancer/stroma interface. By construction,a triangle in the Delaunay triangulation of cells is part ofthe interface if it links at least one cancer and one stromacell (see Section III-B). As a consequence, lymphocytes arenot used to deﬁne the border, which leads to this artifact.A noticeable difference between both histograms is that thecentral subject in the homogeneous sub-population S (2)1 hasa less deep lymphocyte inﬁltration (around − µm ) thanthe central subject in the heterogeneous sub-population S (2)2 (around − µm ). When looking at the corresponding survivalcurves in Figure 3 (top), it reveals that subjects in the cluster S (2)2 have a better survival prognosis than the other ones,which matches known facts regarding the link between TILsand survival outcome [6]. In order to verify if that observation generalizes to all subjects within the clusters – and not onlyto the central subjects –, we have listed the minimum distanceper histogram and have analyzed the distributions of suchvalues per cluster. We obtain for cluster S (2)1 a mean of − . µm , with standard deviation . µm ; and for cluster S (2)2 a mean of − . µm , with standard deviation . µm .Mann-Whitney U test [17] indicates that both distributionsare statistically signiﬁcant ( U = 201 , p -value = 1 . e − ),which validates the observation.A similar observation can be done with feature f (5) . His-tograms in Figure 6 show that the cental subject of the homo-geneous cluster S (5)1 , which has a poorer survival prognosis,has a less deep inﬁltration of lymphocytes (around − µm )when compared with the central subject of the heterogeneous

000 2000 3000 4000 5000 6000 7000

TTLFU (days) . . . . . . % a li v e S (3)2 (45 subjects) S (3)1 (49 subjects) Distance between aggregates of lymphocytes ( µ m) N u m b e r o f o cc u rr e n ce s Distance between aggregates of lymphocytes ( µ m) N u m b e r o f o cc u rr e n ce s Fig. 4. Survival curves associated with both sub-populations S (3)1 and S (3)2 obtained by AHC using D (3) (top); histogram H (3) s ∗ associated with the mostcentral subject s ∗ ∈ S (3)1 (alive, TTLFU days) of the ﬁrst sub-population S (3)1 , obtained by feeding D (3) to the AHC clustering algorithm (left); andcorresponding histogram H (3) s ∗ of the central subject s ∗ ∈ S (3)2 (alive, TTLFU days) of the other sub-population S (3)2 (right). cluster S (5)2 (around − µm ). As for feature f (2) , we havelisted minimum band per histogram and have analyzed thedistributions of such values per cluster. We obtain for cluster S (5)1 a mean of − µm , with standard deviation . µm ; andfor cluster S (2)2 a mean of − µm , with standard deviation . µm . Mann-Whitney U test indicates that both distributionsare statistically signiﬁcant ( U = 7 . , p -value = 3 . e − ),which conﬁrms what has been discussed in the previousparagraph.Features f (3) and f (4) consider a coarser scale in the tissue,and study the link between presence of aggregates of lympho-cytes at the tumor periphery and overall survival. The formerfeature captures the distances between aggregates, while thelatter simply studies the distribution of the sizes of these aggregates. It can be seen in Figure 4 that the central subjectin the homogeneous cluster S (3)1 has a maximum distancebetween aggregates of lymphocytes (around µm ) which ismuch lower than for the central subject of the heterogeneouscluster S (3)2 (around µm ). Survival curves show that theheterogeneous sub-population has a poorer survival prognosis.A closer analysis at histograms reveals that subjects from thehomogeneous cluster have aggregates of lymphocytes that arein general close to one another, suggesting a uniform reparti-tion of aggregates of lymphocytes to ﬁght tumor progression,which denotes an efﬁcient immune response. We have listedthe maximum inter-aggregate distance per histogram and haveanalyzed the distributions of such values per cluster. We obtainfor cluster S (3)1 a mean of . µm , with standard deviation

000 2000 3000 4000 5000 6000 7000TTLFU (days)0.00.20.40.60.81.0 % a li v e S (4)2 (34 subjects) S (4)1 (60 subjects) Size of the aggregate of lymphocytes (number of cells) N u m b e r o f o cc u rr e n ce s Size of the aggregate of lymphocytes (number of cells) N u m b e r o f o cc u rr e n ce s Fig. 5. Survival curves associated with both sub-populations S (4)1 and S (4)2 obtained by AHC using D (4) (top); histogram H (4) s ∗ associated with the mostcentral subject s ∗ ∈ S (4)1 (alive, TTLFU days) of the ﬁrst sub-population S (4)1 , obtained by feeding D (4) to the AHC clustering algorithm (left); andcorresponding histogram H (4) s ∗ of the central subject s ∗ ∈ S (4)2 (alive, TTLFU days) of the other sub-population S (4)2 (right). . µm ; and for cluster S (3)2 a mean of . µm , withstandard deviation . µm . Mann-Whitney U test indicatesthat both distributions are statistically signiﬁcant ( U = 611 , p -value = 0 . ), which validates the observation.In addition to this analysis of repartition of aggregatesof lymphocytes, Figure 5 shows us that the size of theseaggregates is important for survival prognosis. As a matterof fact, it can be seen that the most central subject ofthe homogeneous cluster S (4)1 has aggregates that are a lotsmaller (maximum around µm ) than those of the centralsubject of the heterogeneous cluster S (4)2 (maximum around µm ). Since subjects in the homogeneous cluster have apoorer survival prognosis, this suggests that presence of largeaggregates of lymphocytes may be characteristic of an efﬁcient immune response. We have listed the maximum aggregate sizeper histogram and have analyzed the distributions of suchvalues per cluster. We obtain for cluster S (4)1 a mean of . lymphocytes in the largest aggregate, with standard deviation . ; and for cluster S (4)2 a mean of . lymphocytes inthe largest aggregate, with standard deviation . . Mann-Whitney U test indicates that both distributions are statisticallysigniﬁcant ( U = 192 . , p -value = 7 . e − ), which validatesthe observation.Finally, Figure 7 shows that the matrix of weighted co-occurrences of subjects within clusters signiﬁcantly separatesthe two survival curves. This suggests that sub-populationsproduced tend to be coherent across features, especially forsubjects that die after a very short period. As a consequence,

000 2000 3000 4000 5000 6000 7000TTLFU (days)0.00.20.40.60.81.0 % a li v e S (5)2 (76 subjects) S (5)1 (18 subjects) [-40,-20] [-20,0] [0,20] [20,40]Density of lymphocytes at cancer/stroma interface . . . . . . N u m b e r o f o cc u rr e n ce s [-20,0] [0,20][-80,-60] [-60,-40] [-40,-20] [20,40] [40,60]Density of lymphocytes at cancer/stroma interface . . . . . . . . . N u m b e r o f o cc u rr e n ce s Fig. 6. Survival curves associated with both sub-populations S (5)1 and S (5)2 obtained by AHC using D (5) (top); histogram H (5) s ∗ associated with the mostcentral subject s ∗ ∈ S (5)1 (alive, TTLFU days) of the ﬁrst sub-population S (5)1 , obtained by feeding D (5) to the AHC clustering algorithm (left); andcorresponding histogram H (5) s ∗ of the central subject s ∗ ∈ S (5)2 (dead, TTLFU days) of the other sub-population S (5)2 (right). this aggregation method offers an interesting approach toestimate survival prognosis based on a multi-criteria analysis.VI. C ONCLUSION

In this article, we have proposed a methodology that allowscomparing and clustering subjects on the basis of complexattributes, which capture their repartitions of cells withinbiopsies. To motivate our approach, we have introduced ﬁvefeatures, each capturing some information on spatial reparti-tions of cells, and have shown that our method could produceclusters that are signiﬁcantly different, while offering someinterpretability of the results. We have illustrated our methodon a dataset of H&E biopsies of subjects suffering from Stage I lung adenocarcinoma, and have been able to ﬁnd clusters thatmatch existing knowledge in survival prognosis estimation.Our method therefore offers a novel way of integratingcomplex features in survival analysis, in an automatic way.This allows one to explore new hypotheses regarding thelinks between cell organization patterns and survival outcome,and offers a systemic method for comparing two subjectson the basis of their personal attributes, which can helpsuggest someone a treatment that was efﬁcient to a similarsubject. Additionally, our method provides pathologists withvery readable histograms of complex organizations of cells (orcoarser elements such as cell aggregates) within tissues, whichcan be very difﬁcult to assess by human eye. Our approach canthus help them produce a more detailed and quicker diagnosis,

000 2000 3000 4000 5000 6000 7000

TTLFU (days) . . . . . . % a li v e S (82 subjects) S (12 subjects) Fig. 7. Survival curves associated with both sub-populations S and S obtained by AHC using D . thus reducing the time between diagnosis and treatment of thecancer.Perspective for extending this work are numerous. Onedirection consists in extending our catalogue of features todiscover new distributions that may correlate with prognosis.Another direction is to evaluate our methodology on morecomplex data, such as mIHC, and other families of cancers.Finally, we want to explore variations of the elements in ourapproach, such as using different clustering algorithms, orchanging the number of sub-populations produced, in case wemanipulate a database of subjects with more than two proﬁles.A CKNOWLEDGMENT

The authors would like to thank Prof. Pascal Frossardand all members of the LTS4 laboratory at EPFL, Lausanne,Switzerland; as well as the group of Dr. Michel Cuendet atCHUV, Lausanne, Switzerland, for the interesting discussionswhich eventually led to that work.R

EFERENCES[1] B. Mlecnik, G. Bindea, F. Pag`es, and J. Galon, “Tumor immunosurveil-lance in human cancers,”

Cancer and Metastasis Reviews , vol. 30, no. 1,pp. 5–12, 2011.[2] C. Reichling, J. Taieb, V. Derangere, Q. Klopfenstein, K. Le Malicot, J.-M. Gornet, H. Becheur, F. Fein, O. Cojocarasu, M. C. Kaminsky et al. ,“Artiﬁcial intelligence-guided tissue analysis combined with immuneinﬁltrate assessment predicts stage iii colon cancer outcomes in petacc08study,”

Gut , vol. 69, no. 4, pp. 681–690, 2020.[3] J. Galon, F. Pag`es, F. M. Marincola, M. Thurin, G. Trinchieri, B. A.Fox, T. F. Gajewski, and P. A. Ascierto, “The immune score as a newpossible approach for the classiﬁcation of cancer,” 2012. [4] W. H. Fridman, F. Pag`es, C. Sautes-Fridman, and J. Galon, “The immunecontexture in human tumours: impact on clinical outcome,”

NatureReviews Cancer , vol. 12, no. 4, pp. 298–306, 2012.[5] B. Yener, “Cell-graphs: image-driven modeling of structure-functionrelationship,”

Communications of the ACM , vol. 60, no. 1, pp. 74–84,2016.[6] J. Saltz, R. Gupta, L. Hou, T. Kurc, P. Singh, V. Nguyen, D. Sama-ras, K. R. Shroyer, T. Zhao, R. Batiste et al. , “Spatial organizationand molecular correlation of tumor-inﬁltrating lymphocytes using deeplearning on pathology images,”

Cell reports , vol. 23, no. 1, pp. 181–193,2018.[7] S. Wang, T. Wang, L. Yang, D. M. Yang, J. Fujimoto, F. Yi, X. Luo,Y. Yang, B. Yao, S. Lin et al. , “Convpath: A software tool for lungadenocarcinoma digital pathological image analysis aided by a convo-lutional neural network,”

EBioMedicine

ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Processing (ICASSP) .IEEE, 2019, pp. 1020–1024.[10] B. Delaunay et al. , “Sur la sphere vide,”

Izv. Akad. Nauk SSSR, OtdelenieMatematicheskii i Estestvennyka Nauk , vol. 7, no. 793-800, pp. 1–2,1934.[11] G. Peyr´e, M. Cuturi et al. , “Computational optimal transport,”

Founda-tions and Trends R (cid:13) in Machine Learning , vol. 11, no. 5-6, pp. 355–607,2019.[12] R. Flamary and N. Courty, “Pot python optimal transport library,”2017. [Online]. Available: https://pythonot.github.io/[13] G. J. Szekely and M. L. Rizzo, “Hierarchical clustering via jointbetween-within distances: Extending ward’s minimum variance method.” Journal of classiﬁcation , vol. 22, no. 2, 2005.[14] E. L. Kaplan and P. Meier, “Nonparametric estimation from incompleteobservations,”

Journal of the American statistical association , vol. 53,no. 282, pp. 457–481, 1958.[15] N. Mantel, “Evaluation of survival data and two new rank order statisticsrising in its consideration,”

Cancer Chemother. Rep. , vol. 50, pp. 163–170, 1966.[16] “The cancer genome atlas,” http://cancergenome.nih.gov/, 2018.[17] H. B. Mann and D. R. Whitney, “On a test of whether one of tworandom variables is stochastically larger than the other,”