[PDF] Cancer Gene Profiling through Unsupervised Discovery

Abstract

Precision medicine is a paradigm shift in healthcare relying heavily on genomics data. However, the complexity of biological interactions, the large number of genes as well as the lack of comparisons on the analysis of data, remain a tremendous bottleneck regarding clinical adoption. In this paper, we introduce a novel, automatic and unsupervised framework to discover low-dimensional gene biomarkers. Our method is based on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm, that offers modularity as concerns metric functions and scalability, while being able to automatically determine the best number of clusters. Our evaluation includes both mathematical and biological criteria. The recovered signature is applied to a variety of biological tasks, including screening of biological pathways and functions, and characterization relevance on tumor types and subtypes. Quantitative comparisons among different distance metrics, commonly used clustering methods and a referential gene signature used in the literature, confirm state of the art performance of our approach. In particular, our signature, that is based on 27 genes, reports at least 30 times better mathematical significance (average Dunn's Index) and 25% better biological significance (average Enrichment in Protein-Protein Interaction) than those produced by other referential clustering methods. Finally, our signature reports promising results on distinguishing immune inflammatory and immune desert tumors, while reporting a high balanced accuracy of 92% on tumor types classification and averaged balanced accuracy of 68% on tumor subtypes classification, which represents, respectively 7% and 9% higher performance compared to the referential signature.

Full PDF

11 Cancer Gene Proﬁling through UnsupervisedDiscovery

Enzo Battistella, Maria Vakalopoulou,

Member,

Roger Sun, Th´eo Estienne, Marvin Lerousseau, Sergey Nikolaev,´Emilie Alvarez Andres, Alexandre Carr´e, St´ephane Niyoteka, Charlotte Robert, Nikos Paragios,

Fellow Member ,and ´Eric Deutsch

Abstract —Precision medicine is a paradigm shift in healthcarerelying heavily on genomics data. However, the complexity ofbiological interactions, the large number of genes as well asthe lack of comparisons on the analysis of data, remain atremendous bottleneck regarding clinical adoption. In this paper,we introduce a novel, automatic and unsupervised frameworkto discover low-dimensional gene biomarkers. Our method isbased on the LP-Stability algorithm, a high dimensional center-based unsupervised clustering algorithm, that offers modularityas concerns metric functions and scalability, while being ableto automatically determine the best number of clusters. Ourevaluation includes both mathematical and biological criteria.The recovered signature is applied to a variety of biologicaltasks, including screening of biological pathways and functions,and characterization relevance on tumor types and subtypes.Quantitative comparisons among different distance metrics, com-monly used clustering methods and a referential gene signatureused in the literature, conﬁrm state of the art performance ofour approach. In particular, our signature, that is based on genes, reports at least times better mathematical signiﬁcance(average Dunn’s Index) and better biological signiﬁcance(average Enrichment in Protein-Protein Interaction) than thoseproduced by other referential clustering methods. Finally, oursignature reports promising results on distinguishing immuneinﬂammatory and immune desert tumors, while reporting ahigh balanced accuracy of on tumor types classiﬁcationand averaged balanced accuracy of on tumor subtypesclassiﬁcation, which represents, respectively and higherperformance compared to the referential signature. Index Terms —Clustering, Predictive Signature, Biomarkers,Genomics, Multi-tumor association

I. I

NTRODUCTION O MICS data analysis - including genomics, transcrip-tomics and metabolomics - has greatly beneﬁted fromthe tremendous sequencing technique advances [1] allowingto highly increase the quality and the quantity of data. Theseomics techniques are pivotal aspects of the development ofpersonalized medicine by enabling a better understanding ofﬁne-grained molecular mechanisms [2]. In oncology, thesetechniques provide a more comprehensive insight of the bi-ological processes intricacy in cancers giving momentum tomolecular-type characterization through omics or even multi-omics approaches [3], [4]. Such a precise and robust charac- Version submitted to IEEE Transactions on Bioinformatics and Compu-tational Biology© 2021 IEEE. Personal use of this material is permitted.Permission from IEEE must be obtained for all other uses, in any current orfuture media, including reprinting/republishing this material for advertising orpromotional purposes, creating new collective works, for resale or redistribu-tion to servers or lists, or reuse of any copyrighted component of this workin other works. terization is a highly valuable asset for tumor characterizationand provides signiﬁcant acumen on their treatment.Genomics, probably the most prominent omics technique,refer to the study of entire genomes contrary to genetics thatinterrogate individual variants or single genes [5]. In thisdirection, novel methods study speciﬁc variants of genes aimedat producing robust biomarkers, which contribute to both theresponse of patients to treatment [6], [7] and the associationwith complex and Mendelian diseases [8]. However, the rela-tively low number of samples per tumor subtype, along withthe curse of dimensionality and the lack of ground truth affectmany of these studies [9], which may prevent any statisticallymeaningful causal relation discovery.Unsupervised clustering is a very efﬁcient technique tostudy large high-dimensional datasets aimed at discoveringunknown indiscernible structures and correlations [10]. Clus-tering algorithms aspire to single out a group separation of thedata favoring low variation inside the groups and high variationbetween groups. Notwithstanding, there is a large variety ofclustering approaches, relying on different properties, leadingto signiﬁcantly different outcomes. The main challenge ofclustering is the deﬁnition of a metric/similarity functiondepicting the notion of closeness between objects underconsideration. This includes not only the intrinsic clusteringproperties the algorithm seeks to optimize but also the notionof distance involved. Qualitative and quantitative evaluation isa critical step towards clustering effective adoption and relieson the basis of independent and reliable measures for theproper comparison of the parameters and methods. Numerousexisting metrics assess the quality of the clusters from astatistical point of view as the Silhouette Value [11], Dunn’sIndex [12] or more recently the Diversity Method [13]. Inaddition, in presence of annotations, the Rand Index [14].Clustering evaluation is even more challenging in the caseof genomics, as the clusters should also be biologicallyinformative. Protein-Protein Interaction (PPI) and the GeneOntology (GO) have been recently introduced in this sub-domain to assess the biological soundness of the clustersthrough Enrichment Scores [15], [16].The use of cluster analysis on RNA-seq transcriptomesis a wide-spread technique [17] whose main goal is to de-ﬁne groups of genes that have similar expression proﬁles,proposing compact signatures [8]. These robust signaturesare necessary to identify associations with different biologicalprocesses, as tumor types or cancer molecular subtypes, andto highlight gene coding for proteins interacting together or a r X i v : . [ q - b i o . GN ] F e b ersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 Fig. 1.

Proposed Framework.

A general overview of the different steps of our process. Our proposed framework is composed of two steps. First, aclustering algorithm, here LP-Stability, is used to generate clusters of genes having similar expression proﬁles. Then, the clustering that performs best onboth mathematical and biological scores is selected as a gene signature. In the second step, the generated signature is used to perform sample clustering andsample classiﬁcation. The performance on this step is evaluated by analysing the distribution of the samples into the different clusters or the performance onthe classiﬁcation tasks, here the target was the tumor types and subtypes characterization. participating in the same biological process [18]. The mainadvantages of unsupervised clustering compared to supervisedapproaches - or methods guided by speciﬁc biological func-tions or processes - is the ability to discover unknown patterns,associations and correlations in the genome. Furthermore,unsupervised discovery offers better tractability when appliedto the tremendous amount of genes. In addition, studies thatperform clustering relying on a priori knowledge lead toredundant signatures and loss of information [19]. This isone of the reasons that several studies focus on statisticalpattern recognition methods such as the center-based K-Means [20], the model-based CorEx [21] or the stability-basedLP-Stability [22] towards the identiﬁcation of meaningful andpredictive groups of genes as biomarkers [23]. In this direction,CorEx [16] has recently been introduced to generate genesignatures evaluated and optimized over ovarian tumors thataddress this speciﬁc tumor type with a high dimensional genesignature composed of several hundred genes. To deal withthis high dimensionality, studies propose methods to combineand prune existing signatures to obtain a unique compact andinformative signature [19], [24].Although dimension reduction through clustering is notnew [16], there is an important shortfall in literature of a thor-ough, mathematically and biologically meaningful comparisonof clusterings methods on a same database. In many studies,a single evaluation metric is used and there is no relevantcomparison with other algorithms. By “relevant”, we meanhere that the optimization of the different baseline algorithmhyperparameters is ensured and compared through a fair eval-uation metric. Mathematical metrics for instance, are highlydependent on the property the algorithm is optimizing and thedistance notion considered. The evaluation of this bias through, as an example, random clusters using different distance notionsto offer a fair comparison between the different algorithms.Finally, a few surveys [25] propose a thorough comparison,using several evaluation criteria, albeit reporting results shownin several other studies without actually comparing the meth-ods on a same database with all the criteria at once.In this paper, we introduce a novel unsupervised approachthat is modular, scalable and metric free towards the deﬁnitionof a predictive gene signature while proposing a completemethodology for comparison, analysis and evaluation of ge-nomic signatures. The backbone of our methodology refersto a powerful graph-based unsupervised clustering method,the LP-Stability algorithm [22], which has been successfullyadapted in various ﬁelds including, recently, in genomics [26].Our approach offers:(i) Standardization and automatization concerning gene clus-tering evaluation for the selection of the best distancenotions, metrics, algorithms and hyperparameters;(ii) Creation of generic, low dimensional signatures usingthe gene expressions of all coding genes, includingcomparisons to random signatures to highlight statisticalsuperiority;(iii) Systematic assessment of the biological power of genesignatures by evaluating the different tumor type andsubtype associations via supervised (proving tissue-speciﬁcity and predictive power), and unsupervised (prov-ing automatic discovery and expression power) tech-niques. By this, we demonstrated the power of the pro-posed gene signature (based on genes) compared toother methods in the literature;(iv) Thorough biological analysis of the processes involved insample clusters via gene screening techniques, afﬁrmingersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020the robustness of the obtained results.II. O VERVIEW OF THE P ROPOSED A PPROACH

The overview of the method presented in this paper issummarized in Fig. 1. To evaluate and select the best genesignature, we introduce two distinct metrics checking bothmathematical and biological properties. In particular, we usedthe mathematical assessment metric of Dunn’s Index (DI) [12]and the biological one of Enrichment Score in PPI [16] whichare both referential for the assessment of clustering althoughthey have never been combined. Then, a low dimensional ag-gregated gene signature is deﬁned by combining representativegenes in each cluster. To prove the power of the discoveredbiomarker, a systematic and thorough evaluation regarding itsbiological and clinical relevance was performed. In particular,the signature was evaluated and compared through sampleclustering and sample classiﬁcation. As targeted by the sampleclustering, we chose the different tumor types and assessedthe success of the clustering through sample distributionanalysis and clustering evaluation metrics such as Rand Indexand Mutual Information. In addition, we used the methodfrom [27] to obtain important genes for the samples of eachcluster which were associated to their pathways using [28].Finally, the last evaluation criteria was the performance incategorizing the cancer types and subtypes through supervisedmachine learning techniques. Our proposed signature has beencompared against both signatures designed from commonlyused algorithms for gene clustering [16], [20] and a recentlyproposed prominent gene signature [24].III. D

ISCOVERING C ORRELATIONS IN G ENE E XPRESSIONS

A. Algorithms

We consider here n points S = { x , ...x n } in a spaceof dimension m where each point x p coordinates will bedescribed by x p = ( x p , ..., x pm ) . In this study, we considerseveral notions of dissimilarity d . If we denote k the number ofclusters in a clustering C , then the clustering is a set of clusters C = { C , ..., C k } deﬁned such that ∀ ≤ i, j ≤ k , C i ∩ C j = ∅ and (cid:83) ≤ i ≤ k C i = S . The number of points in cluster C i willbe denoted by n i . We call centroid µ i of cluster C i the meanof the points of the cluster. Finally, we get a discrete randomvariable X = { X , ..., X b } from a point x ∈ S by binning b bins. We denote P ( X ) the probability mass function of X . Then from that, the Shannon Entropy H of variable X is deﬁned by H ( X ) = − (cid:80) ≤ i ≤ b P ( X i ) ln P ( X i ) . K-Means algorithm [20] is a very popular and simplealgorithm used for data following Gaussian distributions. First,the algorithm draws an initial random set of cluster centroids.Then, until convergence it iteratively determines k clusters byassigning the points to their closest centroid and computestheir new centroids µ i . The algorithm aims to solve min C k (cid:88) i =1 (cid:88) x ∈ C i d ( x, µ i ) . (1)Only the number of clusters k has to be deﬁned beforehand.Generally, K-Means is used with Euclidean distance for con-vergence issues. A main drawback of this technique is that the random initialization is a source of nondeterminism and maycause instability in the cluster generation for different runs. Weaddress this issue by selecting the best clustering over multipleiterations with a different initialization. CorEx algorithm [21] is a model-based algorithm that hasbeen applied to various ﬁelds and, especially on genes [16]with great success. This algorithm aims to deﬁne a set S (cid:48) of k latent factors accounting for the most variance of the dataset S .Formally, it relies on the Total Correlation of discrete randomvariables X , ..., X p deﬁned by T C ( X , ..., X p ) = (cid:88) ≤ i ≤ p H ( X i ) − H ( X , ..., X p ) (2)and the Mutual Information of two random variables M I ( X i , X j ) = (cid:88) X ip ∈ X i (cid:88) X jq ∈ X j P ( X ip , X qi ) log P ( X ip , X jq ) P ( X ip ) P ( X jq ) (3)where P ( X ip , X jq ) is the joint probability function and P ( X ip ) , P ( X jq ) are marginal probability functions. To guaran-tee a reliable deﬁnition of the latent variables, the algorithmminimizes the Total Correlation, T C ( S | S (cid:48) ) , corresponding tothe additional information brought by the points in S comparedto the latent factors of S (cid:48) . Then, to obtain the clustering,each point x p is allocated to the cluster of the latent factor f maximizing the mutual information, M I ( X p , f ) . Similarto K-Means, the only requirement of CorEx algorithm isthe number of clusters k . Moreover, K-Means is the onlyalgorithm under consideration that directly operates on thegene expression matrix to build a statistical model withoutusing a given distance matrix. LP-Stability algorithm [22] is based on linear program-ming and relies on the same deﬁnition of clusters as K-Means i.e. we want to minimize the distance between each point of acluster and the center of the cluster. However, the novelty andinterest of this technique is that instead of taking centroids ascluster centers, it deﬁnes stable cluster centers. Formally, weaim to optimize the following linear system

P RIM AL ≡ min C (cid:88) p,q d ( x p , x q ) C ( p, q ) s.t. (cid:88) q C ( p, q ) = 1 C ( p, q ) ≤ C ( q, q ) C ( p, q ) ≥ . (4)where C ( p, q ) represents the fact that x p belongs to the clusterof center x q . The formula corresponds to the minimizationof the distance between a point and its cluster center whileensuring that each point must belong to one and only onecluster and that centers belong to their own cluster. Thedetermination of the stable centers relies on the followingnotion of stability: S ( q ) = inf s { s, d ( q, q ) + s PRIMAL has nooptimal solution with C ( q, q ) > } . ersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020The stability of a point is the maximum penalty the pointcan receive while remaining an optimal cluster center inPRIMAL. Besides, to better exploit particular ﬁeld constraintsof the points or to better tune the number of clusters, penaltyvalue S q > = 0 can be added to point q . We will thenconsider the penalty vector S weighting the distance d such as ∀ q, S q ∈ S, d (cid:48) ( q, q ) = d ( q, q ) + S q . Doing so, we will imposea stronger minimal stability for the cluster centers entailing alower number of clusters.Let us denote Q the set of stable clusters centers. Thealgorithm solves the clustering using the DUAL problem DU AL ≡ max D D ( h ) = (cid:88) p ∈V h p s.t. h p = min q ∈V h ( p, q ) (cid:88) p ∈V h ( p, q ) = (cid:88) p ∈V d ( x p , x q ) h ( p, q ) ≥ d ( x p , x q ) . (5)where h ( p, q ) corresponds here to the minimal pseudo-distancebetween x p and x q and h p to the one from x p . This previ-ous DUAL problem is then conditioned by considering onlycenters in the set of stable points Q : DU AL Q = max DU AL s.t. h pq = d pq , ∀{ p, q } ∩ Q (cid:54) = ∅ . (6)This method presents several advantages. It is versatile andcan integrate any metric function while, it does not make priorassumptions on the number of clusters or their distribution. Itaims to deﬁne clustering in a global manner seeking for anautomatic selection of the cluster centers. For that matter, itrelies on the optimization of the set of stable centers, as wellas the assignment of each observation to the most appropriatecluster, meaning the one minimizing the distance to the center.This algorithm only requires a penalty vector S , inﬂuencingthe number of clusters. B. Proximity Measures

To tackle the issue of high dimensionality of the datacombined with a low ratio between samples and dimensionsof each sample, we studied several different distance notions.In this study we considered the following distances: the Eu-clidean distance, the Cosine distance, the Pearson’s correlation,the Spearman’s rank correlation, the Kendall’s rank correlationand the Kullback-Leibler divergence. These standard metricsare further detailed in Supplementary Materials.Depending on the type of data or algorithm used, differentproximity measures may heavily impact the performance forthe clustering. The different correlations cover the range of [ − , , the value is positive when the observations evolve in asimilar way for the compared variables and negative when theyevolve in opposite ways. High absolute values indicate highcorrelations in the observations. On the other hand, high valuesin terms of distance indicate observations that are not similarin the speciﬁc feature space. In this work, we used (cid:112) − c ) to convert correlations into distances. For simplicity, distances coming from correlations will be referred to as correlation-based distances for the rest of the paper. C. Unsupervised Gene Clustering Evaluation

To evaluate the performance of gene clustering methods, weused both mathematical and biological criteria. The quality ofthe results was assessed using the biological relevance infor-mation brought by the Enrichment Score, while the prominentDunn’s Index statistical method was considered regarding theclustering mathematical appropriateness. • Enrichment Score (ES):

Enrichment is the most com-monly adopted technique to assess biological relevancein an automatic manner [16]. We considered here theEnrichment in PPI by studying the proteins correspond-ing to the genes. Even if, contrary to enrichment in agiven biological process, PPI does not integrate speciﬁcinformation about predeﬁned pathways and biologicalprocesses, it fulﬁlls our aim of an unbiased and generalmetric. Enrichment for a cluster represents the probabilityof obtaining the same number of interactions in a randomset of genes of same size as in the evaluated cluster. Inparticular, the cluster is considered as enriched if the p-value is below a given threshold (abbreviated by th). TheES corresponds to the proportion of enriched clusters inthe clustering. To calculate the ES, Stringdb library basedon String PPI network [28] was used. • Dunn’s Index (DI):

The Dunn’s Index [12] studies theratio between inter-cluster and intra-cluster variance. Theformer is meant to be large as the distributions in differentclusters should be different. The latter has to be small aswe want points that are in a same cluster to follow acommon distribution. Formally,

Dunn ( C ) = min ≤ i,j ≤ k δ ( C i , C j )max ≤ i ≤ k ∆( C i ) where δ ( C , C ) isthe distance between the two closest points of the clusters C i and C j , ∆( C i ) is the diameter of the cluster i.e. thedistance between the two farthest points of the cluster C i .This assessment score is highly sensitive to extreme notwell-formed clusters making it ideal for our problem.IV. D EFINITION OF A L OW D IMENSIONAL C ANCER S IGNATURE

A. Signature Selection

The ultimate goal of our approach is to produce lowdimensional signatures as a byproduct of unsupervised clus-tering outcome. To this end, the signatures were producedby selecting the most representative gene per cluster for theclusterings with the highest ES and DI performance. For LP-Stability algorithm, the selected genes were the stable centersthat the algorithm relies on, in the the rest of the algorithms,we chose as representatives the clusters medoids which arethe genes the closest to the cluster centroid. This choice ismotivated by the fact that the stable centers of the clusteringobtained through LP-Stability are also medoids.In complement, once a signature is selected, a redundancyanalysis was performed using STRING tool [28] to decipherany biological process that was particularly over-representedersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020so suggesting redundancy of the information. In addition,Genotype-Tissue Expression (GTEx) portal was used toassess the tissue speciﬁcity for the proposed signature. Agood signature should present genes with different expressionproﬁles over the different tissues. This tool offers a visualrepresentation for each gene of their expression and regulationin different tissues. It relies on the analysis of multiple humantissues from donors to identify correlations between genotypeand tissue-speciﬁc gene expression levels. B. Sample Clustering: Discovery Power

In order to perform sample clustering, we compared sev-eral algorithms and distances. The most meaningful resultswere obtained with the K-Medoids method, a variant of K-Means, combined with the Spearman’s rank correlation-baseddistance. The relevance of the obtained results was assessedby analyzing the partition of the different tumor types in theclusters. In particular, driven from known biological evidence,we considered as meaningful the associations of lung tumors(LUSC, LUAD), squamous tumors (LUSC, HNSC, CESC),gynecologic tumors (BRCA, OV, CESC), smoking related tu-mors (LUSC, LUAD, BLCA, CESC, HNSC). We disregardedsamples types in a cluster representing less than of thetotal cluster size. We considered a poorly deﬁned cluster to bea cluster presenting less than samples or distribution of thesamples types in the same proportions as in the whole datasetas it would show random associations.Gene screening analysis was also used to identify the genesthat are expressed differently over the sample clusters andthus indicating the biological processes involved. For that, weused the SAM method [27], that aims to identify the genesthat are differently expressed over two groups of samples.SAM assesses the signiﬁcance of the variations of the geneexpression using a statistical t-test, providing a signiﬁcancescore and a False Discovery Rate (FDR). To better assessthe relevance of separating samples of the same tumor type,we studied the genes that are expressed differently for eachtumor type in a cluster compared to all the other samples ofthe same tumor type. We thus pinpointed signiﬁcant genesfor each cluster and each tumor type by cluster. Once more,the method in [28] has been used for assessing the biologicalrelevance of the clusters and their association to differenttumor types, by studying the biological processes involved.A well-deﬁned sample clustering is characterized by differentclusters presenting different enriched biological processes andpathways while different tumor types in a same cluster shouldbe enriched in the same ones. C. Sample Clustering: Expression Power

To assess how well the different tumor types have beenseparated, we used several different metrics (formal deﬁnitionsare provided in Supplementary Materials). • Adjusted Rand Index (ARI) is a similarity measurebetween a clustering C (cid:48) and a ground truth C . ARIcorresponds to the proportion of pairs of elements that are in different clusters in both C and C (cid:48) called a or ina same cluster in both C and C (cid:48) called b . • Normalized Mutual Information (NMI) is a normalizedmeasure of the mutual dependence between a clusteringand the reference group. • Homogeneity:

Considering a clustering C (cid:48) and a groundtruth C . It values clusters of C containing elements allbelonging to a same cluster in C (cid:48) . • Completeness:

Considering a clustering C (cid:48) and a groundtruth C . This complement of homogeneity values clustersof C having all their elements belonging to a same clusterin C (cid:48) . • Fowlkes-Mallow Score (FMS):

It corresponds to thegeometric mean of the pairwise precision and recall.

D. Supervised Tumor Types/SubTypes Categorization

The evaluation of the provided signatures were further as-sessed by a supervised setting in order to highlight their tissuespeciﬁcity properties. The supervised framework for tumortypes and subtypes categorization was adapted from [29]. Thisclassiﬁcation pipeline relies on an ensemble of machine learn-ing classiﬁers, exploring the ones with strong generalisationpower. The best performing in terms of balanced accuracy andgeneralisation are combined through a probabilistic consensusschema to provide the appropriate label.Towards the evaluation of the reported performance, we re-lied on classic machine learning metrics, namely balanced ac-curacy, weighted precision, weighted speciﬁcity and weightedsensitivity. The use of weighted metrics instead of the non-weighted ones is required here as we are considering a multi-class classiﬁcation task with very unbalanced classes. Theweighted scores (WS) were deﬁned as

W S = 1 N (cid:88) l N l S l (7)where N corresponds to the total number of samples, N l the number of samples with class of label l and S l the non-weighted score in one-vs-rest classiﬁcation for the class l .V. I MPLEMENTATION D ETAILS

The parameters of each algorithm for the gene clusteringwere obtained using grid search. In order to benchmark thebehavior of each algorithm on different number of clusters,we evaluated their performance for the following number ofclusters: from to with an increment of , , , and between and with an increasing step of forthe Random Clustering and K-Means algorithms and forCorEx algorithm because of its computational complexity.LP-Stability automatically determines the number of clusters.In order to create meaningful comparisons, we adjusted thepenalty vector S in order to obtain approximately the samenumber of clusters as with the rest of the algorithms. Forcomparison purposes, we used the same penalty for all thegenes, however, for the LP-Stability algorithm the penaltyvalue could be adjusted and customized depending on theimportance of speciﬁc genes.ersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 TABLE I D ESCRIPTION OF THE DATASET USED IN THIS STUDY . T HE DIFFERENTTUMORS AND TUMOR TYPES TOGETHER WITH THE CORRESPONDINGNUMBER OF SAMPLES ARE SUMMARISED . U

ROTHELIAL B LADDER C ARCINOMA (BLCA), B

REAST I NVASIVE C ARCINOMA (BRCA),C

ERVICAL S QUAMOUS C ELL C ARCINOMA AND E NDOCERVICAL A DENOCARCINOMA (CESC), G

LIOBLASTOMA M ULTIFORME (GBM),H

EAD AND N ECK S QUAMOUS C ELL C ARCINOMA (HNSC), L

IVER H EPATOCELLULAR C ARCINOMA (LIHC), R

ECTUM A DENOCARCINOMA (READ), L

UNG ADENOCARCINOMA (LUAD), L

UNG S QUAMOUS C ELL C ARCINOMA (LUSC)

AND O VARIAN C ANCER (OV).Tumor Type Clustering Classiﬁcation

For the ES we reported the behavior of the algorithmswith different threshold values i.e. . , . , . and . .Furthermore, in reporting the DI value, each method has beenevaluated with the same proximity measure it relies on. ForK-Means that is sensitive to initialization, we performed iterations for each parameter and selected the best clusteringbased on DI only to cope with the computational cost of theES. This iterative process augments the computational timeof the algorithm, but reports clusters with better statisticalsigniﬁcance and more stable scores. Similarly, we performed repetitions of random clustering and observed rathersimilar results, we selected the clustering that reports the bestDI score and reported its results.For the sample clustering, we considered clusters corre-sponding to the actual tumor types. For the gene screening,we selected the most signiﬁcant genes that reported a signiﬁ-cance score of which corresponds to a q-value of FDR closeto zero in most cases, while for the biological processes weconsidered only the most enriched processes by screening.VI. D ATASET

In this study, we based our experiments on The Can-cer Genome Atlas (TCGA) dataset [30]. TCGA contains acomprehensive dataset including several data types such asDNA copy number, DNA methylation, mRNA expression,miRNA expression, protein expression, and somatic pointmutation. We focused our study on tumor types relevant for radiotherapy and/or immunotherapy. For the gene clus-tering part, our dataset consists of samples (TableI second column). In particular, we investigated the fol-lowing types of tumors, namely: Urothelial Bladder Carci-noma (BLCA), Breast Invasive Carcinoma (BRCA), CervicalSquamous Cell Carcinoma and Endocervical Adenocarcinoma(CESC), Glioblastoma Multiforme (GBM), Head and NeckSquamous Cell Carcinoma (HNSC), Liver HepatocellularCarcinoma (LIHC), Rectum Adenocarcinoma (READ), Lungadenocarcinoma (LUAD), Lung Squamous Cell Carcinoma(LUSC) and Ovarian Cancer (OV). For each sample, we hadthe RNA-seq reads of

20 365 genes processed using normal-ized RNA-seq by Expectation-Maximization (RSEM) [31].Several articles as [32] consider the challenging and impor-tant task of generating biomarkers for distinguishing tumor andsubtumor types. In this study, we also focus on this task basingour experiments on the cohort presented in [24] by selectingsamples from the locations used for the gene signature.This cohort consists of samples (Table I third and fourthcolumns). For the tumor subtypes characterisation, we focusedon subtypes that had more than × n subtype samples. At theend, different tumor types namely the BRCA, HNSC, LIHC,READ and OV have been used for subtypes classiﬁcation.VII. R ESULTS AND D ISCUSSION

This study has been designed upon three pivotal com-plementary aspects. The ﬁrst one relates to the genes clus-tering performance to assess the deﬁnition of the signatureregarding both a mathematical (DI) and a biological (ES)metric (section VII-A). The second evaluates the ability of thesignature to relevantly separate the different tumor samples in N UMBER OF C LUSTERS E N R I C H M E N T S C O R E Lp-Stability th = 0 . th = 0 . th = 0 . th = 0 . th = 0 .

10 0 20 40 60 80 1005060708090100 N UMBER OF C LUSTERS E N R I C H M E N T S C O R E CorEx th = 0 . th = 0 . th = 0 . th = 0 . th = 0 .

100 20 40 60 80 10020406080100 N UMBER OF C LUSTERS E N R I C H M E N T S C O R E K-Means th = 0 . th = 0 . th = 0 . th = 0 . th = 0 .

10 0 20 40 60 80 100020406080100 N UMBER OF C LUSTERS E N R I C H M E N T S C O R E Random th = 0 . th = 0 . th = 0 . th = 0 . th = 0 . Fig. 2.

Evaluation of the clustering performance for different EnrichmentThreshold values.

LP-Stability (upper left), CorEx (upper right), K-Means(lower left), Random (lower right). The ﬁgure presents the percentage ofthe enriched clusters for the threshold values of . , . , . , . , . and using Kendall’s correlation-based distance. The higher differences inthe enrichment thresholds are reported from the Random Clustering whenthe number of clusters is relatively high. For the rest of the algorithmsand especially LP-Stability, the different thresholds only slightly impact thereported results. ersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020an unbiased manner in particular through sample clustering(section VII-B). The third aspect characterizes the tissuespeciﬁcity of the signature thanks to classiﬁcation tasks ontumor types and subtypes (section VII-C). To better estimatethe results obtained, comparisons with referential clusteringmethods and gene signatures are performed throughout thedifferent evaluations. A global comparison with all referencesover all metrics is provided in section VII-D. A. Results on Clustering Gene Data

The obtained clusters were evaluated using both mathe-matical and biological evaluation criteria. Starting with thebiological criteria, Fig. 2 presents a comparison of the differentES per algorithm for different threshold ( th ) values. Weobserved that for the different clustering methods the thresholddoes not signiﬁcantly change the behavior of the ES, indicatinga strong statistical signiﬁcance for the clusters. But, it is notthe case for the random signature on which for a number ofclusters higher than one can observe an important disparitybetween the different th of the ES. For the rest of the study,we will use th = 0 . .To select the best distance per method we used the DImetric. In Fig. 3 one can observe the inﬂuence of the distancewith respect to the number of clusters for the random and LP-stability methods. Compared with random clustering one canobserve the bias that each distance introduces for the DI score.In particular, with correlation-based distances the reported DIscores are on average times higher. Thus, to tackle thisproblem of bias, for our comparisons, we will refer to a clus-tering difference in DI scores with the corresponding randomclustering for the same number of clusters and distance. Basedon our experiments we also noticed that the different distancesgreatly affects the performance of the clustering algorithm,with the correlation-based distances (especially the Kendall’scorrelation) reporting in general higher performances. To en-sure the biological meaning of the clusters, we also reportthe performance of the different distances for ES. Once againthe superiority of correlation-based distances both in termsof performance and stability is indicated. Besides, only theEuclidean distance does not reach the maximal value of .This is due to the unbalanced clusters that Euclidean distancefavors, leading to very small clusters that are less likely tobe enriched. For the rest of the paper we selected Kendall’scorrelation-based distance when reporting LP-Stability andRandom Clustering performances.In Table II, we summarize the performance of LP-Stabilityin comparison to other algorithms based on both ES andDI scores together with the reported number of clusters.Additional information about the average Enrichment and theaverage computational time per algorithm is also provided inthe table. The best performance of DI is achieved with the LP-Stability and the Kendall’s Correlation-based distance. More-over, even if almost all the methods, except K-Means, reachedan Enrichment Score of , LP-Stability still reports thehighest average Enrichment, with while CorEx reachesonly . Another interesting point from this analysis is theindication of the optimal number of clusters per algorithm. Only LP-Stability reports its best value with more than clusters while the rest of the algorithms have their bestperformance with less than clusters and even clusters onlyif we consider DI alone. This might seem to be an argument infavor of the other algorithms as they are able to deﬁne a morecompact signature. However, such a low number of clustershighlights failure on characterizing a clustering structure asthey favor a disposition where genes are grouped altogether.This is also indicated by the low average ES and DI scores.Computational time study was presented in [26] and can befound in Supplementary Material.A thorough comparison of the different algorithms for adifferent number of clusters is presented in Fig. 4. For bothDI and ES the superiority of the proposed LP-Stability incomparison to the other algorithms can be observed bothin terms of stability for a varying number of clusters andperformance. The reported results indicate that the proposedmethod can generate clusters that are both mathematicallyand biologically meaningful. Moreover, one can observe thatfor Random Clustering, the reported enrichment is very high,however dropping dramatically for more than clusters,while the DI is really low for all the cases. This highlightsthe need to study both the mathematical performance and thestability of the biological score as ES alone would not givesigniﬁcant results. B. Unsupervised Signature Assessment1) Signature Selection:

The signature was selected usingthe method detailed in section IV-A on the clustering present-ing the highest DI among clusterings having best ES. However,due to the relatively low number of genes for signatures basedon K-Means, CorEx or Random Clustering, the sample clus-terings with those signatures gave quite irrelevant intermixedtumor types (Fig.1 in Supplementary). To deal with this andfor comparison reasons, we used for all these algorithms thegene signatures produced with and genes and in thefollowing, when referring to CorEx and K-Means signatureswe will refer to those signatures. To facilitate future studiesleveraging this work, we provide in additional material theclustering of genes and samples.Regarding the evaluation of the enriched biological pro-cesses for the different signatures, we found that LP-Stabilitysignature ( genes with Kendall’s correlation-based distance)does not present any redundancy in the biological processes,in contrast to the K-Means ( genes with Euclidean dis-tance) which presents several hundred of enriched biologicalprocesses. Moreover, CorEx signature ( genes with TotalCorrelation) presents a low biological redundancy with onlyphototransduction process being enriched.Our proposed gene signature using LP-Stability is com-posed by genes. Globally these genes, which descrip-tions are detailed in Supplementary Materials, are related tocell development and cell cycle (CD53, NCAPH, GNA15,GADD45GIP1, CD302, NCAPH, YEATS2), DNA transcrip-tion (HSFX1, CCDC30, MATR3, ASH1L, ANKRD30A,GSX1), gene expression (ZNF767, C1orf159, RPS8, ZEB2),DNA repair (RIF1), antigen recognition (ZNF767), apoptosisersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 N UMBER OF C LUSTERS D UNN ’ S I ND E X Random: Dunn’s Index N UMBER OF C LUSTERS D UNN ’ S I ND E X LP-Stability: Dunn’s Index N UMBER OF C LUSTERS D UNN ’ S I ND E X LP-Stability: Enrichment Score

Kullback − LeiblerPearson (cid:48) sSpearman (cid:48) sKendall (cid:48) sEuclideanCosine

Fig. 3.

Evaluation of the clustering performance for different distances.

The performance of the different distances are presented for both Random (left)in terms of Dunn’s Index and LP-Stability clustering (middle and right) in terms of Dunn’s Index and Enrichment Score. Only DI results are presented forRandom as ES computation on a same clustering is not inﬂuenced by the distance used. Both ES and DI are presented in percentages in terms of the numberof clusters. The ﬁgure highlights the superiority of the correlation-based distances and in particular the one reported by Kendall’s for both mathematical andbiological aspects. TABLE IIC

OMPARISON OF THE DIFFERENT EVALUATED ALGORITHMS IN TERMS OF

PPI E

NRICHMENT S CORE (ES)

WITH A THRESHOLD OF . , D UNN ’ S I NDEX (DI), A

VERAGE ES AND COMPUTATIONAL TIME . LP-S

TABILITY ALGORITHM OUTPERFORMS THE REST OF THE ALGORITHMS REPORTING HIGHEST DI AND A VERAGE ES SCORE AND THE LOWEST COMPUTATIONAL TIME .Method Best ES Best DI Average ES (%) Average DI (%) TimeES (%) DI (%) Clusters ES (%) DI (%) ClustersRandom 100 36 2 100 36 2 54 19.8 -K-Means (Euclidean) 85.7 2.5 7 50 15.6 5 37 1.2 3hCorEx (Total Correlation) 100 2.4 5 100 2.4 5 71 0.6 > N UMBER OF C LUSTERS E N R I C H M E N T S C O R E Enrichment Score N UMBER OF C LUSTERS D UNN ’ S I ND E X Dunn’s Index LP − StabilityCorExK − MeansRandom

Fig. 4.

Evaluation of the different clustering algorithms.

For the different evaluated algorithms the ES and the DI are presented in terms of number ofclusters and using Kendall’s correlation-based distance. For both metrics, LP-Stability reports the highest and more stable values. Moreover, the rest of thealgorithms tends to report their higher scores for a very small number of clusters (often ), indicating their failure to discover clustering structures. (C3P1, CLIP3), mRNA splicing (SNRPG). We also have manygenes speciﬁc to cancer or having a major impact on cancer(CD53, ANKRD30A, ZEB2, ADNP, SFTA3, ACBD4). Allthese processes are highly important and signiﬁcant for cancer.We also report in Supplementary Materials for each genethe main tissues they are overexpressed using GTEx portal.Finally, even if we have many genes related to speciﬁc tissuetypes such as brain, blood lymphocytes, liver or gynecologictissues, the overall proﬁles of each gene are unique.

2) Sample Clustering: Discovery Power:

The predictivepowers of the best signature per algorithm together with therandom signatures, and the signature presented in [24], werefurther assessed by measuring their ability to separate different tumor types (Table I) in a completely unsupervisedmanner, through sample clustering. In Fig. 5, the resultsfor the LP-Stability (with clusters, ES and DI . using Kendall’s correlation-based distance) signature,K-Means (with genes, ES of and a DI of . usingEuclidean distance), CorEx (with genes, ES and DI . using Total Correlation), Random Clustering signature(with genes, ES . and DI of . using Kendall’scorrelation-based distance) and the signature from [24] arepresented. One can observe that CorEx and Random signaturesfail to properly separate the tumor types and for this reasonfor the rest of the section we present a detailed comparisonof the K-Means and LP-Stability signatures only. A thoroughersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020analysis of the clusterings of Fig. 5 is provided in [26] andin Supplementary Materials and shows evidence validating therelevance of the approach presented here.Another interesting point is that the distance used greatlyaffects the distribution of the different tumor types for theclustering. This proves the importance of the distance selec-tion in combination with the selected algorithm. Based onour experiments, we noticed that Spearman’s and Kendall’scorrelations provide the best sample clustering for all thealgorithms. In particular, Spearman’s correlation tends to betterseparate the different tumors into different clusters, while theKendall’s seems to generate clusters that groups tumor-relatedsamples (Fig.2 in Supplementary Materials).To compare our results with other methods in the literature,we assessed our gene signature against a knowledge-basedsignature of genes that has been proven to be appropriatefor determining immune related sample clusters [24]. Theobtained tumor distribution is presented in Fig. 5, reportingquite intermixed associations. Again, LIHC and BRCA areseparated properly while the rest of the tumors are clusteredin unrelated groups. This comparison indicates the need forcompact signatures, highlighting at the same time the difﬁcultyof capturing the full genome information as well as the needfor an automatically computed signature to avoid redundancyand information loss.

3) Gene Screening Analysis:

Screening analysis aims toidentify the signiﬁcant genes for each cluster which are thenused to determine the enriched biological processes per cluster.To determine those signiﬁcant genes, we used the SAMmethod to look for genes that are expressed differently forthe samples of one tumor type in a cluster compared to theother samples of the same tumor type (see section IV-B formore details). We will refer to those genes as differentiallyexpressed genes in the remainder of the article. Besides, theSAM method scores allows to determine signiﬁcant genes, inthe following, we will refer in particular to the most signiﬁcantgenes for a cluster or a tumor type in a cluster for the genesreaching the highest SAM scores. This method, allows tocheck if the tumor types of a given cluster share genes relatedto similar biological processes, highlighting the biologicalrelevance of the cluster. Meanwhile, this method enables usto verify the relevance of the distribution of the same tumortypes into different clusters by checking the absence of similarbiological processes. In particular, a summary of the analysisfor our proposed signature’s sample clustering is presented inTable III. In this section, we provide a detailed analysis pertumor type for each cluster for both K-Means and LP-Stabilityselected signatures.Starting with the gene screening of LP-Stability, cluster is one of the most intermixed clusters. This cluster containssigniﬁcant genes for different tumor types that are associatedwith immune, defense response and other inﬂammatory pro-cesses with strong enrichment. Among the most signiﬁcantgenes we can report IL4R for HNSC, this Interleukin is atreatment target for multiple cancers, GNA15 for LUSC whichhas been highlighted in lung cancer treatment or for CESC.KRT5, has been identiﬁed as a potential biomarker to dis-tinguish adenocarcinomas to squamous cell carcinomas [33]. Continuing our analysis with cluster which includes mainlyBRCA samples, its most signiﬁcant gene is FPR3. This geneseems to be related to immune inﬂammation and multiplecancers including breast [34]. For cluster which is mainlycomposed of BRCA samples, we identiﬁed that it is a basalBRCA cluster. Indeed, its most signiﬁcant gene TTC28 isrelated to breast cancer and especially basal BRCA [35].Besides, the cluster is enriched for basal plasma membrane.Cluster is also a mixed cluster mostly composed of BRCAand OV cancers. It seems to be related to mitochondrialcomplexes and organization. Its most signiﬁcant gene is theNDUFB10 which is related to breast cancer patients [36],OV cancer [37] and also correlated to a decreased viabil-ity in esophageal squamous lineage as LIHC. Cluster isa lung related cluster, composed of LUAD/LUSC samples.LUAD samples are related to immune response and LUSCsamples to surfactant homeostasis which is linked to manylung diseases. The most signiﬁcant gene for LUAD is SFTA3,a lung protein [38] and a biomarker distinguishing LUADand LUSC [33], whereas the most signiﬁcant gene for LUSCsamples is NAPSA that has been proven to be of relevancefor LUAD tumors. Cluster mostly consists of LIHC samples,grouping all the LIHC samples in this cluster. Similarly cluster consists of all GBM tumors. Thus, the screening processis not applicable for them as it compares samples from thesame tumor type over different clusters. Cluster is a luminalbreast cancer cluster, related to metabolic processes whichhave already been studied in a breast cancer context [41]. Themost signiﬁcant gene seems to be the FOXA1, a gene relatedto Estrogen-Receptor Positive Breast Cancer and LuminalBreast Carcinoma [39]. Cluster is GBM tumors cluster. Itis interesting to notice that next two dominant tumor types inthe cluster, BRCA and BLCA, are related to cardiovascularityand blood vessels, their respective most signiﬁcant genes areANGPTL5 and FERMT2. The latter having been highlightedin GBM proliferation [42]. Cluster has no biological processlinked to immune response, but presents a strong association tometabolic and structural processes. This group of processes hasbeen found signiﬁcant for BRCA [43]. The most signiﬁcantgenes for this cluster are the AP1M2 for BLCA samples whichinteracts in tyrosine-based signals and has been consideredin epithelial cells studies, the UQCRH for BRCA a geneencoding mitochondrial Hinge protein that is important in softtissue sarcomas and in particular in two cell lines of breastcancer and one of ovarian cancer [40]. Finally, cluster is acluster with BRCA tumors, it has CIRP as its most signiﬁcantgene, which is considered to be an oncogene in several cancersand in particular for BRCA. Cluster presents alternativesplicing and coiled coil processes.This analysis highlights that each cluster is enriched insimilar biological processes while the processes from differ-ent clusters are different. Moreover, it reveals that even ifclusters and contain different tumor types, they presenta homogeneity in their biological processes. Cluster isespecially interesting as it contains inﬂamed tumor samplesand cluster non-inﬂamed samples. These two clusters containall the CESC samples, proving once more the relevance of theLP-Stability signature as they automatically and without anyersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 Fig. 5.

Gene Signature Assessment via tumor distribution analysis across 10 sample clusters generated in the different signatures feature space.

Thegraph presents the distribution of the different tumors for Random Clustering signature ( genes) , CorEx , K-Means , a referential gene signature [24] andLP-Stability. The distribution of tumors for Random and CorEx algorithms is quite intermixed without a lot of associations between the tumor types whileK-Means, referential and LP-Stability signatures seem to favor some good tumor associations.TABLE IIIA NALYSIS OF THE BIOLOGICAL PATHWAYS AND MOST SIGNIFICANT GENES PER CLUSTER FOR THE SAMPLE CLUSTERING PERFORMED USING OURPROPOSED SIGNATURE OF GENES VIA

LP-S

TABILITY ALGORITHM AND K ENDALL ’ S CORRELATION - BASED DISTANCE . T

HE TABLE HIGHLIGHTS THESEPARATION BETWEEN INFLAMED AND NON - INFLAMED TUMORS AND THE IDENTIFICATION OF WELL - KNOWN CANCER SUBTYPES SUCH AS

BRCA.

Clusters Signiﬁcant Tumor Validated Role of the Gene Biological Pathways Key FeatureGenes Type in Tumor Type of the cluster

Cluster 0 IL4R HNSC Treatment target Immune and defense response An inﬂamed solidGNA15 LUSC Lung cancers treatment Regulation of cell proliferation tumors clusterKRT5 CESC Biomarker distinguishing adenocarcinomas Interferon-gamma mediated processfrom squamous cell carcinomas [33]Cluster 1 FPR3 BRCA Immune inﬂammation related [34] Immune response related pathways An inﬂamed BRCAtumors clusterCluster 2 TTC28 BRCA Related to basal breast cancer risk [35] Basal plasma membrane A basal BRCAtumors clusterCluster 3 NDUFB10 BRCA Related to breast [36] and ovarian [37] Mitochondrial complexes and A gynecologic tumorscancers, poor prognosis for cells organization related processesesophageal lineage and LIHCSLC39A6 OV Poor prognosis for cluster linked to LIHCesophageal lineage and LIHCCluster 4 SFTA3 LUAD Related to LUAD and LUSC [33], [38] Pathways of immune response An inﬂamed lung tumors clusterNAPSA LUSC Related to LUAD Surfactant homeostasisCluster 5 LIHC A pure complete LIHCtumor clusterCluster 6 FOXA1 BRCA Related to Breast Metabolic processes A luminal BRCALuminal cancer [39] tumors clusterCluster 7 GBM Complete GBM cluster Response to stimulus,ANGPTL5 BRCA Angiopoietin-like protein family cardiovascularity, A GBM tumors clusterFERMT2 BLCA Related to various cancer with other tumors, all enrichedincluding breast ones blood vessels related in cardiovascularity pathwaysCluster 8 UQCRH BRCA Mitochondrial Hinge protein related [40] Metabolic processes Mis-splicing related tumorsAP1M2 BLCA Tyrosine-based signals general compound processesCluster 9 CIRBP BRCA Driver of many cancers Related to alternative splicing Alternative splicing related BRCAprocesses and organelles prior knowledge separate inﬂammatory and non-inﬂammatoryCESC samples. This speciﬁc problem is an active ﬁeld ofresearch [44]. Clusters and provide an even more valu- able insight when studying the genes IFNG, STAT1, CCR5,CXCL9, CXCL10, CXCL11, IDO1, PRF1, GZMA, MHCIIand HLA-DRA highlighted in [45] for their major role inersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020immunotherapy. Indeed, for each tumor type in cluster allor most of these genes are differentially expressed which is notthe case for cluster , so proving the speciﬁcity and clinicalrelevance of the separation of these clusters.On a second level, we analyzed the distribution of theBRCA cancer samples on different clusters, examining itsclinical relevance. We chose to highlight BRCA in thiscomparison, as it is the most represented tumor type andit presents a variety of subtypes. BRCA samples are dis-tributed into clusters , , , , and using the LP-Stability signature, featuring the main molecular subtypes ofBRCA. In particular, cluster contains immune inﬂammatorysamples, cluster basal samples and cluster the luminalEstrogen-Receptor Positive samples. Additionally cluster isa gynecologic cluster with BRCA samples presenting relationsto OV samples. Cluster features mis-plicing related tumorswhich are strongly related to BRCA samples [46]. Cluster ismarked by alternative splicing whose implications in cancersare well known and studied [47]. It is also interesting toreport that hallmarks genes BRCA1 and BRCA2 are positivelyand differentially expressed in luminal BRCA cluster whichattests of an over-expression of these genes for cluster . Thisobservation is consistent with [48], where BRCA1 and BRCA2were more expressed in luminal BRCA samples as they aremarkers for good prognosis. Besides, these genes present anunder-expression in cluster which is coherent as this mixedcluster groups BRCA and OV samples that are known topresent a bad prognosis.For comparison, we performed the same analysis with thesample clustering produced by the K-Means algorithm with the genes. After the analysis, we observed that this signatureseems to be very speciﬁc for the BRCA tumors while reportingweaker relevance in the separation of other samples. Indeed,we can observe that the separation of BRCA samples is rathermeaningful as in each cluster BRCA samples present ratherrelevant differentially expressed genes and enriched biologicalprocesses. However, the clusters are lacking homogeneity asthe different tumor types of the clusters present unrelated dif-ferentially expressed genes and enriched biological processes.Besides, K-Means signature fails to properly characterize othertumor types. This issue might be explained by the over-representation of BRCA samples in our data set. Detailedanalysis of the sample clustering using K-Means signature canbe found in Supplementary Materials.Additionally, in order to indicate the signiﬁcance of thedistance used we considered different distances to performthe sample clustering. Our experiments conﬁrmed that thedistance that gave the best biologically relevant clusters wasSpearman’s correlation-based distance. Moreover, after thescreening analysis we observed that the differentially ex-pressed genes are not necessarily the genes selected in thesignatures. This observation indicates that the strength of ourapproach is to combine genes that might not be the mostinformative taken individually but whose combination allowsa good compact representation of the information brought bythe whole genome for cancer tumors. It is also worth men-tioning that LP-Stability signature correctly separates immuneinﬂammatory samples from the others for all tumor types.

4) Expression Power:

The expression power of our signa-ture was further evaluated using the ARI, NMI, homogeneity,completeness and FMS metrics and compared with the restof the signatures. We called the average of those scores theExpression Power of the signature and report it in Fig.6.Detailed results for each score are provided in SupplementaryMaterials. For Random Clustering, we calculated the metricson the average results of sample clustering designed from random signatures. Overall the performances of K-Means andLP-Stability are the best with the ﬁrst outperform the second.Good performance of K-Means could be due to the goodseparation of BRCA clusters, the dominant tumor type in ourdataset. C. Tumor Types/Subtypes Classiﬁcation Tasks

The predictive power of our proposed signature has beenassessed in a supervised setting by classifying the samplesaccording to their tumor and sub-tumor types. This experimentaims to evaluate the tissue-speciﬁc information captured byeach signature. In Table IV, we report the performance ontraining and test for each signature using the same classiﬁ-cation strategy. Our experiments highlight that even randomsignatures with the relatively small number of genes reportsgood performance with a balanced accuracy of . Thisproves that even a low number of genes are informativeenough to perform a good separation of tumor types. However,our proposed signature reports the highest balanced accuracyreaching outperforming the referential signature [24]which reached a balanced accuracy of .Regarding tumor subtypes classiﬁcation, results averagedover all considered tumor types are provided in Table V.Our proposed method presents the highest performance with abalanced accuracy of , outperforming the other algorithmsby at least . This task is quite challenging as we areusing the same compact signature to characterize all thedifferent tumor types at a ﬁne molecular level. Considering thecomplexity of the task and the important number of differentclasses, results obtained with the proposed signature are verypromising. Indeed, it is surpassing the random signatures av-erage balanced accuracy by and the referential signature,devised on this speciﬁc dataset, by . D. Global Comparison

In order to better summarize the different results and providea fair comparison with random, the state of the art and thereferential signatures a spider chart is presented in Fig. 6. Thecomparison focuses in different criteria: (i) criteria based onthe gene clustering performance in blue, (ii) criteria based onthe informativeness of the signature for unsupervised cluster-ing tasks in green and (iii) criteria based on the relevance of thesignature for supervised classiﬁcation tasks in gold. DiscoveryPower is the proportion of tumor types that are relevantlygrouped in sample clustering according to related tumor types,the criteria of evaluation are presented in section IV-B.Expression Power corresponds to the average of the followingclustering scores: ARI, NMI, homogeneity, completeness andFMS the results are provided in Supplementary. Predictiveersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 TABLE IV T UMOR T YPES C LASSIFICATION

PERFORMANCE USING THE AVERAGE PERFORMANCE OF SETS OF RANDOMLY - SELECTED GENES OF SAME SIZE ASTHE PROPOSED SIGNATURE , C OR E X , K-M EANS , THE REFERENTIAL [24]

AND OUR PROPOSED SIGNATURES .Signature Balanced Accuracy (%) Weighted Precision (%) Weighted Sensitivity (%) Weighted Speciﬁcity (%)Training Test Training Test Training Test Training TestRandom 96+/-5 84+/-2 95+/-5 87+/-3 94+/-7 86+/-4 99+/-1 97+/-1CorEx 100 85 100 90 100 91 100 98K-Means 100 90 100 94 100 94 100 98Referential [28] 100 85 100 89 100 89 100 98

Proposed 99 92 99 94 98 93 100 99

TABLE V T UMOR S UBTYPES C LASSIFICATION

PERFORMANCE USING THE AVERAGE PERFORMANCE USING SETS OF RANDOMLY - SELECTED GENES OF SAMESIZE AS THE PROPOSED SIGNATURE , C OR E X , K-M EANS , THE REFERENTIAL [24]

AND OUR PROPOSED SIGNATURE . O

NLY THE TYPES OF TUMORSWITH MORE THAN × n subtypes SAMPLES WERE STUDIED

Signature Balanced Accuracy (%) Weighted Precision (%) Weighted Sensitivity (%) Weighted Speciﬁcity (%)Training Test Training Test Training Test Training TestRandom 81+/-11 57+/-9 85+/-8 66+/-10 82+/-9 62+/-7 87+/-12 74+/-23CorEx 82+/-19 59+/-14 83+/-18 70+/-11 81+/-20 65+/-8 94+/-6 71+/-36K-Means 85+/-12 53+/-24 89+/-10 67+/-15 79+/-20 56+/-19 96+/-3 69+/-38Referential [28] 90+/-11 59+/-7 91+/-9 68+/-10 90+/-9 67+/-12 97+/-4 70+/-35

Proposed 85+/-11 68+/-9 90+/-6 73+/-13 82+/-16 63+/-9 93+/-6 89+/-6

Power: Types is the balanced accuracy on test of the tumortypes classiﬁcation task results are provided in Table IV. Pre-dictive Power: Subtypes is the average over all tumor types ofthe balanced accuracy on test for tumor subtypes classiﬁcationthe results are provided in Table V. Biological Relevance isthe average ES of the gene clustering method results providedin Table II. Mathematical Relevance is the average DI scoreof the gene clustering method results provided in Table II.Decreasing Time Complexity is the average time taken forthe gene clustering, the bigger the area in the chart the faster,results are provided in Table II. Our proposed signature isshown to be largely superior by at least to randomand referential signatures in all criteria except compactness.It is also superior to the other signatures designed usingour pipeline with other prominent clustering methods. Oneinteresting exception is the Tumor-Speciﬁc Expression Powerof K-Means-derived signature. The signature deﬁned with K-Means differentiates the types of tumor well as also provedby the Predictive Power: Types but does not perform well onidentifying the subtypes (Predictive Power: Subtypes). This isalso due to the lower Discovery Power of K-Means comparedto our proposed signature.VIII. C

ONCLUSIONS

In this paper, we present a framework for gene clusteringdeﬁnition and comparison, for gene signature selection andevaluation in terms of redundancy, compactness and expres-sion power. In particular, we present a mathematical andbiological evaluation of gene clustering, an extensive sampleclustering evaluation using quantitative and ﬁeld speciﬁc clini-cal, biological metrics, and a supervised approach for its asso-ciation with tumor types and subtypes characterization. In thisframework we have shown the interest of using LP-Stabilityalgorithm, a powerful center-based clustering algorithm, for

Fig. 6.

Comparison of the different signatures.

Blue: criteria based on thegene clustering performance, Green: criteria based on the informativeness ofthe signature for unsupervised clustering tasks and Gold: criteria based on therelevance of the signature for supervised classiﬁcation tasks. gene clustering. Moreover, the obtained clusters formulate agene signature proving causality and strong associations withtumor phenotypes. These results compete with those reportedin the literature by using a large set of different omics data.In addition, our compact signature has been compared andproved to be more expressive than a prominent knowledge-based gene signature [24]. An extensive biological analysisevidenced that the designed signature, leads to sample clusterswith high relevance and correlation to cancer-related processesand immune response reporting promising results in tumortypes and subtypes classiﬁcation In the future, we aim toersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020extend the proposed method towards discovering stronger genedependencies through higher-order relations between geneexpression data, as well as further evaluation of this biomarkerfor therapeutic treatment selection in the context of cancer.IX. A

CKNOWLEDGEMENT

We would like to acknowledge the support of Amazon WebServices grant and Pr. Stefano Soatto for fruitful discussions. We alsoacknowledge ARC sign’it program and Siric Socrates INCA. Thiswork was partially supported by the Fondation pour la RechercheM´edicale (FRM; no. DIC20161236437) and by the ARC sign’itgrant ARC: Grant SIGNIT201801286. We also thank Y. Boursin,M. Azoulay and Gustave Roussy Cancer Campus DTNSI team forproviding the infrastructure resources used in this work. R EFERENCES[1] A. W. Kurian, E. E. Hare, M. A. Mills, K. E. Kingham, L. McPherson,A. S. Whittemore, V. McGuire, U. Ladabaum, Y. Kobayashi, S. E.Lincoln, M. Cargill, and J. M. Ford, “Clinical evaluation of a multiple-gene sequencing panel for hereditary cancer risk assessment,”

Journalof Clinical Oncology , vol. 32, no. 19, pp. 2001–2009, jul 2014.[2] D. Hanahan and R. A. Weinberg, “Hallmarks of cancer: The nextgeneration,”

Cell , vol. 144, no. 5, pp. 646–674, mar 2011.[3] S. Ramaswamy, P. Tamayo, R. Rifkin, S. Mukherjee, C.-H. Yeang,M. Angelo, C. Ladd, M. Reich, E. Latulippe, J. P. Mesirov et al. ,“Multiclass cancer diagnosis using tumor gene expression signatures,”

Proceedings of the National Academy of Sciences , vol. 98, no. 26, pp.15 149–15 154, 2001.[4] F. Chen, Y. Zhang, D. Boss´e, A.-K. A. Lalani, A. A. Hakimi, J. J.Hsieh, T. K. Choueiri, D. L. Gibbons, M. Ittmann, and C. J. Creighton,“Pan-urologic cancer genomic subtypes that transcend tissue of origin,”

Nature communications , vol. 8, no. 1, p. 199, 2017.[5] Y. Hasin, M. Seldin, and A. Lusis, “Multi-omics approaches to disease,”

Genome Biology , vol. 18, no. 1, p. 83, May 2017.[6] Y.-W. Wan, Y. Qian, S. Rathnagiriswaran, V. Castranova, and N. L.Guo, “A breast cancer prognostic signature predicts clinical outcomesin multiple tumor types,”

Oncology reports , vol. 24, no. 2, pp. 489–494,2010.[7] R. Sun, E. J. Limkin, M. Vakalopoulou, L. Dercle, S. Champiat, S. R.Han, L. Verlingue, D. Brandao, A. Lancia, and S. A. et al., “A radiomicsapproach to assess tumour-inﬁltrating CD 8 cells and response to anti-PD-1 or anti-PD-l1 immunotherapy: an imaging biomarker, retrospectivemulticohort study,”

The Lancet Oncology , vol. 19, no. 9, pp. 1180–1191,sep 2018.[8] P. D. Dunne, M. Alderdice, P. G. O. Reilly, A. C. Roddy, A. M. B.McCorry, S. Richman, T. Maughan, S. S. McDade, P. G. Johnston, D. B.Longley, E. Kay, D. G. McArt, and M. Lawler, “Cancer-cell intrinsicgene expression signatures overcome intratumoural heterogeneity bias incolorectal cancer patient classiﬁcation,”

Nature Communications , vol. 8,p. 15657, may 2017.[9] E. Drucker and K. Krapfenbauer, “Pitfalls and limitations in translationfrom biomarker discovery to clinical utility in predictive and person-alised medicine,”

EPMA journal , vol. 4, no. 1, p. 7, 2013.[10] M. Halkidi, Y. Batistakis, and M. Vazirgiannis, “On clustering validationtechniques,”

Journal of Intelligent Information Systems , vol. 17, no. 2,Dec 2001.[11] L. Kaufmann and P. Rousseeuw, “Clustering by means of medoids,”

Data Analysis based on the L1-Norm and Related Methods , pp. 405–416, 01 1987.[12] F. Kov´acs, C. Leg´any, and A. Babos, “Cluster validity measurementtechniques,” in . Citeseer, 2005.[13] S. K. Kingrani, M. Levene, and D. Zhang, “Estimating the number ofclusters using diversity,”

Artiﬁcial Intelligence Research , vol. 7, no. 1,p. 15, dec 2017.[14] L. Hubert and P. Arabie, “Comparing partitions,”

Journal of classiﬁca-tion , vol. 2, no. 1, pp. 193–218, 1985.[15] F. Wagner, “Go-pca: an unsupervised method to explore gene expressiondata using prior knowledge,”

PloS one , vol. 10, no. 11, p. e0143196,2015.[16] S. Pepke and G. V. Steeg, “Comprehensive discovery of subsamplegene expression components by information explanation: therapeuticimplications in cancer,”

BMC Medical Genomics , vol. 10, no. 1, mar2017. [17] L. Cowen, T. Ideker, B. J. Raphael, and R. Sharan, “Network propa-gation: a universal ampliﬁer of genetic associations,”

Nature ReviewsGenetics , vol. 18, no. 9, pp. 551–562, 2017.[18] S. van Dam, U. V˜osa, A. van der Graaf, L. Franke, and J. P. de Ma-galh˜aes, “Gene co-expression analysis for functional classiﬁcation andgene–disease predictions,”

Brieﬁngs in Bioinformatics , p. bbw139, jan2017.[19] L. Cantini, L. Calzone, L. Martignetti, M. Rydenfelt, N. Bl¨uthgen,E. Barillot, and A. Zinovyev, “Classiﬁcation of gene signatures for theirinformation value and functional redundancy,”

NPJ systems biology andapplications , vol. 4, no. 1, p. 2, 2017.[20] J. MacQueen, “Some methods for classiﬁcation and analysis of multi-variate observations,” 1967.[21] G. Ver Steeg and A. Galstyan, “Discovering structure in high-dimensional data through correlation explanation,” in

Advances in Neu-ral Information Processing Systems , 2014, pp. 577–585.[22] N. Komodakis, N. Paragios, and G. Tziritas, “Clustering via lp-basedstabilities,” in

Advances in Neural Information Processing Systems 21 ,2009, pp. 865–872.[23] M. H. Bailey, C. Tokheim, E. Porta-Pardo, S. Sengupta, D. Bertrand,A. Weerasinghe, A. Colaprico, M. C. Wendl, J. Kim, B. Reardon et al. ,“Comprehensive characterization of cancer driver genes and mutations,”

Cell , vol. 173, no. 2, pp. 371–385, 2018.[24] V. Thorsson, D. L. Gibbs, S. D. Brown, D. Wolf, D. S. Bortone, T.-H. O.Yang, E. Porta-Pardo, G. F. Gao, C. L. Plaisier, J. A. Eddy et al. , “Theimmune landscape of cancer,”

Immunity , vol. 48, no. 4, pp. 812–830,2018.[25] J. Oyelade, I. Isewon, F. Oladipupo, O. Aromolaran, E. Uwoghiren,F. Ameh, M. Achas, and E. Adebiyi, “Clustering algorithms: Theirapplication to gene expression data,”

Bioinformatics and BiologyInsights , vol. 10, p. BBI.S38316, Jan. 2016. [Online]. Available:https://doi.org/10.4137/bbi.s38316[26] E. Battistella, M. Vakalopoulou, T. Estienne, M. Lerousseau, R. Sun,C. Robert, N. Paragios, and E. Deutsch, “Gene expression high-dimensional clustering towards a novel, robust, clinically relevantand highly compact cancer signature,” in

IWBBIO 2019 , Granada,Spain, May 2019. [Online]. Available: https://hal.archives-ouvertes.fr/hal-02076104[27] V. G. Tusher, R. Tibshirani, and G. Chu, “Signiﬁcance analysis ofmicroarrays applied to the ionizing radiation response,”

Proceedings ofthe National Academy of Sciences , vol. 98, no. 9, pp. 5116–5121, 2001.[28] D. Szklarczyk, A. L. Gable, D. Lyon, A. Junge, S. Wyder, J. Huerta-Cepas, M. Simonovic, N. T. Doncheva, J. H. Morris, P. Bork et al. ,“String v11: protein–protein association networks with increased cov-erage, supporting functional discovery in genome-wide experimentaldatasets,”

Nucleic acids research , vol. 47, no. D1, pp. D607–D613, 2018.[29] G. Chassagnon, M. Vakalopoulou, E. Battistella, S. Christodoulidis, T.-N. Hoang-Thi, S. Dangeard, E. Deutsch, F. Andre, E. Guillo, N. Halm et al. , “Ai-driven ct-based quantiﬁcation, staging and short-term outcomeprediction of covid-19 pneumonia,” arXiv preprint arXiv:2004.12852 ,2020.[30] R. L. Grossman, A. P. Heath, V. Ferretti, H. E. Varmus, D. R. Lowy,W. A. Kibbe, and L. M. Staudt, “Toward a shared vision for cancergenomic data,”

New England Journal of Medicine , vol. 375, no. 12, pp.1109–1112, 2016.[31] B. Li and C. N. Dewey, “Rsem: accurate transcript quantiﬁcation fromrna-seq data with or without a reference genome,”

BMC Bioinformatics ,vol. 12, no. 1, p. 323, Aug 2011.[32] H. Salem, G. Attiya, and N. El-Fishawy, “Classiﬁcation of human cancerdiseases by gene expression proﬁles,”

Applied Soft Computing , vol. 50,pp. 124–134, jan 2017.[33] J. Xiao, X. Lu, X. Chen, Y. Zou, A. Liu, W. Li, B. He, S. He, andQ. Chen, “Eight potential biomarkers for distinguishing between lungadenocarcinoma and squamous cell carcinoma,”

Oncotarget , vol. 8,no. 42, May 2017.[34] S.-Q. Li, N. Su, P. Gong, H.-B. Zhang, J. Liu, D. Wang, Y.-P. Sun,Y. Zhang, F. Qian, B. Zhao et al. , “The expression of formyl peptidereceptor 1 is correlated with tumor invasion of human colorectal cancer,”

Scientiﬁc reports , vol. 7, no. 1, p. 5918, 2017.[35] Y. Hamdi, P. Soucy, V. Adoue, K. Michailidou, S. Canisius, A. Lemac¸on,A. Droit, I. L. Andrulis, H. Anton-Culver, and V. A. et al., “Associationof breast cancer risk with genetic variants showing differential allelicexpression: Identiﬁcation of a novel breast cancer susceptibility locus at4q21,”

Oncotarget , vol. 7, no. 49, Oct. 2016.[36] Y. Zhang and Q. Cai, “Whole-exome sequencing identiﬁes novel somaticmutations in chinese breast cancer patients,”

Journal of Molecular andGenetic Medicine , vol. 09, no. 04, 2015. ersion submitted to IEEE Transactions on Bioinformatics and Computational Biology on 15/12/2020 [37] J. Permuth-Wey, Y. A. Chen, Y.-Y. Tsai, Z. Chen, X. Qu, J. M. Lancaster,H. Stockwell, G. Dagne, E. Iversen, H. Risch, J. Barnholtz-Sloan, J. M.Cunningham, R. A. Vierkant, B. L. Fridley, R. Sutphen, J. McLaughlin,S. A. Narod, E. L. Goode, J. M. Schildkraut, D. Fenstermacher,C. M. Phelan, and T. A. Sellers, “Inherited variants in mitochondrialbiogenesis genes may inﬂuence epithelial ovarian cancer risk,”

CancerEpidemiology Biomarkers & Prevention , vol. 20, no. 6, pp. 1131–1145,Mar. 2011.[38] M. Schicht, F. Rausch, S. Finotto, M. Mathews, A. Mattil, M. Schubert,B. Koch, M. Traxdorf, C. Bohr, D. Worlitzsch, W. Brandt, F. Garreis,S. Sel, F. Paulsen, and L. Brauer, “SFTA3, a novel protein of the lung:three-dimensional structure, characterisation and immune activation,”

European Respiratory Journal , vol. 44, no. 2, pp. 447–456, Apr. 2014.[39] V. Cappelletti, E. Iorio, P. Miodini, M. Silvestri, M. Dugo, and M. G.Daidone, “Metabolic footprints and molecular subtypes in breast cancer,”

Disease Markers , vol. 2017, pp. 1–19, 2017.[40] P. Modena, M. A. Testi, F. Facchinetti, D. Mezzanzanica, M. T. Radice,S. Pilotti, and G. Sozzi, “UQCRH gene encoding mitochondrial hingeprotein is interrupted by a translocation in a soft-tissue sarcoma andepigenetically inactivated in some cancer cell lines,”

Oncogene , vol. 22,no. 29, pp. 4586–4593, Jul. 2003.[41] G. Schramm, E.-M. Surmann, S. Wiesberg, M. Oswald, G. Reinelt,R. Eils, and R. K¨onig, “Analyzing the regulation of metabolic pathwaysin human breast cancer,”

BMC Medical Genomics , vol. 3, no. 1, Sep.2010.[42] A. M. Alshabi, B. Vastrad, I. A. Shaikh, and C. Vastrad, “Identiﬁcationof crucial candidate genes and pathways in glioblastoma multiformby bioinformatics analysis,”

Biomolecules , vol. 9, no. 5, p. 201, May2019. [Online]. Available: https://doi.org/10.3390/biom9050201[43] A. Read and R. Natrajan, “Splicing dysregulation as a driver of breastcancer,”

Endocrine-Related Cancer , vol. 25, no. 9, pp. R467–R478, Sep.2018.[44] A. M. Heeren, S. Punt, M. C. Bleeker, K. N. Gaarenstroom, J. van derVelden, G. G. Kenter, T. D. de Gruijl, and E. S. Jordanova, “Prognosticeffect of different PD-l1 expression patterns in squamous cell carcinomaand adenocarcinoma of the cervix,”

Modern Pathology , vol. 29, no. 7,pp. 753–763, Apr. 2016.[45] M. Ayers, J. Lunceford, M. Nebozhyn, E. Murphy, A. Loboda, D. R.Kaufman, A. Albright, J. D. Cheng, S. P. Kang, V. Shankaran et al. ,“Ifn- γ –related mrna proﬁle predicts clinical response to pd-1 blockade,” The Journal of clinical investigation , vol. 127, no. 8, pp. 2930–2940,2017.[46] E. Koedoot, L. Wolters, B. van de Water, and S. E. L. D´ev´edec,“Splicing regulatory factors in breast cancer hallmarks and diseaseprogression,”

Oncotarget , vol. 10, no. 57, pp. 6021–6037, Oct. 2019.[Online]. Available: https://doi.org/10.18632/oncotarget.27215[47] B. Singh and E. Eyras, “The role of alternative splicing in cancer,”

Transcription , vol. 8, no. 2, pp. 91–98, 2017.[48] A. M. Mahmoud, V. Macias, U. Al-alem, R. J. Deaton, A. Kadjaksy-Balla, P. H. Gann, and G. H. Rauscher, “Brca1 protein expressionand subcellular localization in primary breast cancer: Automated digitalmicroscopy analysis of tissue microarrays,”

PloS one , vol. 12, no. 9, p.e0184385, 2017.

Enzo Battistella holds a MSc and a computer science engineer degree fromTelecom Paris. He is currently pursuing a PhD in CentraleSupelec and GustaveRoussy about gene clustering and gene signature designing. He is under thejoint supervision of Eric Deutsch and Nikos Paragios.

Maria Vakalopoulou is an Assistant Professor at MICS laboratory of Cen-traleSupelec, University Paris-Saclay, Paris, France. Prior to that and during2017 – 2018, she was a postdoctoral researcher at the same university. Shereceived her PhD, in 2017, from the National Technical University of Athens,Athens, Greece from where she also received her Engineering Diploma degree,graduating with excellence. Her research interests include medical imagery,remote sensing, computer vision, and machine learning. The researcher haspublished her research in international journals (Lancet Oncology, Radiology,European Radiology) and conferences (MICCAI, ISBI, IGARSS).

Roger Sun is a resident in radiation oncology at Assistance Public desHˆopitaux de Paris. He is currently doing his PhD under the supervision ofEric Deutsch in the INSERM U1030 molecular radiotherapy laboratory.

Th´eo Estienne holds a MSc and a engineer degree in Applied Mathematicsfrom Centrale Paris (became CentraleSupelec). He is currently pursuing aPhD in collaboration between CentraleSupelec (MICS laboratory) and GustaveRoussy. His topics are medical imaging, deep learning and more speciﬁcallythe extraction of new bio-markers for brain cancer using deep learning andbrain MRI registration based on deep learning.

Lerousseau Marvin is a PhD student in oncology and artiﬁcial intelligence.

Sergey Nikolaev main research’s interest has evolved from evolutionarygenetics to genetics of cancer, benign tumors and non-tumoral malformativedisease. In the last 8 years, he has conducted successful genetic researchwith next generation sequencing on various cancer types, which resultedin important scientiﬁc advances published in Nat. Communications, Journalof Pathology, Cancer Research, Nat. Genetics and New England Journal ofMedicine. From January 2018 he started his own group in Gustave RoussyComprehensive Cancer Center located in Villejuif, Paris. In Gustave Roussyhis team studies the genetics of Non-Melanoma Skin Cancers and continuesto collaborate with clinicians to study the genetics of other rare tumor typesand somatic mutations in non-tumoral malformative diseases. ´Emilie Alvarez Andres holds a MSc in medical physics from the Universityof Paris-Sud (2017). She is currently pursuing a PhD in collaboration betweenTheraPanacea and Gustave Roussy about brain MRI-only radiotherapy basedon Deep Learning.

Alexandre Carr´e holds a MSc in medical physics from the University of ParisSaclay. He is currently pursuing a PhD in Gustave Roussy on the analysis oftumor heterogeneity for the personalization of brain tumor treatment plans inradiotherapy.

St´ephane Niyoteka is a PhD student at Gustave Roussy about radiomicsignature of lymphocyte inﬁltrate for patients with locally advanced cervicalcancers treated with combined chemoradiation +/- immunotherapy. He holdsa MSc in medical physics from the university Paris Saclay.

Charlotte Robert is an assistant professor at Paris-Saclay University, whereshe teaches medical and nuclear physics at the Faculties of Medicine andSciences. After a thesis in physics dedicated to the optimization of semi-conductor SPECT systems, she completed her training with a three-yearpost-doctoral position at IMNC laboratory (Imaging and Modelization forNeurology and Oncology) in Orsay, France. This post-doc was focused onthe use of PET acquisitions for the dose delivery control in hadrontherapy.Since 2019, she is coordinating research activities dedicated to artiﬁcialintelligence and multi-modal imaging for personalized radiotherapy in theradiation therapy department of Gustave Roussy Hospital.

Nikos Paragios received his B.Sc. and M.Sc.degrees from the University ofCrete, Gallos, Greece,in 1994 and 1996, respectively, his Ph.D. degree fromI.N.R.I.A./University of Nice/Sophia Antipolis, Nice,France, in 2000 and hisHDR (Habilitation `a Diriger des Recherches) degree from the same universityin 2005. He is a Professor of applied mathematics at CentraleSupelec/Universit Paris-Saclay Orsay, France. He has published more than 200 papersin the areas of computer vision, biomedical imaging, and machine learning.Prof. Paragios is the Editor-in-Chief of the Computer Vision and ImageUnderstanding journal and serves on the editorial board of the Medical ImageAnalysis and SIAM Journal on Imaging Sciences. He is a Senior Fellow ofthe Institut Universitaire de France and a Member of the Scientiﬁc Councilof Safran conglomerate. ´Eric Deutsch´Eric Deutsch