[PDF] DiviK: Divisive intelligent K-means for hands-free unsupervised clustering in biological big data

Abstract

Investigation of molecular heterogeneity provides insights about tumor origin and metabolomics. Increasing amount of data gathered makes manual analyses infeasible. Automated unsupervised learning approaches are exercised for this purpose. However, this kind of analysis requires a lot of experience with setting its hyperparameters and usually an upfront knowledge about the number of expected substructures. Moreover, numerous measured molecules require additional step of feature engineering to provide valuable results. In this work we propose DiviK: a scalable auto-tuning algorithm for segmentation of high-dimensional datasets, and a method to assess the quality of the unsupervised analysis. DiviK is validated on two separate high-throughput datasets acquired by Mass Spectrometry Imaging in 2D and 3D. Proposed algorithm could be one of the default choices to consider during initial exploration of Mass Spectrometry Imaging data. With comparable clustering quality, it brings the possibility of focusing on different levels of dataset nuance, while requires no number of expected structures specified upfront. Finally, due to its simplicity, DiviK is easily generalizable to even more flexible framework, with other clustering algorithm used instead of k-means. Generic implementation is freely available under Apache 2.0 license at this https URL.

Full PDF

MMrukwa and Pola´nska

SOFTWARE

DiviK: Divisive intelligent K-means for hands-freeunsupervised clustering in biological big data

Grzegorz Mrukwa and Joanna Pola´nska AbstractBackground:

Results:

In this work we propose DiviK: a scalable auto-tuning algorithm for segmentation of high-dimensionaldatasets, and a method to assess the quality of the unsupervised analysis. DiviK is validated on two separatehigh-throughput datasets acquired by Mass Spectrometry Imaging in 2D and 3D.

Conclusions:

Proposed algorithm could be one of the default choices to consider during initial exploration ofMass Spectrometry Imaging data. With comparable clustering quality, it brings the possibility of focusing ondiﬀerent levels of dataset nuance, while requires no number of expected structures speciﬁed upfront. Finally, dueto its simplicity, DiviK is easily generalizable to even more ﬂexible framework, with other clustering algorithmused instead of k-means. Generic implementation is freely available under Apache 2.0 license at https://github.com/gmrukwa/divik . Keywords: machine learning; unsupervised clustering; feature engineering; high dimensional data analysis;omics; mass spectrometry imaging; tumor heterogeneity

Background

Mass Spectrometry Imaging (MSI) is widely used formolecular proﬁles discovery as it provides unparal-leled insight into the metabolomics of a tissue sam-ple [1, 2, 3]. Applied to tumor, MSI allows to inves-tigate heterogeneities that could indicate functionaldiﬀerences across the tissue regions, but also vary-ing cancer subtypes that require dedicated treatment[4, 5, 6, 7, 8, 9].Technically, MSI is an excellent example of biologicalbig-data, with the following characteristic: • Volume — spatial resolution 5 − µm leads to10 000 - 4 000 000 spectra potentially acquiredfrom 1 cm tissue sample with Time-of-Flight (ToF)spectrometers • Velocity — last 3 years led to more than 4 500MSI datasets uploaded to METASPACE database[10, 11] * Correspondence: [email protected] Silesian University of Technology, Department of Data Mining,Akademicka 2A, 44-100 Gliwice, PolandFull list of author information is available at the end of the article • Variety — dataset may consist of more than200,000 mass channels (features) per single spec-trum (observation) representing proteins, peptidesand metabolites • Veracity — Acquisition methods and various bi-ological phenomena introduce heavy duplicationof information, thus capturing the most relevantnuances becomes hard with dominating high-levelpatterns ampliﬁed [12]. On the other hand, featureimportance may be diverse across separate regionsof interest.Eﬃcient analyses of thousands of MSI spectra usu-ally require careful feature space adaptation, despiteof extensive preprocessing pipeline [13, 14, 15]. In afully unsupervised setup, there are following groups ofmethods: ﬁltering, linear (e.g. Principal ComponentsAnalysis), non-linear (e.g. Universal Manifold Approxi-mation and Projection) [16]. Filtering removes featuresaccording to some, often constant threshold. Linearmethods produce new set of features which are linearcombination of the input features. Non-linear methodsaim to ﬁnd low-dimensional representation which bestapproximates distances between pairs of points. a r X i v : . [ q - b i o . Q M ] S e p rukwa and Pola´nska Page 2 of 8 Further, many clustering methods are known andcan be divided into few major groups, based on theirapproach: centroid-based (e.g. k-means), connectivity-based (e.g. hierarchical clustering), distribution-based(e.g. Gaussian Mixture Modeling) and density-based(e.g. graph-cuts clustering). Sometimes method is pro-posed that exercises more than a single property (e.g.both connectivity and centroids) of the data and sucha method is called hybrid clustering.Finally, clustering quality can be captured via met-rics like Dunn’s index [17] or Adjusted Rand Index [18].These were introduced for low dimensional data clus-tering, but studies [19, 20] compare their usefulness forhigh dimensional data and subspace clustering. Despitehundreds of features in clustering are considered prob-lem of high dimensionality, it is still at least an orderof magnitude less severe than in MSI data analyses.Therefore a domain-related work must be investigated.First solutions for Mass Spectrometry Imaging exploitPrincipal Components Analysis (PCA) and agglomer-ative clustering [21]. Data is transformed with PCAand 70% of variance explained is selected. Authorsuse Euclidean metric to capture spectra similarity andWard linkage to provide the most meaningful results.The approach is semi-supervised, as it requires manualsetting of the number of clusters, based on histologyexamination.Another work features high dimensional data clus-tering (HDDC) [22] — a hybrid approach based onGaussian Mixture Model (GMM). Locally relevant fea-tures are obtained as a part of ﬁtting process and thedomain is adapted for each estimated component. Sincethe model is Gaussian Mixture-based, similarity of ob-servations is considered as Euclidean metric in somesubspaces of the original feature space. For MSI data,combination of HDDC with edge-preserving denoisingof m/z -images is applied [23]. Authors discuss previousapproaches which were smoothing only the cluster-ing results in a form of result post-processing, whilethey propose an approach oriented on m/z -images, sothe denoising provides better quality data for cluster-ing method. The idea of denoising is extended [24].Fast Map algorithm adapts the domain and K-Meansclustering follows — simpler than HDDC. Major ad-vancements are located in denoising stage and willbe omitted here. The overlap of clusters and actualstructures is a primary quality metric, however bothapproaches [23, 24] compare visual perception of clusterconsistency.Region consistency is the main idea behind EXIMS[25] — a modern PCA-oriented approach. First, the con-trast of the molecular image is enhanced with histogramequalization. Then, it proposes a scoring method to as-sess whether structures exist in the enhanced molecular image. Random distribution of counts over the imageleads to a low value of score. Common structure typeslike: regions, curves, gradients and islets are recognizedwith gray level co-occurrence matrix and result in ahigher score value. However, deﬁnition of the score isunbound, thus there is no clear threshold to discerninformative peaks. Finally, the informative peaks arePCA-transformed and fuzzy c-means algorithm is usedfor clustering.With the rise of single dataset volume K-Means andspectral clustering are further investigated in a two-step scenario to speed up computations [26]. In the ﬁrststep of clustering several random subsets of input dataare clustered independently and then cluster represen-tatives are grouped to form the ﬁnal clusters. Cosinedistance is used as a spectrum similarity measure. Hy-perparameters selection (i.e. number of eigenvectors ofthe connectivity graph) is performed manually to pro-vide visually best results. Both methods are manuallyset to isolate 7 clusters.The rising single dataset volume is also addressedin feature extraction techniques. For example, Hier-archical Stochastic Neighbor Embedding (HSNE) isclaimed superior to classical non-linear techniques [27].Reduced feature space is constructed hierarchically,with the approach

Overwiev-First, Details-on-Demand .Firstly, characteristic points of the dataset (landmarks)are embedded, to provide an overview. Then, a newembedding is created for local neighborhoods. Eachpoint has assigned likelihood to be represented by aspeciﬁc landmark. Authors discuss how to use maps oflikelihood for molecular segmentation given a landmark,however clusters of landmarks are proposed manually.Another scalable feature extraction example is theUniversal Manifold Approximation and Projection(UMAP) algorithm. More scalable than t-SNE andpreserves more of global structure [28]. Moreover, norestriction is made on the embedding dimension. Thusit may be used as a more general feature extractionmethod in MSI, as compared to visualizing-only capa-bilities of t-SNE [29].Mentioned unsupervised analyses of MSI data viaclustering follow a single schema: candidate peaks areselected, some domain adaptation is applied and ﬁnallya basic clustering algorithm is applied. This approachhas several drawbacks. • PCA applied to the whole dataset at once is ableto capture the most variance on high level, but atthe same time discards all of the nuances. Thesenuances might be crucial for hierarchical analyses.Other domain adaptation methods may partiallyresolve the issue, but with thousands of dimensionsdisproportion between low- and high-level detailis signiﬁcant. rukwa and Pola´nska Page 3 of 8 • Non-hierarchical approaches (e.g. one-step, two-step) can’t provide insight on the internal struc-tures unless the molecular diﬀerences are dominat-ing whole dataset. • Only hierarchical approaches describe the relationof subgroups in a form of the clustering tree, thatcaptures context like parent , child , or sibling . Thiscontext provide additional insight into moleculardiversity. For example, parent-child relation of clus-ters may describe diﬀerent functional subregions oftumor region, while sibling clusters may representother types of tumor. • Agglomerative or bisecting deglomerative hier-archical algorithms provide limited informationabout siblings, since on each level only two objectsare merged or divided. Therefore lot of siblings ismistreated as an artiﬁcial parent-child relation. Toidentify such cases, an additional post-processingstep would be required.We propose a hybrid framework that is a direct an-swer to the above drawbacks: Divisive intelligent K-Means (DiviK). First, it has hierarchical nature, toprovide the most comprehensive insight. Second, dur-ing the analysis, domain adaptation is performed foreach region of interest separately, going from top levelof whole dataset down to each subcluster. Finally, acheck is performed before ROI analysis, whether thereexist an evidence that diverse groups are present in theROI (see Figure 1). feature spaceadaptation ROI diversityconﬁrmation unsupervisedsplitrecursively for each subregion

Figure 1

Proposed hierarchical framework for unsupervisedclustering of MSI data with local feature domain adaptation.For each ROI an analysis is performed whether it contains anyimplicit structures. If structures are present, molecularsegmentation is continued recursively for each subregion.

Implementation

In this section three main components of the DiviKframework are described: feature ﬁltering, stop condi-tion check and clustering. The described compositionis used to obtain the results further discussed, howeverit is noteworthy that all the components are replace-able. That means, one could easily use e.g. DBSCANalgorithm (or any other) instead of K-Means as longas certain conditions are met.

Feature Space Adaptation

A method for adaptive feature ﬁltering in biologicalbig data is already known [30] and ﬁne-tuned [31]. It is validated with similarly massive amount of features asin MSI datasets, using microarray gene expression data.The method is based on GMM decomposition of a his-togram of feature variance, abundance, or other kind ofcharacteristics. The work speciﬁes which componentsshould be used, but the number of components andcharacteristics of MSI data diﬀers and these aspectsmust be carefully calibrated for MSI. Firstly, we aimfor a hands-free pipeline, thus Bayesian InformationCriterion is exploited to select the optimal numberof GMM components for ﬁltering. Secondly, out ofnumerous default feature characteristics we selectedabundance logarithm and variance logarithm as promis-ing candidates that allow for signiﬁcant discriminationof features (see Figure 2).

Figure 2

Data-driven feature ﬁltering based on histogramdecomposition. For each feature it’s abundance and varianceare considered. Histograms of logarithms of these twocharacteristics are decomposed into GMM. Crossing points ofneighboring components are candidate thresholds (magenta).For the abundance levels (panel A) we select the ﬁrstcomponent to remove (black). For the variance (panel B) wepersist only the topmost components that together persist atleast 1% of all the features.

This component is replaceable by any unsupervisedfeature space adaptation, including PCA, UMAP, orother ﬁltering method than proposed here.

Stop Condition

There exist many indices to validate quality of unsu-pervised segmentation. These take into account clus-ter separability, compactness or probabilistic measures.However, there is just a few that allow for a comparisonof a multi-cluster partition to an artiﬁcial single-clusterpartition. For K-Means algorithm, the GAP statistic[32] is one of the opportunities. It relates each partitionobtained via centroid-based clustering, to partitionsobtained via the same method over random datasetswithin the same bounds. We use GAP in two-trial sce-nario to construct a test for diversity inside a ROI (seeFigure 3).This component is replaceable by any kind of multi-dimensional data unimodality check.

Clustering

K-Means clustering is neither the most powerful noreﬃcient clustering method known. However, due toits simplicity, it is highly interpretable and open for rukwa and Pola´nska Page 4 of 8 data draw N randomuniformdata sets groupuniformdata sets computedispersions inrandom sampleslabelscentroids computedispersionsin real data dispersion ratio

Figure 3

Flowchart of GAP statistic computation used in thetwo-trial scenario. Spectra belonging to an ROI are artiﬁciallyclustered into a single cluster and then into two clusters. GAPstatistic is compared for both cases. If spectra are forming justa single group, GAP statistic tends to be greater for artiﬁcialdivision. Otherwise it tends to be greater for two clusters (butnot necessarily optimal). calibration to the requirements of speciﬁc domain. Thuswe select it as a ﬂexible baseline to demonstrate theconcept of DiviK framework. For the purpose of MSIdata analysis, we adjust the distance metric [33] andthe initialization method.K-Means algorithm requires the number of clustersto be speciﬁed upfront. Since the actual number ofmolecularly homogeneous regions is unknown, we useunsupervised quality metrics to guide the computations.Current implementation has Dunn’s index [17] andGAP statistic [32] already included.This component is replaceable by any unsupervisedclustering algorithm with an automated selection ofthe number of clusters.

Initialization Method

Due to the gradient nature of K-Means clustering, oneneeds an eﬀective initialization method that could ap-proximate the global optimum of the target. Althoughsuch method is already proposed and veriﬁed with MSIdata [33], we introduce two extensions: • Robustness to outliers – We select the objectsclosest to ﬁxed percentile of distances, not theextreme. • Scalability – We build a KD-Tree ﬁrst, as a high-level overview of the dataset. The maximal leaf sizeis a ﬁxed percentage of the input data. The linearmodel is built on top of KD-Tree segments insteadof all data, which greatly reduces the amount ofcomputations. The KD-Tree is further exploitedduring selection process.

Technology

DiviK is written mostly in Python and distributedcross-platform under the Apache 2.0 license throughPython Package Index ( https://pypi.org/project/divik/ ) and Docker Hub ( https://hub.docker.com/r/gmrukwa/divik ). It is designed to be simple and ef-ﬁcient, accessible to a non-expert, and highly reusable.Thus DiviK is accessible in Python directly and viacommand-line interface (CLI). Python API follows the scikit-learn [34] design patterns and similar packageorganization conventions to provide reusable buildingblocks. CLI allows constructing highly-ﬂexible process-ing pipelines due to injection-based conﬁguration sys-tem.

Results

We evaluate the tool against two MSI datasets of dif-ferent characteristics: • Oral Squamous Cell Carcinoma (OSCC) — singleannotated slices from two diﬀerent patients; • mouse kidney — 75 slices from a single 3D-allocated mouse kidney without the annotation.For each dataset, DiviK is compared to the diﬀerentclustering and feature extraction methods combined.We select k-means and spectral clustering as repre-sentatives of clustering algorithms validated for MSI.Both are used in fully unsupervised setup with numberof clusters selected using GAP statistic. For featureextraction we use either no extraction, PCA with knee-based selection of number of components [35], PCA ontop of EXIMS-preselected features, and UMAP.All the experiments were carried out in Polyaxonenvironment [36]. Oral Squamous Cell Carcinoma

The OSCC dataset [33] is a two-dimensional MALDI-ToF MSI dataset. We selected a subset consisting oftwo patient samples of 19,874 spectra and 3,714 GMMcomponents in total. Spectra are provided with anannotation of the tissue type, allowing to distinguishtumor, epithelium and healthy areas (Figure 4).The dataset is segmented using all the combinationsof aforementioned methods. As the volume of 2D datais lower, we additionally include spatial clustering [24]into the comparison.DiviK is sweeping for up to 10 clusters on each seg-mentation level. It operates with correlation distance.The minimal number of features that it is required topreserve is 1%. We do not split the cluster further ifit already is 200 spectra or less. Leaf size during theinitialization is 1% of the subset size and the algorithmstarts from the leaf containing 99th percentile of the dis-tance. Computing Dunn’s and Gap indices we sample10 times 1,000 spectra each. K-means clustering is setup for sweep up to 50 clusters, with the same criteriafor computing the Gap index as DiviK and the corre-lation distance as well. Spatial clustering is launchedwith a radius of 7. Spectral clustering is used with co-sine metric during the embedding and the embeddingwith the number of components equal to 1% of thenumber of features. UMAP was ran with 30 neighbors,correlation distance, 500 epochs and negative samplerate of 70 to obtain 3 components. rukwa and Pola´nska Page 5 of 8

Figure 4

Microscope image of HNC samples annotated by apathologist. Red — tumor; cyan — epithelium; magenta —connective tissue; yellow — muscle; green — salivary gland.Due to medical relevance we focus mostly on tumor andepithelium tissue — the origin of OSCC.

Fully unsupervised approach yields varying numberof clusters that requires normalization before any nu-merical assessment. The normalization procedure isexplained in the Figure 5. percentage of clusters D i ce i nd e x Figure 5

Cluster labels normalization procedure. Red: falsenegative; cyan: false positive; grey: true positive tumor. In theﬁrst step a binary decision is made whether label should belongto one of pathologist-deﬁned regions. Clusters are sorted by thepercentage of their area covered by the ROI. They are selectedsequentially to optimize the Dice index. Secondly, all ambiguousassignment are resolved via optimization of Rand Index to formthe normalized labels.

We use this dataset to compare the limits of ROIreconstruction capabilities when using speciﬁc conﬁgu-rations. With clusters matched to pathologist’s ROIs,following quality metrics are considered: • Dice Index — to capture tumor reconstructioncapabilities; • Rand Index — to capture global multi-ROI recon-struction capabilities; • EXIMS score — to capture the spatial consistencyof the clusters.The EXIMS score is an unbound measure, usefulduring comparative analyses, but its magnitude is hardto interpret. Therefore we scaled and clipped the valuesso the highest multi-cluster result is 1.Performance visualization of the DiviK algorithmand remaining conﬁgurations is presented in the Figure6. Exact values of quality metrics are available in theTable 1. Normalized map of the clusters obtained inthe process is presented in the Figure 7.

Dice Index EXIMSScoreRandIndex

Figure 6

Quality indices of ROI composability by the obtainedclusters. Next to the point there is the length of the vector —used as the quality measure. Arrow indicates the top result.

Table 1

Values of quality indices measured for OSCC data.Diceindex relativeEXIMSscore adjustedRandindex overallquality clusteringalgorithm featureextractionmethod0.4844 0.5891 0.2792 0.8122 Spectral UMAP0.0000 1.0000 0.0000 1.0000 Spatial UMAP0.0000 1.0000 0.2723 1.0364 K-Means Knee PCA0.5129 0.8323 0.4827 1.0903 K-Means EXIMS PCA0.7418 0.6449 0.5447 1.1237 Spectral EXIMS PCA0.5043 0.9712 0.3364 1.1449 K-Means none0.7238 0.7225 0.5231 1.1487 K-Means UMAP0.7065 0.7639 0.4985 1.1537 Spatial Knee PCA0.7765 0.6383 0.6082 1.1749 DiviK EXIMS PCA0.7966 0.6520 0.5906 1.1868 Spectral none0.7540 0.7289 0.5567 1.1873 DiviK Knee PCA0.7720 0.7587 0.5617 1.2195 Spatial none0.8369 0.6568 0.6534 1.2485 DiviK UMAP0.6897 0.9891 0.4594 1.2904 Spectral Knee PCA0.8672 0.6977 0.7035 1.3167 Spatial EXIMS PCA0.7372 1.0000 0.5433 1.3560 DiviK none

Mouse Kidney 3D

The mouse kidney dataset [37] is a three-dimensionalMALDI-ToF MSI dataset. It consists of 75 sections rukwa and Pola´nska Page 6 of 8

No feature extraction Knee PCA EXIMS UMAP K - M e a n s D i v i K Sp ec tr a l c l u s t e r i n g Sp a t i a l c l u s t e r i n g Figure 7

Results of the unsupervised analyses for OSCC data.Clusters were matched to the pathologist regions using themethod described. Red — tumor region; cyan — healthyepithelium; gray — other tissue. from the central part of a mouse kidney, 1,362,830spectra in total, 7,680 data points each. The datasetwas already a subject to Gaussian spectral smoothingand baseline reduction with Top Hat algorithm.We conduct clustering in the same scenarios as pre-viously. There are no labels available for the spectra,thus the capabilities of heterogeneity detection must beassessed visually. Clusters are subject to a similar post-processing procedure as already described to indicatethe similarities between results.The minimal number of features that DiviK is re-quired to preserve is 0.5%. We do not split the clusterfurther if it already is 50,000 spectra or less. Leaf sizeduring the initialization is 0.1% of the subset size andthe algorithm starts from the leaf containing 95th per-centile of the distance. Computing Dunn’s and Gapindices we sample 10 times 5,000 spectra each.In the Figure 8 we present volumes with clustersmarked. Spectral clustering is not included, as thecomputational complexity of the basic approach doesnot allow for enough scalability. On the other hand,two-step approach [26] does not lead to convergence.

Discussion

Oral Squamous Cell Carcinoma

As one can see in the Figure 7, it is possible to ob-serve varying capabilities for heterogeneity detectiondepending on the selected methods for clustering andfeature selection. For example, UMAP combined withspatial clustering does not create clusters overlappingwith biological structures (Dice index 0 .

00, Rand in-dex 0 . No feature extraction Knee PCA EXIMS UMAP K - M e a n s D i v i K Figure 8

Results of the unsupervised analyses for mouse kidneydata. index 0 .

77, Rand index 0 .

56 with no feature extraction)and UMAP (Dice index 0 .

72, Rand index 0 .

52 withK-Means) tend to exhibit high potential for capturingthe necessary detail (see Table 1).On the contrary, spatial clustering with EXIMS-basedfeature extraction approximates tumor region with thehighest Dice index and the top ROIs composition ex-pressed via Rand index. However, these are not theonly criteria used by the medical experts during thequality assessment.Visual comparison between clusters (Figure 7) andpathologist-annotated ROIs (Figure 4) shows that spec-tral clustering and DiviK yield rather stable regionsregardless of the feature extraction method.Finally, the trade-oﬀ between the agreement measuresand the cluster visual consistency is assessed throughEXIMS score and overall quality concept (see Figure6). Naturally, the top EXIMS score is achieved forthe most consistent segmentations which yield a singlecluster (spatial clustering with UMAP) or miss oneof the ROIs completely (K-Means with PCA). Sincethese are not relevant from the medical point of view,we bind their relative EXIMS score at 1 . overall quality a top one.Additionally, all the DiviK conﬁgurations yield overallquality over the median (1.1643). Mouse Kidney 3D

Published benchmark datasets [37] are speciﬁcally ori-ented on high-scale computations. Such scale signiﬁ-cantly increases the risk that crucial details are missed.Moreover, it renders many methods useless due to itscomputational complexity. Therefore additional modi-ﬁcations are sometimes required [26].As one can observe in the Figure 8, the situation ofalgorithm yielding just a single cluster is much moreoften: • K-Means without feature extraction; rukwa and Pola´nska Page 7 of 8 • Spectral clustering without external feature ex-traction; • Spectral clustering with UMAP feature extraction.Structures discovered via K-Means and DiviK algo-rithms look consistent when compared to other work[27, 38] exercising this data.

Conclusions

DiviK is a tool that provides a reasonable trade-oﬀbetween accuracy of unsupervised heterogeneity discov-ery and consistency of obtained clusters. It seems to bescalable and robust enough to cope with both low- andlarge-scale MSI data. Through comparing with otherconﬁgurations already proven for MSI we show that Di-viK is feasible for real-world applications when appliedto solve a complex multi-dimensional problem. Finally,DiviK framework could be easily extended to supportother kinds of biological big data, via performing simpleadaptation of ﬁlters.

Availability and requirements

Project name: DiviKProject home page: https://github.com/gmrukwa/divik/

Operating system(s): Linux, Windows, MacProgramming language: Python ( ≥ List of abbreviations

CLI: Command-Line Interface; DiviK: Divisive intelligent K-Means; GMM:Gaussian Mixture Modeling; HDDC: high dimensional data clustering; MSI:Mass Spectrometry Imaging; OSCC: Oral Squamous Cell Carcinoma; PCA:Principal Components Analysis; ToF: Time-of-Flight; UMAP: UniversalManifold Approximation and Projection;

Declarations

Ethics approval and consent to participateNot applicable.Consent for publicationNot applicable.Availability of data and materialsAll the data is available from their original sources.Competing interestsThe authors declare that they have no competing interests.FundingThis project was ﬁnancially supported by NCBiR grant AIDA no.I029/17-POWR.03.02.00-IP.08-00-DOK/17 (GM) and NCN grant BITIMSno. UMO-2015/19/B/ST6/01736 (GM, JP).Authors’ contributionsJP and GM designed the DiviK core algorithm. GM and implemented thescripts, performed the tests, and developed the package for PyPIsubmission. JP and GM wrote the manuscript. JP critically revised thework. All authors read and approved the ﬁnal manuscript. AcknowledgementsWe would like to thank Piotr Wid(cid:32)lak, Monika Pietrowska, Marta Gawin andMykola Chekan from Maria Sk(cid:32)lodowska-Curie National Research Instituteof Oncology, Gliwice branch, for providing the necessary biological and MSIdata acquisition background.Finally, we would like to thank Katarzyna Bednarczyk for help in datapreprocessing.

Author details Silesian University of Technology, Department of Data Mining,Akademicka 2A, 44-100 Gliwice, Poland. Netguru, Ma(cid:32)le Garbary 9,61-756 Pozna´n, Poland.

References

1. Aichler M, Walch A. MALDI Imaging mass spectrometry: currentfrontiers and perspectives in pathology research and practice.Laboratory investigation. 2015;95(4):422–431.2. Miura D, Fujimura Y, Yamato M, Hyodo F, Utsumi H, Tachibana H,et al. Ultrahighly sensitive in situ metabolomic imaging for visualizingspatiotemporal metabolic behaviors. Analytical chemistry.2010;82(23):9789–9796.3. Hattori K, Kajimura M, Hishiki T, Nakanishi T, Kubo A, Nagahata Y,et al.. Paradoxical ATP elevation in ischemic penumbra revealed byquantitative imaging mass spectrometry. Mary Ann Liebert, Inc. 140Huguenot Street, 3rd Floor New Rochelle, NY 10801 USA; 2010.4. Djidja MC, Claude E, Snel MF, Francese S, Scriven P, Carolan V, et al.Novel molecular tumour classiﬁcation using MALDI–massspectrometry imaging of tissue micro-array. Analytical andbioanalytical chemistry. 2010;397(2):587–601.5. Morita Y, Ikegami K, Goto-Inoue N, Hayasaka T, Zaima N, Tanaka H,et al. Imaging mass spectrometry of gastric carcinoma informalin-ﬁxed paraﬃn-embedded tissue microarray. Cancer science.2010;101(1):267–273.6. Groseclose MR, Massion PP, Chaurand P, Caprioli RM.High-throughput proteomic analysis of formalin-ﬁxedparaﬃn-embedded tissue microarrays using MALDI imaging massspectrometry. Proteomics. 2008;8(18):3715–3724.7. Quaas A, Bahar AS, von Loga K, Seddiqi AS, Singer JM, Omidi M,et al. MALDI imaging on large-scale tissue microarrays identiﬁesmolecular features associated with tumour phenotype in oesophagealcancer. Histopathology. 2013;63(4):455–462.8. Steurer S, Borkowski C, Odinga S, Buchholz M, Koop C, Huland H,et al. MALDI mass spectrometric imaging based identiﬁcation ofclinically relevant signals in prostate cancer using large-scale tissuemicroarrays. International journal of cancer. 2013;133(4):920–928.9. Pietrowska M, Diehl HC, Mrukwa G, Kalinowska-Herok M, Gawin M,Chekan M, et al. Molecular proﬁles of thyroid cancer subtypes:Classiﬁcation based on features of tissue revealed by massspectrometry imaging. Biochimica et Biophysica Acta (BBA)-Proteinsand Proteomics. 2017;1865(7):837–845.10. Palmer A, Phapale P, Chernyavsky I, Lavigne R, Fay D, Tarasov A,et al. FDR-controlled metabolite annotation for high-resolutionimaging mass spectrometry. Nature methods. 2017;14(1):57–60.11. METASPACE annotation platform: datasets summary;. Accessed:2020-06-14. https://metaspace2020.eu/datasets/summary .12. Polanski A, Marczyk M, Pietrowska M, Widlak P, Polanska J. Signalpartitioning algorithm for highly eﬃcient Gaussian mixture modeling inmass spectrometry. PloS one. 2015;10(7).13. Jones EA, van Remoortere A, van Zeijl RJ, Hogendoorn PC, Bov´ee JV,Deelder AM, et al. Multiple statistical analysis techniques corroborateintratumor heterogeneity in imaging mass spectrometry datasets ofmyxoﬁbrosarcoma. PloS one. 2011;6(9):e24913.14. Thomas SA, Race AM, Steven RT, Gilmore IS, Bunch J.Dimensionality reduction of mass spectrometry imaging data usingautoencoders. In: 2016 IEEE Symposium Series on ComputationalIntelligence (SSCI). IEEE; 2016. p. 1–7.15. Veselkov KA, Mirnezami R, Strittmatter N, Goldin RD, Kinross J,Speller AV, et al. Chemo-informatic strategy for imaging massspectrometry-based hyperspectral proﬁling of lipid signatures incolorectal cancer. Proceedings of the National Academy of Sciences.2014;111(3):1216–1221. rukwa and Pola´nska Page 8 of 8

16. Postma E, van den Herik H, van der Maaten L. Dimensionalityreduction: a comparative review. Journal of Machine LearningResearch. 2009;10(1–41):66–71.17. Dunn JC. Well-separated clusters and optimal fuzzy partitions. Journalof cybernetics. 1974;4(1):95–104.18. Lawrence H, Phipps A. Comparing partitions. Journal of classiﬁcation.1985;2(1):193–218.19. Lipor J, Balzano L. Clustering quality metrics for subspace clustering.Pattern Recognition. 2020;p. 107328.20. Rodriguez MZ, Comin CH, Casanova D, Bruno OM, Amancio DR,Costa LdF, et al. Clustering algorithms: A comparative approach. PloSone. 2019;14(1):e0210236.21. Deininger SO, Ebert MP, Futterer A, Gerhard M, Rocken C. MALDIimaging combined with hierarchical clustering as a new tool for theinterpretation of complex human cancers. Journal of proteomeresearch. 2008;7(12):5230–5236.22. Bouveyron C, Girard S, Schmid C. High-dimensional data clustering.Computational Statistics & Data Analysis. 2007;52(1):502–519.23. Alexandrov T, Becker M, Deininger So, Ernst G, Wehder L, GrasmairM, et al. Spatial segmentation of imaging mass spectrometry datawith edge-preserving image denoising and clustering. Journal ofproteome research. 2010;9(12):6535–6546.24. Alexandrov T, Kobarg JH. Eﬃcient spatial segmentation of largeimaging mass spectrometry datasets with spatially aware clustering.Bioinformatics. 2011;27(13):i230–i238.25. Wijetunge CD, Saeed I, Boughton BA, Spraggins JM, Caprioli RM,Bacic A, et al. EXIMS: an improved data analysis pipeline based on anew peak picking method for EXploring Imaging Mass Spectrometrydata. Bioinformatics. 2015;31(19):3198–3206.26. Dexter A, Race AM, Steven RT, Barnes JR, Hulme H, Goodwin RJ,et al. Two-phase and graph-based clustering methods for accurate andeﬃcient segmentation of large mass spectrometry images. Analyticalchemistry. 2017;89(21):11293–11300.27. Abdelmoula WM, Pezzotti N, H¨olt T, Dijkstra J, Vilanova A,McDonnell LA, et al. Interactive visual exploration of 3D massspectrometry imaging data using hierarchical stochastic neighborembedding reveals spatiomolecular structures at full data resolution.Journal of proteome research. 2018;17(3):1054–1064.28. McInnes L, Healy J, Melville J. Umap: Uniform manifoldapproximation and projection for dimension reduction. arXiv preprintarXiv:180203426. 2018;.29. Smets T, Verbeeck N, Claesen M, Asperger A, Griﬃoen G, TousseynT, et al. Evaluation of distance metrics and spatial autocorrelation inUniform Manifold Approximation and Projection applied to MassSpectrometry Imaging data. Analytical chemistry. 2019;.30. Marczyk M, Jaksik R, Polanski A, Polanska J. Adaptive ﬁltering ofmicroarray gene expression data based on Gaussian mixturedecomposition. BMC bioinformatics. 2013;14(1):101.31. Polanski A, Marczyk M, Pietrowska M, Widlak P, Polanska J.Initializing the EM algorithm for univariate Gaussian, multi-component,heteroscedastic mixture models by dynamic programming partitions.International Journal of Computational Methods.2018;15(03):1850012.32. Tibshirani R, Walther G, Hastie T. Estimating the number of clustersin a data set via the gap statistic. Journal of the Royal StatisticalSociety: Series B (Statistical Methodology). 2001;63(2):411–423.33. Widlak P, Mrukwa G, Kalinowska M, Pietrowska M, Chekan M,Wierzgon J, et al. Detection of molecular signatures of oral squamouscell carcinoma and normal epithelium–application of a novelmethodology for unsupervised segmentation of imaging massspectrometry data. Proteomics. 2016;16(11-12):1613–1621.34. Buitinck L, Louppe G, Blondel M, Pedregosa F, Mueller A, Grisel O,et al. API design for machine learning software: experiences from thescikit-learn project. In: ECML PKDD Workshop: Languages for DataMining and Machine Learning; 2013. p. 108–122.35. Satopaa V, Albrecht J, Irwin D, Raghavan B. Finding a” kneedle” in ahaystack: Detecting knee points in system behavior. In: 2011 31stinternational conference on distributed computing systems workshops.IEEE; 2011. p. 166–171.36. Mouraﬁq M. Polyaxon: Cloud native machine learning automationplatform; 2017. Web page. Available from: https://github.com/polyaxon/polyaxonhttps://github.com/polyaxon/polyaxon