[PDF] Freecyto: Quantized Flow Cytometry Analysis for the Web

Abstract

Flow cytometry (FCM) is an analytic technique that is capable of detecting and recording the emission of fluorescence and light scattering of cells or particles (that are collectively called "events") in a population. A typical FCM experiment can produce a large array of data making the analysis computationally intensive. Current FCM data analysis platforms (FlowJo, etc.), while very useful, do not allow interactive data processing online due to the data size limitations. Here we report a more effective way to analyze FCM data. Freecyto is a free, easy-to-learn, Python-flask-based web application that uses a weighted k-means clustering algorithm to facilitate the interactive analysis of flow cytometry data. A key limitation of web browsers is their inability to interactively display large amounts of data. Freecyto addresses this bottleneck through the use of the k-means algorithm to quantize the data, allowing the user to access a representative set of data points for interactive visualization of complex datasets. Moreover, Freecyto enables the interactive analyses of large complex datasets while preserving the standard FCM visualization features, such as the generation of scatterplots (dotplots), histograms, heatmaps, boxplots, as well as a SQL-based sub-population gating feature. We also show that Freecyto can be applied to the analysis of various experimental setups that frequently require the use of FCM. Finally, we demonstrate that the data accuracy is preserved when Freecyto is compared to conventional FCM software.

Full PDF

FFreecyto: Quantized Flow Cytometry Analysis forthe Web

Nathan Wong , Daehwan Kim , Zachery Robinson , Connie Huang , and Irina M.Conboy Department of Bioengineering and QB3, UC Berkeley, Berkeley, CA 94720, USA * [email protected] * [email protected] ABSTRACT

Flow cytometry (FCM) is an analytic technique that is capable of detecting and recording the emission of ﬂuorescence and lightscattering of cells or particles (that are collectively called “events”) in a population . A typical FCM experiment can producea large array of data making the analysis computationally intensive . Current FCM data analysis platforms (FlowJo , etc.),while very useful, do not allow interactive data processing online due to the data size limitations. Here we report a moreeffective way to analyze FCM data. Freecyto is a free, easy-to-learn, Python-ﬂask-based web application that uses a weightedk-means clustering algorithm to facilitate the interactive analysis of ﬂow cytometry data. A key limitation of web browsers istheir inability to interactively display large amounts of data. Freecyto addresses this bottleneck through the use of the k-meansalgorithm to quantize the data, allowing the user to access a representative set of data points for interactive visualizationof complex datasets. Moreover, Freecyto enables the interactive analyses of large complex datasets while preserving thestandard FCM visualization features, such as the generation of scatterplots (dotplots), histograms, heatmaps, boxplots, aswell as a SQL-based sub-population gating feature . We also show that Freecyto can be applied to the analysis of variousexperimental setups that frequently require the use of FCM. Finally, we demonstrate that the data accuracy is preserved whenFreecyto is compared to conventional FCM software. Keywords

Flow cytometry, Big data analysis, Web application, Machine learning, Unsupervised learning, Data Quantization, Softwaredevelopment

Introduction

Flow cytometry is broadly used in biomedicine, which is exempliﬁed by identiﬁcation of protein marker expressions ,determinations of cell-fate and cell cycle progression , analysis of pathology-caused changes, e.g. cancer promoted,immune-skewing, etc. , testing therapeutic efﬁcacy of a treatment , and, more recently, gene-editing detection workﬂows .A common experimental setup in biomedicine relies on being able to identify speciﬁc changes between a control and anexperimental cell population. The changes between control and experimental cohorts are often determined throughﬂuorescently tagged antibodies that are speciﬁc for given proteins; and the ﬂuorescence is examined by microscopy and/or highthroughput screening using a ﬂow cytometer .Successful FCM experiments rely on the accuracy and resolution of the data analysis, e.g. the performance of the FCMsoftware that provides quantitative outputs for large numbers of events . In FCM analysis, an event is constituted by thecytometer’s detection of ﬂuorescence emission and/or light scatter signals from a single cell or particle that passes through themicroﬂuidic ﬂow chamber. With thousands of these events, individual measures of ﬂuorescence, size and granularity areproduced, and to add complexity, these measurements can be deliberately modiﬁed by a researcher through the instrumentsetup, which can be changed from run to run . FCM analysis, thus, becomes a computational and statistical challenge thatproduces meaningful data only if the analysis is adequate for the experimental complexity. Inherent in this requirement, thedatasets that are produced with the conventional FCM software (FlowJo , Cytobank , OpenCyto , and Webﬂow ) aretypically quite large, which complicates their interactive web analyses.In this work we developed a new FCM software that facilitates the FCM data analysis, while maintaining the accuracy andresolution of the data. In fact, analysis of ﬂow cytometry experiments, despite having tens of thousands of data points, can beperformed and visualized on a mobile device. Importantly, while simplifying the data analysis and having the intuitive work a r X i v : . [ q - b i o . Q M ] N ov ow, Freecyto preserves the key features of traditional FCM software, such as scatterplots (dotplots) of two different emission,histograms of a ﬂuorescent emission measurement , the side-by-side comparison of the results between the control andexperimental populations and gating on sub-populations of cells.Similarly to FlowJo , Cytobank , OpenCyto , and Webﬂow , Freecyto supports machine learning applications, but it doesnot require the installation of speciﬁc software packages (often OS-dependent), a detailed understanding of the softwareworkﬂow, or extra layers of complexity in displaying, interacting, and sharing the FCM analysis with other researchers.Additional features of Freecyto are robust data-management and data-sharing: Freecyto is built on a secure centralized databasemanagement system, allowing for data to be stored remotely and analyses to be shared and edited by anyone, yet it maintainsthe safeguard of proper permissions. Notably, the decisions on instrument settings (such as, changing the gain and signalintensity) and experimental set-ups (for instance, additional runs of certain cohorts) become better informed - based on realtime user-friendly data analysis.A key feature of Freecyto is the k-means clustering algorithm in which data points are clustered together into k clusters basedon a Euclidian distance metric. This use of k-means algorithm as a method of data quantization is distinct from the ﬂowcytometry studies, which use clustering algorithms to analyze the data . Freecyto, in contrast, uses k-means to create areduced, representative dataset of the original, so that the user can have much greater capability in analyzing the data, such asapplying the stated clustering algorithms to the data. The original data is then reduced to the centers of the clusters, allowingthe user to gate interactively on these centers. We show that FCM data analysis remains faithful when Freecyto is compared tothe conventional FlowJo software.By focusing and quantizing the data, Freecyto offers a better control over the analysis of FCM experiments, increasing thecomputational feasibility of any and particularly, very large datasets. Because of the high dimensional nature of ﬂow cytometrydata and the increasing technological developments in ﬂow cytometers which have pushed the number of parameters and thesheer volume of data ever higher, there is a greater need for FCM software to handle increasingly large data sets . Freecytowas developed to address this challenge. Results

K-means Quantization

While the quick visualization capabilities are sufﬁcient for most basic ﬂow cytometry operations, a more detailed study mayrequire additional specialized functions, such as sub-population gating and quadrant (coordinate-system) gating. Having datasets on the magnitude of 10 or 10 events, presents a signiﬁcant challenge to interactively plot these on the web. In the case ofgating, having tens of thousands of points that users can lasso-select on the web is virtually impossible for personal computersand standard web browsers. Freecyto solves this problem by introducing a k-means clustering algorithm for quantizing theinput data (Figure 1).First, after running the k-means clustering algorithm, the centroids are used to construct a Voronoi diagram. Thus, the originaldataset is partitioned into Voronoi cells, and each cell contains all the original points that belong to that cluster. Following, foreach Voronoi cell, the variance is computed, with the centroid used as the mean of the geometric space. Finally, thewithin-cluster variance is plotted as a colormap within the Voronoi diagram to portray which cells contain more of theunderlying variance, and the variance is summed up across all Voronoi cells to portray the elbow at which minimalwithin-cluster variance is lost with respect to the increase in computation power due to increasing the number of clusters.K-means clustering (implemented with Lloyd’s algorithm, clusters initialized with kmeans++ with a default seed) is anunsupervised machine-learning algorithm that is used to identify clusters of points based on each point’s distance from thecenter of a proposed cluster. Freecyto runs this algorithm on the user-selected channels, identifying a pre-deﬁned number ofclusters, and storing only the centers of these clusters. The number of clusters is either user-selected (if running locally) orapproximated automatically as a range between 250 and 5000 based on the size of the dataset. This simpliﬁes the conventionalk-clustering approach and enables future development of more suitable algorithms to determine k . Freecyto’s applicationof k-means clustering quantization vastly reduces the complexity of the ﬂow cytometry data, without signiﬁcant loss to thevariability within the original dataset as we will show in the next section. The reduced dataset that is generated is highlysuitable for downstream statistical analysis, such as hierarchical clustering or dimensionality reduction to identifysub-populations of cells (Supplemental Figure 5). igure 1. K-means Workﬂow in Freecyto. (A)

The process by which the original dataset is quantized, and how manualgating works on a shared data source. (B)

The principles behind k-means quantization, and the Voronoi diagram computedfrom the reduced dataset projected on the original dataset. igure 2.

K-means Within-Cluster Variance Visualization of Synthetic Datasets. (A)

Original spiral data (N=5000). (B)

Cluster centers with Voronoi cells outlined. (C)

Within-cluster variance of each Voronoi cell with increasing k, and byextension, the MSE in each cluster identiﬁed by k-means. (D)

Trend of increasing clusters and the average within-clustervariance of each cluster. (E)

Original bimodal data (N=10000). (F, G, H)

Cluster centers and variance loss in each Voronoi cellwith increasing k. idelity of Data Quantization in Interactive Analysis.

To quantitively examine the quality of our reduced data set, we compute the mean-squared error (MSE) of each cluster. For thek-means algorithm, this is equivalent to computing the within-cluster variance of each cluster, because the predicted clustercenter is the mean of all points in that cluster. The MSE of each cluster, as visualized by Voronoi cells, is then mapped to acolor range to depict how faithfully each cluster center captures the other points in that cluster. In Figure 2C and 2G, it’s shownthat with increasing k, the lower the MSE for each cluster. Finally, the average of all the MSE for all clusters is computed (2Dand 2H) to show that the data lost in each cluster center decreases rapidly in exchange for smaller increases in the number ofclusters chosen.The quantized data can then be plotted interactively through Bokeh on a webpage and downloaded as a SQL database withinthe web application. In this interactive analysis portion, each ﬂow cytometry data ﬁle is treated as a shared data source, thus inFreecyto the user can lasso-select a sub-population of cells that are displayed in a scatterplot graph or a ﬂuorescence channeland observe the quantized data for that sub-population of cells in the other FCM channel(s). This Freecyto feature allows theuser to quickly and with more precision determine how the size of the cells or a signal for a speciﬁc marker (cell-fate protein,for example) is related to other markers (transgene expression, for instance) for each cell in the studied population. Demo:(3:07 – 6:20)One key question is whether our method of k-means clustering qualitatively maintains the accuracy and resolution of the data.To address this, we compared side-by-side Freecyto and the conventional FCM software FlowJo in the analysis of GFP positivecells in a population and in studying cells in early and late stages of apoptosis (e.g. AnnexinV-7AAD and co-stain). Here weused Freecyto modality for such a common feature of FCM as a coordinate system gating to identify the percentage of cellslocated within certain thresholds. As shown in Figures 3 and 4, Freecyto was as accurate as FlowJo in the resolution of thesedata sets, at the same time preserving the key features of FCM software, such as allowing the user to specify ﬂuorescencethresholds and visualize and quantify the percentage of cells located in these quadrants (Figures 3, 4).Moreover, Freecyto generated quantized data points are stored in an SQLite database - essential to the deep gating tool. Thedeep gating tool allows the user to lasso-select a sub-population of cells and graphically display only the gated cells for alladvanced analysis operations. This is useful in narrowing the analysis to speciﬁc sub-populations, as well as identifying outliersin the dataset. This deep-gating function can be applied as many times as needed, and all deep-gates can be reset by pressingthe reset-gating button, after which the visualization and quantiﬁcation of the results will reﬂect the original, unaltered dataset(Figures 3, 4). Both the results of the k-means quantization and the sub-populations identiﬁed from manual gating can bedownloaded directly in the application.To comparatively analyze the accuracy and capabilities of Freecyto and FlowJo, WT and GFP+ cells were mixed at ﬁvedifferent ratios, 100:0, 75:25, 50:50, 25:75, and 0:100, WT:GFP+; and run on Guava Easycyte Flow cytometer(Millipore-Sigma). The data was analyzed by FlowJo and Freecyto in parallel. As a result, the number of GFP positive cellsincreased linearly from 100:0 WT/GFP+ to 0:100 WT/GFP+, as expected, which was accurately detected by both FlowJo andFreecyto.To compare Freecyto and Flowjo in another commonly analyzed by Flow Cytometry assay – cell apoptosis, IMR90 humanﬁbroblasts were treated (or not) with hydrogen peroxide, H O , at 200 µ M for 24h to induce apoptosis. The cells were assayedwith Annexin V and 7-AAD and run on the Guava Easycyte Flow cytometer (Millipore-Sigma). The results were analysed withFreecyto, yielding accurate and visually clear data. The negative control, isotype-matched IgG ﬂuorescence was used to set upthe quadrant, Figure 4A. Early apoptotic cells positive for Annexin V can be seen in the top left quadrant and late apoptoticcells positive for both Annexin V and 7-AAD in the top right quadrant. As expected, Freecyto shows the number of Annexin Vpositive cells, Figure 4B. The number of cells in early and late stages of apoptosis were increased with H O , as compared tothe untreated control, Figure 4C. In summary, the analysis of apoptosis (Annexin V and 7ADD assay) yields the predictedresults and is as accurate and sensitive with Freecyto as it is with Flowjo. Web (Uwsgi-ﬂask-nginx) application to allow platform-agnostic, mobile-ready access to ﬂow cytometryanalysis

Several core technologies are deeply integrated into Freecyto in order to allow seamless processing and visualization of ﬂowcytometry data. Chieﬂy, the integration of these technologies allows for robust storage of user data, high-throughput handlingof the data, e.g. processing operations, and interactivity of the data visualizations.Computationally expensive operations in ﬂow cytometry, including reading and parsing data, performing visualizations, andobtaining sample statistics, are all performed server-side in Freecyto. Freecyto is hosted as a Python-ﬂask-uwsgi-nginxapplication on a Digital Ocean server. igure 3.

Analysis of GFP positive and negative cell populations. (A) (B)

The same 50:50 GFP transgenic cell ratios with the coordinates gated byFlowJo. (C)

Compares Freecyto and FlowJo measurements of GFP+ cells for 100:0, 75:25, 50:50, 25:75, and 0:100 ratios. (D)

Density plot created by Freecyto which outlines the density of cells after the k-means quantization is performed with 250clusters. (E)

MSE of each cluster with varying values of k. (F)

The resulting density plot with varying values of k. igure 4.

Analysis of Apoptosis.

IMR90 cells were treated with hydrogen peroxide, H O , at 200 µ M for 24h to induceapoptosis. The cells were then stained with Annexin V and 7-AAD. Early apoptotic cells are positive for Annexin V and areseen in the top left quadrant (Q1) and late apoptotic cells, which are positive for both annexin and 7-AAD are seen in the topright quadrant (Q2). Live cells are negative for both stains (Q4). (A)

Negative control: Isotype-matched IgG staining (1stantibody) + secondary (FITC). (B)

Untreated group. (C) H O treatment group. s e r F un c t i on s A u t ho r i z a t i on and C a c h i ng C a ll ab l e AP I S e r v e r Upload new flow cytometry experimentsAuthorized?Query and update saved experimental analysis Query server to analyze, store, and expose data for quick visualiation tasks, such as t-SNE, KDE plots, and histograms, for the selected channels

Freecyto Application Workflow

User Login or Account Creation

Yes Google Firebase APISaved data repositoryNo

Python code to normalize data, compute statistics, generate static visualization images, and export raw data to Excel file

Store analysis

View and download completed quick visualizations and raw dataChange fluorescence channels for analysis

Yes No

Run 'advanced analysis' for interactive analysis, and downloaded quantized dataset and gated subpopulations

Cache busting by appending random hash to repeated documents

Query server to perform input data quantization and relevant downstream analysis, for the selected channels.Python code to perform k-means clustering for data quantization, create interactive HTML documents for gating analysis, and export quantized and gated data to SQLite database

View and share previously performed analysis

Figure 5.

Freecyto Application WorkﬂowWhile most ﬂow cytometry tools have unique requirements depending on the user’s operating system (OS), applicationdependencies (a speciﬁc version of python packages), or computational resources (i.e. four CPU cores), Freecyto can beaccessed without platform restrictions and dependencies. This application also is designed to be mobile-compatible, allowingusers to access their ﬂow cytometry analysis and also perform new ﬂow cytometry analysis directly on their mobile devices(Figure 5).In addition, Freecyto can be downloaded as a Flask application (open-source), so that users can install the appropriatedependencies and run the application on a local intranet (useful if users desire a stricter control of Flow cytometry data privacy).This also allows for greater control over default parameters and application modules, such changing the number of reduced datapoints used in interactive analysis and implementing a clustering model on top of the reduced data set (Figure 5).Demo: (0:00 – 1:00)

Parallel processing (multiprocessing) of computationally intensive analysis functions

Freecyto integrates advances in multiprocessing functionality in order to speed up traditionally expensive FCM data analysisoperations. Multiprocessing is implemented when users upload multiple ﬁles, when visualizations are performed, and when thek-means algorithm is running. These operations are asynchronously performed on the server-side, speeding up the time it takesfor the user to receive analyses outputs from their data by an order of magnitude. Through the implementation of thismultiprocessing a side-by-side over ﬁve ﬁles upload becomes possible (Supplemental Figure 3). ser data management and authentication

Google Firestore/Datastore is integrated to store references to previously performed visualization operations. For example, theimages that are generated from an experimental upload are stored in a unique directory on the server, and the references to thegenerated images are stored in a collection as a unique entry under the user account in Google Firestore. This preventsredundant analysis operations (i.e. the user uploads the same experimental ﬁles), yet, it allows the user to access the previouslyperformed operation. A sortable table of previously performed experiments (50 most recent) are listed in the user home page,allowing the user to easily access previously analysed ﬂow cytometry results.Firebase and Google identity platform: Google and Email logins are enabled, allowing the user to create and access their useraccount with these authentication methods. This prevents unauthorized usage of the application, requiring the user to create anaccount before accessing the analysis toolkit. To promote scientiﬁc knowledge and collaborations, sharing the results of a ﬂowcytometry experiment on Freecyto merely requires sharing the URL of the experiment. Demo: (1:00 – 1:30)

Side-by-side experiment comparisons (multiple ﬁle upload)

Freecyto supports user upload of multiple ﬂow cytometry ﬁles as a result of the multiprocessing pipeline. For normalization ofthe raw input ﬁles, the user may select hyperlog, logicle, or no transformation to be applied. Logicle and hyperlogtransformations normalize the ﬂow cytometry data by transforming most events (including negatively measured values) to anormalized ﬂuorescence value of between 0 and 1 . This improves on traditional free ﬂow cytometry analysis applications,which limit the user to uploading only a single ﬂow cytometry ﬁle at a time, though many ﬂow cytometry experiments haveanywhere from 2 to 10+ ﬁles to analyse. Freecyto’s approach allows the user to upload numerous ﬁles concurrently, enablingplots to be overlaid for easy and clearly visualized comparison between the datasets. In another feature of Freecyto, if overlaysmake it harder to discern the individual plots, then individual ﬁles can also be graphed and visualized. Demo: (1:30 – 2:00) Quick visualization capabilities

Freecyto is built on the principle that FCM analysis should be easy to perform and that real-time data processing expands theresearch capabilities in acutely and accurately modulating the FCM experiments. Freecyto’s pipeline achieves this by quickvisualization of the scatterplots, density-estimation plots, histograms, box-whisker diagrams, and correlation tables, which aregenerated by Freecyto based on the selected ﬂuorescence channels. In addition, t-SNE plots allow users to visualize segregatingfeatures of the data. The images and relevant statistics are displayed through a carousel slider (Siema) and a table respectively.It is integral to ﬂow cytometry analysis to allow users to select the ﬂuorescence channels they wish to visualize. Freecytoaccomplishes this with a simple checkbox list of all possible channels. The user selects the channels they wish to visualize,presses “submit,” and the images automatically update to match the desired ﬂuorescence channels to visualize. This pipeline isdesigned to be minimalistic – it allows the user to quickly determine how their data looks, offering enough modularity tofacilitate the most common ﬂow cytometry analysis operations. In addition, the converted ﬂow cytometry data can bedownloaded as an Excel spreadsheet. Demo: (2:00 – 3:07)

Discussion

Freecyto was developed as a new data processing software for Flow Cytometry data and validated for enhancing the speed,convenience, and machine learning capacity of the FCM data analysis, while preserving the accuracy. These features werevalidated in key FCM set-ups of studying sub-populations with variable expression of a transgene, and in viability-apoptosisstudies. Summarily, the use of our weighted k-means clustering algorithm innovated FCM data analysis and transformed it intoa simple, easy to use online platform.Freecyto offers all the necessary features to perform typical FCM analyses, in addition to providing the user interactive analysisof the data and it ﬁlls a niche when compared with other FCM software (Table 1). Freecyto is a ﬂexible platform that allowsmodiﬁcations. For example, Opencyto allows users to create automated gating pipelines in R which may solve the subjectivityand time-consuming nature of manual gating and such a feature is very compatible to build on top of Freecyto’s existingframework . Freecyto does not innovate the existing ﬂow cytometry analysis, instead it innovates the approach to suchanalyses, thereby improving on the ease and accessibility of FCM data, while also providing greater ﬂexibility and control ingating large datasets, through the quantizing of the data with a weighted k-means clustering algorithm. eature Freecyto Opencyto Cytobank FlowJoWhat is it? Python web application R software package Cloud-based web server Software package (OS de-pendent)

Summary

K-means algorithm al-lows interactive gatingbetween any combinationof channels in side-by-side graphs Pipeline for automatedgating algorithms (as op-posed to manual gating) Specialized service thatuses many different toolse.g Citrus to performFCM analyses Automation of repeatedanalyses, customizabledata visualizations Free-to-use

Yes Yes No No

Requiressoftwaredownload

No Yes No Yes

Straight-forwarddata analy-sis sharing

Yes No Yes No

Beginner-friendly

Yes No No No

Mobile-compatible

Yes No No No

Table 1.

Comparing Freecyto with other ﬂow cytometry applications.

Conclusions

FCM analysis is essential for a broad range of biomedical studies, many of which are directly and critically important forhuman health. Freecyto allows for the streamlined, fast, facile, user-friendly and easy to share analysis of multiple FCMexperiments in parallel, harnessing the transmissibility of internet ease-of-use to power and serve its analytical platform.Whereas many FCM analysis packages are expensive, require software/OS dependencies, or have a signiﬁcant learning curve,Freecyto is free, web-based, and easy to use, and while simplifying FCM studies, Freecyto improves the processing ofhigh-volume data and facilitates the real-time data analysis.As ﬂow cytometry development continues to improve, the need for indexing and manipulating large quantities of scientiﬁc datacannot be understated. Freecyto integrates state-of-the-art data storing and indexing features with Google Cloud, creating aninterface for users to have greater conﬁdence and connectivity with their ﬂow cytometry data. In this regard, our k-meansquantization approach might be broadly useful and important not only in FCM, but more broadly, for Big Data analysis inomics, medical data for machine learning and AI, computer vision, environmental engineering, etc. large data realms.

Materials and Methods

Data Visualization

Several Python packages were used in creating this application. Flask was used to serve the web application. GoogleIdentity (Firebase) was used to authenticate users, and Google DataStore was used to store references to previously performedexperiments. Pandas, NumPy, FlowUtils, and Cytoﬂow were used to dynamically store and transform the raw ﬂow cytometrydata. Matplotlib, Seaborn, and Pandas were used to generate images of scatterplots, box-plots, heatmaps, and histograms. Thet-distributed stochastic neighbour embedding (t-SNE) projection was performed with Scikit-learn (sklearn) with perplexity of40. For the interactive analysis, sklearn was used for the weighted k-means clustering. SQLite3 was used to store clustereddata. Bokeh and Holoviews were used to display the interactive graphs. HTML5UP and Creative Tim Light Bootstrap Themeinspired the front-end template design of the web application.

Multiprocessing

Multiprocessing, assuming a multi-core machine, was implemented to speed up the data visualization algorithms. Chieﬂy,the results of a benchmark test on a quad-core, 8 GB RAM, 2.3 Ghz MacBook Pro are reported below for the static imagevisualizations, and for the interactive data analysis portions. eighted K-means Algorithm X = { x , x , ..., x n } such that every x i has d dimensions. Let Ω be a diagonal d x d matrix such that the diagonal entries are theweights of each dimension. k is the number of clusters we want to ﬁnd. S is the set of all k clusters such that S = { S , S , ..., S k } .We want to minimize the loss function: arg min S k ∑ i = ∑ x ∈ S i ( x − µ i ) T Ω ( x − µ i ) In the default case, let the diagonal entries of Ω be 1 if the corresponding channel was selected for visualization, and 0 otherwise. Voronoi Diagram Algorithm X = { x , x , ..., x n } such that every x i has d dimensions. R is the set of all k Voronoi diagrams such that R = { R , R , ..., R k } and S is the set of all k clusters such that S = { S , S , ..., S k } . d is a distance metric, for which we used Euclidean distance. Wewant to ﬁnd the region such that every point in the region is closest to the set of points described by the k-means clustering. R k = { x ∈ X | d ( x , S k ) ≤ d ( x , S j ) ∀ j (cid:54) = k } Or equivalently, because the distance of every point x in S k to it’s mean centroid µ k has already been minimized in the convergedk-means algorithm: ∀ x ∈ S k | d ( x , S k ) ≤ d ( x , S j ) ∀ j (cid:54) = k = ⇒ R k = { x ∈ S k } Web application (open-source) licenses • Advanced Analysis: Light bootstrap theme by Creative Tim: MIT Licensehttps://github.com/timcreative/freebies/blob/master/LICENSE.md• Lens by HTML5UP: Creative Commons 3.0 https://html5up.net/license• NumPy: https://github.com/numpy/numpy/blob/master/LICENSE.txt• SciPy: https://scipy.org/scipylib/license.html• Scikit-learn: https://scikit-learn.org/stable/• Pandas: https://github.com/pandas-dev/pandas/blob/master/LICENSE• Matplotlib: https://matplotlib.org/users/license.html• Bokeh: https://github.com/bokeh/bokeh/blob/master/LICENSE.txt• Holoviews: https://github.com/pyviz/holoviews/blob/master/LICENSE.txt• Flask: http://ﬂask.pocoo.org/docs/1.0/license/• SQLAlchemy: https://docs.sqlalchemy.org/en/latest/copyright.html• Cytoﬂow: https://github.com/bpteague/cytoﬂow/blob/master/LICENSE.txt• FlowUtils: https://github.com/whitews/FlowUtils/blob/master/LICENSE

Myoblast cultures

Transgenic GFP+ and WT (C57.B6) mouse myoblasts were cultured in growth medium: Ham’s F10, 20% Bovine GrowthSerum and 5 ng/ml bFGF on 1 µ g/cm Matrigel. Cells were washed and detached with PBS (three 37C) and were pelleted bycentrifugation. Cells were pelleted and counted using a hemocytometer. ell culture and apoptotic assay

Normal human lung ﬁbroblast cells (IMR-90) were obtained from ATCC . When cells were grown to 70%conﬂuence, they were subcultured at dilution for later passaging.The apoptotic assay of IMR90 was conducted by Apoptosis Detection Kit (ab214663, Abcam) according to the manufacturer’sprotocol. Brieﬂy, cells were detached using 0.05% trypsin and washed twice with PBS. Then, samples were resuspended in 1xannexin-binding buffer and incubated with 5 µ L Annexin V-FITC and 5 µ L 7-amino-actinomycin D (7-AAD) for 15 min at37°C, avoiding light. Finally, events were acquired with a Guava Easycyte Flow cytometer (Millipore-Sigma) and analysed byFreecyto and Flowjo software individually to quantify the distribution of cells.

Abbreviations

FCM : Flow cytometry

Event(s) : Emission(s) of ﬂuorescence and light scattering of cells or particles t-SNE : Barnes-Hut approximation of t-distributed stochastic neighbour embedding

K-means : Lloyd’s Algorithm with Euclidean distances for k-means clustering (k-means++ is used for cluster center initializa-tion).

MSE : Mean squared error WT : Wild type GFP : Green ﬂuorescent protein

IMR-90 : Human lung ﬁbroblast cells

Data Availability

The datasets generated and/or analysed during the current study are available in the Freecyto Github repository,https://github.com/nathan2wong/freecyto/tree/master/datasets.Project name: FreecytoProject homepage: https://freecyto.comDemo: https://youtu.be/JlIVgxh4_YAArchived version: https://github.com/nathan2wong/freecytoOperating system(s): Platform independentProgramming Language: Python, JavaScriptOther requirements: Listed on GitHubLicense: BSD3Any restrictions to use by non-academics: License Needed

Acknowledgements

We would like to thank Alex Park for providing technical help with these studies, and Michael Conboy for the helpfulsuggestions on the work and the manuscript.

Funding

This work was supported by NIH R01 EB023776, R01 HL139605 and Open Philanthropy awards to IC, and the funds wereused to support the data collection of the study.

Author Information

Afﬁliations

Department of Bioengineering and QB3, UC Berkeley, Berkeley, CA 94720, USANathan Wong, Daehwan Kim, Zachery Robinson, Connie Huang, and Irina M. Conboy ontributions

NW created the Freecyto software and wrote the manuscript. ZR provided ﬁgures, data, and analyses of the GFP cell experiment(Figure 3). DH provided ﬁgures, data, and analyses of the apoptotic cell experiment (Figure 4). CH provided ﬁgures, tables(Figure 1A, Table 1), and contributed code for downstream analysis in the Freecyto software. IC co-wrote the manuscript andcontributed to design of these studies. All authors read and approved the ﬁnal manuscript.

Corresponding Author

Correspondence to Nathan Wong ([email protected]) and Irina Conboy ([email protected]).

Ethics Declarations

The authors declare no competing interests.

Additional Information

Necessary Resources

Freecyto is designed to be fully compatible with a standard user setup, and very little setup is required to begin using Freecytofor your ﬂow cytometry needs.• A web browser with JavaScript enabled (Core functions in the interactive analysis portion require JavaScript to be fullyfunctional). Common browsers that satisfy this requirement include Google Chrome and Firefox. Mobile devices thathave a mobile web browsing application can also satisfy this requirement.• A valid Google ID or email address. This allows Freecyto to recognize the user and keep records of previous jobsperformed under this user ID.• A valid internet connection (HTTP, HTTPS) is required to access the online interface of Freecyto.

Walkthrough

To begin, navigate to freecyto.com. Note that several documentation options are available for viewing on the home page. Theseoptions include: (1) Detailed, feature-speciﬁc documentation, (2) Video run-through of the application, (3) Open-source licensesand attributions (4) Freecyto’s privacy policy, and (5) Login URL to access the Freecyto application interface. [SupplementalFigure 1]Next, press “advanced analysis” to access the interactive visualizations of the ﬂow cytometry data. This is an example of theshallow gating feature, in which selecting a sub-population of cells will display that sub-population across all selectedﬂuorescence channels. [Supplemental Figure 2]

References O'Neill, K., Aghaeepour, N., Špidlen, J. & Brinkman, R. Flow cytometry bioinformatics.

PLoS Comput. Biol. , e1003365,DOI: 10.1371/journal.pcbi.1003365 (2013). Lugli, E., Roederer, M. & Cossarizza, A. Data analysis in ﬂow cytometry: The future just started.

Cytom. Part A ,705–713, DOI: 10.1002/cyto.a.20901 (2010). Flowjo™ software. [software application] (2019). Ramel, S. et al.

Evaluation of p53 protein expression in barrett’s esophagus by two-parameter ﬂow cytometry.

Gastroen-terology , 1220 – 1228, DOI: https://doi.org/10.1016/0016-5085(92)70016-5 (1992). Leith, C. et al.

Correlation of multidrug resistance (MDR1) protein expression with functional dye/drug efﬂux in acutemyeloid leukemia by multiparameter ﬂow cytometry: identiﬁcation of discordant MDR-/efﬂux+ and MDR1+/efﬂux- cases.

Blood , 2329–2342, DOI: 10.1182/blood.V86.6.2329.bloodjournal8662329 (1995). https://ashpublications.org/blood/article-pdf/86/6/2329/617651/2329.pdf. Rosner, M., Schipany, K. & Hengstschläger, M. Merging high-quality biochemical fractionation with a reﬁned ﬂowcytometry approach to monitor nucleocytoplasmic protein expression throughout the unperturbed mammalian cell cycle.

Nat. Protoc. , 602–626, DOI: 10.1038/nprot.2013.011 (2013). . Darzynkiewicz, Z. et al.

Features of apoptotic cells measured by ﬂow cytometry.

Cytometry , 795–808, DOI:10.1002/cyto.990130802 (1992). Barlogie, B. et al.

Flow cytometry in clinical cancer research.

Cancer Res. , 3982–3997 (1983). https://cancerres.aacrjournals.org/content/43/9/3982.full.pdf. Keyes, T. J., Domizi, P., Lo, Y.-C., Nolan, G. P. & Davis, K. L. A cancer biologist's primer on machine learning applicationsin high-dimensional cytometry.

Cytom. Part A , 782–799, DOI: 10.1002/cyto.a.24158 (2020). Brando, B. et al.

Cytoﬂuorometric methods for assessing absolute numbers of cell subsets in blood.

Cytometry , 327–346, DOI: https://doi.org/10.1002/1097-0320(20001215)42:6<327::AID-CYTO1000>3.0.CO;2-F (2000). https://onlinelibrary.wiley.com/doi/pdf/10.1002/1097-0320%2820001215%2942%3A6%3C327%3A%3AAID-CYTO1000%3E3.0.CO%3B2-F. Lugli, E., Troiano, L. & Cossarizza, A. Investigating t cells by polychromatic ﬂow cytometry.

Methods molecular biology(Clifton, N.J.) , 47–63, DOI: 10.1007/978-1-60327-527-9_5 (2009).

Benedek, G., Meza-Romero, R., Bourdette, D. & Vandenbark, A. A. The use of ﬂow cytometry to assess a novel drugefﬁcacy in multiple sclerosis.

Metab. Brain Dis. , 877–884, DOI: 10.1007/s11011-014-9634-0 (2014). Hu, W. et al.

RNA-directed gene editing speciﬁcally eradicates latent and prevents new HIV-1 infection.

Proc. Natl. Acad.Sci. , 11461–11466, DOI: 10.1073/pnas.1405186111 (2014).

McKinnon, K. M. Flow cytometry: An overview.

Curr. Protoc. Immunol. , DOI: 10.1002/cpim.40 (2018).

Maecker, H. T. & Trotter, J. Flow cytometry controls, instrument setup, and the determination of positivity.

Cytom. Part A , 1037–1042, DOI: 10.1002/cyto.a.20333 (2006).

Kotecha, N., Krutzik, P. O. & Irish, J. M. Web-based analysis and publication of ﬂow cytometry experiments.

Curr. Protoc.Cytom. , 10.17.1–10.17.24, DOI: 10.1002/0471142956.cy1017s53 (2010). Finak, G. et al.

OpenCyto: An open source infrastructure for scalable, robust, reproducible, and automated, end-to-endﬂow cytometry data analysis.

PLoS Comput. Biol. , e1003806, DOI: 10.1371/journal.pcbi.1003806 (2014). Hammer, M. M., Kotecha, N., Irish, J. M., Nolan, G. P. & Krutzik, P. O. WebFlow: A software package for high-throughputanalysis of ﬂow cytometry data.

ASSAY Drug Dev. Technol. , 44–55, DOI: 10.1089/adt.2008.174 (2009). Murphy, R. F. Automated identiﬁcation of subpopulations in ﬂow cytometric list mode data using cluster analysis.

Cytometry , 302–309, DOI: 10.1002/cyto.990060405 (1985). Bruggner, R. V., Bodenmiller, B., Dill, D. L., Tibshirani, R. J. & Nolan, G. P. Automated identiﬁcation of stratifyingsignatures in cellular subpopulations.

Proc. Natl. Acad. Sci. , E2770–E2777, DOI: 10.1073/pnas.1408792111 (2014).

Ye, X. & Ho, J. W. K. Ultrafast clustering of single-cell ﬂow cytometry data using FlowGrid.

BMC Syst. Biol. , DOI:10.1186/s12918-019-0690-2 (2019). Ge, Y. & Sealfon, S. C. ﬂowPeaks: a fast unsupervised clustering for ﬂow cytometry data via k-means and density peakﬁnding.

Bioinformatics , 2052–2058, DOI: 10.1093/bioinformatics/bts300 (2012). Dorfman, D. M., LaPlante, C. D. & Li, B. FLOCK cluster analysis of plasma cell ﬂow cytometry data predicts bonemarrow involvement by plasma cell neoplasia.

Leuk. Res. , 40–45, DOI: 10.1016/j.leukres.2016.07.003 (2016). Bendall, S. C. et al.

Single-cell mass cytometry of differential immune and drug responses across a human hematopoieticcontinuum.

Science , 687–696, DOI: 10.1126/science.1198704 (2011).

Mair, F. et al.

The end of gating? an introduction to automated analysis of high dimensional cytometry data.

Eur. J.Immunol. , 34–43, DOI: 10.1002/eji.201545774 (2015). Yuan, C. & Yang, H. Research on k-value selection method of k-means clustering algorithm. J , 226–235, DOI:10.3390/j2020016 (2019). Pham, D. T., Dimov, S. S. & Nguyen, C. D. Selection of k in k-means clustering.

Proc. Inst. Mech. Eng. Part C: J. Mech.Eng. Sci. , 103–119, DOI: 10.1243/095440605x8298 (2005).

Bagwell, C. B. Hyperlog?a ﬂexible log-like transform for negative, zero, and positive valued data.

Cytom. Part A ,34–42, DOI: 10.1002/cyto.a.20114 (2005).

Moon, K. R. et al.

Visualizing structure and transitions in high-dimensional biological data.

Nat. Biotechnol. , 1482–1492,DOI: 10.1038/s41587-019-0336-3 (2019). upplemental Figures Supplemental Figure 1. Freecyto Quick Visualization Walkthrough.(A) Freecyto Homepage.

Navigate to freecyto.com and select login to continue. After clicking Login to access the Freecytoapplication interface, you need to create a new user account either through Google or email. If you already have an account onFreecyto, log in with those credentials. (B) Freecyto Login Page.

Create an account using a Google or Email ID. Once you have successfully logged in, you will beable to access your personal user portal. From here, you can see all past analyses that you performed (linked to your user ID).You can also sort and search past saved analyses and access visualizations of those analyses directly and quickly by clicking onthe corresponding link. (C) Freecyto User Portal.

View previously performed analyses and access the page to create a new job. New users will haveno previous experiments saved. However, each time the user uploads data or another user shares an experiment, the experimentwill be listed in the table of the home page. These experiments can be sorted, indexed, and accessed without needing to repeatpreviously performed analysis operations. To begin a new job, click “New Job” located in the left column of the dashboard.Next, upload any number of FCS ﬁles you wish to analyze. (D) Freecyto New Job.

Upload new FCS ﬁle(s) to begin a new analysis job. After the ﬁles have been uploaded, you will beable to access the quick visualizations page, in which the standard histograms, scatterplots, heatmaps, and box-whiskerdiagrams are displayed in a slideshow (image carousel) format. (E) Freecyto Quick Visualization.

View histograms, scatterplots, box-whisker diagrams, heatmaps of the uploaded ﬂowcytometry data. You may also change the ﬂuorescence channels displayed at this time, by scrolling to the bottom of the pageand selecting the new ﬂuorescence channels to display. (F) Changing the quick visualization display options. upplemental Figure 2. Freecyto Interactive Analysis Walkthrough.

Next, press “advanced analysis” to access theinteractive visualizations of the ﬂow cytometry data. This is an example of the shallow gating feature, in which selecting asub-population of cells will display that sub-population across all selected ﬂuorescence channels. (A) Freecyto Interactive Shallow Gating.

Shallow gating to see associated ﬂuorescence values of a selected region.Coordinate gating analysis can then be performed to determine the percentage of cells that are located within or outside thebounds of preset x and y values. (B) Freecyto Interactive Coordinate Gating Display.

Gate ﬂow cytometry experimental ﬁles based on speciﬁc X and Yvalues and see the percentage of cells within and outside these regions. Deep gating can also be performed to speciﬁcallyexamine sub-populations of cells. (C) Freecyto Interactive Deep Gating Display (Before).(D) Freecyto Interactive Deep Gating Display (After). upplemental Figure 3. Multiprocessing vs No Pipeline.

Plots show the time taken to process ﬁles when usingmultiprocessing vs. no multiprocessing for (A)

Quick visualization and for (B)

Advanced visualization. upplemental Figure 4.

Some advantages of the Freecyto analysis include multiple ﬁle upload and quick data visualization. (A) Multiple File Upload.

You can upload multiple ﬁles here and customize available settings, such as t-SNE and KDEvisualizations with the option of various transformations. (B) Quick Visualizations.

You now have access to many different visualizations of your uploaded data, including histograms,kernel density plots, and heatmaps. upplemental Figure 5.

Downstream analysis of ﬂow cytometry experiments. (A) Visualizing the local structure of the 50:50 WT/GFP+ experiment.

Ward hierarchical clustering is performeddownstream of the k-means quantization on the spearman correlation matrix of the Green Fluorescence and Side Scatterchannels. We ﬁnd the 2 distinct sub-populations as expected from this experiment. (B) Dimensionality reduction comparison.

Various dimensionality reduction techniques (PCA, tSNE, PHATE ) wereperformed on the same downstream data, but with all 15 channels selected as features. As expected, 2 distinct sub-populationswere noted in each of these methods.) wereperformed on the same downstream data, but with all 15 channels selected as features. As expected, 2 distinct sub-populationswere noted in each of these methods.