Clustering COVID-19 Lung Scans
Jacob Householder, Andrew Householder, John Paul Gomez-Reed, Fredrick Park, Shuai Zhang
CClustering COVID-19 Lung Scans
Jacob Householder, Andrew Householder, and John Paul Gomez-ReedMentors: Fredrick Park and Shuai Zhang*Mathematics Department, Whittier CollegeWhittier, CA 90601 *Qualcomm AI Research
Friday 4 th September, 2020
Abstract
With the recent outbreak of COVID-19, creating a means to stop it’sspread and eventually develop a vaccine are the most important and chal-lenging tasks that the scientific community is facing right now. The firststep towards these goals is to correctly identify a patient that is infectedwith the virus. Our group applied an unsupervised machine learning tech-nique to identify COVID-19 cases. This is an important topic as COVID-19 is a novel disease currently being studied in detail and our methodologyhas the potential to reveal important differences between it and other viralpneumonia. This could then, in turn, enable doctors to more confidentlyhelp each patient. Our experiments utilize Principal Component Anal-ysis (PCA), t-distributed Stochastic Neighbor Embedding (t-SNE), andthe recently developed Robust Continuous Clustering algorithm (RCC).We display the performance of RCC in identifying COVID-19 patientsand its ability to compete with other unsupervised algorithms, namely K-Means++ (KM++). Using a COVID-19 Radiography dataset, we foundthat RCC outperformed KM++; we used the Adjusted Mutual Informa-tion Score (AMI) in order to measure the effectiveness of both algorithms.The AMI for the two and three class cases of KM++ were 0.0250 and0.054, respectively. In comparison, RCC scored 0.5044 in the two classcase and 0.267 in the three class case, clearly showing RCC as the supe-rior algorithm. This not only opens new possible applications of RCC,but it could potentially aid in the creation of a new tool for COVID-19identification.
The COVID-19 outbreak has redirected the efforts of the scientific communityas a whole to help relieve the pressures on medical staff, create new effectivetreatments, limit the spread of the disease, and, most importantly, create a vac-cine to end the pandemic. Our group has focused on the first issue, in trying to1 a r X i v : . [ c s . C V ] S e p ide in the identification of COVID-19 afflicted individual from non-COVID-19viral pneumonia or healthy individuals. Using unsupervised machine learning,or clustering techniques, we aim to create a method that can diagnose patientsgiven a CT lung scan. This is an important, novel, and difficult issue because ofthe nature of clustering algorithms; these algorithms are extremely sensitive andrequire a great deal of fine-tuning in order to work properly. Clustering is anessential data analysis task where objects are placed into groups based on sim-ilarity of features. Popular supervised classification models would likely yieldmore accurate results, but implementing supervised learning requires labeleddata in every application of a given method–a potential limiting factor for noveldata [12, 20]. Creating an unsupervised model with high accuracy would be asignificant accomplishment because of the greater range of real-world applica-bility the model would have. This is namely due to the fact that it can processunlabeled data and because of the objective nature of its solution, which simplydeciphers data for what it is and not by reference to any labeling [4]. The workwe have done focuses on using clustering as a means to identify patterns of pul-monary tissue sequelae in X-ray image visualizations via a 2-D representationof the data. This is an important stipulation as it provides the potential fordirect application of our experiment to real world scenarios. Lastly, the pro-posed method can be used as an important first step in creating labeled datafor supervised classification.In this work, we utilize the new COVID-19 Radiography Database by Chowd-hury et al. [11, 3]. The dataset is composed of X-ray and CT scan images sepa-rated into three classes: COVID-19 cases, Viral Pneumonia cases, and Normallungs. We applied the K-Means++ (KM++) [1] and Robust Continuous Clus-tering (RCC) [13] algorithms to this database of X-ray images of patients labeledby cause of illness with the goal of exploring the clustering properties latent inthe images. We conducted two tests using this data, the first was processingthe dataset as it stands; while the second was creating a binary class datasetby combining both the Viral Pneumonia and Healthy lung classes into one non-COVID-19 class. The images were then isolated from their class labelling inprocessing and the classes were only used as a means to label the visualiza-tion/graphical representation. As KM++ was found to be one of the top 10algorithms in unsupervised machine learning [8], we wanted to test whether theclaims of A. Shah and V. Koltun were true [13] and seen if RCC could competewith KM++ in terms of performance and clustering capabilities. Through ourwork, we have found that RCC has dramatically increased clustering perfor-mance in the COVID-19 database. We attribute this performance gain to theconnectivity structure that the algorithm relies on [13]. It is able to identifythe group of COVID-19 pneumonia cases with high accuracy, suggesting thatthere is information encoded in the image data set that differentiates cases ofCOVID-19 from other types of viral pneumonia. Figure 1 showcases lung scansof patients where the top row contains scans of those with COVID-19. Thebottom row consists of non-COVID-19 cases, meaning they are patients witheither normal lungs or viral pneumonia.The database used is composed of many different image databases of recently2igure 1:developed COVID-19 and lung/CT scan databases of varying quality, resolution,and focuses [11]. The authors of the database complied the images to test super-vised learning techniques and published their database on Kaggle to aid otherresearchers. Despite the varied origins and quality of the images, our findings arenonetheless interesting from an exploratory perspective as they illuminate thatthere are inherent differences between COVID-19 and non-COVID-19 lung/CTscans.This has no impact on the optimization of the RCC algorithm as it is notparameterized by the number of expected clusters. It does have an impact onour performance metrics.Visualizing high dimensional data is a difficult challenge in many data scienceproblems. Techniques like Principal Component Analysis (PCA) [10, 6] havebeen used for dimension reduction for quite some time. Recently, the methodof mapping higher dimensional data to a lower dimensional representation thatpreserves neighbor structure in t-Distributed Stochastic Neighbor Embedding(t-SNE) was introduced by van der Maaten and Hinton [16]. The t-SNE methodhas the ability to exhibit global arrangement of data in the sense of clusters at avarying range of scales. This is extremely useful since high dimensional data, likeimage data, can be viewed in as few as 2 or 3 dimensions with similar imagesclustering in this lower dimensional space. More details and related work tot-SNE can be found in [5, 14, 17, 15].To evaluate clustering performance of our experiment, we utilize the Ad-justed Mutual Information (AMI) [9, 18, 19] criterion. AMI is useful in thissetting as we have access to the ground truth categorization of the data set.This criterion gives us an insight into the efficacy of our clustering routineswhile accounting for random chance. We will go into more details later.The images were PCA reduced to 80 components and then normalized beforebeing fed into the RCC routine. The RCC routine has a connectivity structurebuilt with the Mutual k-Nearest Neighbors algorithm [2], which we set to k=30.3e noticed that image normalization is very important to obtain convergence ofthe RCC algorithm and is done so in the empirical results of the original paper,although not explicitly mentioned. We make use of four main data processing techniques in our experiment. Theseare Principal Component Analysis (PCA) for dimensionality reduction, t-DistributedStochastic Neighbor Embedding (t-SNE) [16] for two-dimensional visualization,and K-Means++ (KM++) along with Robust Continuous Clustering (RCC) asclustering methods. Using these techniques, we are able to process the vectorrepresentations of our images in a meaningful manner. t-SNE is a dimension reduction technique that creates a low dimension rep-resentation of high dimensional data points. This technique is primarily usedfor visualization in two or three dimensions as it preserves pairwise distanceswell while reducing dimension [16]. This is accomplished by computing the Eu-clidean distances between data points and constructing probability distributionsin both high and low dimensions such that if two high dimensional data pointsare close together it is likely that their low dimensional representations are aswell. We start by defining the probabilities as follows: p i | j = exp (cid:0) −|| x i − x j || / σ i (cid:1)(cid:80) k (cid:54) = i exp ( −|| x i − x k || / σ i ) (1) q i | j = exp (cid:0) −|| y i − y j || (cid:1)(cid:80) k (cid:54) = i exp ( −|| y i − y k || )where X = { x , x , ..., x N } are the initial data points and Y = { y , y , ..., y N } are their low dimensional counterparts, paired by index, σ i which is computedusing bisection search to best represent the distances of the data points. Theseconditional probabilities are then symmetrized for each pair of data points inorder to obtain the pairwise similarity: p ij = p i | j + p j | i N (2)in the high dimension, and the same for the low dimensional conditional prob-ability q . The algorithm then calls for minimization of the Kullback-Leiblerdivergence between the two pairwise probabilities: C = KL ( P || Q ) = (cid:88) i (cid:54) = j p ij log (cid:18) p ij q ij (cid:19) (3)4sing gradient descent. The gradient is as follows: δCδy i = 4 (cid:88) j (cid:54) = i ( p ij − q ij ) q ij y i − y j (cid:80) k (cid:54) = l (1 + || y k − y l || ) . This technique allows for visualization of high dimensional data that enables avisual and spacial intuition into the relationship of the data points and perfor-mance of clustering and classification algorithms.
RCC is a recently developed clustering technique that evolves a continuousrepresentation of the input data set such that similar data points form tightclusters [13]. The objective function follows as: C ( U ) = 12 n (cid:88) i =1 || x i − u j || + λ (cid:88) p,q ∈ ξ w p,q ρ ( || u p − u q || ) (4) ρ ( y ) = µy µ + y where X = [ x , x , ... ], x ∈ R D , and U is the set of representative points, whichare initially set as X . Most importantly, ξ is the set of edges connecting datapoints to one another–this is constructed by mutual-K Nearest Neighbors (m-KNN). By m-KNN, in order for two or more points to be grouped together,both points must consider each other as the nearest point, thus the pointshave a “mutual neighbor.” This allows RCC to have a parameter that definesthe maximum number of neighbors to look for, highlighting another feature ofRCC: the number of clusters does not need to be known ahead of time, setting itapart from KM++ greatly. The ρ term is used to represent estimator functions,in our case the Geman-McClure estimator is used with scale parameter µ . Theterm λ balances the contributions of terms to the objective. Below is anotherform of the RCC objective function, which takes into account the connectionsformed by m-KNN: C ( U, L ) = 12 n (cid:88) i =1 || x i − u j || + λ (cid:88) p,q ∈ ξ w p,q ( l p,q || u p − u q || + Ψ( l p,q )) (5)Ψ( l p,q ) = µ ( (cid:112) l p,q − . RCC can be optimized via alternating minimization, which sees the formulabecome: argmin U || X − U || F + λ (cid:88) ( p,q ) ∈ ξ w p,q l p,q ( e p − e q )( e p − e q ) (cid:62) (6)while the optimal value of l p,q becomes: l p,q = (cid:18) µµ || u p − u q || (cid:19) . (7)5 Metric - Adjusted Mutual Information
We calculated the Adjusted Mutual Information Score (AMI) between the dataset’sground truth labeling and the labeling produced by our algorithm to evaluateclustering performance. Mutual Information (MI) is an entropy based measurewhich quantifies the amount of information given by a random variable in aparticular clustering based on the probability of a particular point lying in anygiven cluster. The formulation of MI is given by:
M I ( U, V ) = R (cid:88) i =1 C (cid:88) i =1 P UV ( i, j ) log P UV ( i, j ) P U ( i ) P V ( j ) (8)where U i and V j are clusters in separate partitions of the same set of data rangingfrom { U , ..., U R } and { V , ..., V C } respectively. The probability of a data pointlying in a given cluster is denoted by P U and P V , and the joint probabilitybetween partitions is labeled P UV . To obtain the AMI, the adjustment of anindex for chance proposed by Hubert and Arabie [7] is then applied to theformulation of the MI:Adjusted index = index − expected indexmax index − expected index , Using this formula, a number between 0 and 1 is obtained, which determinesthe algorithm’s effectiveness is displayed. The closer the number is to 1, themore effective that algorithm is at clustering the given dataset.
It is clear that the COVID-19 data points naturally cluster with our experiment.This was initially observed in the t-SNE reduction of the true labeling of thedataset in Figure 2 (left) where COVID-19 points are colored red. The set ofCOVID-19 data points are similar enough to each other to have been groupedtogether, and different enough from the other cases to have been remapped tothe exterior of the set of all considered data points. This suggests that there isenough easily accessible information to cluster our data without complex featureengineering. So the first cluster analysis technique to try is KM++, a robusttechnique that affords an established baseline. The results from KM++, whichare discussed in detail below, produced a rather unsuccessful clustering. Nextup is RCC, which produced very intriguing results. It was able to successfullyidentify the COVID-19 cases as separate from the other cases, yet was unableto separate the viral pneumonia from the healthy cases in any significant way.We suspect that this behavior is due to the similarity of COVID-19 images andthe connectivity structure that RCC uses. Nonetheless, the result is notableand quite useful in its ability to discern between COVID-19 and non-COVID-19. This is significant as KM++ was unable to discriminate between any of thecases. 6igure 2: 3 class labelingFigure 3: 2 class labelingNow the AMI score for each of the clustering algorithms shows that RCCproduces a significantly better clustering in this case. . . .
8% and 51%, 68 .
75% and 96 . .
75% and 93 . .
43% and 80 . Conclusions and Future Work
The results from our experiment showed that the COVID-19 data points inthe data set are clearly different from the non-COVID-19 cases. Even thoughRCC does not separate the COVID-19 cases and non-COVID-19 cases into twodistinct classes, it is still able to create clusters that distinguish both cases.Thus, revealing that there is latent information within COVID-19 images andthere is an underlying similarity between many COVID-19 cases.RCC likely performs well because it uses a more refined measure of pairwisesimilarity than simply (cid:96) distance. The connectivity structure constructed bym-KNN places a stricter requirement on determining similarities between datapoints, enabling the algorithm to perform more informed data clustering.Potential limitations of our work come from the relative ambiguity as to thereasoning for classification and the, overall, small sample of COVID-19 cases inthe data which may not be representative of all real COVID-19 cases. It canbe inferred that the COVID-19 cases present in the dataset are likely severeas they warranted a professional lung scan, meaning that our findings may belimited to only the more severe COVID-19 cases. Considering this possibility,our findings are nonetheless relevant as they still display an inherent similaritybetween the COVID-19 cases in the data and distinguishes them from other viralpneumonia. In order to combat this potential limitation, future implementationsof these techniques can be performed on cleaner, more rigorously constructeddatasets.One direction for future work is to explore the effectiveness of different fea-ture engineering techniques on this data set. Consideration of other dimension-ality reduction techniques, other than PCA, that may be more finely tuned tothe COVID-19 dataset is a worthwhile future route. Moreover, as new dataemerges, we can further refine our experimental methodology. Acknowledgments
We would foremost like to thank the PIC Math Program for this unique andrewarding opportunity. PIC Math is a program of the Mathematical Associ-ation of America (MAA). Support for this MAA program is provided by theNational Science Foundation (NSF grant DMS-1722275) and the National Se-curity Agency (NSA). We thank Dr. Shuai Zhang for proposing this interestingand important problem and for contributing useful suggestions throughout theproject. We also thank Dr. Fred Park for his dedication, guidance, and supportthroughout this work. Lastly, we would also like to thank Qualcomm, WhittierCollege, NSF, MAA, and NSA, for their assistance in our work.
References [1]
D. Arthur and S. Vassilvitskii , k-means++: the advantages of carefulseeding, p 1027–1035 , in SODA’07: proceedings of the eighteenth annual9CM-SIAM symposium on discrete algorithms. Society for Industrial andApplied Mathematics, Philadelphia, PA, 2007.[2] M. Brito, E. Chavez, A. Quiroz, and J. Yukich , Connectivity of themutual k-nearest-neighbor graph in clustering and outlier detection , Statis-tics & Probability Letters, 35 (1997), pp. 33–42.[3]
M. E. Chowdhury, T. Rahman, A. Khandakar, R. Mazhar, M. A.Kadir, Z. B. Mahbub, K. R. Islam, M. S. Khan, A. Iqbal, N. Al-Emadi, et al. , Can ai help in screening viral and covid-19 pneumonia? ,arXiv preprint arXiv:2003.13145, (2020).[4]
D. Erhan, Y. Bengio, A. Courville, P.-A. Manzagol, P. Vincent,and S. Bengio , Why does unsupervised pre-training help deep learning? ,Journal of Machine Learning Research, 11 (2010), pp. 625–660, http://jmlr.org/papers/v11/erhan10a.html .[5]
G. E. Hinton and S. T. Roweis , Stochastic neighbor embedding , inAdvances in neural information processing systems, 2003, pp. 857–864.[6]
H. Hotelling , Analysis of a complex of statistical variables into principalcomponents. , Journal of educational psychology, 24 (1933), p. 417.[7]
L. Hubert and P. Arabie , Comparing partitions , Journal of Classifica-tion, 2 (1985), pp. 193–218, https://doi.org/10.1007/BF01908075 .[8]
A. K. Jain , Data clustering: 50 years beyond k-means , Pattern recognitionletters, 31 (2010), pp. 651–666.[9]
M. Meil˘a , Comparing clusteringsan information based distance , Journalof multivariate analysis, 98 (2007), pp. 873–895.[10]
K. Pearson , On lines of closes fit to system of points in space, london, edinb , Dublin Philos. Mag. J. Sci, 2 (1901), pp. 559–572.[11]
T. Rahman, M. Chowdhury, and A. Khandakar , Covid-19 radiogra-phy database , 2020.[12]
L. Schmarje, M. Santarossa, S.-M. Schrder, and R. Koch , Asurvey on semi-, self- and unsupervised techniques in image classification ,(2020).[13]
S. A. Shah and V. Koltun , Robust continuous clustering ,Proceedings of the National Academy of Sciences, 114 (2017),pp. 9814–9819, https://doi.org/10.1073/pnas.1700770114 , , .[14] L. van der Maaten , Learning a parametric embedding by preserving localstructure , in Artificial Intelligence and Statistics, 2009, pp. 384–391.1015]
L. van der Maaten , Accelerating t-sne using tree-based algorithms , Jour-nal of machine learning research, 15 (2014), pp. 3221–3245.[16]
L. van der Maaten and G. Hinton , Viualizing data using t-sne , Journalof Machine Learning Research, 9 (2008), pp. 2579–2605.[17]
L. van der Maaten and G. Hinton , Visualizing non-metric similaritiesin multiple maps , Machine learning, 87 (2012), pp. 33–55.[18]
N. X. Vinh, J. Epps, and J. Bailey , Information theoretic measures forclusterings comparison: is a correction for chance necessary? , in Proceed-ings of the 26th annual international conference on machine learning, 2009,pp. 1073–1080.[19]
N. X. Vinh, J. Epps, and J. Bailey , Information theoretic measures forclusterings comparison: Variants, properties, normalization and correctionfor chance , The Journal of Machine Learning Research, 11 (2010), pp. 2837–2854.[20]