[PDF] COVID-19 Literature Topic-Based Search via Hierarchical NMF

Abstract

A dataset of COVID-19-related scientific literature is compiled, combining the articles from several online libraries and selecting those with open access and full text available. Then, hierarchical nonnegative matrix factorization is used to organize literature related to the novel coronavirus into a tree structure that allows researchers to search for relevant literature based on detected topics. We discover eight major latent topics and 52 granular subtopics in the body of literature, related to vaccines, genetic structure and modeling of the disease and patient studies, as well as related diseases and virology. In order that our tool may help current researchers, an interactive website is created that organizes available literature using this hierarchical structure.

Full PDF

CCOVID-19 Literature Topic-Based Search via Hierarchical NMF

Rachel Grotheer

Wofford CollegeSpartanburg, SC [email protected]

Kyung Ha, Yihuan Huang, Pengyu Li, Xia LiLongxiu Huang, Deanna Needell, Elizaveta Rebrova

University of California, Los AngelesLos Angeles, CA { kyungha, charlotte0408, erby1215, xli51 } @g.ucla.edu { huangl3,deanna,rebrova } @math.ucla.edu Alona Kryshchenko

California State University Channel IslandsCamarillo, CA [email protected]

Oleksandr Kryshchenko

LWS ResearchWest Hollywood, CA [email protected]

Abstract

A dataset of COVID-19-related scientiﬁc liter-ature is compiled, combining the articles fromseveral online libraries and selecting thosewith open access and full text available. Then,hierarchical nonnegative matrix factorizationis used to organize literature related to thenovel coronavirus into a tree structure thatallows researchers to search for relevant lit-erature based on detected topics. We dis-cover eight major latent topics and 52 granu-lar subtopics in the body of literature, relatedto vaccines,genetic structure and modeling ofthe disease and patient studies, as well as re-lated diseases and virology. In order that ourtool may help current researchers, an interac-tive website is created that organizes availableliterature using this hierarchical structure.

The appearance of the novel SARS-CoV-2 viruson the global scale has generated demand for rapidresearch into the virus and the disease it causes,COVID-19. However, the literature about coron-aviruses such as SARS-CoV-2 is vast and difﬁcultto sift through. This paper describes an attemptto organize existing literature on coronaviruses,other pandemics, and early research on the cur-rent COVID-19 outbreak in response to the call toaction issued by the White House Ofﬁce of Scienceand Technology policy (Science and Policy, 2020)and posted on the Semantic Scholar (Scholar, 2020) and Kaggle (2020) websites. The original datasetposted on that site is augmented by adding articlesdrawn from other databases in order to make the ﬁ-nal interactive organizational structure more robustfor researchers.Our primary goal is to create a framework fora topic-based search of papers within this datasetthat is helpful to those investigating the novel coro-navirus, SARS-CoV-2, and the global COVID-19pandemic. In order to discover the latent topicspresent in the collection of scholarly articles, aswell as to organize them into a hierarchical treestructure that allows for an interactive search, weuse a modiﬁed hierarchical nonnegative matrix fac-torization (HNMF) approach. A website thatallows users to walk through the topic tree basedon the top keywords associated with each topic iscreated using this hierarchical organization of thepapers. Our methods help make sense of a vast and rapidlygrowing body of COVID-19 related literature. Themain contributions of this paper are as follows: • A diverse dataset of COVID-19 related sci-entiﬁc literature is compiled, consisting ofarticles with full-text available drawn fromseveral online collections. http://covid-19-literature-clustering.net/ a r X i v : . [ c s . D L ] S e p A tree-like soft cluster structure is createdof all the papers in the dataset based on theinherent relation between their topics usinghierarchical NMF. • The best number of topics for each layer isdeﬁned as the number that produces the mostconsistent clustering of the dataset with ran-dom initializations of NMF algorithm. A vari-ance analysis method is used to identify thebest number of topics on each layer. • The effectiveness of the method is measuredby exploring the coherence of each topic anddissimilarity between the topics. • The discovered topics and distribution of ar-ticles into each of the topics are discussed,revealing the major areas of interest and re-search in the early months of the pandemic,as well as how existing epidemic literaturecan be effectively organized to allow efﬁcientcomparison to COVID-19 related research. • The theoretical results are complemented withan interactive website.

Some relevant works that motivate our approachare brieﬂy reviewed. NMF was ﬁrst proposed fordocument clustering (Xu et al., 2003), and sincethen many variants of the NMF algorithm havebeen proposed and applied to help organize varioustypes of data (Lee and Seung, 1999; Buciu, 2008;Kuang et al., 2015). In particular, there exist severalrecent papers that use NMF to ﬁnd a hierarchy oftopics in a set of documents. For example, Kuangand Park (2013) apply a rank-2 NMF to the recur-sive splitting of a text corpus and also provide anefﬁcient on-the-ﬂy stopping criterion. Gao et al.(2019) discuss a different version of HNMF, whenthe hierarchy of topics is generated by aggregationof the topics (rather than splitting). The ﬁrst appli-cation of NMF produces the initial set of the mostreﬁned topics, and the subsequent NMF iterationsﬁnd supertopics in which the previous set of topicscan be summarized. This approach is referred as a bottom-to-top viewpoint, and the former as a top-to-bottom.

Approaches that utilize tools from neuralnetworks such as back propagation to improve the soft here means that clusters can intersect, as one papercould belong to more than one topic topic representations have also been developed re-cently (Trigeorgis et al., 2016; Le Roux et al., 2015;Sun et al., 2017; Gao et al., 2019).Tu et al. (2018) propose a hierarchical on-line non-negative matrix factorization method(HONMF) to generate topic hierarchies from datastreams. The proposed method can dynamicallyadjust the topic hierarchy to adapt to the emerg-ing, evolving and fading process of the topics. Thiswork most closely aligns with what we present here,and although we do not consider the online setting,our method can easily be adapted to such.Finally, several authors have sought to addressthe issue of interpretability of topics discoveredby NMF, especially in datasets comprised of textdocuments. For example, Ailem et al. (2017) applyNMF to the documents using a word embeddingmodel, Word2Vec (Mikolov et al., 2013b), that fo-cuses on the semantic relationship between words.We make use of this embedding to analyze the use-fulness of the topics generated by examining theirsemantic similarity.

The dataset used is compiled from 4 differentdatabases that contain scholarly articles related toCOVID-19, various coronavirus diseases, otherinfectious diseases, and epidemiology (Scholar,2020; for Disease Control and Prevention, 2020;National Center for Biotechnology Information,2020; bioRxiv, 2020). From each of thesedatabases, only articles written in English that havea complete abstract and text body available are in-cluded. Punctuation, stop words, and words deemto be irrelevant such as “copyright” or “et al” areremoved from the text body and abstract of each ar-ticle and the articles are lemmatized. Further, eachword in the text body and abstract is represented bya TD-IDF embedding (Salton and Buckley, 1988).After processing and cleaning, the ﬁnal dataset con-tains 25,663 articles. Most of these databases areregularly updated and one of the important futuredirections of this work will include developing adynamic tree structure that pulls new articles fromthese databases weekly.

In a vector space model, a corpus can be repre-sented by a d × n matrix X , where d is the sizeof the vocabulary, and n is the number of docu-ments. The underlying assumptions in topic mod-ling (Blei et al., 2007) are that a latent topic canbe represented as a distribution over the words, andthat every document is a mixture of topics, i.e. com-prises a statistical distribution of topics that can beobtained by “adding up” all of the distributions ofall the topics covered. In this section, we will in-troduce how to apply Hierarchical NMF for topicdetection and creation of the hierarchical tree struc-ture. As a preliminary step, a brief introduction tousing NMF for topic detection is given. In NMF, the corpus matrix X ∈ R d × n ≥ is decom-posed into a pair of low-rank nonnegative matrices W ∈ R d × k , also known as the dictionary matrix,and H ∈ R k × n , also known as the coding matrix,by solving the following optimization problem inf W ∈ R d × k ≥ , H ∈ R k × n ≥ (cid:107) X − W H (cid:107) F , (1)where (cid:107) A (cid:107) F = (cid:80) i,j A ij denotes the matrix Frobe-nius norm.NMF, essentially an iterative optimization algo-rithm, has a drawback: the objective function isusually non-convex and has multiple local mini-mums. Therefore a different random initializationof the NMF procedure will result in a different ma-trix factorization. More importantly, this changesthe interpretation of the results, including topicvector representations ( W ) as well as the relevancebetween articles and topics ( H ). Another possiblesource of variability in the algorithm is the choiceof the number of topics, k . Different combinationsof initializations of W , H , and k yield differenttopics, leading to different article clustering results.See Section 5.1 for more discussion and implemen-tation details in this vein. The traditional NMF method treats the detectedtopics as a ﬂat structure, which limits the ability ofthe representation of such method. A hierarchicalstructure, such as a tree, generally provides a morecomprehensive description of the data. Given thecomplex nature of the coronavirus literature corpus,such a hierarchical approach is appealing.In this work, a hierarchical NMF (HNMF) frame-work is applied which is able to detect supertopics,subtopics, and the relationship between them, cre-ating a tree structure. The proposed HNMF algo-rithm is summarized in Algorithm 1. In HNMF, NMF is ﬁrst applied to the originalcorpus matrix X to obtain the dictionary matrix W and coding matrix H . The documents are thensorted into matrices X , X , · · · , X k , each repre-senting a different topic, according to the codingmatrix H , or into the matrix X e that temporarilyholds unassigned articles. Whether the leaves needto be further divided depends on the number ofthe documents in each topic matrix (leaf). If thenumber of documents sorted into a topic is greaterthan a pre-speciﬁed value m , then a further divi-sion is needed. The above process is repeated untilthe number of documents in each leaf is less than m . For more details on the implementation of theHNMF algorithm, the reader can refer to Section5.2. Algorithm 1:

Hierarchical NMF

Input:

Corpus matrix X . [ W, H ] =

NMF ( X, k ∗ ) where topic number k ∗ is chosen by Algorithm 2;assign articles to the related topics X , · · · , X k ∗ according to the threshold α in H , and any remaining articles to “ExtraDocument” matrix X e ; while i > m do determine the k ∗ i of thetopic i in X i by Algorithm 2; [ W i , H i ] = NMF ( X i , k ∗ i ) ;assign the documents to the topics by the athreshold α in H i s;assign the rest to X e ; endfor article x i in X e do calculate cosine similarity between x i andleaves, and assign the article to the mostrelated leaf; end repeat both while and for loops until thenumber of the articles assigned to each topic isless than m. This section begins with a discussion and visual-ization of the hierarchical tree structure obtainedusing Algorithm 1. Then in Sections 4.3 and 4.4quantitative evidence is provided that the discov-ered topics are reasonable. In doing this, we seekto measure both the rationality of a given topic and igure 1: Sunburst Diagram of the complete hierarchi-cal structure. The top three relevant words per topic areshown. The area of each region is proportional to thenumber of articles in that topic. See appendix for thekeywords associated with the third layer. The inner cir-cle numeric labels are corresponding to topic numberin Figure 2 the similarity between topics to evaluate whetherthe topics differ enough to be useful for a user.

Implementation of Algorithm 1 on the datasetresults in a hierarchical clustering of the arti-cles into eight supertopics, each with ﬁve to sixsubtopics. Two of these subtopics, the ﬁrst andfourth subtopics of supertopic 7, are further decom-posed into a third layer of subtopics as the numberof articles assigned to the ﬁrst and fourth subtopicsare larger than the selected m in Algorithm 1. Thefull hierarchical tree structure is visualized in thediagram in Figure 1. Each color represents oneof the eight supertopics and the size of each sliceis proportional to the number of articles that areclustered into that topic. It is important to note thatonly the top three words associated to each topicare shown due to space constraints, but in somecases extending the list of highly related wordsis necessary to clarify the difference between thesubtopics. For reference, the top ten keywords asso-ciated with each topic and subtopic can be found inAppendix 7. Additionally, the ﬁve most probable words associated to each topic are displayed on theassociated website to aid users in more effectivelychoosing the topics of personal interest.In order to examine the structure in more depth,Figure 2 displays a branch of the resulting tree rep-resented by word clouds, generated from the topﬁve words associated with each topic. The size ofthe words in each word cloud cell are proportionalto their weight in the corresponding W matrices,and thus, the probability they are associated withthat topic. In particular, the ﬁgure follows onepath down the tree structure, focusing on Topic 7and its associated subtopics, and then continuingto the subtopics of Topic 7-1. When moving todeeper layers in the tree, the general “health” and“model” topic further differentiates into subtopicsranging from public health to animal to humantransmission diseases, and data modeling. Finally,the public health subtopic leads to clusters of arti-cles speciﬁcally related to China or hospital care,for example. Perhaps not surprisingly, the topic to which thehighest number of articles are assigned, Topic 7,is about the general study of the disease (with themost highly associated words being “health, model,disease, case, epidemic, outbreak, public, country,population, transmission”), further split into twoadditional layers of subtopics. This is the only topicthat was split into a third layer, allowing a moreeffective differentiation between articles coveringa similar topic.Also unsurprisingly, much of the literature,which was compiled early on during the pandemic,is clustered around the study of other coronavirus-caused diseases. Topic 8, for example, focuseson vaccine development through the lens of thePorcine Epidemic Diarrhea Virus (PEDV). Al-though this is a coronavirus found only in pigs,several vaccines have been developed, especiallywithin the last seven years, when PEDV wasﬁrst discovered in North America (Gerdts and Za-khartchouk, 2017). Hence, it is reasonable that thistopic would be of interest to current researcherslooking to develop a vaccine for SARS-CoV-2.Similarly, Topic 1 focuses on coronaviruses knownto infect humans, such as SARS-CoV, and MERS-CoV. Topic 4 also contains a couple of subtopicsthat look speciﬁcally at the genetic structure ofSARS-CoV. igure 2: Part of Topics from HNMF and related topic coherence: The ﬁrst row shows the the key words for thetopics in the ﬁrst layer, the second row shows the subtopics of Topic 7 and the subtopics of Topic 7-1 is showed inrow 3. Corresponding topic coherence score (see Section 4.3 for more details) is underneath each word cloud.

Other topics of interest focus on articles aboutdiseases with related symptoms, although they maybe caused by a different type of virus. For exam-ple, both Topics 5 and 6 examine literature relatedto respiratory illnesses such as inﬂuenza, thoughTopic 5 clusters articles more related to laboratorystudy and Topic 6 clusters articles more related tohospital studies and patient care.Other major topics focus more on microbiol-ogy, including the genomic structure of the virus,the cellular infection and immuno-response, andcell-protein interaction. Thus, the hierarchical treestructure separates papers between macro- (publichealth) and micro- (biological) studies of the virus,and into papers that study related viruses. Thiscreates a clear delineation of topics for those in-vestigating papers, and gives insight into areas ofinterest for early researchers of SARS-CoV-2. Thisorganizational structure appears to be more robustand high-level than e.g. a keyword based search ororganization.

One measure of effectiveness of the topics discov-ered by HNMF is topic coherence . Topic coherenceis a quantitative measure of how well the keywordsthat deﬁne a topic make sense as a whole to a hu-man observer. As deﬁned by Mimno et al. (2011),the coherence score, C i for topic i , i = 1 , . . . , k with set V ( i ) = { v ( i )1 , . . . , v ( i ) P } of the P most prob-able words in that topic is given by, C i ( V ( i ) ) = P (cid:88) p =2 p − (cid:88) (cid:96) =1 log D ( v ( i ) p , v ( i ) (cid:96) ) + 1 D ( v ( i ) (cid:96) ) , (2) where D ( v ( i ) p , v ( i ) (cid:96) ) is the number of documents intopic i containing at least one occurrence of bothwords v ( i ) p and v ( i ) (cid:96) , and D ( v ( i ) (cid:96) ) is the number ofdocuments in topic i containing word v ( i ) (cid:96) . Thetopic coherence scores for each of the topics in theﬁrst layer, as well as the scores for the subtopics inlevels 2 and 3 for Topic 7 can be seen in the wordcloud display in Figure 2. A positive coherencescore indicates that the keywords are in a groupingthat would be recognizable to a human expert. Anegative coherence score indicates that a topic isless meaningful, which may occur, for example,if the associated keywords fall into two unrelatedgroups, or if the keywords are seemingly randomand have no obvious connection. All of our identi-ﬁed subtopics have large positive coherence scores,suggesting that by this metric they are understand-able and useful to human users. Another test of the usefulness of the hierarchicalstructure generated is to evaluate whether the top-ics are different enough to allow for informativechoice between them. To evaluate this, we quantifytopic similarity using a metric known as the WordMover’s Distance (WMD). WMD is a popular toolfor measuring distances between documents (Kus-ner et al., 2015). WMD utilizes

Word2Vec (Mikolovet al., 2013a), a word embedding technique, andtreats each document as a set of vectors in the em-bedded vector space. This embedding allows theWMD metric to consider the semantic meaning ofa given word, rather than just its spelling. Thus, forxample, it allows for identiﬁcation of synonymsas having the same meaning in a given context de-spite being different words , which makes it morepreferable than traditional metrics such as cosinesimilarity or Euclidean distance. The distance be-tween two documents A and B is deﬁned as theminimum cumulative distance that words from doc-ument A need to travel to match exactly the wordsof document B.The topic similarity across the layers and withineach layer is evaluated by computing the WMDbetween a topic and its associated subtopics and be-tween the subtopics themselves, where each topicis represented by its 100 most related words. Thesimilarities between all topics in the hierarchicalstructure obtained from HNMF is visualized in theheat map in Figure 3. As indicated by the over-all dark colors, in general each topic in the tree isdissimilar from the others.When examining the similarities between a topicand its subtopics, results show that for a given topic,its subtopics are less correlated with each otherthan with their parent topic. For example, in Figure4, for Topic 7, the similarity scores between itssubtopics are much lower than the scores betweensubtopics and their parent Topic 7. Similar resultscan be drawn for Topic 7-1 and its subtopics, asshown in Figure 5.

Figure 3: Topic similarity for all the topics from HNMFmeasured by WDM . A dark color indicates the topicsare dissimilar, while a light color indicates high simi-larity. Note that the topics are listed from ﬁrst layerto third layer from top to bottom or right to left on thevertical and horizontal axes, respectively.

However, there are some high similarity scores

Figure 4: Topic similarity between Topic 7 and itssubtopics measured by WDM: Topic 7 has high topicsimilarity with its ﬁve subtopics (7-1, 7-2, 7-3, 7-4, 7-5) and the ﬁve topics have low similarity between them-selves.Figure 5: Topic similarity between Topic 7-1 and itssubtopics measured by WDM: Topic 7-1 has high topicsimilarity with its four subtopics (7-1-1, 7-1-2, 7-1-3,7-1-4) and the four topics have low similarity betweenthemselves. between subtopics that belong to different topics,for example the light off-diagonal spot in Figure 3showing the similarities between Topics 6-3 and 5-3. Examining the top ten keywords associated witheach topic, we ﬁnd that both topics are associatedwith the words “inﬂuenza”, “virus”, and “study”indicating that both topics deal with studies relatedto the inﬂuenza virus.The insight into the difference between the twosubtopics comes from examining supertopics 5 and6 and the keywords associated with each subtopicthat do not overlap. Looking at words such as “de-tection” and “assay” associated with Topic 5 and“surveillance”, “case”, “season”, and “year” asso-ciated with Topic 5-3, it appears that Topic 5-3 ismore associated with detecting and monitoring theprevalence of cases of inﬂuenza in the general pop-ulace in a given ﬂu season. On the other hand, thepresence of keywords “patient”, “hospital”, “clini-cal”, and “study” associated with the parent topic,opic 6, as well as “patient”, “child”, and “respi-ratory” associated with Topic 6-3, it seems thatTopic 6-3, while also related to inﬂuenza studies,deals more speciﬁcally with cases in a hospital set-ting, perhaps speciﬁcally related to children, andexamining the relationship with respiratory illnessin general.A study of similar subtopics such as these showthe effectiveness of the tree in separating relatedtopics into more dissimilar supertopics to makenavigation to articles of interest clear. However,Algorithm 1 allows for an article to be assignedto more than one subtopic, acknowledging that asingle article may of equal interest to researchersinvestigating different, but related topics.

In this section, we discuss the details of the imple-mentation of HNMF and the construction of thehierarchical structure.

As previously discussed, the latent topics discov-ered by NMF are sensitive to the initial state ofthe algorithm, leading to different dictionaries foreach topic. In order to reduce this sensitivity, weseek to ﬁnd an appropriate number of topics, k ∗ ,in each layer such that if a k ∗ -topic NMF is initial-ized using any two random seeds, the content in thetopics discovered should be similar, as measuredby cosine similarity. We deﬁne this as a consistent number of topics. Algorithm 2 summarizes the pro-cess to ﬁnd the “best” number of topics, as deﬁnedin this manner, for a corpus matrix X .In Algorithm 2, ﬁrst the increment in proportionof variance explained by adding one more clus-ter to split the corpus matrix X is plotted. Thisis calculated by looking at the singular values of X . By examining this plot (Figure 6), a range [ k , k ] = [7 , in which a potential optimal num-ber of topics, k ∗ can be found is obtained by notingwhere the proportion of variance explained startsto level off.To determine the value of k ∗ in this range, ﬁrst, q + 1 random seeds are randomly selected, where q is a sufﬁciently large number. In this case, q = 30 was used. For each number of topics k ∈ [ k , k ] ,topic sets are generated { T j } q +1 j =1 using each of the q + 1 random seeds for initializing NMF. Algorithm 2:

Determine optimal number oftopics

Input: integer q , corpus matrix X .Determine a range for the potential topicnumber [ k , k ] by plotting increment invariance explained by adding one morecluster to X ;randomly select q + 1 seeds for initialization; for integer k in [ k , k ] do generate topic sets { T j } q +1 j =1 from NMFinitialized by random seed j ;generate S kj for j = 1 , , · · · , q where S kj is the cosine similarity matrixbetween topics in T j , T j +1 ; for S kj , j = 1 , , · · · , q do LSS k = ∅ ;add lss = min (max( s a. ) , max( s .b )) to LSS k , where s ab is the ( a, b ) thentry of the matrix S kj ; return k ∗ = arg max k ( median ( LSS k )) . Then, the cosine similarity is calculated betweeneach of the k topics for every consecutive pairof T j ’s. The similarity scores between the top-ics for each pair ( T j , T j +1 ) are stored in a matrix S kj ∈ R k × k . Therefore, q of such matrices aregenerated for each k ∈ [ k , k ] . For a ﬁxed k , theminimum of all maximum entries from each col-umn and row of each similarity matrix S kj is de-ﬁned to be least seed similarity ( lss ) score for that k . The set containing the q , lss scores for a givennumber of topics k is denoted LSS k . A consistentnumber of topics should have an overall high simi-larity between the topics generated for each seed.Therefore, we choose k ∗ in [ k , k ] to be the “best”number of topics if the median of all its lss scoresis the highest.The boxplot in Figure 7 shows the distributionof the lss scores for k in [7 , . In this case, is chosen as the “best” number of topics since itresults in the highest median lss score. A hierarchical NMF (see Algorithm 1) is appliedto cluster the articles, where the number of topicsin each layer is determined by Algorithm 2. Thehierarchical tree structure is established from topto bottom and consists of three layers on this dataset (see Figure 1).To generate topics in the ﬁrst layer, NMF is ap- igure 6: Plot of marginal increment in proportion ofvariance explained by adding another cluster to split X .It is determined that the ideal number of clusters/topicslikely lies in the range [7 , , as this is where the plotstarts to level off.Figure 7: Box plot of LSS k : Topic number 8 is the“best” as it has the highest median lss (least seed simi-larity) score and should be expected to yield consistentresults with random seeds. plied to the matrix X containing all the vectorizedarticles, resulting in a factorization with 8 topics,as determined by Algorithm 2. Next, a threshold α (in this case, α = 0 . ) is chosen, and the articlesin X are assigned into a topic class X , · · · , X iftheir corresponding document-topic correlation inthe H matrix is greater than α . Note that by thisdeﬁnition, one article could be assigned to one ormore topic class. After this, any articles not classi-ﬁed to one of the 8 topics are assigned to the “ExtraDocument corpus, X e . Now, the second layer ofthe tree consists of text corpora X , · · · , X .For each X i , i = 1 , , · · · , in the second layer,the topic is further subdivided into a third layer ifthe number of articles assigned to a topic class i ismore than some m (in this analysis, we chose m = ). If it is determined that text corpus X i needsto be divided further using NMF, the number of subtopics is chosen by Algorithm 2 and again, arti-cles from X i are assigned to each subtopic basedon the threshold α . As before, any articles that donot receive a classiﬁcation are assigned to X e . Thisprocess is continued for each level in the tree untileach leaf contains no more than m articles.Finally, the cosine similarity between each ar-ticle in X e and the dictionary associated to eachleaf (topic in the lowest layer in a given branch) iscalculated. Note that the dictionary of a leaf is acolumn of the W matrix of its parent topic. Thenthe articles in X e are assigned to the leaf with thehighest cosine similarity. After this reassignment,the number of articles associated with each leaf iscalculated again, and any leaves containing morethan m articles are further subdivided. HNMF is used to organize existing literature oncoronaviruses and pandemics, and early literatureon COVID-19 into an interactive structure eas-ily searchable by researchers and available to usethrough a corresponding website. The topics dis-covered by HNMF reveal that early research ofinterest to the COVID-19 research community di-vides into diverse areas such as research relatedto other coronaviruses, research related to otherrespiratory diseases, virology and genetic research,as well as research relating to the public healthresponse. A topic coherence metric reveals thatthe topics discovered are consistent and semanti-cally meaningful, while a topic similarity metricreveals that the topics differ sufﬁciently from oneanother to allow for a diversity of choice and areasof interest on the part of the user.In the future, we hope to regularly update thehierarchical structure as well as the associated web-site as new research papers are added, both byadding new papers and by adding and deleting clas-siﬁcations as new research topics emerge. We hopeto do this using an online version of the HNMFalgorithm such as the one in Tu et al. (2018).

References

Melissa Ailem, Aghiles Salah, and Mohamed Nadif.2017. Non-negative matrix factorization meets wordembedding. In

Proceedings of the 40th Interna-tional ACM SIGIR Conference on Research and De-velopment in Information Retrieval

The Annals of AppliedStatistics , pages 6–10. IEEE.Volker Gerdts and Alexander Zakhartchouk. 2017.Vaccines for porcine epidemic diarrhea virus andother swine coronaviruses.

Veterinary microbiology

PartitionalClustering Algorithms , pages 215–243. Springer.Da Kuang and Haesun Park. 2013. Fast rank-2 non-negative matrix factorization for hierarchical docu-ment clustering. In

Proceedings of the 19th ACMSIGKDD international conference on Knowledgediscovery and data mining , pages 739–747.Matt Kusner, Yu Sun, Nicholas Kolkin, and KilianWeinberger. 2015. From word embeddings to doc-ument distances. In

International conference on ma-chine learning , pages 957–966.Jonathan Le Roux, John R Hershey, and FelixWeninger. 2015. Deep nmf for speech separation.In , pages66–70. IEEE. Daniel D. Lee and H. Sebastian Seung. 1999. Learningthe parts of objects by non-negative matrix factoriza-tion.

Nature , 401:788–791.Tomas Mikolov, Kai Chen, Greg S. Corrado, and Jef-frey Dean. 2013a. Efﬁcient estimation of word rep-resentations in vector space.Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Cor-rado, and Jeff Dean. 2013b. Distributed representa-tions of words and phrases and their compositional-ity. In

Advances in neural information processingsystems , pages 3111–3119.David Mimno, Hanna Wallach, Edmund Talley,Miriam Leenders, and Andrew McCallum. 2011.Optimizing semantic coherence in topic models. In

Proceedings of the 2011 Conference on EmpiricalMethods in Natural Language Processing , pages262–272.Gerard Salton and Christopher Buckley. 1988. Term-weighting approaches in automatic text retrieval.

In-formation processing & management

CoRR , abs/1701.08349.George Trigeorgis, Konstantinos Bousmalis, StefanosZafeiriou, and Bj¨orn W Schuller. 2016. A deep ma-trix factorization method for learning attribute repre-sentations.

IEEE Transactions on Pattern Analysisand Machine Intelligence , 39(3):417–429.Ding Tu, Ling Chen, Mingqi Lv, Hongyu Shi, and Gen-cai Chen. 2018. Hierarchical online nmf for detect-ing and tracking topic hierarchies in a text stream.

Pattern Recognition , 76:203–214.Wei Xu, Xin Liu, and Yihong Gong. 2003. Docu-ment clustering based on non-negative matrix factor-ization. In

Proceedings of the 26th annual interna-tional ACM SIGIR conference on Research and de-velopment in informaion retrieval , pages 267–273.

Following tables list ten most probable keywordsassociated with each topic and subtopic in the treeenerated by HNMF. These keywords are visibleto website users to enable them to make choicesto navigate through the tree. Note that for the ﬁrstlayer we gave suggested topic titles. Not beingexperts in the ﬁeld, these are only suggestions togive an idea of the types of research someone maybe looking for within that topic. opic Number Key Words Possible Topic

Coronaviruses affectingHumans

Cellular immune re-sponse to viral infection

Genetic characteristicsof the virus

Cell-protein interaction

Detection and biologi-cal study of respiratoryviruses

Clinical and hospi-tal studies (esp. ofrespiratory illnesses)

Infection models andexperiments related topublic health

Vaccine development(esp. of the coronavirusPEDV)

Table 1: The top 10 keywords associated with each of the 8 topics in the ﬁrst layer of the tree

TopicNumber Key Words

Table 2: The top 10 keywords associated with each ofthe subtopics of Topic 1 in the nd layer of the tree TopicNumber Key Words

Table 3: The top 10 keywords associated with each ofthe subtopics of Topic 2 in the nd layer of the tree opicNumber Key Words Table 4: The top 10 keywords associated with each ofthe subtopics of Topic 3 in the nd layer of the tree TopicNumber Key Words

Table 5: The top 10 keywords associated with each ofthe subtopics of Topic 4 in the nd layer of the tree TopicNumber Key Words

Table 6: The top 10 keywords associated with each ofthe subtopics of Topic 5 in the nd layer of the tree TopicNumber Key Words

Table 7: The top 10 keywords associated with each ofthe subtopics of Topic 6 in the nd layer of the tree opicNumber Key Words Table 8: The top 10 keywords associated with each ofthe subtopics of Topic 7 in the nd layer of the tree TopicNumber Key Words

Table 9: The top 10 keywords associated with each ofthe subtopics of Topic 8 in the nd layer of the tree opic Number Key Words Table 10: The top 10 keywords associated with each of the subtopics of Topic 7-1 in the rd layer of the tree Topic Number Key Words

Table 11: The top 10 keywords associated with each of the subtopics of Topic 7-4 in the rdrd