A Glimpse of the First Eight Months of the COVID-19 Literature on Microsoft Academic Graph: Themes, Citation Contexts, and Uncertainties
AA Glimpse of the First
Eight
Months of the COVID ‐ Literature on Microsoft
Academic
Graph:
Themes,
Citation
Contexts, and
Uncertainties
Chaomei Chen College of Computing and Informatics Drexel University, 3141 Chestnut Street, Philadelphia, PA 19104, USA Email: [email protected] Keywords: Microsoft Academic, CiteSpace, Citation Context, Concept Tree, Uncertainty
Abstract
As scientists worldwide search for answers to the overwhelmingly unknown behind the deadly pandemic, the literature concerning COVID-19 has been growing exponentially. Keeping abreast of the body of literature at such a rapidly advancing pace poses significant challenges not only to active researchers but also to the society as a whole. Although numerous data resources have been made openly available, the analytic and synthetic process that is essential in effectively navigating through the vast amount of information with heightened levels of uncertainty remains a significant bottleneck. We introduce a generic method that facilitates the data collection and sense-making process when dealing with a rapidly growing landscape of a research domain such as COVID-19 at multiple levels of granularity. The method integrates the analysis of structural and temporal patterns in scholarly publications with the delineation of thematic concentrations and the types of uncertainties that may offer additional insights into the complexity of the unknown. We demonstrate the application of the method in a study of the COVID-19 literature.
Introduction
The COVID-19 pandemic has impacted so many people’s everyday life worldwide and it is still threatening our society as a whole. The COVID-19 pandemic is unprecedented in several ways that make it particularly challenging and threatening: it is still largely unknown of its origin and transmission routes; there may be months or even longer before COVID-19 vaccines can be expected to be a powerful line of defense; the prolonged pandemic continues to pose social, economic, and political challenges to businesses and entire industries such as airlines and international travels as well as schools and many other areas. Scientists and researchers have actively responded to the urgency and severity of the pandemic. Publications relevant to COVID-19 have increased rapidly across disciplines since the beginning of the year 2020. Several institutions and corporate organizations have contributed openly accessible datasets of COVID-19 publications, notably including the CORD-19 dataset and the Lens. The CORD-19 dataset, the COVID-19 Open Research Dataset, covers the scholarly literature of COVID-19, SARS-CoV-2, and the coronavirus group . Its initial release contained over 29,000 articles, over 13,000 of which contained full text. The CORD-19 dataset has been updated regularly. By the beginning of September 2020, the CORD-19 dataset reached 130,000 articles. It has been studied by many researchers, especially from communities that are well-equipped to nalyze and model text documents, including data science and AI for example. A scientometric researcher, however, may find the CORD-19 dataset and other similar datasets inadequate, for example, due to the lack of cited references as part of the dataset, missing elements such as abstracts, or other data coverage or quality issues. Several widely used analytic methods in the field of scientometrics rely on citations in scholarly publications to derive indicators and metrics of underlying research and its impact, for example, including the h-index and the g-index. Eugene Garfield pioneered the notion of Science Citation Index in the 1950s, which subsequently formed the basis for citation analysis and quantitative studies of science in general (Garfield 1955). Typically, authors of a scholarly publication refer to, i.e. cite, previously published works in their narratives. There are many studies of how and why authors ought to cite their references appropriately and how and why they often failed to do so (Greenberg 2009). The Web of Science, Scopus, Dimensions, the Lens, and Microsoft Academic (MA) are among the most widely used multi-disciplinary bibliographic databases (Visser, Eck et al. 2020), although they differ considerably in terms of the breadth and depth of their coverage, data quality, and accessibility. Document Co-Citation Analysis (DCA) (Small 1973) and Author Co-Citation Analysis (ACA) (White and McCain 1998) are network-based analytic methods for studying scientific literature. Networks in such studies are constructed based on co-occurrences of a pair of entities such as cited references or cited authors. The resultant networks then lend themselves to a wide variety of network analysis, modeling, and visualization techniques. Although networks can be derived from co-occurrences of words and phrases in text, references may play valuable roles at a higher level of abstraction, especially in what is known as concept symbols (Small 1978, Small 1986). A major advantage of bibliographic databases such as the Web of Science and Scopus is the provision of references cited by a scholarly publication. These databases are conveniently used by researchers to conduct citation analysis and other citation-based studies. In contrast, PubMed, a well-known resource of the literature of biomedicine, does not provide references of an article as part of its metadata. CORD-19 provides a similar depth of coverage as PubMed, i.e. the references cited by an article are not readily available from the dataset, which prevent researchers from conducting citation-based network analyses. Microsoft Academic (MA) is a major source behind the construction and dissemination of CORD-19. Microsoft Academic Graph (MAG) organizes entities and relations of scholarly publications as a graph and allows flexible ways to retrieve bibliographic data, including references cited by an article (Wang, Shen et al. 2019). Furthermore, instances of a citation of a reference, often known as citances (Nakov, Schwartz et al. 2004), are also retrievable from MAG in terms of citation contexts. Citation contexts of a reference consist of sentences in which the reference is cited along with surrounding sentences to provide a meaningful context. In the rest of the article, we will introduce a generic and reproducible method based on MA for visual analytic studies of the COVID-19 literature. In fact, the method is applicable to the literature of other topics. This method contributes to the study of a research field as follows: 1) enable anyone who is interested to construct their own CORD-19-like dataset with an enriched inclusion of cited references and citation contexts; 2) eliminate the barrier to conduct scientometric studies with the citation enriched data; 3) expand the current visual analytic studies of research literature further with the inclusion of citation contexts and enhance network-based analyses with deeper insights identified at a finer level of granularity; 4) enable analysts to interactively exploring uncertainties associated with the citations of a reference; and 5) conduct these analyses in an integrated visual analytic environment – an enhanced implementation of a idely used science mapping tool – CiteSpace (Chen 2004, Chen 2006, Chen, Ibekwe ‐ SanJuan et al. 2010, Chen 2017).
CiteSpace
CiteSpace is designed for conducting a visual analytic study of the scholarly literature of a research field, a research area, or a discipline, collectively known as a knowledge domain (Chen 2004, Chen 2006). CiteSpace constructs a series of networks of underlying entities and their relations derived from a representative dataset of the corresponding knowledge domain. Structural patterns and trends are combined with temporal patterns and indicators to inform analysts significant developments of a research field in question. A typical workflow divides a synthesized network into distinct clusters of entities such as cited references. The analyst would focus on the meaning of individual clusters and their interrelationships so as to uncover critical insights from high-levels of aggregation to lower levels. For example, applications of CiteSpace typically aim to address the following questions: What are the major thematic concentrations in the field of research in terms of clusters of co-occurring references? How are neighboring clusters connected? Which articles serve as bridges between distinct clusters? Which articles are the most representative members of a particular cluster? What would be the most accurate word or phrase to sum up the role of a cluster? Technically, CiteSpace decomposes a network into distinct clusters. Each cluster consists of a set of references that are frequently cited together by the same citing articles. In other words, the member references of a cluster are more often cited within the same article than references from different cluster. The quality of the decomposition is measured in terms of network modularity, the silhouette score of each cluster, and a cluster-size weighted silhouette mean. The closer these values are to 1.0, then higher the overall clarity of the configuration. Conceptually, the position of a reference in the underlying network representation can be used to guide our navigation. According to the Structural Hole Theory (Burt 2004), nodes located at structural holes tend to be associated with creativity, originality, and boundary-spanning. In CiteSpace, structural holes are highlighted as nodes with high betweenness centrality scores. Landmarks provide useful guidance for navigation. Highly cited articles are depicted as large tree-rings of citations per year. Articles with sharp increases in citations they received experience citation bursts, which indicate their particularly noteworthiness because of their extraordinary attraction of attention. CiteSpace supports Structural Variation Analysis (SVA), which is a predictive analytic process that can be used to identify newly published articles with transformative potentials (Chen 2012). Transformative potentials are measured in terms of significant structural variations induced by newly published articles. At a given time, a newly published article contributes to the literature by connecting previously disjoint bodies of thematic concentrations. Such connections may lead to transformative changes to the underlying knowledge structure. Transformative changes are realized if the research community follows and consolidates the promising paths forged by the innovative connections. On the other hand, not all such connections lead to changes that are responded by the research community, thus they are considered as the potential at the point of recognition other than as actual transformative changes. The role of uncertainties in representing scientific knowledge is proposed in our recent work to capture the state of the art of a research field as well as to explain what drives the dynamics of nderlying research activities (Chen and Song 2017, Chen, Song et al. 2018). We proposed a framework of uncertainties with three major types of uncertainties that are measurable at the linguistic level, namely, epistemic uncertainty (E), hedging (H), and transitional (T). Epistemic uncertainty is measured based on the presence of uncertainty cue words that indicate the unknown, incomplete, controversial, and contradictory aspects of a topic. Hedging is measured in terms of hedging words such as may, suggest, and imply. Hedging words are essential linguistic devices to render the certainty or, rather the uncertainty, of a statement more appropriately. Recently, there is a growing interest in the connection between hedging and citations (Small 2018, Small, Boyack et al. 2019). Transitional uncertainties are measured by the presence of transitional words such as however and nonetheless, which often signify a change in an argument or the possibility of multiple alternative interpretations of a situation. These uncertainties are measured in terms of information entropy. Using information entropy has several advantages over other metrics. First, information entropy has been widely used to represent uncertainties. Second, measuring uncertainties at different levels of aggregation follows naturally with information entropy. For example, the uncertainty associated with an article can be defined as the sum of all the uncertainties associated with the text passages of the article, such as its citation contexts. Similarly, the uncertainty of a cluster of cited references is naturally the sum of the uncertainties of all its member references. Currently, citation contexts are not available from the most commonly used bibliographic databases, such as the Web of Science, Scopus, Dimensions, and the Lens, except Microsoft Academic Services (MAS)(Sinha, Shen et al. 2015, Wang, Shen et al. 2019).
Microsoft
Academic
Services
Microsoft Academic Services (MAS) provides two ways to access the Microsoft Academic Graph (MAG): through API to access the MAG hosted by MAS or maintain a copy of MAG of your own. The potential and limitations of Microsoft Academic (MA) for citation analysis have been studied and compared with other commonly used resources (Hug and Brandle 2017, Hug, Ochsner et al. 2017, Thelwall 2017, Thelwall 2018, Visser, Eck et al. 2020). More recent reviews of MAS are presented in (Wang, Shen et al. 2019). An interesting study of the core concepts of classic philosophical books made a good use of the MAG (Bornmann, Wray et al. 2020). Each article in MAG has a MAG ID, for example, 2160441315 for (Chen 2004) and 2150999626 for (Chen 2006). One may retrieve publications from MAG based on their MAG IDs, DOIs, and fields of study. It is also possible to retrieve articles that cite a set of references in their MAG IDs, thus this function enables us to enrich an article’s metadata with its references. For each returned article, citation contexts provide a unique advantage of MAS over other major bibliographic databases. The integration of visual analytics of citation contexts with other citation-based analytic processes is a significant contribution of our new method. To demonstrate the application of the integrated and enhanced method, we focus on the COVID-19 literature. Instead of using the currently available CORD-19 dataset due to its lack of cited references, we construct a dataset of the COVID-19 literature directly from MAS. A major advantage of this approach is its flexibility and extensibility. Anyone who is interested would be able to perform the construction with the new version of CiteSpace on a subject of their own interest. In particular, it is possible to further extend CiteSpace to support Cascading Citation Expansion (Chen and Song 2019), which is so far only feasible in CiteSpace with Dimensions’ API. At the time of writing, the Lens API currently does not support retrieving citing articles of a iven reference, although the Lens ScholarlyWorks API is very powerful in other aspects and it is supported in CiteSpace. We constructed the MA-based COVID-19 dataset based on fields of study. The following query means a qualified record should match with at least one of the fields of study. expr=Or (Composite(F.FN=='coronavirus disease 2019'), Composite(F.FN=='severe acute respiratory syndrome coronavirus 2'), Composite(F.FN=='2019 20 coronavirus outbreak'), Composite(F.FN=='covid-19'), Composite(F.FN=='covid19') ) The dataset constructed on September 5, 2020 contains a total of 79,476 records and 7,693 of them have corresponding citation contexts. This dataset represents a snapshot of the COVID-19 literature published in the first eight months of 2020 on MAG. 77,897 records have been cited at least once. These records are stored in a format that is a superset of the Web of Science field-delimited data format. The citation contexts are stored in a separate file to maximize the interoperability. The integrated analytic functions involving citation contexts use a database on a local host. Users who do not install such databases can still use CiteSpace to process the dataset retrieved from MAS as if they were from the Web of Science. The MAS-enriched COVID-19 dataset can be studied at three levels. First, the dataset contains cited references the same way as bibliographic records of the Web of Science origin. All the visual analytic functions in CiteSpace that can be applied to the Web of Science data now can be applied to the MAS-enriched dataset. We will illustrate a visualization of the co-citation network with monthly time-sliced intervals and an overview of its major thematic concentrations (i.e. clusters of co-cited references). Second, the integrative analysis of citation contexts along with Level-1 analyses, which are visual analytic tasks typically used with tools such as CiteSpace. Citation contexts serve the source of information from which three types of uncertainties are measured. One reference can be cited by the same citing article in multiple citation contexts as well as by multiple citing articles in an even larger number of citation contexts (Ding, Liu et al. 2013, Hu, Chen et al. 2013). Making sense of key themes in the potentially diverse and extensive narratives is a time consuming and cognitively demanding task. Uncertainty scores provide an extra layer of organizing the information and highlight information that is potentially valuable to the sense-making process. Finally, at the third level, we demonstrate the application of the SVA procedure to the dataset and identify a set of newly published articles that are too young to standout in terms of their citation counts but start to show signs of transformative potential.
Data
To demonstrate the method, we constructed a dataset of the COVID-19 literature from the MAG based on matching fields of study. The resultant dataset contains publications as earlier as 2014. Interestingly, the dataset also contains publications from the future, all the way from next month to next year. Table 1 summarizes the dataset. Our subsequent discussions will focus on the subset of the first eight months in 2020.
Table 1. A self-constructed dataset of the COVID-19 literature on Microsoft Academic Graph as of September 5, 2020.
Time of Publication Unique Citing Articles Unique References Citation Contexts Unique Contexts
Table 2 lists the number of articles found on major bibliographic databases on September 5, 2020 with the identical or closely matched queries to the query used for this study. For example, the following query was used to search in the titles only in Dimensions DSL. For full text search, replace the scope title_only with full_data. search publications in title_only for "\"coronavirus disease 2019\" OR \"severe acute respiratory syndrome coronavirus 2\" OR \"2019 20 coronavirus outbreak\" OR \"covid-19\" OR covid19" return publications [id]
A search on Google Scholar used the following query: 'coronavirus disease 2019' OR 'severe acute respiratory syndrome coronavirus 2' OR '2019 20 coronavirus outbreak' OR 'covid-19' OR 'covid19'
The search on MAG used the following query: expr=Or(Composite(F.FN=='coronavirus disease 2019'), Composite(F.FN=='severe acute respiratory syndrome coronavirus 2'), Composite(F.FN=='2019 20 coronavirus outbreak'), Composite(F.FN=='covid-19'), Composite(F.FN=='covid19'))
The CORD-19 dataset is the most comprehensive one with 130,000 articles. Full text searches on Dimensions and the Lens returned over 100,000 articles. Searches in the metadata alone of course returned fewer articles than full text searches. We used fields of study to construct our demonstrative dataset to reduce the complexity of the query formation. More sophisticated strategies such as Cascading Citation Expansion can be used to optimize the overall quality of the data collection step (Chen and Song 2019). A topic search in the Web of Science returned the lowest number of articles in this group. In terms of the percentage of the CORD-19 dataset, the eb of Science represents about 23% of its volume, MAG over 62%, and full text searchers on Dimensions and Lens over 75%.
Table 2. A simple comparison between different data sources on the search query used for this study as of September 5, 2020.
Data Source Articles % of CORD-19 Search Strategy
CORD-19 Overview
The first question regarding the literature of a topic that is full of uncertainty is: what are the major components of the COVID-19 literature? What aspects of the subject are predominant in the first eight months of the pandemic? Figure 1 shows an overview of the underlying network of references that are often cited together. The nodes are references cited by citing articles, which are records from the dataset retrieved from MAG, whereas links between them represent the strengths of their co-occurrences. Groupings of references, i.e. clusters, emerge as some references are co-cited more often than others. These clusters represent concentrations of themes, although their degrees of concentration may vary widely across different clusters. Each cluster is assigned an automatically generated cluster label, for example, https://app.dimensions.ai/discover/publication https://docs.microsoft.com/en-us/academic-services/ https://scholar.google.com/schhp?hl=en https://clarivate.com/webofsciencegroup/solutions/web-of-science/ he conventional Document Co-Citation Analysis (DCA), which typically treats each reference with an equal weight for a citing article because no further data beyond the reference list is available. In this study, we utilize the additional information accessible from the corresponding citation contexts such that references frequently cited in the article can be retained, whereas references cited below average can be discarded. Users may control this option. Figure 1 . An overview of 1,330 top-cited articles in the COVID-19 literature of 77,897 articles.
Concept
Trees
The MAG-based COVID-19 dataset contains 111,360 citation contexts made by 7,693 citing articles, involving 39,995 references. Although this represents only 9.88% of the 77,897 citing articles that have at least one citation in the dataset, the available citation contexts provide enough information for us to assess the feasibility of the integrative approach. The overview in Figure 1 gives us some general idea about the most active areas of the literature. To understand what each cluster involves, one option is to characterize the thematic structure of the cluster in a hierarchical representation of its component concepts (Palla, Tibely et al. 2015, Balogh, Zagyva et al. 2019). Figure 2 shows a concept tree of Cluster
Figure 2. Making sense of a cluster (
Concept trees can be used to highlight the structure of a cluster, a reference, or a word or a phrase. Figure 3, for example, is generated from a single reference that has the highest epistemic uncertainty score. The concept tree reveals that incubation period and mean incubation period are major themes emerged from citation contexts made by subsequently published articles. Interactive inspections of the concept of mean incubation period reveal specific days mentioned in various citation contexts shown in the left window. For example, the mean incubation period of 5.2 days appears frequently in these contexts. This function can facilitate traditional systematic reviews and meta analyses as one can efficiently collect values of a specific variable from a large number of publications.
Figure 3. Making sense of major themes of citations to a specific reference (Li, Q., 2020).
Uncertainties
Each citation context consists of a few sentences in which a particular reference is cited by a citing article. The uncertainties of these sentences can be measured in terms of the presence of uncertainty cue words and how often these cue words appear in a global bibliographic database such as the entire collections of over 200 million articles on Dimensions. Semantically equivalent cue words are learned from initially hand-picked cue words as seeds (Chen, Song et al. 2018). Epistemic uncertainties are designed to capture uncertainties in our knowledge, which can be manifested by the appearance of cue words such as controversial, contradictory, inconsistent, unknown, and uncertain. While the uncertainties of citation contexts can be identified independent of the underlying topic as shown in the above example, a more valuable option is the combine the level of uncertainty with specific rhetorical cues or specific concepts. For example, we may want to see how epistemic uncertainties are associated with concrete concepts such as social distancing, face masks, and international travels. As shown in Table 3, a combination of epistemic uncertainty cues and rhetorical words on conclusions reveals additional insights into the challenges we are facing.
Table 3. Epistemic uncertainties of citation contexts containing specific rhetorical words on conclusions. Uncertainty cue words are highlighted in yellow, whereas rhetorical words are highlighted in green.
Citing Article Cited Reference Epistemic Uncertainty Citation Context: Uncertainty: uncertain/conflicting/contradict/inconsistent Rhetorical: conclusion/conclude
280 0.0100 Moreover, the contradictory findings about smoking in the literature, with 058961 studies published before the COVID-19 pandemic reporting that smoking and nicotine down-regulate ACE2 [3, 4] and other studies published during the pandemic reporting that they up-regulate ACE2 [5] [6] [7] , do not allow for solid conclusions regarding the effects of nicotine or smoking on ACE2. 3007114958 Figure 4 shows a screenshot of the Node Details window in CiteSpace for the article (Li, Guan et al. 2020), which has the largest epistemic uncertainty score in the dataset. The screenshot shows a list of citation contexts of the reference in chronological order. The length of an orange bar depicts the epistemic uncertainty score (E). Major uncertainty cue words contributing to the uncertainty score are highlighted in red, including unknown, uncertainty, and controversial. The visual representation reduces the cognitive burdens for us to manually sift through numerous citation contexts across different articles. It makes it easier for us to concentrate on the key information. We can quickly scan citation contexts and focus on ones with high uncertainty scores. It is out belief that uncertainties can provide valuable insights into the heart of research (Chen and Song 2017).
Figure 4. Uncertainties of contexts in which (Li, Guan et al. 2020) has been cited.
Figure 5 shows a concept tree of a phrase, vaccine. The concept tree on the left is the entirety of the hierarchical representation of the concepts co-occurring with the words vaccine or vaccination. Similarly, one can interactively inspect the concept tree and the underlying citation contexts.
Figure 5. Concept tree of a phrase: vaccine.
The examples shown above illustrate the application of the method on datasets retrieved from MAG. The following examples demonstrate Structural Variation Analysis (SVA), which is a predictive analytic method to identify newly published articles with transformative potentials (Chen 2012). In other words, SVA aims to identify new articles that may have profound impacts on the underlying literature in the future.
Structural
Variation
Analysis (SVA)
The structural variation analysis is built on the basis of Structural Hole Theory, which was originally proposed for social networks (Burt 2004, Chen 2012). According to the Structural Hole Theory, people’s positions in their social network or a network of their co-workers may be associated with social capitals, which may be in turn translated into competitive edge. Furthermore, positions that play a significant role in connecting different parts of the network tend to have a greater potential than other positions. From an information flow point of view, individuals who are in such positions are exposed to different ideas, perspectives, and opinions, which make them more open-minded and creative. We extended this insight to networks of scholarly publications. References located at such positions, i.e. connections are limited or completely missing, have the potential to make a profound impact on the global structure of the underlying research field. The SVA procedure looks for newly added connections that may alter the global structure or have the potential to do so. In essence, each newly published article is compared against a baseline network formed by the literature up to the publication of the article in question. Co-occurring links made by the article are examined in the baseline network to etermine whether a specific link is novel between clusters or incremental within clusters or anything else in between. Transformative potential scores are given to articles that make novel links between distinct clusters. Figure 6 shows the overview of the network we will work with for SVA. The current implementation of SVA is computationally expensive; therefore, we used a smaller network than the one we showed earlier for the demonstration. The overview shows a similar set of clusters with slight shifts of focus. The largest cluster is
Figure 6.An overview of a smaller network to illustrate SVA. The size of a disc in red depicts the epistemic uncertainty (E). The largest three discs are labeled in black background.
Figure 7 shows the same set of clusters in a more separated arrangement. The separation reveals that the largest cluster
Figure 7.The distribution of citation contexts with uncertainties is uneven. The most of uncertainties are in clusters 0, 2, and 5.
Figure 8 shows an overlay of cluster
Figure 8. References associated with the strongest sentiment of uncertainty are from Cluster
Figure 9 shows a newly published article identified with a high transformative potential based on the centrality divergence metric, one of the three structural variation metrics. The article is also ranked high on another metric – modularity change. The visualization reveals the distribution of the references cited by the article in different clusters. The next question is why these references from different areas are cited by the same citing article.
Figure 9. An article identified by SVA with a high transformative potential according to centrality divergence.
The review article (Fang and Meng 2020) was published in June 2020. Its citation contexts are not found on MAG at the time of writing. We manually located the citation contexts from the original publication. Table 4 shows the citation contexts of four of the references cited by (Fang and Meng 2020). Each of the references is cited in different sections on distinct topics, which partially explains the diverse footprint found by SVA.
Table 4. Citation contexts of the references cited by (Fang and Meng 2020).
Cluster Reference Citation Context Section Heading
Guo L , Ren L, Yang S, et al. Profiling early humoral response to diagnose novel coronavirus disease (COVID-19). Clin Infect Dis. 2020. DOI:10.1093/cid/ciaa310 Infection of SARS-CoV-2 triggers the host humoral response, leading to generation of antibodies including IgA, IgM, and IgG against SARS-CoV-2 [ ]. days while for IgG it was 14 days after symptom onset [ ]. Kimball A , Hatfield KM, Arons M, et al.; CDC COVID-19 Investigation Team. Asymptomatic and presymptomatic SARS-CoV-2 infections in residents of a long-term care skilled nursing facility - King County, Washington, March 2020. MMWR Morb Mortal Wkly Rep. 2020;69(13):377–381. A PCR screening test on 78 residents in a long-term care nursing home in Washington State resulted in the detection of 10 symptomatic, 10 pre-symptomatic, and 3 asymptomatic cases [ ]. symptomatic, 10 pre-symptomatic, and 3 asymptomatic cases [ ].
Grein J , Ohmagari N, Shin D, et al. Compassionate use of remdesivir for patients with severe Covid-19. N Engl J Med. 2020. DOI:10.1056/NEJMoa2007016 Preliminary results showed that about 68% patients with severe COVID-19 treated with compassionate-use of remdesivir had clinical improvement [ ], whereas a randomized, double-blind, and placebo-controlled multicenter trial showed remdesivir was not associated with significant clinical benefits [15], suggesting additional efficacy studies will be needed to demonstrated remdesivir’s benefit. Mehta P , McAuley DF, Brown M, et al.; HLH Across Speciality Collaboration, UK. COVID-19: consider cytokine storm syndromes and immunosuppression. Lancet. 2020;395(10229):1033–1034. Possible roles of cytokine storm syndrome that leads to critical disease and death of COVID-19 patients have been discussed [ ,55]. γ inducible protein 10, monocyte chemo-attractant protein 1, macrophage inflammatory protein 1- α , and TNF α [7,48,49, ]. ]. A practical question in constructing a systematic review of a subject is that given the current landscape of its literature, where are the promising areas to watch for future growth? Where can one expect the next breakthrough? In the case of the COVID-19 pandemic, monitoring the spread of the virus and the development of new vaccines are among the top of such a list. About 10% of the records in the MAG-based COVID-19 dataset have corresponding citation contexts available. Nearly 15% of them contained the words vaccine or vaccination. SVA enables researchers to watch activities in valleys between hills on the current literature landscape, i.e. innovative connections that link thematic islands. The literature-based discovery community pioneered by Don Swanson has been devoted it to uncover this type of connections that are missing or weak in the current literature (Chen and Song 2019). Table 5 shows a list of articles identified with the strongest transformative potentials in terms of modularity change, which is essentially consistent with the citation-based ranking, suggesting that the validity of the list can be further verified in the near future when their citation contexts become available. Most of the articles on the list are published in June 2020, slightly over two months ago. The second one on the list has already been cited 19 times. It shows how fast the COVID-19 literature is growing as well as a promising visibility of this article. The title of the article is “COVID-19: From epidemiology to treatment.” Interested readers may revisit this list regularly in the next few months and monitor whether their transformative potential is strengthened and realized or diminished and forgotten.
Table 5. Some of the articles with the strongest transformative potentials in terms of M for modularity change. C-L is for cluster linkage. C-D is for centrality divergence.
Discussion
The illustrative examples demonstrate the advantages and further potentials of integrating MAG for the study of a rapidly growing body of literature over a monthly time scale, comparing to a much longer time period typically used in contemporary scientometric studies. Our experience shows that it is feasible for end users to construct their own datasets at a pace of their own choice. More importantly, this method opens a wide range of possibilities for researchers to compare different bibliographic databases. Such possibilities until recently have been limited to the few due to the resource-demanding nature (Chen and Song 2019, Visser, Eck et al. 2020). At the basic level, we have demonstrated that end users would be able to conduct scientometric studies using commonly used methods such as DCA in the same way as they would with datasets from the long established but relatively more selective sources such as the Web of Science. At the same time, an equivalent topic search in the Web of Science returned 29,858 records, whereas CORD-19 has reached 130,000 records. MAG is in the same category as the Lens and Dimensions within the range of 70,000~90,000 records. At the next level, integrating citation contexts with the visual analytic workflow significantly extends the depth of scientometric studies. The ability to cross-reference between structural indicators and linguistic cues of uncertainties has a great potential to enrich the sense-making process when facing a rapidly growing body of the literature. The integrated method enables analysts to visually inspect all the instances in which the mean incubation period are discussed with reference to a particular study. The interactive visualization considerably reduces the overhead and cognitive load. Such improvements are promising examples of integrating network-based visualization and analytic approaches with text-based visual exploration guided by different types of uncertainty metrics. At the third level, our examples have demonstrated that the integrated approach can further push the envelope of scientometric studies and the provision of citation contexts opens up valuable opportunities for researchers to monitor and track articles identified with transformative potentials. It is also conceivable how one may extend studies along the direction demonstrated in (Bornmann, Wray et al. 2020), which resembles closed search when researchers identify oncepts they would like to trace as opposed to open-ended search when researchers would welcome concepts identified by computational models and/or interactive visual exploration. MAG has its limitations. The most noticeable one is the relatively low rate of records with citation contexts at the time of writing. We found about 10% of the records in the dataset have citation contexts. We tested the distribution of the records with citation contexts by overlaying the network of records with citation contexts on the network of all records. The result is assuring – the records with citation contexts seem to be located in the core of the larger network, whereas the missing ones tend to be located at the peripherals of the network. Even with the current level of the available citation contexts, we are able to reveal answers to some of the specific questions such as the consensus on the mean incubation period in the literature. Encouraged by the recent review of MAS (Wang, Shen et al. 2019), we believe MAS will continue to improve and the overall quality of MAG as a data source will steadily increase. The current MAS allows free API access to the MAG with a minor compromise in terms of the limit of API calls per minute and the total calls per month. MAS provides an alternative for users to host their own copy of MAG so that these restrictions can be eliminated. This is another reason why we consider MAG as a promising component to integrate in the workflow of visual analytic studies of the literature.
Conclusion
In conclusion, we have introduced a visual analytic method that can overcome a few significant shortcomings in the practice of conducting scientometric studies of the literature. The method provides a flexible and extensible approach that allows researchers to tailor the breadth and the depth of a dataset to meet their own requirements. Researchers are enabled to apply existing tools to such self-constructed datasets on topics of their interest at a pace of their own choice. The easy access to citation contexts provides valuable contributions to the workflow of visual analytic studies of the literature. Integrating citation contexts with network-centric analytic paradigms may serve as a stepping stone to a fully integrated workflow for the study of scientific literature at all levels of details as well as insightful patterns and trends.
Acknowledgements
The author would like to acknowledge the support of the Science of Science and Innovation Policy (SciSIP) Program of the National Science Foundation (Award
References
Balogh, S. G., D. Zagyva, P. Pollner and G. Palla (2019). "Time evolution of the hierarchical networks between PubMed MeSH terms." PLoS ONE (8): e0220648. Bornmann, L., K. B. Wray and R. Haunschild (2020). "Citation concept analysis (CCA): a new form of citation analysis revealing the usefulness of concepts for other researchers illustrated by exemplary case studies including classic books by Thomas S. Kuhn and Karl R. Popper." Scientometrics : 1051-1074. Burt, R. (2004). "Structural holes and good ideas." American Journal of Sociology : 349-399. Chen, C. (2004). "Searching for intellectual turning points: Progressive knowledge domain visualization." Proceedings of the National Academy of Sciences (suppl 1): 5303-5310. Chen, C. (2006). "CiteSpace II: Detecting and visualizing emerging trends and transient patterns in scientific literature." Journal of the American Society for Information Science and Technology (3): 359-377. hen, C. (2012). "Predictive effects of structural variation on citation counts." Journal of the American Society for Information Science and Technology (3): 431-449. Chen, C. (2017). "Science Mapping: A Systematic Review of the Literature." Journal of Data and Information Science (2): 1-40. Chen, C., F. Ibekwe ‐ SanJuan and J. Hou (2010). "The structure and dynamics of cocitation clusters: A multiple ‐ perspective cocitation analysis." Journal of the American Society for information Science and Technology (7): 1386-1409. Chen, C. and M. Song (2017). Representing scientific knowledge: The role of uncertainty, Springer. Chen, C. and M. Song (2019). "Visualizing a Field of Research: A Methodology of Systematic Scientometric Reviews." PLoS One (10): e0223994. Chen, C., M. Song and G. E. Heo (2018). "A scalable and adaptive method for finding semantically equivalent cue words of uncertainty." Journal of Informetrics (1): 158-180. Ding, Y., X. Liu, C. Guo and B. Cronin (2013). "The distribution of references across texts: Some implications for citation analysis." Journal of Informetrics (3): 583-592. Fang, B. and Q. H. Meng (2020). "The laboratory’s role in combating COVID-19." Critical Reviews in Clinical Laboratory Sciences (6): 400-414. Garfield, E. (1955). "Citation indexes for science: a new dimension in documentation through association of ideas." Science (108-111). Greenberg, S. A. (2009). "How citation distortions create unfounded authority: Analysis of a citation network." British Medical Journal : b2680. Hu, Z., C. Chen and Z. Liu (2013). "Where are citations located in the body of scientific articles? A study of the distributions of citation locations." Journal of Informetrics (4): 887–896. Hug, S. E. and M. P. Brandle (2017). "The coverage of Microsoft Academic: Analyzing the publication output of a university." Scientometrics (3): 1551-1571. Hug, S. E., M. Ochsner and M. P. Brandle (2017). "Citation analysis with microsoft academic." Scientometrics (1): 371-378. Li, Q., X. Guan, P. Wu, X. Wang, L. Zhou, Y. Tong, R. Ren, K. S. M. Leung, E. H. Y. Lau, J. Y. Wong, X. Xing, N. Xiang, Y. Wu, C. Li, Q. Chen, D. Li, T. Liu, J. Zhao, M. Liu, W. Tu, C. Chen, L. Jin, R. Yang, Q. Wang, S. Zhou, R. Wang, H. Liu, Y. Luo, Y. Liu, G. Shao, H. Li, Z. Tao, Y. Yang, Z. Deng, B. Liu, Z. Ma, Y. Zhang, G. Shi, T. T. Y. Lam, J. T. Wu, G. F. Gao, B. J. Cowling, B. Yang, G. M. Leung and Z. Feng (2020). "Early Transmission Dynamics in Wuhan, China, of Novel Coronavirus–Infected Pneumonia " The New England Journal of Medicine : 1199. Nakov, P. I., A. S. Schwartz and M. A. Hearst (2004). Citances: Citation Sentences for Semantic Analysis of Bioscience Text. Proceedings of the SIGIR 4 : (Article 15016). Sinha, A., Z. Shen, Y. Song, H. Ma, D. Eide, B.-J. P. Hsu and K. Wang (2015). An overview of microsoft academic service (MAS) and applications. The 24th International Conference on World Wide Web (WWW’15 Companion) Florence, Italy. Small, H. (1973). "Co-citation in the scientific literature: A new measure of the relationship between two documents." Journal of the American Society for Information Science : 265-269. Small, H. (1986). "The synthesis of specialty narratives from co-citation clusters." Journal of the American Society for Information Science (3): 97-110. mall, H. (2018). "Characterizing highly cited method and non-method papers using citation contexts: The role of uncertainty." Journal of Informetrics (2): 461-480. Small, H., K. W. Boyack and R. Klavans (2019). "Citations and certainty: a new interpretation of citation counts." Scientometrics : 1079-1092. Small, H. G. (1978). "Cited documents as concept symbols." Social Studies of Science (3): 327-340. Thelwall, M. (2017). "Microsoft Academic: a multidisciplinary comparison of citation counts with Scopus and Mendeley for 29 journals." Journal of Informetrics : 1201–1212. Thelwall, M. (2018). "Can Microsoft Academic be used for citation analysis of preprint archives? The case of the Social Science Research Network." Scientometrics : 913-928. Visser, M., N. J. v. Eck and L. Waltman. (2020). "Large-scale comparison of bibliographic data sources: Scopus, Web of Science, Dimensions, Crossref, and Microsoft Academic." from https://arxiv.org/ftp/arxiv/papers/2005/2005.10732.pdf. Wang, K., Z. Shen, C. Huang, C.-H. Wu, D. Eide, Y. Dong, J. Qian, A. Kanakia, A. Chen and R. Rogahn (2019). "A review of Microsoft Academic Services for science of science studies." Frontiers in Big Data (Article 45). White, H. D. and K. W. McCain (1998). "Visualizing a discipline: An author co-citation analysis of information science, 1972-1995." Journal of the American Society for Information Science49