Visual Exploration and Knowledge Discovery from Biomedical Dark Data
VVisual Exploration and Knowledge Discovery from BiomedicalDark Data
Shashwat Aggarwal a , Ramesh Singh b a University Of Delhi b National Informatics Center
Abstract
Data visualization techniques proffer efficient means to organize and present data ingraphically appealing formats, which not only speeds up the process of decision makingand pattern recognition but also enables decision-makers to fully understand datainsights and make informed decisions. Over time, with the rise in technological andcomputational resources, there has been an exponential increase in the worlds scientificknowledge. However, most of it lacks structure and cannot be easily categorized andimported into regular databases. This type of data is often termed as Dark Data. Datavisualization techniques provide a promising solution to explore such data by allowingquick comprehension of information, the discovery of emerging trends, identification ofrelationships and patterns, etc. In this empirical research study, we use the rich corpusof PubMed comprising of more than 30 million citations from biomedical literature tovisually explore and understand the underlying key-insights using various informationvisualization techniques. We employ a natural language processing based pipeline todiscover knowledge out of the biomedical dark data. The pipeline comprises of differentlexical analysis techniques like Topic Modeling to extract inherent topics and majorfocus areas, Network Graphs to study the relationships between various entities likescientific documents and journals, researchers, and, keywords and terms, etc. With thisanalytical research, we aim to proffer a potential solution to overcome the problem ofanalyzing overwhelming amounts of information and diminish the limitation of humancognition and perception in handling and examining such large volumes of data.
Keywords:
Empirical Analysis, Dark Data, PubMed, Data Visualization, TextMining, Knowledge Discovery, VOSViewer
1. Introduction
In todays data centralized world, the practice of data visualization has become anindispensable tool in numerous domains such as
Research, Marketing, Journalism, Bi-
Email addresses: [email protected] (Shashwat Aggarwal), [email protected] (RameshSingh) a r X i v : . [ c s . D L ] S e p logy, etc. Data visualization is the art of efficiently organizing and presenting data in agraphically appealing format. It speeds up the process of decision making and patternrecognition, thereby enabling decision makers to make informed decisions. With therise in technology, the data has been exploding exponentially, and the worlds scientificknowledge is accessible with ease. There is an enormous amount of data available inthe form of scientific articles, government reports, natural language, and images thatin total contributes to around 80% of overall data generated globally. However, mostof the data lack structure and cannot be easily categorized and imported into regu-lar databases. This type of data is often termed as Dark Data. Data visualizationtechniques proffer a potential solution to overcome the problem of handling and an-alyzing overwhelming amounts of such information. It enables the decision maker tolook at data differently and more imaginatively. It promotes creative data explorationby allowing quick comprehension of information, the discovery of emerging trends,identification of relationships and patterns, etc.In this empirical research, we visually explore and mine knowledge out of biomed-ical research documents present in the rich corpus of PubMed. Firstly, we use textsummarization and visualization techniques like computation of raw term and docu-ment frequencies, streamline analysis for most frequent terms inside the corpus, wordclouds, etc., to get a general overview of the PubMed database. We then utilize theMALLET library, [1] to perform topic modeling to extract knowledge from the biomed-ical database by identifying and clustering the major topics inherent in the database.Finally, we use VOSviewer network viewer, [2] to construct bibliometric networks forstudying relationships between different entities like scientific documents and journals,researchers, and, keywords and terms.
2. Visual Exploration of PubMed
We use the rich corpus of the PubMed which comprises of more than 30 millioncitations and abstracts for biomedical literature from MEDLINE, life science journals,and online books. [3]. It is an open source database developed and maintained by NCBI.In addition to free access to MEDLINE, PubMed also provides links to free full-textarticles provided by PubMed Central and third-party websites, and other facilities suchas clinical queries search filters, special query pages, etc.
PubMed is a key informationresource in biological sciences and medicine primarily because of its wide diversity andmanual curation. It comprises of an order of three billion bases of the human genome,rich meta-information (e.g., MeSH terms), detailed affiliation, etc., summing up to atotal of 70GB database. [4]. As of 1 August 2020, PubMed has more than 30 millionrecords from 5500 journals, dating back to 1966, with the earliest publication availablefrom the year 1809. PubMed supports efficient methods to search the database byusing author names, journal names, keywords, and phrases, or any combination ofthese. It also enables users to download the fetched citations and abstracts for queriedterms in various formats such as plain text form (both Summary and Abstract), XMLform, PMID form, CSV form and MEDLINE form.2 able 1: Fundamental Statistics of PubMed
Size
70 GB
110 Million patient,cells, data, cancer, gene are more prominent signifying a substantial proportion ofstudy related to cancer and genomics being conducted compared to other domains.Furthermore, various words such as
DNA, tumour, acid, and receptors highlight theother significant areas where research has been done or is going on. One interesting factthat can be observed from these clouds is the prominence of words,
High Population and
Children providing a high-level indication of the major disease cause and majoritygroup affected by those diseases. Alongside the word cloud, in Figure 1 we also plotthe streamline graph for the seven most frequent terms across the database depictingthe variation of their relative frequency distribution across the set of documents.Another widely used text representation technique which apart from considering theterm frequencies, also encodes their importance inside a document, is
T F × IDF . TheTF gives the frequency of the term within a particular document, and the IDF givesthe inverse document frequency of a term, i.e., a measure of the importance of termacross several documents.
T F × IDF , term frequency-inverse document frequency, isa statistical measure used to evaluate how important a word is to a document in a3 a)(b)Figure 1: (a): Word Cloud of the PubMed Corpora. (b): Streamline Graph for top 7 most frequentwords in PubMed. collection. The importance is directly proportional to the term frequency of the wordin the document but is offset by the frequency of the word in the corpus.In Figure 2, we visualize the
T F × IDF plot computed over 50,000 full text articlesretrieved from PubMed Central. The x-axis represents the normalized decimal termrepresentations while the y axis gives their corresponding
T F × IDF scores. All the
T F × IDF scores are calculated by calculating the term document matrix using Sklearn4 igure 2:
T F × IDF
Score Distribution Plot over 50,000 full-text articles retrieved from PubMedCentral.
T F × IDF vectorizer [6]. The top 2 most significant components of the correspondingvector representation of the terms obtained from
T F × IDF vectorizer are computedusing PCA (principal component analysis) dimensionality reduction technique [7] andare visualized in Figure 2. We extract a number of relevant words (A.K.A. keywords)in accordance to the
T F × IDF scores computed earlier, with the number determinedby a certain threshold score. We visualize the distribution of the number of keywordsextracted via
T F × IDF score for different document lengths. score for differentdocument lengths. For space constraints and sparsity reasons, we binned the documentlengths by quartile (i.e., the bins are not of equal range but contains the same numberof documents 25% each). Figure 3 displays the box-and-whisker plot computed over50,000 full text articles from PubMed Central showing the distribution of keywordsacross different document lengths (binned by quartile). Figure 3 also shows the swarmplot, which gives a better representation of the distribution of keywords, visualizing allobservations along with the underlying distribution.As it can be seen from the plots, the median value of number of keywords increasewith the document length till the third quartile after which there is a drop in themedian value, which is mainly because of
T F × IDF scoring and is intuitive since agiven word is more likely to be found in a relatively longer document as comparedto a shorter document but is not necessarily a keyword. We can also observe thatthe variation of number of keywords is less in first and the last bins as compared tothe second and the third bin, indicating that documents which are either too short orlong approximately contain a constant number of relevant words while moderate lengthdocuments have a high variability in their distribution of the number of keywords.5 a)(b)Figure 3: (a): Box-and-Whisker plot and (b): Swarm Plot, computed over 50,000 full text articlesfrom PubMed Central showing the distribution of keywords across different document lengths (binnedby quartile.)
Although the raw term frequencies, word cloud visualizations and
T F × IDF scoreswork quite well in practice (e.g., in summarization of general overview of the database),however, these techniques fail to capture the ordered relationship between terms andsentences. To understand the contextual relationship between the different terms weuse DocuBurst. DocuBurst [8] is an online document visualization tool used for creatinginteractive visual summaries of documents, exploring keywords to uncover documentthemes or topics, investigating intra-document word patterns, such as character re-lationships, comparing documents, etc. which takes advantage of the human-created6tructure in lexical databases. DocuBurst visualize nouns and phrases in a hierarchi-cally structured manner centered around a root word which is selected either as themost prominent word in the database or as queried by the user. DocuBurst uses apre-existing ontology, WordNet [9], to group words having related meanings together.It creates a radial, space-filling layout of hyponymy (IS-A relation) with interactivetechniques of zoom, filter, and details-on-demand for the task of document visualiza-tion.
Figure 4: DocuBurst plot on the PubMed database with part as the root word.
We generate DocuBurst graphs over 50,000 full-text articles that we retrieved ear-lier from PubMed Central. We limit our database to this subset of PubMed to meetthe software and memory requirements of the tool utilized. Alongside the DocuBurstgraphs, both word score and word clouds for the selected word (shown in pink) andits co-occurring words are also displayed to summarize the content better. Figure4 shows the DocuBurst graph with part chosen as the root word. DocuBurst hi-erarchically structures the radially surrounding hyponyms such as organ, structure,tissue, and system around the root word. Each surrounding hyponym is further sub-structured with its related hyponyms, thereby forming chains of correlated keywordterms which reveal the coherent document themes present in the database. Along withthe DocuBurst graph, Word Clouds depicting words having strong correlations withterms on the DocuBurst graph are also shown. From these word clouds, we can inferthat words like
Cancer and
Hypertension are highly correlated with terms related to7ody parts. The word
Cancer can be seen to be highly connected with term tissues and also with other terms like
University, Research and specific country names suchas
China, Germany indicating significant work related to cancer research being carriedout by Universities in these countries. Similarly, it can also be observed that termslike
Hypertension are highly correlated to terms highlighted in pink which are mostlyrelated to mind thus, indicating some of the body organs affected due to hypertension.
Figure 5: Topic Cloud for Left: n=5 and Right: n=7 topics respectively.
3. Topic Modelling on PubMed
In the previous sections, we utilized numerous visualization techniques to visuallyexplore and get a general overview of the PubMed corpora. In this section, we use theMALLET library to perform topic modeling to extract knowledge from the biomedicaldatabase by identifying and clustering the major topics inherent in the database.MALLET is a Java-based package that includes sophisticated tools and a widevariety of algorithms for performing statistical natural language processing tasks likedocument classification, information extraction, topic modeling, etc., for analyzinglarge collections of dark data and extracting information out of it. It is co-written byAndrew McCallum and his group at the University of Massachusetts Amherst, as wellas contributions from Fernando Pereira, Ryan McDonald, and others at the Universityof Pennsylvania. The topic modeling toolkit in MALLET contains several efficient,sampling-based implementations of Latent Dirichlet Allocation (LDA), Pachinko Allo-cation, and Hierarchical LDA, etc. 8 able 2: Top 10 topics along with their proportion in PubMed; Computed using MALLET.
Topic Id Proportion Major Words in topic cancer clinical review studies therapeutic developmentdisease treatment potential current molecular researchprovide including recent strategies data role approachdiseases2 0.11401 patients survival cancer analysis prognostic stage studyfactors group lymph significantly tumor clinical multi-variate risk significant gastric metastasis ratio compared3 0.10156 cells cell cancer expression apoptosis proliferationgrowth migration protein pathway effect lines study ef-fects human signaling inhibition invasion tumor induced4 0.09279 cancer risk women breast study years age screeningmortality association incidence factors data populationhigher increased men diagnosis health rates5 0.08076 cancer care patients health quality study life treatmentpatient oncology medical information studies data inter-vention research systematic clinical guidelines palliative6 0.07577 data imaging model method analysis sensitivity basedaccuracy volume study values methods detection per-formance results diagnostic mri compared images test7 0.07482 case tumors pancreatic thyroid tumor cases diagnosiscarcinoma patient lesions malignant adenocarcinoma re-port rare imaging biopsy benign metastatic petct pri-mary8 0.07106 expression mir cancer gene tissues cell genes analysistumor lung crc protein study normal significantly tissuemirnas correlated prognosis rna9 0.06988 patients treatment therapy months radiation radiother-apy dose chemotherapy median treated brain metastaseslocal cancer lung tumor received control survival fol-lowup10 0.06737 protein signaling dna cell proteins binding role path-way activity cells transcription human cellular regula-tion function receptor activation gene complex showWe use Latent Dirichlet Allocation from the topic modeling toolkit to map termsand words present in the database into a low dimensional continuous space by exploitingthe word collocation patterns. We then use Kmeans, a famous clustering algorithmto group words into semantically related clusters according to their cosine similaritymeasure. Typically, only a small number of topics are present in each document, and9nly a small number of words have a high probability in each topic. So, we representthese topics along with top-n most frequent terms in each topic using a Topic Cloud.A Topic Cloud is a pie chart visualization of inherent topics inside a database whichconsists of a number of topic slices, where each slice contains the most important wordsin that topic. The relative prominence of words in a topic is made explicit by scalingtheir sizes in proportion to their confidence score as computed by LDA. A topic cloudis like a word cloud giving the frequency of words or phrases, but the major differencethat a topic cloud offers as compared to word cloud is the semantic grouping of wordsunder similar topics, hence capturing the contextual relationship between terms andproviding more significant insights into the text data with refined granularity.We analyse top n most significant topics inherent in PubMed where n specifies thenumber of topics considered. Figure 5 shows the topic clouds for n=5 and n=7 topicsrespectively. Both the topic clouds elegantly summarize the contents of PubMed doc-uments, with each slice depicting the most important words in that topic. Words like patients, cancer, disease, treatment etc. are grouped under one topic whereas wordssuch as data, study, research depict another topic. On comparison of both the topicclouds, we find that the topics produced for n=7 criterion are more homogenous, withuniform proportions, while the topics generated for n=5 criterion are more dispropor-tionate; indicating n=7 to be a coherently less noisy criterion as compared to the n=5 criterion. In addition to topic clouds representing top five and seven most significanttopic, we also compute top 10 topics inherent in PubMed as reported in Table 2 alongwith proportion for each topic representing the confidence score as computed by LDA.The topic with most significant proportions majorly focuses upon recent research anddevelopment strategies such as molecular and therapeutic research primarily concernedwith Cancer , indicating substantial research work related to Cancer.
4. Bibliometric Citation Analysis using Network Graphs
Network visualization also called Network graphs are often used to visualize complexand convoluted relations between an enormous amount of entities. They representinformation in a hierarchically structured manner through an interconnected networkof entities highlighting the correlation between them. At its most basic level, a networkgraph consists of nodes and edges. Nodes represent the entities, and edges represent therelationship between those entities. Edges in the graph can be directed or undirected.Directed edges indicate the flow of information from one node to another. Undirectededges, on the other hand, show the presence of a bidirectional relationship between thetwo nodes.One of its variants, Citation networks has been used extensively in the field of bib-liometrics [10]. Citation networks proffer quick summarization and visualization of thestructure inherent to a set of publications. The resulting visualizations provide insightsnot only into the present state of scientific research but in identifying potential futureresearch directions and collaboration opportunities. Bibliometric or citation networkscan be classified into direct citation networks, co-citation networks, and bibliographic a)(b)Figure 6: Co-Authorships Networks for (a) Alzheimers Disease and (b) Tuberculosis generated usingVOSviewer, network and citation viewer. oupling relations . Direct citation networks, also known as cross citation networks rep-resent research documents citing each other directly as nodes in the network. Thesenetworks only offer a direct indication of relatedness between the entities. These areusually very sparse networks and hence relatively uncommon in research settings.The second variant of citation networks, i.e., co-citation networks represents co-citedresearch documents (i.e., a pair of documents that are cited by some other group ofcommon documents) as network entities. The higher the number of research documentsciting the two documents, the stronger is the relatedness between them. [11] used theseco-citation networks to study the researchers in the field of information science. Andin the final variant, bibliographic coupling, two documents are said to be coupled ifboth cite a common research document [12]. In other words, the more common thereferences two documents have, the stronger is the coupling relation between them.We use VOSviewer, an open source software tool used for constructing and visu-alizing bibliometric networks, [2] to construct networks of scientific publications andjournals. Each visualization map consists of a network of objects of interests, alsoknown as entities. Entities can be research documents, authors, or keywords whichare interconnected with other entities through edges representing citation (co-citationand bibliographic coupling), co-authorship, or co-occurrence links. Each link has astrength associated with it, represented by a positive numerical value. The higherthis value, the stronger is the link between the connected entities. Furthermore, eachentity is grouped into a non-overlapping and exhaustive cluster. Entities have variousattributes associated with them for instance, the weight attribute of the entity or thedistance between two entities. The weight of an entity indicates the importance of thatentity in the network. An entity with a higher weight is regarded as more importantthan an entity with a lower weight and hence shown more prominently. The distancebetween any two entities indicates the strength of the relationship between the cor-responding entities. The closer the entities are to each other, the stronger they arecorrelated with each other.We create two types of visualizations, Co-Authorship and
Co-occurrence Word
Net-works. The Co-Authorship networks link the authors of various biomedical researchpublications in Pubmed based upon the number of publications they have co-authored.These networks help in obtaining significant insights on possible communities and groupof researchers who are involved in contributing in their field and may prove to be of sig-nificant help to researchers of the same or even different field which are closely relatedto closely focus on the works of major contributors and head their research forward.The Co-occurrence word networks on the other hand link keywords and term phraseswhich co-occur together. These networks reveal the semantic correlation among dif-ferent terms along with major terms highlighted within each cluster. Various usefulinsights for example in case of PubMed names of major medicines used to cure a dis-ease, or side-effects of treatment, or the possible age groups or gender targeted becauseof disease can be inferred from these networks with ease.To create these networks, we query the PubMed database on two different topics,
Alzheimer’s Disease and
Tuberculosis respectively. We obtain the resulting abstracts12 a)(b)Figure 7: Co-Occurrence Word Networks for (a) Alzheimers Disease and (b) Tuberculosis generatedusing VOSviewer, network and citation viewer. full counting , i.e., each link contributes equally and Authorsas the unit of analysis. We choose the minimum number of documents an author musthave published to be ten as the threshold to limit our network to mentions of authorswho have contributed at least ten documents related to the topic for which the networkmap is created. Finally, from the authors shortlisted we select top 500 authors to bevisualized in our network based upon their total link strength which indicates the totalstrength of the co-authorship links of a given researcher with other researchers. Simi-larly, for the Co-occurrence based Word Networks, we first select the option to ignorecopyright statement to get rid of the unwarranted text, we extract text from both titleand abstract fields of the MEDLINE document, and we select the option of full count-ing, to count all the occurrences of a term in a document. We filter the less significantterms from more significant terms by setting the minimum number of occurrences ofa term barrier to 8. For each of the filtered term, VOSviewer calculates a relevancescore, which represents the specificity of a term towards the topics covered by the textdata. We select the top 80% most relevant terms depending upon the relevance scoremetric to be displayed in our network. We select minimum cluster size to be 2 and
Association Strength to be the normalization method for the layout algorithm for bothvisualizations discussed above.Figure 6(a) shows the Co-Authorship networks for PubMed documents related tothe topic of Alzheimers Disease. From the figure, we can observe that some authorslike
Bennett Da, Blennow K, Perry G, Zhang Y, Wang Y have bigger node sizes ascompared to other authors thereby indicating a higher proportion of work contributedby these authors in the queried field of work. From the same figure, we can also observethe potential clusters of authors depending upon the papers they have co-authored andthe areas of their study. Authors in the purple and yellow cluster appear to work onlywith authors within their clusters while the work of authors of red, blue and cyanclusters are uniformly interspersed between different clusters. Also, from the distancesbetween two authors in the visualization and the thickness of links connecting them,the relatedness of the authors can be inferred. In general, the closer two authorsare located to each other or thicker the link connecting them is, the stronger theirrelatedness.Similarly, Figure 6(b) shows the Co-Authorship network for PubMed documentsrelated to the topic of
Tuberculosis . From the figure, prominent authors can be observedbased on their node sizes like
Harries AD, Van Soolingen, Wang J, Narayanan PR, etc.Various clusters can also be observed shown in distinct colors with the authors in thered cluster can be seen to be widely connected with authors present in other clusters,thereby indicating major source of work done by these authors related to tuberculosis.In addition to the co-authorship networks, we also show the Co-Occurrence wordnetworks in Figure 7. Various useful insights can be gathered from these networks.Firstly, from the co-occurrence word network shown in Figure 7(a) we can infer that
Alzheimers disease is related to the brain due to presence cooccurring terms such14s brain, memory, cognition, etc. We can also infer that males are more vulnera-ble to Alzheimer disease as compared to females . Potential age group suffering fromAlzheimer disease can also be identified as the middle age to old age group. Variousside effects of Alzheimer disease can also be found such as memory disorders, dementia,depression etc.Similarly, from the co-occurrence word network for Tuberculosis shown in Fig-ure 7(b) many important terms such as certain types of tuberculosis, like abdomi-nal tuberculosis, pulmonary tuberculosis, neck tuberculosis, etc can be identified. Thenetwork also lists the names of certain vaccines and resistance techniques related totuberculosis. Finally, on studying the network in depth, names of certain places like
In-dia, North Carolina, England, etc can be found in association with terms like healthcareworkers, survey, treatment success rate, etc indicating that these places are playing amajor role in spreading information and public awareness related to disease and areproviding proper treatment to people affected by tuberculosis.
5. Conclusion
The main motivation behind the present work was to diminish the limitation ofhuman cognition and perception, in handling and examining enormous amounts ofinformation. The present analytical work endeavored to exploit various data visualiza-tion tools and natural language processing techniques to proffer a potential solutionto overcome the problem of analyzing overwhelming amounts of information. In thisempirical research, we utilized the rich corpus of the PubMed to visually explore andanalyze lexical and textual biomedical dark data to mine knowledge out of it. Weemployed various text summarization and visualization techniques like computation ofraw term and document frequencies, word clouds,
T F × IDF scores, DocuBurst, etc.,to get a general overview of the PubMed database. We then utilized the MALLETlibrary to perform topic modeling to extract knowledge from the biomedical databaseand visualized the inherent major topics inherent in PubMed using Topic Clouds. Fi-nally, we used network and citation visualization techniques, to construct bibliometricnetworks, i.e., Co-Authorship and Co-Occurrence word networks for studying relation-ships between different entities like scientific documents and journals, researchers, and,keywords and terms. All of the techniques that were employed to explore visually andmine knowledge from dark data, i.e., PubMed proved to be of great help in allowingquick comprehension of information, the discovery of emerging trends, and identifica-tion of relationships and patterns within the database.
References [1] McCallum, A.K., Mallet: A machine learning for language toolkit (2002).Http://mallet.cs.umass.edu.[2] N. Van Eck, L. Waltman, Software survey: VOSviewer, a computer program forbibliometric mapping, Scientometrics 84(2) (2010) 523–538.153] NCBI, PubMed, , ???? [Accessed on2020-08-01].[4] R. Roberts, PubMed Central: The GenBank of the published literature, InProceedings of the National Academy of Sciences (2001) 381–382.[5] Jonathan Feinbergz, Wordle, , 2014. [Accessed on 2018-12-04].[6] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel,M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos,D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machinelearning in Python, Journal of Machine Learning Research 12 (2011) 2825–2830.[7] I. Jolliffe, Principal component analysis, Springer Verlag, New York, 2002.[8] C. Collins, S. Carpendale, G. Penn, DocuBurst: Visualizing Document Con-tent Using Language Structure, Computer Graphics Forum (Proc. of theEurographics/IEEE-VGTC Symposium on Visualization (EuroVis)) 28(3) (2009)1039–1046.[9] C. Fellbaum, WordNet: An Electronic Lexical Database, Cambridge, MA: MITPress (1998).[10] C. Belter, Visualizing Networks of Scientific Research,