[PDF] GlassViz: Visualizing Automatically-Extracted Entry Points for Exploring Scientific Corpora in Problem-Driven Visualization Research

Abstract

In this paper, we report the development of a model and a proof-of-concept visual text analytics (VTA) tool to enhance documentdiscovery in a problem-driven visualization research (PDVR) con-text. The proposed model captures the cognitive model followed bydomain and visualization experts by analyzing the interdisciplinarycommunication channel as represented by keywords found in twodisjoint collections of research papers. High distributional inter-collection similarities are employed to build informative keywordassociations that serve as entry points to drive the exploration of alarge document corpus. Our approach is demonstrated in the contextof research on visualization for the digital humanities.

Full PDF

©© 2020 IEEE. This is the author’s version of the article that has been published in the proceedings of IEEEVisualization conference. The ﬁnal version of this record is available at: xx.xxxx/TVCG.201x.xxxxxxx/

GlassViz: Visualizing Automatically-Extracted Entry Points for ExploringScientiﬁc Corpora in Problem-Driven Visualization Research

Alejandro Benito-Santos * Roberto Ther ´on † VisUSAL Research Group. Universidad de Salamanca, SpainFigure 1: GlassViz interface showing entry points to a corpus of visualization research papers along with related documents andkeywords: (a) Quality neighborhoods representing entry points as connected keyword groups. (b) List of documents (showing onlythe ﬁrst nine) sorted by number of keyword tokens matching those in selection s.1. (c) Keyword tokens for each document in view b.(d.1) Ranked list of tokens appearing in view c informing of the composition of topics in the entry point selected in s.1. (d.2) State ofview d.1 when the selection in view ”a” is changed to s.2. A BSTRACT

Keywords: visual text analytics, literature-based discovery, visual-ization of scientiﬁc corpora, distributional similarity, sensemaking,methodology transfer, digital humanities

NTRODUCTION

Problem-driven visualization research (PDVR) [28] requires inten-sive collaboration between visualization and domain experts to solveproblems in a speciﬁc academic discipline such as biology, sports sci-ence, computer security, or the humanities. Motivated by the increas- * e-mail: [email protected] † e-mail: [email protected] ing specialization and difﬁculty of said problems, this collaborationusually materializes in the celebration of workshops, parallel eventsand micro-conferences (e.g., BioVis, Vis4DH, CityVis, VizSec) andrelated specialized publication datasets. Setting aside each domain’sparticularities, these communities generally have to deal with thesame typical problems of visualization practice (e.g., dimension-ality reduction, hierarchy visualization, or color perception). Toobtain insight on these topics and generate novel research ideas [14],researchers perform literature reviews on other larger datasets ofvisualization publications in search of techniques conceived in otherdomains that may assist them in solving speciﬁc problems of theirown domains. This transference of knowledge between communi-ties of practice is known in human-computer interaction (HCI) andvisualization as ”methodology transfer” (MT), that is, the actionof utilizing available models that provide solutions to existing andunsolved problems [4, 24]. For example, under this paradigm, adigital humanist focusing on the analysis of digital editions may ﬁndinteresting a visual algorithm conceived for the analysis of geneticdata or vice-versa (e.g., an Arc Diagram [33]). However, the arrivalat this kind of ﬁndings is seldom straightforward. A ﬁrst hurdleis related to the lack of linguistic competences [28] to formulatequeries that serve as entry points [14] to the dataset. To illustratethis situation, take the example of the same digital humanist willingto explore a large corpus of visualization research papers. Fromprevious experience, she knows that the analysis of digital editionsis typically related to the concepts of ”network analysis” and ”graphtheory”, which are her entry points to the dataset. However, shemight not be familiar yet with other more speciﬁc techniques that1 a r X i v : . [ c s . H C ] S e p could be useful in this context, such as ”graph clique” or ”persistenthomology.” Conversely, the authors of papers containing these spe-ciﬁc terms might not have chosen to include the more general terms”network analysis” or ”graph theory” in their keyword selections forbeing too obvious and thus uninteresting for the audience addressedinitially in their works. Therefore, these publications are effectivelyinvisible to the digital humanist’s eyes because she has not yet ac-quired the necessary vocabulary to formulate an adequate query forthis dataset. Irremediably, in a typical setup she will have to start thesearch by typing keyword(s) she is familiar with, initiating an itera-tive sensemaking process [19, 25], that will be followed by a facetedbrowsing of the dataset according to its different dimensions (e.g.,authors, keywords, or citations). The situation depicted in this ex-ample presents further problems: ﬁrstly, searching by general termswill return large document lists with varying degrees of relevancethat the researcher needs to inspect and ﬁlter individually. Second,the subsequent browsing is performed by manual means followinga chain of ﬁrst-order co-occurrences of metadata items, which mayrapidly become a frustrating experience for the user, especially whenthe data volumes are large. To overcome these issues, we propose adistributional model and a related proof-of-concept (POC) tool thataim to capture similarities between keywords in different domainsand to automate the generation of meaningful entry points to a cor-pus of research papers that needs to be explored. The model andtool are demonstrated in the context of a researcher working at theintersection of visualization and the humanities. ELATED W ORK

Problem-Driven Visualization Research (PDVR):

PDVR bringstogether domain and visualization experts that collaborate to solvespeciﬁc, inherently complex domain problems. Beyond technical ex-pertise in both domains, some authors have stressed the importanceof language to success in interdisciplinary research [3]. In this regard,Simon et al. explain collaborations in PDVR with a communicationmodel [28] in which domain experts generate the problem space by providing data and driving problems , and visualization expertscontribute exploratory data analysis and visualization techniquesdeﬁning the design space . Following this reasoning, solutions aremappings between the problem and design spaces, and their num-ber is deﬁned by the breadth (or richness) of the communicationchannel shared by the two teams. More recently, Miller et al. [24]developed these concepts further in their Methodology TransferModel (MTM). The MTM incorporates the notions of similarity and alignment to identify potential MTs between different knowledgedomains. Our work employs distributional similarity to extend thesetheoretical models with other concepts drawn from information sci-ence (see next paragraph).

Literature-Based Discovery (LBD):

LBD is a knowledge extraction technique that ”generates discov-eries, or hypotheses, by combining what is already known in theliterature.” [31] The concept was introduced in the 1980s by Don R.Swanson, an information scientist known for coining the ﬁrst formof LBD, the

ABC model [29]. The

ABC model employs transitiveinference to unveil non-trivial implicit associations between twodisjoint bodies of scientiﬁc literature (source and target). It utilizesa simple yet powerful syllogism to pair knowledge fragments: If aterm/concept (a-concept) is related to the intermediate term/concept(b-concept) which appears in both the source and target literatures,and the b-concept is related to a c-concept which only appears inthe target literature, then we can ﬁnd a relation, characterized by theb-concept, between the a-concept (which the user is familiar with)and the c-concept (which is new to user). Speciﬁcally, we look uponrecent work by Thilakaratne et al. [30], who employ word embed-dings to ﬁnd interesting cross-disciplinary afﬁnities in online paperdatabases. As opposed to the authors, who employ paper abstractsto generate neural embeddings using the word2vec model [23], ourwork relies on author-assigned keywords (hereinafter simply ”key- words”), which are descriptive words assigned by the authors totheir research papers and have been successfully employed in thepast by other researchers to ”facilitate the process of understandingdifferences and commonalities of the various research sub-ﬁelds invisualization.” [18]. Also, and despite recent efforts [8], the processby which humans extract keywords from academic texts remainsmostly unknown [20]. Therefore, keywords model a unique andhighly expressive language that serves as the starting point for ourstudy.

Visual Text Analytics (VTA) of Scientiﬁc Literature:

Inrecent times, some authors have started to incorporate linguisticand sensemaking models into their VTA tools to replicate the typi-cal tasks and goals of exploring scientiﬁc texts [12]. For example,the Action Science Explorer [10] and PaperPoles [15] mimic thesensemaking process of traditional literature reviews. Concretely,PaperPoles supports the browsing of publications in a context-awareenvironment by requesting positive or negative queries from the useras the application workﬂow progresses. PaperQuest [26] employs arelevance algorithm to rank papers according to the sensemaking pro-cess of literature reviews. PaperQuest assumes that the user has oneor more seed papers at her disposal to start the exploration, a conceptthat we implemented in GlassViz. Guo et al. [14] propose a two-stage sensemaking framework to discover novel research ideas basedon previous work by Pirolli and Card [25]. Wang et al. implementtwo different logic ﬂows in their system (author-based and citation-based) to mirror the traditional literature review process [32]. To thebest of our knowledge, GlassViz is the ﬁrst VTA tool to incorporatethe sensemaking model followed by interdisciplinary visualizationresearchers using an LBD workﬂow.

ATA P ROCESSING

We selected two research paper collections as the S and T literaturesin our LBD setup. T-Literature (VIS4DH dataset), representingthe target domain that solutions need to be imported to, comprises221 papers on visualization for the Digital Humanities (DH) [2].S-Literature (VIS dataset) is a set of 2117 visualization publicationsthat have appeared at the IEEE Visualization set of conferences:InfoVis, SciVis, VAST and Vis between the years 1991-2018 [17].Keywords were extracted from each document, tokenized and trans-lated into their American English forms when necessary. Tokensmatching NLTK’s list of English stop words (e.g., ”and” or ”of”)were removed from further analysis, which yielded a total of 3403different tokens. Next, each token was light-stemmed using thePorter algorithm. Given that keywords are a very sparse feature ofscientiﬁc papers, the stemming procedure had the positive effect ofcompressing the input vocabulary (from 3403 to 2720 tokens) bylinking redundant forms together under the same root (e.g., ”ﬁlter-ing,” ”ﬁlters” and ”ﬁltered” under ”ﬁlter”). In addition, and despitecertain limitations that we discuss in Sect. 6, the stemming algorithmhelped relate documents referring to the same high-level conceptsrequiring minimal human intervention (e.g., a manual classiﬁca-tion [17]). Finally, we removed uninteresting tokens with inversedocument frequency (IDF) of less than 1.0, resulting in only onetoken (”visual”) being discarded. Each document was treated as abag-of-(key)word tokens deﬁning a vocabulary composed of threedisjoint sets as per Swanson’s ABC model: V a (a-concepts, or tokensappearing exclusively in the VIS4DH dataset), V c (c-concepts, ortokens appearing exclusively in the VIS dataset), and V b (b-concepts,or tokens appearing in both datasets). In the end, the vocabularysizes obtained were: | V a | = | V b | = | V c | = YSTEM D ESIGN

Our approach relies on the extraction of entry points to guide theexploration of a scientiﬁc corpus. The extraction of the entry pointsis based on the following assumptions: at the beginning of thisstudy, we observed that researchers participating in PDVR internally2 follow an MTM that is mainly driven by their previous experiencein other projects and domains. Here, the expert initially analyzesthe problem and breaks into its constituent parts, leading to a setof themes that are matched against previous grounded knowledge.In this mental process, candidate solutions are detached from theoriginal problem’s domain and matched against the new domainin search of viable solutions. The most similar solutions are thenimplemented to obtain preliminary insight into the data, which isoften necessary to promote discussions between stakeholders andadvance the project at its early stages. Later in the design process,the team may decide to modify and/or combine these initial solutionsto provide a visualization that aligns better with the data and tasksof the problem at hand [24]. Motivated by the presented situation,we extracted the following design goals and related questions at thebeginning of the study, which ultimately drove the design of ourdistributional model and POC tool:

DG.1 : Motivate a personalizedexploration of scientiﬁc corpora that is tailored to the users researchaims. ”What kind of knowledge does the user want to extract froma dataset?”, ”What can a user learn from the dataset that is usefulfor solving a particular domain problem?”

DG.2 : Potentiate thediscovery of methodologies that could potentially be transferredfrom other existing design spaces to the source domain. ”How canwe measure the degree of transferability of solutions conceived inother knowledge domains?”

DG.3 : Accelerate sensemaking andlanguage acquisition in the context of PDVR. ”What are the mostinformative terms that best describe a dataset according to the userslevel of expertise and grounded knowledge?”, What themes areespecially interesting for the user?”, ”How can they be presentedin the best possible manner to augment their comprehensibility?”

DG.4 : Provide a reading order for discovered documents.

Whatdocuments are the most important in the collection for the user?

Our theoretical model (Fig. 2) combines Swansons and Miller etal.s models to build automatic entry points that resemble the re-searchers sensemaking model and assist them in the task of mappingproblem and design spaces in different domains and bodies of litera-ture. According to Simon et al. [28], the problem space is deﬁnedby application domains and their data, whereas the design spacecomprises analytical tasks and visualizations. In the diagram, wedepict the idea that valid MTs consist of a series of concepts speciﬁcto each domain (a- and c-concepts) and a variable number of tech-niques that address a generic, high-level problem in the visualizationdomain (b-concepts). Thus, as per Swanson’s model, solutions (orpapers) in the T-literature link problems and designs containing onlya- and b-concepts, while those in the S-literature contain only c-and b-concepts. Then, it should be theoretically possible to deductrecurrent terms of potential solutions by analyzing the distributionof terms in existing solutions documented in the literature in otherdomains and relating them to the problem(s) at hand using high-order co-occurrence. This idea is depicted in the Venn diagram atthe center of the image. At the intersection of the four sets, the coreterms of the elements in the four spaces meet, giving clues about thedescriptions of potential solutions. Besides, more potential solutionscould be found by following chains of co-occurrence that led toperipheral intersection spaces. As we explain in the next section,our proposed model captures high-order co-occurrence of conceptsfor enhancing the document exploration process.

We rely on the generation of keyword embeddings for detectingdistributional similarities between problems, data, tasks, or visual-izations in the S- and T-literatures. These embeddings were gener-ated by following the method proposed by Levy et al. [21], whichrequires minimal hyper-parameter tuning and they are known toexcel at word similarity tasks [21, 22]. Initially, the method relies on

S-LiteratureT-Literature

Design SpaceProblem Space

Unsolved ProblemVisual Solution

Research PaperPotential SolutionMappingSolution Space M e t hodo l ogy T r a n s f e r Design SpaceProblem Space

Unknown Visualizations A ⋂ B Problem / Problem/Design B-concept / Problem/Design A-concept / Problem/Design C-concept

Figure 2: Methodology Transfer Model (MTM) adapted from Miller etal. [24]. The model maps problems and designs found in two disjointbodies of literature and it is augmented with concepts drawn fromSwanson’s ABC Model for Literature-Based Discovery to automate thediscovery of candidate MTs and to provide the user with informativeentry points to the S-Literature. an initial pointwise mutual information (PMI) matrix that encodesthe probability for a pair of keyword tokens to be seen together in adocument with respect to the probability of seeing those two sametokens in the union of the two corpora (see Equation 1). For allkeyword token pairs in the S and T literatures, each cell M i , j repre-sents the log odds ratio of w i (a keyword) and c j (any other keywordappearing with w in a document D, its context) joint probability andthe product of their marginal probabilities. The marginal probabil-ities were empirically obtained from the corpora by counting thenumber of occurrences of each token divided by the union size ofthe document collections. PMI ( w , c ) = log P ( w , c ) P ( w ) P ( c ) = log ( w , c ) · | D | ( w ) · ( c ) (1)Given that PMI can be − in f for pairs of tokens that were never seenin the corpus, it is customary to use the positive version of the PMImatrix that is deﬁned as: PPMI ( w , c ) = max ( PMI ( w , c ) , ) (2)Following recommendations in the literature [1, 22], we applied alight smoothing with α = . SPPMI ( w , c ) = log ˆ P ( w , c ) ˆ P ( w ) ˆ P α ( c ) (3)where the smoothed unigram distribution of the context is:ˆ P α ( c ) = ( c ) α ∑ c ( c ) α (4)To capture high-order co-occurrence and to generate dense keywordvectors from the sparse SPPMI matrix was factorized into the productof three matrices by applying a non-parametric algebraic method,SVD, which was popularized in the NLP community with Latent3 Semantic Analysis (LSA) [9, 21]. If the SPPMI matrix is the matrix M , SVD decomposes M into the product of three matrices U Σ V T ,where U and V are orthonormal ( U T U = V T V = I ) and Σ is adiagonal matrix of sorted singular values of the same rank r as theinput matrix. Then, our resulting vector space model (VSM) isformed by dense keyword embeddings resulting from keeping onlythe ﬁrst k columns in U ( k =

50 in our case).

LASS V IZ In this section, we describe the design decisions that drove the devel-opment of our prototype tool by carrying an experiment using thedatasets introduced in Sect. 3. Our approach is centered around thequalitative inspection of quality local neighborhoods of a-conceptsthat were derived using a cosine metric [30]. According to the litera-ture, it is customary to select between 3 and 5 nearest neighbors forthis task (see [16], section 4.1.1) Thus, we decided to extract the 4nearest neighbors for each a-concept t a in V a . Tokens represented byvery similar vectors ( sim ( t a , t b ) ≤ .

01) and thus displaying identicalnearest neighbors were considered redundant for the purpose of thistask and thus were removed (488 in total). Quality neighborhoodswere deﬁned as those containing signiﬁcant similarities between a-and c- concepts and were identiﬁed by two criteria: (1) the neighbor-hood included at least one c-concept, and (2) when criterion 1 wasmet, the similarity between the a-concept and its nearest c-conceptin the neighborhood fell within the ﬁrst quartile of all highest sim-ilarities ( dist ( t a , t c ) < = Q , with Q = . DG.1 ) and for motivating a fastvocabulary learning (

DG.3 ) with a minimal cognitive gap. Thisis achieved by pruning graph edges that are not on shortest pathsaccording to two parameters q (the number of indirect proximitiesconsidered to build the PFNET) and r (the metric used to computepairwise similarities) [5, 6, 27]. Concretely, we calculated mini-mum spanning trees (MSTs), the most concise form of a PFNET( q = n − , r = ∞ ), for the 12 complete subgraphs. Following rec-ommendations in the literature [5], each PFNET was plotted usinga force-directed algorithm [13] that placed nodes displaying highpairwise cosine similarities closer in the chart. In this representation,the nodes depict a-,b-, or c-concepts as per Swanson’s ABC model.Each MST portrays an exploration path (or entry point) to the VISdataset that can be inspected individually in the designated areas ofview 1.a (see Fig. 1). A total of 29 a-concepts (red), 16 b-concepts(yellow) and 19 c-concepts (blue) were captured. Each node showsa text label containing the most common form of the correspondingtoken and whose size encodes the token(s)’s absolute frequency inthe union of the two literatures. In view 1.b, documents in the cur-rent selection are listed in descending order of number of keywordtokens matching those in the current selection ( DG.4 ). The numberof documents containing any of the terms captured by the entrypoints was 69 for the T-Literature A and 297 for the S-Literature(31.22% and 14.03% coverage, respectively). To the right of view1.b., view 1.c shows keyword tokens for each document shown inview 1.b, whereas view 1.d1 (and 1.d2 for selection s2) aggregatesand presents these tokens in a rank frequency list. By visually inspecting each of the 12 entry points in view 1.a(

DG.1 ), the user can recognize interesting inter-collection distribu-tional similarities between concepts describing application areas,domain problems, analytical tasks/techniques and visualizations asper the model introduced in Sect. 4.2. The entry points can be fur-ther inspected using a brushing+linking interaction technique Forexample, when brushing the entry point in selection s DG.3 ), theuser could look at view 1.d1 to discover the most frequent tokens(”topic,” ”model,” ”text,” ”analyt,” ”dirichlet,” etc.) found in doc-uments matching any of the entry point’s concepts shown in s1,allowing a ﬁrst rapid interpretation of the theme. By interacting withthe items in view 1.b, the user could retrieve in a pop-up view multi-ple related metadata to a document; i.e., title, authors list, publicationyear/venue and number of matching keywords with the entry point.In the same view, it can be observed that the three a-concepts in s1can be traced to three documents in the VIS4DH dataset describingtwo domain problems (the analysis of international trade agreementsand bibliographic works, respectively) and a domain-speciﬁc analy-sis tool, a wrangling Excel script for a popular NLP toolkit amongDH practitioners. Similarly, the two c-concepts ”nonnegative” and”latent” can be traced to three other documents in the VIS datasetand reconstructed by the user to ”nonnegative matrix factorization”and ”latent dirichlet analysis,” introducing potentially interestinganalysis techniques (

DG.2 ). The same workﬂow could be applied toany other entry point of view 1.a, for example to the one depicted inselection s2. This entry point relates the domain problem of ”socialjustice” to the a-concept ”tele-immersion” under the backgroundtheme of virtual and augmented reality (view 1.d2).

ONCLUSIONS , L

IMITATIONS AND F UTURE W ORK

We have presented a model and a related VTA prototype aimedat accelerating the process of knowledge and language acquisitionin PDVR. By modeling the distribution of keywords deﬁning theinterdisciplinary communication channel as documented by researchpapers found in two disjoint bodies of literature, we were ableto generate entry points that motivated a personal exploration of acorpus of visualization papers according to the researcher’s particularneeds and expectations and that required minimal user intervention.However, we identiﬁed the existence of certain limitations in ourapproach that are discussed hereafter: ﬁrstly, the stemming algorithmemployed to compress the input data produced some false positivesthat are difﬁcult to avoid by automatic means. Concretely, this sideeffect could be observed in cases where keywords with differentmeanings were reduced to the same lexical form, for example in thetokens factory (from ”smart factory”) and factorial (from ”factorialanalysis”). Also,

GlassViz does not allow the interactive tuningof certain parameters such as the number k of singular values, thesmoothing alpha factor α , or the similarity thresholds set to detectredundant vectors and quality neighbors. To resolve these and otherissues, we plan to incorporate direct manipulation techniques [11]in the future. Furthermore, the tokenization of keywords increasedthe difﬁculty in interpreting the entry points’ background themes,a limitation that could be resolved by employing auxiliary n-gramstatistics [7] to assist the user in reconstructing the original phrases. A CKNOWLEDGMENTS

The authors want to thank the three anonymous reviewers for theirhelpful comments. This work was supported by a grant from theSpanish Ministry of Economic Affairs and Digital Transformationunder the EU CHIST-ERA agreement (PCIN-2017-064).4 R EFERENCES [1] A. Benito-Santos and R. Ther´on S´anchez. Cross-domain visual ex-ploration of academic corpora via the latent meaning of user-authoredkeywords.

IEEE Access , 7:98144–98160, 2019.[2] A. Benito-Santos and R. Ther´on S´anchez. A Data-Driven Introductionto Authors, Readings and Techniques in Visualization for the DigitalHumanities.

IEEE Computer Graphics and Applications , pp. 1–1,2020.[3] L. J. Bracken and E. A. Oughton. ‘What do you mean?’ The importanceof language in developing interdisciplinary research.

Transactions ofthe Institute of British Geographers , 31(3):371–382, July 2006. doi: 10.1111/j.1475-5661.2006.00218.x[4] R. Burkhard. Learning from architects: The difference between knowl-edge visualization and information visualization. In

Proceedings.Eighth International Conference on Information Visualisation, 2004.IV 2004. , pp. 519–524, July 2004. doi: 10.1109/IV.2004.1320194[5] C. Chen. Visualising semantic spaces and author co-citation networksin digital libraries.

Information Processing & Management , 35(3):401–420, May 1999. doi: 10.1016/S0306-4573(98)00068-5[6] C. Chen, J. Kuljis, and R. J. Paul. Visualizing latent domain knowledge.

IEEE Transactions on Systems, Man, and Cybernetics, Part C (Appli-cations and Reviews) , 31(4):518–529, Nov. 2001. doi: 10.1109/5326.983935[7] J. Chuang, C. D. Manning, and J. Heer. Termite: Visualization Tech-niques for Assessing Textual Topic Models. In

Proceedings of the Inter-national Working Conference on Advanced Visual Interfaces , AVI ’12,pp. 74–77. ACM, New York, NY, USA, 2012. doi: 10.1145/2254556.2254572[8] J. Chuang, C. D. Manning, and J. Heer. “Without the Clutter ofUnimportant Words”: Descriptive Keyphrases for Text Visualization.

ACM Trans. Comput.-Hum. Interact. , 19(3):19:1–19:29, Oct. 2012. doi:10.1145/2362364.2362367[9] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman. Indexing by latent semantic analysis.

Journal ofthe American Society for Information Science , 41(6):391–407, 1990.doi: 10.1002/(SICI)1097-4571(199009)41:6 < > Journal of the American Society forInformation Science and Technology , 63(12):2351–2369, 2012. doi: 10.1002/asi.22652[11] M. El-Assady, R. Kehlbeck, C. Collins, D. Keim, and O. Deussen.Semantic Concept Spaces: Guided Topic Model Reﬁnement usingWord-Embedding Projections.

IEEE Transactions on Visualization andComputer Graphics , 26(1):1001–1011, Jan. 2020. doi: 10.1109/TVCG.2019.2934654[12] P. Federico, F. Heimerl, S. Koch, and S. Miksch. A Survey on VisualApproaches for Analyzing Scientiﬁc Literature and Patents.

IEEETransactions on Visualization and Computer Graphics , 23(9):2179–2198, Sept. 2017. doi: 10.1109/TVCG.2016.2610422[13] T. M. J. Fruchterman and E. M. Reingold. Graph drawing by force-directed placement.

Software: Practice and Experience , 21(11):1129–1164, 1991. doi: 10.1002/spe.4380211102[14] H. Guo and D. H. Laidlaw. Topic-based Exploration and EmbeddedVisualizations for Research Idea Generation.

IEEE Transactions onVisualization and Computer Graphics , pp. 1–1, 2018. doi: 10.1109/TVCG.2018.2873011[15] J. He, Q. Ping, W. Lou, and C. Chen. PaperPoles: Facilitating adaptivevisual exploration of scientiﬁc publications by citation links.

Journalof the Association for Information Science and Technology , 70(8):843–857, 2019. doi: 10.1002/asi.24171[16] F. Heimerl and M. Gleicher. Interactive Analysis of Word VectorEmbeddings.

Computer Graphics Forum , 37(3):253–265, June 2018.doi: 10.1111/cgf.13417[17] P. Isenberg, F. Heimerl, S. Koch, T. Isenberg, P. Xu, C. D. Stolper,M. Sedlmair, J. Chen, T. M¨oller, and J. Stasko. Vispubdata.org: A Meta-data Collection About IEEE Visualization (VIS) Publications.

IEEETransactions on Visualization and Computer Graphics , 23(9):2199– 2206, Sept. 2017. doi: 10.1109/TVCG.2016.2615308[18] P. Isenberg, T. Isenberg, M. Sedlmair, J. Chen, and T. M¨oller. Visualiza-tion as Seen through its Research Paper Keywords.

IEEE Transactionson Visualization and Computer Graphics , 23(1):771–780, Jan. 2017.doi: 10.1109/TVCG.2016.2598827[19] G. Klein, B. Moon, and R. R. Hoffman. Making Sense of Sensemaking2: A Macrocognitive Model.

IEEE Intelligent Systems , 21(5):88–92,Sept. 2006. doi: 10.1109/MIS.2006.100[20] S. Lahiri. Replication of the Keyword Extraction part of the paper”’Without the Clutter of Unimportant Words’: Descriptive Keyphrasesfor Text Visualization”. arXiv e-prints , 1908:arXiv:1908.07818, Aug.2019.[21] O. Levy and Y. Goldberg. Neural Word Embedding as Implicit Ma-trix Factorization. In Z. Ghahramani, M. Welling, C. Cortes, N. D.Lawrence, and K. Q. Weinberger, eds.,

Advances in Neural InformationProcessing Systems 27 , pp. 2177–2185. Curran Associates, Inc., 2014.[22] O. Levy, Y. Goldberg, and I. Dagan. Improving Distributional Similar-ity with Lessons Learned from Word Embeddings.

Transactions of theAssociation for Computational Linguistics , 3(0):211–225, May 2015.[23] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. Dis-tributed Representations of Words and Phrases and their Composition-ality. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, andK. Q. Weinberger, eds.,

Advances in Neural Information ProcessingSystems 26 , pp. 3111–3119. Curran Associates, Inc., 2013.[24] M. Miller, H. Sch¨afer, M. Kraus, M. Leman, D. A. Keim, and M. El-Assady. Framing Visual Musicology through Methodology Transfer.In

Proc. 4th Workshop on Visualization for the Digital Humanities(VIS4DH) , 2019.[25] P. Pirolli and S. Card. The sensemaking process and leverage pointsfor analyst technology as identiﬁed through cognitive task analysis. In

Proceedings of International Conference on Intelligence Analysis , pp.2–4, 2005.[26] A. Ponsard, F. Escalona, and T. Munzner. PaperQuest: A Visualiza-tion Tool to Support Literature Review. In

Proceedings of the 2016CHI Conference Extended Abstracts on Human Factors in ComputingSystems , CHI EA ’16, pp. 2264–2271. Association for Computing Ma-chinery, San Jose, California, USA, May 2016. doi: 10.1145/2851581.2892334[27] R. W. Schvaneveldt, ed.

Pathﬁnder Associative Networks: Studies inKnowledge Organization . Pathﬁnder Associative Networks: Studies inKnowledge Organization. Ablex Publishing, Westport, CT, US, 1990.[28] S. Simon, S. Mittelst¨adt, D. A. Keim, and M. Sedlmair. Bridging thegap of domain and visualization experts with a Liaison. In

Proceedingsof the Eurographics Conference on Visualization (EuroVis) , vol. 2015.The Eurographics Association, Cagliari, Italy, 2015.[29] D. R. Swanson. Fish Oil, Raynaud’s Syndrome, and UndiscoveredPublic Knowledge.

Perspectives in Biology and Medicine , 30(1):7–18,1986. doi: 10.1353/pbm.1986.0087[30] M. Thilakaratne, K. Falkner, and T. Atapattu. Automatic Detection ofCross-Disciplinary Knowledge Associations. In

Proceedings of ACL2018, Student Research Workshop , pp. 45–51, July 2018.[31] M. Thilakaratne, K. Falkner, and T. Atapattu. A Systematic Review onLiterature-based Discovery.

ACM Computing Surveys (CSUR) , Dec.2019.[32] Y. Wang, D. Liu, H. Qu, Q. Luo, and X. Ma. A Guided Tour ofLiterature Review: Facilitating Academic Paper Reading with NarrativeVisualization. In

Proceedings of the 9th International Symposium onVisual Information Communication and Interaction , VINCI ’16, pp.17–24. ACM, New York, NY, USA, 2016. doi: 10.1145/2968220.2968242[33] M. Wattenberg. Arc Diagrams: Visualizing Structure in Strings. In

IEEE Symposium on Information Visualization, 2002. INFOVIS 2002. ,pp. 110–116, Oct. 2002. doi: 10.1109/INFVIS.2002.1173155,pp. 110–116, Oct. 2002. doi: 10.1109/INFVIS.2002.1173155