[PDF] Mapping Researchers with PeopleMap

Abstract

Discovering research expertise at universities can be a difficult task. Directories routinely become outdated, and few help in visually summarizing researchers' work or supporting the exploration of shared interests among researchers. This results in lost opportunities for both internal and external entities to discover new connections, nurture research collaboration, and explore the diversity of research. To address this problem, at Georgia Tech, we have been developing PeopleMap, an open-source interactive web-based tool that uses natural language processing (NLP) to create visual maps for researchers based on their research interests and publications. Requiring only the researchers' Google Scholar profiles as input, PeopleMap generates and visualizes embeddings for the researchers, significantly reducing the need for manual curation of publication information. To encourage and facilitate easy adoption and extension of PeopleMap, we have open-sourced it under the permissive MIT license at this https URL. PeopleMap has received positive feedback and enthusiasm for expanding its adoption across Georgia Tech.

Full PDF

MMapping Researchers with P

EOPLE M AP Jon Saad-Falcon ∗ , Omar Shaikh ∗ , Zijie J. Wang ∗ , Austin P. Wright * , Sasha Richardson † , Duen Horng (Polo) Chau ∗ Figure 1: A. P EOPLE M AP for 83 Georgia Tech researchers afﬁliated with the Institute of Data Engineering and Science (IDEaS)based on their research interests and publications. 1. Map View visualizes embeddings of researchers generated from their GoogleScholar keywords and publication text; each dot represents one researcher. 2.

Research Query allows searching for researchersand areas of research. 3.

Researcher View shows detailed information (e.g., afﬁliation, citation count) of researcher selected inMap View. 4.

Control Panel for adjusting visualization settings (e.g., show researcher names). B. Example search results whenquerying for the “algorithms” research topic; darker color indicates stronger research alignment between a researcher (dot) and thequery topic. C. Example researcher clustering results produced by a Gaussian mixture model; cluster distributions shown as ellipses. A BSTRACT

Discovering research expertise at universities can be a difﬁcult task.Directories routinely become outdated, and few help in visuallysummarizing researchers’ work or supporting the exploration ofshared interests among researchers. This results in lost opportunitiesfor both internal and external entities to discover new connections,nurture research collaboration, and explore the diversity of research.To address this problem, at Georgia Tech, we have been develop-ing P

EOPLE M AP , an open-source interactive web-based tool thatuses natural language processing (NLP) to create visual maps forresearchers based on their research interests and publications. Re-quiring only the researchers’ Google Scholar proﬁles as input, P EO - PLE M AP generates and visualizes embeddings for the researchers,signiﬁcantly reducing the need for manual curation of publication in-formation. To encourage and facilitate easy adoption and extensionof P EOPLE M AP , we have open-sourced it under the permissive MITlicense at https://github.com/poloclub/people-map . P EO - PLE M AP has received positive feedback and enthusiasm for expand-ing its adoption across Georgia Tech. Index Terms:

Human-centered computing—Visualization—Visu-alization systems and tools; * Georgia Institute of Technology. { jonsaadfalcon, oshaikh, jayw, apwright, polo } @gatech.edu † Fayetteville State University. [email protected]

NTRODUCTION

Discovering research expertise and potential collaborators at uni-versities can be a difﬁcult task. While manually curated universitydirectories currently ﬁll this role, they are primarily designed forcataloging individuals’ afﬁliation and contact information. Few helpin visually summarizing researchers’ work or supporting the explo-ration of shared interests among researchers. Furthermore, suchdirectories routinely become outdated and sometimes provide inac-curate or incomplete information about the researchers as researchinterests and publication records evolve over time. This results inlost opportunities for both internal and external entities to nurtureresearch collaboration and explore the diversity of research. Toaddress this common issue shared among research institutions, ourongoing work makes the following contributions:1. P EOPLE M AP (Fig. 1A), an open-source interactive web-basedtool that employs embeddings generated using natural languageprocessing (NLP) techniques to visually “map out” researchersusing their research interests and publications found on the re-searchers’ Google Scholar proﬁles. Requiring only GoogleScholar proﬁles as input (e.g., their URLs), P EOPLE M AP sig-niﬁcantly reduces the need for manual curation of publicationinformation. While existing tools and research have primar-ily focused on tackling tasks such as recommending researchpapers and venues to publish at [1, 3, 4], we are working to con-tribute P EOPLE M AP as one of the ﬁrst practical tools that helpssummarize and visualize researcher interests and expertise.To encourage and facilitate easy adoption and extension ofP EOPLE M AP , we have open-sourced it under the permissiveIT licence. P EOPLE M AP ’s code repository and detailed doc-umentation is available at https://github.com/poloclub/people-map . All generated P EOPLE M AP s are static web appli-cations that can be hosted as standard web pages (e.g., as GitHubpages) without the need for any backend computation servers.2. Deployment of P

EOPLE M AP : Early Usage and Feedback Todemonstrate the feasibility and generalizability of P

EOPLE M AP ,we have successfully deployed P EOPLE M AP s for three researchunits at Georgia Tech: (1) the Institute of Data Engineeringand Science (IDEaS) , with 83 afﬁliated faculty members, thatserves as a uniﬁed point to connect researchers with governmentand industry to advance foundational data science research; (2)the Center of Machine Learning with over 40 core faculty mem-bers; and (3) the Department of Chemistry and Biochemistry with 32 faculty members. We have enjoyed positive feedbackfrom leadership teams for this early deployment. Some facultymembers are particularly excited about P EOPLE M AP ’s interac-tive exploration support and its potential in helping them ﬁndcolleagues to collaborate with on research projects and grantproposals. Discussion has begun on expanding P EOPLE M AP ’sadoption across more research units across Georgia Tech. EOPLE M AP S YSTEM D ESIGN P EOPLE M AP ’s user interface holds four major components: Mapping Out Researcher Interests.

The

Map View component(Fig. 1A-1) visualizes the researcher embeddings, allowing theuser to explore the similarities and differences between researchers.These researcher embeddings were generated by ﬁrst gathering theresearch interests and publications from each researcher’s GoogleScholar proﬁle, using the scholarly Python library ; this processonly requires the researchers’ Google Scholar proﬁle URLs. Wethen concatenate the titles and abstracts of each researcher’s publica-tions together; for some conﬁgurations of P EOPLE M AP ’s settings,researcher keywords are also added into their combined document.To normalize these combined documents, non-English charactersand stopwords are removed, words are stemmed, and characters areturned lowercase. The collected data is then processed using termfrequency–inverse document frequency (TFIDF) [2], which allows usto penalize common terms (low TFIDF score) shared by the wholedataset and focus on ﬁnding “characteristic” terms that differenti-ate (high score). The TFIDF-weighting placed on a researcher’sGoogle Scholar keywords can be adjusted using the Control Panel(Fig. 1A-4). Each researcher’s embedding becomes a column in aTFIDF matrix, where each row is a term, and the cell value is theterm’s TFIDF score in the embedding. As there are thousands ofterms, we perform principal component analysis (PCA) to reducethe dimensionality. We then perform Gaussian mixture modelingto split the overall distribution of researcher vectors into severaldifferent Gaussian distributions; each researcher vector in the MapView (Fig. 1C) are colored according to the distribution they areassigned. These Gaussian distributions are intended to aid users intheir analysis of the different ﬁelds of study among the researchers. Finding Speciﬁc Researchers and Areas of Study.

When compa-nies and national labs seek to collaborate with a research institution,they often need to ﬁrst discover whether any researchers’ interestsalign with theirs, and whether there is a critical mass of researchersthat could sustain the research engagement. P

EOPLE M AP ’s Re-search Query tool (Fig. 1A-2) aims to support such discovery. Itallows users to search for researchers based on how well their re-search interests align with the query topic (see Fig. 1B for example https://poloclub.github.io/people-map/ideas/ https://poloclub.github.io/people-map/ml/ https://poloclub.github.io/private-people-map/ https://pypi.org/project/scholarly/ search results when querying for the “algorithms” research topic).When a user types in a Google Scholar keyword, P EOPLE M AP visu-alizes how each researcher aligns with the given topic. To determinethis alignment between each researcher and the ﬁeld of interest, wecompute the cosine similarity by using the same TFIDF researcherembeddings described earlier. Fig. 1B shows an example queryresult, where darker colors indicate stronger research alignment,highlighting those who tend to use the query term proportionallymore in their publications than other researchers. This feature helpsusers more easily assess the scope of research relevance. Learning More About a Researcher.

The Researcher View(Fig. 1A-3) shows information related to a researcher’s proﬁle whenthey are highlighted or hovered over in the Map View. This proﬁleinformation includes: their name, afﬁliation, position, citation count,Google Scholar proﬁle link, and Google Scholar keywords.

Calibrating Exploration.

The Control Panel (Fig. 1A-4) allowsthe user to control various conﬁgurations about the Map View com-ponent. These tools include: the

Show Distributions toggle allowsthe user to display the Gaussian distributions generated in the MapView (Fig. 1C); the slider allows the user to change thenumber of distributions generated in the Gaussian mixture model;the

Keywords Emphasis drop-down allows the user to change theweight placed on each researcher’s Google Scholar keywords whencreating their TFIDF embedding; and the

Publication Set dropdownallows the user to change the publications (most highly cited, ormost recent) used to generate the researcher embedding.

ONCLUSION AND O NGOING W ORK

As P

EOPLE M AP continues to gain adoption, we plan to enhance itby exploring more embedding techniques such as Transformer [5]models like BERT to improve information extraction from researcherdatasets, and the visualization of researcher embeddings. Addition-ally, we will investigate the usability and accuracy of other topicmodeling techniques, such as employing non-negative matrix factor-ization (NMF) [1] to identify research ﬁelds of interest in a dataset.This can potentially allow us to enhance the explorability of P EO - PLE M AP by providing visualized labels for user-selected clusters.Additionally, we plan to conduct lab studies to evaluate P EO - PLE M AP ’s usability, and work with administrators and industrypartners to better understand how P EOPLE M AP could support theirdiscovery of relevant researchers for their diverse array of researchprojects. As we test P EOPLE M AP with different research entities,we will better understand how the visualization and embeddingtechniques may work in different conditions to identify potentialconstraints, such as dataset size and visualizing research interests.We look forward to more institutions adopting P EOPLE M AP tocomplement their directories, so that both internal and externalentities can better explore the diversity of their research expertise. R EFERENCES [1] J. Choo, C. Lee, H. Kim, H. Lee, Z. Liu, R. Kannan, C. D. Stolper,J. Stasko, B. L. Drake, and H. Park. Visirr: Visual analytics for infor-mation retrieval and recommendation with large-scale document data.In , pp. 243–244. IEEE, 2014.[2] K. S. Jones. A statistical interpretation of term speciﬁcity and its appli-cation in retrieval.

Journal of documentation , 1972.[3] O. K¨uc¸ ¨uktunc¸, E. Saule, K. Kaya, and ¨U. V. C¸ ataly¨urek. Theadvisor: awebservice for academic recommendation. In

Proceedings of the 13thACM/IEEE-CS joint conference on Digital libraries , pp. 433–434, 2013.[4] E. Medvet, A. Bartoli, and G. Piccinin. Publication venue recommenda-tion based on paper abstract. In2014 IEEE 26th International Confer-ence on Tools with Artiﬁcial Intelligence