[PDF] Augmenting Scientific Papers with Just-in-Time, Position-Sensitive Definitions of Terms and Symbols

Abstract

Despite the central importance of research papers to scientific progress, they can be difficult to read. Comprehension is often stymied when the information needed to understand a passage resides somewhere else: in another section, or in another paper. In this work, we envision how interfaces can bring definitions of technical terms and symbols to readers when and where they need them most. We introduce ScholarPhi, an augmented reading interface with four novel features: (1) tooltips that surface position-sensitive definitions from elsewhere in a paper, (2) a filter over the paper that "declutters" it to reveal how the term or symbol is used across the paper, (3) automatic equation diagrams that expose multiple definitions in parallel, and (4) an automatically generated glossary of important terms and symbols. A usability study showed that the tool helps researchers of all experience levels read papers. Furthermore, researchers were eager to have ScholarPhi's definitions available to support their everyday reading.

Full PDF

AAugmenting Scientific Papers with Just-in-Time,Position-Sensitive Definitions of Terms and Symbols

Andrew Head [email protected] Berkeley

Kyle Lo [email protected] Institute for AI

Dongyeop Kang [email protected] Berkeley

Raymond Fok [email protected] of Washington

Sam Skjonsberg [email protected] Institute for AI

Daniel S. Weld ∗ [email protected] Institute for AI Marti A. Hearst [email protected] Berkeley

ABSTRACT

ScholarPhi , an augmented reading interfacewith four novel features: (1) tooltips that surface position-sensitivedefinitions from elsewhere in a paper, (2) a filter over the paperthat “declutters” it to reveal how the term or symbol is used acrossthe paper, (3) automatic equation diagrams that expose multipledefinitions in parallel, and (4) an automatically generated glossaryof important terms and symbols. A usability study showed thatthe tool helps researchers of all experience levels read papers. Fur-thermore, researchers were eager to have ScholarPhi’s definitionsavailable to support their everyday reading.

CCS CONCEPTS • Human-centered computing → Interactive systems and tools . KEYWORDS interactive documents, reading interfaces, scientific papers, defini-tions, nonce words

Researchers are charged with keeping on top of immense, rapidly-changing literatures. Naturally, then, reading constitutes a majorpart of a researcher’s everyday work. Senior researchers, such asfaculty members, spend over one-hundred hours a year reading theliterature, consuming over one one-hundred papers annually [89].And despite the formidable background knowledge that a researchergains over the course of their career, they will still often find thatpapers are prohibitively difficult to read.As they read, a researcher is constantly trying to fit the infor-mation they find into schemas of their prior knowledge, but thesuccess of this assimilation is by no means guaranteed [8]. A re-searcher may struggle to understand a paper due to gaps in theirown knowledge, or due to the intrinsic difficulty of reading a spe-cific paper [8]. Reading is made all the more challenging by thefact that scholars increasingly read selectively, looking for specificinformation by skimming and scanning [33, 63, 90]. ∗ Also with University of Washington. deﬁnitionnonce word (the symbol " k ") hyperlink to deﬁnition in context usage countbuttons to open deﬁnition, formula, and usage lists Figure 1: ScholarPhi helps readers understand nonce words —unique technical terms and symbols—defined within scien-tific papers.

When a reader comes across a nonce word that theydo not understand, ScholarPhi lets them click the word to view aposition-sensitive definition in a compact tooltip. The tooltip letsthe reader jump to the definition in context. It also lets them openlists of prose definitions, defining formulae, and usages of the word.ScholarPhi augments the reading interface with this and a host ofother features (see Section 4) to assist readers.We are motivated by the question “Can a novel interface improvethe reading experience by reducing diversions and distractions thatinterrupt the reading flow?” This work takes a measured step toaddress the general design question by focusing on the specific caseof helping readers understand cryptic technical terms and symbolsdefined within a paper, which are called “nonce words” in the fieldof linguistics. Formally, a nonce word is a word that is coined fora particular use, which is unlikely to become a permanent part ofthe vocabulary [59]. Because a nonce word is localized to a specificpaper, a reader cannot know precisely what it means when theystart reading the paper. Because it is only intended for use within asingle paper, it is likely to be defined somewhere within that samepaper, but finding that definition may require significant effort bythe reader. By their nature, nonce words are an interesting focusfor augmenting reading tools because readers will have questionsabout them, and those questions will be answerable (exclusivelyby) searching the text that contains them. a r X i v : . [ c s . H C ] S e p ead et al. symbols composite symbolsmetrics datasets methods Figure 2: One challenge to reading a paper is making senseof the hundreds of nonce words within them.

Nonce words,like the symbols, abbreviations, and terms shown in this figure,are defined within a paper for use within that paper. As such, areader cannot know what they mean ahead of time. Quintessentialexamples of nonce words in the computer science literature aremathematical symbols, and abbreviations for metrics, algorithms,and datasets.Two aspects of nonce words constrain the design of any readingapplication that is built to define them. First, they are abundant. Apaper can contain hundreds of them; indeed, a single passage maycontain dozens closely packed together. For example, an equationmay be comprised of dozens of symbols, and sub-symbols that makeup those symbols. Similarly, tables of results may have rows andcolumns which are indexed by abbreviations for metrics, datasets,and experimental conditions (Figure 2). In such settings a readeris likely to have demands on their working memory and may alsowant to see definitions for multiple nonce words in the same vicinity.Second, nonce words are sometimes assigned multiple definitionswithin the same paper. One example is a symbol like k , which overthe course of a single paper may variously stand for a dummy vari-able in a summation operation, the number of components in amixture of Gaussian models, and the number of clusters outputby a clustering algorithm (see the scenario in Section 4). Thesetwo aspects of nonce words beg the question of whether conven-tional solutions for showing definitions of terms (e.g., the electronicglossaries explored in second-language learning research [14, 94]or Wikipedia’s page previews [61]) also suit a researcher who ispuzzling their way through dense, cryptic, ambiguous notation.In this work, we design, develop, and assess ScholarPhi , a pro-totype tool for helping readers concentrate on the cognitively de-manding task of reading scientific papers by providing them effi-cient access to definitions of nonce words. This paper begins witha formative study of nine readers as they read a scientific text oftheir own choice (Section 3.1). Most readers expressed confusion atnonce words in the text. Many readers were reluctant to look upwhat the words meant given the anticipated cost of doing so. Thisinspired the subsequent design of tools that could have answeredthose readers’ questions while requiring so little effort that readerswould actually use the tool. We then describe design motivations for a new reading interface(Section 3.2), that are grounded in insights from four pilot studiesof early prototypes of ScholarPhi, conducted with 24 researchers.Key insights from the research include the importance of tailoringdefinitions to the passage where a reader seeks to understand anonce word, and the competing goals of providing scent (i.e., visualcues [68]) of what is defined without distracting from a readingtask that is already cognitively demanding on its own.Building on the motivations found in the pilot research, theScholarPhi system is presented (Section 4). The basic design ofScholarPhi is one of an interactive hypertext interface. A reader’spaper is augmented with subtle hyperlinks indicating which noncewords can be clicked in order to access definition information. Read-ers can click nonce words to access definitions for those words in acompact tooltip (Figure 1). These definitions are position-sensitive—that is, if there are multiple definitions of a nonce word in the text,ScholarPhi uses the heuristic of showing readers the most recentdefinition that appears before the selected usage of the nonce word.Definitions are also linked to the passage they were extracted from:a reader can click on a hyperlink next to the definition to jump towhere it appears in the paper. In addition to definitions, the tooltipmakes available a list of all usages of the nonce word throughoutthe text (including definitions), as well as a special view of formulaethat include the nonce word.Beyond these basic affordances, ScholarPhi provides a suite offeatures, each of which provides readers with efficient yet non-intrusive methods for accessing information about nonce words.First, ScholarPhi provides efficient, precise selection mechanics forselecting mathematical symbols and their sub-symbols throughsingle clicks, rather than error-prone text selections (Section 4.1).Second, ScholarPhi provides a novel filter over the paper called“declutter” that helps a reader search for information about a nonceword by low-lighting all sentences in the paper that do not includethat word (Section 4.2). Third, ScholarPhi generates equation di-agrams and overlays them on top of display equations, affixinglabels to all symbols and sub-symbols in the equation for whichdefinitions are available (Section 4.3). The final feature is a primingglossary comprising definitions of all nonce words that appear ina paper, prepended to the start of the document to let a readerreview the nonce words for a paper before they begin to read it(Section 4.4).The emphasis in the design of each of these features is on ac-knowledging the inherent complexity of the setting of scientificpapers, and hence designing features for looking up definitionsthat are easy to invoke and minimally distracting. To enable thesefeatures, new methods were introduced for analyzing scientific pa-pers in order to make nonce words interactive. A paper processingpipeline was built that automatically segments equations into sym-bols and their sub-symbols, detects all usages for a nonce word, andwhich detects precise bounding box locations of nonce words sothat they may be clicked. A custom PDF annotation tool was builtto facilitate the manual extraction of definitions and annotation ofnonce words with these definitions. This pipeline was sufficient forenabling us to research the design and use of ScholarPhi, and hasbeen made available for public use. While details of the pipeline See the project repository at https://github.com/allenai/scholarphi. ugmenting Scientific Papers with Definitions of Terms and Symbols are not the focus of this paper, the pipeline is responsible for theoperation of the system, and hence is described in Appendix A.1.This work concludes with a controlled usability study withtwenty-seven researchers (Section 5). Researchers were observedas they used three versions of ScholarPhi—one with all the featuresdescribed above, one with only the “declutter” feature, and onethat behaved exactly like a standard, un-augmented PDF reader.When readers had access to ScholarPhi’s features, they could an-swer questions about a scientific paper in significantly less time,while viewing significantly less of the paper in order to come to ananswer. They also reported that they found it easier to answer ques-tions about the paper, and were more confident about their answers,when they used ScholarPhi. Researchers were also observed as theyused ScholarPhi for a fifteen-minute period of unstructured readingtime. Researchers made use of all of ScholarPhi’s features. Feed-back was overwhelmingly positive. Most participants expressed aninterest in using the features “often” or “always” for future papers,with a particular emphasis on the perceived utility of definitiontooltips and equation diagrams.In summary, this work makes four contributions. First, it charac-terizes the problem of searching for information about nonce wordsas one of the challenges of reading scientific papers, grounded ina small formative study. Second, it provides design motivationsfor designing interactive tools that define nonce words, groundedin the iterative design of a tool. Third, it presents ScholarPhi, anaugmented reading interface with a suite of novel features for help-ing readers understand nonce words in scientific papers. Finally, itprovides evidence of the usefulness of the design in searching andreading scientific papers through a controlled study with twenty-seven researchers.

Researchers read papers to become aware of foundational ideas andto stay apprised of the latest developments in their field. However,reading papers is difficult. Challenges in reading a paper can comefrom gaps in the reader’s knowledge, or from ideas in the paperthat are poorly explained [8]. Papers may be read out-of-order andpiecemeal [8, 33, 63]. As a result, a passage of a paper may be readout of context. The need to assimilate information scattered acrossone or many papers is representative of what has been observedin the human-computer interaction community for active readingbehavior of other types of knowledge workers [2, 65, 86].Papers that include mathematical content can impose additionaldemands on a reader. Reading mathematical texts often entails grap-pling with unfamiliar terminology and notational idioms, which canbe particularly challenging for less experienced readers [82]. Self-reports from mathematicians have suggested that the process ofreading math involves backtracking as a reader attempts to scaffoldtheir understanding [92], a pattern which has also been observedin eye-tracking studies of reading math [37, 46]. When attemptingto understand an equation, readers will look to nearby equationsand text for clarifications [46].While reading papers in physical volumes and print-outs usedto be the norm, it is increasingly the case that researchers con-sult papers in digital reading applications [51, 89], particularly for some types of scholarly communication such as conference pro-ceedings [51]. This suggests the value of investing in reading userinterfaces that take advantage of the unique interactive potentialof digital interfaces to augment the reading experience.

Since the beginning of human-computer interaction as a discipline,one of the foundational challenges has been equipping knowledgeworkers with tools that extend their cognition during reading. Van-nevar Bush in his vision of the memex proposed a system thatenabled readers to build trails across the literature, linking passagesacross related readings in a way that makes implicit connectionsclear [10]. This vision has expressed itself in many forms, fromthe invention of hypertext [19] to experiments with interactivebooks [64] and “fluid documents” that can adapt their form andcontent to elaborate where readers need clarification [12]. In thefirst decade of the CHI conference, myriad techniques were pro-posed to help readers navigate text using social annotations [32],augment hypertexts with glosses that could dynamically changethe layout of the text [98], and provide navigational affordancesthat allowed readers to see overviews of document content andjump quickly to passages of interest [28, 78].

Today, many read-ing and editing tools show dictionary definitions on hover or click-ing on words. The Word Wise feature in the Amazon Kindle letsreaders view definitions of tricky words in the space between con-secutive lines of text [54]. In 2014, Wikipedia began to roll out pagepreviews as a feature that allowed readers to preview the contentof a referenced page by hovering over a link to that page. Basedon positive usability evaluation results, Wikipedia decided to makethe feature a permanent fixture on the site [61]. Recent proceed-ings of human-computer interaction conferences have introducedprototypes that allow readers to answer their questions about howto use web pages [15], the meaning of cryptic programming syn-tax [30], hard-to-visualize quantities [36], and unfamiliar wordsfrom a second language [55].

ScholarPhi used an advanced symbol se-lection technique that draws from related work. Zeleznik et al. [97]introduced gestures for a multi-touch display that support the effi-cient selection of mathematical expressions. Bier et al. [9] designeda technique for rapid selection of entities (such as addresses) witha single click. The symbol selection mechanism in ScholarPhi canbe seen as a combination of these two features, supporting single-click selection of mathematical expressions, with refinement ofthe selection to choose specific sub-symbols of that expression viaadditional clicks. In the future, ScholarPhi may support the efficientselection of many nonce words at once in a passage using fuzzytext selection techniques such as those proposed by Hinckley et al.[34] and Chang et al. [13].

ScholarPhi was de-signed to provide the efficiency of visual querying present in con-temporary code editors like VSCode [91], in which arbitrary text(i.e., a variable or expression) can be selected, and all other appear-ances of that same text are instantly highlighted everywhere else ead et al. in the text. In the design of its lists of definitions and usages, Schol-arPhi also draws inspiration from tools such as LiquidText [87],which support viewing lists of within-text search results side-by-side with the query term highlighted. In its design of the “declutter”filter, ScholarPhi draws on the design of visual filters already presentin prototype and production tools. The fading out of content inorder to direct a reader’s focus to information of interest is a de-sign pattern that has been put to use for interactive tutorials [41]in which instructions are highlighted while the rest of the userinterface is faded, as well as interactive debugging tools [24, 44].

On the whole,evidence has supported the use of embedding explanations in texts.In the context of second-language learning, embedded glosses forunfamiliar vocabulary have been shown to lead to vocabulary learn-ing [94], and improved comprehension [88].That said, in making texts interactive, there is a key tension be-tween assisting the reader and distracting them. On the one hand,studies such as one run by Rott [74] suggest that the best com-prehension outcomes can be achieved when all words that haveglosses are marked. That said, interactive texts change readers’behavior. Understandably, readers are more likely to click on wordsthat are visibly interactive [22], leading to what has been calledby some “click-happy behavior” [73]. Furthermore, studies of textsaugmented with hyperlinks have sometimes shown that these aug-mentations have led to worse comprehension of the texts, ratherthan better comprehension [23]. What the evidence suggests over-all is that amidst the appeal of interactive reading interfaces, greatcare must be taken during design to make sure not to introducefeatures that will ultimately distract readers from the cognitivelydemanding task of reading.

Tools can help researchers readscientific papers in a number of ways. To reduce the need to clickaway from the paper currently being read, some journals pub-lishers now allow readers to view metadata by clicking on cita-tions [25, 71, 84]. Experimental tools have been built that augmentpapers with additional information about cited papers [70], bias instudy design [56], and links to external learning resources [39, 53].They have supported explanations from one person to another,allowing peer reviewers [67], collaborators [96], instructors [60],strangers [26], and crowds [38] to annotate and discuss arbitrarypassages of papers. Other approaches to saving the scientist time in-clude tools to support literature search (e.g., [69, 72, 99]), summarizethe text [11, 79] or rewrite passages in simpler language [43].

Reading interfaces can also help re-searchers by helping them navigate to information of interest withinthe paper. For several years, interfaces for reading PDFs have pro-vided standard affordances for jumping to hyperlinks within thepaper. Typesetting software like L A TEX can automatically embedclickable links from references to figures, equations, and sections tothe content they refer to, and from citations to reference sections.Prototype tools have been built to further assist readers in findingpassages about topics of interest [28], in jumping between a passagethat describes research results to the relevant parts of data tables [5, 42, 48], and jumping to passages that answer natural languagequestions [100]. Other research has augmented static figures inpapers with interactive overlays [29, 57].Of particular relevance to this paper is a class of experimentalsystems that surface explanations of terms and symbols in scientificpapers. Tools have been developed that link from terms to pagesthat define them on Wikipedia [1], expand acronyms [81], andwhich direct from key phrases in papers to topic pages where thosephrases are defined and relevant excerpts for those topics are takenfrom other papers [80].

In response to the unique chal-lenges of reading mathematical texts, prototype tools have beendesigned to help readers find explanations of math expressions inthe text [3, 46, 66]. The e-Proof system focuses on single, single-page proofs rather than papers. As part of a guided tour of a proof,they selectively fade out the parts of the proof that are not cur-rently the focus of the tour [3]. Another approach is a prototype thatlets readers look up the meanings of operator symbols in externalknowledge bases, and reveals simplified versions of equations withdetails elided [47]. Of these tools, only Alcock [3] was evaluatedwith human users (see [75, 76]) in a math education setting. It wasfound that while readers used the tools of their own accord [75],many features that were introduced to assist readers, such as audiowalkthroughs of the content, got in readers’ way [76]. ScholarPhiconsolidates and extends features from these prior prototypes, andintroduces additional features and affordances, with the goal ofhelping readers understand nonce words in papers. This work con-tributes a system that can help scale these interactions to scientificpapers, and an understanding of how to tailor the interactions tobetter support readers based on iterative evaluation of interactiveprototypes.

The design principles for ScholarPhi are based on several roundsof iterative design, first with a formative study of how scholarscurrently read scientific texts, and then with increasingly fleshedout versions of the prototype. This section simultaneously reportson these rounds of iterative design and presents the resulting designprinciples.

To better understand how the presence of nonce words affects thereading experience, we conducted a small formative study. Ninereaders (four graduate students, five undergraduate students, re-ferred to as R1–9 below) participated in an observational study inwhich they read a scientific text of their own choice. Six partici-pants brought research papers (R1–5, R8). Five of these papers wereabout computer science and one was about architecture. Threeparticipants brought instructional texts on the topics of data sci-ence (R6), experimental design (R7), and formal analysis (R9). Theselatter three participants were included to see whether the obstaclesencountered in reading scientific papers also occurred for readersof other types of scientific texts.Readers were asked to read their text for forty minutes. Duringthis time, they thought aloud. Readers reported when they encoun-tered confusing passages of text. Then, they described whether they ugmenting Scientific Papers with Definitions of Terms and Symbols intended to look up information to clarify their confusion. If theychose to look for such clarifying information, they described wherethey looked and why. Our findings were as follows:All but one reader expressed confusion at a term used in the text(R1–3, R5–9). In some cases, the confusion was about a term thatwas specific to the scientific discipline of the text (R3, R5–9), such asthe terms “diacritic” (R3) or “population parameter” (R6). For papersfrom computer science, such terms included both benchmarks usedto test an algorithm (R3) as well as baselines against which analgorithm of interest was compared (R5).In other cases, the terms causing confusion came from within thesame paper. Authors introduced terms to describe their methods(“symbolic validator” (R1), “backtranslation” (R3)) that had nuancedmeanings within the text, but whose meaning the reader could notsummon when viewed out of the context of its definition. Authorswould invent shorthand for running examples (e.g., a test set of cowimages named “cow”) that they then referred back to by that short-hand (i.e., “cow”) throughout the figures, which could be confusingif the reader was reading the text out of order (R5). Texts couldalso be sprinkled with vague back-references to assumptions (R5),analyses (R6), parameters (R8), and theorems (R9) that readers couldnot recall. In some cases (R6, R8), readers were not sure whethera reference referred to a passage in the current text or in anothertext.Mathematical symbols were another source of confusion (R2–4,R6). Questions that readers had about symbols were of several kinds.Readers sometimes could simply not understand the meaning ofa symbol (e.g., “ Θ s ”, “ M ”, “ p ”, “ q ”, “ x ”, “ y ”, “ y ”, R2–4, R6). In othercases, they wanted information about how a set of symbols wereused in combination. For example, R4 scanned the appendix of theresearch paper they were reading to better understand the meaningof a ratio “ M / N ” that appeared in one of the equations. Readersalso wondered about the values that symbols were assigned (R2,R3, R6). For example, one reader (R2) wondered what value theregularization parameter λ was set to when a model was trained.Another reader (R3) wanted to see example data that could be usedas inputs x and y to a translation algorithm.Thus, confusion about terms and symbols (nonce words, in ourterminology) was common among the readers in the study. Read-ers’ strategies for resolving this confusion varied based on howimportant it was that they understood a nonce word. If it matteredthat they understood a nonce word, a reader often attempted toinfer meaning from context (R3, R6–9). If they could not surmisethe meaning from context, readers would sometimes delay lookingup an explanation with the hope that they might find one laterin the text (R1, R3, R4, R6–9). A drawback of this approach, de-scribed by R1, is that a reader may reach a point in the text thatthey lack an understanding of so many important terms that theycan no longer understand the text without stopping and searchingfor explanations.Eventually, many readers needed to stop reading in order tolook up explanations. One participant referred to this as an unde-sirable “context switch” which takes them out of the “headspace”of understanding a complicated passage (R4). When looking forexplanations, five readers looked elsewhere in the same text (R2-4,R8, R9). This entailed backtracking within the text (R3, R4), jumping Figure 3: When researchers have trouble understandingnonce words, they look up explanations elsewhere.

One re-searcher in the formative study proactively assembled glossaries inthe margins of the paper for key symbols (above). They annotatedboth brief text descriptions and miniature equation diagrams (seethe annotation for T ( i , j ) ).forward (R2, R4), opening within-text glossaries (R8), and perform-ing within-text search (i.e., “Control-F” search) within the readingapplication (R9). Those reading instructional texts often consultedexternal references like web search results (R6, R8), dictionary ap-plications (R7), and Wikipedia (R9). One reader took a proactiveapproach to reducing the cost of within-paper lookups by assem-bling glossaries for key symbols in the margins of the text (R4, seeFigure 3).This study indicated that readers of scientific papers, and scien-tific texts more generally, frequently have questions about noncewords. To answer these questions, readers either infer answers fromcontext, wait for an answer, or look for explanations elsewhere.While readers do look for explanations elsewhere, they try to avoiddoing so as it takes them away from the text they are trying tounderstand. These observations suggest that readers could benefitfrom interfaces that make explanations of nonce words available tothem without distracting them from the task of a careful reading. The design of ScholarPhi was refined through an iterative designprocess lasting twelve months. Improvements to the design weremotivated by feedback from 24 researchers who used prototypes ofthe tool to read scientific papers.Four pilot studies were conducted, each one of a prototype ofScholarPhi at a different stage of design. • Study D (Declutter lens only): 4 researchers (D1–4) • Study S (Side notes containing definitions, defining formulae,and usages): 4 researchers (S1–4) • Study T (Tooltips instead of side notes): 9 researchers (T1–9) • Study E (Equation diagrams and a complex version of tooltipinteraction flow): 9 researchers (E1–9)The first two studies ( D and S ) were observation studies. Par-ticipants thought aloud as they used the tool. For the second two ead et al. studies ( T and E ), participants read on their own, participated in a30-minute focus group discussion, and filled out a questionnaireabout their experience. Seven participants in these last two stud-ies (T1–3, E1–4) participated in a 15-minute follow-up interview.In each study, participants read a different scientific paper. Tworesearchers (S2, S3) participated in multiple studies.Participant feedback motivated improvements to the interfacewhich are reported in Section 4. One author analyzed transcriptsfrom all studies following a qualitative approach. This yielded thefollowing six design motivations for designing effective interfacesfor providing in-situ explanations within scientific texts. M1. Tailor definitions to the location of appearance.

The samenonce word can have multiple conflicting definitions throughouta paper. For example, in the paper used as stimulus in the formalstudy [85], the symbol T took on multiple distinct senses includ-ing referring to the dimensionality of a vector x t , being part of acomposite symbol T ( j ) used to refer to a layer in a neural networkand being used as the matrix transposition operation in severaldisplay equations. Additionally, several meanings of T were neverexplicitly defined.When readers used a prototype that showed definitions of all of these senses in a list, they wanted to know which ones werethe most appropriate to the passage that they were reading (S1–3).Readers requested that the tool show the definitions appropriate tothe place where they asked for them (S1). They also asked to seethe surrounding context of a definition (S2, S3).A related principle is eliminating redundant definitions. If areader selected a nonce word within a passage where it was beingdefined, they did not wish to see a tooltip containing the definitionsentence they were already reading (S1, T9). M2. Connect readers to definitions in context.

Four readers re-quested the ability to jump from a definition to the passage whereit appeared in the paper (S1–3, T5, T6). This would aid judgingrelevance (S1–3) and assessing what appeared to be errors in theextraction algorithm (T5).

M3. Consolidate information.

While the information that explainsa nonce word can be scattered across a paper, readers want expla-nations that consolidates all of that information in one compact,concise package. When they clicked on a composite symbol, theywanted to see explanations of each sub-symbol that made it up(E2, E4). They also expected the interface to be able to gather ex-planations for semantically similar symbols that differed in theirsurface features, such as showing a definition for “PMA (·) ” thatwas extracted for the function “PMA ( X ) ” (E1). M4. Provide scent.

In all prototypes, nonce words were markedwith a light dotted underline. Readers appreciated that the under-lines provided scent of which words they could click to see defini-tions (S2–4). Participants did not turn off this affordance, althoughthey were provided with this option in later versions of the design.

M5. Minimize occlusion.

In two prototypes, tooltips were packedwith definitions, defining formulae, and usages for symbols. Readersreported that these tooltips occluded text that they wished to see (T4,T6, E7) without providing much value beyond the first definition(T1, T4–6). Still, some readers desired tooltips as opposed to side notes, as it allowed them to view definitions without losing theirplace in the text (E3, E4). The current prototype attempts to balancethese conflicting needs by providing a compact tooltip that containsonly the most recent definition of a nonce word and a few smallbuttons for accessing lists of definitions, defining formulae, andusages. A tooltip for a nonce word can be hidden by clicking on a“close” button within the tooltip.

M6. Minimize distractions.

The user interface was revised severaltimes to remove features that, while originally envisioned as beinghelpful, distracted from the reading task. One reader aptly described,“I was trying to pay more attention to the paper than the tool andthe paper requires a lot of overhead to understand. So I didn’thave much left over for the tool” (E1). One prototype used severalhighlighting colors to indicate appearances, usages, and definitionsof a selected nonce word; however, this added visual clutter thatwas hard to understand (E3). The current prototype uses a singlestatic highlight color. Readers were asked across multiple studieswhether they found underlines beneath the nonce words distracting.They repeatedly reported that they did not (S2–4, T5, T7). However,one reader did request the ability to turn them off (E1), which hasbeen included in all recent prototypes of the interface.

We illustrate the experience using ScholarPhi through a set of fourscenarios, where a reader wishes to know the meaning of a specificnonce word. Each scenario is chosen so that one of ScholarPhi’sfeatures is uniquely well-suited to the reader’s task. To explain the design decisions underlying a feature, we referback to findings from the formative research. Specifically, we notewhenever a design choice was informed by one of the design moti-vations M1–6 that were introduced in Section 3.2. Implementationdetails can be found in the Appendix A.1.

When a reader wants to know the meaning of a nonce word, Schol-arPhi lets them look up the meaning by clicking the nonce word.This reveals a definition tooltip (see Figure 1).Definition tooltips appear directly beneath the selected nonceword. This placement is intentional. By placing the definition be-neath the word, as opposed to placing it in a document marginor a glossary elsewhere in the text, a reader need not divert theirgaze from the text. In this way, the tooltip placement is chosen tominimize distraction (M6). Furthermore, to avoid occluding the text(M5), tooltips are compact. Their dimensions never exceed half thepage width, nor are they permitted to be longer than four lines tall.If there are multiple definitions of a nonce word available withinthe paper, ScholarPhi shows the definition that it infers as beingmost relevant to the context. Specifically, it uses a heuristic ofshowing the definition that appeared most recently before thatappearance of the word. This reduces mental effort that seeingmultiple definitions over the nonce word would incur (M1) andreduces the amount of text occluded by the tooltip (M5).In the passage shown below, k refers to an index of a componentin a mixture of Gaussians. See also the video figure at https://bit.ly/scholarphi-video-walkthrough. ugmenting Scientific Papers with Definitions of Terms and Symbols

However, in a later passage, k is given an entirely differentmeaning—a parameter that controls the number of clusters out-put by a clustering algorithm. When the reader opens a definitiontooltip in this other passage, they also see the appropriate definition.After seeing a definition in the tooltip, a reader may want moreinformation about the nonce word. For instance, they may wantto find whether the authors recommended that a specific numberof k components be used in the mixture of Gaussians. To help thereader answer questions like this, ScholarPhi connects the readerto definitions in context (M2). The reader can view the definition incontext by clicking the hyperlink next to the definition (i.e., “page 2”in the figure above). ScholarPhi scrolls the paper to the definition,highlighting the sentence that the definition came from:When the reader has finished consulting the highlighted passage,they can click their web browser’s “Back” button to return to thedefinition tooltip at their previous position in the document. Lists of usages.

A reader can also look for more information abouta nonce word by reviewing the usages of the word. To connecta reader with these usages, the definition tooltip provides threebuttons. The buttons let a reader open lists of all prose definitionsof the word, all defining formulae (i.e., formulae in which the nonceword appears on the left-hand side of an assignment), and all usages(i.e., passages that refer to the nonce word). Together, the buttonsprovide a way for readers to access a consolidated collection ofeverything that ScholarPhi knows about a nonce word (M3). Eachbutton provides scent that helps a reader understand how a nonceword is defined and used (M4). By hovering over a button, the readercan see how many definitions, defining formulae, or usages thereare for the nonce word. The button is disabled when no definitions,defining formulae, or usages exist. When a reader clicks the buttonfor “all usages,” the list of usages opens in a dedicated sidebar, ratherthan in the tooltip, to avoid occluding the text (M5). For example,below we show the usages list for the nonce word V parse . Each usage in the list comprises one sentence referring to thenonce word and a link to the sentence where it appears in the paper(M2). To help readers evaluate the relevance of a usage, which cancontain the visual clutter of dense text and equations, the nonceword is highlighted wherever it appears in a usage.To avoid disorienting the reader, a tooltip always makes thesame information available to a reader in the same layout: buttonsfor lists of definitions, defining formulae, and usages, as well asa definition if one is available. If a tooltip is opened for a nonceword within the sentence where the word is defined, the definitiontooltip reports, “Defined here.” This way, tooltips do not distractthe reader from the text with a definition they have already seen,or are about to see (M6). If no definition exists for the nonce word,then the three buttons to access the usage lists are still shown, butthose with no information behind them are grayed out. Scent.

While some nonce words are defined in a paper, others arenot. Authors may assume the meaning of a nonce word is implicitor they may simply forget to define the nonce word. ScholarPhi pro-vides visual scent [68] to help readers determine whether they’ll finda definition for a nonce word before they click on it (M4). This vi-sual scent is provided in the form of a subtle dotted gray underlinebeneath the nonce word. For instance, in the following passage,readers can open definition tooltips for any of the underlined noncewords, “CoNLL-2005,” “SRL,” and “LISA.”So that it does not divert a user’s attention from the text need-lessly (M6), ScholarPhi assumes that a reader will not want to viewa nonce word in a sentence that defines it, and so does not under-line such nonce words. The rules for underlining symbols are morenuanced. Papers can contain composite symbols where certain sub-symbols (e.g., subscripts or superscripts) are defined, but the symbolas a whole is not. In such a case, ScholarPhi highlights sub-symbolsfor which definitions are available. In the passage below, ScholarPhihighlights symbols to indicate it has definitions for “ t ,” “ X ,” and “ r t .”Because the composite symbol “ y prpt ” is defined in the sentence, itis not underlined. ead et al. Symbol selection.

In a conventional interface for reading papers,one challenge to searching for information about a symbol is sim-ply selecting the symbol. Because the text for a symbol is oftensplit across multiple baselines (i.e., in subscripts or superscripts),conventional text selection mechanisms may fail to select preciselythose characters that belong to the symbol. To reduce the cost ofaccessing explanations, ScholarPhi supports efficient selection ofsymbols. Symbols can be selected by clicking them once (steps “1”and “2” below). Once a symbol is selected, all sub-symbols thatbelong to it are highlighted and can be selected with a click (“3”).

By helping readers rapidly select sub-symbols, it is hoped thatScholarPhi lets readers understand the meaning of a compositesymbol in terms of the meanings of it parts (M3).Beyond the core features of definition tooltips and lists of usage,ScholarPhi provides three innovative views to help readers accessdefinitions of nonce words when and where they need them, whichwe describe next.

To help readers quickly find information about a nonce word that isscattered across a paper, ScholarPhi provides a novel feature called“decluttering.” When a reader selects a nonce word, ScholarPhi “de-clutters” the paper — by highlighting segments of text that containmatches, and fading out all other sentences — in an effort to helpreaders scan the paper for usages. matching symbolmatching sentence selection

ScholarPhi provides visual scent (M4) of where usages can befound via a conventional search bar. The search bar counts howmany times the nonce word appears in the paper, and shows thepage number of the usage the reader selected. While readers areexpected to navigate a decluttered document by scrolling throughit, the search bar also supports navigation between usages with“Next” and “Previous” buttons with arrow key keyboard shortcuts. Decluttering has several advantages over the list of usages: itconnects readers to definitions in context by providing a view thatis grounded in the text (M2) and it reduces distractions by hidingcontent in the paper, rather than exposing additional user interfacewidgets (M6). Like the list of usages, decluttering does not occludetext (M5).

Some passages are rife with nonce words. For instance, tables ofempirical results are indexed by abbreviations that represent exper-imental conditions and measurements. Equations contain dozens ofsymbols. For dense passages like these, readers may desire the abil-ity to consult the definitions for many nonce words at the same time.For display equations in particular (i.e., equations that are shownon their own line separated from the text), ScholarPhi provides theability to view definitions of all symbols at the same time. To see thedefinitions of all symbols in a display equation, a reader can clickthat equation. Definitions are affixed to all symbols simultaneously.Definitions are shown for symbols (e.g., “ V ( j ) h ”) and the sub-symbols they are composed of (“ h ”, “ j ”). Thus, definition infor-mation that would otherwise be split across multiple tooltips isconsolidated into one place (M3). Like the definitions that appear intooltips, the definitions for equation diagrams are position-sensitive(M1). By clicking a label for a symbol, a reader can open the defini-tion tooltip for the symbol, providing access through the definitiontooltip to the context of the definition (M2).Brushing and linking connects the definitions to the symbols;as a reader hovers over a definition, the symbol it defines is high-lighted with a more saturated color than the other symbols. Leaderlines connect the definitions to the symbols. The leader lines con-necting definitions to symbols are diagonal, proceeding straightfrom the definition label to the symbol. This style of leader line waschosen as opposed to orthogonal leaders (i.e., leaders comprisingone horizontal and one vertical segment). While in general, orthog-onal leaders have been observed to have great legibility [6], wehave found that diagonal lines stand out better amidst the clutterof other marks in an equation (M6). Scientific texts like textbooks often contain glossaries that allowreaders to look up definitions of terms in a predictable place. Onetype of glossary that can be particularly helpful to readers is whatWiddowson [93, page 82] called a “priming glossary,” or a glossarythat is shown to readers before a text to help prepare them forproblematic words that may appear in the text. ScholarPhi prepends ugmenting Scientific Papers with Definitions of Terms and Symbols a priming glossary to scientific papers. The glossary includes a listof key terms and symbols, ordered by their appearance in the paper.The glossary is intended to help readers in two ways. First, itlets them prime themselves on the terms and symbols that will beused in the paper. And second, it provides a reference that can beprinted and viewed side-by-side with the paper. One advantage topresenting definitions in a priming gloss as opposed to tooltips isthat definitions for all nonce words can be consolidated into oneplace (M3), letting a reader understand groups of related noncewords by studying all of their definitions together. Furthermore,the gloss provides scent (M4) of the density of nonce words, andthe presence of definitions of those words, before they start readingthe paper.

We performed a formal remote usability study to ascertain the an-swers to the following questions: Do the features of ScholarPhi aidreaders’ ability to understand the use of nonce words when read-ing complex scientific papers? Do readers elect to use the featureswhen given unstructured reading time? How are the features usedto support the reading experience?In a within-participants design, we compared the full featuresof ScholarPhi to a simplified version and a standard PDF reader ona series of close reading tasks on a machine learning paper. Thequantitative and subjective results were strongly in favor of theaffordances supplied by ScholarPhi over a standard PDF reader,with one exception.

Participants.

Criteria for inclusion was having previously read amachine learning paper. A total of 27 participants were recruitedthrough university and company mailing lists. 18 were doctoralstudents, 5 were master’s students, 3 were undergraduate students,and 1 was a professional researcher. 13 of the 27 participants iden-tified their discipline as machine learning, and 21 were somewhator very comfortable with reading machine learning papers. Par-ticipants were thanked with a $20 (USD) gift certificate. All studysessions were 1 hour long and held remotely over Zoom, a videoconferencing platform; participant interactions were logged andscreen activity was captured. Participants opened a version of theapplication in a private browser window, and were asked to sharetheir screen with the experimenters. This led to participants usingthe interface on a variety of screen sizes and configurations.

Stimulus Paper.

For this study, all participants read “Linguistically-Informed Self-Attention for Semantic Role Labeling” (LISA) [85].(Several examples in Section 4 are drawn from this paper.) This pa-per makes for an interesting case because it won a best paper awardand yet some notation is used inconsistently and some symbols arenever defined explicitly.

Tasks.

Each 1 hour session ran as follows: (1) Greeting and con-sent form. (2) Interactive tutorial with all features on a two-pagepaper [18]. (3) Read the abstract of the stimulus paper. (4) Com-plete a timed practice question with the full interface. (5) Completethree timed test questions using each of the three test interfaces(4 minutes each), each followed with a question about confidence and ease of use. (6) Unstructured reading of the stimulus paper(15-20 minutes). (7) Questionnaire on background and subjectiveresponses.In the unstructured reading portion participants were encour-aged to make use of the tools if they anticipated they would behelpful. The intention of this segment was to observe which aspectsof the tool were used when not under time pressure.

Interfaces.

Three interfaces were compared within-participants: • “Basic” is a basic PDF reader with standard search function-ality (specifically, being able to find words using “Control-F”with a toggle button to match case and the ability to highlightall matches). • “Declutter” is a PDF reader with additional declutter func-tionality. • “ScholarPhi” is a PDF reader with all ScholarPhi features. Test questions.

The three multiple-choice test questions wereeach intended to assess a different aspect of pain points identifiedby formative studies. • “Results”: “According to Table 1, which model achieves thebest recall on WSJ data when GloVE embeddings are used?” • “Dataset”: “Which text corpora is the ConLL-2005 datasetmade from? Select all that apply.” • “Symbols”: “What does T (upper case) mean in this paper?Select all senses in which T is used.” Assessment Measures.

For each of the test questions, we measuredthe following quantitative metrics: • “Confidence” is a five-point Likert scale variable indicatingthe participant’s self-assessment of the following prompt: “Iam confident I came up with the right answer.” A score of 5indicates strong agreement, and a score of 1 indicates strongdisagreement. • “Ease” is a five-point Likert scale variable indicating theparticipant’s self-assessment of the following prompt: “Itwas easy to find the answer.” A score of 5 indicates strongagreement, and a score of 1 indicates strong disagreement. • “Time” is the number of seconds the participant spent toanswer the question. It is measured from when the questionfirst appeared on the participant’s screen, to when the partic-ipant clicked the next button or the question timer expired(whichever event occurred first). • “Correct” is a binary variable indicating whether the partic-ipant’s response to the question was correct. For questionsrequiring a response with multiple selections, a responsewas considered correct if it included all and only the correctselections. • “Area” is the proportion of the full paper viewed. It is com-puted as the cumulative total pixel area viewed over thetotal available pixel area in the entire paper. It ranges be-tween values 0 (none of the paper viewed) and 1 (entire paperviewed). • “Distance” is a continuous variable measuring the cumu-lative (normalized) absolute vertical pixel distance — thatis, number of document lengths — traversed by the partic-ipant’s screen. Normalization controls for different pixelheights across participants’ devices. The distance between ead et al. C ONFIDENCE P ERCEIVED E ASE

ORRECT A NSWERS S CHOLAR P HI B ASIC D ECLUTTER ONLY I NTERFACES : D ISTANCE ( T IME (minutes) A REA V IEWED (%)

Figure 4: Quantitative results for test questions.

For continuous variables Time, Distance, and Area, the vertical bars indicate themean; lower values are preferred. For Confidence, Ease, and Correct, higher values are preferred.the top and bottom pixels on each page is set to 1 / n paдes such that the entire paper’s total height sums to 1 .

0; travers-ing the length of the paper twice would contribute 2 . Unstructured reading task measurements.

Measurements in theunstructured reading tasks included usage of key features and sub-jective feedback.

Assignment.

Using a repeated measures factorial design, we as-signed each participant to three of nine possible configurations— interface-question pairs — while ensuring that (i) each partici-pant observed each interface and each question type exactly onceand (ii) all nine configurations had the same number of assignedparticipants.

Analysis.

For each of the quantitative measurements, we fit ageneralized linear mixed-effects model (GLMM) with fixed effectsfor the interface and question factors (and a fixed-effects interactionterm). Details can be found in Appendix A.2.

Reduced Controls Due to Remote Testing.

Since the study was heldremotely, some standard controls could not be employed: the sizeof the screen, the speed of the user’s computer (the PDF readerappeared to have lag for some participants and not for others),and the distraction in the environment (background noise couldbe heard for many of the participants). These differences mightaccount for variation in performance and subjective accounts ofthe experience. Rather than degrading the quality of the data, these factors make the study better represent variation that we anticipatereaders using this tool would have in their environments.

Figure 4 summarizes how the quantitative measures on the testquestions vary across the three interfaces. We report results fromtwo-sided t -test analyses of pairwise contrasts in Table 1. Theseresults indicate which patterns shown in Figure 4 are statisticallysignificant.We observed that ScholarPhi outperformed the other interfacesin terms of subjective scores on Ease and Confidence. (Declutterreported higher Ease than Basic, but not higher Confidence).ScholarPhi also outperformed the other interfaces in terms oftime required to answer the test questions (Time). (Declutterand Basic were not significantly different). While ScholarPhi andBasic performed equally on number of participants answering testquestions correctly (Correct) (Declutter reported higher Cor-rect than both), these pairwise differences were not statisticallysignificant. As such, we observed that participants using Schol-arPhi were able to answer questions as correctly as they wouldusing other interfaces, but in less time. Finally, we observed thatparticipants traversed less screen Distance and viewed less Areaof the paper under ScholarPhi and Declutter compared to Ba-sic; ScholarPhi outperformed Declutter on Area but did notsignificantly outperform Declutter on Distance. Overall, theseresults suggest that even the lighter-weight version of the tool,with the Declutter overlay alone, yields benefits over the standard ugmenting Scientific Papers with Definitions of Terms and Symbols PDF reader, but the full set of features in ScholarPhi is especiallybeneficial.Upon further inspection of the results on Correct, we foundthe performance of participants on a particular question yieldedthe reason for ScholarPhi performing similarly to Basic (and withDeclutter yielding slightly higher results than the other two).Participants performed better on both Results and Dataset usingScholarPhi, but performed very poorly on Symbols with this in-terface. Recall from the discussion in Section 3.2 (M1) that the LISApaper uses the symbol T inconsistently and also does not define allsenses of this symbol. We found that participants almost alwaysanswered this question incorrectly using ScholarPhi because thedefinitions did not show all of the usages, and the participants hadthe expectation that the definitions showed all of the senses of theterm. This highlights an important potential drawback of a toollike ScholarPhi—it can mislead if it implies incorrect information. Subjective responses from participants (referred to here as “readers”collectively, and P1-27 individually) were obtained both from oralcomments during the study and from open-ended questions in thefinal questionnaire. Readers’ impressions of ScholarPhi were over-whelmingly positive. Readers were enthusiastic about the supportthat ScholarPhi provided for the reading task. They described thetool as “cool” (P8), “very cool”, (P13), “super cool” (P12), and “amaz-ing” (P4, P16, P19). Eight of the 27 responses to the open-endedquestionnaire forms contained exclamation marks conveying par-ticipant excitement for the tool. Several readers commented onthe polish of the prototype (P7, P24), which reflects on the carefulrefinement of the interface over several cycles of iterative design.Readers appreciated ScholarPhi for three supporting roles theysaw it as playing in reading tasks. First, its ability to preserve whatmultiple participants called “reading flow” (P16, P27). In the wordsof one participant, ScholarPhi helped them “focus on the aspects ofthe paper that interested me, and not waste time on other stuff” likereminding themselves of definitions (P4). The features providedtimely reminders (P10, P21, P26), and eliminated the need to traverse“back and forth” within the paper (P11). Second, ScholarPhi helpedthem “check their understanding” of the meanings of nonce words(P16) and the passages of text they appeared in (P20). Third, readersthought that ScholarPhi could help readers engage with papers thatthey otherwise would not have had the vocabulary to read easily(P4, P23), in effect “lowering the barrier” to reading papers in fieldsoutside of one’s expertise.

Anticipated usage.

To determine which of ScholarPhi’s featureswould be of greatest interest to researchers in the future, and hencewhich features should be developed further, readers were askedto report how often they expected they would use each feature ifit was available in the software they used to read papers. Readersexpected they would use several of the features very frequently,including definition tooltips for symbols (16 / 27 “always”, 8 / 27“often”), definition tooltips for terms (15 / 27 “always”, 9 / 27 “often”),and equation diagrams (17 / 27 “always”, 6 / 27 “often”). The otherfeatures seemed to have less universal appeal, in particular declutterfor symbols (5 / 27 “always”, 13 / 27 “sometimes”), declutter forterms (2 / 27 “always”, 15 / 27 “often”), and the priming glossary (8 / 27 “always”, 6 / 27 “often”). Even amidst this variation, responseson the whole were positive. While readers could report that they“never” saw themselves using a feature, we did not see a singleparticipant report they would never use one of the features.

To understand successes and gaps in the design, usage logs werecollected during the unstructured reading task. All readers exceptfor one (96%) used at least one of ScholarPhi’s features during theunstructured reading time. Analysis of the aforementioned dataled to the following conclusions about the usability of ScholarPhi’sfeatures:

Definition tooltips.

For most readers, tooltips were ScholarPhi’smost essential feature. As noted above, it was the feature that themost participants anticipated using regularly if available in theirreading interfaces. Readers appreciated tooltips for their intendedpurpose: their support for looking up definitions of nonce wordsthat appeared elsewhere in the paper (P10). An additional use casewas to check if a passage the reader was consulting was indeedthe definition of a nonce word, so the reader could make sure theywere not missing information of interest (P2).Readers used definition tooltips more than any other feature inScholarPhi. All but three participants opened at least one tooltip fora symbol, and all but one participants opened at least one tooltip fora term. When readers used tooltips they used them often. Readersopened tooltips for symbols a median of 10 times ( σ = . max = σ = . max = Declutter.

In contrast to tooltips, which were unanimously liked,the declutter feature saw disagreement. Some readers valued thefeature, and others did not. Those that did not sometimes did notunderstand what the point of the feature was (P25), or thoughtthe feature provided little value over the definition tooltips (P22).Others felt that the standard “Control-F” search provided a moreefficient interface for searching a paper than scrolling througha paper with declutter (P2). One obstacle to using declutter wasthat, in contrast to standard text search, readers could only startsearching with declutter if the nonce word they wanted to searchfor was within view. With standard text search, a search can beinitiated anywhere by typing in an arbitrary query into an always-available query widget. It could be frustrating not to be able todo the same with declutter, particularly if the reader wanted totemporarily deactivate declutter so that they could consult someof the hidden text, and then resume declutter for the same nonceword as before (P14).That said, readers’ behavior indicates that most readers likelyexpected declutter to be useful for finding answers to questions ina paper: all participants activated declutter at least once in the testtask when they used an interface with only the declutter featureenabled. Several readers indicated that they believed declutter couldbe useful for finding information about nonce words (P6, P11, P15,P23, P26). One reader noted that the feature made the paper look“less cluttered,” despite not having been told that the feature wasin fact named “declutter” (P11). Furthermore, declutter could makereaders feel “less overwhelmed” by the text in the paper (P27). ead et al.

ScholarPhi vs.Declutter Declutter vs.Basic ScholarPhi vs.Basicˆ y n − ˆ y d p ˆ y d − ˆ y b p ˆ y n − ˆ y b p Confidence (1–5) 0.59

Ease (1–5) 0.93 <0.0001

Time (seconds) -27.6 -16.8 0.218 -45.4

Correct -15% 0.393 -15% 0.393 0% 1.000Distance ( -0.90

Area -11% -14% -25% <0.0001Table 1: Two-sided t -tests for pairwise contrasts. Reporting ˆ y i − ˆ y j and Holm-Bonferroni-corrected p -values[35], where ˆ y is the estimatedmean of y under the GLMM, and i , j correspond to interface options — b =Basic, d =Declutter, n =ScholarPhi. For example, in the cell for(Time, ScholarPhi vs. Basic), we can interpret the result as ScholarPhi is associated with 45.4 fewer seconds in Time than Basic, onaverage. Correct and Area contrasts are reported as absolute, not relative, percentage point differences. Statistically significant p -valuesare bolded. Further details are in Appendix A.2. Lists of usages.

Nearly all (20 / 27) readers opened a list of defini-tions, defining formulae, or usages during the unstructured readingtask. 18 readers opened a list of definitions, 3 opened a list of defin-ing formulae, and 10 opened a list of usages. Some participants usedthe lists heavily—one participant opened the lists of definitions andusages eight times each (P4). Readers used the list of usages todevelop an understanding of the purpose of the paper (P9) and togather additional context to check their understanding of a term(P16).One reader used the list of usages in a novel way, describingthe list as a “guide” that supported non-linear reading (P27). Thereader left the list open for minutes at a time. Because the usagespane loads usages for whatever nonce word is currently selected,they could therefore jump from one passage to the next, findingusages of nonce words that drew their interest, jumping to thoseusages, and then viewing the usages of nonce words in the passagethey just jumped to. The reader believed that by supporting thisreading pattern, the list allowed them to answer questions they hadabout the text as they were raised, rather than waiting them to beresolved in a later passage.

Equation diagrams.

Equation diagrams were a favorite featurefor many readers. More readers expected they would use this feature“always” for future readings than any other feature. Nearly all (21 /27) readers opened at least one equation diagram during the fifteen-minute unstructured reading session, and most readers openedmultiple; the median participant opened three equation diagramswhile they read ( σ = . max = Priming glossary.

Among the features of ScholarPhi, primingglossaries were the least used during the reading task. A few readers(6 / 27) were observed consulting the priming glossary for a non-trivial amount of time, declared in our protocol to be 10 or moreseconds. Although readers rarely consulted the priming glossaryduring the study, they saw the glossary as being useful in twoscenarios. First, the priming glossary was envisioned as a usefultool for orienting to the terminology used in a paper before readingit (P13, P16). Two readers spent a substantial amount of time (i.e.,two minutes (P16) and over five minutes (P1)) carefully studyingthe glossary at the beginning of the reading task. Second, readershoped that the glossary might provide additional information abouta nonce word that could not be found in a definition tooltip. In fact,it appeared that several readers accessed the glossary as a fallbackwhen the definition tooltip did not contain the information thatreaders were looking for (P3, P12, P14, P22).

Coordination of features.

Observations of readers’ behavior offerevidence of the usability of the holistic set of features. The toolfeatures appeared to be discoverable. During the tutorial task, read-ers often discovered features on their own by tinkering with theinterface, like the ability to jump to a definition in the paper fromthe definition tooltip (P2), or the presence of lists of definitions andusages (P6). Furthermore, during the unstructured reading task,readers sometimes chained together interactions with multipleScholarPhi features as they sought information about nonce words.For instance, one participant (P6) clicked an equation to reveal adiagram, selected one of the symbols in the diagram, opened thelist of definitions for the symbol, and then clicked on a link thattook them to one of those definitions. Sequences of interactionslike this sometimes lasted only a few a seconds from start to end.Several readers chained interactions across multiple of ScholarPhi’sfeatures in a similar way (P6, P8, P13, P19). Readers’ positive expe-riences with the individual features as well as the tool as a whole ugmenting Scientific Papers with Definitions of Terms and Symbols indicates the suitability of ScholarPhi’s design for helping read-ers find what they need to know about nonce words in scientificpapers.

The outcomes of the usability study produced the following answersto the research questions:

Do the features of ScholarPhi aid readers’ ability to understandthe use of nonce words when reading complex scientific papers?

Yes.When asked to answer questions requiring understanding of noncewords, readers answered questions significantly more quickly withScholarPhi than with a baseline PDF reader, while viewing signifi-cantly less of the paper.

Do readers elect to use the features when given unstructured readingtime?

Yes. 96% of readers used ScholarPhi’s features at least onceduring 15 minutes of unstructured reading time. Tooltips werethe most frequently used feature: readers opened a median of 10tooltips for symbols, and 5 for terms. Equation diagrams wereopened a median of 3 times. Almost all participants opened a listof definitions, defining formulae, or usages at least once.

How are the features used to support the reading experience?

Onthe whole, readers used the features for the reasons expected: theyreferred to tooltips to remind themselves of forgotten definitions,activated declutter to find information about nonce words within aless cluttered view of the paper, and opened equation diagrams toview the definitions of many symbols at once. Readers also used thetools to support the reading experience in unconventional ways, forinstance using the list of usages as a “guide” to support a non-linear,curiosity-driven reading, and skimming a section by jumping fromone equation diagram to the next.

A major limitation of the usability study is its focus on a single pa-per, where performance was measured for only three tasks. Papersvary widely in clarity and readability. To improve generalizabilityof the study, the paper was selected to be a widely-read scientificpaper exhibiting some of the very problems the system was seekingto address. Furthermore, the three tasks were chosen to require anunderstanding of different types of nonce words: terms referringto datasets, baselines, and symbols. In the future, we will continueto evaluate ScholarPhi on a variety of research papers, as has beendone to date through the iterative design process for the tool. Asecond limitation, that pertains to the tool’s suitability for support-ing unstructured reading, is that readers in the study only usedthe tool for 15–20 minutes, and may have not had enough timeto discover limitations that would preclude them using the tool inthe future. Observations from our pilot studies have suggested thatreaders continue to find aspects of the tool useful after 20 minutesof reading, but longitudinal studies are necessary to better assesshow readers would employ ScholarPhi in day-to-day use.

The study of ScholarPhi has revealed three opportunities for futureresearch to advance the potential of intelligent reading interfacesto aid in the authoring and reading of scientific papers.

Readersin the formative studies, pilot studies, and usability study all askedfor the ability to look for definitions of terms that resided outsideof a paper. This means that readers were looking for definitionsof terms that were not nonce words, but were rather jargon ordomain-specific vocabulary. Readers also asked for the ability tolook up information about cited papers within the paper they werecurrently reading. Substantial further design work is needed to pro-vide just-in-time, relevant definitions like these that connect readersto external information sources, though prototypes of such toolsare already being built (e.g., [1, 39]). A key design challenge whichrepresents an opportunity for novel research is how to address thedesign motivations of providing scent, tailoring definitions, andminimizing distractions in a setting where definitions are sourcedfrom massive corpora comprised of source documents of widelyvarying quality (e.g., other papers, Wikipedia, or the science blogo-sphere). Our research suggests that if this design problem can besolved in a well-designed tool, researchers would enthusiasticallyembrace that tool.

Are today’s machine learning models up to the task of de-tecting definitions of nonce words so that users can use ScholarPhifor arbitrary papers? A recent study [4] indicates that the state-of-the-art algorithms for definition detection currently have a problemof recall when it comes to detecting definitions in scientific papers.This raises the question of whether readers would still want to useScholarPhi if some definitions were not detected by the system, orif some of the predictions were wrong. Furthermore, it is unclearhow best to tune the precision-recall tradeoff of an AI method,since we don’t yet know whether false positives are more detrimen-tal to the reader experience than false negatives. Researchers inhuman-computer interaction have explored how users interact withan imperfect artificial intelligence [45, 95]. Tools like ScholarPhimay benefit from an analogous thread of research which exploreshow models for augmenting texts with interactive affordances canconvey uncertainty. Conventional solutions like showing multiple,alternative predictions (i.e., definitions) may not suit the settingof scientific papers, where showing additional definitions may dis-tract and ultimately lead to disuse of the tool. The success of bothhuman-computer interaction research into augmented reading andapplied natural language processing depend on a co-exploration ofthe underlying algorithms and management of user expectationsat the same time.

Is there a dual ofScholarPhi that could support the task of writing clear scientific pa-pers? Such a tool might better support the goals of ScholarPhi thana post-hoc augmented reading interface by placing a small burdenon an author in order to reduce the mental effort expended by theauthor’s many readers. Features that an author might wish for arethe ability to know when they have left a nonce word undefined,when they use the same nonce word to mean two different things ead et al. (as is often the case for symbols like k ), and to know when theyare using two redundant nonce words to refer to the same thing.The same paper processing technologies that can detect definitionsand relate two nonce words to each other could suit writing just aswell as reading. As we saw in the development of ScholarPhi, thedesign exploration of augmented writing interfaces likely needsto begin with careful observations of writers to understand howlightweight, non-intrusive features can support the writing taskwithout distracting authors. Our formative study showed that readers find nonce words in scien-tific texts confusing, but may choose not to look up what the noncewords mean given the anticipated cost of doing so. The ScholarPhisystem was designed to help readers concentrate on the cognitivelydemanding task of reading scientific papers by providing them effi-cient access to definitions of nonce words. The iterative design ofthe system revealed that systems like ScholarPhi’s need to tailordefinitions to the passage where a reader seeks an understandingof a nonce word, provide scent, and avoid distracting readers fromtheir reading. A usability study with 27 researchers showed thatwhen using ScholarPhi versus a standard PDF reader, they couldanswer questions that required an understanding of nonce words inless time, viewing less of the paper, with ScholarPhi. Readers couldsee using ScholarPhi’s definition tooltips and equation diagrams“often” or “always” if they were available in their reading interface.These strong empirical results suggest that researchers are eagerand ready for tools like ScholarPhi that support the reading taskby providing just-in-time, position-sensitive definitions of noncewords when and where they need them.

ACKNOWLEDGMENTS

We thank Zachary Kirby, Jocelyn Sun, Luming Chen, Nidhi Kaku-lawaram, RJ Pimentel, and Benjamin Barantschik for their help indesigning, building, and evaluating prototypes of the ScholarPhisystem. We also thank Luca Weihs, Brendan Roof, and Alvaro Her-rasti for developing a prototype algorithm for localizing colorizedL A TEX equations that inspired the algorithm used in the ScholarPhipipeline. This research receives funding from the Alfred P. SloanFoundation, the Allen Institute for AI, Office of Naval Researchgrant N00014-15-1-2774, and the University of Washington Wash-ington Research Foundation/Thomas J. Cable Professorship.

REFERENCES [1] Takeshi Abekawa and Akiko Aizawa. 2016. SideNoter: Scholarly Paper BrowsingSystem based on PDF Restructuring and Text Annotation. In

Proceedings of theInternational Conference on Computational Linguistics . 136–140.[2] Annette Adler, Anuj Gujar, Beverly L. Harrison, Kenton O’Hara, and AbigailSellen. 1998. A diary study of work-related reading: design implications fordigital reading devices. In

Proceedings of the CHI Conference on Human Factorsin Computing Systems . ACM, 241–248.[3] Lara Alcock. 2009. e-Proofs: Student Experience of Online Resources to AidUnderstanding of Mathematical Proofs. In

Proceedings of the Conference onResearch in Undergraduate Mathematics Education .[4] Anonymized authors. Under peer review.[5] Sriram Karthik Badam, Zhicheng Liu, and Niklas Elmqvist. 2019. Elastic Doc-uments: Coupling Text and Tables through Contextual Visualizations for En-hanced Document Reading.

IEEE Transactions on Visualization and ComputerGraphics

25, 1 (January 2019), 661–671. [6] Lukas Barth, Andreas Gemsa, Benjamin Niedermann, and Martin Nöllenburg.2019. On the readability of leaders in boundary labeling.

Information Visualiza-tion

18, 1 (2019), 110–132.[7] Douglas Bates, Martin Mächler, Benjamin M. Bolker, and Steven C. Walker. 2015.Fitting Linear Mixed-Effects Models Using lme4.

Journal of Statistical Software

67, 1 (October 2015), 1–48.[8] Charles Bazerman. 1985. Physicists Reading Physics: Schema-Laden Purposesand Purpose-Laden Schema.

Written Communication

2, 1 (January 1985), 3–23.[9] Eric A. Bier, Edward W. Ishak, and Ed Chi. 2006. Entity Quick Click: RapidText Copying Based on Automatic Entity Extraction. In

Proceedings of the CHIConference on Human Factors in Computing Systems . ACM, 562–567.[10] Vannevar Bush. 1945. As we may think.

The Atlantic

Proceedings of theSymposium on User Interface Software and Technology . ACM, 123–132.[13] Joseph Chee Chang, Nathan Hahn, and Aniket Kittur. 2016. Supporting MobileSensemaking Through Intentionally Uncertain Highlighting. In

Proceedings ofthe Symposium on User Interface Software and Technology . ACM, 61–68.[14] Ying-Hsueh Cheng and Robert L. Good. 2009. L1 glosses: Effects on EFL learners’reading comprehension and vocabulary retention.

Reading in a Foreign Language

Pro-ceedings of the CHI Conference on Human Factors in Computing Systems . ACM,1549–1558.[16] Rune Haubo B Christensen. 2018.

Cumulative Link Models for Ordinal Regressionwith the R Package ordinal . http://cran.uni-muenster.de/web/packages/ordinal/vignettes/clm_article.pdf.[17] Avital Cnaan, Nan M. Laird, and Peter Slasor. 1997. Using the general linearmixed model to analyse unbalanced repeated measures and longitudinal data.

Statistics in Medicine

16, 20 (1997), 2349–2380.[18] Joseph Paul Cohen, Henry Z. Lo, Tingting Lu, and Wei Ding. 2016. CraterDetection via Convolutional Neural Networks. (2016). arXiv:1601.00978 [cs.CV][19] Jeff Conklin. 1987. Hypertext: An Introduction and Survey.

Computer

20, 9(September 1987), 17–41.[20] Robert Cudeck. 1996. Mixed-effects Models in the Study of Individual Differenceswith Repeated Measures Data.

Multivariate Behavioral Research

31, 3 (1996),371–403.[21] Kenny Davila, Ritvik Joshi, Srirangaraj Setlur, Venu Govindaraju, and RichardZanibbi. 2019. Tangent-V: Math Formula Image Search Using Line-of-SightGraphs. In

Proceedings of the European Conference on Information Retrieval .Springer, 681–695.[22] Isabelle De Ridder. 2002. Visible or invisible links? In

Proceedings of the CHIConference on Human Factors in Computing Systems . ACM, 624–625.[23] Diana DeStefano and Jo-Anne LeFevre. 2007. Cognitive load in hypertextreading: A review.

Computers in Human Behavior

23, 3 (May 2007), 1616–1641.[24] Pierre Dragicevic, Stéphane Huot, and Fanny Chevalier. 2011. Gliimpse: Animat-ing from Markup Code to Rendered Documents and Vice Versa. In

Proceedingsof the Symposium on User Interface Software and Technology . ACM, 257–262.[25] eLife. 2013.

Seeing through the eLife Lens: A new way to view re-search . https://elifesciences.org/inside-elife/0414db99/seeing-through-the-elife-lens-a-new-way-to-view-research.[26] Fermat’s Library. https://fermatslibrary.com/. Last accessed September 16, 2020.[27] Max Froumentin.

Mathematical Markup Language (MathML)

Proceedings of the CHI Conference on Human Factors inComputing Systems . ACM, 481–488.[29] Tovi Grossman, Fanny Chevalier, and Rubaiat Habib Kazi. 2015. Your Paper isDead! Bringing Life to Research Articles with Animated Figures. In

Proceedingsof the CHI Conference on Human Factors in Computing Systems . ACM, 461–475.[30] Andrew Head, Codanda Appachu, Marti A. Hearst, and Björn Hartmann. 2015.Tutorons: Generating Context-Relevant, On-Demand Explanations and Demon-strations of Online Code. In

Proceedings of the Symposium on Visual Languagesand Human-Centric Computing . IEEE, 3–12.[31] Marti A. Hearst, Emily Pedersen, Lekha Patil, Elsie Lee, Paul Laskowski, andSteven Franconeri. 2020. An Evaluation of Semantically Grouped Word CloudDesigns.

IEEE Transactions on Visualization and Computer Graphics

26, 9 (Sep-tember 2020), 2748–2761.[32] William C. Hill, James D. Hollan, Dave Wroblewski, and Tim McCandless. 1992.Edit wear and read wear. In

Proceedings of the CHI Conference on Human Factorsin Computing Systems . ACM, 3–9.[33] Terje Hillesund. 2010. Digital reading spaces: How expert readers handle books,the Web and electronic paper.

First Monday

15, 4 (April 2010).[34] Ken Hinckley, Xiaojun Bi, Michel Pahud, and Bill Buxton. 2012. InformalInformation Gathering Techniques for Active Reading. In

Proceedings of the CHI ugmenting Scientific Papers with Definitions of Terms and Symbols

Conference on Human Factors in Computing Systems . ACM, 1893–1896.[35] Sture Holm. 1979. A Simple Sequentially Rejective Multiple Test Procedure.

Scandinavian Journal of Statistics

6, 2 (1979), 65–70.[36] Jessica Hullman, Yea-Seul Kim, Francis Nguyen, Lauren Speers, and ManeeshAgrawala. 2018. Improving Comprehension of Measurements Using ConcreteRe-Expression Strategies. In

Proceedings of the CHI Conference on Human Factorsin Computing Systems . ACM. Paper 34.[37] Matthew Inglis and Lara Alcock. 2012. Expert and Novice Approaches to ReadingMathematical Proofs.

Journal for Research in Mathematics Education

43, 4 (July2012), 358–390.[38] Nan Jiang and Huseyin Dogan. 2014. CrowdHiLite: A Peer Review Service toSupport Serious Reading on the Screen. In

Proceedings of the International BCSHuman Computer Interaction Conference . 323–328.[39] Zhuoren Jiang, Liangcai Gao, Ke Yuan, Zheng Gao, Zhi Tang, and XiaozhongLiu. 2018. Mathematics Content Understanding for Cyberlearning via FormulaEvolution Map. In

Proceedings of the International Conference on Informationand Knowledge Management . ACM, 37–46.[40] KaTeX. https://katex.org. Last accessed September 16, 2020.[41] Caitlin Kelleher and Randy Pausch. 2005. Stencils-Based Tutorials: Design andEvaluation. In

Proceedings of the CHI Conference on Human Factors in ComputingSystems . ACM, 541–550.[42] Dae Hyun Kim, Enamul Hoque, Juho Kim, and Maneesh Agrawala. 2018. Fa-cilitating Document Reading by Linking Text and Tables. In

Proceedings of theSymposium on User Interface Software and Technology . ACM, 423–434.[43] Yea-Seul Kim, Jessica Hullman, Matthew Burgess, and Eytan Adar. 2016. Simple-Science: Lexical Simplification of Scientific Terminology. In

Proceedings of theConference on Empirical Methods in Natural Language Processing . Associationfor Computational Linguistics, 1066–1071.[44] Amy J. Ko and Brad A. Myers. 2009. Finding Causes of Program Output withthe Java Whyline. In

Proceedings of the CHI Conference on Human Factors inComputing Systems . ACM, 1569–1578.[45] Rafal Kocielnik, Saleema Amershi, and Paul N. Bennett. 2019. Will You Acceptan Imperfect AI? Exploring Designs for Adjusting End-user Expectations of AISystems. In

Proceedings of the CHI Conference on Human Factors in ComputingSystems . ACM. Paper 411.[46] Andrea Kohlhase, Michael Kohlhase, and Taweechai Ouypornkochagorn. 2018.Discourse Phenomena in Mathematical Documents. In

Proceedings of the Con-ference on Intelligent Computer Mathematics . 147–163.[47] Michael Kohlhase, Joseph Corneli, Catalin David, Deyan Ginev, ConstantinJucovschi, Andrea Kohlhase, Christoph Lange, Bogdan Matican, Stefan Mirea,and Vyacheslav Zholudev. 2011. The Planetary System: Web 3.0 & ActiveDocuments for STEM.

Procedia Computer Science

Proceedings of theCHI Conference on Human Factors in Computing Systems . ACM, 31–40.[49] Alexandra Kuznetsova, Peter Brockhoff, and Rune H. B. Christensen. 2017.lmerTest Package: Tests in Linear Mixed Effects Models.

Journal of StatisticalSoftware

82, 13 (December 2017), 1–26.[50] Labella.js. https://twitter.github.io/labella.js/. Last accessed September 16, 2020.[51] Elina Late, Carol Tenopir, Sanna Talja, and Lisa Christian. 2019. Reading prac-tices in scholarly work: from articles and books to blogs.

Journal of Documenta-tion

75, 3 (2019), 478–499.[52] Mary J. Lindstrom and Douglas M. Bates. 1990. Nonlinear Mixed Effects Modelsfor Repeated Measures Data.

Biometrics

46, 3 (September 1990), 673–687.[53] Xiaozhong Liu, Zhuoren Jiang, and Liangcai Gao. 2015. Scientific InformationUnderstanding via Open Educational Resources (OER). In

Proceedings of theInternational Conference on Research and Development in Information Retrieval

Proceedings of the CHI Conference on Human Factors in Computing Systems . ACM.Paper 338.[56] Iain J Marshall, Joël Kuiper, and Byron C Wallace. 2016. RobotReviewer: evalua-tion of a system for automatically assessing bias in clinical trials.

Journal of theAmerican Medical Informatics Association

23, 1 (January 2016), 193–201.[57] Damien Masson, Sylvain Malacria, Edward Lank, and Géry Casiez. 2020.Chameleon: Bringing Interactivity to Static Digital Documents. In

Proceed-ings of the CHI Conference on Human Factors in Computing Systems . ACM. Paper432.[58] Material-UI. https://material-ui.com/. Last accessed September 16, 2020.[59] Elisa Mattiello. 2017.

Analogy in Word-Formation: A Study of English Neologismsand Occasionalisms . De Gruyter Mouton.[60] Melissa McCartney, Chazman Childers, Rachael R. Baiduc, and Kitch Barnicle.2018. Annotated Primary Literature: A Professional Development Opportunityin Science Communication for Graduate Students and Postdocs.

Journal ofMicrobiology & Biology Education

19, 1 (March 2018), 1–13. [61] MediaWiki contributors.

Page Previews pdf.js . https://mozilla.github.io/pdf.js/.Last accessed September 16, 2020.[63] David Nicholas, Peter Williams, Ian Rowlands, and Hamid R. Jamali. 2010.Researchers’ e-journal use and information seeking behaviour.

Journal ofInformation Science

36, 4 (2010), 494–516.[64] Don Norman. 2013.

The design of everyday things . Basic Books. See pages288–291, section “The Future of Books”.[65] Kenton O’Hara. 1996.

Towards a Typology of Reading Goals . Technical Report.Rank Xerox Research Centre.[66] Robert Pagel and Moritz Schubotz. 2014. Mathematical Language ProcessingProject. In

Proceedings of the Conference on Intelligent Computer Mathematics .[67] PeerLibrary. https://peerlibrary.org/. Last accessed September 16, 2020.[68] Peter Pirolli and Stuart K. Card. 1999. Information Foraging.

PsychologicalReview

Proceedings of the CHIConference on Human Factors in Computing Systems . ACM, 2264–2271.[70] Brett Powley, Robert Dale, and Ilya Anisimoff. 2009. Enriching a Document Col-lection by Integrating Information Extraction and PDF Annotation. In

DocumentRecognition and Retrieval

Proceedings of the Conference on Computer-Supported CooperativeWork and Social Computing . ACM, 341–346.[73] Warren B. Roby. 1999. "What’s in a gloss?": A commentary on Lara L. Lomicka’s"To gloss or not to gloss": An investigation of reading comprehension online.

Language Learning & Technology

2, 2 (January 1999), 94–101.[74] Susanne Rott. 2007. The Effect of Frequency of Input-Enhancements on WordLearning and Text Comprehension.

Language Learning

57, 2 (June 2007), 165–199.[75] Somali Roy. 2014.

Evaluating novel pedagogy in higher education: A case studyof e-Proofs . Ph.D. Dissertation. Loughborough University.[76] Somali Roy, Matthew Inglis, and Lara Alcock. 2017. Multimedia resourcesdesigned to support learning from written proofs: An eye-movement study.

Educational Studies in Mathematics

96, 2 (2017), 249–266.[77] Nipun Sadvilkar. 2020. pySBD: Python Sentence Boundary Disambiguation (SBD) .https://github.com/nipunsadvilkar/pySBD. Last accessed July 27, 2020.[78] Bill N. Schilit, Gene Golovchinsky, and Morgan N. Price. 1998. Beyond Paper:Supporting Active Reading with Free Form Digital Ink Annotations. In

Pro-ceedings of the CHI Conference on Human Factors in Computing Systems

The Journal of Mathematical Behavior

35 (September 2014), 74–86.[83] Noah Siegel, Nicholas Lourie, Russell Power, and Waleed Ammar. 2018. Extract-ing Scientific Figures with Distantly Supervised Neural Networks. In

Proceedingsof the Joint Conference on Digital Libraries . ACM, 223–232.[84] Springer. https://link.springer.com. Last accessed September 16, 2020.[85] Emma Strubell, Patrick Verga, Daniel Andor, David Weiss, and Andrew McCal-lum. 2018. Linguistically-Informed Self-Attention for Semantic Role Labeling.(2018). arXiv:1804.08199 [cs.CL][86] Craig Tashman and W. Keith Edwards. 2011. Active Reading and Its Discon-tents: The Situations, Problems and Ideas of Readers. In

Proceedings of the CHIConference on Human Factors in Computing Systems . ACM, 2927–2936.[87] Craig S. Tashman and W. Keith Edwards. 2011. LiquidText: A Flexible, Mul-titouch Environment to Support Active Reading. In

Proceedings of the CHIConference on Human Factors in Computing Systems . ACM, 3285–3294.[88] Alan Taylor. 2006. The Effects of CALL versus Traditional L1 Glosses on L2Reading Comprehension.

CALICO journal

23, 2 (2006), 309–318.[89] Carol Tenopir, Donald W. King, Sheri Edwards, and Lei Wu. 2009. Electronicjournals and changes in scholarly article seeking and reading patterns. 61, 1(2009), 5–32.[90] Carol Tenopir, Elina Late, Sanna Talja, and Lisa Christian. 2019. Changes inScholarly Reading in Finland Over a Decade: Influences of E-Journals and SocialMedia.

Libri

69, 3 (2019), 169–187.[91] VSCode. https://code.visualstudio.com/. Last accessed September 16, 2020.[92] Keith Weber. 2008. How Mathematicians Determine if an Argument Is a ValidProof.

Journal for Research in Mathematics Education

39, 4 (July 2008), 431–459. ead et al. [93] H. G. Widdowson. 1978.

Teaching Language as Communication . Oxford Univer-sity Press.[94] Akifumi Yanagisawa, Stuart Webb, and Takumi Uchihara. 2020. How do differentforms of glossing contribute to L2 vocabulary learning from reading? A meta-regression analysis.

Studies in Second Language Acquisition

42, 2 (May 2020),411–438.[95] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understandingthe Effect of Accuracy on Trust in Machine Learning Models. In

Proceedings ofthe CHI Conference on Human Factors in Computing Systems . ACM. Paper 279.[96] Dongwook Yoon, Nicholas Chen, François Guimbretière, and Abigail Sellen.2014. RichReview: Blending Ink, Speech, and Gesture to Support CollaborativeDocument Review. In

Proceedings of the Symposium on User Interface Softwareand Technology . 481–490.[97] Robert Zeleznik, Andrew Bragdon, Ferdi Adeputra, and Hsu-Sheng Ko. 2010.Hands-On Math: A page-based multi-touch and pen desktop for technical workand problem solving. In

Proceedings of the Symposium on User Interface Softwareand Technology . ACM, 17–26.[98] Polle T. Zellweger, Bay-Wei Chang, and Jock D. Mackinlay. 1998. Fluid Linksfor Informed and Incremental Link Transitions. In

Proceedings of the Conferenceon Hypertext and Hypermedia . ACM, 50–57.[99] Xiaolong Zhang, Yan Qu, C. Lee Giles, and Piyou Song. 2008. CiteSense: Support-ing Sensemaking of Research Literature. In

Proceedings of the CHI Conference onHuman Factors in Computing Systems . ACM, 677–680.[100] Tianchang Zhao and Kyusong Lee. 2020. Talk to Papers: Bringing NeuralQuestion Answering to Academic Search. In

Annual Meeting of the Associationfor Computational Linguistics . Association for Computational Linguistics, 30–36.

A APPENDIXA.1 Implementation of ScholarPhi

This section describes our suite of algorithms for preparing papersto be read in ScholarPhi. These implementations, along with aninteractive paper annotation tool for cleaning up the outputs ofthese algorithms, is available for other tool builders to use in ourpublic repository. Paper preprocessing.

ScholarPhi currently supports papers whichhave been written using the TEX typesetting language. By restrict-ing the domain of papers to those that have TEX, ScholarPhi is ableto more precisely identify the locations of symbols and relation-ships between them. Given the TEX source for a paper, plain textsentences are extracted by removing macros and replacing citationsand equations with placeholders. The plain text is split into a se-quence of sentences using pysbd [77], a state-of-the-art sentenceboundary detector. These sentences act as inputs to the algorithmsfor detecting definitions and usages of nonce words.

Symbol detection.

To detect symbols in a paper, ScholarPhi firstextracts all equations from the TEX for the paper using a customTEX lexer. Each equation is parsed using K A TEX [40], an open sourcelibrary for rendering L A TEX equations in the browser. This yields, foreach equation, a representation of that equation in MathML [27], aflavor of XML where elements correspond to identifiers, operators,numbers, and combinations thereof.ScholarPhi climbs the MathML tree, building up symbols thatare more and more complex, assigning those made at lower levelsof the tree as sub-symbols of those made at higher levels of the tree.In this manner, composite symbols are identified.

Nonce word localization in PDFs.

To make nonce words interac-tive, ScholarPhi must know the positions of those words in the PDF.It is non-trivial to extract structured representations of mathemati-cal symbols from PDFs based on the information available in PDFs.Hence, ScholarPhi makes use of a technique described by Siegel See the project repository at https://github.com/allenai/scholarphi. et al. [83] to find the bounding boxes of objects of interest in PDFswhen TEX source files are available. Specifically, the technique per-turbs the colors of the objects by detecting the text span that createsthe object, and wrapping the span in coloring commands. Then, theTEX document is compiled into a PDF, and simple computer visiontechniques are used to detect the regions of the colorized PDF thatdiffer from the original PDF. These regions form the bounding boxfor the object.To adapt this technique to the detection of symbols, ScholarPhineeds to know which spans of characters in a TEX file correspondedto a symbol. Therefore, the K A TEX equation parser (see above) wasinstrumented to report which character spans of each TEX equa-tion corresponded to which MathML elements in the MathML treeproduced by the K A TEX parser. Once the character offsets of eachsymbol in the paper’s TEX is known, the technique by Siegel et al.[83] can be used to locate the precise bounding box of each symbolin the PDF. The bounding boxes of terms and sentences are detectedusing the same method. The character offsets of terms within theTEX are extracted by the custom TEX processor that can take anarbitrary list of term names as input and determine the offsets ofall appearances of those terms. The character offsets of sentencesare extracted by the sentence boundary detector.

Term and definition detection.

For some of the prototypes as-sessed in Section 3.2, terms and definitions were extracted usingan automatic, state-of-the-art definition recognition algorithm [4].As in the case with most algorithms in Natural Language Process-ing, the results are not 100% correct. This system is under activedevelopment as part of this project and accuracy is anticipated toimprove steadily.Because we wanted to assess the designed interfaces affordancesof ScholarPhi without delving into issues relating to error recovery(which is a separate and relevant topic) for the usability studydescribed in Section 5, we manually corrected and selected the termsand definitions shown to participants. To scaffold our prototypingefforts, a custom PDF annotation tool was developed (which is alsoincluded in our suite of open source tools), which supported thetagging of arbitrary text as terms, and the tagging of those termswith arbitrary lists of definitions and usages.

Usage extraction.

Usages of a nonce word were extracted as allsentences that contained the nonce word. Containment was de-termined by comparing the character offsets of the sentences andnonce words where they appeared in the TEX. Defining formulaewere extracted for symbols by searching for equations in which thesymbol appeared on the left-hand side of an equation (i.e., to theleft of a definition operator like “ = ”). Each appearance of the nonceword in a usage was wrapped in HTML tags that allowed the nonceword to be highlighted in lists of usages in the web interface. User interface.

The user interface builds on top of the MozillaFoundation’s open source pdf.js web viewer [62]. ScholarPhi’s inter-active features, including definition tooltips, lists of usages, declut-tering, symbol selection, equation diagrams, and priming glossaries,are all implemented as an overlay atop the pdf.js PDF reader. Visualstyling is accomplished using custom styles using Material UI com-ponents [58] as a starting point. The features are written in 10.5klines of React code, which complements the 10.2k lines of Python ugmenting Scientific Papers with Definitions of Terms and Symbols code and 200 lines of custom TEX coloring macros that are used toprocess the papers before they reach the user interface.Symbols and formulae are rendered throughout the interface.Rather than show symbols using bitmaps extracted from the PDFs,TEX for equations and symbols are rendered within the browserusing K A TEX. This has the advantage of rendering symbols andequations at a high resolution. In addition, definitions and usagesthat contain equations can be rendered in views like tooltips andlists of usages where their text must be able to wrap.To display equation diagrams, definitions for symbols and sub-symbols are overlaid on top of the page. Labels are placed on the topand bottom boundaries of the equation with a fixed margin betweenthe equation and labels. Labels are spaced horizontally using aconstraint-based layout algorithm implemented in Labella.js [50].They are split evenly between the top and bottom of the equation,with label position determined by which side of the equation thesymbol is closest to.Algorithms for both straight (i.e., diagonal, single-segment) andorthogonal (i.e., two-segment, horizontal-then-vertical) leader linesare implemented in the ScholarPhi code.

Technical limitations.

Most aspects of ScholarPhi are currentlyautomated, with a few limitations. As it stands, much of the docu-ment processing is automated, such as symbol detection, sentencedetection, and nonce word localization in PDFs. (Some minor ad-justments were made to correct errors for the usability study.) Asmentioned, the current implementation is applied only to docu-ments with TEX source, but we intend for future versions of tobe able to deliver full functionality on arbitrary PDFs, perhaps bymaking use of state-of-the-art tools for symbol extraction fromscholarly documents (e.g., SymbolScraper [21]).Definition detection is the one stage that requires further ad-vances to the state-of-the-art to achieve full functionality. In futurework, we intend to investigate error recovery mechanisms anduser-supplied corrections. We also see work like ScholarPhi as in-forming the direction of further advances for the state of the art inautomated definition recognition.

A.2 Details of Statistical Analysis

A.2.1 Modeling mixed-effects in repeated measures studies.

For theanalysis of results in Section 5, we use the generalized linear mixed-effects model (GLMM). GLMMs are often used to analyze repeatedmeasures studies, in which the same subject contributes multiple(potentially correlated) measurements.[52] They have been usedfor such studies in medicine [17], behavioral studies [20] and evenusability studies of semantic layouts [31].

A.2.2 F-tests for significant effect of interface.

For each of the quan-titative measurements ( y ), we fit a GLMM with fixed effects β forthe interface ( x ) and question ( x ) factors (and a fixed-effects in-teraction term).This is done using the lme4 package in R [7], we fit the followingGLMM: д ( E [ y ]) = β + γ j + β x + β x + β x x (1) where д is the link function, random intercepts γ j ∼ N( , σ γ ) capture individual variation of participant j . Using the lmerTestR package [49], we conduct F-tests for differences in fixed-effectestimates between each interface option, repeated for each y . Weperform Holm-Bonferroni [35] correction on the p -values using thep.adjust R package). We find significance for Correct ( p = . ) ,Ease ( p < . ) , Confidence ( p = . ) , Time ( p < . ) , Distance ( p = . ) , Area ( p < . ) — even while controlling for questionand participant-specific effects. That is to say, for these metrics, theF-test has identified that the choice of interface (Basic, Declutteror ScholarPhi) is a significant factor. Note that the F-test does notassess which of these interfaces is more or less impactful on themetric. A.2.3 t -tests for pairwise contrasts between interfaces. We conducta post-hoc analysis of pairwise contrasts to quantify the differencesin mean effect of interface on y under the GLMM (and controllingfor question). Two-sided t -tests for pairwise contrasts are computedusing the emmeans R package, and results are shown in Table 1. A.2.4 Ordinal regression for Likert-scale variables.

As Ease andConfidence are measured on a 5-point Likert scale, a linear GLMMestimated means may be ill-suited for analysis, especially if Easeand Confidence are sufficiently non-Normally distributed. Weadditionally perform likelihood ratio tests after fitting analogouscumulative link mixed-effects models (CLMM) provided in the or-dinal R package [16]. Likelihood ratio tests, which are similarto F-tests but more conservative, yielded similar p -values — Ease( p < . p = . We use the identity link д ( z ) = z for y ∈ { Ease, Confidence, Time, Distance,Area } . We use the logit link д ( p ) = log ( p /( − p )) for y = Correct, which is treatedas a Bernoulli variable. The F-test is not applicable when y ∼ Bernoulli, so we performed the similar, butslightly less conservative, likelihood ratio test for y = Correct [49]. Because the GLMM for y = Correct was fit using a logit link, direct testing ofpairwise contrasts ˆ y i − ˆ y j = ˆ Pr ( Correct = | i ) − ˆ Pr ( Correct = | j ) is notpossible. We used the transform option in emmeans to perform the contrast testson the log-odds log Pr ( Correct = )/ Pr ( Correct = ) scale, which are linearunder the GLMM, before applying the inverse-link д − transformation to return to theprobability Pr ( Correct = ))