Delineating Knowledge Domains in the Scientific Literature Using Visual Information
DDelineating Knowledge Domains in the Scientific LiteratureUsing Visual Information
Sean T. Yang
University of WashingtonSeattle, [email protected]
Po-shen Lee
University of WashingtonSeattle, [email protected]
Jevin D. West
University of WashingtonSeattle, [email protected]
Bill Howe
University of WashingtonSeattle, [email protected]
ABSTRACT
Figures are an important channel for scientific communication,used to express complex ideas, models and data in ways that wordscannot. However, this visual information is mostly ignored in anal-yses of the scientific literature. In this paper, we demonstrate theutility of using scientific figures as markers of knowledge domainsin science, which can be used for classification, recommender sys-tems, and studies of scientific information exchange. We encodesets of images into a visual signature, then use distances betweenthese signatures to understand how patterns of visual communica-tion compare with patterns of jargon and citation structures. Wefind that figures can be as effective for differentiating communitiesof practice as text or citation patterns. We then consider wherethese metrics disagree to understand how different disciplines usevisualization to express ideas. Finally, we further consider howspecific figure types propagate through the literature, suggestinga new mechanism for understanding the flow of ideas apart fromconventional channels of text and citations. Our ultimate aim is tobetter leverage these information-dense objects to improve scien-tific communication across disciplinary boundaries.
KEYWORDS
VizioMetrics, science of science, bibliometrics, scientometrics
ACM Reference format:
Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe. 2016. DelineatingKnowledge Domains in the Scientific Literature Using Visual Information.In
Proceedings of ACM Conference, Washington, DC, USA, July 2017 (Confer-ence’17),
10 pages.DOI: 10.475/123 4
Increased access to publication data has contributed to the emer-gence of the Science of Science (SciSci) as a field of study. SciSci
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
Conference’17, Washington, DC, USA © 2016 ACM. 123-4567-24-567/08/06...$15.00DOI: 10.475/123 4 studies metrics of knowledge production and the factors contribut-ing to this production [14]. Citations and text are the primary datatypes for measuring influence and tracking the evolution of scien-tific disciplines in this field. Dong et al. [9] use citations to studythe growth of science and observe the globalization of scientificdevelopment within the past century. Vilhena et al. [43] character-ize culture holes of scientific communication embedded in citationnetworks. However, among the studies in SciSci, the use of visual-izationhas received little attention, despite being widely recognizedas a significant communication channel within disciplines, acrossdisciplines, and with the general public [28].Humans perceive information presented visually better thantextually[35] due to the highly developed visual cortex[44]. As aresult, figures play a significant role in academic communication.The information density of a visualization or diagram can representcomplex ideas in a compact form. For example, a neural networkarchitecture diagram conveys an overview of the method used in apaper without requiring code listings or significant text. Moreover,the presence of a neural network diagram can be a better indicatorthat the paper involves the use of a neural network than any simpletext features such as the presence of the phrase ”neural network.”Despite the importance of the figures in the scientific literature,they have received relatively little attention in the SciSci commu-nity. Viziometrics [28] is the analysis of visual information in thescientific literature. The term was adopted to distinguish this analy-sis from bibilometrics and scientometrics, while still conveying thecommon objectives of understanding and optimizing patterns of sci-entific influence and communication. Lee et al. [28] has shown therelationship between visual information and the scientific impactof a paper. In this paper, we demonstrate that visual informationcan serve as an effective measure of similarity that can demarcateareas of knowledge in the scientific literature.Different scientific communities use visual information differ-ently and one can use these differences to understand communitiesof practice across traditional disciplines and show how ideas flowbetween these communities.We consider three hypotheses: H1) Sub-disciplines use distin-guishable patterns of visual communication just as they use dis-tinguishable jargon, H2) these patterns expose new modalities ofcommunication that are not identifiable by either text or the struc-ture of the citation graph, and H3) by classifying and analyzing use a r X i v : . [ c s . D L ] A ug onference’17, July 2017, Washington, DC, USA Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe of specific types of figures, we can track the propagation and popu-larity of certain ideas and methods that are difficult to discern usingtext or citations alone (e.g., inclusion of neural network diagramssuggest contributions of new neural network architectures).To test these hypotheses, we extract over 5 million scientificfigures from papers on arXiv.org, process the images into low-dimensional vectors, then build a visual signature for each field byclustering the vectors and computing the frequency distributionacross clusters for each discipline. We use these signatures to reasonabout the similarity between fields, and compare these measures toprior work in understanding scientific community structure usingtext [43] and the citation graph [10, 43]. Citations and text havebeen used to circumscribe knowledge domains, but this is the firststudy that shows that figures can also delineate fields.We compare the pairwise distances between these three matricesusing the Mantel test [32], a common statistical test of the correla-tion between two distance matrices. We find that the visual distanceis moderately correlated to citation-based metrics (r = 0.706, p =0.0001, z score = 5.103) and text-based metrics (r = 0.531, p=0.0002,z score = 5.019). We also perform hierarchical clustering on alldistance matrices to provide a qualitative comparison of the results,finding that the hierarchical structure of the fields largely agrees,but with some significant exceptions. We then consider pairs offields that are visually distinct but similar in either text distanceor citation distance, suggesting differences in the visual style ofhow ideas are presented. For example, we find that Computationand Language is visually distinct from other
Computer Science dis-ciplines despite being quite similar in citation distance, because theformer includes far more tables of data.Finally, we consider specific cases of the use of particular typesof figures can indicates a common method or idea in a way thattext and citation similarity do not. We conduct a case study ontwo popular types of visualizations, neural network diagrams andembedding visualizations used to show clusters. The analysis in-dicates that visualizations can be used to make inferences aboutconcept adoption within scientific communities. We also observethat the figures reveal the uptake of neural networks earlier than ci-tation analysis, since citation counts take years to accrue. With thiscase study, we show the significance of visualizations in scientificliterature, suggesting that the integration of figures into systemsfor bibilometric analysis, document summarization, informationretrieval, and recommendation can improve performance and af-ford new applications. Our focus is in the scientific literature, butour methods are directly applicable to other domains, includingpatents, web pages [2], and news.In this paper, we make the following contributions: • We present a method for delineating scholarly disciplinesbased on the figures and visualizations in the literature. • We compare this method to prior results based on citationsand text and find that different fields and sub-disciplinesexhibit discernible patterns of visual communication (H1) • We find instances of fields that use similar jargon and citesimilar sources, but are visually distinct, suggesting thatvisual patterns of communication are not redundant withother forms of communication (H2). • We present a method for identifying specific figure typesand show that the presence of these figures in a paper canbe used to understand concept adoption and a potentialmarker for tracking the evolution of scientific ideas (H3).
Citations have been extensively studied and utilized as a measure ofsimilarity among scientific publications. Marshakova proposed co-citation analysis [33] which uses the frequency that papers are citedtogether as a measure of similarity. Citations are also utilized todelineate the emerging nanoscience fields in [30, 47] and are appliedto design recommendation systems [21]. However, citations onlyreveal the structural information with the scholarly literature andignore the rich content in the articles.Text has also received significant attention on analyzing theconnection within scientific disciplines and documents, especiallyin citation recommendations [20, 42]. Vilhena et al. [43] proposeda text-based metric to characterize the jargon distance betweendisciplines. However, ambiguity and synonymity of text makestext-based model less ideal[24].Researchers have explored other aspects of a research paperfor measuring the distance between disciplines. The frequency ofmathematical symbols in papers are used to delineate fields by Westet. al [45], but mathematical symbols are not as ubiquitous as othercomponents. Visual communication is a significant channel forconveying scientific knowledge, but is relatively less explored.A number of studies have focused on mining the scientific figures.Chart classification was well-studied by Futrelle et al. [15], Shao etal. [39], and Lee et al. [29]. Recent studies have been focusing on theextraction of quantitative data from scientific visualizations, includ-ing line charts [31, 40], bar charts [3], and tables [13]. Researchershave also investigated the techniques to understand the semanticmessages of the scientific figures. Kembhavi et al. [22] utilized aconvolution neural network (CNN) to study the problem of diagraminterpretation and reasoning. Elzer et al. [12] studied the intendedmessages in bar charts. Several visualization-based search engineshave also been presented. DiagramFlyer [7], introduced by Chen etal., is a search engine for data-driven diagrams. VizioMetrix[27] andNOA[6] are both scientific figures search engines with big scholardata, while they both work by examining the captions around thefigures. We see visual-based models for demarcating knowledgedomains as a next step in this area of research.
The data for this study comes from the arXiv. The arXiv is an openaccess repository for pre-prints in physics, mathematics, computerscience, quantitative biology, quantitative finance, statistics, elec-trical engineering, systems science, and economics. The variety ofdisciplines allows consideration of information between fields, incontrast to more specialized repositories such as PubMed. Thereare 1,343,669 research papers which include 5,009,523 figures onarXiv through December 31st 2017. elineating Knowledge Domains in the Scientific Literature Using Visual Information Conference’17, July 2017, Washington, DC, USA
Figure 1: Overall pipeline. Figures are mapped to vectorsusing ResNet-50, dimension-reduced, then organized intoa histogram for each field. The distances between thesehistograms are used to infer relationships and informationflow.
Fig. 1 shows the pipeline to characterize scientific disciplines usingvisual information. Each step will be explained in the correspondingnumbered paragraph.
We first embed eachfigure into a 2048-d feature vector using the pre-trained ResNet-50[18] model. The figures are re-sized and padded with white pixelsto be 224 x 224 before being embedded by pre-trained ResNet-50. ResNet-50 was trained on the ImageNet [8] corpus of 1.2Mnatural images. Even though the model was trained on naturalimages, we find that the early layers of the network identify simplepatterns (lines, edges, corners, curves) that are sufficiently generalfor the overall network to represent the combinations of edges andshapes that comprise artificial images as well. Although we positthat a custom neural network architecture could be designed toincrementally improve performance on artificial images, we do notfurther consider that direction in this paper.
We reduce the dimension of eachfigure vector using Principal Component Analysis (PCA). The high-dimensional vectors produced by ResNet-50 contain more infor-mation than is necessary for our application of computing thevisual similarity between fields, and we seek to make the pipelineas efficient as possible. Plus, the ResNet model is pre-trained bynatural images, while scientific figures have a lot more white ar-eas, which make the embedding vectors more sparse, than naturalimages. Distances tend to be inflated in high dimensional space,reducing clustering performance [4]. We follow the typical practiceof applying dimension reduction prior to clustering. Our originalhypothesis was that a very low number of dimensions (10) wouldbe sufficient to capture the differences between fields, but in ourevaluation the higher values (200+) produced stronger correlationswith other methods of delineating fields. We considered different values of this parameter using a sample of 1.5M figures from the5M figure corpus. The results of the experiment are presented inSection 5.1.
The distribution of differenttypes of figures carries significant information about how the visualcommunication is different in each discipline and could further rep-resent each category. We cluster our figure corpus with K-Meansclustering to aggregate similar figures. Although more advancedmethods of clustering could provide better results, we aim to demon-strate that the approach can work even with very simple methods.The objective of this paper is to show the utility of the figures for po-tential applications, rather than to propose a specialized frameworkfor specific task. The experimental results are shown in Section 5.1.
We cluster the fig-ures with number of centroid k = In this section, we describe the process to train the classifier toidentify specific figure types, which we will use to understand howthe use of particular styles of visualization and diagrams propagatethrough the literature. We consider two specific examples: neuralnetwork diagrams (associated with the rapid increase of neuralnetwork methods in the literature) and clustering plots (associatedwith the use of unsupervised learning). Examples of these visual-izations are shown in Figure 2. Sethi et al. [38] characterize six (a) (b)
Figure 2: Examples of neural network diagram and embed-ding visualization. (a) An example of neural network dia-gram. The diagram is borrowed from AlexNet paper [23].(b) An example of embedding visualization. The plot is bor-rowed from MultiDEC paper [46]. different figure types to demonstrate neural network architecture.We label 10,651 figures from arXiv, which includes 1,503 neuralnetwork diagrams, 1,057 embedding visualizations, 8,091 negativeexamples. For neural network diagrams, we label them according tothe taxonomy suggested by Sethi et al. [38], but we exclude figuresin table format. We consider a figure as an embedding visualizationif the figure is used to visualize the representation distribution ofthe data. The annotators make use of images and captions to label onference’17, July 2017, Washington, DC, USA Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe
Figure 3: The architecture of the neural network diagramsand embedding visualization classifier.Table 1: Implementation details for training the neural net-work diagrams and embedding visualization classifier.
LearningRate Decay Epoch Batch Size Loss0.001 0.001 150 256 CategoricalCross Entropythe images. We extract visual features from the fully connectedlayer of a ResNet-50[18] model, which is pre-trained by 1M Ima-geNet dataset[8]. The figures are resized to 224x224 and a 2048-dnumeric vector is acquired for each figure. The labeled image set isthen split into training, validation, and test set with 8:1:1 ratio totrain a deep neural network (DNN) classifier. We tune the depthof the model, dimension of the layers, dropout rate, learning rate,decay ratio, and training epochs. The architecture of the final modelis shown in Figure 3 and implementation details is shown in Table1.
We use the Mantel test [32], a standard statistical test of the corre-lation between two matrices, to compare visual distance with thedistance matrices created by (1) Average shortest citation distance[10, 43] and (2) Natural language jargon distance [43]. Citationsand text have been extensively analyzed and employed to measurethe similarity among research articles, and both of the measureshave had success on information retrieval and recommendationsystems among scholarly documents. Therefore, we consider ci-tation distance as our benchmark of the task and text distance asalternative comparison.
We compute the average shortest path between each pair of fieldsas a measure of similarity. Average shortest path [10] is one of thethree most robust measures [5] of network topology, in addition toits clustering coefficient and its degree distribution. Vilhena et al[43] used this method to measure distance in the citation networkto compare with their text-based metric. Average shortest path is computed as follows: D ij = n i n j (cid:213) n i (cid:213) n j d ( v i , v j ) where n i is the number of vertices in field i and n j is the numberof vertices in field j . The average shortest path between field i andfield j , D ij , is the average of all paths between all vertex pairs, v i and v j .Our citation graph is obtained from the SAO/NASA AstrophysicsData System (ADS)[11], a digital library portal maintaining threebibliographic databases containing more than 13.6 million recordscovering publications in Astronomy and Astrophysics, Physics, andthe arXiv e-prints. The creation of the citations in ADS [1] is startedby scanning the full-text of the paper to retrieve bibcode for eachreference string in the article, followed by computing the similarityscore between the ADS record and the bibcode. The citation pairsare generated if the similarity is higher than the threshold. Thisdata has been extensively used on several bibliographic studies[16, 25]. There are 14,555,820 citation edges within our arXiv datacorpus. We also compare our results to text metrics based on cultural in-formation as represented by patterns of discipline-specific jargon.Jargon distance was first proposed by Vilhena et al. [43], wherethe authors quantitatively measure the communication barrier be-tween fields using n-grams from full text. The jargon distance ( E ij )between field i and field j is defined as the ratio of (1) the entropy H of a random variable X i with a probability distribution of thejargon or mathematical symbols within field i and (2) the crossentropy Q between the probability distributions in field i and field j : E ij = H ( X i ) Q ( p i || p j ) = − (cid:205) x ∈ X p i ( x ) log p i ( x )− (cid:205) x ∈ X p i ( x ) log p j ( x ) Imagine a writer from field i trying to communicate with areader from field j . The writer has a codebook P i that maps thenatural language or mathematical symbols to codewords that thereader has to decode using the codebook P j from field j . A smalljargon distance means high communication efficiency between twofields and are closely related. This metric could be easily applied tonatural language jargon to explore how the communication variesthrough these two channels across disciplines. We compute thejargon distance between two different disciplines by applying themetrics on unigram from abstracts. We show that the distance between visual signatures can be usedto determine the overall relationships between fields in a mannersimilar to prior methods, but that this approach also exposes in-formation that prior methods cannot. In Section 5.1, we presentthe experimental results on picking the number of dimensions andclusters. In Section 5.2, we show the capacity of visual distanceto reveal the relationships across scientific disciplines by showingglobal agreement between visual distance and citation distance(H1). In Section 5.3, we examine each cluster to understand the elineating Knowledge Domains in the Scientific Literature Using Visual Information Conference’17, July 2017, Washington, DC, USA visual composition and find that each cluster is dominated by a cer-tain type of visualization, extending prior work in the life sciencesthat used coarse-grained labeling of figure types [28]. In Section5.4, we show that citation distance and visual distance disagree incertain cases, and consider one case in particular (H2). Finally, weconsider cases where the presence of a particular type of figurecan indicate the use of a method or concept in a way that text andcitation similarity do not in Section 5.5 (H3). We demonstrate thatthe figures in the scientific literature can serve as an indicator ofconcept adoption that travels faster than citation count.
Our pipeline involves two hyperparameters: the number of dimen-sions to retain via PCA and the number of clusters to assume whenconstructing visual signatures. We determine these parametersexperimentally. The results of our analysis of PCA dimensions ap-pear in Table 2. The explained variance ratio shows the percentageof variance explained by the selected components. The varianceexplained grows insignificantly after 256 components. The averagecorrelation with citation distance shows the average of the correla-tions between visual distance and citation distance across all thenumbers of centroid k (from 2 to 30). We evaluate our method byconducting the Mantel test [32] to compare the correlation betweenvisual distance and citation distance. It confirms our hypothesisthat the correlation increases when more components are used, butit converges after sufficient information is preserved. Maximumcorrelation to citation distance shows the maximum correlation ofthe specified dimension among different options of number of cen-troid k , and the k contributing the maximum correlation is shownin ”Maximum at k = ?”. Surprisingly, the maximum correlationhappens at larger number of centroid with low dimension of figurevector. Our interpretation is that there is not sufficient informationpreserved by low dimensional space.We ran a second experiment to determine the number of cen-troids k . Initially, we expected the correlation with other measuresto be higher using larger values of k , since the diversity of figuresin the literature appears vast. However, considering k = 100, 200,and 400, we found that larger values of k generate lower correla-tions with citation distance (correlation coefficient around 0.4), dueto overfitting to rare, low-confidence clusters. Lowering k to therange of 2 to 30 performed better; these results appear in Table2. The relatively low values of k suggest that there are relativelyfew modalities of visual communication in use across fields. Themaximum correlation occurred at k = 4 in most of the experiments.We further discuss the interpretation of these results in Section 5.3. In this section, we demonstrate the ability of visual distance tocharacterize the relationships between fields, quantitatively andqualitatively. Quantitatively, we conduct the Mantel test [32] withSpearman rank correlation method to compare two different dis-tance matrices to reveal the similarity between two structures. Wealso perform hierarchical clustering using UPGMA algorithm [36]to visualize the hierarchical relationships across disciplines, quali-tatively. Vilhena et al.[43] used similar technique to qualitatively visualize how disciplines are delineated, but the data they used wasfrom JSTOR, which focuses on biological science and social scienceso that it is not comparable with our task.Table 3 shows the correlation results between different distances.The first two columns indicate the methods being compared and theResults column shows the correlations. The correlation betweenvisual distance and citation distance ( r = . r = . r = . Computer Sci-ence , Statistics , Math , and
Mathematical Physics are isolated fromother physics-related fields of study. There is inconsistency betweenvisual distance and citation distance in the field of
Quantitative Bi-ology , which is the outlier in citation distance, but is assigned tothe physics-related cluster in visual distance.
We classify the figures in each cluster to understand the visualcomposition of each cluster. We use the convolutional neural net-work classifier in [29] to categorize figures into five categories:(1) Diagrams (2) Plots (3) Table (4) Photo and (5) Equation. Theclassification results are shown in Fig. 5. Surprisingly, each clus-ter is prominently associated with a certain type of visualization:Cluster
Diagram ), Cluster
Table , Cluster
Plot , and Cluster
Photo . These results corroborateprevious work that used supervised methods and manual labelingto categorize figures into five classes (Diagram, Plot, Table, Photo,and Equation) [28]. The distribution of figures helps to reveal theproperties of each discipline. For instance, Cluster
Plot is dominantin
Quantitative Biology (48%) and
Nuclear Experiment (60%), whichmay indicate the degree to which these fields can be consideredexperimental and data-driven. The distribution could further beused to group similar disciplines and separate the dissimilar fieldsas we show in the previous section. onference’17, July 2017, Washington, DC, USA Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe
Table 2: Choosing the number of clusters (k).
Dimension Explained VarianceRatio Average of Correlationsto Citation Distance Maximum Correlationto Citation Distance Maximumat k=?16 52.0% 0.661 0.737 1532 63.7% 0.631 0.768 364 73.9% 0.660 0.769 4128 82.3% 0.662 0.770 4256 88.9% 0.672 0.793 4320 90.7% 0.674 0.793 4
Figure 4: The hierarchical clustering dendrogram of visual distance (left), citation distance (middle), and jargon distance (right).Citation distance is a benchmark in our task. It shows similar pattern as visual distance where
Computer Science , Statistics , Math , and
Mathematical Physics are separated from the rest of the disciplines. The inconsistency between citation distanceand visual distance is
Quantitative Biology , which is clustered with physics-related disciplines in visual distance while it isisolated in citation distance. On the other hand, Jargon distance segregates disciplines differently from visual distance andcitation distance in the high level. High Energy Physics and Nuclear are separated from the rest where Quantitative Biology,Computer Science and Statistics are isolated in the sub-cluster.Table 3: The correlation results between distance matrices.
ResultsVisual Distance Citation Distance r = 0.706p = 0.0001z = 5.103Visual Distance Jargon Distance r = 0.531p = 0.0002z = 5.019Jargon Distance Citation Distance r = 0.697p = 0.0001z = 5.989
In this section, we focus on the cases in computer science wherevisual distance and citation distance disagree and we validate oursecond hypothesis: visual patterns expose new modalities of com-munication that are not identifiable by either text or the structureof the citation graph. The analysis aims to answer the followingquestions: (1) Where are there visual differences in the disciplinarylandscape when compared to citation differences? (2) What is re-vealed about the fields where visual differences occur?We normalize visual distance and citation distance, then subtractvisual distance from citation distance to expose the discrepancies.Fig. 6 shows that there is a significant disagreement between vi-sual distance and citation distance for the subfield
Computation elineating Knowledge Domains in the Scientific Literature Using Visual Information Conference’17, July 2017, Washington, DC, USA
Figure 5: The visual composition of each cluster. It appearsthat each cluster has one dominant visualization.Figure 6: Heat map of differences between visual and cita-tion distance. We normalize visual distance and citation dis-tance and subtract visual distance from citation distance toexpose the discrepancies. Red indicates that two subfieldsare visually distant but near in citation distance. Green indi-cates that two subfields are distinct in citation distance butvisually similar.
Computation and Language is visually dif-ferent across the subfields in
Computer Science but relativelyclose in terms of citation distance. and Language . Red cells show the disagreements where fields arevisually distinct but similar in citation distance. Green cells, incontrast, indicate disciplines that are visually similar, but far apartin citation distance. We observe that
Computation and Language is generally close to all other categories in
Computer Science , butvisually distinct. We further examine the visual profile of
Compu-tation and Language in order to better understand the reasons forthe divergence between these two distances.Fig. 7 shows the distribution of the figure usage in
Computationand Language (CL) and
Computer Science (CS) over the past tenyears. We make two observations from this stacked bar chart: (1)Cluster
Table dominates the visual communication style with over50% in
Computation and Language in 2017, compared to approxi-mately 30% in
Computer Science , and it has been growing over thepast few years. (2) The researchers in
Computation and Language use very few figures associated with Cluster
Photo . We furtherinvestigate the reason that tables are largely used in
Computationand Language by analyzing the cluster textually. We conduct topicmodeling on the captions of the figures of Cluster Table using Non-negative Matrix Factorization (NMF) [26] with five topic numbers. In Table 4, we display the top 10 keywords of each topic alongwith the ratio of the count of the figures in each topic to the totalcount in the cluster over the past 10 years. We also look at theimages in each topic to help us understand the purpose of eachtopic. Based on the keywords and the images, we can infer thatTopic 0 mostly contains table with comparison data to other models,Topic 1 includes the examples of the language and words, Topic 2,which is similar to Topic 0, also involves comparing results betweendifferent models. Topic 3 consists of statistics about the dataset.Topic 4 is a mix of the tables and diagrams which mostly are usedto illustrate the architecture of LSTM models. It appears that tablesto compare the accuracy of different models have been growingsignificantly, from 46.4% (28.6% + 17.8%) in 2008 to 60% (47.6% +12.4%) in 2017, suggesting that an empirical regime of research isdominant, perhaps due to improved access to advanced computa-tional infrastructure, easy access to data and code, and the rapidgrowth of the field itself.
Figure 7: The chart shows how the distribution of the clus-ters evolves in
Computation and Language and
ComputerScience over the past ten years. We could observe that Clus-ter Table has been growing in
Computation and Language and researchers in
Computation and Language use a rela-tively low number of figures in the Photo Cluster.
The classifier achieves accuracy of 0.902 on the validation set and0.868 on the test set with precision of 0.741 and recall of 0.827 onneural network diagrams. The confusion matrix of the classifieris shown in Fig. 8. The classifier tends to misclassify flow charts,bar charts, and diagrams with multiple circles as neural networkdiagrams and the classifier is also often confused between embed-ding visualization and scatter plots (which are indeed quite similar).The classifier appears sufficiently effective at identifying neural net-work diagrams and embedding visualizations to conduct followinganalysis.We use the trained classifier to label 60k figures in computerscience papers on arXiv and analyze the count of the neural networkdiagrams (Top line chart in Fig. 9) and the embedding visualizationsin computer science disciplines over time. We select four categories,which are
Artificial Intelligence , Machine Learning , Computer Vision ,and
Computation Language . These disciplines are known to bestrongly involved in neural network research. We also include
Computational Complexity , which has less involvement in neurallearning research as a control. We also compute the count of papers onference’17, July 2017, Washington, DC, USA Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe
Table 4: Top 10 keywords for each topic in Cluster Tablealong with the ratio of the figure in each topic over time.
Topic 0 Topic 1 Topic 2 Topic 3 Topic 4ClusterTable resultstablemodelsdifferentperformancebestscoresdatasetcomparisonaccuracy wordsfigurewordnumberexampletablesentenceexamplesentencesused etal20152016201420172013results2011taken settesttrainingdatadevelopmentsetstabledevusedstatistics modellanguagetrainedbaselinelstmproposedmodelsattentionlayerperformanceyear ratio ratio ratio ratio ratio2008 28.6% 25.0% 17.8% 22.9% 5.7%2009 31.1% 26.8% 16.1% 18.6% 7.4%2010 31.2% 24.2% 16.9% 21.0% 6.7%2011 34.2% 25.1% 17.2% 16.3% 7.2%2012 39.1% 22.7% 16.0% 15.5% 6.7%2013 37.3% 21.7% 17.3% 15.9% 7.8%2014 39.4% 19.8% 16.0% 14.9% 9.9%2015 43.9% 18.7% 14.2% 12.2% 11.0%2016 45.3% 18.5% 13.4% 10.4% 12.4%2017 47.6% 17.4% 12.4% 10.1% 12.5%
Figure 8: The confusion matrix of the figure type classifier.The classifier achieves 0.868 overall accuracy. whose abstract include ”neural network” and ”deep learning” in theselected categories over time. The usage profile by field in the useof embedding visualizations is similar to that of neural networkdiagrams. The trend is shown in the middle line chart in Fig. 9.Finally, we select six influential papers in deep learning research:AlexNet [23], GAN [17], LSTM [19], ResNet [18], RNN [37], VGG[41], and Word2Vec [34]. We calculate the received citation countof each paper for each year to show the growth of influence of thesepapers (Bottom line chart in Fig. 9). We compare these results withour visualization-based metrics to study our third hypothesis: wecan use specific types of figures to track the propagation of ideasand methods in the literature.From the three plots, we make the following observations. First,the three line charts demonstrate the same tendency: a rapid risein recent years. It is not surprising to see this common trend;increased interest in a topic leads to both increasing citations andan increasing number of relevant diagrams across the literature.Second, the count of papers that include ”neural network” in theirabstracts steadily increases from 2012 to 2014 (yellow background),
Figure 9: The three line charts demonstrate the trend of re-cent studies in deep learning using three different media:figures (top), text (middle), and citation (bottom). Top: Thenumber of papers that include neural network diagramsover time. Middle: The count of papers that have ”neural net-work” or ”deep learning” in their abstracts over time. Bot-tom: The citation count of six selected influential papers indeep learning. The annotation of each influential paper indi-cates the publication time. Citation count of the most influ-ential papers and use of the term ”neural network” in the ab-stract quickly increase (yellow area), but the effect is small.The use of relevant figures increases only once authors startto truly adopt the concept in their research. as does the citation count of one particular paper, AlexNet. Butthere is no increase in the use of figures during this period. The costof mentioning ”neural networks” or citing a relevant paper is low,but the cost of developing a relevant figure is high. We interpretthis result as evidence that the use of a figure is better correlatedwith the true adoption of a concept or method, as opposed to simply elineating Knowledge Domains in the Scientific Literature Using Visual Information Conference’17, July 2017, Washington, DC, USA acknowledging the relevance of a concept or method. After a novelidea is published, the community rapidly begins to discuss the workand, potentially, cites a relevant paper. But it takes time for thecommunity to integrate the concept into their own research. Oncethey have done so, the cost of developing a figure is justified, andthe number of figures increases. When the concept is adopting theconcept, visualizations begin to emerge in the literature.Third, the number of neural network diagrams increases dra-matically in 2015 in the four relevant disciplines, while, exceptfor AlexNet, we do not see such rapid growth of received citationcounts until 2017 (ResNet and VGG). There is a two year gap be-tween the emergence of the use of neural network diagrams andthe rise of the received citation counts. Figures, as well as text, arefaster to react to the introduction of new ideas than aggregate cita-tion counts. These results both validate the use of figures as a signalof scientific communication, but also that they expose patterns nototherwise discernible.
In this study, we demonstrate the feasibility of visual informationbeing used as a measure of similarity. We show that visual distanceis able to determine the overall relationships between fields byacquiring moderate high correlation (0.706) between visual distanceand citation distance. In addition, we show that visual distancestill delivers valuable information when it disagrees with citationdistance. We further conduct a case study on two specific types offigures: neural network diagrams and embedding visualizations.We find that the upward trend of neural network diagrams andembedding visualizations predates the citation counts of influentialpapers in recent years. This provides evidence that figures in thescientific literature are leading indicators of citations.We plan to extend our study to more fine-grained figure labels.This extension will afford better interpretation of the correlationsbetween figures, text, and citations and help us better refine ourgroupings. In addition, we plan to apply these visual demarcationtechniques to tasks in information retrieval and recommendationsystems.
This research has made use of NASA’s Astrophysics Data SystemBibliographic Services.
REFERENCES [1] Alberto Accomazzi, Gunther Eichhorn, Michael J Kurtz, Carolyn S Grant, Ed-win Henneken, Markus Demleitner, Donna Thompson, Elizabeth Bohlen, andStephen S Murray. 2006. Creation and Use of Citations in the ADS. arXiv preprintcs/0610011 (2006).[2] Bram van den Akker, Ilya Markov, and Maarten de Rijke. 2019. ViTOR: Learningto Rank Webpages Based on Visual Features. arXiv preprint arXiv:1903.02939 (2019).[3] Rabah A Al-Zaidy and C Lee Giles. 2015. Automatic extraction of data from barcharts. In
K-CAP . ACM, 30.[4] Richard E Bellman. 1961.
Adaptive control processes: a guided tour . Vol. 2045.Princeton university press.[5] Stefano Boccaletti, Vito Latora, Yamir Moreno, Martin Chavez, and D-U Hwang.2006. Complex networks: Structure and dynamics.
Physics reports
ECIR . Springer, 797–800. [7] Zhe Chen, Michael Cafarella, and Eytan Adar. 2015. Diagramflyer: A searchengine for data-driven diagrams. In
The Web Conference . ACM, 183–186.[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima-genet: A large-scale hierarchical image database. In
CVPR . Ieee, 248–255.[9] Yuxiao Dong, Hao Ma, Zhihong Shen, and Kuansan Wang. 2017. A Century ofScience: Globalization of Scientific Collaborations, Citations, and Innovations.In
KDD . ACM, 1437–1446.[10] Stuart E Dreyfus. 1969. An appraisal of some shortest-path algorithms.
Operationsresearch
17, 3 (1969), 395–412.[11] Guenther Eichhorn. 1994. An overview of the astrophysics data system.
Experi-mental Astronomy
5, 3-4 (1994), 205–220.[12] Stephanie Elzer, Sandra Carberry, and Ingrid Zukerman. 2011. The automatedunderstanding of simple bar charts.
Artificial Intelligence
AAAI . 599–605.[14] Santo Fortunato, Carl T Bergstrom, Katy B¨orner, James A Evans, Dirk Helbing,Staˇsa Milojevi´c, Alexander M Petersen, Filippo Radicchi, Roberta Sinatra, BrianUzzi, et al. 2018. Science of science.
Science
ICDAR . IEEE, 1007–1013.[16] Eugene Garfield. 2006. The history and meaning of the journal impact factor.
Jama
Advances in neural information processing systems . 2672–2680.[18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In
CVPR . 770–778.[19] Sepp Hochreiter and J¨urgen Schmidhuber. 1997. Long short-term memory.
Neuralcomputation
9, 8 (1997), 1735–1780.[20] Wenyi Huang, Zhaohui Wu, Prasenjit Mitra, and C Lee Giles. 2014. Refseer: Acitation recommendation system. In
JCDL . IEEE Press, 371–374.[21] J.D. West, I. Wesley-Smith, and C.T. Bergstrom. 2016. A recommendation systembased on hierarchical clustering of an article-level citation network.
IEEE Trans-actions on Big Data
2, 2 (June 2016), 113–123. https://doi.org/10.1109/TBDATA.2016.2541167[22] Aniruddha Kembhavi, Mike Salvato, Eric Kolve, Minjoon Seo, Hannaneh Ha-jishirzi, and Ali Farhadi. 2016. A diagram is worth a dozen images. In
ECCV .Springer, 235–251.[23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2012. Imagenet classifica-tion with deep convolutional neural networks. In
Advances in neural informationprocessing systems . 1097–1105.[24] Onur K¨uc¸¨uktunc¸, Erik Saule, Kamer Kaya, and ¨Umit V C¸ataly¨urek. 2012. Directionawareness in citation recommendation. (2012).[25] Michael J Kurtz and Edwin A Henneken. 2017. Measuring metrics-a 40-yearlongitudinal cross-validation of citations, downloads, and peer review in astro-physics.
Journal of the Association for Information Science and Technology
68, 3(2017), 695–708.[26] Daniel D Lee and H Sebastian Seung. 1999. Learning the parts of objects bynon-negative matrix factorization.
Nature
The Web ConferenceWorkshop on BigScholar .[28] Poshen Lee, Jevin West, and Bill Howe. 2017. Viziometrics: Analyzing VisualPatterns in the Scientific Literature.
IEEE Transactions on Big Data (2017).[29] Poshen Lee, T. Sean Yang, Jevin West, and Bill Howe. 2017. PhyloParser: AHybrid Algorithm for ExtractingPhylogenies from Dendrograms. (2017).[30] Loet Leydesdorff and Ping Zhou. 2007. Nanotechnology as a field of science: Itsdelineation in terms of journals and patents.
Scientometrics
70, 3 (2007), 693–713.[31] Xiaonan Lu, J Wang, Prasenjit Mitra, and C Lee Giles. 2007. Automatic extractionof data from 2-d plots in documents. In
ICDAR , Vol. 1. IEEE, 188–192.[32] Nathan Mantel. 1967. The detection of disease clustering and a generalizedregression approach.
Cancer research
27, 2 Part 1 (1967), 209–220.[33] IV Marshakova. 1973. Co-Citation in Scientific Literature: A New Measure of theRelationship Between Publications.”.
Scientific and Technical Information Serialof VINITI
Advances in neural information processing systems . 3111–3119.[35] Douglas L Nelson, Valerie S Reed, and John R Walling. 1976. Pictorial superiorityeffect.
Journal of Experimental Psychology: Human Learning and Memory
2, 5(1976), 523.[36] F James Rohlf and David R Fisher. 1968. Tests for hierarchical structure in randomdata sets.
Systematic Biology
17, 4 (1968), 407–412.[37] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. 1986. Learningrepresentations by back-propagating errors.
Nature onference’17, July 2017, Washington, DC, USA Sean T. Yang, Po-shen Lee, Jevin D. West, and Bill Howe papers. In
AAAI .[39] Mingyan Shao and Robert P Futrelle. 2005. Recognition and classification offigures in PDF documents. In
International Workshop on Graphics Recognition .Springer, 231–242.[40] Noah Siegel, Zachary Horvitz, Roie Levin, Santosh Divvala, and Ali Farhadi.2016. FigureSeer: Parsing result-figures in research papers. In
ECCV . Springer,664–680.[41] Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional net-works for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).[42] Trevor Strohman, W Bruce Croft, and David Jensen. 2007. Recommendingcitations for academic papers. In
SIGIR . ACM, 705–706.[43] D. Vilhena, J. Foster, M. Rosvall, J.D. West, J. Evans, and C. Bergstrom. 2014.Finding Cultural Holes: How Structure and Culture Diverge in Networks ofScholarly Communication.
Sociological Science
Information visualization: perception for design . Elsevier.[45] J.D. West and J. Portenoy. 2016. Delineating Fields Using Mathematical Jargon.In
JCDL Workshop on BIRNDL .[46] Sean Yang, Kuan-Hao Huang, and BIll Howe. 2019. MultiDEC: Multi-ModalClustering of Image-Caption Pairs. arXiv preprint arXiv:1901.01860 (2019).[47] Michel Zitt and Elise Bassecoulard. 2006. Delineating complex scientific fields byan hybrid lexical-citation method: An application to nanosciences.