[PDF] Identifying the Development and Application of Artificial Intelligence in Scientific Text

Abstract

We describe a strategy for identifying the universe of research publications relevant to the application and development of artificial intelligence. The approach leverages the arXiv corpus of scientific preprints, in which authors choose subject tags for their papers from a set defined by editors. We compose a functional definition of AI relevance by learning these subjects from paper metadata, and then inferring the arXiv-subject labels of papers in larger corpora: Clarivate Web of Science, Digital Science Dimensions, and Microsoft Academic Graph. This yields predictive classification F 1 scores between .75 and .86 for Natural Language Processing (cs.CL), Computer Vision (cs.CV), and Robotics (cs.RO). For a single model that learns these and four other AI-relevant subjects (cs.AI, cs.LG, stat.ML, and cs.MA), we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classifiers against other sources of topic information and predictions from alternative methods. We find that a supervised solution can generalize to identify publications that belong to the high-level fields of study represented on arXiv. This offers a method for identifying AI-relevant publications that updates at the pace of research output, without reliance on subject-matter experts for query development or labeling.

Full PDF

II DENTIFYING THE D EVELOPMENT AND A PPLICATION OF A RTIFICIAL I NTELLIGENCE IN S CIENTIFIC T EXT ∗ James Dunham

Center for Security and Emerging TechnologyGeorgetown University [email protected]

Jennifer Melot

Center for Security and Emerging TechnologyGeorgetown University [email protected]

Dewey Murdick

Center for Security and Emerging TechnologyGeorgetown University [email protected]

May 29, 2020 A BSTRACT

We describe a strategy for identifying the universe of research publications relevant to the applicationand development of artiﬁcial intelligence. The approach leverages the arXiv corpus of scientiﬁcpreprints, in which authors choose subject tags for their papers from a set deﬁned by editors. Wecompose a functional deﬁnition of AI relevance by learning these subjects from paper metadata,and then inferring the arXiv-subject labels of papers in larger corpora: Clarivate Web of Science,Digital Science Dimensions, and Microsoft Academic Graph. This yields predictive classiﬁcation F scores between .75 and .86 for Natural Language Processing ( cs.CL ), Computer Vision ( cs.CV ), andRobotics ( cs.RO ). For a single model that learns these and four other AI-relevant subjects ( cs.AI , cs.LG , stat.ML , and cs.MA ), we see precision of .83 and recall of .85. We evaluate the out-of-domain performance of our classiﬁers against other sources of topic information and predictions fromalternative methods. We ﬁnd that a supervised solution can generalize to identify publications thatbelong to the high-level ﬁelds of study represented on arXiv. This offers a method for identifyingAI-relevant publications that updates at the pace of research output, without reliance on subject-matterexperts for query development or labeling. Study of the applications and development of artiﬁcial intelligence faces a deﬁnitional problem: AI is a movingconceptual target, understood differently across researchers and observers of the ﬁeld [11]. This presents a challengefor analysts and policy-makers [24]. The proliferation of reports on AI describe only partially overlapping domains[1, 17, 2], so their conclusions may be sensitive to the delineation of the ﬁeld [25]. We describe a strategy for addressingthis and identifying a universe of AI-relevant scientiﬁc publications for use in bibliometric work.The approach relies on the success of Cornell’s arXiv project in attracting open-access preprints from subﬁelds ofcomputer science, physics, statistics, and other quantitative ﬁelds. Authors and editors choose subject tags for thesepapers. There are 39 subjects in computer science, including those we will consider relevant to AI: Artiﬁcial Intelligence, ∗ We thank Kevin Boyack, Daniel Chou, Teddy Collins, Dick Klavans, and Ilya Rahkovsky for their feedback and ideas on thiswork. We are grateful to the team at Elsevier for extended discussions about the methodological details of a related project, andsharing expert-curated keywords and labeled data. Zihe Yang led the replication of the Elsevier approach to identifying AI-relevantresearch. Neha Tiwari contributed the descriptive analysis of arXiv and conference-paper data, and assisted with model development.For replication materials, see https://github.com/georgetown-cset/ai-relevant-papers . https://arxiv.org . a r X i v : . [ c s . D L ] M a y omputer Vision, Computation and Language (Natural Language Processing), Machine Learning, Multiagent Learning,and Robotics. The arXiv labels offer a particular ground truth deﬁned by the participation of an expert community.Additionally, arXiv’s implicit deﬁnition of subjects has the highly desirable characteristic of updating in real time, asopposed to less-favorable approaches that rely on keyword curation or annotation by subject-matter experts. Thosealternatives tend to require maintenance over time, and as we demonstrate, a query that subject-matter experts calibrateto retrieve AI-relevant publications in 2019 may struggle to surface those from 2010.We are keenly aware that the subjects comprising AI research and applications are contestable. Rather than argue fora single delineation, we offer an approach which requires only that an operational deﬁnition is composable from thesubjects available to arXiv authors. The sensitivity of all subsequent analysis to that choice of relevant subjects can beassessed through ablation. Researchers may also add or remove particular subjects as appropriate for their analyses.We implement this approach by training SciBERT [4] classiﬁers on arXiv metadata and subject labels. Using thearXiv-trained models, we infer the subject relevance of papers in other corpora. The premise of identifying AI-relevantpublications in this way is that a model trained on arXiv data will successfully generalize to other sets of publicationdata, which may signiﬁcantly differ in content and subject distribution. This approach seems plausible when leveragingSciBERT’s pre-training, but the risk of overﬁtting to arXiv and gaps in its coverage are concerns we address below witha series of results.First, to assess performance within arXiv, we evaluate our models on a test set. We observe F scores between .75 and.86 for three subject-speciﬁc models, and .84 for a model trained on labels collapsed to indicate AI-relevance for papersin any of six AI-relevant subjects. For comparison, we also assess a keyword-query solution and a keyword-learnerhybrid developed for a recent bibliometric analysis of AI-relevant publications in Scopus [1, 17]. Evaluation againstarXiv labels yields F scores of .55 and .59, respectively, for these methods.We then report results from applying the models to scientiﬁc text in larger corpora: Clarivate Web of Science (WoS), Digital Science Dimensions, and Microsoft Academic Graph (MAG). In the absence of ground-truth arXiv labelsfrom these sources, we assess out-of-domain performance using other sources of topic information, by showing ratesof predicted subject relevance in the ﬁelds of study deﬁned by MAG. We ﬁnd that in the ﬁelds represented on arXiv,generalizing for inference in other corpora is feasible. This offers a method for identifying AI-relevant publications thatupdates at the pace of research output, without reliance on subject-matter experts for query development or labeling.

Scientiﬁc text offers insight into the development of a ﬁeld: its analysis can identify the organization of researchcommunities; their breakthroughs or stagnation; and progress from basic research to applications [e.g., 22, 5]. Theobstacles to such inference are delineation of that ﬁeld and the identiﬁcation of emergent topics or technologies withinit. In reference to biotech and nanotech in prior decades, Mogoutov and Kahane write, “Their content and dynamic aredifﬁcult to track at a time when they are struggling to deﬁne what they are, what they include and exclude, and howthey organize and classify themselves internally” [15]. A related problem is identifying as-yet-unknown topics within aﬁeld, without the beneﬁt of historical perspective. Even in emergent areas, the distinction between “legacy technologies”and “emerging technology” may be incremental [10].Recent analyses of AI research using query-based methods to delineate the ﬁeld [16, 14, 18] have encountered theseobstacles. Grappling with the problem of query development in bibliometric work on nanotechnology resulted inprincipled methods for term curation and their evaluation [15, 3, 9, 13], from which studies of AI could beneﬁt. Drawingfrom this literature, for example, Huang et al. develop a method for retrieving “big data” research that expands from aninitial set of terms across iterations of discovery, manual review, expert checks, and tuning for performance [10].Other approaches to delineation depend on or begin with the identiﬁcation of relevant journals [8] or conferences [12, 20].While appropriate for some analytic purposes, this method risks omitting relevant research in more general-audiencevenues or other disciplines, which may be a particularly acute problem for AI. For the full taxonomy, see https://arxiv.org/category_taxonomy . https://clarivate.com/webofsciencegroup . . . For a discussion of precisely what constitutes emerging technology, see [23].

2n review of the variety of methods for delineating the ﬁeld of AI-relevant research, we note that beyond the method-ological difﬁculties, the criteria for a system’s intelligence vary by observer and over time. In the typology developedby Russell and Norvig [19], deﬁnitions may emphasize behavior or reasoning, and evaluate it against human or rationalstandards. In recent survey research [11], AI researchers tended to prefer deﬁnitions that emphasized the correctness ofdecisions and actions, but often disagreed on what satisﬁed these requirements.Our own interest in high-quality analysis of AI and its security implications requires a solution for identifying AI-relevant research that is robust to the diversiﬁcation of methods, tasks, and applications over time. In this context, expertquery development is increasingly impractical. The solution that we describe in this paper embraces the dynamics ofemerging technologies. arXiv is organized into high-level domain repositories for physics, biology, computer science, statistics, and so forth.Each of these repositories further deﬁnes a set of subjects to organize its content. Authors select one or more subjects todescribe each paper they submit. Editors later review these subject tags [6]. arXiv’s Computing Research Repository(CoRR) deﬁnes 39 subjects including artiﬁcial intelligence and machine learning. We focus in this paper on six subjects that CoRR editors describe as related to AI: Artiﬁcial Intelligence, Computationand Language (NLP), Computer Vision and Pattern Recognition (CV), Machine Learning, Multiagent Systems, andRobotics. According to CoRR documentation, the Artiﬁcial Intelligence subject “[c]overs all areas of AI except Vision,Robotics, Machine Learning, Multiagent Systems, and Computation and Language (Natural Language Processing),”because these areas have their own subjects. It speciﬁcally “includes Expert Systems, Theorem Proving [...], KnowledgeRepresentation, Planning, and Uncertainty in AI.” The Machine Learning subject “[c]overs all aspects of machinelearning research [and] is also an appropriate primary category for applications of machine learning methods.” Becausethese applications may have their own subject areas, CoRR documentation speciﬁes, “If the primary domain of theapplication is available as another category in arXiv and readers of that category would be the main audience, thatcategory should be primary.” Some explicit examples of this are papers on CV, NLP, information retrieval, speechrecognition, and neural networks. Using arXiv submissions in these categories as training data for subject classiﬁers, and deﬁning AI-relevant research asthe union of their positive predictions, is a useful framework for future researchers who may have differing needs orviews on what constitutes AI. Adding Neural and Evolutionary Computing or Information Retrieval papers might bewarranted in future work. We exclude them here for consistency with the CoRR editors’ description of the ArtiﬁcalIntelligence subject, but in practice, we suggest evaluating how sensitive quantities of interest are to these choices.The compositional effect of including or excluding some subjects will be modest due to patterns of cross-posting papersacross related subjects. There are 3,464 papers in our data with Information Retrieval as their primary subject, and42% also appear in one or more of the six subjects we consider AI-relevant here. Of the 2,942 papers with the primarycategory of Neural and Evolutionary Computing, 39% are cross-posted to at least one of our AI-relevant subjects,primarily Machine Learning.From 2010 through 2019, authors submitted 1,060,321 papers to arXiv. The largest repositories at the end of thisdecade, counting by papers’ primary subjects, are physics (540,692), math (270,244), and computer science (194,627).Table 1 shows paper counts in the six computer science subjects we consider relevant. There are 85,670 whose primarysubject, the ﬁrst selected by authors, is one of these six. Authors can cross-post their papers under additional subjects,however, and when including these cross-posts there are 107,380 papers across the relevant subjects.Our targets for inference are larger corpora: Clarivate’s Web of Science (WoS) Core Collection, Digital ScienceDimensions, and Microsoft Academic Graph (MAG). Training on arXiv is appealing for reasons we have described, but The Center for Security and Emerging Technology (CSET) studies the security impacts of emerging technologies and deliversnonpartisan analysis to the policy community. See examples of reports that are dependent on various AI deﬁnitions at https://cset.georgetown.edu/reports . See https://arxiv.org/category_taxonomy . We include machine learning papers from the statistics repository ( stat.ML ) in this subject. Cross-posting between the twocategories is automatic. https://arxiv.org/corr/subjectclasses . We restrict this effort to the last decade of arXiv papers to ensure reasonable numbers of papers in each subject in every year. cs.AI ) 8,941 19,964Natural Language Processing ( cs.CL ) 11,881 15,361Computer Vision ( cs.CV ) 28,309 35,254Machine Learning ( cs.LG, stat.ML ) 30,175 52,909Multiagent Systems ( cs.MA ) 985 2,602Robotics ( cs.RO ) 5,379 7,933Any of the above 85,670 107,380we ultimately care about performance in these more general knowledge bases, and many differences separate them. Thedisciplinary coverage of the larger sources is broader, spanning ﬁelds in which we expect to ﬁnd no AI-relevant papers.For our analysis below, we create a combined corpus of unique English-language publications from Dimensions, MAG,and WoS in 2010 through 2019. The result after deduplication is an analytic corpus of 38.6 million publications. From the arXiv corpus we draw two 10% samples for development and testing, stratifying by publication year and subjectlabel. We use the resulting partition to train and evaluate solutions for identifying AI-relevant and subject-relevantpublications.Our baseline solution uses keyword matches. We use 100 terms and patterns that we developed for a variety of documentretrieval tasks in early Spring 2019, in a manual process: we reviewed search results and adapted the term list, anditerated until satsiﬁed. (See Appendix B.) If one of these terms is present in the title or abstract of a publication, weconsider that publication AI-relevant. Our expectation was that this approach would achieve reasonable precision butlow recall. When tested against arXiv papers, considering papers in any of the six chosen subjects to be AI-relevant, weobserve precision of .76 and recall of .43 ( F = .55).A second approach for comparison is a keyword-classiﬁer hybrid developed by Elsevier [21] as part of a bibliometricstudy of AI. The Elsevier group ﬁrst extracted candidate terms from diverse textual sources, drawing from syllabi,books, patents, textbooks, the Cooperative Patent Classiﬁcation scheme, and AI news coverage. The initial resultwas 800,000 keywords, which the group iteratively reduced to 797 distinct and speciﬁc terms.The Elsevier team solicited comments on this set of terms from outside subject-matter experts. Characteristically [11],however, these experts could not agree on any common set of keywords “representative enough to scope the breadth ofthe ﬁeld and [...] speciﬁc enough to AI” [21]. The solution was for internal experts to score the terms on a three-pointscale, and then task the outside experts with labeling a collection of publications that included the keywords. Thisaccount illustrates the difﬁculty of delineating the ﬁeld by consensus, and the investment that expert labeling entails.Ultimately, incidence of the 797 terms in the input text was the basis for a series of features: variously weighted countsand proportions of lower- and higher-scoring terms in title and abstract text. Following [21], we apply a random forestmodel to learn weights for these features using the training set drawn from the arXiv corpus.We depart from a replication of the Elsevier method by training on arXiv, and the implementation details of doingso may not correspond with the original work. Using a grid search to tune hyperparameter values and evaluatingperformance through cross-validation, we see precision of .74 and recall of .49 ( F = .59) in prediction of AI-relevantarticles. These results outperform our baseline keyword solution. We describe this process further in Appendix C. . https://aitopics.org . For implementation details and replication code, see https://github.com/georgetown-cset/ai-relevant-papers . F scores of .84 for the all-subject SciBERT modeland between .75 and .86 for subject-speciﬁc models. Our adaptation of the Elsevier AI model [21]outperforms our keywords but falls behind the BERT models.Method Precision Recall F CSET Keywords .76 .43 .55Elsevier Keyword-classiﬁer Hybrid [21] .74 .49 .59SciBERT All Subjects .83 .85 .84

SciBERTNatural Language Processing ( cs.CL ) .86 .86 .86Computer Vision ( cs.CV ) .87 .81 .84Robotics ( cs.RO ) .78 .73 .75Lastly, we apply SciBERT [4], a BERT [7] model pre-trained on full text from Semantic Scholar then frozen and used toembed the title and abstract text of publications for classiﬁcation. Here we use the same tuning parameters as reportedfor the text classiﬁcation task in [4]. We consider papers tagged with any of the six subjects to be AI-relevant and traina binary “all subjects” classiﬁer. In evaluation on the arXiv test set, we ﬁnd improvements from SciBERT over theprevious methods, with precision of .83 and recall of .85 ( F = .84). We also train classiﬁers for AI-relevant subjectsseparately, one-versus-all. This effort is successful for the three subjects that correspond with well-deﬁned applicationﬁelds: NLP ( F = .86), Computer Vision ( F = .84) and Robotics ( F = .75).In Table 2, we summarize the test performance of the baseline keyword solution, the Elsevier method, and the SciBERTmodels. The all-subjects SciBERT model outperforms the alternative methods in the test data, and in comparison withthe keyword-reliant solutions, we ﬁnd appealing the availability of real-time updates from new arXiv content and thestraightforward decomposability of AI-relevant research into subjects like computer vision.In Figure 1, we assess the longitudinal performance of the keyword, Elsevier hybrid, and SciBERT classiﬁers. Thekeyword solution performs best ( F = .61) in 2019, the year we developed it. Its performance declines steadily in prioryears, which we ﬁnd unsurprising in a fast-moving ﬁeld. Elsevier’s model and the SciBERT all-subjects model exhibitthe same pattern, but for different reasons.Figure 1: Higher performance from the supervised methods in more recent years is due in large partto longitudinal imbalance in the training data. Resampling or other strategies for imbalanced data canaddress this as appropriate for downstream analyses. The variation in keyword performance, by contrast,is the sign of a fast-moving ﬁeld. Year F The appropriate response to this imbalance depends on the analytic context. The expansion of arXivsince 2010 is attributable to its popularity relative to traditional journals, the growth of the particular ﬁelds arXiv covers,and secular trends in research output. When training a classiﬁer on arXiv for inference in WoS or elsewhere, one mightseek the highest performance overall or prefer stable performance within strata meaningful in downstream analysis.We suggest comparing the performance of a single model to that of period-speciﬁc models if inference focuses ontime-series measures.

Because we lack gold labels for straightforward estimation of the models’ performance outside of arXiv, we comparetheir predictions to other sources of subject information. MAG provides a rich taxonomy of ﬁelds of study useful forthis purpose. Table 3 shows for top-level ﬁelds, along with subﬁelds of computer science, the proportion of articlespredicted relevant by each method. The topical scope of MAG is broader than arXiv, so we approach generalizationwith some caution, limiting it to ﬁelds well-represented on arXiv. During training, for example, the SciBERT classiﬁersencountered few papers in chemistry, medicine, or the social sciences. Table 3 shows for top-level ﬁelds, along with subﬁelds of computer science, the proportion of articles predictedrelevant by each method. Each row in the table represents publications in a MAG ﬁeld, and each column a method ormodel. “SciBERT” refers to the All Subjects model, and from left to right, the arXiv subject abbreviations refer to theComputation and Language (NLP), Computer Vision, and Robotics subject models.Plausibly, the keyword, Elsevier, and SciBERT methods for identifying AI-relevant publications yield the highestprediction rates in artiﬁcial intelligence, computer vision, data mining, machine learning, natural language processing,pattern recognition, and speech recognition. Consistent with test performance, which showed higher recall for theall-subject SciBERT model (.85) than the hybrid (.49) or keyword (.43) methods, the SciBERT model tends to predictmuch larger proportions of these ﬁelds to be relevant. The MAG ﬁelds of study are themselves estimates, however, sothis is a validation exercise rather than an evaluation against ground truth. The ﬁnal columns of Table 3 give corresponding statistics for the subject-speciﬁc SciBERT models. The NLP ( cs.CL )model identiﬁes 77% of papers in MAG’s natural language processing ﬁeld as relevant, along with 22% of the speechrecognition ﬁeld and 18% of information retrieval papers. The subject model successfully discriminates betweenNLP papers and those in machine learning (only 7% relevant) or artiﬁcial intelligence (8%). Predictions from thecomputer vision ( cs.CV ) model identify 53% of the computer vision ﬁeld and 54% of pattern recognition papers asrelevant. Positive predictions from the robotics ( cs.RO ) model are relatively rare, but it identiﬁes 71% of papers in therobotics subﬁeld of engineering and mathematics as relevant, along with 17% of the simulation subﬁeld and 11% ofhuman-computer interaction. It is also possible that classiﬁcation in earlier years is more difﬁcult than in recent years, or for that matter easier, but the imbalanceconfounds direct evaluation. We necessarily restrict this table to publications found in MAG. These are 90% of the unique articles across Dimensions, WoS,and MAG. We omit from prediction the MAG ﬁelds of Art, Business, Chemistry, Environmental science, Geography, History, Medicine,Philosophy, Political science, Psychology, and Sociology. MAG provides ﬁeld scores for each paper: the positive subset of cosine similarities between its embedding and those of ﬁelds.Here we consider a paper to belong in a ﬁeld of study if its score is positive. Like arXiv subjects, MAG ﬁelds are non-exclusive. Many papers have positive ﬁeld scores for more than one ﬁeld.

Percent of Count predicted relevantSciBERT SciBERT SciBERT SciBERTMAG Field / Subﬁeld Count Keywords Elsevier All Subj. cs.CL cs.CV cs.RO

Biology 8,820,224 1 1 1 0 0 1CS / Algorithm 403,571 14 17 26 1 8 2CS / Artiﬁcial intelligence 1,243,775 39 38 66 8 31 6CS / Computational science 18,629 5 5 5 0 1 1CS / Computer architecture 15,018 11 11 7 0 1 1CS / Computer engineering 20,994 15 16 14 0 4 2CS / Computer graphics (images) 58,976 10 5 30 0 23 3CS / Computer hardware 115,751 5 3 6 0 2 2CS / Computer network 418,390 3 5 2 0 0 0CS / Computer security 220,493 5 5 5 0 1 1CS / Computer vision 494,902 29 23 64 0 53 9CS / Data mining 345,223 28 31 42 4 6 1CS / Data science 105,878 14 17 17 4 1 0CS / Database 102,016 6 7 9 1 2 1CS / Distributed computing 276,100 7 12 9 0 1 2CS / Embedded system 125,784 4 4 8 0 1 4CS / Human–computer interaction 129,101 11 15 31 2 3 11CS / Information retrieval 108,145 28 27 44 18 6 0CS / Internet privacy 80,802 2 3 2 1 0 0CS / Knowledge management 318,313 3 8 6 1 0 0CS / Library science 166,741 1 1 1 1 0 0CS / Machine learning 360,586 51 55 71 7 15 2CS / Multimedia 219,419 6 9 11 2 2 1CS / Natural language processing 103,318 38 41 79 77 5 0CS / Operating system 50,324 2 2 2 0 0 1CS / Parallel computing 77,951 7 8 6 0 2 0CS / Pattern recognition 360,826 52 48 80 3 54 1CS / Programming language 50,998 5 10 9 3 0 1CS / Real-time computing 271,586 8 10 12 0 3 3CS / Simulation 280,108 6 9 23 0 2 17CS / Software engineering 77,539 4 10 8 1 0 2CS / Speech recognition 106,362 41 37 58 22 11 1CS / Telecommunications 86,710 1 2 1 0 0 0CS / Theoretical computer science 152,733 11 16 20 1 2 1CS / World Wide Web 228,179 6 9 8 3 0 0Economics 3,370,477 1 2 1 0 0 0Engineering 6,518,254 3 5 6 0 0 4Engineering / Robotics 29,488 21 25 72 0 9 71Geology 1,610,737 2 2 3 1 1 2Materials science 2,407,580 0 1 1 0 0 1Mathematics 4,032,139 5 8 8 0 1 2Physics 4,195,403 1 1 1 0 0 0

Our results demonstrate high classiﬁcation performance from SciBERT [4] models applied to learning arXiv subjects.Although we did not evaluate SciBERT against a comparable BERT model pre-trained on Wikipedia and the BookCorpus[7], we attribute some of this performance to transfer learning via SciBERT’s embedding of scientiﬁc vocabulary after7re-training on Semantic Scholar. Within the set of topics the models saw in training on arXiv papers, inference in WoSappears feasible: we observe plausible rates of predicted relevance in MAG ﬁelds of study.Looking forward, manual annotation is the obvious solution to our lack of labeled examples in Dimensions, MAG, andWoS. However, developing guidelines for labeling publications for AI-relevance would require addressing deﬁnitionalquestions we sidestepped in this work; it would represent a departure from using the implicit delineation of the ﬁeldprovided by arXiv preprints. But we anticipate that labeling examples to approximate the boundaries of arXiv subjects,like NLP and computer vision, is far more tractable than manual labeling for AI relevance.The arXiv corpus exhibits a class imbalance of about 9:1 in favor of negative examples. In the analytic corpus, whosetopical coverage is broader, we assume the true imbalance is greater. The appropriate tuning for class performance willdepend on the application.Another major direction for future work is expanding domain generalizibility, particularly in potential applicationareas. We have substantive interest in papers on topics unavailable in arXiv, from agriculture to medicine. We wouldconsider reports of AI applications in trade journals to be AI-relevant in principle, for example, but we focus in thispaper on a delineation of the ﬁeld whose implementation may not include them. To expand into these areas, weanticipate leveraging bibliometric data in addition to text: applying scientometric methods to extend the identiﬁcationof publications describing the development and applications of AI beyond arXiv’s coverage.

References [1] Artiﬁcial intelligence: How knowledge is created, transferred, and used. Technical report, Elsevier. URL .[2] G. C. Allen. Understanding China’s AI strategy: Clues to Chinese strategic thinking on artiﬁcial intelligenceand national security. Technical report, Center for a New American Security. URL .[3] S. K. Arora, A. L. Porter, J. Youtie, and P. Shapira. Capturing new developments in an emerging technology: Anupdated search strategy for identifying nanotechnology research outputs.

Scientometrics , 95(1):351–370, Apr.2013. doi: 10.1007/s11192-012-0903-6.[4] I. Beltagy, A. Cohan, and K. Lo. SciBERT: Pretrained contextualized embeddings for scientiﬁc text.

CoRR , 2019.URL http://arxiv.org/abs/1903.10676 .[5] K. W. Boyack, M. Patek, L. H. Ungar, P. Yoon, and R. Klavans. Classiﬁcation of individual articles from all ofscience by research level.

Journal of Informetrics , 8(1):1–12. doi: 10.1016/j.joi.2013.10.005.[6] C. B. Clement, M. Bierbaum, K. P. O’Keeffe, and A. A. Alemi. On the use of arXiv as a dataset. 2019. URL http://arxiv.org/abs/1905.00075 .[7] J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova. BERT: Pre-training of deep bidirectional transformers forlanguage understanding. 2018. URL http://arxiv.org/abs/1810.04805 .[8] F. Gao, X. Jia, Z. Zhao, C.-C. Chen, F. Xu, Z. Geng, and X. Song. Bibliometric analysis on tendency and topics ofartiﬁcial intelligence over last decade.

Microsystem Technologies . doi: 10.1007/s00542-019-04426-y.[9] C. Huang, A. Notten, and N. Rasters. Nanoscience and technology publications and patents: A review of socialscience studies and search strategies.

The Journal of Technology Transfer , 36(2):145–172, Apr. 2011. doi:10.1007/s10961-009-9149-8.[10] Y. Huang, J. Schuehle, A. L. Porter, and J. Youtie. A systematic method to create search strategies for emergingtechnologies based on the web of science: illustrated for ‘big data’.

Scientometrics , 105(3):2005–2022. doi:10.1007/s11192-015-1638-y.[11] P. M. Krafft, M. Young, M. Katell, K. Huang, and G. Bugingo. Deﬁning AI in policy versus practice. URL http://arxiv.org/abs/1912.11095 .[12] F. Martínez-Plumed, B. S. Loe, P. Flach, S. Ó hÉigeartaigh, K. Vold, and J. Hernández-Orallo. The facetsof artiﬁcial intelligence: A framework to track the evolution of AI. In

Proceedings of the Twenty-SeventhInternational Joint Conference on Artiﬁcial Intelligence , pages 5180–5187. International Joint Conferences onArtiﬁcial Intelligence Organization. doi: 10.24963/ijcai.2018/718.813] D. H. Milanez, E. Noyons, and L. I. L. de Faria. A delineating procedure to retrieve relevant publication data inresearch areas: the case of nanocellulose.

Scientometrics , 107(2):627–643. doi: 10.1007/s11192-016-1922-5.[14] K. Miyazaki and R. Sato. Analyses of the technological accumulation over the 2 nd and the 3 rd AI boom and theissues related to AI adoption by ﬁrms. In , pages 1–7. IEEE. doi: 10.23919/PICMET.2018.8481822.[15] A. Mogoutov and B. Kahane. Data search strategy for science and technology emergence: A scalable andevolutionary query for nanotechnology tracking.

Research Policy , 36(6):893–903, 2007. doi: 10.1016/j.respol.2007.02.005.[16] J. Niu, W. Tang, F. Xu, X. Zhou, and Y. Song. Global research on artiﬁcial intelligence from 1990–2014:Spatially-explicit bibliometric analysis.

ISPRS International Journal of Geo-Information , 5(5):66, 2016. doi:10.3390/ijgi5050066.[17] R. Perrault, Y. Shoham, E. Brynjolfsson, J. Clark, J. Etchemendy, B. Grosz, T. Lyons, J. Manyika, S. Mishra, andJ. C. Niebles. The AI index 2019 annual report. Technical report, AI Index Steering Committee, Human-CenteredAI Institute, 2019. URL https://hai.stanford.edu/sites/g/files/sbiybj10986/f/ai_index_2019_report.pdf .[18] J. Rincon-Patino, G. Ramirez-Gonzalez, and J. C. Corrales. Exploring machine learning: A bibliometric generalapproach using Citespace.

F1000Research , 7:1240, 2018. doi: 10.12688/f1000research.15619.1.[19] S. Russell and P. Norvig.

Artiﬁcial Intelligence: A Modern Approach . Pearson, 3rd edition, 2009. ISBN978-0-13-604259-4.[20] A. K. Shukla, M. Janmaijaya, A. Abraham, and P. K. Muhuri. Engineering applications of artiﬁcial intelligence:A bibliometric analysis of 30 years (1988–2018).

Engineering Applications of Artiﬁcial Intelligence , 85:517–532,2019. doi: 10.1016/j.engappai.2019.06.010.[21] M. Siebert, C. Kohler, A. Scerri, and G. Tsatsaronis. Technical background and methodology for the Elsevier’sartiﬁcial intelligence report. Technical report, Elsevier. URL .[22] H. Small, K. W. Boyack, and R. Klavans. Identifying emerging topics in science and technology.

Research Policy ,43(8):1450–1467. doi: 10.1016/j.respol.2014.02.005.[23] A. Suominen and N. C. Newman. Exploring the Fundamental Conceptual Units of Technical Emergence. In , pages 1–5, 2017.doi: 10.23919/PICMET.2017.8125287.[24] D. Tarraf, W. Shelton, E. Parker, B. Alkire, D. Carew, J. Grana, A. Levedahl, J. Leveille, J. Mondschein, J. Ryseff,A. Wyne, D. Elinoff, E. Geist, B. Harris, E. Hui, C. Kenney, S. Newberry, C. Sachs, P. Schirmer, D. Schlang,V. Smith, A. Tingstad, P. Vedula, and K. Warren.

The Department of Defense Posture for Artiﬁcial Intelligence:Assessment and Recommendations . RAND Corporation. doi: 10.7249/RR4229.[25] M. Zitt. Meso-level retrieval: IR-bibliometrics interplay and hybrid citation-words methods in scientiﬁc ﬁeldsdelineation.

Scientometrics , 102(3):2223–2245. doi: 10.1007/s11192-014-1482-5.9

Further results

Table A.1 reports the evaluation of keywords (Appendix B) in the full arXiv data by year. Scores are for the positiveclass. The “Support” column refers to the number of AI-relevant articles out of the “Total“ articles, where AI-relevanceis deﬁned as elsewhere by having at least one of the six selected subject tags: cs.AI , cs.CL , cs.CV , cs.LG / stat.ML , cs.MA , and cs.RO . Performance in highest in 2019, when we generated the terms. We take the declining performancein earlier years to suggest the need for continuous maintenance of keywords.Table A.1: Keyword performance in full arXiv data.Year Precision Recall F Support Total2010 .50 .27 .35 1,379 70,2862011 .54 .24 .33 2,025 76,6052012 .63 .25 .36 3,370 84,3892013 .65 .25 .36 4,561 92,8662014 .66 .31 .43 4,896 97,5982015 .71 .36 .48 6,663 105,1282016 .78 .41 .54 10,566 113,4362017 .77 .44 .56 15,670 123,7812018 .77 .48 .59 23,891 140,3922019 .80 .49 .61 34,359 155,840All .76 .43 .55 103,380 1,060,321In Table A.2, we show the test performance of the keyword-classiﬁer hybrid developed by Elsevier. This solution showsimprovements over our baseline keyword solution. We attribute higher performance in more recent years to longitudinalimbalance in the training data. There is also a class imbalance of about 9:1 in favor of negative examples. Its effect onperformance is apparent despite the use of class weights.Table A.2: Elsevier keyword-classiﬁer performance in arXiv test data.

Positive Class Negative Class Wtd. Avg.

Year Precision Recall F Support Precision Recall F Support F Support2010 .50 .31 .38 138 .99 .99 .99 6,891 .98 7,0292011 .50 .30 .38 202 .98 .99 .99 7,458 .97 7,6602012 .58 .26 .36 337 .97 .99 .98 8,102 .96 8,4392013 .60 .28 .39 456 .96 .99 .98 8,831 .95 9,2872014 .59 .31 .41 489 .96 .99 .98 9,271 .95 9,7602015 .69 .42 .52 666 .96 .99 .97 9,847 .95 10,5132016 .75 .45 .57 1,057 .95 .98 .97 10,287 .93 11,3442017 .74 .49 .59 1,567 .93 .98 .95 10,811 .91 12,3782018 .75 .55 .64 2,389 .91 .96 .94 11,650 .89 14,0392019 .81 .55 .66 3,436 .88 .96 .92 12,148 .86 15,584All .74 .49 .59 10,737 .94 .98 .96 95,296 .92 106,033Table B.2 gives test performance of the all-subject SciBERT model. Like the Elsevier solution, the best results are forrecent years, due to longitudinal imbalance. 10

Keywords

Table B.1: We use these terms and patterns in our baseline search strategy. Originally, we developed thislist for document retrieval tasks on a variety of knowledgebases, such as WoS, ProQuest, Dimensions,and CNKI, in early Spring 2019. The * character represents a wildcard that matches zero or morenon-whitespace characters.active learning incremental clusteringadaptive learning information extractionanomaly detection information fusionartiﬁcial intelligence information retrievalassociative learning k nearest neighborautonomous navigation knowledge based system*autonomous system* knowledge discoveryautonomous vehicle* knowledge representationaverage link clustering language identiﬁcationback propagation machine learningbackpropagation machine perceptionbinary classiﬁcation machine translationbioNLP multi class classiﬁcationboltzmann machine multi label classiﬁcationcharacter recognition multi task learningclassiﬁcation algorithm natural language generationclassiﬁcation label* natural language processingclustering method* natural language understandingcomplete link clustering neural networkcomputer aided diagnosis object recognitioncomputer vision one shot learningdeep learning pattern matchingensemble learning pattern recognitionevolutionary algorithm random forestfac* expression recognition recommend* system*fac* identiﬁcation recurrent networkfac* recognition reinforcement learningfeature extraction scene* classiﬁcationfeature learning scene* understandingfeature matching self driving car*feature selection semi supervised learningfeature vector sentiment classiﬁcationfeedforward network single link clusteringfeedforward neural network spatial learningfuzzy clustering speech processinggenerative adversarial network speech recognitiongradient algorithm speech synthesisgraph matching statistical learninggraphical model strong artiﬁcial intelligencehandwriting recognition supervised learninghierarchical clustering support vector machinehierarchical model text mininghuman robot text processingimage annotation transfer learningimage classiﬁcation translation systemimage matching unsupervised learningimage processing video classiﬁcationimage registration video processingimage representation weak artiﬁcial intelligenceimage retrieval zero shot learning11able B.2: All-subject SciBERT performance in arXiv test data.

Positive Class Negative Class Wtd. Avg.

Year Precision Recall F Support Precision Recall F Support F Support2010 .64 .63 .64 138 .99 .99 .99 6,891 .99 7,0292011 .65 .63 .64 202 .99 .99 .99 7,458 .98 7,6602012 .74 .69 .72 337 .99 .99 .99 8,102 .98 8,4392013 .80 .74 .77 456 .99 .99 .99 8,831 .98 9,2872014 .74 .73 .74 489 .99 .99 .99 9,271 .97 9,7602015 .78 .79 .78 666 .99 .98 .99 9,847 .97 10,5132016 .82 .84 .83 1,057 .98 .98 .98 10,287 .97 11,3442017 .83 .89 .85 1,567 .98 .97 .98 10,811 .96 12,3782018 .83 .90 .87 2,389 .98 .96 .97 11,650 .95 14,0392019 .87 .89 .88 3,436 .97 .96 .97 12,148 .95 15,584All .83 .85 .84 10,737 .98 .98 .98 95,296 .97 106,033