Joint Workshop on Bibliometric-enhanced Information Retrieval and Natural Language Processing for Digital Libraries (BIRNDL 2017)
aa r X i v : . [ c s . D L ] J un Joint Workshop on Bibliometric-enhanced InformationRetrieval and Natural Language Processingfor Digital Libraries (BIRNDL 2017)
Muthu Kumar Chandrasekaran
School of Computing,National University of Singapore, [email protected]
Kokil Jaidka
School of Arts & Sciences,University of Pennsylvania, [email protected]
Philipp Mayr
GESIS – Leibniz Institute for the SocialSciences, [email protected]
ABSTRACT
The large scale of scholarly publications poses a challenge for schol-ars in information seeking and sensemaking. Bibliometrics, infor-mation retrieval (IR), text mining and NLP techniques could helpin these search and look-up activities, but are not yet widely used.This workshop is intended to stimulate IR researchers and digitallibrary professionals to elaborate on new approaches in natural lan-guage processing, information retrieval, scientometrics, text min-ing and recommendation techniques that can advance the state-of-the-art in scholarly document understanding, analysis, and re-trieval at scale. The BIRNDL workshop at SIGIR 2017 will incor-porate an invited talk, paper sessions and the third edition of theComputational Linguistics (CL) Scientific Summarization SharedTask.
CCS CONCEPTS • Information systems → Information retrieval ; Link and co-citation analysis ; •
Applied computing → Digital libraries andarchives ; KEYWORDS
Scientometrics; Information Retrieval; Digital Libraries; NLP; Sum-marization; Information Extraction; Citation analysis
ACM Reference format:
Muthu Kumar Chandrasekaran, Kokil Jaidka, and Philipp Mayr. 2017. JointWorkshop on Bibliometric-enhanced Information Retrieval and Natural Lan-guage Processing for Digital Libraries (BIRNDL 2017). In
Proceedings of SI-GIR’17, August 7–11, 2017, Shinjuku, Tokyo, Japan., ,
Over the past several years, the BIRNDL workshop and its parentworkshops are establishing themselves as the primary interdisci-plinary venue for the cross- pollination of bibliometrics and infor-mation retrieval (IR) [5]. Our motivation as organizers of the work-shop started from the observation that both communities shareonly a partial overlap; yet, the main discourse in both fields con-sists of different approaches to solve similar problems. We believe
Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).
SIGIR’17, , August 7–11, 2017, Shinjuku, Tokyo, Japan. © 2017 Copyright held by the owner/author(s).ACM ISBN ACM ISBN 978-1-4503-5022-8/17/08.https://doi.org/10.1145/3077136.3084370 that a knowledge transfer would be profitable for both sides. Agood overview of the symbiotic relationship that exists among bib-liometrics, IR and natural language processing (NLP) has been pre-sented by Wolfram [6]. A report of the past BIRNDL workshop hasbeen published recently in The SIGIR Forum [1].The goal of the BIRNDL workshop at SIGIR is to engage theIR community about the open problems in academic search. Aca-demic search refers to the large, cross-domain digital repositorieswhich index research papers, such as the ACL Anthology, ArXiv,ACM Digital Library, IEEE database, Web of Science and GoogleScholar. Currently, digital libraries collect and allow access to pa-pers and their metadata — including citations — but mostly do notanalyze the items they index. The scale of scholarly publicationsposes a challenge for scholars in their search for relevant literature.Finding relevant scholarly literature is the key theme of BIRNDLand sets the agenda for tools and approaches to be discussed andevaluated at the workshop.Papers at the 2 nd BIRNDL workshop will incorporate insightsfrom IR, bibliometrics and NLP to develop new techniques to ad-dress the open problems such as evidence-based searching, mea-surement of research quality, relevance and impact, the emergenceand decline of research problems, identification of scholarly rela-tionships and influences and applied problems such as languagetranslation, question-answering and summarization. We will alsoaddress the need for established, standardized baselines, evalua-tion metrics and test collections. Towards the purpose of evaluat-ing tools and technologies developed for digital libraries, we are or-ganizing the 3 rd CL-SciSumm Shared Task based on the CL-SciSummcorpus, which comprises over 500 computational linguistics (CL)research papers, interlinked through a citation network.The organizers of the 2 nd BIRNDL workshop at SIGIR 2017 have previously organized other workshop series at premier IRand CS venues - notably, the Bibliometric-enhanced InformationRetrieval (BIR) workshops in 2014, 2015 and 2016 at ECIR [4] andthe NLPIR4DL workshop at ACL-IJCNLP (2009). Most recently, theBIRNDL workshop and the 2 nd CL-SciSumm Shared Task wereco-located with JCDL 2016 [1], where 10 research papers and 10system papers were presented (acceptance rate: 30%). In 2017, theBIRNDL workshop takes this legacy forward with a focus on schol-arly publications and data, and an updated scientific summariza-tion Shared Task for its participants.This workshop will be relevant to scholars in computer and in-formation science, specializing in IR and NLP. It will also be of http://wing.comp.nus.edu.sg/birndl-sigir2017/ http://wing.comp.nus.edu.sg/birndl-jcdl2016/ http://ceur-ws.org/Vol-1610/ mportance to all stakeholders in the publication pipeline: practi-tioners, publishers and policymakers. Today’s publishers continueto provide new ways to support their consumers in disseminatingand retrieving the right published works to their audience. For-mal citation metrics are increasingly a factor in decision-makingby universities and funding bodies worldwide, making the need forresearch in applying these metrics more pressing. Our goal is to encourage insights from IR, NLP and CL for schol-arly document understanding, document analysis and retrieval indigital libraries. The papers presented at the workshop will touchupon several topics, including (but not limited to) full-text analy-sis, multimedia and multilingual analysis and alignment as well asthe application of citation-based NLP, information retrieval and in-formation seeking techniques in digital libraries. More specifically,our fields of interests include: • Infrastructures for scientific text mining and IR • Semantic and Network-based indexing, navigation, search-ing and browsing in structured data • Discourse structure identification and argument mining fromscientific papers • Summarization and question-answering for scholarly DLs • Recommendation for scholarly papers, reviewers, citationsand publication venues • Measurement and evaluation of quality and impact • Metadata and controlled vocabularies for resource descrip-tion and discovery; automatic metadata discovery, such aslanguage identification • Disambiguation issues in scholarly DLs using NLP or IRtechniques; data cleaning and data quality.
The workshop will start with a keynote titled “Do “Future Work"sections have a real purpose? Citation links and entailment for globalscientometric questions” by Dr. Simone Teufel (University of Cam-bridge). This session will be followed by regular research paperpresentations, overview papers and posters on the Shared Task.
The 3 rd Computational Linguistics (CL) Scientific SummarizationShared Task, sponsored by Microsoft Research Asia, will be con-ducted as a part of this workshop. This is the first medium-scaleshared task on scientific document summarization in the CL do-main. It follows up on and extends the successful CL Shared Tasksconducted as a part of BIRNDL 2016 [1], and within the Biomed-Summ Track at the Text Analysis Conference 2014 (TAC 2014) [2].In the CL-SciSumm 2016 [3] Shared Task, fifteen teams from sixcountries signed up, and ten teams ultimately submitted and pre-sented their results.The Shared Task comprises three sub-tasks in automatic researchpaper summarization on a new corpus of research papers, as de-scribed below.Given: A topic consisting of a Reference Paper (RP) and up to tenCiting Papers (CPs) that all contain citations to the RP. Citations in the CP are pre-identified as the text spans (i.e., citances), that citethe RP.
Task 1a:
For each citance, identify the spans of text (cited textspans) in the RP that most accurately reflect the citance.
Task 1b:
For each cited text span, identify what facet of the paperit belongs to, from a predefined set of facets.
Task 2 (optional bonus task): Finally, generate a structured sum-mary of the RP from the cited text spans of the RP. The length ofthe summary should not exceed 250 words.
Evaluation:
Task 1 will be scored by overlap of text spans mea-sured by number of sentences in the system output vs gold stan-dard. Task 2 will be scored using the ROUGE family of metrics be-tween the system output, and i) human summaries, ii) communitysummaries comprising the cited text spans, and ii) the Abstractsection of the reference paper.This task is continues to be of interest to a broad community in-cluding those working in CL and NLP, especially in the sub-disciplinesof text summarization, discourse structure in scholarly discourse,paraphrase, textual entailment and text simplification.
This workshop is the first step to foster a reflection on interdisci-plinarity, and the benefits that the disciplines Bibliometrics, IR andNLP can derive from it in the Digital Libraries context. The authorsof accepted papers will be invited to submit extended versions oftheir work to the International Journal on Digital Libraries (IJDL).As an output of BIRNDL 2016, a special issue of IJDL on “Biblio-metrics, Information Retrieval and Natural Language Processingin Digital Libraries” is currently in preparation. In the future, weplan to continue to host this series of workshops and Shared Tasksat prominent IR, NLP and Digital Library venues.
ACKNOWLEDGMENTS
We thank Microsoft Research Asia for their generous support in fundingthe development, dissemination and organization of the CL-SciSumm datasetand the Shared Task. We are also grateful to the co-organizers of the 1 st BIRNDL workshop - Guillaume Cabanac, Ingo Frommholz, Min-Yen Kanand Dietmar Wolfram, for their continued support and involvement.
REFERENCES [1] Guillaume Cabanac, Muthu Kumar Chandrasekaran, Ingo Frommholz, KokilJaidka, Min-Yen Kan, Philipp Mayr, and Dietmar Wolfram. 2016. Report on theJoint Workshop on Bibliometric-enhanced Information Retrieval and NaturalLanguage Processing for Digital Libraries (BIRNDL 2016).
SIGIR Forum
50, 2(2016), 36–43. http://sigir.org/wp-content/uploads/2017/01/p036.pdf[2] Kokil Jaidka, Muthu Kumar Chandrasekaran, Beatriz Fisas Elizalde, Rahul Jha,Christopher Jones, Min-Yen Kan, Ankur Khanna, Diego Molla-Aliod, Dragomir RRadev, Francesco Ronzano, et al. 2014. The computational linguistics summariza-tion pilot task. In
Proceedings of Text Analysis Conference . Gaithersburg, USA.[3] Kokil Jaidka, Muthu Kumar Chandrasekaran, Sajal Rustagi, and Min-Yen Kan.2016. Overview of the CL-SciSumm 2016 Shared Task. In
In Proceedings of JointWorkshop on Bibliometric-enhanced Information Retrieval and NLP for Digital Li-braries (BIRNDL 2016) .[4] Philipp Mayr, Ingo Frommholz, and Guillaume Cabanac. 2016. Re-port on the 3rd International Workshop on Bibliometric-enhancedInformation Retrieval (BIR 2016).
SIGIR Forum
50, 1 (2016), 28–34.http://sigir.org/files/forum/2016J/p028.pdf[5] Philipp Mayr and Andrea Scharnhorst. 2015. Scientometrics and Informa-tion Retrieval - weak-links revitalized.
Scientometrics