Featured Researches

Digital Libraries

Improving Academic Plagiarism Detection for STEM Documents by Analyzing Mathematical Content and Citations

Identifying academic plagiarism is a pressing task for educational and research institutions, publishers, and funding agencies. Current plagiarism detection systems reliably find instances of copied and moderately reworded text. However, reliably detecting concealed plagiarism, such as strong paraphrases, translations, and the reuse of nontextual content and ideas is an open research problem. In this paper, we extend our prior research on analyzing mathematical content and academic citations. Both are promising approaches for improving the detection of concealed academic plagiarism primarily in Science, Technology, Engineering and Mathematics (STEM). We make the following contributions: i) We present a two-stage detection process that combines similarity assessments of mathematical content, academic citations, and text. ii) We introduce new similarity measures that consider the order of mathematical features and outperform the measures in our prior research. iii) We compare the effectiveness of the math-based, citation-based, and text-based detection approaches using confirmed cases of academic plagiarism. iv) We demonstrate that the combined analysis of math-based and citation-based content features allows identifying potentially suspicious cases in a collection of 102K STEM documents. Overall, we show that analyzing the similarity of mathematical content and academic citations is a striking supplement for conventional text-based detection approaches for academic literature in the STEM disciplines.

Read more
Digital Libraries

Improving Scholarly Knowledge Representation: Evaluating BERT-based Models for Scientific Relation Classification

With the rapid growth of research publications, there is a vast amount of scholarly knowledge that needs to be organized in digital libraries. To deal with this challenge, techniques relying on knowledge-graph structures are being advocated. Within such graph-based pipelines, inferring relation types between related scientific concepts is a crucial step. Recently, advanced techniques relying on language models pre-trained on the large corpus have been popularly explored for automatic relation classification. Despite remarkable contributions that have been made, many of these methods were evaluated under different scenarios, which limits their comparability. To this end, we present a thorough empirical evaluation on eight Bert-based classification models by focusing on two key factors: 1) Bert model variants, and 2) classification strategies. Experiments on three corpora show that domain-specific pre-training corpus benefits the Bert-based classification model to identify the type of scientific relations. Although the strategy of predicting a single relation each time achieves a higher classification accuracy than the strategy of identifying multiple relation types simultaneously in general, the latter strategy demonstrates a more consistent performance in the corpus with either a large or small size of annotations. Our study aims to offer recommendations to the stakeholders of digital libraries for selecting the appropriate technique to build knowledge-graph-based systems for enhanced scholarly information organization.

Read more
Digital Libraries

Improving Text Relationship Modeling with Artificial Data

Data augmentation uses artificially-created examples to support supervised machine learning, adding robustness to the resulting models and helping to account for limited availability of labelled data. We apply and evaluate a synthetic data approach to relationship classification in digital libraries, generating artificial books with relationships that are common in digital libraries but not easier inferred from existing metadata. We find that for classification on whole-part relationships between books, synthetic data improves a deep neural network classifier by 91%. Further, we consider the ability of synthetic data to learn a useful new text relationship class from fully artificial training data.

Read more
Digital Libraries

In Search of Outstanding Research Advances: Prototyping the creation of an open dataset of "editorial highlights"

A long-standing research question in bibliometrics is how one identifies publications, which represent major advances in their fields, making high impact in there and other areas. In this context, the term "Breakthrough" is often used and commonly used approaches rely on citation links between publications implicitly positing that peers who use or build upon previously published results collectively inform about their standing in terms of advancing the research frontiers. Here we argue that the "Breakthrough" concept is rooted in the Kuhnian model of scientific revolution which has been both conceptually and empirically challenged. A more fruitful approach is to consider various ways in which authoritative actors in scholarly communication system signal the importance of research results. We bring to discussions different "recognition channels" and pilot the creation of an open dataset of editorial highlights from regular lists of notable research advances. The dataset covers the last ten years and includes: the "discoveries of the year" from Science magazine and La Recherche and weekly editorial highlights from Nature ("research highlights") and Science ("editor's choice"). The final dataset includes 230 entries in the "discoveries of the years" (with over 720 references) and about 9,000 weekly highlights (with over 8,000 references).

Read more
Digital Libraries

India's rank and global share in scientific research -- how data sourced from different databases can produce varying outcomes

India is emerging as a major knowledge producer of the world in terms of proportionate share of global research output and the overall research productivity rank. Many recent reports, both of commissioned studies from Government of India as well as independent international agencies, show India at different ranks of global research productivity (variations as large as from 3rd to 9th place). The paper examines this contradiction; tries to analyse as to why different reports places India at different ranks and what may be the reasons thereof. The research output data for India, along with the ten most productive countries in the world, is analysed from three major scholarly databases: Web of Science, Scopus and Dimensions for this purpose. Results show that both, the endogenous factors (such as database coverage variation and different subject classification schemes) and the exogenous factors (such as subject selection and publication counting methodology) cause the variations in different reports. This paper reports first part of the analysis, focusing mainly on variations due to use of data from different databases. The policy implications of the study are also discussed.

Read more
Digital Libraries

India's rank and global share in scientific research -- how publication counting method and subject selection can vary the outcomes

During the last two decades, India has emerged as a major knowledge producer in the world, however different reports put it at different ranks, varying from 3rd to 9th places. The recent commissioned study reports of Department of Science and Technology (DST) done by Elsevier and Clarivate Analytics, rank India at 5thand 9th places, respectively. On the other hand, an independent report by National Science Foundation (NSF) of United States (US), ranks India at 3rd place on research output in Science and Engineering area. Interestingly, both, the Elsevier and the NSF reports use Scopus data, and yet surprisingly their outcomes are different. This article, therefore, attempts to investigate as to how the use of same database can still produce different outcomes, due to differences in methodological approaches. The publication counting method used and the subject selection approach are the two main exogenous factors identified to cause these variations. The implications of the analytical outcomes are discussed with special focus on policy perspectives.

Read more
Digital Libraries

Indicators of Open Access for universities

This paper presents a first attempt to analyse Open Access integration at the institutional level. For this, we combine information from Unpaywall and the Leiden Ranking to offer basic OA indicators for universities. We calculate the overall number of Open Access publications for 930 universities worldwide. OA indicators are also disaggregated by green, gold and hybrid Open Access. We then explore differences between and within countries and offer a general ranking of universities based on the proportion of their output which is openly accessible.

Read more
Digital Libraries

Infrastructure-Agnostic Hypertext

This paper presents a novel and formal interpretation of the original vision of hypertext: infrastructure-agnostic hypertext is independent from specific standards such as data formats and network protocols. Its model is illustrated with examples and references to existing technologies that allow for implementation and integration in current information infrastructures such as the Internet.

Read more
Digital Libraries

Innovation and Revenue: Deep Diving into the Temporal Rank-shifts of Fortune 500 Companies

Research and innovation is important agenda for any company to remain competitive in the market. The relationship between innovation and revenue is a key metric for companies to decide on the amount to be invested for future research. Two important parameters to evaluate innovation are the quantity and quality of scientific papers and patents. Our work studies the relationship between innovation and patenting activities for several Fortune 500 companies over a period of time. We perform a comprehensive study of the patent citation dataset available in the Reed Technology Index collected from the US Patent Office. We observe several interesting relations between parameters like the number of (i) patent applications, (ii) patent grants, (iii) patent citations and Fortune 500 ranks of companies. We also study the trends of these parameters varying over the years and derive causal explanations for these with qualitative and intuitive reasoning. To facilitate reproducible research, we make all the processed patent dataset publicly available at this https URL.

Read more
Digital Libraries

Interdisciplinary Relationships Between Biological and Physical Sciences

Several interdisciplinary areas have appeared at the interface between biological and physical sciences. In this work, we suggest a complex network-based methodology for analyzing the interrelationships between some of these interdisciplinary areas, including Bioinformatics, Computational Biology, Biochemistry, among others. This approach has been applied over respective data derived from Wikipedia. Related reviews from the scientific literature are also considered as a reference, yielding a respective bipartite hypergraph which can be used to gain insights about the interrelationships underlying the considered interdisciplinary areas. Several interesting results are obtained, including greater interconnection between the considered interdisciplinary areas with biological than with physical sciences. A good agreement was also found between the network obtained from Wikipedia and the interrelationships revealed by the literature reviews. At the same time, the former network was found to exhibit more intricate relationships than in the hypergraph derived from the literature review.

Read more

Ready to get started?

Join us today