Featured Researches

Digital Libraries

Dot-Science Top Level Domain: academic websites or dumpsites?

Dot-science was launched in 2015 as a new academic top-level domain (TLD) aimed to provide 'a dedicated, easily accessible location for global Internet users with an interest in science'. The main objective of this work is to find out the general scholarly usage of this top-level domain. In particular, the following three questions are pursued: usage (number of web domains registered with the dot-science), purpose (main function and category of websites linked to these web domains), and impact (websites' visibility and authority). To do this, 13,900 domain names were gathered through ICANN's Domain Name Registration Data Lookup database. Each web domain was subsequently categorized, and data on web impact were obtained from Majestic's API. Based on the results obtained, it is concluded that the dot-science top-level domain is scarcely adopted by the academic community, and mainly used by registrar companies for reselling purposes (35.5% of all web domains were parked). Websites receiving the highest number of backlinks were generally related to non-academic websites applying intensive link building practices and offering leisure or even fraudulent contents. Majestic's Trust Flow metric has been proved an effective method to filter reputable academic websites. As regards primary academic-related dot-science web domain categories, 1,175 (8.5% of all web domains registered) were found, mainly personal academic websites (342 web domains), blogs (261) and research groups (133). All dubious content reveals bad practices on the Web, where the tag 'science' is fundamentally used as a mechanism to deceive search engine algorithms.

Read more
Digital Libraries

Dr. Strangelove or: how I learned to stop worrying and love the citations

Citations are getting more and more important in the career of a researcher. But how to use them in the best possible way? This is a satirical paper, showing a bad trend currently happening in citation trends, due to intensive use of citation metrics. I am putting this on the arXiv and on Researchgate. Should you be interested to publish this paper on a journal of which you are editor, let me know.

Read more
Digital Libraries

EXmatcher: Combining Features Based on Reference Strings and Segments to Enhance Citation Matching

Citation matching is a challenging task due to different problems such as the variety of citation styles, mistakes in reference strings and the quality of identified reference segments. The classic citation matching configuration used in this paper is the combination of blocking technique and a binary classifier. Three different possible inputs (reference strings, reference segments and a combination of reference strings and segments) were tested to find the most efficient strategy for citation matching. In the classification step, we describe the effect which the probabilities of reference segments can have in citation matching. Our evaluation on a manually curated gold standard showed that the input data consisting of the combination of reference segments and reference strings lead to the best result. In addition, the usage of the probabilities of the segmentation slightly improves the result.

Read more
Digital Libraries

Early Indicators of Scientific Impact: Predicting Citations with Altmetrics

Identifying important scholarly literature at an early stage is vital to the academic research community and other stakeholders such as technology companies and government bodies. Due to the sheer amount of research published and the growth of ever-changing interdisciplinary areas, researchers need an efficient way to identify important scholarly work. The number of citations a given research publication has accrued has been used for this purpose, but these take time to occur and longer to accumulate. In this article, we use altmetrics to predict the short-term and long-term citations that a scholarly publication could receive. We build various classification and regression models and evaluate their performance, finding neural networks and ensemble models to perform best for these tasks. We also find that Mendeley readership is the most important factor in predicting the early citations, followed by other factors such as the academic status of the readers (e.g., student, postdoc, professor), followers on Twitter, online post length, author count, and the number of mentions on Twitter, Wikipedia, and across different countries.

Read more
Digital Libraries

Early insights into the Arabic Citation Index

The Arabic Citation Index (ARCI) was launched in 2020. This study gives an overview of the scientific literature available in this new database. By using metadata available in scientific publications, I analyse ARCI to characterize the scientific literature published in Arabic. First, I describe the data and the methods used in the analyses. As of October 2020, ARCI indexed 65,208 records covering the 2015-2019 period. Second, I explore the literature distributions at various levels (research domains, countries, languages, open access). Close to 99% of documents indexed are articles. Results reveal the concentration of publications in the Arts & Humanities and Social Sciences fields. Most journals indexed in ARCI are currently published from Egypt, Algeria, Iraq, Jordan and Saudi Arabia. Around 7% of publications in ARCI are published in languages other than Arabic. Then, I use an unsupervised machine learning model, LDA (Latent Dirichlet Allocation) and the text mining algorithms of VOSviewer to uncover the main topics in ARCI. These methods are particularly useful to better understand the topical structure of ARCI. Finally, I suggest few research opportunities after discussing the results of this study.

Read more
Digital Libraries

Economic Power, Population, and the Size of Astronomical Community

The number of astronomers for a country registered to the IAU is known to have a correlation with the GDP. However, the robustness of this relationship can be doubted, because the fraction of astronomers joining the IAU differs from country to country. Here we revisit this correlation by using the recent data updated as of 2017, and then we find a similar correlation by using the total enumeration of astronomers and astrophysicists with PhD degrees and working in each country, instead of adopting the number of IAU members. We confirm the existence of two subgroup in the correlation. One group consists of European advanced countries having long history of modern astronomy, while the other group consists of countries having experienced recent rapid economic development. In order to find causation in the correlation, we obtain the long-term variations of the number of astronomers, population, and the GDP for a number of countries to find that the number of astronomers per citizen for recently developing countries has increased more rapidly as GDP per capita increased, than that for fully developed countries. We collect a demographic data of the Korean astronomical community. From these findings we estimate the proper size of the Korean astronomical community by considering the society's economic power and population. The current number of PhD astronomers working in Korea is approximately 310, but it should be 550 that is large enough to be comparable and competitive to the sizes of Spainish, Canadian, and Japanese astronomical communities. We discuss on the way how to overcome the vulnerability of the Korean astronomical community, based on the statistics of national R&D expenditure structure comparing with that of other major advanced countries.

Read more
Digital Libraries

Effect of forename string on author name disambiguation

In author name disambiguation, author forenames are used to decide which name instances are disambiguated together and how much they are likely to refer to the same author. Despite such a crucial role of forenames, their effect on the performances of heuristic (string matching) and algorithmic disambiguation is not well understood. This study assesses the contributions of forenames in author name disambiguation using multiple labeled datasets under varying ratios and lengths of full forenames, reflecting real-world scenarios in which an author is represented by forename variants (synonym) and some authors share the same forenames (homonym). Results show that increasing the ratios of full forenames improves substantially the performances of both heuristic and machine-learning-based disambiguation. Performance gains by algorithmic disambiguation are pronounced when many forenames are initialized or homonym is prevalent. As the ratios of full forenames increase, however, they become marginal compared to the performances by string matching. Using a small portion of forename strings does not reduce much the performances of both heuristic and algorithmic disambiguation compared to using full-length strings. These findings provide practical suggestions such as restoring initialized forenames into a full-string format via record linkage for improved disambiguation performances.

Read more
Digital Libraries

Examining Citations of Natural Language Processing Literature

We extracted information from the ACL Anthology (AA) and Google Scholar (GS) to examine trends in citations of NLP papers. We explore questions such as: how well cited are papers of different types (journal articles, conference papers, demo papers, etc.)? how well cited are papers from different areas of within NLP? etc. Notably, we show that only about 56\% of the papers in AA are cited ten or more times. CL Journal has the most cited papers, but its citation dominance has lessened in recent years. On average, long papers get almost three times as many citations as short papers; and papers on sentiment classification, anaphora resolution, and entity recognition have the highest median citations. The analyses presented here, and the associated dataset of NLP papers mapped to citations, have a number of uses including: understanding how the field is growing and quantifying the impact of different types of papers.

Read more
Digital Libraries

Exploring WorldCat Identities as an altmetric information source: A library catalog analysis experiment in the field of Scientometrics

Assessing the impact of scholarly books is a difficult research evaluation problem. Library Catalog Analysis facilitates the quantitative study, at different levels, of the impact and diffusion of academic books based on data about their availability in libraries. The WorldCat global catalog collates data on library holdings, offering a range of tools including the novel WorldCat Identities. This is based on author profiles and provides indicators relating to the availability of their books in library catalogs. Here, we investigate this new tool to identify its strengths and weaknesses based on a sample of Bibliometrics and Scientometrics authors. We review the problems that this entails and compare Library Catalog Analysis indicators with Google Scholar and Web of Science citations. The results show that WorldCat Identities can be a useful tool for book impact assessment but the value of its data is undermined by the provision of massive collections of ebooks to academic libraries.

Read more
Digital Libraries

Exploring the Effects of Data Set Choice on Measuring International Research Collaboration: an Example Using the ACM Digital Library and Microsoft Academic Graph

International research collaboration (IRC) measurement is important because countries can and want to benefit from international collaboration but performing the same measurement procedure on different data sets can lead to different results. This study aims to explore the effects of data set choice on IRC measurement.

Read more

Ready to get started?

Join us today