Crowdsourcing open citations with CROCI -- An analysis of the current status of open citations, and a proposal
CCrowdsourcing open citations with CROCI An analysis of the current status of open citations, and a proposal
Ivan Heibi , Silvio Peroni and David Shotton {ivan.heibi2, silvio.peroni}@unibo.it Digital Humanities Advanced Research Centre (DHARC), Department of Classical Philology and Italian Studies, University of Bologna, via Zamboni 32, 40126 Bologna (Italy) david.shotton @oerc.ox.ac.uk Oxford e-Research Centre, University of Oxford, 7 Keble Rd, Oxford OX1 3QG (United Kingdom)
Abstract
In this paper, we analyse the current availability of open citations data in one particular dataset, namely COCI (the OpenCitations Index of Crossref open DOI-to-DOI citations; http://opencitations.net/index/coci) provided by OpenCitations. The results of these analyses show a persistent gap in the coverage of the currently available open citation data. In order to address this specific issue, we propose a strategy whereby the community (e.g. scholars and publishers) can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index, CROCI, the Crowdsourced Open Citations Index.
Introduction
The availability of open scholarly citations – i.e. citation data that are structured , separate , open , identifiable and available What is the ratio between open citations vs. closed citations within each category of scholarly entities included in COCI (i.e. journals, books, proceedings, datasets, and others)? .
Which are the top twenty publishers in terms of the number of open citations received by their own publications, according to the citation data available in COCI? 3.
To what degree are the publishers highlighted in the previous analysis themselves contributing to the open citations movement, according to the data available in Crossref? The results of these analyses show a persistent gap in the coverage of the currently available open citation data. To address this specific issue, we have developed a novel strategy whereby members of the community of scholars, authors, editors and publishers can directly involve themselves in crowdsourcing open citations, by uploading their citation data via the OpenCitations infrastructure into our new index,
CROCI, the Crowdsourced Open Citations Index . Methods and material citing entities, to check the participation of these top twenty publishers in terms of the number of open citations they were themselves publishing in response to the open citation movement sponsored by I4OC. Details of all these analyses are available online in CC0 (Heibi et al., 2019).
Results
First (RQ1) we determined the numbers of open citations and closed citations received by the entities in the Crossref dump. All the entity types retrieved from Crossref were aligned to one of following five categories: journal, book, proceedings, dataset, other – the mapping between Crossref types and the five types we used in our analysis is illustrated in the description of the table “croci_types.csv” in (Heibi et al., 2019). The outcomes are summarised in Figure 1, where it is evident that the number of open citations available in COCI is always greater than the number of closed citations to these entities within the Crossref database to which COCI does not have access, for each of the publication categories considered, with the categories proceedings and dataset having the largest ratios. nalysis of the Crossref data show that there are in total ~4.1 million DOIs that have received no open citations and at least one closed citation. Conversely, there are ~10.7 million DOIs that have received no closed citations and at least one open citation in COCI. Most of the papers in both these categories have received very few citations. The outcome of the second analysis (RQ2) shows which publishers are receiving the most open citations. To this end, we considered all the open citations recorded in COCI, and compared them with the number of closed citations to these same entities recorded in Crossref. Figure 2 shows the top twenty publishers that received the greatest number of open citations. Elsevier is the first publisher according to this ranking, but it also records the highest number of closed citations received (~97M vs. ~105.5M). The highest ratio in terms of open citations vs. closed citations was recorded by IEEE publications (ratio 6.25 to 1), while the lowest ratio was for the American Chemical Society (ratio 0.73 to 1).
Figure 1. The number of open citations (available in COCI) vs. closed citations (according to Crossref data) received by the cited entities within COCI, analyzed and grouped according to five distinct categories. [Note that the vertical axis has a logarithmic scale].
Figure 2. The top twenty publishers sorted in decreasing order according to the number of open citations the entities they published have received, according to the open citation data within COCI. We accompany this count with the number of closed citations to the entities published by each of them according to the values available in Crossref.
Considering the twenty publishers listed in Figure 2, we wanted additionally to know their current support for the open citation movement (RQ3). The results of this analysis (made by uerying the Crossref API on 24 January 2019) are shown in Figure 3. Among the top ten publishers shown in Figure 2, i.e. those who themselves received the largest numbers of open citations, only five, namely Springer Nature, Wiley, the American Physical Society, Informa UK Limited, and Oxford University Press, are participating actively in the open publication of their own citations through Crossref. It is noteworthy that JSTOR contributes very few references to Crossref, while the many citations directed towards its own holdings place JSTOR twelfth in the list of publishers receiving open citations (Figure 2). However, as the last column of Figure 3 shows, all the major publishers listed here are failing to submit reference lists to Crossref for a large number of the publications for which they submit metadata, that number being the difference between the value in the last column for that publisher and the combined values in the preceding three columns. JSTOR is the worst in this regard, submitting references with only 0.53% of its deposits to Crossref, while the American Physical Society is the best, submitting references with 96.54% of its publications recorded in Crossref.
Additional information about these analyses, including the code and the data we have used to compute all the figures, is available as a Jupyter notebook at https://github.com/sosgang/pushing-open-citations-issi2019/blob/master/script/croci_nb.ipynb.
Figure 3. The contributions to open citations made by the twenty publishers listed in Figure 2, as of 24 January 2018, according to the data available through the Crossref API. The counts listed in the first three results columns of this table refers to the number of publications for which each publisher has submitted metadata to Crossref that include the publication’s reference list, the categories closed , limited and open has submitted the reference list with the other metadata. The percentage values given in parentheses show the percentage of publications in each category whose metadata submitted to Crossref includes the reference lists, these percentages being obtained by dividing the values in each column by the total number of publications for which that publisher has submitted metadata to Crossref shown in the fourth results column. It should be stressed that a very large number of potentially open citations are totally missing from the Crossref database, and consequently from COCI, for the simple reason that many publishers, particularly smaller ones with limited technical and financial resources, but also all he large ones shown in Figure 3 and most of the others, are simply not depositing with Crossref the reference lists for any or all of their publications.
Discussion
According to the data retrieved, the open DOI-to-DOI citations available in COCI exceed the number of closed DOI-to-DOI citations recorded in Crossref for every publication category, as shown in Figure 1. The journal category is the one receiving the most open citations overall, as expected considering the historical and present importance of journals in most areas of the scholarly ecosystem. However, the number of closed citations to journal articles within Crossref is also of great significance, since these 322 million closed citations represent 43% of the total. It is important to note that about one third of these closed citations to journal articles (according to Figure 2) are references to entities published by Elsevier, and that references from within Elsevier’s own publications constitute the largest proportion of these closed citations, since Elsevier is the largest publisher of journal articles. Thus, Elsevier’s present refusal to open its article references is contributing significantly to the invisibility of Elsevier’s own publications within the corpus of open citation data that is being increasingly used by the scholarly community for discovery, citation network visualization and bibliometric analysis. It is also worth mentioning the discrepancy between the citations available in COCI, which comes from the data contained in the open and limited Crossref datasets as of 3 October 2018, and those available within those same Crossref datasets as of 24 January 2019. The most significant difference relates to IEEE. While the citations present in COCI include those from IEEE publications to other entities prior to November 2018 (since in October 2018 its article metadata with references were present within the Crossref limited dataset), in November 2019 this scholarly society decided to close the main part of its Crossref references, and thus from that moment they became unavailable to Crossref Metadata Plus members such as OpenCitations, as highlighted in Figure 3. Thus, IEEE citations from articles whose metadata was submitted to Crossref after the date of this switch to closed can no longer be automatically ingested into COCI. To date, the majority of the citations present in Crossref that are not available in COCI comes from just three publishers: Elsevier, the American Chemical Society and University of Chicago Press (Figure 3). In fact, considering the average value of 18.6 DOI-to-DOI citation links for each citing entity – calculated by dividing the total number of citations in COCI by the number of citing entities in the same dataset – these three publishers are holding more than 214 million DOI-to-DOI citations that could potentially be opened. (The IEEE citation data which was in the Crossref ‘limited’ category as of October 2018 are actually included in COCI, although those from that organization’s more recent publications will no longer be, as mentioned above). We think it is deeply regrettable and almost incomprehensible that any professional organization, learned society or university press, whose primary mission is to serve the interests of the practitioners, scholars and readers it represents, should choose not open all its publications’ reference lists as a public good, whatever secondary added-value services it chooses to build on top of the citations that those reference lists contain.
CROCI, the Crowdsourced Open Citations Index
CROCI, the Crowdsourced Open Citations Index , into which individuals identified by ORCiD identifiers may deposit citation information that they have a legal right to submit, and within which these submitted citation data will be published under a CC0 public domain waiver to emphasize and ensure their openness for every kind of reuse without limitation. Since citations are statements of fact about relationships between publications (resembling statements of fact about marriages between individual persons), they are not subject to copyright, although their specific textual arrangements within the reference lists of particular publications may be. Thus, the citations from which the reference list of an author’s publication has been composed may legally be submitted to CROCI, although the formatted reference list cannot be. Similarly, citations extracted from within an individual’s electronic reference management system and presented in the requested format may be legally submitted to CROCI, irrespective of the original sources of these citations. To populate CROCI, we ask researchers, authors, editors and publishers to provide us with their citation data organised in a simple four-column CSV file (“citing_id”, “citing_publication_date”, “cited_id”, “cited_publication_date”), where each row depicts a citation from the citing entity (“citing_id”, giving the DOI of the cited entity) published on a certain date (“citing_publication_date”, with the date value expressed in ISO format “yyyy-mm-dd”), to the cited entity (“cited_id”, giving the DOI of the cited entity) published on a certain date (“cited_publication_date”, again with the date value expressed in ISO format “yyyy-mm-dd”). The submitted dataset may contain an individual citation, groups of citations (for example those derived from the reference lists of one or more publications on a particular topic), or entire citation collections. Should any of the submitted citations be already present within CROCI, these duplicates will be automatically detected and ignored . The date information given for each citation should be as complete as possible, and minimally should be the publication years of the citing and cited entities. However, if such date information is unavailable, we will try to retrieve it automatically using OpenCitations technologies already available. DOIs may be expressed in any of a variety of valid alternative formats, e.g. “https://doi.org/10.1038/502295a”, “http://dx.doi.org/10.1038/502295a”, “doi: 10.1038/502295a”, “doi:10.1038/502295a”, or simply “10.1038/502295a”. An example of such a CVS citations file can be found at https://github.com/opencitations/croci/blob/master/example.csv. As an alternative to submissions in CSV format, contributors can submit the same citation data using the Scholix format (Burton et al., 2017) – an example of such format can be found at https://github.com/opencitations/croci/blob/master/example.scholix. Submission of such a citation dataset in CSV or Scholix format should be made as a file upload either to Figshare (https://figshare.com) or to Zenodo (https://zenodo.org). For provenance purposes, the ORCID personal identifier of the submitter of these citation data should be explicitly provided in the metadata or in the description of the Figshare/Zenodo object. Once such a citation data file upload has been made, the submitter should inform OpenCitations of this fact by adding an new issue to the GitHub issue tracker of the CROCI repository at https://github.com/opencitations/croci/issues. OpenCitations will then process each submitted citation dataset and ingest the new citation information into CROCI. These CROCI citations will be made available at http://opencitations.net/index/croci using a REST API and a SPARQL endpoint, and will dditionally be published periodically as data dumps in Figshare, all releases being under CC0 waivers. We propose in future to enable combined searches over all the OpenCitations indexes, including COCI and CROCI. We are confident that the community will respond positively to this proposal of a simple method by which the number of open citations available to the academic community can be increased, in particular since the data files to be uploaded have a very simple structure and thus should be easy to prepare. In particular, we hope for submissions of citations from within the reference lists of authors’ green OA versions of papers published by Elsevier, IEEE, ACS and UCP, and from publishers not already submitting publication metadata to Crossref, so as to address existing gaps in open citations availability. We look forward to your active engagement in this initiative to further increase the availability of open scholarly citations.