Merits and Limits: Applying open data to monitor open access publications in bibliometric databases
MMerits and Limits: Applying open data to monitor openaccess publications in bibliometric databases
Aliakbar Akbaritabar ∗ Stephan Stahlschmidt † Abstract
Identifying and monitoring Open Access (OA) publications might seem a trivial task while practicalefforts prove otherwise. Contradictory information arise often depending on metadata employed. Westrive to assign OA status to publications in Web of Science (WOS) and Scopus while complementing itwith different sources of OA information to resolve contradicting cases. We linked publications from WOSand Scopus via DOIs and ISSNs to Unpaywall, Crossref, DOAJ and ROAD. Only about 50% of articlesand reviews from WOS and Scopus could be matched via a DOI to Unpaywall. Matching with Crossrefbrought 56 distinct licences, which define in many cases the legally binding access status of publications.But only 44% of publications hold only a single licence on Crossref, while more than 50% have no licenceinformation submitted to Crossref. Contrasting OA information from Crossref licences with Unpaywallwe found contradictory cases overall amounting to more than 25%, which might be partially explainedby (ex-)including green OA. A further manual check found about 17% of OA publications that are not accessible and 15% non-OA publications that are accessible through publishers’ websites. Thesepreliminary results suggest that identification of OA state of publications denotes a difficult and currentlyunfulfilled task.
Keywords : Open Access, Unpaywall, Crossref, Web Of Science, Scopus
Introduction
Open access (henceforth OA) in scholarly communication describes unrestricted access to published peer-reviewed documents written by and addressed to researchers. These documents have traditionally beendisseminated via publications in scientific journals, which charge for access to the respective content. Stimu-lated by a call for greater openness and transparency in general (“open science”), the OA movement hasnowadays been accepted as one, though not the only, alternative for the dissemination of scholarly documents.Even publishers seem to embrace this new model as providing a suitable infrastructure while at the sametime securing their own economic interests.This inter-mixture of interests has resulted not only in one, but several forms of OA publications such as
Gold , Hidden Gold , Hybrid , Green , Delayed , Bronze and
Black which are mainly based on right to access andpay to publish models depending on venues where the OA publication is accessible.Due to the individual ascription of single publications to one or several of these categories and the decentralizedstructure of the scientific publishing market with a variety of diverse publishers, the identification of OA isless trivial than it might seem. Even large bibliometric data provider rely on external information to provideinformation on OA and most large scale undertakings by the scientometric community to obtain reliableinformation on OA prevalence rely on the use of web crawlers (Archambault et al., 2013; Piwowar et al., 2018)Inspired by the Hybrid OA Dashboard (Jahn, 2017) we applied licensing information detailing the legallybinding access state supplied by publishers to the publisher association Crossref to identify OA publications.We determined the OA status of all publications retrieved from Web of Science (henceforth WOS) andScopus in-house databases of 2017 by confronting them to two sources of OA information, i.e., Unpaywall ∗ German Centre for Higher Education Research and Science Studies (DZHW), Schützenstr. 6a, Berlin, 10117 (Germany);[email protected]; (corresponding author) † German Centre for Higher Education Research and Science Studies (DZHW); [email protected] https://clarivate.com/blog/easing-access-to-open-access-clarivate-analytics-partners-with-impactstory/ a r X i v : . [ c s . D L ] F e b nd Crossref. In Section 2, we present our data and methods. In Section 3 we present our findings, while wediscuss our main results in Section 4. Data & Method
We queried all publications from Scopus and WOS in in-house databases of 2017. Data included article’sunique ID from database and DOI. We matched those DOIs with Unpaywall database from April 18 th .Additionally, we used the journals’ ISSNs provided by Wohlgemuth, Rimmert, & Winterhager (2016) (andthe updated version in Rimmert, Bruns, Lenke, & Taubert (2017)) to identify Gold OA publications. Theyuse different known OA indexes (e.g., DOAJ (Directory of Open Access Journals) and ROAD (Directory ofOpen Access scholarly Resources) and determine if the respective ISSN is listed in those databases. Theydifferentiate between ISSN and
ISSNL which is more fine-grained by adding a specific ISSN to some specialissues. We tried both ISSN and ISSNL, sicne the latter had higher matching records, therefore in our analysispresented in the Results section we use the
ISSNL .It is necessary to note that some publications had multiple licence URLs in Crossref database, we followed aprocedure with four steps to ensure using only one licence per publication (see Table 2 for the frequenciesof these publications):1. If a publication had only one record in Crossref database, whether it had an OA , non-OA , unclear licence or no licence information (i.e. NA) , we used this status and categorized the publication as aunique one.2. If a publication had multiple OA licence URLs, we removed the duplicates and categorized it as OA .3. If a publication had a mixture of OA and non-OA licence URLs, we removed the duplicates andcategorized it as OA .4. If a publication had multiple non-OA licence URLs, we removed the duplicates and categorized it as non-OA .A research assistant controlled the unique licences (a total of 56) we extracted from Crossref with availableinformation online to categorize them as OA and non-OA . We used this categorization in parallel to establishedOA identification procedures (e.g., searching for journal’s ISSN in DOAJ and ROAD in Gold OA identification)to ensure a higher level of robustness in our results.In OA Identification process and in order to determine if a publication was OA or not, we applied amulti-category view separating Gold, Hidden Gold, Hybrid and Delayed OA, while doing so, we reached anew category of
Probable Hybrid OA . Our investigation strategy for each category was as follows:•
Gold OA : As described earlier, we used the ISSNs provided by Rimmert et al. (2017) to determineGold OA. We matched the respective ISSN (from both WOS and Scopus) with DOAJ and ROAD. Ifthe respective ISSN was listed in one of those directories, the publication is categorized as
Gold OA .We confronted Gold OA from DOAJ and ROAD with our research assistant’s categorization of Crossreflicences after the manual check of unique licence URLs.•
Hidden Gold OA : we used metadata from WOS and Scopus to determine the journal issue andlooking at the licences of all publications in a single issue, if all publications had OA licences, but theISSN was not indexed in DOAJ or ROAD we categorized it as Hidden Gold OA . It is neccessary to note that our effort to send large number of requests to Crossref API (even while using plus service andthrough both rcrossref package in R and more fine-grained httr requests directly to Crossref API) faced timeout and responsetime errors and alternatively we chose to use the in-house snapshot of the Crossref data to circumvent the above error. Thismeant parsing large corpus of JSON files which can be time consuming depending on the goals of the analysis. Any effort onautomating the proposed OA identification procedure needs to overcome the technical issues like this. https://doaj.org/ https://road.issn.org/ Data Source Frequency PercentWOS (matched Unpaywall Only articles & reviews) 11,661,206 57.5%WOS (Only articles & reviews 2000-2017) 20,280,606 -Scopus (matched Unpaywall Only articles & reviews) 14,188,983 53.48%Scopus (Only articles & reviews 2000-2017) 26,532,295 - • Hybrid OA : If an issue had at least one non-OA publication while having one or more OA publications,we categorized the OA publications as Hybrid OA .• Probable Hybrid OA : If an issue did not have a non-OA publication while having one or more OA publications and some publications in the issue didn’t have licence information, we categorized them as Probable Hybrid OA .• Delayed OA : In all of the above cases, we looked into delays based on Crossref metadata (a differencein terms of days from day of publication and the date licence was assigned to the publication as describedin CrossRef-API (2019), this is the time period known as embargo time ) to determine if they were
Delayed , therefore each of the above categories were split to two groups, delayed and not-delayed . If apublication had multiple licence URLs on Crossref, we controlled their respective delay times, if any ofthose were not-delayed we categorized the publication as such, while if any of the licences were delayed ,the publication is identified as a delayed one.•
Closed Access : Strictly speaking, if the number of publications in an issue was equal to the numberof non-OA publications and the ISSN was not indexed in DOAJ or ROAD, we categorized them as
Closed Access .• NA (Not available) : A publication that was not fitting in any of the above categories or did nothave a licence URL to determine its condition was categorised as NA . Number of NAs are higher than Closed Access publications, since we aimed to keep the definitions as strict as possible.
Results
We present the results in two main sections, one regarding
Unpaywall and the other on licences extracted from
Crossref . We then present the comparison between Unpaywall and Crossref and the results of our manualchecks on random samples for robustness of the results.Table 1 shows the number of articles and review papers from WOS and Scopus with an equivalent record inUnpaywall database. It presents also the total number of articles and review papers in WOS/Scopus to providea baseline for comparison. Unpaywall has higher than 50% coverage in both cases while coverage of WOS isslightly higher (can be due to different indexing philosophy or DOIs completeness). In the following tables(in
Unpaywall results), publications are limited to only articles and review papers published in 2000-2017.Figure 1 presents the distribution of journals and publications indexed in WOS (top) and Scopus (bottom)matched with Unpaywall database and crosschecked the ISSNs with DOAJ.
Missing on DOAJ in theseFigures refer to those journals whose ISSN was missing from Rimmert et al. (2017) data, therefore we couldnot check if the ISSN is listed in DOAJ or not while
Others means the ISSN was existing in Rimmert etal. (2017) but it was not listed as OA in DOAJ. Share of pubilcations which don’t have a matching ISSNin DOAJ (meaning they are not Gold OA) and are identified as OA in Unpaywall is interesting on bothFigures (designated with “Missing on DOAJ | Unpaywall OA” as label). They could be other OA types(green, hybrid, hidden gold).We matched publications to Crossref data from April 2018 and found 56 distinct licence types for all of thepublications. Table 2 presents a descriptive view on whether publications have licence information recordedin Crossref. It shows that about 50% of publications from WOS or Scopus with a matching DOI indexedin Crossref do not have a licence URL. Some of the publications had more than one licence information inCrossref (as an example, the number of DOIs that each have 6 licence records on Crossref are 7). In case of3igure 1: Publications indexed in WOS (top) and Scopus (bottom) matched with Unpaywall database andcrosschecked the ISSNs with DOAJ (Gold OA) (X-axis denotes the years, Y-axis denotes the number ofpublications in each year) Table 2: Number of licences per DOI found in Crossref Number of licences per DOI Frequency of DOIs Percent0 9,892,208 51.411 8,520,158 44.282 824,975 4.293 5,770 0.035 25 0.006 7 0.00 multiple licences, if a publication had at least one OA licence, we categorized it as OA .Figure 2 present the Gold , Hidden Gold , Hybrid and
Delayed OA status of the publications from WOS (top)and Scopus (bottom), which is presented as trends over the years. We limited the years to 2000-2017 to showthe most recent trends. To make these Figures more readable, we removed NA (those without a matchingDOI or without a licence information on Crossref).Tables 3 and 4 present the OA status comparison between Unpaywall and Crossref in WOS and Scopuspublications, respectively. Note, Crossref OA status in the Tables is the categorization we developed usingrespective licence URLs. We double checked the contradictory cases and improved our while-list of OAlicences, while some of the contradictions still remain (e.g., Unpaywall declares those publications as OAwhile they are closed access or vice versa, in case of licences on Crossref that are open access while thepublication is declared as non-OA on Unpaywall). Overall contradictory cases amount to 27.95% in WOSand 27.57% in Scopus which might partly be explained by the wider scope of Unpaywall including also greenOA publications that might not be identified via license information only.Tables 5 and 6 present the result of our research assistant’s manual check for accessibility to article’s PDFfile from publishers websites compared to the respecitve licence in Crossref and the OA status we manuallyassignded to those URLs in contrast to OA status from Unpaywall. It is interesting to see there are publicationsdefined as
Non-OA while their PDF is accessible from the publisher (14.42% in WOS and 14.98% in Scopus)or vice versa, OA publications (based on either Unpaywall, Crossref or both) that are not accessible online4igure 2: Comparison of OA publications 2000-2017 (WOS (top) and Scopus (bottom) data matched withCrossref) (X-axis denotes the years, Y-axis denotes the number of publications in each year)Table 3: OA status comparison between Unpaywall and Crossref in WOS publications
Crossref OA Status Unpaywall OA Status Frequency PercentClosed Access Closed Access 4,452,185 38.18NA Closed Access 3,512,794 30.12NA Open Access 1,770,612 15.18Closed Access Open Access 1,363,525 11.69Open Access Open Access 435,516 3.73Open Access Closed Access 126,354 1.08Closed Access NA 26 0.00NA NA 19 0.00 (17.57% in WOS and 16.74% in Scopus). Note also the contradictory cases between Crossref and Unpaywall,where metadata from one shows OA and the other Closed , which requires further probes (22.98% in WOSand 22.91% in Scopus, these percentages are quite close to contradictions observed in the overall samplepresented in Tables 3 and 4). Our effort to complement these databases proves that none of them could beused in isolation. We aim to follow-up and use PDF URLs provided by Unpaywall in large scale to controlthe ratio of publications which can be accessed.
Conclusions
It is clear that publishing as OA is on the rise in recent years. This trend is observed similarly in WOSand Scopus (while Scopus has higher raw publication counts but trends are identical) and based on OAidentification stemming from both Unpaywall and Crossref. But still the majority of publications are closedaccess. We observed that despite the high coverage of Unpaywall (higher than 50% of articles and reviews inboth WOS and Scopus), it doesn’t provide enough metadata (as of April 2018) for OA categorization thuscould be limiting for large scale OA monitoring in the leading bibliometric databases. Licence informationfrom Crossref is more detailed and it gives a good possibility to complement Unpaywall metadata. Althoughwe overcame the downsides by complementing these databases, we still found further contradictions betweenthem with manual random checks. Some publications were OA (based on their licences or Unpaywall status)5able 4: OA status comparison between Unpaywall and Crossref in Scopus publications
Crossref OA Status Unpaywall OA Status Frequency PercentClosed Access Closed Access 5,138,444 36.21NA Closed Access 4,635,801 32.67NA Open Access 2,201,936 15.52Closed Access Open Access 1,549,902 10.92Open Access Open Access 502,510 3.54Open Access Closed Access 160,132 1.13NA NA 15 0.00Open Access NA 4 0.00Closed Access NA 1 0.00
Table 5: Random sample OA status check on publications from WOS
PDF Manually accessible? Licence status Pub OA? Frequency PercentPDF Accessible Open Access Unpaywall OA 104 46.85No Access to PDF Closed Access Unpaywall non-OA 44 19.82No Access to PDF Open Access Unpaywall non-OA 18 8.11No Access to PDF Closed Access Unpaywall OA 16 7.21PDF Accessible Closed Access Unpaywall OA 16 7.21PDF Accessible Closed Access Unpaywall non-OA 14 6.31No Access to PDF Open Access Unpaywall OA 5 2.25NA Closed Access Unpaywall non-OA 1 0.45No Access to PDF Closed Access Missing on Unpaywall 1 0.45PDF Accessible NA Unpaywall non-OA 1 0.45PDF Accessible Open Access Unpaywall non-OA 1 0.45PDF Accessible NA Unpaywall OA 1 0.45 while their PDF files were not accessible through publishers’ websites. Some publications were closed access,while their PDF files were accessible . We found that the issue of multiple records for some publications ormultiple licence information is something that needs to be seriously considered in OA monitoring. While wetried to test different scenarios in OA identification, still there are publications that won’t fit into any of thescenarios and we had to categorize them as NA (since we wanted to keep the Closed Access definition asstrict as possible), these are the publications that need to be further studied and usually the metadata of theOA databases are lacking for them. We propose OA monitoring activities to try to benefit from our approachin compelemting the metadata from OA databases, i.e. Unpaywall and Crossref, while noting that there arecontradictions between these sources. Our effort to complement these databases proves that none of themcould be used in isolation.
References
Archambault, E., Amyot, D., Deschamps, P., Nicol, A., Rebout, L., & Roberge, G. (2013). Peer-reviewedpapers at the european and world levels—2004-2011.