Do Authors Deposit on Time? Tracking Open Access Policy Compliance
DDo Authors Deposit on Time?Tracking Open Access Policy Compliance
Drahomira Herrmannova
Knowledge Media InstituteThe Open UniversityMilton Keynes, United Kingdomorcid.org/[email protected]
Nancy Pontika
Knowledge Media InstituteThe Open UniversityMilton Keynes, United Kingdomorcid.org/[email protected]
Petr Knoth
Knowledge Media InstituteThe Open UniversityMilton Keynes, United Kingdomorcid.org/[email protected]
ABSTRACT
Recent years have seen fast growth in the number of policies man-dating Open Access (OA) to research outputs. We conduct a large-scale analysis of over 800 thousand papers from repositories aroundthe world published over a period of 5 years to investigate: a) if thetime lag between the date of publication and date of deposit in arepository can be effectively tracked across thousands of reposito-ries globally, and b) if introducing deposit deadlines is associatedwith a reduction of time from acceptance to public availability ofresearch outputs. We show that after the introduction of the UK REF2021 OA policy, this time lag has decreased significantly in the UKand that the policy introduction might have accelerated the UK’smove towards immediate OA compared to other countries. Thissupports the argument for the inclusion of a time-limited depositrequirement in OA policies. CCS CONCEPTS • Information systems → Data mining ; Digital libraries andarchives ; •
Applied computing → Publishing ; KEYWORDS
Open Access, Scholarly Data, Data Mining, Research Evaluation,Research Excellence Framework, REF
More than seventeen years have passed since the definition of OpenAccess (OA) has been agreed [4]. OA, which refers to scientific liter-ature that is online and available free of cost to the end user, ques-tions the traditional publishing business model relying on paywallsand advocates for a shift towards alternative, more cost-effectivepublishing models delivering free access to research outputs forall [19, 21–23]. These arguments have been gradually influencingresearchers, research organisations, and funders, resulting in thecreation of new OA policies. As of January 2019, according to theRegistry of Open Access Repository Mandates and Policies , thereare 732 institutional and 85 funder OA policies globally.OA policies provide authors with criteria for making their re-search outputs available as OA [17]. These criteria typically in-clude when and where should the research outputs be depositedor published and what version of the manuscript (e.g. pre-print The term “immediate OA” is usually used to refer to outputs that are available im-mediately upon publication without any embargo periods. In this context, we do notconsider embargoes and use it simply to mean availability upon publication. https://roarmap.eprints.org/ vs. post-print) should be made openly accessible [17]. Arguablyone of the most significant OA policies, the UK Research Excel-lence Framework (REF) 2021 Open Access Policy , was introducedin the UK in March 2014 [8]. The significance of this policy liesin two aspects: 1) the requirement to make research outputs OAis linked to performance review, creating a strong incentive forcompliance [25, 29, 30], and 2) it affects over 5% of global researchoutputs . Under this policy, only compliant research outputs willbe evaluated in the national Research Excellence Framework. Over52 thousand academic staff from 154 UK universities submittedover 190 thousand research outputs in the most recent REF (2014)[20]. The UK REF 2021 OA policy is not the only major nation-widedevelopment – the U.S. Public Access Plan [27] introduced in 2013and the European Commission supported “Plan S” [5], are just twomore examples of a global shift towards Open Access. The problem.
The growth of OA and the introduction of newpolicies, such as the REF 2021 Open Access Policy, has brought forthimportant questions and implications, some universal and somepolicy-specific. Even when authors deposit their work in OA repos-itories, does this happen immediately, or is the deposit delayed?What effect does the introduction of policies have on the practiceof publishing OA? Is there evidence to support that introducing OApolicies reduces the time from acceptance to the open availabilityof research outputs? More importantly, how can compliance withOA policies be tracked, particularly when specific time-frames formaking research outputs OA are in place? While recent studiesanalysing compliance with OA policies [12, 14] and the prevalenceof OA [18] have focused on whether articles are eventually madeopenly available, they have not taken into consideration the timelag between the acceptance/publication of an article and its onlineavailability (deposit into an OA repository). Two existing studieswhich have taken deposit dates into consideration [25, 29] are nowoutdated, are not easily reproducible, and have not used these datesto assess compliance (i.e. to understand whether authors depositon time in accordance with existing policies) but instead used thesedates to study policy effectiveness (i.e. to understand whether cer-tain types of policies shorten the time between publication anddeposit). If we can measure the time lag between publication anddeposit, can we assist authors and institutions in improving theircompliance with OA policies? a r X i v : . [ c s . D L ] J un CDL 2019, June 2019, Urbana-Champaign, IL, USA Drahomira Herrmannova, Nancy Pontika, and Petr Knoth
Research questions.
In this paper we analyse the time lag be-tween article publication dates and dates of their deposit into OArepositories. We will further refer to the time lag between thesedates simply as deposit time lag . We analyse deposit time lag acrosscountry, time, repository, and discipline. Furthermore, we inves-tigate whether introducing a mandatory policy in the UK – theREF 2021 Open Access Policy, which requires depositing researchoutputs within a specific period – affected this time lag.To study deposit time lag and compliance with the policy, weuse data from Crossref , the largest Digital Object Identifier (DOI)registration agency, and from CORE , the largest full text aggre-gation service collecting OA research outputs from institutionaland subject repositories and from journals around the world [13].After matching article metadata from Crossref and from CORE weanalyse the time lag between publication dates we receive fromCrossref and deposit dates we receive from CORE. Using this data,we answer the following research questions:(1) How does deposit time lag vary across time, country, insti-tution, and discipline?(2) What proportion of UK research outputs was not depositedon time to comply with the REF 2021 OA Policy?(3) Is the REF 2021 OA policy affecting how soon are publica-tions made OA?(4) How does the change in the deposit time lag in the UK overthe past several years compare to other countries? Findings.
We show that the time between publication and de-posit has globally significantly decreased. We also show that whilethere are notable differences in deposit time lag of different subjects,there are even larger differences between different institutions, evenwhen considering only publications from the same discipline. Thissuggests institutions may be stronger drivers of OA than disciplineculture. Furthermore, we show the introduction of the UK REF OAPolicy might have accelerated the UK’s move towards immediateOA compared to other countries.
Contributions.
We present a method for automated trackingof deposit time lag which can be applied to research outputs world-wide. Using this method, we provide the first large scale analysis ofdeposit time lag. Ours is also the first study to quantitatively analysedeposit time lag in relation to the REF 2021 OA Policy. Our resultssupport the argument for the inclusion of a time-limited depositrequirement in OA policies. Finally, to support further studies onthe deposit of research outputs into OA repositories, we release ourdataset of 800 thousand publications and the source codes of ouranalysis . Outline.
This paper is organised as follows. First, in Section 2we review previous work related to our study. Next, in Section 3 wedescribe our data collection process and the methodology used inour analysis. In Section 4 we explain how we prepare our dataset,and in Section 5 we present the results of our analysis. Finally,Section 6 discusses limitation of the present work and future goals. Existing studies sometimes refer to the difference between the publication and depositdates as “deposit latency” [25, 29]. However, because the term “latency” is in computerscience typically associated with a different meaning, we chose to use the term “timelag” instead. https://core.ac.uk/ https://github.com/oacore/jcdl_2019 In this section we discuss work related to our research. In particular,we focus on two topics: 1) studies that try to estimate the propor-tion of all research publications that are openly accessible and 2)studies that analyse compliance with specific OA policies. We closethis section by discussing the differences between our study andprevious work.Particularly in recent years many studies have been conductedthat have tried to estimate the proportion of existing research thatis available as OA [1–3, 6, 11, 14, 18]. While an earlier study identi-fied OA articles using manual Google search [3], the later studiesuse automated methods based on web crawling [6, 11], databasesearching [14, 18], or a combination of both [1, 2]. One of the twomost recent studies has estimated the proportion of OA articlesto be at least 28% overall (a finding similar to [6, 11]), with 45% ofarticles published in 2015 being OA [18]. The most recent study weknow of [14] has utilised the same method as [18], but focused onpublications subject to OA policies of selected funders, revealingthat two thirds of these publications were available as OA.Two of the studies [6, 14] are of particular interest because theyinvestigated the proportion of OA articles in relation to specificpolicies. Gargouri et al. [6] have demonstrated that the proportionof OA articles at institutions with OA policies was three times ashigh as at institutions without them. Interestingly, the study hasalso shown not all articles were made available online upon pub-lication but were instead deposited retrospectively. Lariviere andSugimoto [14] investigated twelve funders (the European ResearchCouncil and eleven funders from the UK, US and Canada) whichimplemented OA policies. The study has revealed significant dif-ferences in the proportion of OA publications between differentfunders, even when considering funders from the same discipline.In particular, funders which required depositing into a repositoryupon publication had significantly higher proportion of OA articlesthan funders which allowed deposit after publication. While theauthors have observed differences between disciplines, finding sig-nificant variations between funders within the same discipline hasled the authors to conclude the funding agency may be a strongerdriver of OA publishing than the culture within a discipline.The above mentioned studies look at how many publications areavailable as OA compared to how many publications appear behindpaywalls. However, as Gargouri et al. [6] have indirectly shown, theopen online availability of a publication does not necessarily ensurecompliance with a given policy. A number of policies, including theUK REF 2021 OA Policy and the US National Institutes of Health(NIH) Public Access Policy, require deposit by a certain date – threemonths after acceptance in the case of the REF 2021 OA Policy andupon publication in the case of the NIH Public Access Policy. Theapproach utilised by the above mentioned works would typicallymean even publications which were deposited retrospectively couldbe considered compliant with these two policies.Only a handful of studies have investigated specific details ofexisting policies [12, 25, 29]. Vincent-Lamarre et al. [29] analysedresearch articles published by 67 institutions with an OA mandate,i.e. an OA policy which was mandatory rather than recommended.The studied mandates were broken down into eight specific con-ditions such as deposit timing and embargo length, and the study o Authors Deposit on Time? Tracking Open Access Policy Compliance JCDL 2019, June 2019, Urbana-Champaign, IL, USA investigated how these conditions relate to mandate compliance.They found that one value for three of the eight conditions (imme-diate deposit required, deposit required for performance evaluation,unconditional opt-out allowed for the OA requirement but no opt-out for deposit requirement) was strongly associated with higherdeposit rates as well as with lower deposit time lag. Swan et al.[25] have conducted a similar study and compared specific policyconditions with deposit rates and time lag for 122 institutions withmandatory OA policies. Similarly as in the case of [29], the authorshave identified three criteria which were associated with improveddeposit rates (deposit mandatory, deposit cannot be waived, de-posit should be linked with research evaluation). Khoo and Lay [12]have focused on embargo periods and studied the rate at whichneuroscientists in Australia and Canada publish in journals withembargo periods that are not compliant with funder policies, i.e. arelonger than 12 months. Interestingly, they observed no reductionin the number of articles published in journals with non-compliantembargo periods after new funder policies were introduced in Aus-tralia and Canada, despite these policies being mandatory.In the present work we investigate how much time does it takefor authors to deposit their articles in OA repositories in relation towhen these articles get published. Our work differs from the afore-mentioned studies in a number of ways. In contrast to [29] and [25]who correlated deposit time lag with specific policy conditions, weinstead analyse how deposit time lag differs across a number ofdimensions such as country and discipline. We also address whatwe envision as a future step in assisting the OA movement – auto-mated and reproducible tracking of policy compliance. By utilisingthe CORE aggregator which harvests content from thousands ofrepositories globally, we are able to study how many publicationsget deposited in multiple places and whether recognising thesemultiple copies can enable faster access to research. Ours is also thefirst study to quantitatively analyse the UK REF 2021 OA Policy.
In this section, we describe the datasets and the methodology usedto answer our research questions. As one of the aims of this workis to study compliance with the UK REF 2021 OA Policy, we startby introducing the policy.Compliance with the REF 2021 OA Policy is met when authors de-posit (self-archive) the post-print (also called the “author acceptedmanuscript,” i.e. author’s final version of the manuscript where allthe peer review suggestions have been addressed but without thepublisher’s typesetting) into an institutional or a subject reposi-tory within three months from the acceptance of the publication[8, 24]. The policy affects journal articles and conference proceed-ings with an International Standard Serial Number (ISSN), whichconstitute the majority (77%) of outputs submitted to the latest REF[10]. Although the policy was introduced in 2014, the implementa-tion period started in April 2016 to allow universities to create thenecessary infrastructure for tracking compliance.To collect the data needed for the analysis of deposit time lagworld-wide, we use the following data sources: • Crossref is the largest DOI registration agency. Crossrefstores publication metadata associated with each DOI that is Author Publisher2. AcceptsRepository – Publication date– ISSN
3. Final version4. Deposits Crossref5. Registers CORE6. Is aggregated – Deposit date– Country– Repository
Deposittime lagMetadata1. Submits
Figure 1: A visual depiction of the publishing and data col-lection process, which is started by a submission and an ac-ceptance of a publication. Steps mentioned in the REF 2021OA Policy are shown in red. The dates we acquire from thetwo databases and use to calculate the deposit time lag arehighlighted with a blue frame. registered with the service. At the time of writing, Crossrefcontained 103 million records . • CORE is the world’s largest OA aggregation service [15],collecting OA research outputs from institutional and sub-ject repositories and from journals worldwide [13]. Assuch, CORE provides a single interface for accessing datafrom repositories around the world. At the time of writing,CORE aggregated content from over 3,700 repositories andcontained 135 million article records. While there are otherservices such as OpenAIRE and BASE, which aggregate datafrom repositories; OpenAIRE has an order of magnitudesmaller dataset (25 million records) and neither BASE norOpenAIRE make the datasets publicly available for downloadand analysis. Furthermore, judging from the user interfacesof both, deposit dates do not appear to be available.Figure 1 shows Crossref and CORE along with the data theycollect and depicts the process of how published articles get enteredinto these systems. The process is started when an author submitsand a publisher accepts a manuscript. The REF 2021 Open AccessPolicy stipulates that the author’s final version of the manuscript(i.e. the post-print) must be deposited into a repository within threemonths of acceptance. The acceptance and deposit steps, which arementioned in the policy, are shown in red in the figure.Upon receiving the author’s final version of the manuscript, thepublisher registers this manuscript with Crossref. Crossref thenstores metadata associated with the publication, including the dateof publication. Furthermore, once the author’s final version of themanuscript is deposited in a repository, the metadata of the pub-lication including the date it was deposited into the repository ispropagated into CORE through its aggregation service.The REF 2021 OA Policy requires papers to be deposited intoa repository within a certain time frame relative to the date ofacceptance. However, when the policy was introduced, the dateof acceptance was not tracked by Crossref or by most repositories https://core.ac.uk/ Subject repositories aggregated by CORE include e-print repositories such as ArXivwhich is often used to deposit pre-prints as well as post-prints. The latest REF 2021submission guidelines state e-print repositories will be considered acceptable for com-pliance purposes [26]. We have therefore included these repositories in our analysis.
CDL 2019, June 2019, Urbana-Champaign, IL, USA Drahomira Herrmannova, Nancy Pontika, and Petr Knoth and other databases. Although Crossref metadata now contain an accepted field, this field is only populated for a small fraction ofpublications (this will be further discussed in Section 4.6). Further-more, while repositories have since the introduction of the policycreated infrastructure for recording the acceptance date, the dateis unlikely to be available for publications published prior to thepolicy taking effect and for non-UK publications. Consequently, theacceptance date does not allow us to study compliance with thepolicy over time or compare the UK to other countries. Therefore,to measure deposit time lag and non-compliance with the policy,we use dates of publication instead of acceptance dates.
As mentioned above, we use Crossref and CORE to collect data forour analysis. More specifically, we use Crossref to obtain publica-tion dates and ISSN numbers, and CORE to obtain deposit dates,repository names, and for institutional repositories also locations(specifically the country of the repository).Additionally, to ensure correct deposit dates for older documents,we have applied the following procedure. CORE harvests documentsfrom repositories using the Open Archives Initiative Protocol forMetadata Harvesting (OAI-PMH). The OAI-PMH metadata do notcontain a deposit date field, but only a last update field. Thus, thelast update field will contain a deposit date of an article up until thearticle’s metadata is updated in the repository. The metadata doesnot distinguish which version of the article is presented. In Septem-ber 2018, CORE created infrastructure which allows it to store thefirst date it receives as the deposit date and any subsequent datesas dates of updates. To ensure correct deposit dates for documentsdeposited prior to September 2018, we have created web scrapersfor the following repositories: repositories using DSpace, EPrints,or Invenio software, and additional individual scrapers for ArXivand Zenodo. The choice of repositories we created scrapers for wasmade based on a) availability of deposit dates on the website and b)whether we were able to match a repository page URL to a specificOAI-PMH metadata record.Furthermore, we used Mendeley to obtain information aboutpublications’ subjects using the profiles of those who read the pub-lications. Mendeley is a reference manager that can be used tomanage a research library and provides an API that can be queriedto obtain information about how many people have added a cer-tain publication in their libraries. When users create Mendeleyaccounts, they are asked about their fields of study. We have usedthe information about how many users from each field of studyhave bookmarked a certain publication to categorise publicationsinto subject categories. The details of how we did this are describedin Section 4.5. Based on the available data, for the analysis of the REF 2021 OAPolicy we can assign each publication to one of the following com-pliance categories : a publication has been depositedinto a repository and its first date of deposit is later than three months after its original date of publication. This category may notinclude all non-compliant publications as some may fall into the“likely compliant” category below, depending on their actual dateof acceptance. However, using this classification, we can be certainthat all publications within the non-compliant category are indeednon-compliant, i.e. this category will have 100% precision but not100% recall. a publication has been deposited into arepository and its deposit date is within a three months period ofits original publication date or earlier. This category may includesome non-compliant publications, depending on the actual date ofacceptance. However, given the way it’s defined, we can be certainthat all truly compliant publications will fall into this category, i.e.this category will have 100% recall but not 100% precision. We started by obtaining a complete data dump from Crossref andCORE. Our Crossref data dump was obtained in May 2018 andour CORE dump in March 2019 (the reason why our CORE dumpwas obtained later was to allow enough time for publications tobe deposited and aggregated by CORE). We then filtered out alldocuments with a missing title, year of publication, or author names.Additionally, we filtered out any Crossref documents where themetadata contained only the year of publication but not the monthof publication. If a day of publication, but not the year or month,was missing, we used the first day of the month as the day whenthe paper was published, e.g. if we knew a paper was published in2017-09, we replaced the date with 2017-09-01. Finally, we removedall documents from both datasets which were published prior to2013. After this filtering we were left with 18,753,649 CORE articlesand 15,832,311 Crossref articles.Title, year of publication, and the last name of the first authorwere then used to merge the two datasets. As not all documentsin CORE contain a DOI, we were unable to use DOIs to matchdocuments between Crossref and CORE. On the other hand, title,author, and year information are available for most documents.Matching documents by title, year, and first author name is a strictapproach which results in lower recall, because authors may notbe listed in the correct order, different spelling or hyphenationof the titles and author names may be used, etc. However, thisapproach produces cleaner and more reliable data (the accuracyof this matching method is 95.27%, a more detailed analysis of theaccuracy is provided in Section 4.1) and for the purposes of theanalysis this was our aim.Titles and author names were cleaned by removing all char-acters other than alphanumeric characters and underscores, andby converting all characters to lowercase. Additionally, we havenormalised the text by replacing accented characters and specialcharacters appearing in non-English alphabets with their non-accented/English versions (e.g. by converting “François” to “Fran-cois”). The data was then merged using exact match on the title,year of publication, and last name of the first author. Because onearticle can be deposited in multiple repositories (for example if theauthors of the article are affiliated with different institutions andall deposit the article in their respective repositories), we have addi-tionally grouped all CORE articles that were matched to the same o Authors Deposit on Time? Tracking Open Access Policy Compliance JCDL 2019, June 2019, Urbana-Champaign, IL, USA
Table 1: Dataset size.
Unique CORE articles 948,044Unique Crossref articles 808,984Links between Crossref & CORE 985,175Final dataset size (after grouping) 808,984
Table 2: Examples of differences between DOIs obtainedform Crossref and from CORE.
Example 1CORE DOI 10.1002/2016jd026252/abstractCrossref DOI 10.1002/2016jd026252Example 2CORE DOI 10.1088/0031-8949Crossref DOI 1 10.1088/0031-8949/2013/t156/014026Crossref DOI 2 10.1088/0031-8949/90/9/095101Crossref article into one record using Crossref DOI. This groupingreduces the size of the dataset by about half a million records andthe merged and grouped dataset contains 1,589,469 rows. Finally,we have used our repository scrapers (Section 3.1) to obtain correctdeposit dates. We were able to obtain deposit dates for 808,984documents in our dataset. Table 1 shows the final dataset size.
As the results of our analysis are impacted by the above mentionedmatching method, we need to be confident that the accuracy of thematching is high. To measure this accuracy, we compare DOIs be-tween all pairs of matched documents. There are 985,175 documentpairs in total (Table 1) out of which 354,897 don’t have a DOI inCORE (36.02%). Of the remaining 630,278 that have a DOI both inCrossref and in CORE, 595,202 have exactly matching DOIs (94.43%of the 630 thousand pairs) and 35,076 have DOIs that do not match(5.57%).We have investigated the non-matches and observed that it isoften because of minor differences which seem like errors intro-duced during the deposit in the repository. More specifically, DOIsobtained from CORE often have additional text appended at the end(Table 2, Example 1) while clearly referring to the same document.This is not the case for the opposite scenario, as CORE DOIs withmissing characters can often match multiple Crossref DOIs (Table 2,Example 2). There are 5,264 DOI pairs (15.01% of the non-matchingDOI pairs) where Crossref DOI is substring of the CORE DOI, i.e.CORE DOI contains additional characters. If we consider these ascorrect matches, the accuracy of the matching method is 95.27%.Given the 95.27% matching accuracy, we estimate that 338,110document pairs, which do not have a DOI in CORE, were matchedcorrectly. If we were to match documents by DOIs instead, wewould have missed these. Furthermore, evaluating the accuracyof the method would have been more time consuming (it wouldrequire a manual check) and would likely be less precise. u k i t n / au sc hn l dee s au c np t hu i n f i m y p l z a f r j ph k N u m be r o f pape r s Figure 2: Country distribution of publications in our dataset.The column labelled “n/a” represents publications depositedin repositories without a country code (e.g. ArXiv).
We are interested in studying the differences in deposit time lag atdifferent institutions. However, Crossref only contains affiliationinformation for a small subset of the publications in our dataset –129,405 (~16%) documents have affiliation information for at leastone author. Therefore, as an approximation, we use informationabout publications’ repositories instead, i.e. we assume authors de-posit publications into repositories of institutions they are affiliatedwith.There are 728 unique repositories in the dataset, each publicationwas deposited into 1.16 repositories on average and the largestnumber of repositories per publication is 31. On the other hand,there are on average 1,286 publications per repository, while 315repositories contain less than 100 publications and 255 less than 50.Appendix A, Table 3 presents the ten largest repositories.
To assign publications to countries we use information about repos-itory locations. Figure 2 shows the distribution of publications percountry for top 20 countries. Publications affiliated with multiplecountries are represented as a full publication for each country(instead of counting only the relevant fraction of the publication).There are several possible reasons why a large number of publi-cations in our dataset are from the UK. Firstly, the UK had a leadingrole in the adoption and implementation of repositories comparingto other countries. Furthermore, depositing into a repository is nowa requirement included in the REF 2021 OA Policy.
In all experiments we use the date of publication we obtained fromCrossref instead of using the date of publication from CORE, asCrossref metadata typically contains more detailed information(e.g. year, month, and day vs. just year). Figure 3 shows the age ofpublications in our dataset.As part of our study we are interested in analysing deposit timelag in the UK with regard to the UK OA policy. To understandhow many publications in our dataset are from the UK, we distin-guish them in the figure by colour – blue colour represents UKpublications, while green colour represent all other publications.The drop in publication count in 2018 is due to us not having datafor the complete year (we collected data from Crossref in May 2018).The drop in 2017 is likely caused by late deposits – it is possible
CDL 2019, June 2019, Urbana-Champaign, IL, USA Drahomira Herrmannova, Nancy Pontika, and Petr Knoth N u m be r o f pape r s UKRoW
Figure 3: Age of publications in our dataset. Publicationswith at least one author affiliated with a UK institution areshown in blue, while publications without a UK-based au-thor (labelled “rest of the world” – RoW – in the figure) areshown in green.
Number of publications (thousands)
Decision SciencesEnergyDesignChemical EngineeringPhilosophyImmunology and MicrobiologyPharmacology, Toxicology and Pharmaceutical ScienceVeterinary Science and Veterinary MedicineLinguisticsSports and RecreationsNeuroscienceNursing and Health ProfessionsArts and HumanitiesMaterials ScienceEconomics, Econometrics and FinanceBiochemistry, Genetics and Molecular BiologyEnvironmental ScienceMathematicsBusiness, Management and AccountingEarth and Planetary SciencesChemistryPsychologyComputer ScienceSocial SciencesUnspecifiedAgricultural and Biological SciencesEngineeringMedicine and DentistryPhysics and Astronomy
Figure 4: Subject distribution of publications in our dataset. that some publications from 2017 had not been deposited yet due tolooser policy requirements, authors forgetting to deposit, publisherembargoes, etc.
Figure 4 shows subject distribution of publications in our dataset.For publications with multiple subjects we only counted the rele-vant proportion towards each subject. For example, a publicationassigned to two subjects is counted as 0.5 towards each subject.The subjects were obtained from Mendeley in the following way.We used Crossref DOIs to query the Mendeley API to obtain themetadata Mendeley stores for each article. This metadata containsinformation about how many readers from each of Mendeley’s 28subjects saved each article in their Mendeley library. Each articlewas then tagged with the subject in which it accumulated the mostreaders – e.g. if an article was read by 20 people in “Medicine andDentistry” and by 5 people in “Immunology”, we would tag thearticle with the subject “Medicine and Dentistry”. In case multiplesubjects had the same number of readers the article was tagged with https://dev.mendeley.com/ A B C D n/aREF 2021 panel025,00050,000 o f pape r s Figure 5: Distribution of UK publications in our dataset intothe four main REF 2021 assessment panels. all of those subjects. According to [7], reader counts in Mendeleytend to be skewed towards certain disciplines. The obtained subjectsare therefore only an approximation.We were able to obtain Mendeley metadata for 664,277 publica-tions (~82%). There are 19 readers per publication on average. Usingour subject tagging method described above, 86,731 documentswere tagged with multiple subjects (~11%). Out of those, 65,419were tagged with two subjects (75%) and 15,390 with three subjects(18%), while the rest (5,922, or 7%) was tagged with between fourand ten subjects. While these numbers are lower than existing es-timates of the proportion of interdisciplinary research [28], thiscould be due to our tagging method.Additionally, we manually assigned each of the Mendeley subjectcategories to one of the four REF 2021 Main Assessment Panels .These panels are “A: Medicine, health and life sciences”, “B: Physicalsciences, engineering and mathematics”, “C: Social sciences”, and“D: Arts and humanities”. The mapping between Mendeley subjectsand REF 2021 panels we used is shown in Appendix A, Table 4.Figure 5 shows a distribution of UK publications in our datasetbetween the four REF 2021 assessment panels. Crossref metadata contains an accepted field which, according to theCrossref API documentation , contains “date on which a work wasaccepted, after being submitted, during a submission process”. Wehave analysed this field for the 800 thousand articles in our dataset.However, we found only 975 articles with the date of acceptancepopulated. Additionally, for 684 (70%) this date was the same as thedate of publication and for 272 (28%) the date of acceptance wasa later date than the date of publication, showing that the date ofacceptance in Crossref is in 99.9% of cases not available and in 98%of cases where it is available, it is incorrect. Therefore, we won’tuse this date in further analysis. As the REF 2021 Open Access Policy applies only to publicationswith an ISSN, we have included Crossref ISSN numbers in ourdataset. We found that 55,014 publications do not have an ISSNnumber, 12,463 of those are from a UK institution. In our analysisof compliance with the REF 2021 OA Policy have excluded these 12thousand publications as the policy does not apply to them. https://github.com/Crossref/rest-api-doc/blob/master/api_format.md o Authors Deposit on Time? Tracking Open Access Policy Compliance JCDL 2019, June 2019, Urbana-Champaign, IL, USA Figure 6: Overall deposit time lag for five countries withthe most publications in our dataset. Each bar in the his-togram represents one week. The vertical red lines represent3 months after the date of publication.
To calculate deposit time lag for publications in our dataset, wesubtracted dates of publication from deposit dates and expressed thedifference in days. As a result, negative values mean an article wasdeposited before being published and positive values mean it wasdeposited after being published. A histogram of deposit time lag forall publications in our dataset is shown in Appendix B, Figure 14.
Figure 6 reveals significant differences in deposit time lag betweenfive countries with the highest number of publications in our dataset.UK publications appear to have the shortest deposit time lag ofall five countries, with a large number of articles deposited beforeor at the time of publication. US publications display a similarpattern, however, deposit time lag in the US peaks a few weeksafter publication. On the other hand, Italy, Switzerland, and theNetherlands show a long-tail distribution where deposits peak atthe time of publication but decreases slower than in the case of theUK and the US. Furthermore, a large proportion of publicationsfrom these countries is deposited with long delays.Next, we wanted to compare how deposit time lag in these coun-tries has changed over time. One way of doing this is by usingall data available to us to calculate average deposit time lag percountry and year. This approach has limitations we will illustratein the following example. Consider deposit dates present in ourdataset for articles published in 2013 and in 2017. While articlespublished in 2013 had just over six years during which they couldhave been deposited in a repository (our dataset goes until early2019), publications from 2017 had, in contrast, much shorter timeto appear in a repository. It is possible some publications from bothyears have not been deposited yet, but this is more likely for pub-lications from 2017. This affects yearly deposit time lag in a waywhich slightly underestimates (decreases) deposit time lag for allpublication years, but especially for newer publications.Another option is to use maximum limit on deposit time lag andfilter out all publications which were deposited later than within aspecified time frame. To give an example, consider limiting deposit M ean depo s i t t i m e l ag ( da ys ) ukusitchnl Figure 7: Average deposit time lag per year for five countrieswith the most publications in our dataset. Figure was createdusing all available data. M ean depo s i t t i m e l ag ( da ys ) ukusitchnl Figure 8: Average deposit time lag per year for five countrieswith the most publications in our dataset. Figure was createdby filtering out all publications which were deposited aftera year of being published. time lag to one year. In this case, only publications from 2013 thatwere deposited within a year of their publication date (but none ofthe publications deposited later) would be compared to the sameset from 2017. This affects yearly deposit time lag in a way whichslightly underestimates (decreases) deposit time lag for all years,but especially for older publications, due late deposits becomingless common over time.As we are not aware of a better way to compare deposit time lagacross years that would alleviate the limitations of both of the abovementioned approaches at the same time, we use both approachesin conjunction.Figures 7 and 8 show average deposit time lag per year andcountry. In the case of Figure 7, the deposit time lag was calculatedusing all available data, while in the case of Figure 8 it was calculatedusing one year maximum deposit time lag limit. In the case of Figure8, year 2018 was excluded as we do not have a complete year of datafor it. An additional figure created by applying a maximum deposittime lag limit of two years is shown in Appendix B, Figure 15.The figures reveal several interesting trends. Since 2016, thedeposit time lag of UK publications is the lowest of all five countriesand is negative in Figure 7 in 2018 (-3.69 days). In fact, this has notalways been the case and, when considering all data including latedeposits (Figure 7), the UK was fourth of the selected five countriesin 2013 and 2014. Interestingly, this change in average deposit timelag in the UK coincides with the introduction of the REF 2021OA Policy in 2014. When considering only publications depositedwithin a year (Figure 8), the UK started as the first of the selectedfive countries, however, its average deposit time lag had increased
CDL 2019, June 2019, Urbana-Champaign, IL, USA Drahomira Herrmannova, Nancy Pontika, and Petr Knoth P r opo r t i on o f pub li c a t i on s Potentially compliantNon-compliant
Figure 9: Proportion of non-compliant and potentially com-pliant UK publications per year. in 2014. A possible explanation is the introduction of the REF 2021OA policy, where researchers started shifting their deposit habitsto comply with the policy and as a result deposit more often, but ittook time for this shift to become a common practice.There has been a decreasing trend in deposit time lag for allcountries, particularly since 2016. Italy has seen the largest decreasein average deposit time lag from 706 days in 2013 to 48 days in 2018in the case of Figure 7, and from 244 in 2013 to 86 in 2017 in thecase of Figure 8. In 2013, the Italian government passed legislationrequiring all research in which at least 50% of funding was publicfunding to be made OA [16]. While we are not aware of any specificdeposit time frames associated with this requirement, it is possibleit affected deposit practice.Finally, we analyse deposit time lag with respect to the UK REF2021 Open Access Policy. To do this, we assign each UK publicationto one of the two compliance categories described in Section 3.2:“definitely non-compliant” – publications with deposit time lagof more than 90 days, and “likely compliant” – publications withdeposit time lag with 90 days or less. The proportion of publicationsbelonging to each category per year is shown in Figure 9.The figure shows that prior to the REF 2021 OA Policy taking ef-fect in 2016, more than 50% of publications each year were depositedlater than three months after the date of publication. However, thesituation has changed after the policy took effect in April 2016. In2017, 80% of papers were made available in an OA repository withinthree months of the date of publication, or even earlier. While wedo not yet have complete data for 2018 (our sample contains datauntil May 2018), we can observe that compliance is still increasing.
Our next question is whether there is a difference between deposittime lag of different repositories and how this has changed overtime. Figure 10 shows deposit time lag per year for all repositorieswith more than 100 publications in a given year. To produce thisfigure, we have calculated the following two statistics for eachrepository:(1)
Single repository deposit time lag.
Deposit time lag withrespect to the publications’ deposit date in a given repository.In this case, we do not take into account that a publicationmay have been deposited into multiple repositories. For ex-ample, if a publication was deposited into the University ofCambridge repository, we only consider the date of depositinto this repository. A v e r age depo s i t t i m e l ag Figure 10: Deposit time lag per repository and year. Thefull lines show “single repository deposit time lag” and thedashed lines show “any repository deposit time lag”. (2)
Any repository deposit time lag.
Deposit time lag calcu-lated with respect to the publications’ deposit date in any repository. For example, if a publication was deposited intothe University of Cambridge repository as well as elsewhere,we simply use the first of the two dates to calculate deposittime lag.To produce the full lines in Figure 10, we have sorted the reposi-tories according to their “single repository deposit time lag” valuesfrom the lowest to the highest. The dashed lines were producedthe same way, but using the “any repository deposit time lag” val-ues. The figure reveals significant differences between repositories,which have reduced over time, but remain high. For 2013 publica-tions, the difference between the repository with the lowest andthe highest “single repository deposit time lag” was 1,982 days, andthe standard deviation across all repositories was 377 days. In 2017,these numbers have dropped to 991 days and a standard deviationof 108. The figure also reveals that by aggregating data from allrepositories, the deposit time lag can be lowered.We have produced a similar figure for UK repositories showingthe proportion of “likely compliant” publications per repository.Similarly to Figure 10, Figure 11 was produced by calculating twostatistics for each repository: single repository compliance (fulllines), i.e. proportion of likely compliant publications when con-sidering deposits only in a single repository, and any repositorycompliance (dashed lines), i.e. proportion of likely compliant pub-lications with respect to their deposit date in any repository. Inboth cases, the repositories were sorted from the most to the leastcompliant. It can be seen that repository compliance has increasedrapidly from 2014 onward, particularly between 2015 and 2016. Asthe UK REF 2021 OA Policy was introduced in 2014, it may be oneof the reasons for this increase. The figure also shows that aggregat-ing research outputs from multiple repositories may help improverepository compliance.
Finally, we investigated whether there were any differences indeposit time lag between different subjects. Figure 12 shows averagedeposit time lag per subject in 2013 and 2017. To produce this figurewe have removed a single subject (Decision Sciences) with lessthan 100 publications in one year. The figure shows that whilethere were significant differences between subjects in 2013, thesewere largely diminished by 2017. The figure also reveals smaller o Authors Deposit on Time? Tracking Open Access Policy Compliance JCDL 2019, June 2019, Urbana-Champaign, IL, USA P r opo r t i on o f c o m p li an t Figure 11: Proportion of likely compliant publications perrepository and year. The full and dashed lines show “single”and “any” repository compliance, respectively.
200 0 200 400
Average deposit time lag
MathematicsPhysics and AstronomyEarth and Planetary SciencesEconomics, Econometrics and FinancePhilosophySocial SciencesVeterinary Science and Veterinary MedicineEnvironmental ScienceComputer ScienceArts and HumanitiesLinguisticsMaterials ScienceUnspecifiedPsychologyBusiness, Management and AccountingChemistryDesignAgricultural and Biological SciencesPharmacology, Toxicology and Pharmaceutical ScienceEngineeringMedicine and DentistryBiochemistry, Genetics and Molecular BiologyNursing and Health ProfessionsNeuroscienceImmunology and MicrobiologySports and RecreationsEnergyChemical Engineering 2013 2017
Figure 12: Average deposit time lag per subject in 2013 and2017. The bars in the figure are not stacked but insteadplaced on top of each other, i.e. the bars of both years havethe same baseline of zero. differences between subjects than the differences observed betweenrepositories shown in Figure 10. In 2013, the difference betweenthe highest and the lowest average deposit time lag per subject was532 days and standard deviation across all subjects was 107 days.In 2017 the range was 295 and standard deviation was 57 days.On the other hand as we have shown in Section 5.2, range andstandard deviation across all repositories were 1,982 and 377 in 2013,and 991 and 108 in 2017. If we consider only publications from asingle subject, the differences between repositories remain high.For example, using only publications from “Physics and Astron-omy” (our largest subject), range and standard deviation were 1,787and 370 in 2013, and 940 and 174 in 2017. The situation is similarfor other subjects. This suggests institutional policies, particularlywhen harmonised with funder policies, may be stronger drivers ofOA than disciplinary culture.Finally, Figure 13 shows the proportion of likely-compliant andnon-compliant publications across the four main REF 2021 assess-ment panels (Section 4.5) in 2013 and in 2017. The figure shows
Figure 13: Proportion of likely compliant and non-compliant publications per each of the main REF 2021assessment panels. there has been significant increase in compliance over the five yearperiod, which has been similar across all four panels.
Our findings indicate that deposit time lag has been decreasingglobally. However, we have observed major differences in deposittime lag across institutions and significant differences betweensubjects. Furthermore, we have shown that the deposit time laghas been shortening over the last 5 years both globally and inthe UK. Our results suggest that the REF 2021 OA Policy likelyhelped to reduce deposit time lag. The results outlined in this paperpresent a preliminary study of deposit time lag and compliancewith existing OA policies. There are many areas where this studycould be enhanced and broadened.The matching of articles between Crossref and CORE was doneby means of the articles’ metadata (titles, years of publication, andfirst author names). This is a strict approach that may result inlower recall due to minor differences in metadata, such as listingauthors in incorrect order, typos, differences in punctuation, etc.While our present study has been precision oriented, i.e. our aimwas to produce as clean data as possible, in the future, we wouldlike to improve our recall. This would also allow us to study depositrates, i.e. the proportion of articles that get deposited into OArepositories compared to articles that do not, in addition to deposittime lag. Improving our recall could be done in a number of ways.For example, in addition to the metadata we already use for thematching, we could utilise all other metadata available to us, suchas abstracts, and employ looser matching techniques such as thoseused in article deduplication [9].For this initial study we make the assumption that if a metadatarecord is in the repository, the full text is also deposited. This isbecause validating if the full text is deposited is a complicatedprocess which is outside of the scope of this work. The OAI-PMHprotocol does not guarantee a link to the publication full text will bein the metadata even if the full text was deposited into the repository.To check if an article full text was deposited, we would have to crawlall links provided in the OAI-PMH metadata and correctly matchthe identified documents to the publication metadata. Therefore, asour present study focuses on deposit time lag rather than presenceof the full text, we decided not to perform this check.As our analysis relies on deposit dates, publications that havenever been deposited into a repository are not included in our study.Consequently, this means that the proportion of publications thatare potentially compliant with the REF 2021 OA Policy are comparedagainst non-compliant but deposited publications, rather than all
CDL 2019, June 2019, Urbana-Champaign, IL, USA Drahomira Herrmannova, Nancy Pontika, and Petr Knoth publications. To quantify missing deposits, we would have to be ableto correctly match all CORE publications to their Crossref metadata.This is out of the scope of our study, as the focus of our study ison deposit time lag rather than the analysis of the proportion ofmissing deposits. However, to allow for as many publications to beincluded in our study we have collected deposit dates almost a year(in March 2019) after collecting publication metadata (May 2018).
The aim of this study was to investigate how much time does it takefor authors to deposit their articles in OA repositories in relationto when these articles get published. Furthermore, our goal wasto investigate if OA policies might have reduced this time, andif compliance with such policies can be effectively tracked. Wecollected dates of publication and deposit dates for 800 thousandarticles published around the world between 2013 and 2018, andcompared the difference between these dates across time, country,subject, and repository.We have shown that the time between publication and deposithas decreased significantly over the 2013-2017 period globally, by472 days per country on average across all countries in our dataset.We have also shown that after the introduction of the UK REF 2021OA Policy, this decrease in the UK has accelerated, and in 2018the mean difference between publication and deposit dates hasbecome negative (-3.69 days), meaning that, as of early 2018, onaverage, UK publications potentially become OA immediately oreven slightly before publication. The key message of our paper isthat this observation supports the argument for the inclusion of astrictly time-limited deposit requirement in OA policies. Further-more, our work demonstrates that countries which now have a timeframe on deposits included in their OA policies can develop reliabletracking mechanisms for monitoring the effects of such policies.Based on the presented methodology, we have developed a toolfor tracking the time lag between article publication and depositwhich relies on data from thousands of repositories. We hope thetool will be useful to authors, funders and institutions who intendto improve the accessibility of research and improve compliancewith existing OA policies. To support further studies on the depositof research outputs in OA repositories, we release our dataset of800 thousand publications and the source codes of our analysis . REFERENCES [1] Eric Archambault, Didier Amyot, Philippe Deschamps, Aurore Nicol, FrancoiseProvencher, Lise Rebout, and Guillaume Roberge. 2014. Proportion of openaccess papers published in peer-reviewed journals at the European and worldlevels–1996-2013.
Report, European Commission DG Research & Innovation (2014).[2] Eric Archambault, Didier Amyot, Philippe Deschamps, Aurore Nicol, Lise Rebout,and Guillaume Roberge. 2013. Proportion of open access peer-reviewed papersat the European and world levels–2004-2011.
Report, European Commission DGResearch & Innovation (Aug 2013).[3] Bo-Christer Björk, Patrik Welling, Mikael Laakso, Peter Majlender, Turid Hedlund,and Gudni Gudnason. 2010. Open access to the scientific journal literature:situation 2009.
PloS one http://github.com/oacore/jcdl_2019 [5] European Commision. 2018. ’Plan S’ and ’cOAlition S’ – Accelerating thetransition to full and immediate Open Access to scientific publications. https://europa.eu/!hw84rX. Accessed: 2018-11-20.[6] Yassine Gargouri, Vincent Larivière, Yves Gingras, Les Carr, and Stevan Harnad.2012. Green and Gold Open Access Percentages and Growth, by Discipline. arXive-prints (Jun 2012). Preprint, https://arxiv.org/abs/1206.3664.[7] Robin Haunschild and Lutz Bornmann. 2016. Normalization of Mendeley readercounts for impact assessment. Journal of Informetrics
Database
Insights
27, 1(2014). https://doi.org/10.1629/2048-7754.115[11] Madian Khabsa and C Lee Giles. 2014. The number of scholarly documents onthe public web.
PloS one
9, 5 (2014), e93949.[12] Shaun Yon-Seng Khoo and Belinda Po Pyn Lay. 2018. A very long embargo:Journal choice reveals active non-compliance with funder open access policiesby Australian and Canadian neuroscientists.
Liber Quarterly
28, 1 (2018).[13] Petr Knoth and Zdenek Zdrahal. 2012. CORE: Three Access Levels to UnderpinOpen Access.
D-Lib Magazine
18, 11/12 (nov 2012). https://doi.org/10.1045/november2012-knoth[14] Vincent Lariviere and Cassidy R. Sugimoto. 2018. Do authors comply whenfunders enforce open access to research?
Nature
PeerJ
The Guardian (24 Apr 2012). Accessed: 2018-11-19.[22] Stuart Shieber. 2013. Why open access is better for scholarly soci-eties. https://blogs.harvard.edu/pamphlet/2013/01/29/why-open-access-is-better-for-scholarly-societies/. Accessed: 2018-11-19.[23] Peter Suber. 2003. The taxpayer argument for open access.
SPARC Open AccessNewsletter (4 Sep 2003). http://nrs.harvard.edu/urn-3:HUL.InstRepos:4725013[24] Alma Swan. 2014. HEFCE announces Open Access policy for the next REF inthe UK: Why this Open Access policy will be a game-changer.
Impact of SocialSciences Blog
Nature
Journal of the Association for Information Science and Technology
67, 11 (2016), 2815–2828.[30] Jingfeng Xia, Sarah B Gilchrist, Nathaniel XP Smith, Justin A Kingery, Jennifer RRadecki, Marcia L Wilhelm, Keith C Harrison, Michael L Ashby, and Alyson JMahn. 2012. A review of open access self-archiving mandate policies. portal:Libraries and the Academy
12, 1 (2012), 85–102. o Authors Deposit on Time? Tracking Open Access Policy Compliance JCDL 2019, June 2019, Urbana-Champaign, IL, USA
A DATA PREPARATION AND STATISTICS
Table 3: The ten largest repositories in our dataset.
Name PublicationsArXiv e-Print Archive 97,594White Rose Research Online 24,019ZORA 20,617Utrecht University Repository 20,304Enlighten 19,267Radboud Repository 17,837ZENODO 17,100Università di Roma La Sapienza Repository 14,795Online Research @ Cardiff 14,261Università di Padova Repository 14,077
Table 4: Mapping of Mendeley subjects to REF 2021 MainPanels.
Mendeley subject REF Main PanelAgricultural and Biological Sciences AArts and Humanities DBiochemistry, Genetics and Molecular Biology ABusiness, Management and Accounting CChemical Engineering BChemistry BComputer Science BDecision Sciences CDesign DEarth and Planetary Sciences BEconomics, Econometrics and Finance CEnergy BEngineering BEnvironmental Science BImmunology and Microbiology ALinguistics DMaterials Science BMathematics BMedicine and Dentistry ANeuroscience ANursing and Health Professions APharmacology, Toxicology and Pharmaceuti-cal Science APhilosophy DPhysics and Astronomy BPsychology ASocial Sciences CSports and Recreations CUnspecified n/aVeterinary Science and Veterinary Medicine A -2,000 -1,000 0 1,000 2,000Deposit time lag (days)10 Figure 14: Deposit time lag in days for all publications in ourdataset. The histogram was created by aggregating 30 daysat a time, i.e. each bar represents one month. The y-axis islogarithmic and the vertical red line represents 3 monthsafter the date of publication. M ean depo s i t t i m e l ag ( da ys ) ukusitchnl Figure 15: Average deposit time lag per year for five coun-tries with the most publications in our dataset. Figure wascreated by filtering out all publications which were de-posited later than within two years of being published.