Large coverage fluctuations in Google Scholar: a case study
LLarge coverage fluctuations in Google Scholar: a case study
Alberto Martín-Martín and Emilio Delgado López-Cózar [email protected] Facultad de Comunicación y Documentación, Universidad de Granada (Spain) [email protected] Facultad de Comunicación y Documentación, Universidad de Granada (Spain)
Abstract
Unlike other academic bibliographic databases, Google Scholar intentionally operates in a way that does notmaintain coverage stability: documents that stop being available to Google Scholar's crawlers are removed fromthe system. This can also affect Google Scholar's citation graph (citation counts can decrease). Furthermore,because Google Scholar is not transparent about its coverage, the only way to directly observe coverage loss isthrough regular monitorization of Google Scholar data. Because of this, few studies have empiricallydocumented this phenomenon. This study analyses a large decrease in coverage of documents in the field ofAstronomy and Astrophysics that took place in 2019 and its subsequent recovery, using longitudinal data fromprevious analyses and a new dataset extracted in 2020. Documents from most of the larger publishers in the fielddisappeared from Google Scholar despite continuing to be available on the Web, which suggests an error onGoogle Scholar's side. Disappeared documents did not reappear until the following index-wide update, manymonths after the problem was discovered. The slowness with which Google Scholar is currently able to resolveindexing errors is a clear limitation of the platform both for literature search and bibliometric use cases.
Introduction
Academic bibliographic databases, and especially those that generate citation graphs, usuallyimplement document inclusion policies that rarely allow records of documents to be removedonce they have entered the system. In a bibliographic database that is intended for literaturediscovery, coverage stability is a desirable property, if we assume that users intuitively expecta system to retrieve the same documents over time given the same query (in addition to newdocuments that also meet the search criteria). This property is especially critical for someliterature search use cases such as those carried out for systematic reviews, wherereproducibility of the process is essential (Gusenbauer & Haddaway, 2020; Haddaway &Gusenbauer, 2020). In a citation index, the disappearance of a document would affect thecitation counts of all its cited documents, impeding some types of citation analysis.Google Scholar continues to be widely used for literature discovery, and sometimes as a datasource for research evaluations. Some reasons for this are its comprehensive coverage, andthat it is free to access. However, unlike other analogous tools, Google Scholar intentionallyoperates in a way that does not maintain coverage stability (Delgado López-Cózar et al.,2019). Instead, Google Scholar mirrors the general-purpose Google search engine, andsubordinates the continued inclusion of documents in its index to their ongoing availability onthe Web (as well as their continued abidance to Google Scholar’s technical guidelines). Itsdocumentation declares that this approach was chosen to provide a current reflection of theacademic web at any given time (Google Scholar, n.d.-b).As a result of this policy, coverage in Google Scholar not only increases when newdocuments indexed, but can also decrease when indexed documents become unavailable toGoogle Scholar’s crawlers. To learn whether documents have stopped being available on theWeb, Google Scholar carries out two complete recrawls of its index approximately every year(Google Scholar, n.d.-a). If Google Scholar’s crawlers are not able to access a documentduring one of these recrawls, it is removed from the index.
Not peer-reviewed n some cases, documents that have been removed are still findable in Google Scholar,because they are available from other sources on the Web which Google Scholar also indexes,or because they are cited in other works indexed in the system (in these cases, the document iskept as a [CITATION]-type record). However, in other cases, and usually after one of thesemajor index recrawls, documents stop being findable in Google Scholar altogether, with theadded consequence that citation counts of the documents that they cite also suffer a decrease(provided that cited reference metadata was available for the removed documents).If we consider Google Scholar merely as a gateway, a digital non-place (Augé, 1995) throughwhich users navigate to the places where academic documents can be accessed, rather than asa data source in its own right that could serve as a record of the inherently cumulativeacademic knowledge and the interactions that occur between academics, then the decision tonot display information about documents that are no longer accessible is understandable (if anairport loses a flight route, it does not make sense to for it to keep displaying informationabout it). Furthermore, under this point of view, decreasing citation counts may not be anoverly concerning issue, as their purpose would only be to serve as one of the parameters thatis used to rank documents in a search, a purpose for which a certain amount of inaccuracies incitation counts can probably be tollerated.However, Google Scholar is perceived as a bibliographic data source by many users, whichmakes decreasing coverage problematic. For example, researchers sometimes report author-level indicators calculated by Google Scholar in research evaluation processes (hiring,promotion, grant applications…), and seeing these figures decrease overnight and withoutexplanation can be a cause of concern and confusion to them. Probably it is the number ofquestions about this phenomenon that has led Google Scholar to include a clarification in itshelp pages about why this occurs (Figure 1).The issue of decreasing coverage in Google Scholar is compounded by the fact that there is nopublic information on the sources that are covered by this search engine. Therefore, usershave no way of knowing when certain collections of documents are removed from the index,opening the possibility to situations in which users may decide to rely on this platform basedon assumptions regarding its coverage that no longer hold true.This lack of transparency means that the only way to directly observe coverage loss is throughregular monitorization of Google Scholar data: recording states of the index at regularintervals. This makes this phenomenon difficult to analyse, as extracting data from GoogleScholar is very time-consuming (Else, 2018), and it is difficult to anticipate which documents
Figure 1: Extract from Google Scholar help page that explains why citation counts of documentssometimes decrease in this platform re going to be dropped by Google Scholar and when. Because of this, few studies haveempirically documented this issue.In November 2017, while carrying out an annual update of a directory of Spanish academicjournals covered by Google Scholar Metrics (an annual product released by Google Scholarthat calculates a 5-year h-index for journals), Delgado López-Cózar & Martín-Martín (2018)noticed that the number of available Spanish journals in this product had sharply decreasedcompared to previous years, breaking the general growing trend observed since 2012: whilethe 2016 edition of the directory contained 1,101 Spanish journals, in the 2017 edition only599 journals could be found (Delgado López-Cózar & Martín-Martín, 2019). Journals fromall fields disappeared, but Law journals were particularly affected, going from 156 journals in2016 to 35 journals in 2017. Because Spanish Law journals do not yet have a strong presenceon the Web, we turned our attention to the largest bibliographic database that focuses onacademic content published in Spain: Dialnet. This database is still the only window throughwhich many journals published in Spain are visible on the Web, and therefore was consideredlikely to be involved in this blackout of Spanish scientific production in Google Scholar.After searching content available from Dialnet in Google Scholar and comparing the results toprevious web domain analyses that we had carried in the past (Delgado López-Cózar et al.,2019), it was confirmed that most of the content from Dialnet had disappeared at some pointfrom the search engine. This analysis was published in the Spanish LIS-focused mailing list
Iwetel, where Dialnet’s Technical Director confirmed that they had been aware of this issuesince June 2017. Apparently Google Scholar had detected that the metadata of a small batchof old records from Dialnet were inconsistent with metadata for the same documents found inother web sources, and decided to remove most of Dialnet from its index under suspicion ofproviding incorrect metadata. Dialnet promply fixed this issue, but its records did not return toGoogle Scholar results until the following complete recrawl of the index was made public inJanuary 2018. Thus, users interested in this content and who knew it to be covered in the pastwere unknowingly underserviced by Google Scholar for more than half a year. In this case,the issue was more difficult to detect because Dialnet did not contain cited reference metadatathat Google Scholar could access and use in its citation graph, and therefore citation countswere not affected. This episode revealed that despite its known errors and limitations(Orduna-Malea et al., 2017), Google Scholar subjects its data to certain quality-controlmeasures. This is, to our knowledge, the first empirically documented case of a large coveragefluctuation in Google Scholar.In 2019, the editor of the journal
Astronomy & Astrophysics denounced an apparently similarcase (Forveille, 2019). In March 2019 the journal was notified by several researchers thatcitation counts to documents of this journal had decreased “by an order of magnitude” in theirpersonal Google Scholar profiles. The editor contacted Google Scholar, who acknowledgedthe error and promised to remedy it. However, this would not be visible in the platform untilthe next complete recrawl of the index. This resulted in a sharp decrease of the h-index of thisjournal in the 2019 edition of Google Scholar Metrics, which was computed after GoogleScholar was made aware of the issue, but using the yet uncorrected index: while the 2018edition displayed an h5-index of 115 for this journal, in the 2019 edition this figure droppedto 52. Given the inherent resistance of a high h-index to small random changes in theunderlying citation counts (probably the reason why Google Scholar favors this indicator),this signaled a very significant drop in coverage of documents in this field.The large drop in Astronomy & Astrophysics documents in Google Scholar was also noticedby Martín-Martín et al. (2018, 2021). Analysing a collection of citations to a sample ofighly-cited documents from all subject areas collected in 2018 to compare relativedifferences in coverage in Google Scholar, Scopus, and Web of Science, they found thatGoogle Scholar was able to find 98% of the citations found by Web of Science, and 97% ofthe citations found by Scopus. Additionally, 30% of all citations were only found in GoogleScholar. In 2019 the data was collected again using the same sample of highly-citeddocuments, but the citations to Astronomy & Astrophysics documents obtained from GoogleScholar had radically changed (unlike in other subject categories, where relative differencesamong data sources remained mostly the same as in 2018): Google Scholar was only able tofind 60% of all Web of Science citations, and 60% of all Scopus citations. Since the citationdata extracted from Web of Science and Scopus in 2019 contained the same citations thatwere extracted in 2018, plus the new citations included in these systems between the twopoints of extraction, this large difference also signaled a significant drop in coverage inGoogle Scholar.The datasets extracted from Google Scholar for Martín-Martín et al. (2018, 2021) provide uswith an opportunity to analyse this case in more detail. Therefore, the goal of this study is todocument this case, to try to find out the cause of this sudden drop in coverage to documentsin the field of Astronomy & Astrophysics, and to check whether this issue was resolved insubsequent recrawls of Google Scholar’s index.
Methods
The datasets extracted from Google Scholar and analysed in Martín-Martín et al. (2018, 2021)were used. These datasets contain the lists of citing documents to a sample of 2515 highly-cited documents from 2006 that Google Scholar released in 2017 with the name
GoogleScholar Classic Papers (GSCP). In this product , Google Scholar displayed the top 10 mostcited documents published in 2006 in each of 252 subject categories. For more informationabout this product, see Orduna-Malea et al. (2018).Since in this study we are particularly interested in the coverage of Astronomy &Astrophysics documents in Google Scholar, only the 10 highly-cited documents in this field inGSCP (Table 1), and the list of citing documents in Google Scholar for each of thesedocuments, are analysed here. able 1: Highly-cited documents in Astronomy & Astrophysics in Google Scholar Classic Papers Art.
The list of documents that cite each document in Table 1 was extracted from Google Scholarin three different occasions: April-May of 2018, May-June 2019, and April 2020. To do this,a custom scraper was used (Martín-Martín, 2018).The metadata extracted from Google Scholar for each of these citing documents was enrichedby complementing it with metadata available in the HTML meta tags of the webpages whereGoogle Scholar found these documents, and the metadata available in CrossRef’s and arXiv’spublic APIs. A DOI was found for 79% of the citations in the 2018 dataset, 76% of thecitations in the 2019 dataset, and 77% of the citations in the 2020 dataset.Citations across the three datasets were matched based on (in this order) Google Scholar’sinternal document identification codes, the URLs of the webpages were Google Scholar foundthe citing documents, the DOIs of the citing documents, and a combination of title and authorsimilarity, in a similar way as described in Martín-Martín et al. (2018, 2021):1. For each pair of datasets A and B and a seed highly-cited document X, all citingdocuments with a document id (Google Scholar ID, URL, and DOI) that cite Xaccording to A were matched to all citing documents with a corresponding documentid that cite X according to B.2. For each of the unmatched documents citing X in A and B, a further comparison wascarried out. The title of each unmatched document citing X in A was compared to thetitles of all the unmatched documents citing X in B, using the restricted Damerau-Levenshtein distance (optimal string alignment) (Damerau, 1964; Levenshtein, 1966).The pair of citing documents which returned the highest title similarity (1 is perfectsimilarity) was selected as a potential match. This match was considered successful ifeither of the following conservative heuristics was met:1. The title similarity was at least 0.8, and the title of the citing document was at least30 characters long (to avoid matches between short, undescriptive titles such as“Introduction”).2. The title similarity was at least 0.7, and the first author of the citing document wasthe same in A and B.irst, citations in the 2019 dataset were matched to citations in the 2018 dataset. The result ofthat matching was in turn matched to the citations in the 2020 dataset.
Results
A simple observation of the citation counts reported by Google Scholar over the years for the10 highly-cited documents in the field of Astronomy & Astrophysics in GSCP already revealsa large fluctuation (Figure 2).Four of the 10 documents suffered a sharp decresase in citations in 2019. In the most extremecase (document
Astronomy & Astrophysics , but in
The Astronomical Journal , published by IOPScience. Indeed, of the four documents that show a drop in citation counts in 2019, only onewas published in
Astronomy & Astrophysics (document
The Astronomical Journal with decreased citation counts in 2019 (document
Nuclear Physics A (document
Astronomy & Astrophysics was not the only journal to be affected by the drop in coverage in 2019, which is expected,since whichever documents disappeared from Google Scholar likely cited articles fromvarious journals in the field.Six of the ten documents (
Monthly Notices of the Royal Astronomical Society (Oxford University Press), one in TheAstrophysical Journal (IOP Science), and one in Physical Review D (American PhysicalSociety). This does not rule out the possibility that they lost citations in 2019, only that thegrowth/loss balance was positive. Nevertheless, this suggests that some documents could havebeen more affected than others.Of the four documents that clearly lost citations in 2019, three of them seem to recover themby 2020 and continue receiving citations in 2021. This suggests that the coverage loss wasindeed temporary, and that citations were recovered at some point between summer of 2019and spring of 2020, probably after the second complete recrawl of the index in 2019. One of
Figure 2: Citation counts reported by Google Scholar over the years to the 10 highly-citeddocuments in the field of Astronomy & Astrophysics in GSCP he documents (
Nuclear Physics A , it was discovered thatthere are actually two documents with the same title and the same authors, published in twodifferent journals (Figure 3). We believe it is probable that in 2017 and 2018, these twodocuments were incorrectly merged into one record (combining the citations of the two), andthat in 2019 they were separated, as they remain in 2021. Since this is a different kind ofGoogle Scholar error than the one we are documenting here, this document and its citationswere excluded from further analysis in this study.We analysed the list of documents that cited each of the nine highly-cited document accordingto Google Scholar at three points in time: 2018, 2019, and 2020. In 2018, 21,907 citationswere extracted. In 2019, 15,042 were extracted, while in 2020, 25,195 citations were found inGoogle Scholars to these nine documents. Of the 21,907 citations found in 2018, 8,840 (40%)were missing in 2019. In 2020, however, 96% of the citations available in the 2018 datasethad reappeared.To find out exactly which documents caused the decrease in citation counts, the citationsfound in 2018 to each of the nine highly-cited documents were grouped by the publisher ofthe citing document, and by whether or not the citation was also found in 2019 and 2020(Figure 4). Missing documents are mostly concentrated in document
Figure 3: Two documents with the same title and authors, published in different journals. It ispossible that Google Scholar incorrectly merged these two records into one (combining citationcounts) in 2017 and 2018, and separated them in 2019. Screenshot taken on 14/02/2021igure 4: Citations found in 2018 for each of the 9 highly-cited documents, grouped by publisherof citing document and by whether or not the citation was found in 2019 and 2020 he citing documents that were present in the three datasets (data extracted in 2018, 2019, and2020) came with citation counts of their own, which provides us with an opportunity to gaugeto what extent each publisher was affected by the loss in coverage in 2019 using a largersample than the 10 highly-cited documents in GSCP. In this regard, documents published byEDP Sciences (publisher of the journal
Astronomy & Astrophysics ) were severely affected bythe loss of coverage (Figure 5), followed by documents published by the AmericanAstronomical Society, which were affected to a much lower extent. Out of the 724 documentspublished by EDP Sciences that were available in the three datasets, 404 of them (58%)reported at least 10 citations less in the 2019 dataset than in the 2018 dataset, whereas out ofthe 2,604 documents published by the American Astronomical Society present in the threedatasets, 141 (5%) reported at least 10 citation less in the 2019 dataset than in the 2018dataset. In the rest of publishers, lower citations in the 2019 dataset than in the 2018 datasetwere even more uncommon.
Discussion and conclusions
The results confirm that this is another case of a large coverage fluctuation in Google Scholar.Similarly to what happened during the Dialnet blackout event, users interested in Astronomyand Astrophysics content during 2019 who expected Google Scholar to have a comprehensivecoverage of this field could have been unknowingly underserviced for a period of 6 to 9months. Unlike in the Dialnet event, however, the effects were quickly felt because of thedrastic drop in citation counts in documents of the area, especially in the journal
Astronomy &Astrophysics , which are confirmed in this study. Furthermore, although this analysis has notbeen able to discern the specific cause of the error, the fact that documents from many largepublishers disappeared from the platform despite continuing to be available on the Web pointsto a mistake on Google Scholar’s side.Although there are reports that Google Scholar sometimes contact content providers when anissue like this arises (Delgado López-Cózar & Martín-Martín, 2018), the slowness with which
Figure 5: Distribution of (log-transformed) citation counts of citing documents in 2018, 2019, and2020, grouped by publisher t is currently able to resolve this kind of issues is a clear limitation of the platform forliterature search use cases as well as for bibliometric use cases.
References
Augé, M. (1995).
Non-places: Introduction to an Anthropology of Supermodernity . Verso.Damerau, F. J. (1964). A technique for computer detection and correction of spelling errors.
Communications of the ACM , (3), 171–176. https://doi.org/10.1145/363958.363994Delgado López-Cózar, E., & Martín-Martín, A. (2018). Apagón digital de la producción científicaespañola en Google Scholar. Anuario ThinkEPI , , 265–276.https://doi.org/10.3145/thinkepi.2018.40Delgado López-Cózar, E., & Martín-Martín, A. (2019). Indice H de las revistas cientificas españolasen Google Scholar Metrics 2014-2018 . https://doi.org/10.13140/RG.2.2.36649.13923Delgado López-Cózar, E., Orduna-Malea, E., & Martín-Martín, A. (2019). Google Scholar as a datasource for research assessment. In W. Glaenzel, H. Moed, U. Schmoch, & M. Thelwall (Eds.),
Springer Handbook of Science and Technology Indicators . Springer.Else, H. (2018, April 11). How I scraped data from Google Scholar.
Nature .https://doi.org/10.1038/d41586-018-04190-5Forveille, T. (2019). A&A ranking by Google.
Astronomy & Astrophysics , , E1.https://doi.org/10.1051/0004-6361/201936429Google Scholar. (n.d.-a). Search Tips: Content coverage . Google Scholar. Retrieved December 2,2021, from https://web.archive.org/web/20210212175858/https://scholar.google.com/intl/es/scholar/help.html
Search Tips: Inclusion and corrections . Google Scholar. RetrievedDecember 2, 2021, fromhttps://web.archive.org/web/20210212175858if_/https://scholar.google.com/intl/es/scholar/help.html
Research Synthesis Methods , (2), 181–217. https://doi.org/10.1002/jrsm.1378Haddaway, N., & Gusenbauer, M. (2020, February 3). A broken system – why literature searchingneeds a FAIR revolution. Impact of Social Sciences .https://blogs.lse.ac.uk/impactofsocialsciences/2020/02/03/a-broken-system-why-literature-searching-needs-a-fair-revolution/Levenshtein, V. I. (1966). Binary codes capable of correcting deletions, insertions, and reversals.
Soviet Physics Doklady , (8), 707–710.Martín-Martín, A. (2018). Code to extract bibliographic data from Google Scholar .https://doi.org/10.5281/zenodo.1481076Martín-Martín, A., Orduna-Malea, E., Thelwall, M., & Delgado López-Cózar, E. (2018). GoogleScholar, Web of Science, and Scopus: A systematic comparison of citations in 252 subjectcategories.
Journal of Informetrics , (4), 1160–1177.https://doi.org/10.1016/J.JOI.2018.09.002Martín-Martín, A., Thelwall, M., Orduna-Malea, E., & Delgado López-Cózar, E. (2021). GoogleScholar, Microsoft Academic, Scopus, Dimensions, Web of Science, and OpenCitations’COCI: A multidisciplinary comparison of coverage via citations. Scientometrics , (1), 871–906. https://doi.org/10.1007/s11192-020-03690-4Orduna-Malea, E., Martín-Martín, A., & Delgado López-Cózar, E. (2017). Google Scholar as a sourcefor scholarly evaluation: A bibliographic review of database errors. Revista Española deDocumentación Científica , (4), e185. https://doi.org/10.3989/redc.2017.4.1500Orduna-Malea, E., Martín-Martín, A., & Delgado López-Cózar, E. (2018). Classic papers: UsingGoogle Scholar to detect the highly-cited documents.23rd International Conference onScience and Technology Indicators