[PDF] What Library Digitization Leaves Out: Predicting the Availability of Digital Surrogates of English Novels

Abstract

Library digitization has made more than a hundred thousand 19th-century English-language books available to the public. Do the books which have been digitized reflect the population of published books? An affirmative answer would allow book and literary historians to use holdings of major digital libraries as proxies for the population of published works, sparing them the labor of collecting a representative sample. We address this question by taking advantage of exhaustive bibliographies of novels published for the first time in the British Isles in 1836 and 1838, identifying which of these novels have at least one digital surrogate in the Internet Archive, HathiTrust, Google Books, and the British Library. We find that digital surrogate availability is not random. Certain kinds of novels, notably novels written by men and novels published in multivolume format, have digital surrogates available at distinctly higher rates than other kinds of novels. As the processes leading to this outcome are unlikely to be isolated to the novel and the late 1830s, these findings suggest that similar patterns will likely be observed during adjacent decades and in other genres of publishing (e.g., non-fiction).

Full PDF

WWhat Library Digitization Leaves Out: Predicting the Availability of DigitalSurrogates of English Novels

Allen Riddell † Troy J. Bassett * First version: 22 August 2019 This version: 1 September 2020AbstractLibrary digitization has made more than a hundred thousand19th-century English-language books available to the public.Do the books which have been digitized reflect the populationof published books? An affirmative answer would allow bookand literary historians to use holdings of major digitallibraries as proxies for the population of published works,sparing them the labor of collecting a representative sample.We address this question by taking advantage of exhaustivebibliographies of novels published for the first time in theBritish Isles in 1836 and 1838, identifying which of thesenovels have at least one digital surrogate in the InternetArchive, HathiTrust, Google Books, and the British Library.We find that digital surrogate availability is not random.Certain kinds of novels, notably novels written by men andnovels published in multivolume format, have digitalsurrogates available at distinctly higher rates than other kindsof novels. As the processes leading to this outcome areunlikely to be isolated to the novel and the late 1830s, thesefindings suggest that similar patterns will likely be observedduring adjacent decades and in other genres of publishing(e.g., non-fiction).

Bulk library digitization has made hundreds of thousands of digital surrogates of 19th-centurybooks available to researchers and the general public. Researchers are only beginning to explorepossible uses of these surrogates. Book historians and bibliographers, for example, can inspecttitle pages and interiors of books of interest, eliminating the need to travel to distant libraries oreven their own libraries. † †

Indiana University Bloomington. * *

Purdue University Fort Wayne. any of the more exciting potential uses of digital surrogates of 19th-century books held bymajor digital libraries (British Library, Google Books, HathiTrust, and Internet Archive) involvethe analysis of book contents. Today it is possible to imagine reconstructing—with the aid ofoptical character recognition—a detailed view of, say, the range of themes, plots, and settingsdiscussed in works of fiction published in the British Isles during the 19th century. But thisproject and similar projects depend on the population of books with digital surrogates resemblingthe larger population of published books. If these two populations do not resemble each other,then researchers learn only about the collections of libraries contributing digital surrogates. Theylearn nothing about the history of publishing in general.In this study we look at the availability of digital surrogates of novels first published in theBritish Isles during the late 1830s. This is a period for which we have lists of every novel whichwas published. We ask whether the population of novels with digital surrogates resembles thepopulation of published novels. (For example, is the share of novels written by women roughlythe same in both populations?) We ask this question because the consequences of an affirmativeanswer are profound. If the two populations resemble each other, researchers can use novelswhich are already readily accessible to explore aggregate characteristics and trends in the historyof publishing in the British Isles. That is, they need not worry that certain kinds of novels tend tobe missing from the corpus of novels with digital surrogates.

Starting around 2004 organizations including Google, Microsoft, and the Internet Archivedigitized millions of volumes held by major university and state libraries. Participating librariesincluded the University of California, Harvard University, Oxford University, and the BritishLibrary. Although books published during the 20th century are, in general, protected byintellectual monopolies of varying lengths, the digital surrogates of books published during the19th century tend to be freely accessible online. Questions about the “coverage” of digital libraries have dogged researchers wanting to makeinferences about publishing activity during the 19th century based on available digitalsurrogates. The hope that available digital surrogates might adequately reflect the population ofbooks published during the 19th century is far from wishful thinking, especially if this hope isrestricted to certain geographic areas, genres of publishing, or certain decades. Most titlespublished as books in the 19th century survive due to their physical durability and generous printruns (often in the thousands of copies). That most books survive is also due to legal deposit lawswhich mandate that publishers send copies of new books to at least one (state) library. In the caseof the British Isles, two of the five legal deposit libraries (the British Library and OxfordUniversity) participated in early library digitization efforts. The participation of these legaldeposit libraries in mass digitization projects makes plausible the proposition that, for thislocation and period, the population of available digital surrogates may approximate thepopulation of editions published. Digitization of general circulation collections of state and research libraries began in 2004when the internet search and advertising company Google launched the Google Books Project. ignificant early partnerships with Google Books included Oxford, Harvard, the University ofMichigan, the New York Public Library, and Stanford University. Google Books would go on toscan more than thirty million volumes from various libraries. Other libraries, working with theOpen Content Alliance (OCA) and contracting with the non-profit Internet Archive fordigitization services, began digitizing their collections in 2005. Early OCA participating librariesincluded the British Library, the University of Illinois, and the Boston Public Library. Between2005 and 2008 the OCA’s digitization work received significant funding from the softwarecompany Microsoft as part of the Live Book Search service. In 2013 the Internet Archivereported that it had digitized two million books. Library digitization continues in the present. Many large libraries, including several of thosementioned above, operate their own digitization programs. Libraries often make the digitalsurrogates they produce available on their own websites, often via links in the relevant records intheir online catalogs. In North America, a copy of a newly created digital surrogate will likely bedeposited at HathiTrust, a library consortium which accepts deposits of digitized volumes frommember libraries. In the interest of concision, we will refer to the four major digital libraries for English-language books published during the 19th century (Internet Archive, HathiTrust, Google Books,and the British Library) as “the major digital libraries," omitting reference to their focus onEnglish-language books and English-speaking regions of the world. Holdings of digital surrogates overlap. In many cases, a digital surrogate available from onemajor digital library is available from one or more additional digital libraries. HathiTrust, forexample, makes available many—but not all—public domain digital surrogates which werecreated by Google Books. As a consequence, these volumes are typically also available fromGoogle Books. As the Internet Archive was commissioned by the British Library to digitize someof its collections, many of the copies digitized are available from both the Internet Archive andthe British Library. Thanks largely to the efforts of Aaron Swartz, the Internet Archive includescopies of hundreds of thousands of digital surrogates available from Google Books. Althoughoverlap is considerable, there are always digital surrogates which are only available from one ofthe four major digital libraries. For this reason, checking each digital library separately istypically required to identify whether or not an English-language book has a publicly availabledigital surrogate. A researcher interested in locating a digital surrogate of an English-language19th-century book will typically perform a search using the websites of HathiTrust, InternetArchive, and Google Books. If their initial search yields no results, they may use a generalpurpose search engine (e.g., Bing, Google, DuckDuckGo) and search specific library catalogs ifthey believe page images may be available at a library which is not connected with either GoogleBooks, HathiTrust, or the Internet Archive. The British Library is an important example of such alibrary and many of its digital surrogates can be found only by using the library’s online catalog. Numerous academic articles have made use of digital surrogates from one or more of themajor digital libraries. One important line of research in information and library science attemptsto make use of the text of digital surrogates to advance bibliographic goals. (The “text” of adigital surrogate is the machine-readable text produced by optical character recognition (OCR).)or example, Bamman, Carney, Gillick, Hennesy, and Sridhar (2017) use the text derived fromdigital surrogates held by HathiTrust to predict the date of first publication of a work. As titleslacking (reliable) publication dates are relatively common during the 19th century, this researchallows scholars to make inferences, supported by the words appearing in a text, about an undatedvolume’s likely publication date. Research more closely related to the material presented in this paper characterizes thecoverage of one or more of the major digital libraries with respect to some (ideal) benchmark,often for a specific kind of book or document. Jones (2011) explores Google Books’ coverage of19th century books, using as a benchmark the five-volume

Catalogue of the Library of the BostonAthenæum, 1807–1871 . Using a random sample ( N = ) from the Boston Athenæum catalog,Jones (2011) attempts to locate matching digital surrogates in Google Books, finding a match in235 cases (59%). Sare (2012) considers Google Books’ and HathiTrust’s coverage of USgovernment documents published between 1943 and 1976. Like Jones (2011), Sare (2012) usesa random sample ( N = ) from the population of government documents. Sare (2012) findsthat 436 documents (28%) were available in some form from HathiTrust and that 809 documents(53%) were available from Google Books. Sare (2012)’s study is distinctive in that it makes useof an exhaustive list of items which clearly defines a population: every US government documentsent to depository libraries between 1943 and 1979. Jones (2011), by contrast, uses a librarycatalog, leaving open the question of what books were published but never collected by thisparticular library. Although both approaches are extremely valuable, we use a method moresimilar to the one used by Sare (2012) because it avoids having to reason about missing items in aparticular library collection.

Anyone who has had occasion to search online for page images of books published during the19th century knows that not every edition has a digital surrogate. Countless 19th-century editionssurvive in libraries which have not digitized their collections. Many libraries digitized some butnot all 19th-century books in their collections. Many factors influence whether a given editionhas a digital surrogate available in at least one of the major digital libraries. We have found ituseful to group the obvious factors into two categories. Factors in the first category relate to collecting practices . These factors influence whether or not a library contributing to the majordigital libraries is likely to have a copy of a given edition in its holdings. (A library possessing acopy of an edition is, of course, a prerequisite for the library’s contributing a digital surrogate.)The existence of legal deposit requirements for publishers after 1710 give us reason to believethat collecting practices may not be particularly influential, at least for the legal deposit librariescontributing to the major digital libraries (Oxford and the British Library). If publishers in theBritish Isles reliably sent copies of new books to Oxford and the British Library, then every 19th-century novel should, in principle, be found in the holdings of these two libraries. For these twolibraries there may be no “collecting practices” worth mentioning. For North American libraries,by contrast, collecting practices may well influence which 19th-century English-language novelsare in their holdings, especially those originally published in the British Isles. actors in the second category concern digitization practices at contributing libraries.Anecdotal experience with bulk digitization programs suggests that digitization programs did not,as a rule, selectively digitize certain books on library shelves. Shelves of books and floors ofshelves were chosen for digitization, not individual books. But the shelves or floors of shelvesselected for digitization may not have been chosen uniformly at random. Shelves in specialcollections tended to be skipped, for example. Books may also have been passed over fortechnical reasons. Digitization equipment used in bulk digitization programs is typically not ableto process very large or very small items. Large-format books containing maps, for example,tended to be left on shelves. Although the practice of passing over very large and very smallitems is not one we anticipate being relevant in the case of the 19th-century novel, it is anunambiguous case of digitization practices yielding an unrepresentative sample of librarycollections. In this section we will discuss four features associated with a novel which may influencecollecting practices or digitization practices: print run, (sub)genre, author gender, and format. Ourinitial thinking was that only (sub)genre influences both collecting and digitization practices. Theremaining features seem likely to influence only collecting practices. The print run of an edition seems very likely to influence whether or not a given edition has adigital surrogate available in one or more of the major digital libraries. Books with higher first-edition print runs were more likely to be collected than books with lower print runs. Such bookstended to be more popular as high print runs tend to reflect a publishers’ prospective judgment ofdemand. More popular books, all other things being equal, seem more likely to have been targetsfor library and private collection. Moreover, books with higher print runs would be more likely tosurvive in libraries because a greater number of copies implies (mechanically) that a greateropportunity for collection existed. Because the print run of a novel is typically unknown, printrun seems likely to be only relevant to a consideration of collecting practices. The genre of a novel could also have influenced whether or not a given edition has a digitalsurrogate available. Genre may be a factor which influences collection practices and digitizationpractices. Certain genres may have been systematically targeted for collection during the 19thcentury. Libraries may have judged collecting works in certain genres (e.g., historical novels)more desirable. Genre could also influence digitization practices. If novels in certain genres weresegregated and shelves containing them tended to be passed over (or preferentially selected), thena book’s genre could influence whether or not a digital surrogate is available. For example, ifworks classified “juvenile fiction”—for whatever reason—were shelved in a different area thanother works of prose fiction, we can imagine a scenario in which these shelves tended to bepassed over. If libraries facing budget or time constraints were unable to scan all their holdings,we speculate that they might have selected one “type” of fiction rather than the other. Format and author gender are two remaining factors which may have influenced collectingpractices. (We have no reason to believe that they directly influenced digitization practices.) Theformat used for the first edition of a 19th-century novel was a rough indication of the work’sprestige. Having a work published in a multivolume format (typically two or three volumes) wasregarded as more desirable than having a work published in a single-volume format. Theistinction is vaguely analogous to the contemporary division between works whose first editionuses a hardcover format and those which use a paperback format. If libraries used format to guidetheir collecting practices, we would expect that format would predict a novel having a firstedition digital surrogate. Author gender may also have influenced collecting practices. Theprevalence and impact of bias against women novelists by text industry intermediaries (e.g.,reviewers, publishers) during the 19th century has been the subject of sustained discussion. Women novelists may have tended to write novels in specific genres (e.g., juvenile fiction).Novels in these genres may have been less likely to be targets for library collection. If thoseinvolved in library collection practices used gender as an indicator of or proxy for collection-worthiness then author gender may predict digital surrogate availability. In the discussion thatfollows, we will assess not only if format and author gender affect the availability of digitalsurrogates, but crucially if the intersection of format and author gender affect the availability aswell. While individually format and author gender could each affect availability, together thefactors could reinforce the effects of bias. If libraries are truly digitizing their collections atrandom, then there should be no discernable difference between the availabilities of digitizedsurrogates by format, author, or their combination.This article focuses on two research questions:1. Do the novels which have been digitized reflect the population of published novels? Forexample, is the share of novels written by women roughly the same in both populations?2. If they do not, which kinds of novels are over- or under-represented?

Our approach is easy to describe. We first gather a list of novels first published in the British Islesin 1836 and 1838. The list includes every first-edition novel published in 1838 and a simplerandom sample of first-edition novels published in 1836. For each novel in the population, wesearch the major digital libraries for a digital surrogate. If we find at least one digital surrogate,we record this fact. The most important materials in this investigation are two exhaustive lists of novels. Theselists contain records of novels published during 1836 and 1838. (1836 and 1838 are the mostrecent two years for which exhaustive bibliographies of novels exist.) These two lists enumeratethe population of novels which could, in principle, have been digitized. Both these lists includeauthor gender annotations and novel format information. Author gender typically matches thegender of the historical individual credited with authorship. (Appendix A describes the genderannotation procedure in detail.) T. Bassett (n.d.) provides an exhaustive list of the 94 novelspublished in 1838. We gather information about digital surrogate availability for all thesenovels. Garside, Mandal, Ebbes, Koch, and Schöwerling (2004) provides an exhaustive list of the90 novels published in 1836. This list is part of a larger (exhaustive) bibliography covering 1770to 1836. To economize on time, we sample uniformly at random (without replacement) fromthese novels ( n = ). We add this 1836 random sample to the list of 1838 novels primarily toaddress the remote possibility that text industry output during the year 1838 might have beenexceptional in some unanticipated way. n this study we use an inclusive definition of a novel. A novel is a work of prose fiction of atleast 90 pages not addressed primarily to children under the age of 13. This is the definition usedby T. Bassett (n.d.). This definition is more inclusive than the one used by the standardbibliography covering 1770–1836 developed by Garside and collaborators. Garside andcollaborators focused on works labeled as “novels” by contemporaries and therefore excludeprose fiction which they characterize as didactic-religious as well as prose fiction which theyclassify as juvenile fiction (but which likely addressed some adult readers). Our inclusivedefinition has the virtue of being easier to apply. It requires fewer judgments by domain experts.For example, the definition used by Garside and collaborators requires an expert to decidewhether or not contemporaries referred to the book as a novel. Reaching a decision on thisquestion might require consulting contemporary reviews and publisher advertisements. Switchingbetween the two definitions for novels published during 1836 is made easy by the fact thatGarside et al. (2004) lists works which fell short of their definition in appendices. To use theinclusive definition for 1836 works, we include works in these appendices which qualify underthe more permissive definition. Using the combined list of 126 novels published during the late 1830s, we search the majordigital libraries for first edition digital surrogates. That is, we use the relevant web interfaces tosearch the holdings of the major digital libraries. We count a novel as having a digital surrogate ifa digital surrogate exists for the first edition. If the novel is a multivolume novel all volumes mustbe available for the novel to count as having a digital surrogate. We count as first editionsversions of the novel which were published in the same year as the first edition by the firstedition publisher. For example, an “export edition” destined for Canada with a variant title pagecould count as a first edition digital surrogate if it was published in the same year as the firstedition by the same publisher. We find that of the 126 novels, 106 (84%) have at least one digitalsurrogate available in the major digital libraries. Table 1 shows the counts of novels with digitalsurrogates by author gender and novel format. Has digital surrogate TotalWoman-author, single-volume 19 (68%) 28Woman-author, multivolume 20 (91%) 22Man-author, single-volume 17 (85%) 20Man-author, multivolume 42 (95%) 44Unknown-gender author, single-volume 5 (56%) 9Unknown-gender author, multivolume 3 (100%) 3All novels 106 (84%) 126

Table 1:

Availability of digital surrogates of 1836 and 1838 novels by author gender and novelformat.

To address our research questions we need to estimate credible intervals for the underlying ratest which novels have first edition digital surrogates during the late 1830s (1835–1839). In orderto form reasonable beliefs about the rate of digital surrogate availability for years in the late1830s we must write down a model which allows our observations from 1836 and 1838 to informour beliefs about digital surrogate availability for the population of late 1830s novels. Such amodel should also permit us to estimate the rate of availability for novels in different categories,where categories are defined by author gender and novel format. Were we to find rates whichresembled each other, this would count as evidence in favor of the theory that the processesleading to a novel having a first edition digital surrogate are largely random. To estimate the credible intervals for each category of novels, we use a hierarchical modelwith Beta priors and Binomial sampling distributions for each of the six categories of novels. Insymbols, where p , p , … , p are the unobserved rates—which we estimate from the data—at which eachcategory of novels has first edition digital surrogates, N , N , … , N are the number of novels ineach category, and y , y , … , y are the number of novels we observe having available digitalsurrogates in each category. Our model treats the 1836 and 1838 observations as if they wererandomly sampled from the population of late 1830s novels. This assumption simplifies themodel considerably. We think the decision to treat the observations as indistinguishable fromobservations in the population is reasonable. We know of no reason why the specific publicationyear of a late-1830s novel would influence collection or digitization processes. Moreover, we canperform rudimentary checks of this modeling assumption by verifying that digitization rates donot vary much by year. For example, the empirical proportion of 1836 woman-authored novelswhich have digital surrogates is very similar to the rate of women-authored novels in 1838 (75%and 79%) have digital surrogates. The use of a model to characterize uncertainty is valuable even if it were restricted to theanalysis of 1838 titles, a year for which our data includes every title published. In this case theuncertainty estimates can be understood as a description of the probability that a hithertounacknowledged 1838 novel in a given category will be observed to have a digital surrogate.Novels previously unknown to bibliographers do occasionally surface. For example, a handful ofnovels first published between 1800 and 1829 were discovered after the publication of thelandmark bibliography of Garside and Schöwerling (2000), a bibliography of the period believedto be exhaustive at its time of publication. So a model characterizing the posterior probability ofdigital surrogate availability would be valuable even if we had observations of availability for alllate-1830s novels. We estimate the posterior probability distributions with Markov Chain Monte Carlo using thesoftware Stan. Code and data are published under the following DOI: 10.5281/zenodo.4010771.

Analysis

Figure 1:

Estimated rates of digital surrogate availability ( p , p , … , p ). Points indicate posteriormeans, bars show 50% credible intervals. Digital surrogate availability varies by author gender and format. Posterior estimates ofavailability rates appear in Figure 1. These rates were estimated using the model and datadescribed in the previous section. As described in the previous section, the model allows us tocharacterize uncertainty about the rates, something we cannot accomplish using the counts ofavailable digital surrogates alone (Table 1). Although there is uncertainty about the rates,especially for the unknown-gender author novels, several differences are obvious. The meanestimate for the rate at which women-author, single-volume novels from the late 1830s havedigital surrogates is 69%. The mean estimate for the rate for men-author, multivolume novelsfrom the late 1830s is 94%. Taking account of uncertainty about the rates, it is is virtually certain(99% probability) that the latter rate is higher than the former. We are also virtually certain thatthe rate at which men-author, multivolume novels have digital surrogates is higher than the rate atwhich unknown-gender author, single-volume novels have surrogates. Multivolume novels are more likely to have digital surrogates than single-volume novels. Theate at which multivolume novels by women have digital surrogates is very likely greater than therate at which single-volume novels by women have digital surrogates. (“Very likely” means thatits posterior probability of the inequality is greater than 90% but less than 99%.) The patternholds true for multivolume and single-volume novels by men authors and by unknown-genderauthors. The mean estimates of the rates of digitization for multivolume and single-volumenovels by men are 0.94 and 0.83, respectively. The mean estimates of the format-specific ratesfor unknown-gender author novels are 0.86 and 0.62, respectively. Our relative lack ofconfidence in the difference between rates for unknown-gender authors—despite the largedifference between the mean estimates—is due to the remaining uncertainty about the rates. Thisuncertainty is due to the small number of unknown-gender titles in the sample (10 out of 126).With so few observations there is only so much confidence we can gain about the rates ofavailability for these novels. An alternative way of thinking about these findings is to consider what might happen if aresearcher were to sample 200 first-edition late-1830s novels from the major digital libraries.Such 200-novel samples will tend to have about 74 novels by women, 37 of which would besingle-volume novels. Had the samples been taken at random from the population of publishednovels, however, we would expect to see 79 novels by women, 44 of which would be single-volume novels. Hence sampling from the major digital libraries instead of the population willtend to undercount women-author, single-volume novels by about 17%. Single-volume novels byunknown-gender authors would be undercounted by 25%. Novels which are over-counted in thisscenario are primarily multivolume novels by men. Samples from the major digital libraries willtend to over-represent this group by 13%. Our main finding is that different kinds of novels have digital surrogates at different rates. Formatand author gender can be used to predict digital surrogate availability. This result makes usingsamples from the major digital libraries inadvisable, as these samples will not reflect thepopulation of published novels. Yet the result raises a question: Why do we observe thesepatterns in the availability of novels? Are the differences due to library collecting practices? Or isthere perhaps evidence that digitization practices are driving the differential availability of novelsin different categories? Has digital surrogate in British Library Totalwoman-author, single-volume, 1836 4 (44%) 9woman-author, multivolume, 1836 0 (0%) 10man-author, single-volume, 1836 6 (60%) 10man-author, multivolume, 1836 0 (0%) 25unknown-gender author, single- 4 (67%) 6olume, 1836unknown-gender author, multivolume, 1836 0 (0%) 2woman-author, single-volume, 1838 13 (59%) 22woman-author, multivolume, 1838 6 (38%) 16man-author, single-volume, 1838 8 (47%) 17man-author, multivolume, 1838 14 (42%) 33unknown-gender author, single-volume, 1838 1 (25%) 4unknown-gender author, multivolume, 1838 0 (0%) 2All novels 56 (36%) 156

Table 2:

Availability of digital surrogates from the British Library by year, author gender, andnovel format. Note that no 1836 multivolume novels have been digitized. Totals differ fromTable 1 because we check all 1836 novels (instead of a random sample) for a digital surrogate atthe British Library.Although addressing this question is beyond the scope of this article, we did observe, duringthe course of our investigation, a conspicuous pattern in the data which may help us begin tounderstand what is happening. We describe this pattern briefly in this section.As a byproduct of our data collection process, our data include, for each title, whether or notthe title has a digital surrogate from the British Library. These annotations permit us to askwhether or not there is evidence of digitization practices at the British Library which could becontributing to the differential availability across novel categories. (Because the British Library isa legal deposit library, we assume that collecting practices do not influence digital surrogateavailability since the British Library received a copy of all novels.) Table 2 summaries theavailability for the titles.To assemble the data for Table 2 we did some additional data collection focused on the BritishLibrary. We looked for digital surrogates for an additional 30 randomly sampled 1836 novels.We only looked for digital surrogates at the British Library. We did not check to see if thesenovels had surrogates at other digital libraries. Table 2 reports on digital surrogate availability atthe British Library for these 1836 novels and all 1838 novels.We find, much to our surprise, considerable evidence of a digitization practice at the BritishLibrary which discriminates on the basis of format: the British Library did not scan any multivolume novels from 1836. Although we were aware that very large (and very small) bookswere skipped by many digitization projects due to equipment being designed for books of certain(standard) sizes, we did not anticipate that multivolume works would be omitted from bulkdigitization. The absence of digital surrogates for 1836 multivolume novels is unlikely to be dueto chance. If the rate at which 1836 novels at the British Library have digital surrogates is 22.6%—the overall rate for 1836 titles, ignoring format—then the probability that 0 of the 37ultivolume novels would have been digitized is 1 in 13,000. We have no explanation for thispattern.

Figure 2:

Estimated rates of British Library digital surrogate availability for single-volumenovels published in 1836 and 1838. Points indicate posterior means, bars show 50% credibleintervals. Author gender, by contrast, does not predict digital surrogate availability in the BritishLibrary. Restricting our attention to single-volume novels, we find that digital surrogateavailability does not appear to vary by gender in the British Library (Figure 2). (We modelavailability rates using the model described in Section 3.1, reducing the number of categoriesfrom six to three.) This result is consistent with the belief that (single-volume) novels in theBritish Library were indeed digitized at random. Because digitization procedures at differentlibraries resemble each other—in some cases, the firm or organization contracted to do thedigitization is the same—this finding may offer weak support for the belief that digitizationpractices at other libraries did not discriminate among books in their holdings on the basis ofauthor gender.

This paper demonstrates the feasibility of studying the coverage of digital libraries using anxhaustive list of published documents (here, novels). That this approach improves on thestrategy of evaluating the coverage of a digital library by reference to another larger collection orcatalog of unknown comprehensiveness is easy to appreciate. An exhaustive list of publishedbooks is a stable reference point. Given such a list, comparing different collections isstraightforward: each collection has some calculable percentage of the documents in theexhaustive list. Determining the share of documents not present in any of the collections is alsostraightforward. Comparing several collections without such a fixed point is, by contrast,complicated. A given digital library may have impressive coverage with respect to one collectionbut not another. Although an ersatz reference point in such a case would be the union of the itemsin the several collections, such a reference point is not usable in future work which attempts tocompare new collections which contain previously unseen items. Using an exhaustive list avoidsthis complication. Future research on the coverage of digital libraries should seek, wheneverpossible, to use fixed reference points such as exhaustive lists. Our work also suggests a broader research program, the data-intensive study of trans-Atlanticlibrary collection practices. Digital surrogate availability permits distinguishing between exporteditions and first editions which have been transported across the Atlantic. Export editions aside,very few copies of first editions of books printed in the British Isles and intended for audiences inthe British Isles were transported across the Atlantic Ocean. The overwhelming majorityremained in the British Isles. The existence of one such book in a North American library—revealed by the existence of a digital surrogate—is a signal that someone—perhaps a librarydonor or individual responsible for expanding a collection—valued the book enough to financeits transportation across more than 2500 nautical miles. The existence of several such books is aneven stronger signal. For researchers interested in the reception of a book, this signal offersevidence of non-negligible readership, analogous to the way the existence of a second editionhints that (a publisher believed) a book sold well in its initial outing. As every digital surrogateavailable from the major digital libraries indicates the contributing library, the necessaryinformation for this genre of research is readily available. A more immediate need, however, is a fuller account of which novels are more likely to havedigital surrogates. Are 19th-century novels by women and novels by authors of unknown genderalways less likely to have digital surrogates? Or is the phenomenon confined to the first half ofthe 19th century? Does novel format strengthen or attenuate this tendency? Answering thesequestions will produce a more complete narrative of how the major digital libraries’ holdingsdiffer from the population of 19th-century novels. This, in turn, will yield clues about whichlibrary collecting and digitization practices contribute to the differential availability of digitalsurrogates.

The primary limitation of our study is that it provides information about digital libraries' coverageof new prose fiction published during the late 1830s. The digital libraries' holdings of non-fictionbooks and books (fiction and non-fiction) published during other periods may indeed reflect thepopulation. Moreover, as our study narrowly concerns previously unpublished prose fiction, itay be the case that digital libraries' holdings of subsequent editions (i.e., editions other than firsteditions) published in the late 1830s may reflect the population of published subsequenteditions. There are, to our knowledge, no systematic studies of collecting and digitizationpractices that permit us to generalize our findings beyond the late 1830s.Another weakness of the study is that it does not distinguish between physical holdings anddigital surrogates. A more nuanced investigation would have identified, for each librarycontributing digital surrogates, which books were held but not digitized. Such an investigationwould provide insight into the collection and digitization practices of specific libraries. Ourstudy avoided doing this because we assumed—based on experience working on bibliographicprojects—that the legal deposit libraries of Oxford and the British Library possessed copies ofevery novel. This assumption may merit revisiting. Some novels which were once in thecollections of the legal deposit libraries may have been lost or destroyed.

In this paper we describe the availability of digital surrogates of novels first published in theBritish Isles during the late 1830s. We compare the set of novels which have at least one digitalsurrogate available from the major digital libraries with the population of novels published duringthe period. We find that digital surrogate availability differs by author gender and format.Multivolume novels by men are most likely to have at least one digital surrogate. Single-volumenovels, novels by women, and novels by authors of unknown gender are less likely to have digitalsurrogates. Future research may offer an account of the causes of the differential availability ofsurrogates. We speculate that library collecting practices play a role—for instance, a library maytend to collect novels in subgenres in which men authors predominate. Equally so, librarydigitization practices may play a role in the availability of surrogates—for instance, the BritishLibrary appears to have excluded multi-volume novels published in 1836 from bulk digitizationefforts.The period under examination, the late 1830s, was chosen because it includes the most recentyears for which exhaustive bibliographies exist. We have no reason to believe the late 1830s wasexceptional in the relevant sense. A novel being published in the late 1830s seems unlikely, byitself, to influence the likelihood that the novel will have been collected or digitized by a library.Moreover, the existence of conspicuous biases in collecting or digitization practices for novelsmakes it more likely that biases also exist which influence the collecting of non-fiction works.Absent evidence to the contrary, it seems prudent for researchers to assume that gatheringsamples of fiction or non-fiction works published in the British Isles during the 1830s and 1840sfrom the major digital libraries will yield samples which do not reflect the population.

References

New Yorker ppendix A: Gender Annotations

The gender annotation procedure used in both exhaustive bibliographies is believed to beessentially the same. There is one minor difference which only concerns novels which are writtenby an unknown author. If the historical individual who wrote the novel is known , their gender is used. Thanks to thelabors of literary historians, bibliographers, and genealogists, the historical individual is typicallyknown. If, however, the historical individual who wrote the novel is unknown , a genderannotation different than “unknown” will be made if one of the following conditions are met:  The author’s name (or pseudonym) is strongly associated with individuals of one gender rather than another. For example, a novel by a “Lady of Rank" would be coded as a woman-authored novel. Example:

The Glanville Family (1838).  The author indicates their gender in paratextual material such as a preface. For example, a non-narrative preface by the author may indicate the author’s gender by the use of gendered pronouns.This method of arriving at author gender annotations strongly resembles the method used byGarside and Schöwerling (2000). In handling novels whose authors are unknown, we depart fromthat method in one respect: whereas Garside and collaborates tend to only use informationappearing on the title page, we also use information appearing in non-narrative prefaces.

About Google Books , http://books.google.com/googlebooks/history.html, June 2011. Katherine Bode, “The Equivalence of “Close” And “Distant” Reading; Or, toward a New Object for Data-Rich Literary History,”

Modern Language Quarterly

78, no. 1 (March 1, 2017): 77–106, doi:10.1215/00267929-3699787. Isabella Alexander,

Copyright Law and the Public Interest in the Nineteenth Century (Bloomsbury Publishing, March 3, 2010), 62-63. About Google Books. Tim Wu, “What Ever Happened to Google Books?” [In en], September 2015, issn: 0028-792X. Brewster Kahle, “Internet Archive Forums: Books Scanning to Be Publicly Funded,” May 26, 2008, https://archive.org/post/194217/books-scanning-to-be-publicly-funded. For texts written in languages other than English the holdings of these digital libraries are marginal by comparison to otherdigital libraries (e.g., Norway’s Nasjonalbiblioteket, France’s Bibliothèque Nationale de France, and Germany’s Bayerische Staatsbibliothek). Laura Sare, “A Comparison of HathiTrust and Google Books Using Federal Publications,”

Practical Academic Librarianship: The International Journal of the SLA Academic Division

2, no. 1 (2012): 1–25. Brewster Kahle, “Public Access to the Public Domain: Copyright Week,” January 14, 2014, https://blog.archive.org/2014/01/14/public-access-to-the-public-domain-copyright-week/. David Bamman et al., “Estimating the Date of First Publication in a Large-Scale Digital Library,” in

Proceedings of the 17th ACM/IEEE Joint Conference on Digital Libraries (IEEE Press, 2017), 149–158. Edgar Jones, “Google Books as a General Research Collection,”

Library Resources & Technical Services

54, no. 2 (2011):77–89. Ibid. Sare, “A Comparison of HathiTrust and Google Books Using Federal Publications.” Ibid. John Feather,

Publishing, Piracy and Politics: An Historical Study of Copyright in Britain (New York, NY: Mansell, 1994), 97, isbn: 978-0-7201-2135-3. Personal communication with Daniel Wilson (The Alan Turing Institute) on November 16, 2019. The Alan Turing Institute is located in and affiliated with the British Library. Although information about the print run of specific editions is typically lost, we believe there may be serviceable proxies available. For example, a book which a publisher anticipated being popular may have been more likely to be printed using the “prestige” three-volume (“triple decker”) format. Charles E. Lauterbach and Edward S. Lauterbach, “The Nineteenth Century Three-Volume Novel,”

The Papers of the Bibliographical Society of America

Papers of the Bibliographical Society of America Gaye. Tuchman,

Edging Women Out: Victorian Novelists, Publishers, and Social Change (New Haven: Yale University Press, 1989), isbn: 0300043163 (alk. paper); Anne E. Boyd, “"What! Has She Got into the ’Atlantic?’" Women Writers, the Atlantic Monthly, and the Forming of the American Canon,”

American Studies

39, no. 3 (September 1998): 5–36, issn: 0026-3079, doi:10.1353/amsj.v39i3.2695. Troy J. Bassett,

At the Circulating Library James Raven and Antonia Forster,

The English Novel, 1770-1829: A Bibliographical Survey of Prose Fiction Published in the British Isles , ed. Peter Garside, James Raven, and Rainer Schöwerling, vol. 1 (Oxford: Oxford University Press, 2000), isbn: 978-0-19-818318-1; Peter Garside and Rainer Schöwerling,

The English Novel, 1770-1829: A Bibliographical Survey of Prose Fiction Published in the British Isles , ed. Peter Garside, James Raven, and Rainer Schöwerling, vol. 2 (Oxford: Oxford University Press, 2000), isbn: 978-0-19-818318-1. Verifying that a digital surrogate exists takes a considerable amount of time. The catalogs of each digital library must be searched separately. Even when a surrogate is found which appears to be a match for the given novel (i.e., the title, author, and publication year match), page images must be inspected. It is common to find a North American or French edition of a novel which is published in the same year as the first edition. Verifying that all volumes of a multivolume work are present is also time consuming. Oxford, for example, frequently binds together the separate volumes of a multivolume novel into one “volume”. So a single “volume” of a multivolume novel at Oxford needs to be carefully checked to verify it is complete.Our use of a random sample of 1836 novels reflects a desire to economize on time. Bassett,

At the Circulating Library . The average number of novels published in 1836 and 1838 is 92. If we assume that 92 novels were published in 1835, 1837, and 1839, then the size of the late-1830s population is 460 novels. These novels are included in a series of six updates to the bibliography (e.g., (Garside, Berlanger, & Mandal, 2001). The number of novels “discovered” is very small relative to the size of the bibliography. (The 1800-1829 bibliography has more than 2,000 titles.) For example, the first update features 10 newly discovered novels. Bob Carpenter et al., “Stan: A Probabilistic Programming Language,”