[PDF] Applications of Machine Learning in Document Digitisation

Abstract

Full PDF

AApplications of Machine Learning in DocumentDigitisation ∗ Christian M. Dahl †† Torben S. D. Johansen † Emil N. Sørensen ‡ Christian E. Westermann † Simon F. Wittrock † February 8, 2021

Abstract

Data acquisition forms the primary step in all empirical research. The availability ofdata directly impacts the quality and extent of conclusions and insights. In particular,larger and more detailed datasets provide convincing answers even to complex researchquestions. The main problem is that “large and detailed” usually implies “costly anddiﬃcult”, especially when the data medium is paper and books. Human operators andmanual transcription have been the traditional approach for collecting historical data.We instead advocate the use of modern machine learning techniques to automate thedigitisation process. We give an overview of the potential for applying machine digiti-sation for data collection through two illustrative applications. The ﬁrst demonstratesthat unsupervised layout classiﬁcation applied to raw scans of nurse journals can beused to construct a treatment indicator. Moreover, it allows an assessment of assign-ment compliance. The second application uses attention-based neural networks forhandwritten text recognition in order to transcribe age and birth and death dates froma large collection of Danish death certiﬁcates. We describe each step in the digitisationpipeline and provide implementation insights. ∗ Acknowledgements: We thank Peter Sandholdt Jensen, Joseph Price, and Michael Rosholm for usefulcomments. We also thank Søren Poder for contributing his expertise on digitisation of historical documents.We gratefully acknowledge support from Rigsarkivet (Danish National Archive) and Aarhus Stadsarkiv(Aarhus City Archive) who have supplied large amounts of scanned source material. We also gratefullyacknowledge support from DFF who has funded the research project “Inside the black box of welfare stateexpansion: Early-life health policies, parental investments and socio-economic and health trajectories” (grant8106-00003B) with PI Miriam W¨ust. † Department of Business and Economics, University of Southern Denmark ‡ School of Economics, University of Bristol a r X i v : . [ c s . C V ] F e b . INTRODUCTION

Big data have brought new opportunities in economic research (Varian, 2014; Einav andLevin, 2014). However, big data are not conﬁned to be contemporary. A recent review byGutmann et al. (2018) highlights that large collections of scanned historical documents areessential examples of big data, and Gutmann et al. (2018) describe some of the challenges ofharnessing such information in research. In particular, Gutmann et al. (2018) mention theprospects of automated record linking by applying machine learning (ML) to transcribedrecords. However, they do not comment on the important opportunities of using ML toautomate the data collection from the raw images.Traditionally, historical data have been collected manually either by research assistants,using (possibly paid) crowdsourcing, or by complete outsourcing to a transcription com-pany. Manual data collection has limited scalability and this reduces the value of largescanned document collections as they are diﬃcult to operationalise for statistical analysis.ML methods provide a potential solution to this problem. These methods are easily scal-able to millions of documents, they are fast relative to human transcription, and providereproducible results.The economic literature does not have well-described applications of deep- and machinelearning to collection of data from scanned documents. Similarly, while abundant withmethods and models, the ML literature also lacks discussion of complete solutions to thisproblem, see e.g. Nagy (2016) for a general overview of document digitisation. Often thefocus is on improving and benchmarking speciﬁc models in isolation using standardiseddatasets (cf. Graves, Liwicki, et al., 2008; Bluche, Ney, et al., 2014; Lee and Osindero,2016). Such work is of limited practical use when implementing a complete data collectionpipeline where the documents are non-standard and multiple models (e.g. for transcriptionand layout classiﬁcation) need to work in unison. Amazon Mechanical Turk is an example of paid crowdsourcing where workers are paid an amount everytime they solve some pre-speciﬁed task. Alternatives such as Zooniverse provides the infrastructure to docrowdsourcing but otherwise rely on volunteers. EARLY-LIFE CARE IN DENMARK - LAYOUT CLASSIFICATION

Interventions and estimation of treatment eﬀects are central topics in both theoretical andapplied economic research. However, prior to estimation, we need an assignment of eachindividual to a treatment or control group. Often treatment assignment is inferred based onan intervention or policy that (quasi) randomly has assigned each individual, e.g. Angristand Krueger (1991). This section considers a policy where a subset of infants was madeeligible to participate in an expanded care programme. The participants in the programmereceived additional home visits from nurses. Enrolment in the programme was governedby the date of birth. Individuals born in the ﬁrst three days of each month were eligibleto receive additional monitoring. The details of approximately 95 ,

000 infants (whetherenrolled or not) were collected in journals kept by the health care system. The journals havepreviously been described and used by Biering-Sørensen et al. (1980), and Bjerregaard et al.(2014) have manually transcribed a small subset of the contents to study birth weight andbreastfeeding. The infants who received additional monitoring have a speciﬁc follow-up tablein their journal only if the monitoring took place, i.e. the presence of the table is decidedby actual treatment, not eligibility. The journals have been scanned and are available asdigital images. While parts of the journals have previously been digitised, the presence ofthe treatment table was not recorded. Figure 1 illustrates the pages in a typical journal. The journals have been made available through the DFF funded research project ”Inside the black boxof welfare state expansion: Early-life health policies, parental investments, and socio-economic and healthtrajectories” (grant 8106-00003B) with Miriam W¨ust as PI. igure 1: Example of a typical nurse journal. The third page shows the treatment table.We are blacking out sensitive information.In the following, we construct a treatment indicator using unsupervised ML by analysingthe layout of each journal page and thereby identifying the group of children that actuallyreceived follow-up care. We compare this ML-based detection to an intention-to-treat (ITT)indicator inferred from the three-day policy and ﬁnd that there is non-compliance. Thisillustrates that statistical models applied to images can, even without transcription, provideimportant information in applied economic research.Our dataset contains 95 ,

313 journals with a total page count of 261 , Since the treated individuals can be identiﬁed by the presence of a particular page in theirjournal, we can use layout classiﬁcation to detect treatment. If a page in their journal isclassiﬁed as having the treatment table layout, then the individual is classiﬁed as treated. Wedid not have access to a labelled dataset to train a supervised classiﬁer for the treatment page– as will often be the case in practice. Thus, we pursue an unsupervised approach wherewe rely only on the scanned images without labels. Note that we still need to manually4onstruct an evaluation dataset to probe the performance of the applied method. However,if the unsupervised method is found to perform sub-par, then the evaluation set can serveas the basis for training a supervised classiﬁer. In this sense, any manual transcription isnot wasted. We will show an example of a supervised classiﬁer for the same purposes inSection 3.The documents are scanned and stored digitally as image ﬁles. Images consist of coloureddots called pixels. Each pixel is characterised by a location and a colour. Stacking a certainnumber of pixels horizontally and vertically forms an image. Thus, we can consider animage of h × w pixels to be a h × w matrix where each entry corresponds to a single pixel.In grayscale photos, each pixel can only attain white, black, and shades in-between. Thisis represented by a byte (8 bits) specifying a value between 0 and 255 with 0 being blackand 255 white. The core of the machine digitisation process can be formulated as variousstatistical learning problems where we model diﬀerent aspects of the visual information tolearn a mapping from the image matrix into a representation that is suitable for economicanalysis. Learning this mapping represents an array of challenges. In particular, this isoften challenging because the image matrix can be of very high dimension. A feature is alower-dimensional variable that captures some aspect of the high-dimensional image, andhopefully in a way that is more informative than the raw image data itself. Convolutionalneural networks, see Goodfellow et al. (2016, Chp. 9), learn to extract such features whentrained for image classiﬁcation and it turns out that these features are generally informativedespite being trained on a speciﬁc dataset (Simonyan and Zisserman, 2015). This propertyis exploited in transfer learning where parts of a neural network trained on one set of imagesare applied for a new task on a diﬀerent set of images (Pan and Yang, 2009). The VGG16network is an example of a deep convolutional neural network that was trained on over 1million photos to distinguish between 1 ,

000 objects (Simonyan and Zisserman, 2015). Basedon the concept of transfer learning, we use this pre-trained network to extract features fromthe journal images. This is a useful trick that can provide informative features without the5 .000.250.500.751.00 0.00 0.25 0.50 0.75 1.00

Embedding X E m bedd i ng Y Figure 2:

2D t-SNE visualisation of the feature space of the journal pages. Each point rep-resents a journal page and the colours correspond to the labels assigned by the clusteringalgorithm. Pages with similar layout cluster together. The embeddings have been subsam-pled to reduce cluttering, so only 30 ,

000 randomly sampled embeddings are displayed. Thereis a total of 37 clusters which are manually annotated. The treatment pages are containedin four clusters.need to train more sophisticated feature extractors or models. It works similar to traditionalfeature extractors, e.g. SIFT (Lowe, 2004) and SURF (Bay et al., 2008), but the featurerepresentation is learned instead of manually engineered. The classiﬁcation part of VGG16is discarded – we do not care about the original classiﬁcation task – and we only keepthe convolutional network. Each journal page is passed through the VGG16 convolutionalnetwork and we obtain a 512-dimensional feature vector that describes some aspects of thevisual information.Next, we use unsupervised methods to explore the features. The features are clusteredusing DBSCAN (Ester et al., 1996) – a density-based clustering algorithm. Pages with similarlayout should cluster together as they share a similar VGG16 feature vector. To visualise6L detectionGround truth Treated Not TreatedTreated 234 0 234Not Treated 0 3766 3766234 3766

Table 1:

Confusion matrix for the ML treatment detection model. The frequencies are basedon a randomly sampled and manually reviewed validation set of 4 ,

000 journals (10 , ,

914 pages of 4 ,

000 randomly selected journals. For eachjournal, we recorded the presence of the treatment table. The dataset was reviewed twiceand 234 journals with treatment were found. Table 1 provides a confusion matrix for theML treatment detection in the evaluation sample. All 234 treated and 3,766 untreated indi-viduals are correctly classiﬁed by DBSCAN with zero false positives/negatives despite heavy7olicy detection (ITT) ML detectionTreated 7,912 5,735- Born 1st-3rd 7,912 4,247- Born 4th-31st 0 455Non-compliers - 4,120

Table 2:

Treatment indicator. Policy assignment is based on an oﬃcial assignment rulewhich oﬀered all children born in the ﬁrst three days of each month to enroll in the nursevisiting programme. The ML assignment is based on the machine learning model and basesassignment on the presence of the treatment page in the journals. This allows for assessmentof compliance in addition to the intention-to-treat eﬀect, i.e., the date-of-birth assignmentmechanism.class imbalance. Note that the classiﬁer could obtain an accuracy of 3 , / , ≈ . Obviously, this is not desirable and highlightsthe need for other performance measures. An option is to consider the precision and recall(Murphy, 2012, p. 184–185). In our context, precision is the number of individuals predictedas treated that are truly treated , while recall is the number of individuals detected as treatedcompared to the total number of treated . These two measures are especially relevant in de-tection and retrieval tasks such as those considered here (cf. Murphy, 2012, p. 185). In theresults from the unsupervised method, both precision and recall are unity. While this ishighly satisfactory, it is only an estimate of the true performance as we only use a subset ofthe data for evaluation. It is conceivable that the method can make some mistakes acrossthe whole collection of 261 ,

926 pages.Apart from the performance of the classiﬁer itself, the results from the layout detectionprovide valuable insights on treatment assignment. From Table 2 it is evident that notall eligible individuals received the follow-up visits. 7,912 children are eligible but only4,247 individuals born in the three-day eligibility period actual received a visit. This is a Actually, the class imbalance is even more severe. The ML model classiﬁes pages and in the evaluationset only 234 / , ≈ .

14% pages contain the treatment table. In light of this, the high recall of the modelis especially satisfying. Although 7 ,

912 individuals born between the 1st and the 3rd appears low when considering the samplesize of roughly 95 ,

000 children, we observe birth date for only 84 ,

659 children. Hence, 9 .

35% of these childrenare born between the 1st and the 3rd, which is still slightly lower than expected (9 . . This reveals an issue with non-compliance (3 ,

665 + 455 = 4 ,

120 non-compliers)which might have implications when estimating treatment eﬀects, see e.g. Angrist, Imbens,et al. (1996). The performance of the ML detection is very encouraging, and in additionthe ML approach also reveals details about the intervention that would otherwise have beenlost, unless the whole collection of 261 ,

926 pages was manually reviewed.Keep in mind that the nurse journals are very uniformly scanned. Documents with morevariation in quality might beneﬁt from a supervised approach. For example, we found thatthe unsupervised method did not generalise well to the death certiﬁcates (Section 3). MORTALITY IN DENMARK - HANDWRITTEN TEXTRECOGNITION

In Denmark, the use of death certiﬁcates was introduced at the national level in 1832. Adeath certiﬁcate documents the death of a single individual and contains a table with ﬁeldsfor name, birth date, cause-of-death etc. The documents are stored on paper at the Danisharchives, one certiﬁcate is one page. Due to privacy restrictions, the publicly available deathcertiﬁcates are restricted to the timespan 1832–1942. The Danish National Archive andvolunteers have scanned a large collection of certiﬁcates and made them available online asdigital images. Around 2.5 million death certiﬁcates are available for download and withapproximately 10–12 ﬁelds per certiﬁcate; this amounts to 25–30 million individual ﬁelds totranscribe. Also, additional death certiﬁcates are continuously being scanned and added tothe collection. We have a subcollection of approximately 250 ,

000 death certiﬁcates acrossmultiple years and locations. These are not randomly sampled but reﬂect the order inwhich the archives scanned the documents. This is not a major drawback as our main A search for more information on these cases is ongoing. See the description at the Danish National Archive (in Danish) i g u r e : A s a m p l e o f t h e d i ﬀ e r e n t p ag e s i n t h ec o ll ec t i o n o f d e a t h ce rt i ﬁ c a t e s . T h ee x a m p l e s a r e n o t e x h a u s t i v e . igure 4: The birth date, death date and age ﬁelds on a type B death certiﬁcate.11 i g u r e : T h e t h r ee s t e p s i n t h e M L d a t a c o ll ec t i o np i p e li n e . ( )t h ec o ll ec t i o n o f s o u r ce i m ag e s i ss o rt e db y l a y o u t . ( )t h e f o r m i n t h e d o c u m e n t o f i n t e r e s t i ss e g m e n t e d i n t o ﬁ e l d i m ag e s . ( ) e a c hﬁ e l d i m ag e i s tr a n s c r i b e d i n t oa d i g i t a l s tr i n g . ,

903 out of 250 ,

000 certiﬁcates are type B. Figure 4 shows a type B certiﬁcate withmanual highlighting of the three relevant ﬁelds.We split the transcription process into three steps:(1)

Layout classiﬁcation : Separate the type B death certiﬁcates from the other types.(2)

Table segmentation : Extract an image for each of the selected ﬁelds in the pre-printed form.(3)

Transcription : Transcribe the extracted ﬁeld images for birth date, death date, andage.The sequence of steps (1)-(3) is called the ML pipeline, see the illustration in Figure 5.The pipeline consumes raw images of death certiﬁcates and produces string transcriptionsfor each ﬁeld without human interaction. In the following sections, we describe each of thepipeline steps and evaluate the performance. Section 3.1 describes the transcribed datasets13sed for training and evaluation. The pipeline steps are considered in Sections 3.2–3.4 whileSection 3.5 compares the ML pipeline to crowdsourced transcription.

We train and evaluate three separate models for the ML pipeline. One model for layoutclassiﬁcation and the two others for transcription. In this process, we rely on several datasetswhich are outlined in Table 3 and described in detail below. The table segmentation methoddoes not need training so there is no dataset for this step.(1)

Dates : A training dataset with 11 ,

630 dates and an evaluation dataset containing1 ,

000 dates. Data is approximately balanced between birth and death dates. Thedatasets are used to train and evaluate the transcription model for birth and deathdates. Images are 320 ×

50 pixels and the ground truth transcriptions are stored asstrings in a standardised format. See Figure 6.

Figure 6:

Examples of the images in the date datasets.(2)

Ages : The training and evaluation datasets contain 11 ,

072 and 1 ,

000 ages respectively.The datasets are used to train and evaluate the transcription model for age. Images are230 ×

75 pixels and the ground truth transcriptions are stored as strings and excludethe age suﬃx, i.e., years, months, days, or hours. Non-integer ages are also excluded.See Figure 7 for examples. 14raining Evaluation Image sizeDates 11,630 1,000 320 ×

50 pixelsAges 11,072 1,000 230 ×

75 pixelsLayouts 7,000 2,184 VariableCrowdsourced dates - 46,526 320 ×

50 pixels

Table 3:

Overview of the number of samples in each of the training and evaluation datasetsused in the ML pipeline

Figure 7:

Examples of the images in the age datasets.(3)

Layouts : The training and evaluation datasets contain 7 ,

000 and 2 ,

184 pages respec-tively. They are used to train and evaluate the layout classiﬁcation model for detectingcertiﬁcates of type B. The images are of varying size and the ground truth layout typeis stored as an indicator variable. The images are similar to those shown in Figure 3.(4)

Crowdsourced dates : A dataset containing 23,263 complete death certiﬁcates thatintersect with our collection of death certiﬁcates. Transcriptions are only available forbirth and death dates, not for age. This dataset is used for evaluating the end-to-endperformance of the pipeline.The evaluation and training datasets (1)-(3) are constructed by manually transcribing arandom sample of ﬁeld images from the death certiﬁcates. They have been veriﬁed twice bydiﬀerent individuals and images with segmentation errors have been removed. We ensurethere is no overlap between the training and evaluation datasets.The crowdsourced dataset (4) is freely available online from the Danish National Archive.15nybody can contribute to these transcriptions by editing them through an online interface. Layout classiﬁcation refers to the process of organising a collection of documents by commonlayout structure, e.g. a common table, heading, or pre-printed landmarks. Chen and Blostein(2007) provide a relevant but methodologically outdated introduction to this problem. Thelayout type is important as the table segmentation step relies on a pre-deﬁned template thatmust match with the pre-printed structure in the image. Due to variations in layout acrossthe collection of death certiﬁcates (see Figure 3), we need to construct one template for eachtype. Every time we ﬁt a template to a certiﬁcate, we need the template and certiﬁcatetypes to match. This is straightforward if the certiﬁcates are sorted according to layout, asin this case, all images in a given class will share the same template.As we saw in Figure 3, an image X i of a death certiﬁcate can belong to one of K layouttypes that can be distinguished visually, e.g. “Type B”, “Type A”, and so on. Let thelayout type of image i be Y i = k with k = 1 , , ..., K . The K types do not need to mimic thevisual types in the documents, we can easily focus on “Type B” and label everything elseas “Other”. We are interested in learning a model for the probability P ( Y i = k | X i ) suchthat we can infer the most likely layout type ˆ y i = arg max k P ( Y i = k | X i ). This resembles aconventional K -class classiﬁcation problem with the only diﬀerence that X i is an image.We already saw an example of unsupervised layout classiﬁcation in the context of thenurse journals in Section 2 where we emphasised the importance of constructing a lower-dimensional feature that describes the raw image. In the supervised setting, there are twomain considerations: (1) what features should we use, i.e. how do we construct a feature g ( X i ) that represents the high-dimensional information in X i , and (2) what classiﬁer shouldwe use, i.e. what model do we choose for the probability P ( Y i = k | g ( X i )). This cannot See (in Danish) https://bit.ly/2VnFLzb

16e answered deﬁnitively and depends on the application. There are many equally viableapproaches. In the nurse journal application, we solved (1) by using the features from apre-trained neural network (VGG16) and used these directly in a cluster analysis. A modernsupervised end-to-end approach would be to train a convolutional neural network (CNN)to classify the pages using raw images as input, i.e. g ( X i ) = X i . In this case, the neuralnetwork will solve both (1) and (2) as it learns both the feature extractor and the classiﬁerduring training.We take a simpler approach and use the visual Bag-of-Words (BoW) method to classifythe layout type of the certiﬁcates. This method provides an intuitive deﬁnition of featuresin terms of visual landmarks or “words”. Bag-of-words is a technique originally developedto classify chunks of text according to the content (Murphy, 2012, p. 87). It operates bydetermining the frequency of words (i.e. the frequency of some global set of words – the bag-of-words) in each text document. The method has been applied successfully in the ﬁeld ofcomputer vision, see Csurka et al. (2004) and Sivic and Zisserman (2009), where we insteadoperate with a bag of visual words, i.e. chunks of images.The visual bag-of-words model creates features on the basis of a codebook or dictionary.This dictionary is constructed by extracting key points from a training dataset and clusteringthem such that we obtain M groups of key points where key points within a group appearsimilar in some sense. If we think of a training dataset that consists of photographs ofanimals, we could have a key point cluster related to eyes, one for ears, etc. Each of theseclusters is a visual word . The visual words are similar to the feature clusters we discovered inthe nurse journals using the unsupervised method, i.e. those in Figure 2. Here we extract thefeatures using Speeded-Up Robust Features (SURF) (Bay et al., 2008) as this has historicallybeen the common choice, see Csurka et al. (2004).When the dictionary has been constructed, we are ready to create the actual featurevectors for the images. For a given training image i , we extract SURF key points and assigneach of these to the M key point clusters based on distance. We then count the number of17L detectionGround truth Empty Other A BEmpty 13 1 0 0 14Other 2 1,535 0 0 1,537A 0 0 109 0 109B 0 0 0 524 52415 1,536 109 524 Table 4:

Confusion matrix for the BoW layout classiﬁcation. The frequencies are based on arandomly sampled and manually reviewed evaluation set of 2 ,

184 death certiﬁcates. Deathcertiﬁcates are classiﬁed into four classes: Empty, A, B and Other. In this application weare only interested in the type B certiﬁcate.features from the image that belongs to each of the M key point clusters and construct avector of normalised frequencies which serves as the ﬁnal feature vector of the image. Notethat the size M of the dictionary determines the dimensionality of the ﬁnal feature vector. Inthe classiﬁcation step, we train a model to classify each feature vector (and thus each image)into one of the K classes. The classiﬁer can be of any type. Here we use support vectormachines with radial kernels, see the introduction in James et al. (2013, Chp. 9). There areseveral tuning parameters in the BoW model (dictionary size, margin etc.) which we selectedusing leave-one-out cross-validation. The computational burden lies in the initial keypointextraction and clustering. When the appropriate features have been extracted, it is very fastto ﬁt the actual classiﬁers and hence it is computationally fast to do cross-validation for theclassiﬁer hyperparameters . For an end-to-end neural network it is substantially slower todo hyperparameter optimisation as the whole network will need to be re-trained – possiblyfor hours – for each set of hyperparameters. This highlights that simpler classiﬁers can beuseful as a ﬁrst step before developing more complex models.18 .2.2. Results The classiﬁer is trained on the layout dataset which contains 7 ,

000 full page images ofrandomly sampled death certiﬁcates together with an indicator for their ground truth layouttype. We consider K = 4 distinct layout classes: (1) type A certiﬁcates, (2) type B certiﬁcates(our focus), (3) all other certiﬁcates and (4) empty pages. We evaluate the classiﬁer on thecorresponding layout evaluation dataset (2 ,

184 certiﬁcates) and the confusion matrix is givenin Figure 4. From the confusion matrix we see that the classiﬁer provides highly satisfactoryperformance with only two false positives and one false negative in document classes we arenot interested in. For the type B certiﬁcates the class-wise precision and recall are unity.We use the trained BoW classiﬁer to predict across the entire collection of ∼ ,

000 deathcertiﬁcates and ﬁnd that 44 ,

903 certiﬁcates are type B. These are extracted for the tablesegmentation step.

In the table segmentation step, the goal is to extract smaller images corresponding to eachﬁeld (or cell) of a larger form (or table) in the source image. Co¨uasnon and Lemaitre (2014)provide a general introduction to the topic. There are two components to this problem: (1)identifying the structure of the table in the source image, and (2) exploiting the structureto extract the ﬁeld images. We apply a simple template-based approach where a predeﬁnedtemplate is ﬁtted over the source image using a set of landmark points.In ongoing work, we are considering how to solve (1) using edge-detection neural networks.However, in this paper we rely on standard ﬁltering operations from the computer visionliterature to binarise the source image and ﬁnd straight horizontal and vertical lines. This Note that for large training sets, runtime for support vector machines with radial basis kernels increasesfrom O ( n ) to O ( n ) and another classiﬁer should be used. The other classes (i.e. type A and empty) were used for a diﬀerent application we do not discuss here. P = { ( x , y ) , ( x , y ) , ..., ( x N , y N ) } be a set of points that deﬁne a template (e.g. thecorners of a table) and analogously let P = { ( x , y ) , ( x , y ) , ..., ( x K , y K ) } be a set ofpoints in an image that we think roughly corresponds to those in the template. Note thatthe number of points in the two sets do not need to be the same. Point registration is theproblem of aligning the points in P (the template) over the points in P (the image) (Besland McKay, 1992). This is illustrated in Figure 10 where we align template points P (bluedots) with the image landmarks P (red dots). We can intuitively understand this as ﬁndinga transformation function T that applies some transform to all points in P such that T ( P )aligns with P in some distance metric d . The point set registration methods diﬀer in theirchoice of distance d and the constraint they put on the transformation T . The non-rigidversion of the CPD method in Myronenko and Song (2010) assumes that the points in theimage P are generated by a Gaussian distribution “around” the template points P andit puts only mild regularity conditions on the transform T . In particular, it assumes thatpoints move freely but coherently. Myronenko and Song (2010) model the problem using agaussian mixture model. We rely directly on their algorithm to align our template to theimage by supplying our template points P and image landmarks P . The algorithm appliesthe standard Expectation Maximisation (EM) method to solve the maximum likelihoodproblem.It has been suggested that the CPD method can be improved by using a neural network20 igure 8: An example of a set of points that deﬁne a template.to learn the transformation (Li et al., 2019) but we do not consider this here. There arealso examples of more general learning based table segmentation that does not rely on apre-speciﬁed template, see e.g. Clinchant et al. (2018).

The initial template is constructed by ﬁnding a well-scanned death certiﬁcate where wecan manually extract a number of points which can be used as anchors for the template,see Figure 8. This is only done once and these points comprise the template point set P .After the template has been established, each type B death certiﬁcate is subjected to a setof morphological operations (erosion and dilation) to ﬁnd pixels belonging to table rowsor columns. The image is ﬁrst thresholded (i.e. converted from gray scale into black/whitewhere all pixels are either 0 or 255) and afterwards erosion and dilation operators are applied.Erosion and dilation are common non-probabilistic ways to extract straight (vertical orhorizontal) lines in computer vision problems, see e.g. Szeliski (2010). The morphologicaloperations require careful hand-tuning to ﬁnd the optimal parameters for the document typebut these parameters only need to be tuned once for the whole collection. When the image21 i g u r e : L a nd m a r k d e t ec t i o nu s i n g m o r ph o l og i c a l o p e r a t i o n s a nd a c o r n e r d e t ec t o r . F i g u r e : P o i n t s e t a li g n m e n t . B l u e d o t s r e p r e s e n tt h e t e m p l a t e w h il e r e dd o t s a r e t h e l a nd m a r k s f o und i n t h e s o u r ce d o c u m e n t . P . Next, weapply the CPD algorithm to align the template points in P to the image landmarks in P .This iterative ﬁtting process is illustrated in Figure 10. Even if extra landmark points aredetected, the CPD algorithm remains robust as long as the detected landmark points P aresuﬃciently spread out along the axes, i.e. if the noise is fairly uniform. Once the points havebeen aligned, we use the estimated transformation T to ﬁt the template onto the image. Thetemplate then guides the cuts that separate the larger image into individual ﬁeld images.This is illustrated in Figure 11. Figures 13-14 show examples of the ﬁnal segmented ﬁeldsfor dates. We do not report quantitative measures of segmentation performance but notethat the end-to-end transcription accuracy in Section 3.5 will include any errors producedby the segmentation step if these make the ﬁeld images unreadable.The ﬁeld images could be further segmented to limit noise and assure that the positionand scale of the text is similar between samples. A possible way to implement this isa Mask R-CNN network (He, Gkioxari, et al., 2017) which could be trained to predictregions of noise and handwritten text respectively. The text regions are then extractedand re-aligned on a blank image. However, this adds further complexity to the problem byintroducing an additional segmentation model. The Mask R-CNN would need training datawhich requires manual annotation of text outlines to create a ground truth dataset. Wewill instead run the transcription model directly on the ﬁeld images without any additionalsegmentation/cleaning and show that this is a viable approach.23 igure 11: Example of ﬁeld images extracted using the ﬁtted template.

Transcription is the process of converting an image of text into a string representation.Assume we have sequences of the form Y i = ( Y i, , Y i, , ..., Y i,T ) where T is the maximumsequence length and each Y it is a random variable on the sample spaceΩ = { , , ..., } . We call Ω a dictionary (or token space) and the tokens { , ..., } serve asplaceholders that can contain any information, be it individual characters, words, numbers,special symbols, or any combination thereof. The contents of the dictionary depends on There are also sequence models relying on vector-space embeddings of words, e.g. the image captioningmodel of Xu et al. (2015). The output from the learned mapping will then be a vector in R K with K beingthe embedding space dimension. This vector is then mapped to the closest (in some metric) known word in T as asequence Y i ∈ Ω T . Similarly, if Ω contains whole words then we can represent a sentenceof T words equivalently as Y i ∈ Ω T . Based on an image X i we want to predict a sequence Y i = f ( X i ) where f is some unknown map from images to sequences. Learning the sequencemapping f is the primary transcription problem.The optical character recognition (OCR), handwritten text recognition (HTR), and scene-text detection literature are all concerned with this problem, but the solutions diﬀer heavily inhow they model f and how they choose their tokens, i.e. how they choose Ω. Some methodsrely on character-by-character transcription where Ω is simply the alphabet (Graves, Liwicki,et al., 2008) and hence these models are, in theory, not limited in the set of words they cantranscribe. Other approaches, like vector-space models, rely on (embeddings of) whole wordssuch as the image captioning model in Xu et al. (2015). In some cases, these choices limitthe generality of the method, e.g. the whole-word models will not be able to transcribewords that are not in the dictionary. In our case, we are mainly interested in (1) numericalinformation such as dates, numbers, and area codes, and (2) string information such asnames, locations and, cause-of-death. All of these ﬁelds have dictionaries that we can easilyconstruct. In the contemporary literature, it is common to use neural networks to model f with various degrees of sophistication, see e.g. the introduction in Graves, Liwicki, et al.(2008). The majority of the work has been on supervised methods where models are trainedon pairs ( X i , Y i ) but there are examples of unsupervised approaches as well, e.g. Gupta et al.(2018).We limit our attention to supervised models and focus on situations where the dictionaryis small, well-deﬁned and consists of a few words and characters, i.e. a constrained transcrip- this space by a deterministic function, hence the ﬁnal prediction will still be of the form Y i ∈ Ω T . However,word embeddings are usually constructed by performing an auxiliary task on a large dataset of text, and itis not entirely obvious how such an approach would work for other things like cause-of-death or name, i.e.what properties would we expect from reasonable “name” embeddings? See also Ye and Skiena (2019) whouse twitter data to create an embedding space for names. −

9, names of the months, and some specialcharacters. Model.

We utilise an attention-based neural network suggested by Xu et al. (2015)for image captioning and repurpose it for transcription of handwritten text. Originally, themodel by Xu et al. (2015) predicts a string of words that describes the contents of an inputimage, e.g. if it is supplied with an image of a bird sitting in a tree then the model wouldpredict the sentence “A bird sitting in a tree”. However, the model of Xu et al. (2015)is generally applicable to tasks that involve image inputs and sequence outputs – which isexactly what characterises the transcription problem. A similar attention-based model forOCR was proposed by Lee and Osindero (2016) and it has been demonstrated that morecomplex multi-head attention models can transcribe whole paragraphs of handwritten text(multiple lines) without prior line segmentation, see Bluche, Louradour, et al. (2017).In our implementation of Xu et al. (2015), we replace the ﬁnal embedding output witha softmax layer. This allows the model to predict probabilities for each of the tokens in ourdictionary Ω. In particular, assume that we have an image X i and we are at step t in theprediction of the corresponding sequence Y i = ( Y i, , ..., Y i,T ). The model would predict thestep t conditional probabilities P ( Y i,t = k | X i , Y i,t − = ˆ y i,t − , ..., Y i, = ˆ y i, ) , k ∈ Ωwhere (ˆ y i,t − , ..., ˆ y i, ) are the previously predicted tokens in the sequence.The model architecture is depicted in Figure 12. First, the feature network encodes theinput image into a feature map. The attention mechanism A then applies soft-attention to This is less straightforward for transcription of names. However, it is possible to construct dictionariesbased on the most common names and only transcribe these. Our initial tests of this approach have beenpromising for reading names in heavily cluttered and poorly segmented ﬁelds, but we do not present theresults here. AD D ADFeature networky y y T^ Figure 12:

Transcription model based on the image captioning model in Xu et al. (2015). Thefeature network extracts features from the input image. The attention mechanism, denotedby A , weighs the features and supply these as input to an RNN cell, D , that predicts onetoken of the sequence. The hidden state of the RNN cell is updated, and the updated stateis passed to the attention network to decide on the next weights for the feature map. Theprocess repeats until an end-of-sequence prediction is made.this feature map by predicting weights in [0 ,

1] and taking the element-wise product of thefeatures and weights. As noted by Xu et al. (2015), soft-attention is fully diﬀerentiable so wecan use standard back-propagation to train the network. The attention-weighted featuremap is passed to a recurrent neural network (RNN) cell D . This cell predicts the nexttoken in the sequence and updates its hidden state. The hidden state is fed to the attentionmechanism which updates the attention weights, i.e. “where should we look next”? Thissupplies the recurrent cell with a new weighting of the feature map. The RNN cell usesthe hidden state and the weighted features to predict the next token in the sequence. Thisprocedure is repeated until an end-of-sequence token is predicted by the model.An alternative approach is a neural network with the Connectionist Temporal Classiﬁ-cation (CTC) loss. Such networks have achieved state-of-the-art results on unconstrainedtranscription (Graves, Liwicki, et al., 2008), i.e. in settings where there are no restrictionson the words the model can transcribe – implying that Ω contains the alphabet. Graves,Fern´andez, et al. (2006) suggested the CTC loss for sequence labelling. The CTC-based In our work, we have found that the soft attention mechanisms are signiﬁcantly easier to train comparedto their hard attention counterparts. This is also noted by Lee and Osindero (2016). Prediction.

Given an image X i we want to predict the corresponding sequence Y i ∈ Ω T .At each sequence step t , we use the model to predict probabilities over the tokens in thedictionary Ω. Due to dependence in the sequence, the predicted probabilities across thedictionary at step t depend on all the previously predicted tokens at steps ( t − , t − , ..., diﬀerent sequencesfor each sample we want to predict. Hence it is necessary to rely on greedy algorithms toﬁnd a feasible (but possibly suboptimal) solution. We use beam search as suggested by Xuet al. (2015). Evaluation . We evaluate the performance of the transcription models using two metrics.One relates to the average accuracy across individual tokens and the other to the accuracy ofcomplete sequences. To ﬁx notation let y i = ( y i, , ..., y i,k i ) be a realised ground truth sequence See https://rll.byu.edu/ y i = (ˆ y i, , ..., ˆ y i,h i ) a predicted sequence for image X i = x i , i = 1 , ..., n where n is thesize of the evaluation dataset. Assume that the ground truth sequence is always padded so k i ≥ h i and that the padding value is chosen such that y i,j (cid:54) = ˆ y i,j for j > k i . Moreover, let I ( · ) be the indicator function with some predicate. We deﬁne the Token Accuracy (TA) and m -error Sequence Accuracy (SA m ) byTA = 1 (cid:80) i k i n (cid:88) i =1 k i (cid:88) j =1 I ( x i,j = ˆ x i,j ) , (1)SA m = 1 n n (cid:88) i =1 I (cid:32) k i (cid:88) j =1 I ( x i,j (cid:54) = x i,j ) ≤ m (cid:33) . (2)TA has an interpretation as the average accuracy across all predicted tokens. SA m is theproportion of correctly predicted sequences when m mistakes are allowed in the sequence.The token and sequence accuracies are closely related to the character and word accuracies inthe HTR literature, see e.g. Graves, Liwicki, et al. (2008). However, our dictionary containstokens which can be chosen arbitrarily, i.e. a token can be a character or a word.Note that both TA and SA are sensitive to alignment, so while a single substitutionproduces one sequence error, missing or adding an extra token would misalign the entiresentence. For example, in the following prediction the model misses a single token but thiscauses four errors in the predicted sequence because of misalignment: Ground truth: <1> <2> <7> <0>

Prediction: <1> <7> <0>

Correct Correct Error Error Error ErrorHence missing a token can impose a heavy accuracy penalty when using these measures.Another possibility for measuring performance would be various string distances, e.g. theLevenshtein distance. These can be adapted to tokens instead of characters and do nothave the alignment problem. We do not consider this futher but note that relying on string29istance measures would only improve the metrics for our models.

We train separate models for dates and ages but both are based on the same attentionneural network. Below we detail the choices of dictionary, hyperparameters and the trainingprocedure used to train the two models on the type B death certiﬁcates.

Dictionary.

The age transcription model uses a dictionary consisting of the digits(0 , , ...,

9) together with start and end markers to delimit the sequence. The ﬁnal agedictionary is Ω age = { , <0> , <1> , ..., <9> , } which can represent integer ages ofarbitrary length, in principle allowing for 3-digit ages. For example, 12 would be tokenisedas , <1>, <2>, The dictionary excludes ratios of years, e.g. 1 /

2, and any suﬃxes such as years, months,days, and hours. It would be straightforward to add additional tokens but since our groundtruth transcriptions do not contain this information we cannot train the model to recognisethem.The date transcription model uses a larger dictionary. Each month is considered a sep-arate token, , , ..., , and all digits are separate to-kens, <0>, <1>, ..., <9> . We include two separator tokens, and , which encode the separators between the day–month and month–year components of the dates. We also include sequence delimiters , formarking the start/end of the sequence and padding . The entire dictionary Ω date isgiven in Table 7 in the Appendix and it provides a standardised representation of any date.For example, the date would be tokenised as ,<1>,,,,<2>,<0>,<1>,<0>,

Similarly, the date

September 20th, 90 would by tokenised as30

Start>,<2>,<0>,,,,<9>,<0>,

Data.

The date model is trained on 11,630 manually transcribed dates. The age modelis trained on 11,072 manually transcribed ages. To boost the amount of training data,we apply an augmentation procedure to create multiple distorted versions of each originaltraining sample. Prior to training, the mean and variance are estimated using a sample of20 ,

000 augmented images. The mean and variance are then used to standardise the pixelvalues in each training image. The ground truth sequences (respectively for dates and ages)are tokenised according to the deﬁned dictionaries and they are padded to all have samelength T . For age this length is T = 4 tokens while it is T = 11 tokens for dates. Duringtraining the models are fed batches of pairs { ( x i , y i ) } i where each training pair consists of aﬁeld image x i and its corresponding ground truth sequence y i . Hyperparameters.

The transcription models require various hyperparameters thatmust be selected by the researcher. Some parameters are intuitive, e.g. the size of the inputimages, but others are more abstract such as the number of hidden nodes in a speciﬁc layerof the neural network. There is limited guidance for choosing these parameters. For thebest performance, it is necessary to do hyperparameter optimisation where several sets ofparameters are evaluated on a tuning dataset. We do not consider this and instead usemanually selected values that yield acceptable results on the death certiﬁcates.The feature network is the convolutional part of a pre-trained ResNet-101 network (He,Zhang, et al., 2016) which we tune during training with a learning rate of 0 . ,

20) which is obtained by applying adaptive max pooling after theﬁnal layer of the ResNet-101 feature network. The learning rate is 0 . .

25 which is applied to both the main and featurenetwork learning rates at steps 10 , ,

000 and 25 , ,

056 in the RNN cell. The initialhidden state of the RNN cell is learned from the features using a fully-connected layer. Weapply dropout with probability 0 . ,

000 steps implyingthat each sample is encountered on average 162 times. Note that the training datasets areaugmented; while the model does encounter the same original sample multiple times, thesample will have diﬀerent random distortions.

We evaluate the performance of the transcription models using the TA and SA m metrics fromEquations 1–2. This section focuses on the performance of the ML transcription modelsin isolation, so all results are conditional on the transcription models receiving perfectlysegmented images. The trained transcription models are used to predict the images in thedate and age evaluation datasets using beam search. Prior to prediction, the images arestandardised to zero mean and unit variance using the mean and variance estimated fromthe entire evaluation dataset. No other pre-processing is applied. Transcription of a singleimage happens in less than 75ms.Table 5 shows the performance of the two transcription models – for date and age respec-tively – on their corresponding evaluation sets. We present results both for training withand without augmentation of the training dataset. The token accuracy (TA) for dates is97 .

9% with augmentation and 92 .

2% without. For ages the TA is 98 .

5% with augmentation This is common practice and as straightforward as it sounds. We take the average and standarddeviation over all pixels in all images, and then respectively subtract and divide each pixel by these values.

T A SA SA SA T A SA SA Real .922 .661 .933 .981 .966 .936 .988( . . . . . . . . . . . . . . Table 5:

Token (TA) and Sequence (SA) Accuracy on ground truth evaluation set. Standarderrors are given below each accuracy rate. The ﬁrst row uses only ground truth trainingsamples, while the second row shows the result when training is conducted on the augmenteddataset. The date training set contains 11 ,

320 samples, the evaluation set contains 1 , ,

072 samples, the evaluation set contains 1 , token but includes the token (the token is forced inthe network, so it will always be present).and 96 .

6% without. These are the average accuracies of predicting a single token correctly.Figures 13-14 show random samples of incorrect and correct dates as predicted by the MLdate transcription model.In Table 5, the zero-error sequence accuracies SA are signiﬁcantly lower than the tokenaccuracies for both models. Under augmentation, the date model achieves an SA of 90 . SA of 97 . and markers). Hence it is no surprise that the zero-error sequence accuracy is much higherfor age than date. Also, the variation in dates is larger compared to ages with respect toboth format and combination of digits/characters.33 igure 13: Random sample of incorrect predictions from the date transcription model. Thelabel in the upper right corner displays the model prediction in the format day–month–year.

Figure 14:

Random sample of correct predictions from the date transcription model. Thelabel in the upper right corner displays the model prediction in the format day–month–year.34t is apparent from Table 5 that an increase in the number of allowed mistakes persequence improves the accuracy signiﬁcantly. E.g. if we allow one mistake (i.e. one sub-stitution) in the date sequence, the one-error sequence accuracy is 98 . , ,

072 ages. If our training datasets were larger, the payoﬀ from augmentationwould be smaller. However, the diﬀerences display the beneﬁts of augmentation to boostperformance in smaller training datasets.

We consider a comparison to a crowdsourced dataset. The usefulness of this comparisonis twofold: (1) it allows us to evaluate if the ML-approach produces transcriptions thatare on-par with those obtainable from crowdsourcing, and (2) it provides a measure of theend-to-end performance of the ML transcription pipeline, including errors of segmentation,transcription, and to an extent layout classiﬁcation. Our own evaluation dataset – as used in Table 5 – was manually reviewed to exclude (1)images where segmentation errors obscured the text in the image, and (2) where the imagebelonged to a document of the wrong layout type (i.e. anything other than type B deathcertiﬁcates). Evaluating the model on this dataset provides a clean measure of the transcrip-tion model in isolation. However, it does give any insights on the transcription performancewhen the model might receive a badly segmented image. The crowdsourced dataset is dif-ferent in this respect as it is constructed by humans looking at the raw document, ﬁnding It does not provide us with a good measure of the recall of the layout classiﬁcations as we only considercertiﬁcates that are in both datasets. There might be type B certiﬁcates that (1) have not been detected byour ML model and (2) are not in the crowdsourced dataset. These will be missed here. .905 .963 .960 .983 .970 .987 .972 .988( . . . . . . . . Table 6:

Zero-error Sequence Accuracy SA for the ML data model (trained with augmen-tation) and crowdsourcing both evaluated on our date ground truth datasets. Second rowcontains the standard error of the accuracy rate. The evaluation set for the ML model con-sists of 1,000 samples, while the evaluation set for the crowdsourced predictions consists of2,864 samples as we can pool both our training and evaluation ground truth datasets in thiscase. The training and evaluation sets have been twice manually reviewed and dates thatare unreadable due to bad segmentation have been removed. Age is not directly transcribedin the crowdsourced dataset and hence excluded here. Column 1-2 is sequence accuracy forthe whole date, while column 3-8 are sequence accuracy on the individual components ofthe date (day, month and year). Note that the comparison takes into account common dateformatting, e.g. that the dates 01-10-2000 and 1-10-2000 convey the same point in time.the date ﬁeld, and transcribing the text. This implies that the crowdsourced transcriptionscannot be impacted by segmentation or layout classiﬁcation errors. By running the entirepipeline on the raw images and comparing the ﬁnal transcription output to the crowdsourcedtranscription, we can get a measure of the overall performance of the pipeline in practice. Ofcourse, this relies on the assumption that the crowdsourced data are perfectly transcribedwhich is not likely to be the case.To get a baseline indication of the quality of the crowdsourced dataset, we compare itagainst our ground truth training and evaluation sets (overlap of 2,864 documents) and ﬁndthat the dates are identical in 96.30% of the cases, see Table 6. For comparison, the perfor-mance of the transcription model on the evaluation set (i.e. perfectly segmented images) is90 . ,

263 doc-uments – each containing a birth and death date for a total of 46 ,

526 dates – and weﬁlter out any overlap with the training sample used to train the ML model. Note that thecrowdsourced dataset does not contain transcriptions of age. The dates predicted by theML model and crowdsourcing are identical in 83.66% of the cases and for 89.96% of thedates the diﬀerence is less than one calendar year. This is a substantial diﬀerence comparedto Table 5 where the ML sequence accuracy was 90.5%. As elaborated above, the perfor-mance diﬀerence relative to Table 5 stems from two sources: (1) noise in the crowdsourceddataset, and (2) the other pipeline steps prior to transcription. Thus, unless (1) is large,this gives an approximation to the end-to-end performance of the whole pipeline. We shouldkeep in mind that the 83.66% sequence accuracy allows for zero mistakes in the predictedsequence and that this is the expected performance if we – without any pre-processing oradjustments – feed a collection of raw scans to the pipeline. Also, given the noise in thecrowdsourced dataset, we can argue that 83.66% might be a slightly conservative estimateunless crowdsourcing participants and ML make the exact same mistakes on the exact samedocuments.Using the ML approach, it is cheap to transcribe additional ﬁelds on the documents as itonly requires a training sample. As we have seen, the training sample can be much smallerthan the full collection of documents. This can be exploited to produce higher sequenceaccuracy rates if there are internal correspondences between the ﬁelds in the source docu-ment. For example, the death certiﬁcates contain both birth and death date and age. Thesethree ﬁelds should be internally consistent. If they do not match then either (1) the sourcedocument contains a mistake or (2) the ML model made a mistake. If we transcribe bothage and dates and exploit the correspondence between these ﬁelds, we can ﬁlter out 5,767 Not completely identical, we take common diﬀerences in formatting into account, so e.g. 01-3-2000 and1/3-2000 would be considered equal. ,

263 documents (46 ,

526 ﬁeld images). Hence, relationships betweenﬁelds can provide automatic veriﬁcation and be used to ﬂag problematic records for manualreview. This method can of course also be applied in a manual/crowdsourcing context, butin this case transcription of additional ﬁelds is more.In addition to the accuracy rates, we also compare the data obtained from the ML andcrowdsourcing approaches. Any systematic bias in the ML model would produce deviationsfrom the empirical distribution observed in the crowdsourced dataset. Also – based on thediscussion above – we expect internal consistency between ML transcribed ages and dates.Figure 15 compares kernel density estimates of the age distribution produced by (red line)ML age transcriptions, (blue line) ML date transcriptions, and (green line) crowdsourceddate transcriptions. Note that the ﬁgure only displays ages in the interval [0; 100] – anyages outside this interval are discarded. In addition, we discard all predicted dates wherethe year does not contain four digits. For some dates in the source documents, the year hasbeen abbreviated to only the last two digits (e.g. 1890 becomes 90 or 1910 becomes 10).We cannot easily infer the ﬁrst two digits, as most of the death certiﬁcates span the time1850–1950 so the leading digits could be either 18 or 19. This is not a shortcoming of thetranscription model but rather a lack of information in the source documents. In Figure 15,we see that the distribution of ML age transcriptions is very close to the age distributionimplied by the crowdsourced date transcriptions. The age distribution implied by the MLdate transcriptions also appear similar, although with a notable diﬀerence around ages 75–85.This deviation is not surprising as the dates contain longer sequences to transcribe (relativeto age) and the ML model needs to transcribe both birth and death date correctly (at leastdown to the year) to get an approximately correct age prediction.38 .000.010.02 0 25 50 75 100

Age D en s i t y ML age transcriptionML date transcriptionCrowdsourced dates

Figure 15:

Lifetime distributions. Kernel density estimates of the lifetime distribution im-plied by ML age transcriptions (red), ML date transcriptions (blue), and crowdsourced datetranscriptions (green) in an overlapping sample of 23 ,

263 individuals restricted to the ageinterval [0; 100].

Implied age refers to the diﬀerence between the birth and death dates inyears.Motivated by the notable diﬀerence in performance between the whole pipeline and tran-scription only, we manually review some of the cases where the crowdsourced and modelpredictions diﬀer. This reveals, albeit qualitatively, that most discrepancies are related tosegmentation issues where the CPD segmentation obscures the year in the date ﬁelds. TheCPD segmentation template makes an arbitrary cut to separate the age and death date ﬁelds.The year is the rightmost component of the date and hence most likely to be impacted bythis cut – refer to the positioning of the death date and age ﬁelds in Figure 4. This expla-nation is corroborated if we look at the accuracy rates for the individual date componentswith the crowdsourced dataset as ground truth: Day accuracy is 95.92%, month accuracy is96.98%, and year accuracy is 89.11%. Clearly, the accuracy for the year component is no-tably lower. When discovered, such issues can easily be corrected in the ML pipeline. Hadthe transcription been done manually, it would be much more costly to correct systematictranscription errors.As a concluding remark, keeping in mind the resources and time needed to perform39anual or crowdsourced transcription, we note that the ML end-to-end accuracy of 83.66%– trained on only 11,630 dates and 11,072 ages – might in many cases be acceptable, especiallyfor large document collections that are otherwise infeasible to transcribe.It should be noted that the ML model in this application has not been carefully opti-mised and it has only been trained on a small training dataset. It is conceivable that themodel can perform substantially better. In addition, improvements in segmentation wouldalso positively impact the end-to-end accuracy. The previously discussed within-ﬁeld seg-mentation using Mask R-CNN (He, Gkioxari, et al., 2017) would be an option for this. Also,in practical applications, the model can be used to speed up transcription while retainingmanual review of each (or some) predictions. The reviewed predictions can then be used tore-train and improve the model.There is also the possibility to rely on the conﬁdence predicted by the model and removetranscriptions where the model is highly uncertain. We could then apply a threshold on theprediction probabilities and manually review sequences with predicted probabilities belowthe threshold. Although, it is questionable if these probabilities can be understood as ageneral measure of conﬁdence, see e.g. the discussion of calibration in Guo et al. (2017). CONCLUDING REMARKS

We have provided two motivating examples that apply ML to collect data from scanneddocuments. First, we have shown that (unsupervised) layout classiﬁcation can be a usefultool in intervention studies where treatment compliance is inferred from documents. Asan exploratory step, the layout classiﬁcation can reveal the composition of pages withinthe collection and details about assignment. We demonstrated on a clean dataset of nursejournals that a simple unsupervised method can achieve precision and recall around 1 . ,

000 and 2 million documents is negligible as opposed to the cost of theequivalent manual transcription. 41

EFERENCES

Abramitzky, Ran et al. (2019).

Automated linking of historical data . Tech. rep. NationalBureau of Economic Research. url : .Angrist, Joshua D, Guido W Imbens, and Donald B Rubin (1996). “Identiﬁcation of causaleﬀects using instrumental variables”. In: Journal of the American statistical Association

The Quarterly Journal of Economics arXiv preprint 1909.02210 . url : https://arxiv.org/abs/1909.02210 .Athey, Susan and Stefan Wager (2017). “Eﬃcient Policy Learning”. In: arXiv preprint1702.02896 . url : https://arxiv.org/abs/1702.02896 .Bailey, Martha et al. (2017). How Well Do Automated Linking Methods Perform? Lessonsfrom US Historical Data . Tech. rep. National Bureau of Economic Research. url : .Bay, Herbert et al. (2008). “Speeded-up robust features (SURF)”. In: Computer vision andimage understanding

Sensorfusion IV: control paradigms and data structures . Vol. 1611. International Society forOptics and Photonics, pp. 586–606.Biering-Sørensen, Fin, Jørgen Hilden, and Knud Biering-Sørensen (1980). “Breast-feeding inCopenhagen, 1938-1977: Data on more than 365,000 infants”. In:

Danish Medical Bulletin

27, pp. 42–48.Bjerregaard, Lise G et al. (2014). “Eﬀects of body size and change in body size from infancythrough childhood on body mass index in adulthood”. In:

International journal of obesity .Vol. 1. IEEE, pp. 1050–1055.Bluche, Th´eodore, Hermann Ney, and Christopher Kermorvant (2014). “A Comparison ofSequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Model-ing for Handwriting Recognition”. In:

Statistical Language and Speech Processing . Ed. byLaurent Besacier, Adrian-Horia Dediu, and Carlos Martin-Vide. Cham: Springer Inter-national Publishing, pp. 199–210.Chen, Nawei and Dorothea Blostein (2007). “A survey of document image classiﬁcation:problem statement, classiﬁer architecture and performance evaluation”. In:

InternationalJournal of Document Analysis and Recognition (IJDAR)

The Econometrics Journal

Proceedings of the 2014 Conference n Empirical Methods in Natural Language Processing (EMNLP) . Association for Com-putational Linguistics, pp. 1724–1734.Clinchant, S. et al. (2018). “Comparing Machine Learning Approaches for Table Recognitionin Historical Register Books”. In: , pp. 133–138.Co¨uasnon, Bertrand and Aur´elie Lemaitre (2014). “Recognition of Tables and Forms”. In:ed. by David Doermann and Karl Tombre. London: Springer London, pp. 647–677.Csurka, Gabriella et al. (2004). “Visual categorization with bags of keypoints”. In: Workshopon statistical learning in computer vision, ECCV . Vol. 1. 1-22. Prague, pp. 1–2.Einav, Liran and Jonathan Levin (2014). “Economics in the age of big data”. In:

Science

Deep learning . MIT press.Graves, Alex, Santiago Fern´andez, et al. (2006). “Connectionist temporal classiﬁcation: la-belling unsegmented sequence data with recurrent neural networks”. In:

Proceedings ofthe 23rd international conference on Machine learning , pp. 369–376.Graves, Alex, Marcus Liwicki, et al. (2008). “A novel connectionist system for unconstrainedhandwriting recognition”. In:

IEEE transactions on pattern analysis and machine intel-ligence

Proceedings of the34th International Conference on Machine Learning-Volume 70 . JMLR. org, pp. 1321–1330.Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman (2018). “Learning to Read bySpelling: Towards Unsupervised Text Recognition”. In: arXiv preprint 1809.08675 . url : https://arxiv.org/abs/1809.08675 .Gutmann, Myron P., Emily Klancher Merchant, and Evan Roberts (2018). ““Big Data” inEconomic History”. In: The Journal of Economic History

Proceedings of the Alvey Vision Conference , pp. 147–151.He, K., G. Gkioxari, et al. (2017). “Mask R-CNN”. In: , pp. 2980–2988.He, Kaiming, Xiangyu Zhang, et al. (2016). “Deep residual learning for image recogni-tion”. In:

Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 770–778.James, Gareth et al. (2013).

An introduction to statistical learning . Vol. 112. Springer.Kingma, Diederik P and Jimmy Ba (2014). “Adam: A method for stochastic optimization”.In: arXiv preprint 1412.6980 . url : https://arxiv.org/abs/1412.6980 .Lee, Chen-Yu and Simon Osindero (2016). “Recursive recurrent nets with attention modelingfor ocr in the wild”. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 2231–2239.Li, X., L. Wang, and Y. Fang (2019). “PC-Net: Unsupervised Point Correspondence Learningwith Neural Networks”. In: , pp. 145–154. 43owe, David G (2004). “Distinctive image features from scale-invariant keypoints”. In:

In-ternational journal of computer vision

Journal of machine learning research

Machine learning : a probabilistic perspective . MIT Press.Myronenko, Andriy and Xubo Song (2010). “Point set registration: Coherent point drift”.In:

IEEE transactions on pattern analysis and machine intelligence

Pattern Recog-nition Letters

79, pp. 106–112.Nion, T. et al. (2013). “Handwritten Information Extraction from Historical Census Docu-ments”. In: ,pp. 822–826.Pan, Sinno Jialin and Qiang Yang (2009). “A survey on transfer learning”. In:

IEEE Trans-actions on knowledge and data engineering

Annual Review of Sociology

International Conference on Learning Representa-tions” . url : https://arxiv.org/abs/1409.1556 .Sivic, Josef and Andrew Zisserman (2009). “Eﬃcient visual search of videos cast as textretrieval”. In: IEEE transactions on pattern analysis and machine intelligence

Computer vision: algorithms and applications . Springer.Varian, Hal R. (May 2014). “Big Data: New Tricks for Econometrics”. In:

Journal of Eco-nomic Perspectives

Journal of the American Statistical Association

International conference on machine learning , pp. 2048–2057.Ye, Junting and Steven Skiena (2019). “The Secret Lives of Names? Name Embeddings fromSocial Media”. In:

Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , pp. 3000–3008.44

PPENDIX

Dates dictionary

Description TokensDay and year digits <0>, <1>, <2>, <3>, <4>, <5>, <6>, <7>, <8>, <9>

Months , , , , , , , , ,

Separators ,

Sequence markers and padding , ,

Table 7:

Dictionary Ω datedate