Applications of Machine Learning in Document Digitisation
Christian M. Dahl, Torben S. D. Johansen, Emil N. Sørensen, Christian E. Westermann, Simon F. Wittrock
AApplications of Machine Learning in DocumentDigitisation ∗ Christian M. Dahl †† Torben S. D. Johansen † Emil N. Sørensen ‡ Christian E. Westermann † Simon F. Wittrock † February 8, 2021
Abstract
Data acquisition forms the primary step in all empirical research. The availability ofdata directly impacts the quality and extent of conclusions and insights. In particular,larger and more detailed datasets provide convincing answers even to complex researchquestions. The main problem is that “large and detailed” usually implies “costly anddifficult”, especially when the data medium is paper and books. Human operators andmanual transcription have been the traditional approach for collecting historical data.We instead advocate the use of modern machine learning techniques to automate thedigitisation process. We give an overview of the potential for applying machine digiti-sation for data collection through two illustrative applications. The first demonstratesthat unsupervised layout classification applied to raw scans of nurse journals can beused to construct a treatment indicator. Moreover, it allows an assessment of assign-ment compliance. The second application uses attention-based neural networks forhandwritten text recognition in order to transcribe age and birth and death dates froma large collection of Danish death certificates. We describe each step in the digitisationpipeline and provide implementation insights. ∗ Acknowledgements: We thank Peter Sandholdt Jensen, Joseph Price, and Michael Rosholm for usefulcomments. We also thank Søren Poder for contributing his expertise on digitisation of historical documents.We gratefully acknowledge support from Rigsarkivet (Danish National Archive) and Aarhus Stadsarkiv(Aarhus City Archive) who have supplied large amounts of scanned source material. We also gratefullyacknowledge support from DFF who has funded the research project “Inside the black box of welfare stateexpansion: Early-life health policies, parental investments and socio-economic and health trajectories” (grant8106-00003B) with PI Miriam W¨ust. † Department of Business and Economics, University of Southern Denmark ‡ School of Economics, University of Bristol a r X i v : . [ c s . C V ] F e b . INTRODUCTION
Big data have brought new opportunities in economic research (Varian, 2014; Einav andLevin, 2014). However, big data are not confined to be contemporary. A recent review byGutmann et al. (2018) highlights that large collections of scanned historical documents areessential examples of big data, and Gutmann et al. (2018) describe some of the challenges ofharnessing such information in research. In particular, Gutmann et al. (2018) mention theprospects of automated record linking by applying machine learning (ML) to transcribedrecords. However, they do not comment on the important opportunities of using ML toautomate the data collection from the raw images.Traditionally, historical data have been collected manually either by research assistants,using (possibly paid) crowdsourcing, or by complete outsourcing to a transcription com-pany. Manual data collection has limited scalability and this reduces the value of largescanned document collections as they are difficult to operationalise for statistical analysis.ML methods provide a potential solution to this problem. These methods are easily scal-able to millions of documents, they are fast relative to human transcription, and providereproducible results.The economic literature does not have well-described applications of deep- and machinelearning to collection of data from scanned documents. Similarly, while abundant withmethods and models, the ML literature also lacks discussion of complete solutions to thisproblem, see e.g. Nagy (2016) for a general overview of document digitisation. Often thefocus is on improving and benchmarking specific models in isolation using standardiseddatasets (cf. Graves, Liwicki, et al., 2008; Bluche, Ney, et al., 2014; Lee and Osindero,2016). Such work is of limited practical use when implementing a complete data collectionpipeline where the documents are non-standard and multiple models (e.g. for transcriptionand layout classification) need to work in unison. Amazon Mechanical Turk is an example of paid crowdsourcing where workers are paid an amount everytime they solve some pre-specified task. Alternatives such as Zooniverse provides the infrastructure to docrowdsourcing but otherwise rely on volunteers. EARLY-LIFE CARE IN DENMARK - LAYOUT CLASSIFICATION
Interventions and estimation of treatment effects are central topics in both theoretical andapplied economic research. However, prior to estimation, we need an assignment of eachindividual to a treatment or control group. Often treatment assignment is inferred based onan intervention or policy that (quasi) randomly has assigned each individual, e.g. Angristand Krueger (1991). This section considers a policy where a subset of infants was madeeligible to participate in an expanded care programme. The participants in the programmereceived additional home visits from nurses. Enrolment in the programme was governedby the date of birth. Individuals born in the first three days of each month were eligibleto receive additional monitoring. The details of approximately 95 ,
000 infants (whetherenrolled or not) were collected in journals kept by the health care system. The journals havepreviously been described and used by Biering-Sørensen et al. (1980), and Bjerregaard et al.(2014) have manually transcribed a small subset of the contents to study birth weight andbreastfeeding. The infants who received additional monitoring have a specific follow-up tablein their journal only if the monitoring took place, i.e. the presence of the table is decidedby actual treatment, not eligibility. The journals have been scanned and are available asdigital images. While parts of the journals have previously been digitised, the presence ofthe treatment table was not recorded. Figure 1 illustrates the pages in a typical journal. The journals have been made available through the DFF funded research project ”Inside the black boxof welfare state expansion: Early-life health policies, parental investments, and socio-economic and healthtrajectories” (grant 8106-00003B) with Miriam W¨ust as PI. igure 1: Example of a typical nurse journal. The third page shows the treatment table.We are blacking out sensitive information.In the following, we construct a treatment indicator using unsupervised ML by analysingthe layout of each journal page and thereby identifying the group of children that actuallyreceived follow-up care. We compare this ML-based detection to an intention-to-treat (ITT)indicator inferred from the three-day policy and find that there is non-compliance. Thisillustrates that statistical models applied to images can, even without transcription, provideimportant information in applied economic research.Our dataset contains 95 ,
313 journals with a total page count of 261 , Since the treated individuals can be identified by the presence of a particular page in theirjournal, we can use layout classification to detect treatment. If a page in their journal isclassified as having the treatment table layout, then the individual is classified as treated. Wedid not have access to a labelled dataset to train a supervised classifier for the treatment page– as will often be the case in practice. Thus, we pursue an unsupervised approach wherewe rely only on the scanned images without labels. Note that we still need to manually4onstruct an evaluation dataset to probe the performance of the applied method. However,if the unsupervised method is found to perform sub-par, then the evaluation set can serveas the basis for training a supervised classifier. In this sense, any manual transcription isnot wasted. We will show an example of a supervised classifier for the same purposes inSection 3.The documents are scanned and stored digitally as image files. Images consist of coloureddots called pixels. Each pixel is characterised by a location and a colour. Stacking a certainnumber of pixels horizontally and vertically forms an image. Thus, we can consider animage of h × w pixels to be a h × w matrix where each entry corresponds to a single pixel.In grayscale photos, each pixel can only attain white, black, and shades in-between. Thisis represented by a byte (8 bits) specifying a value between 0 and 255 with 0 being blackand 255 white. The core of the machine digitisation process can be formulated as variousstatistical learning problems where we model different aspects of the visual information tolearn a mapping from the image matrix into a representation that is suitable for economicanalysis. Learning this mapping represents an array of challenges. In particular, this isoften challenging because the image matrix can be of very high dimension. A feature is alower-dimensional variable that captures some aspect of the high-dimensional image, andhopefully in a way that is more informative than the raw image data itself. Convolutionalneural networks, see Goodfellow et al. (2016, Chp. 9), learn to extract such features whentrained for image classification and it turns out that these features are generally informativedespite being trained on a specific dataset (Simonyan and Zisserman, 2015). This propertyis exploited in transfer learning where parts of a neural network trained on one set of imagesare applied for a new task on a different set of images (Pan and Yang, 2009). The VGG16network is an example of a deep convolutional neural network that was trained on over 1million photos to distinguish between 1 ,
000 objects (Simonyan and Zisserman, 2015). Basedon the concept of transfer learning, we use this pre-trained network to extract features fromthe journal images. This is a useful trick that can provide informative features without the5 .000.250.500.751.00 0.00 0.25 0.50 0.75 1.00
Embedding X E m bedd i ng Y Figure 2:
2D t-SNE visualisation of the feature space of the journal pages. Each point rep-resents a journal page and the colours correspond to the labels assigned by the clusteringalgorithm. Pages with similar layout cluster together. The embeddings have been subsam-pled to reduce cluttering, so only 30 ,
000 randomly sampled embeddings are displayed. Thereis a total of 37 clusters which are manually annotated. The treatment pages are containedin four clusters.need to train more sophisticated feature extractors or models. It works similar to traditionalfeature extractors, e.g. SIFT (Lowe, 2004) and SURF (Bay et al., 2008), but the featurerepresentation is learned instead of manually engineered. The classification part of VGG16is discarded – we do not care about the original classification task – and we only keepthe convolutional network. Each journal page is passed through the VGG16 convolutionalnetwork and we obtain a 512-dimensional feature vector that describes some aspects of thevisual information.Next, we use unsupervised methods to explore the features. The features are clusteredusing DBSCAN (Ester et al., 1996) – a density-based clustering algorithm. Pages with similarlayout should cluster together as they share a similar VGG16 feature vector. To visualise6L detectionGround truth Treated Not TreatedTreated 234 0 234Not Treated 0 3766 3766234 3766
Table 1:
Confusion matrix for the ML treatment detection model. The frequencies are basedon a randomly sampled and manually reviewed validation set of 4 ,
000 journals (10 , ,
914 pages of 4 ,
000 randomly selected journals. For eachjournal, we recorded the presence of the treatment table. The dataset was reviewed twiceand 234 journals with treatment were found. Table 1 provides a confusion matrix for theML treatment detection in the evaluation sample. All 234 treated and 3,766 untreated indi-viduals are correctly classified by DBSCAN with zero false positives/negatives despite heavy7olicy detection (ITT) ML detectionTreated 7,912 5,735- Born 1st-3rd 7,912 4,247- Born 4th-31st 0 455Non-compliers - 4,120
Table 2:
Treatment indicator. Policy assignment is based on an official assignment rulewhich offered all children born in the first three days of each month to enroll in the nursevisiting programme. The ML assignment is based on the machine learning model and basesassignment on the presence of the treatment page in the journals. This allows for assessmentof compliance in addition to the intention-to-treat effect, i.e., the date-of-birth assignmentmechanism.class imbalance. Note that the classifier could obtain an accuracy of 3 , / , ≈ . Obviously, this is not desirable and highlightsthe need for other performance measures. An option is to consider the precision and recall(Murphy, 2012, p. 184–185). In our context, precision is the number of individuals predictedas treated that are truly treated , while recall is the number of individuals detected as treatedcompared to the total number of treated . These two measures are especially relevant in de-tection and retrieval tasks such as those considered here (cf. Murphy, 2012, p. 185). In theresults from the unsupervised method, both precision and recall are unity. While this ishighly satisfactory, it is only an estimate of the true performance as we only use a subset ofthe data for evaluation. It is conceivable that the method can make some mistakes acrossthe whole collection of 261 ,
926 pages.Apart from the performance of the classifier itself, the results from the layout detectionprovide valuable insights on treatment assignment. From Table 2 it is evident that notall eligible individuals received the follow-up visits. 7,912 children are eligible but only4,247 individuals born in the three-day eligibility period actual received a visit. This is a Actually, the class imbalance is even more severe. The ML model classifies pages and in the evaluationset only 234 / , ≈ .
14% pages contain the treatment table. In light of this, the high recall of the modelis especially satisfying. Although 7 ,
912 individuals born between the 1st and the 3rd appears low when considering the samplesize of roughly 95 ,
000 children, we observe birth date for only 84 ,
659 children. Hence, 9 .
35% of these childrenare born between the 1st and the 3rd, which is still slightly lower than expected (9 . . This reveals an issue with non-compliance (3 ,
665 + 455 = 4 ,
120 non-compliers)which might have implications when estimating treatment effects, see e.g. Angrist, Imbens,et al. (1996). The performance of the ML detection is very encouraging, and in additionthe ML approach also reveals details about the intervention that would otherwise have beenlost, unless the whole collection of 261 ,
926 pages was manually reviewed.Keep in mind that the nurse journals are very uniformly scanned. Documents with morevariation in quality might benefit from a supervised approach. For example, we found thatthe unsupervised method did not generalise well to the death certificates (Section 3). MORTALITY IN DENMARK - HANDWRITTEN TEXTRECOGNITION
In Denmark, the use of death certificates was introduced at the national level in 1832. Adeath certificate documents the death of a single individual and contains a table with fieldsfor name, birth date, cause-of-death etc. The documents are stored on paper at the Danisharchives, one certificate is one page. Due to privacy restrictions, the publicly available deathcertificates are restricted to the timespan 1832–1942. The Danish National Archive andvolunteers have scanned a large collection of certificates and made them available online asdigital images. Around 2.5 million death certificates are available for download and withapproximately 10–12 fields per certificate; this amounts to 25–30 million individual fields totranscribe. Also, additional death certificates are continuously being scanned and added tothe collection. We have a subcollection of approximately 250 ,
000 death certificates acrossmultiple years and locations. These are not randomly sampled but reflect the order inwhich the archives scanned the documents. This is not a major drawback as our main A search for more information on these cases is ongoing. See the description at the Danish National Archive (in Danish) i g u r e : A s a m p l e o f t h e d i ff e r e n t p ag e s i n t h ec o ll ec t i o n o f d e a t h ce rt i fi c a t e s . T h ee x a m p l e s a r e n o t e x h a u s t i v e . igure 4: The birth date, death date and age fields on a type B death certificate.11 i g u r e : T h e t h r ee s t e p s i n t h e M L d a t a c o ll ec t i o np i p e li n e . ( )t h ec o ll ec t i o n o f s o u r ce i m ag e s i ss o rt e db y l a y o u t . ( )t h e f o r m i n t h e d o c u m e n t o f i n t e r e s t i ss e g m e n t e d i n t o fi e l d i m ag e s . ( ) e a c hfi e l d i m ag e i s tr a n s c r i b e d i n t oa d i g i t a l s tr i n g . ,
903 out of 250 ,
000 certificates are type B. Figure 4 shows a type B certificate withmanual highlighting of the three relevant fields.We split the transcription process into three steps:(1)
Layout classification : Separate the type B death certificates from the other types.(2)
Table segmentation : Extract an image for each of the selected fields in the pre-printed form.(3)
Transcription : Transcribe the extracted field images for birth date, death date, andage.The sequence of steps (1)-(3) is called the ML pipeline, see the illustration in Figure 5.The pipeline consumes raw images of death certificates and produces string transcriptionsfor each field without human interaction. In the following sections, we describe each of thepipeline steps and evaluate the performance. Section 3.1 describes the transcribed datasets13sed for training and evaluation. The pipeline steps are considered in Sections 3.2–3.4 whileSection 3.5 compares the ML pipeline to crowdsourced transcription.
We train and evaluate three separate models for the ML pipeline. One model for layoutclassification and the two others for transcription. In this process, we rely on several datasetswhich are outlined in Table 3 and described in detail below. The table segmentation methoddoes not need training so there is no dataset for this step.(1)
Dates : A training dataset with 11 ,
630 dates and an evaluation dataset containing1 ,
000 dates. Data is approximately balanced between birth and death dates. Thedatasets are used to train and evaluate the transcription model for birth and deathdates. Images are 320 ×
50 pixels and the ground truth transcriptions are stored asstrings in a standardised format. See Figure 6.
Figure 6:
Examples of the images in the date datasets.(2)
Ages : The training and evaluation datasets contain 11 ,
072 and 1 ,
000 ages respectively.The datasets are used to train and evaluate the transcription model for age. Images are230 ×
75 pixels and the ground truth transcriptions are stored as strings and excludethe age suffix, i.e., years, months, days, or hours. Non-integer ages are also excluded.See Figure 7 for examples. 14raining Evaluation Image sizeDates 11,630 1,000 320 ×
50 pixelsAges 11,072 1,000 230 ×
75 pixelsLayouts 7,000 2,184 VariableCrowdsourced dates - 46,526 320 ×
50 pixels
Table 3:
Overview of the number of samples in each of the training and evaluation datasetsused in the ML pipeline
Figure 7:
Examples of the images in the age datasets.(3)
Layouts : The training and evaluation datasets contain 7 ,
000 and 2 ,
184 pages respec-tively. They are used to train and evaluate the layout classification model for detectingcertificates of type B. The images are of varying size and the ground truth layout typeis stored as an indicator variable. The images are similar to those shown in Figure 3.(4)
Crowdsourced dates : A dataset containing 23,263 complete death certificates thatintersect with our collection of death certificates. Transcriptions are only available forbirth and death dates, not for age. This dataset is used for evaluating the end-to-endperformance of the pipeline.The evaluation and training datasets (1)-(3) are constructed by manually transcribing arandom sample of field images from the death certificates. They have been verified twice bydifferent individuals and images with segmentation errors have been removed. We ensurethere is no overlap between the training and evaluation datasets.The crowdsourced dataset (4) is freely available online from the Danish National Archive.15nybody can contribute to these transcriptions by editing them through an online interface. Layout classification refers to the process of organising a collection of documents by commonlayout structure, e.g. a common table, heading, or pre-printed landmarks. Chen and Blostein(2007) provide a relevant but methodologically outdated introduction to this problem. Thelayout type is important as the table segmentation step relies on a pre-defined template thatmust match with the pre-printed structure in the image. Due to variations in layout acrossthe collection of death certificates (see Figure 3), we need to construct one template for eachtype. Every time we fit a template to a certificate, we need the template and certificatetypes to match. This is straightforward if the certificates are sorted according to layout, asin this case, all images in a given class will share the same template.As we saw in Figure 3, an image X i of a death certificate can belong to one of K layouttypes that can be distinguished visually, e.g. “Type B”, “Type A”, and so on. Let thelayout type of image i be Y i = k with k = 1 , , ..., K . The K types do not need to mimic thevisual types in the documents, we can easily focus on “Type B” and label everything elseas “Other”. We are interested in learning a model for the probability P ( Y i = k | X i ) suchthat we can infer the most likely layout type ˆ y i = arg max k P ( Y i = k | X i ). This resembles aconventional K -class classification problem with the only difference that X i is an image.We already saw an example of unsupervised layout classification in the context of thenurse journals in Section 2 where we emphasised the importance of constructing a lower-dimensional feature that describes the raw image. In the supervised setting, there are twomain considerations: (1) what features should we use, i.e. how do we construct a feature g ( X i ) that represents the high-dimensional information in X i , and (2) what classifier shouldwe use, i.e. what model do we choose for the probability P ( Y i = k | g ( X i )). This cannot See (in Danish) https://bit.ly/2VnFLzb
16e answered definitively and depends on the application. There are many equally viableapproaches. In the nurse journal application, we solved (1) by using the features from apre-trained neural network (VGG16) and used these directly in a cluster analysis. A modernsupervised end-to-end approach would be to train a convolutional neural network (CNN)to classify the pages using raw images as input, i.e. g ( X i ) = X i . In this case, the neuralnetwork will solve both (1) and (2) as it learns both the feature extractor and the classifierduring training.We take a simpler approach and use the visual Bag-of-Words (BoW) method to classifythe layout type of the certificates. This method provides an intuitive definition of featuresin terms of visual landmarks or “words”. Bag-of-words is a technique originally developedto classify chunks of text according to the content (Murphy, 2012, p. 87). It operates bydetermining the frequency of words (i.e. the frequency of some global set of words – the bag-of-words) in each text document. The method has been applied successfully in the field ofcomputer vision, see Csurka et al. (2004) and Sivic and Zisserman (2009), where we insteadoperate with a bag of visual words, i.e. chunks of images.The visual bag-of-words model creates features on the basis of a codebook or dictionary.This dictionary is constructed by extracting key points from a training dataset and clusteringthem such that we obtain M groups of key points where key points within a group appearsimilar in some sense. If we think of a training dataset that consists of photographs ofanimals, we could have a key point cluster related to eyes, one for ears, etc. Each of theseclusters is a visual word . The visual words are similar to the feature clusters we discovered inthe nurse journals using the unsupervised method, i.e. those in Figure 2. Here we extract thefeatures using Speeded-Up Robust Features (SURF) (Bay et al., 2008) as this has historicallybeen the common choice, see Csurka et al. (2004).When the dictionary has been constructed, we are ready to create the actual featurevectors for the images. For a given training image i , we extract SURF key points and assigneach of these to the M key point clusters based on distance. We then count the number of17L detectionGround truth Empty Other A BEmpty 13 1 0 0 14Other 2 1,535 0 0 1,537A 0 0 109 0 109B 0 0 0 524 52415 1,536 109 524 Table 4:
Confusion matrix for the BoW layout classification. The frequencies are based on arandomly sampled and manually reviewed evaluation set of 2 ,
184 death certificates. Deathcertificates are classified into four classes: Empty, A, B and Other. In this application weare only interested in the type B certificate.features from the image that belongs to each of the M key point clusters and construct avector of normalised frequencies which serves as the final feature vector of the image. Notethat the size M of the dictionary determines the dimensionality of the final feature vector. Inthe classification step, we train a model to classify each feature vector (and thus each image)into one of the K classes. The classifier can be of any type. Here we use support vectormachines with radial kernels, see the introduction in James et al. (2013, Chp. 9). There areseveral tuning parameters in the BoW model (dictionary size, margin etc.) which we selectedusing leave-one-out cross-validation. The computational burden lies in the initial keypointextraction and clustering. When the appropriate features have been extracted, it is very fastto fit the actual classifiers and hence it is computationally fast to do cross-validation for theclassifier hyperparameters . For an end-to-end neural network it is substantially slower todo hyperparameter optimisation as the whole network will need to be re-trained – possiblyfor hours – for each set of hyperparameters. This highlights that simpler classifiers can beuseful as a first step before developing more complex models.18 .2.2. Results The classifier is trained on the layout dataset which contains 7 ,
000 full page images ofrandomly sampled death certificates together with an indicator for their ground truth layouttype. We consider K = 4 distinct layout classes: (1) type A certificates, (2) type B certificates(our focus), (3) all other certificates and (4) empty pages. We evaluate the classifier on thecorresponding layout evaluation dataset (2 ,
184 certificates) and the confusion matrix is givenin Figure 4. From the confusion matrix we see that the classifier provides highly satisfactoryperformance with only two false positives and one false negative in document classes we arenot interested in. For the type B certificates the class-wise precision and recall are unity.We use the trained BoW classifier to predict across the entire collection of ∼ ,
000 deathcertificates and find that 44 ,
903 certificates are type B. These are extracted for the tablesegmentation step.
In the table segmentation step, the goal is to extract smaller images corresponding to eachfield (or cell) of a larger form (or table) in the source image. Co¨uasnon and Lemaitre (2014)provide a general introduction to the topic. There are two components to this problem: (1)identifying the structure of the table in the source image, and (2) exploiting the structureto extract the field images. We apply a simple template-based approach where a predefinedtemplate is fitted over the source image using a set of landmark points.In ongoing work, we are considering how to solve (1) using edge-detection neural networks.However, in this paper we rely on standard filtering operations from the computer visionliterature to binarise the source image and find straight horizontal and vertical lines. This Note that for large training sets, runtime for support vector machines with radial basis kernels increasesfrom O ( n ) to O ( n ) and another classifier should be used. The other classes (i.e. type A and empty) were used for a different application we do not discuss here. P = { ( x , y ) , ( x , y ) , ..., ( x N , y N ) } be a set of points that define a template (e.g. thecorners of a table) and analogously let P = { ( x , y ) , ( x , y ) , ..., ( x K , y K ) } be a set ofpoints in an image that we think roughly corresponds to those in the template. Note thatthe number of points in the two sets do not need to be the same. Point registration is theproblem of aligning the points in P (the template) over the points in P (the image) (Besland McKay, 1992). This is illustrated in Figure 10 where we align template points P (bluedots) with the image landmarks P (red dots). We can intuitively understand this as findinga transformation function T that applies some transform to all points in P such that T ( P )aligns with P in some distance metric d . The point set registration methods differ in theirchoice of distance d and the constraint they put on the transformation T . The non-rigidversion of the CPD method in Myronenko and Song (2010) assumes that the points in theimage P are generated by a Gaussian distribution “around” the template points P andit puts only mild regularity conditions on the transform T . In particular, it assumes thatpoints move freely but coherently. Myronenko and Song (2010) model the problem using agaussian mixture model. We rely directly on their algorithm to align our template to theimage by supplying our template points P and image landmarks P . The algorithm appliesthe standard Expectation Maximisation (EM) method to solve the maximum likelihoodproblem.It has been suggested that the CPD method can be improved by using a neural network20 igure 8: An example of a set of points that define a template.to learn the transformation (Li et al., 2019) but we do not consider this here. There arealso examples of more general learning based table segmentation that does not rely on apre-specified template, see e.g. Clinchant et al. (2018).
The initial template is constructed by finding a well-scanned death certificate where wecan manually extract a number of points which can be used as anchors for the template,see Figure 8. This is only done once and these points comprise the template point set P .After the template has been established, each type B death certificate is subjected to a setof morphological operations (erosion and dilation) to find pixels belonging to table rowsor columns. The image is first thresholded (i.e. converted from gray scale into black/whitewhere all pixels are either 0 or 255) and afterwards erosion and dilation operators are applied.Erosion and dilation are common non-probabilistic ways to extract straight (vertical orhorizontal) lines in computer vision problems, see e.g. Szeliski (2010). The morphologicaloperations require careful hand-tuning to find the optimal parameters for the document typebut these parameters only need to be tuned once for the whole collection. When the image21 i g u r e : L a nd m a r k d e t ec t i o nu s i n g m o r ph o l og i c a l o p e r a t i o n s a nd a c o r n e r d e t ec t o r . F i g u r e : P o i n t s e t a li g n m e n t . B l u e d o t s r e p r e s e n tt h e t e m p l a t e w h il e r e dd o t s a r e t h e l a nd m a r k s f o und i n t h e s o u r ce d o c u m e n t . P . Next, weapply the CPD algorithm to align the template points in P to the image landmarks in P .This iterative fitting process is illustrated in Figure 10. Even if extra landmark points aredetected, the CPD algorithm remains robust as long as the detected landmark points P aresufficiently spread out along the axes, i.e. if the noise is fairly uniform. Once the points havebeen aligned, we use the estimated transformation T to fit the template onto the image. Thetemplate then guides the cuts that separate the larger image into individual field images.This is illustrated in Figure 11. Figures 13-14 show examples of the final segmented fieldsfor dates. We do not report quantitative measures of segmentation performance but notethat the end-to-end transcription accuracy in Section 3.5 will include any errors producedby the segmentation step if these make the field images unreadable.The field images could be further segmented to limit noise and assure that the positionand scale of the text is similar between samples. A possible way to implement this isa Mask R-CNN network (He, Gkioxari, et al., 2017) which could be trained to predictregions of noise and handwritten text respectively. The text regions are then extractedand re-aligned on a blank image. However, this adds further complexity to the problem byintroducing an additional segmentation model. The Mask R-CNN would need training datawhich requires manual annotation of text outlines to create a ground truth dataset. Wewill instead run the transcription model directly on the field images without any additionalsegmentation/cleaning and show that this is a viable approach.23 igure 11: Example of field images extracted using the fitted template.
Transcription is the process of converting an image of text into a string representation.Assume we have sequences of the form Y i = ( Y i, , Y i, , ..., Y i,T ) where T is the maximumsequence length and each Y it is a random variable on the sample spaceΩ = {
9, names of the months, and some specialcharacters. Model.
We utilise an attention-based neural network suggested by Xu et al. (2015)for image captioning and repurpose it for transcription of handwritten text. Originally, themodel by Xu et al. (2015) predicts a string of words that describes the contents of an inputimage, e.g. if it is supplied with an image of a bird sitting in a tree then the model wouldpredict the sentence “A bird sitting in a tree”. However, the model of Xu et al. (2015)is generally applicable to tasks that involve image inputs and sequence outputs – which isexactly what characterises the transcription problem. A similar attention-based model forOCR was proposed by Lee and Osindero (2016) and it has been demonstrated that morecomplex multi-head attention models can transcribe whole paragraphs of handwritten text(multiple lines) without prior line segmentation, see Bluche, Louradour, et al. (2017).In our implementation of Xu et al. (2015), we replace the final embedding output witha softmax layer. This allows the model to predict probabilities for each of the tokens in ourdictionary Ω. In particular, assume that we have an image X i and we are at step t in theprediction of the corresponding sequence Y i = ( Y i, , ..., Y i,T ). The model would predict thestep t conditional probabilities P ( Y i,t = k | X i , Y i,t − = ˆ y i,t − , ..., Y i, = ˆ y i, ) , k ∈ Ωwhere (ˆ y i,t − , ..., ˆ y i, ) are the previously predicted tokens in the sequence.The model architecture is depicted in Figure 12. First, the feature network encodes theinput image into a feature map. The attention mechanism A then applies soft-attention to This is less straightforward for transcription of names. However, it is possible to construct dictionariesbased on the most common names and only transcribe these. Our initial tests of this approach have beenpromising for reading names in heavily cluttered and poorly segmented fields, but we do not present theresults here. AD D ADFeature networky y y T^ Figure 12:
Transcription model based on the image captioning model in Xu et al. (2015). Thefeature network extracts features from the input image. The attention mechanism, denotedby A , weighs the features and supply these as input to an RNN cell, D , that predicts onetoken of the sequence. The hidden state of the RNN cell is updated, and the updated stateis passed to the attention network to decide on the next weights for the feature map. Theprocess repeats until an end-of-sequence prediction is made.this feature map by predicting weights in [0 ,
1] and taking the element-wise product of thefeatures and weights. As noted by Xu et al. (2015), soft-attention is fully differentiable so wecan use standard back-propagation to train the network. The attention-weighted featuremap is passed to a recurrent neural network (RNN) cell D . This cell predicts the nexttoken in the sequence and updates its hidden state. The hidden state is fed to the attentionmechanism which updates the attention weights, i.e. “where should we look next”? Thissupplies the recurrent cell with a new weighting of the feature map. The RNN cell usesthe hidden state and the weighted features to predict the next token in the sequence. Thisprocedure is repeated until an end-of-sequence token is predicted by the model.An alternative approach is a neural network with the Connectionist Temporal Classifi-cation (CTC) loss. Such networks have achieved state-of-the-art results on unconstrainedtranscription (Graves, Liwicki, et al., 2008), i.e. in settings where there are no restrictionson the words the model can transcribe – implying that Ω contains the alphabet. Graves,Fern´andez, et al. (2006) suggested the CTC loss for sequence labelling. The CTC-based In our work, we have found that the soft attention mechanisms are significantly easier to train comparedto their hard attention counterparts. This is also noted by Lee and Osindero (2016). Prediction.
Given an image X i we want to predict the corresponding sequence Y i ∈ Ω T .At each sequence step t , we use the model to predict probabilities over the tokens in thedictionary Ω. Due to dependence in the sequence, the predicted probabilities across thedictionary at step t depend on all the previously predicted tokens at steps ( t − , t − , ..., different sequencesfor each sample we want to predict. Hence it is necessary to rely on greedy algorithms tofind a feasible (but possibly suboptimal) solution. We use beam search as suggested by Xuet al. (2015). Evaluation . We evaluate the performance of the transcription models using two metrics.One relates to the average accuracy across individual tokens and the other to the accuracy ofcomplete sequences. To fix notation let y i = ( y i, , ..., y i,k i ) be a realised ground truth sequence See https://rll.byu.edu/ y i = (ˆ y i, , ..., ˆ y i,h i ) a predicted sequence for image X i = x i , i = 1 , ..., n where n is thesize of the evaluation dataset. Assume that the ground truth sequence is always padded so k i ≥ h i and that the padding value is chosen such that y i,j (cid:54) = ˆ y i,j for j > k i . Moreover, let I ( · ) be the indicator function with some predicate. We define the Token Accuracy (TA) and m -error Sequence Accuracy (SA m ) byTA = 1 (cid:80) i k i n (cid:88) i =1 k i (cid:88) j =1 I ( x i,j = ˆ x i,j ) , (1)SA m = 1 n n (cid:88) i =1 I (cid:32) k i (cid:88) j =1 I ( x i,j (cid:54) = x i,j ) ≤ m (cid:33) . (2)TA has an interpretation as the average accuracy across all predicted tokens. SA m is theproportion of correctly predicted sequences when m mistakes are allowed in the sequence.The token and sequence accuracies are closely related to the character and word accuracies inthe HTR literature, see e.g. Graves, Liwicki, et al. (2008). However, our dictionary containstokens which can be chosen arbitrarily, i.e. a token can be a character or a word.Note that both TA and SA are sensitive to alignment, so while a single substitutionproduces one sequence error, missing or adding an extra token would misalign the entiresentence. For example, in the following prediction the model misses a single token but thiscauses four errors in the predicted sequence because of misalignment: Ground truth:
Prediction:
Correct Correct Error Error Error ErrorHence missing a token can impose a heavy accuracy penalty when using these measures.Another possibility for measuring performance would be various string distances, e.g. theLevenshtein distance. These can be adapted to tokens instead of characters and do nothave the alignment problem. We do not consider this futher but note that relying on string29istance measures would only improve the metrics for our models.
We train separate models for dates and ages but both are based on the same attentionneural network. Below we detail the choices of dictionary, hyperparameters and the trainingprocedure used to train the two models on the type B death certificates.
Dictionary.
The age transcription model uses a dictionary consisting of the digits(0 , , ...,
9) together with start and end markers to delimit the sequence. The final agedictionary is Ω age = {
2, and any suffixes such as years, months,days, and hours. It would be straightforward to add additional tokens but since our groundtruth transcriptions do not contain this information we cannot train the model to recognisethem.The date transcription model uses a larger dictionary. Each month is considered a sep-arate token,
Similarly, the date
September 20th, 90 would by tokenised as30
Start>,<2>,<0>,
Data.
The date model is trained on 11,630 manually transcribed dates. The age modelis trained on 11,072 manually transcribed ages. To boost the amount of training data,we apply an augmentation procedure to create multiple distorted versions of each originaltraining sample. Prior to training, the mean and variance are estimated using a sample of20 ,
000 augmented images. The mean and variance are then used to standardise the pixelvalues in each training image. The ground truth sequences (respectively for dates and ages)are tokenised according to the defined dictionaries and they are padded to all have samelength T . For age this length is T = 4 tokens while it is T = 11 tokens for dates. Duringtraining the models are fed batches of pairs { ( x i , y i ) } i where each training pair consists of afield image x i and its corresponding ground truth sequence y i . Hyperparameters.
The transcription models require various hyperparameters thatmust be selected by the researcher. Some parameters are intuitive, e.g. the size of the inputimages, but others are more abstract such as the number of hidden nodes in a specific layerof the neural network. There is limited guidance for choosing these parameters. For thebest performance, it is necessary to do hyperparameter optimisation where several sets ofparameters are evaluated on a tuning dataset. We do not consider this and instead usemanually selected values that yield acceptable results on the death certificates.The feature network is the convolutional part of a pre-trained ResNet-101 network (He,Zhang, et al., 2016) which we tune during training with a learning rate of 0 . ,
20) which is obtained by applying adaptive max pooling after thefinal layer of the ResNet-101 feature network. The learning rate is 0 . .
25 which is applied to both the main and featurenetwork learning rates at steps 10 , ,
000 and 25 , ,
056 in the RNN cell. The initialhidden state of the RNN cell is learned from the features using a fully-connected layer. Weapply dropout with probability 0 . ,
000 steps implyingthat each sample is encountered on average 162 times. Note that the training datasets areaugmented; while the model does encounter the same original sample multiple times, thesample will have different random distortions.
We evaluate the performance of the transcription models using the TA and SA m metrics fromEquations 1–2. This section focuses on the performance of the ML transcription modelsin isolation, so all results are conditional on the transcription models receiving perfectlysegmented images. The trained transcription models are used to predict the images in thedate and age evaluation datasets using beam search. Prior to prediction, the images arestandardised to zero mean and unit variance using the mean and variance estimated fromthe entire evaluation dataset. No other pre-processing is applied. Transcription of a singleimage happens in less than 75ms.Table 5 shows the performance of the two transcription models – for date and age respec-tively – on their corresponding evaluation sets. We present results both for training withand without augmentation of the training dataset. The token accuracy (TA) for dates is97 .
9% with augmentation and 92 .
2% without. For ages the TA is 98 .
5% with augmentation This is common practice and as straightforward as it sounds. We take the average and standarddeviation over all pixels in all images, and then respectively subtract and divide each pixel by these values.
T A SA SA SA T A SA SA Real .922 .661 .933 .981 .966 .936 .988( . . . . . . . . . . . . . . Table 5:
Token (TA) and Sequence (SA) Accuracy on ground truth evaluation set. Standarderrors are given below each accuracy rate. The first row uses only ground truth trainingsamples, while the second row shows the result when training is conducted on the augmenteddataset. The date training set contains 11 ,
320 samples, the evaluation set contains 1 , ,
072 samples, the evaluation set contains 1 ,
6% without. These are the average accuracies of predicting a single token correctly.Figures 13-14 show random samples of incorrect and correct dates as predicted by the MLdate transcription model.In Table 5, the zero-error sequence accuracies SA are significantly lower than the tokenaccuracies for both models. Under augmentation, the date model achieves an SA of 90 . SA of 97 .
Figure 14:
Random sample of correct predictions from the date transcription model. Thelabel in the upper right corner displays the model prediction in the format day–month–year.34t is apparent from Table 5 that an increase in the number of allowed mistakes persequence improves the accuracy significantly. E.g. if we allow one mistake (i.e. one sub-stitution) in the date sequence, the one-error sequence accuracy is 98 . , ,
072 ages. If our training datasets were larger, the payoff from augmentationwould be smaller. However, the differences display the benefits of augmentation to boostperformance in smaller training datasets.
We consider a comparison to a crowdsourced dataset. The usefulness of this comparisonis twofold: (1) it allows us to evaluate if the ML-approach produces transcriptions thatare on-par with those obtainable from crowdsourcing, and (2) it provides a measure of theend-to-end performance of the ML transcription pipeline, including errors of segmentation,transcription, and to an extent layout classification. Our own evaluation dataset – as used in Table 5 – was manually reviewed to exclude (1)images where segmentation errors obscured the text in the image, and (2) where the imagebelonged to a document of the wrong layout type (i.e. anything other than type B deathcertificates). Evaluating the model on this dataset provides a clean measure of the transcrip-tion model in isolation. However, it does give any insights on the transcription performancewhen the model might receive a badly segmented image. The crowdsourced dataset is dif-ferent in this respect as it is constructed by humans looking at the raw document, finding It does not provide us with a good measure of the recall of the layout classifications as we only considercertificates that are in both datasets. There might be type B certificates that (1) have not been detected byour ML model and (2) are not in the crowdsourced dataset. These will be missed here. .905 .963 .960 .983 .970 .987 .972 .988( . . . . . . . . Table 6:
Zero-error Sequence Accuracy SA for the ML data model (trained with augmen-tation) and crowdsourcing both evaluated on our date ground truth datasets. Second rowcontains the standard error of the accuracy rate. The evaluation set for the ML model con-sists of 1,000 samples, while the evaluation set for the crowdsourced predictions consists of2,864 samples as we can pool both our training and evaluation ground truth datasets in thiscase. The training and evaluation sets have been twice manually reviewed and dates thatare unreadable due to bad segmentation have been removed. Age is not directly transcribedin the crowdsourced dataset and hence excluded here. Column 1-2 is sequence accuracy forthe whole date, while column 3-8 are sequence accuracy on the individual components ofthe date (day, month and year). Note that the comparison takes into account common dateformatting, e.g. that the dates 01-10-2000 and 1-10-2000 convey the same point in time.the date field, and transcribing the text. This implies that the crowdsourced transcriptionscannot be impacted by segmentation or layout classification errors. By running the entirepipeline on the raw images and comparing the final transcription output to the crowdsourcedtranscription, we can get a measure of the overall performance of the pipeline in practice. Ofcourse, this relies on the assumption that the crowdsourced data are perfectly transcribedwhich is not likely to be the case.To get a baseline indication of the quality of the crowdsourced dataset, we compare itagainst our ground truth training and evaluation sets (overlap of 2,864 documents) and findthat the dates are identical in 96.30% of the cases, see Table 6. For comparison, the perfor-mance of the transcription model on the evaluation set (i.e. perfectly segmented images) is90 . ,
263 doc-uments – each containing a birth and death date for a total of 46 ,
526 dates – and wefilter out any overlap with the training sample used to train the ML model. Note that thecrowdsourced dataset does not contain transcriptions of age. The dates predicted by theML model and crowdsourcing are identical in 83.66% of the cases and for 89.96% of thedates the difference is less than one calendar year. This is a substantial difference comparedto Table 5 where the ML sequence accuracy was 90.5%. As elaborated above, the perfor-mance difference relative to Table 5 stems from two sources: (1) noise in the crowdsourceddataset, and (2) the other pipeline steps prior to transcription. Thus, unless (1) is large,this gives an approximation to the end-to-end performance of the whole pipeline. We shouldkeep in mind that the 83.66% sequence accuracy allows for zero mistakes in the predictedsequence and that this is the expected performance if we – without any pre-processing oradjustments – feed a collection of raw scans to the pipeline. Also, given the noise in thecrowdsourced dataset, we can argue that 83.66% might be a slightly conservative estimateunless crowdsourcing participants and ML make the exact same mistakes on the exact samedocuments.Using the ML approach, it is cheap to transcribe additional fields on the documents as itonly requires a training sample. As we have seen, the training sample can be much smallerthan the full collection of documents. This can be exploited to produce higher sequenceaccuracy rates if there are internal correspondences between the fields in the source docu-ment. For example, the death certificates contain both birth and death date and age. Thesethree fields should be internally consistent. If they do not match then either (1) the sourcedocument contains a mistake or (2) the ML model made a mistake. If we transcribe bothage and dates and exploit the correspondence between these fields, we can filter out 5,767 Not completely identical, we take common differences in formatting into account, so e.g. 01-3-2000 and1/3-2000 would be considered equal. ,
263 documents (46 ,
526 field images). Hence, relationships betweenfields can provide automatic verification and be used to flag problematic records for manualreview. This method can of course also be applied in a manual/crowdsourcing context, butin this case transcription of additional fields is more.In addition to the accuracy rates, we also compare the data obtained from the ML andcrowdsourcing approaches. Any systematic bias in the ML model would produce deviationsfrom the empirical distribution observed in the crowdsourced dataset. Also – based on thediscussion above – we expect internal consistency between ML transcribed ages and dates.Figure 15 compares kernel density estimates of the age distribution produced by (red line)ML age transcriptions, (blue line) ML date transcriptions, and (green line) crowdsourceddate transcriptions. Note that the figure only displays ages in the interval [0; 100] – anyages outside this interval are discarded. In addition, we discard all predicted dates wherethe year does not contain four digits. For some dates in the source documents, the year hasbeen abbreviated to only the last two digits (e.g. 1890 becomes 90 or 1910 becomes 10).We cannot easily infer the first two digits, as most of the death certificates span the time1850–1950 so the leading digits could be either 18 or 19. This is not a shortcoming of thetranscription model but rather a lack of information in the source documents. In Figure 15,we see that the distribution of ML age transcriptions is very close to the age distributionimplied by the crowdsourced date transcriptions. The age distribution implied by the MLdate transcriptions also appear similar, although with a notable difference around ages 75–85.This deviation is not surprising as the dates contain longer sequences to transcribe (relativeto age) and the ML model needs to transcribe both birth and death date correctly (at leastdown to the year) to get an approximately correct age prediction.38 .000.010.02 0 25 50 75 100
Age D en s i t y ML age transcriptionML date transcriptionCrowdsourced dates
Figure 15:
Lifetime distributions. Kernel density estimates of the lifetime distribution im-plied by ML age transcriptions (red), ML date transcriptions (blue), and crowdsourced datetranscriptions (green) in an overlapping sample of 23 ,
263 individuals restricted to the ageinterval [0; 100].
Implied age refers to the difference between the birth and death dates inyears.Motivated by the notable difference in performance between the whole pipeline and tran-scription only, we manually review some of the cases where the crowdsourced and modelpredictions differ. This reveals, albeit qualitatively, that most discrepancies are related tosegmentation issues where the CPD segmentation obscures the year in the date fields. TheCPD segmentation template makes an arbitrary cut to separate the age and death date fields.The year is the rightmost component of the date and hence most likely to be impacted bythis cut – refer to the positioning of the death date and age fields in Figure 4. This expla-nation is corroborated if we look at the accuracy rates for the individual date componentswith the crowdsourced dataset as ground truth: Day accuracy is 95.92%, month accuracy is96.98%, and year accuracy is 89.11%. Clearly, the accuracy for the year component is no-tably lower. When discovered, such issues can easily be corrected in the ML pipeline. Hadthe transcription been done manually, it would be much more costly to correct systematictranscription errors.As a concluding remark, keeping in mind the resources and time needed to perform39anual or crowdsourced transcription, we note that the ML end-to-end accuracy of 83.66%– trained on only 11,630 dates and 11,072 ages – might in many cases be acceptable, especiallyfor large document collections that are otherwise infeasible to transcribe.It should be noted that the ML model in this application has not been carefully opti-mised and it has only been trained on a small training dataset. It is conceivable that themodel can perform substantially better. In addition, improvements in segmentation wouldalso positively impact the end-to-end accuracy. The previously discussed within-field seg-mentation using Mask R-CNN (He, Gkioxari, et al., 2017) would be an option for this. Also,in practical applications, the model can be used to speed up transcription while retainingmanual review of each (or some) predictions. The reviewed predictions can then be used tore-train and improve the model.There is also the possibility to rely on the confidence predicted by the model and removetranscriptions where the model is highly uncertain. We could then apply a threshold on theprediction probabilities and manually review sequences with predicted probabilities belowthe threshold. Although, it is questionable if these probabilities can be understood as ageneral measure of confidence, see e.g. the discussion of calibration in Guo et al. (2017). CONCLUDING REMARKS
We have provided two motivating examples that apply ML to collect data from scanneddocuments. First, we have shown that (unsupervised) layout classification can be a usefultool in intervention studies where treatment compliance is inferred from documents. Asan exploratory step, the layout classification can reveal the composition of pages withinthe collection and details about assignment. We demonstrated on a clean dataset of nursejournals that a simple unsupervised method can achieve precision and recall around 1 . ,
000 and 2 million documents is negligible as opposed to the cost of theequivalent manual transcription. 41
EFERENCES
Abramitzky, Ran et al. (2019).
Automated linking of historical data . Tech. rep. NationalBureau of Economic Research. url : .Angrist, Joshua D, Guido W Imbens, and Donald B Rubin (1996). “Identification of causaleffects using instrumental variables”. In: Journal of the American statistical Association
The Quarterly Journal of Economics arXiv preprint 1909.02210 . url : https://arxiv.org/abs/1909.02210 .Athey, Susan and Stefan Wager (2017). “Efficient Policy Learning”. In: arXiv preprint1702.02896 . url : https://arxiv.org/abs/1702.02896 .Bailey, Martha et al. (2017). How Well Do Automated Linking Methods Perform? Lessonsfrom US Historical Data . Tech. rep. National Bureau of Economic Research. url : .Bay, Herbert et al. (2008). “Speeded-up robust features (SURF)”. In: Computer vision andimage understanding
Sensorfusion IV: control paradigms and data structures . Vol. 1611. International Society forOptics and Photonics, pp. 586–606.Biering-Sørensen, Fin, Jørgen Hilden, and Knud Biering-Sørensen (1980). “Breast-feeding inCopenhagen, 1938-1977: Data on more than 365,000 infants”. In:
Danish Medical Bulletin
27, pp. 42–48.Bjerregaard, Lise G et al. (2014). “Effects of body size and change in body size from infancythrough childhood on body mass index in adulthood”. In:
International journal of obesity .Vol. 1. IEEE, pp. 1050–1055.Bluche, Th´eodore, Hermann Ney, and Christopher Kermorvant (2014). “A Comparison ofSequence-Trained Deep Neural Networks and Recurrent Neural Networks Optical Model-ing for Handwriting Recognition”. In:
Statistical Language and Speech Processing . Ed. byLaurent Besacier, Adrian-Horia Dediu, and Carlos Martin-Vide. Cham: Springer Inter-national Publishing, pp. 199–210.Chen, Nawei and Dorothea Blostein (2007). “A survey of document image classification:problem statement, classifier architecture and performance evaluation”. In:
InternationalJournal of Document Analysis and Recognition (IJDAR)
The Econometrics Journal
Proceedings of the 2014 Conference n Empirical Methods in Natural Language Processing (EMNLP) . Association for Com-putational Linguistics, pp. 1724–1734.Clinchant, S. et al. (2018). “Comparing Machine Learning Approaches for Table Recognitionin Historical Register Books”. In: , pp. 133–138.Co¨uasnon, Bertrand and Aur´elie Lemaitre (2014). “Recognition of Tables and Forms”. In:ed. by David Doermann and Karl Tombre. London: Springer London, pp. 647–677.Csurka, Gabriella et al. (2004). “Visual categorization with bags of keypoints”. In: Workshopon statistical learning in computer vision, ECCV . Vol. 1. 1-22. Prague, pp. 1–2.Einav, Liran and Jonathan Levin (2014). “Economics in the age of big data”. In:
Science
Deep learning . MIT press.Graves, Alex, Santiago Fern´andez, et al. (2006). “Connectionist temporal classification: la-belling unsegmented sequence data with recurrent neural networks”. In:
Proceedings ofthe 23rd international conference on Machine learning , pp. 369–376.Graves, Alex, Marcus Liwicki, et al. (2008). “A novel connectionist system for unconstrainedhandwriting recognition”. In:
IEEE transactions on pattern analysis and machine intel-ligence
Proceedings of the34th International Conference on Machine Learning-Volume 70 . JMLR. org, pp. 1321–1330.Gupta, Ankush, Andrea Vedaldi, and Andrew Zisserman (2018). “Learning to Read bySpelling: Towards Unsupervised Text Recognition”. In: arXiv preprint 1809.08675 . url : https://arxiv.org/abs/1809.08675 .Gutmann, Myron P., Emily Klancher Merchant, and Evan Roberts (2018). ““Big Data” inEconomic History”. In: The Journal of Economic History
Proceedings of the Alvey Vision Conference , pp. 147–151.He, K., G. Gkioxari, et al. (2017). “Mask R-CNN”. In: , pp. 2980–2988.He, Kaiming, Xiangyu Zhang, et al. (2016). “Deep residual learning for image recogni-tion”. In:
Proceedings of the IEEE conference on computer vision and pattern recognition ,pp. 770–778.James, Gareth et al. (2013).
An introduction to statistical learning . Vol. 112. Springer.Kingma, Diederik P and Jimmy Ba (2014). “Adam: A method for stochastic optimization”.In: arXiv preprint 1412.6980 . url : https://arxiv.org/abs/1412.6980 .Lee, Chen-Yu and Simon Osindero (2016). “Recursive recurrent nets with attention modelingfor ocr in the wild”. In: Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pp. 2231–2239.Li, X., L. Wang, and Y. Fang (2019). “PC-Net: Unsupervised Point Correspondence Learningwith Neural Networks”. In: , pp. 145–154. 43owe, David G (2004). “Distinctive image features from scale-invariant keypoints”. In:
In-ternational journal of computer vision
Journal of machine learning research
Machine learning : a probabilistic perspective . MIT Press.Myronenko, Andriy and Xubo Song (2010). “Point set registration: Coherent point drift”.In:
IEEE transactions on pattern analysis and machine intelligence
Pattern Recog-nition Letters
79, pp. 106–112.Nion, T. et al. (2013). “Handwritten Information Extraction from Historical Census Docu-ments”. In: ,pp. 822–826.Pan, Sinno Jialin and Qiang Yang (2009). “A survey on transfer learning”. In:
IEEE Trans-actions on knowledge and data engineering
Annual Review of Sociology
International Conference on Learning Representa-tions” . url : https://arxiv.org/abs/1409.1556 .Sivic, Josef and Andrew Zisserman (2009). “Efficient visual search of videos cast as textretrieval”. In: IEEE transactions on pattern analysis and machine intelligence
Computer vision: algorithms and applications . Springer.Varian, Hal R. (May 2014). “Big Data: New Tricks for Econometrics”. In:
Journal of Eco-nomic Perspectives
Journal of the American Statistical Association
Journal of the American Statistical Association
International conference on machine learning , pp. 2048–2057.Ye, Junting and Steven Skiena (2019). “The Secret Lives of Names? Name Embeddings fromSocial Media”. In:
Proceedings of the 25th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining , pp. 3000–3008.44
PPENDIX
Dates dictionary
Description TokensDay and year digits <0>, <1>, <2>, <3>, <4>, <5>, <6>, <7>, <8>, <9>
Months
Separators
Sequence markers and padding
Table 7:
Dictionary Ω datedate