[PDF] EASTER: Efficient and Scalable Text Recognizer

Abstract

Recent progress in deep learning has led to the development of Optical Character Recognition (OCR) systems which perform remarkably well. Most research has been around recurrent networks as well as complex gated layers which make the overall solution complex and difficult to scale. In this paper, we present an Efficient And Scalable TExt Recognizer (EASTER) to perform optical character recognition on both machine printed and handwritten text. Our model utilises 1-D convolutional layers without any recurrence which enables parallel training with considerably less volume of data. We experimented with multiple variations of our architecture and one of the smallest variant (depth and number of parameter wise) performs comparably to RNN based complex choices. Our 20-layered deepest variant outperforms RNN architectures with a good margin on benchmarking datasets like IIIT-5k and SVT. We also showcase improvements over the current best results on offline handwritten text recognition task. We also present data generation pipelines with augmentation setup to generate synthetic datasets for both handwritten and machine printed text.

Full PDF

EEASTER : Efficient and Scalable Text Recognizer

Kartik Chaudhary

Optum, UnitedHealth GroupBengaluru, [email protected]

Raghav Bali

Optum, UnitedHealth GroupBengaluru, [email protected]

ABSTRACT

Recent progress in deep learning has led to the development ofOptical Character Recognition (OCR) systems which perform re-markably well. Most research has been around recurrent networks[4, 8, 11–13, 24, 25, 29] as well as complex gated layers [2, 14] and[30] which make the overall solution complex and difficult to scale.In this paper, we present an E fficient A nd S calable TE xt R ecognizer( EASTER ) to perform optical character recognition on both machineprinted and handwritten text. Our model utilises 1-D convolutionallayers without any recurrence which enables parallel training withconsiderably less volume of data. We experimented with multiplevariations of our architecture and one of the smallest variant (depthand number of parameter wise) performs comparably to RNN basedcomplex choices. Our 20-layered deepest variant outperforms RNNarchitectures with a good margin on benchmarking datasets likeIIIT-5k and SVT. We also showcase improvements over the currentbest results on offline handwritten text recognition task. We alsopresent data generation pipelines with augmentation setup to gen-erate synthetic datasets for both handwritten and machine printedtext.

Text is a ubiquitous entity in natural images and most real worlddatasets like scanned documents, restaurant menu cards, receipts,tax forms, license plates, etc. These datasets may contain textin both, printed as well as handwritten formats. Extracting textinformation from such datasets is a complex task due to variety ofwriting styles and more so due to limitation of ground truth. OpticalCharacter Recognition (OCR) systems have been in existence forquite sometime now [9] and [23]. The improvements and researchin Deep Learning, CNN and LSTM based OCR solutions [3, 4, 8, 11–13, 24, 25] and [29] have taken the field by a storm. The resultsfrom these solutions are leaps and bounds ahead of traditionalsolutions like Tesseract[28]. The downside of these Deep Learningbased solutions is their dependence on huge amounts of data andcompute.Handwritten Text Recognition or HTR is an even more involvedprocess with countless variation of styles. While OCR for printedtext has seen good improvements, HTR still remains a challenge.Lack of training data adds to the list of issues. Moreover, the modelstrained for printed text do not generalise well (even with transferlearning) onto HTR tasks.Ingle et al. in their work “A scalable handwritten text recog-nition system” [14], showcase improvements on limited datasetswith architectures that have recurrent connections. They presentGated Recurrent Convolutional Layers (GRCLs) as specialised con-volutional layers to perform recurrence along depth as compared to LSTMs which do the same along time dimension. In this pa-per we address the problem of training data volume and computerequirements together by presenting three variants of our fullyconvolution architecture devoid of recurrent connections. Thesemodels handle both, handwritten and machine printed texts. Beingfully convolutional (using only 1-dimensional convolutions) en-ables development of smaller, faster and parallel trainable models.This further reduces the barrier for deployment and scalability.Coquenet et al. [7] challenge the notion of recurrent networksby presenting gated convolutional networks. They present resultswhich are an improvement or are on par in comparison to CNN-BiLSTM based networks. Works by Shi et al.[27], Jaderberg et al.[15]and Lee and Osindero[18] improve further by utilising specialisedattention mechanisms, complex recurrent convolutions and otherenhancement techniques to solve the tasks of OCR and HTR.Another major difference between

EASTER and models describedin [2, 3, 7, 14], etc. (apart from architectural differences) is thetraining data. We focussed our model to perform OCR on word levelinputs, i.e. each input is a single word. This restriction was basedupon practical and deployment considerations of our application.This simplification also eases our dataset preparation and trainingprocess but does not limit the model’s capability to handle line levelinputs. We confirm our claims by showcasing performance on linelevel inputs as well. This shows the robustness of our model andits generalisation capabilities, all without using RNNs.Our work is inspired by research in the field of Automatic SpeechRecognition (ASR). Similar to text recognition, the task of speechrecognition works upon a sequential input where label alignmentis not a trivial task. ASR related work by Li et al. and Collobertet al. rely on using non-recurrent architectures. These works relyon multiple repeating block structures composed of different sub-components. They also utilise residual connections and experimentwith different depths to achieve state of the art performance.To the best of our knowledge, this is the first work that leveragesonly 1-D convolutions for the task of offline handwriting recogni-tion without any recurrence.

EASTER being a simple architecture,outperforms more complex models in terms of training time, vol-ume of training data and performance (word and character errorrates). Figure 1 presents the overall setup.The remainder of this paper is organised as follows: Section 2describes how we prepare different datasets to train

EASTER for thetasks of OCR on machine printed and HTR on handwritten texts.We also outline the image augmentation pipeline to infuse variancein the training dataset. Section 3 discusses

EASTER architecture indetail. In section 4 and 5 we discuss different experiments withvariations in the

EASTER architecture and present results acrossdifferent datasets respectively. Section 6 concludes the paper. a r X i v : . [ c s . C V ] A ug igure 1: EASTER Pipeline consists of a configurable data generator, an image augmentation pipeline and recurrence freeEASTER architecture. Given a list of data patterns like names, addresses, phone numbers, etc. along with a list of fonts,styles, etc. the configurable data generator can generate training datasets of required size. Augmentation pipeline helps addperturbations to generate realistic samples To prepare training data for both handwritten and machine printedtext, we applied different methodologies. The task of preparingtraining data for handwritten text is a bit more difficult as comparedto machine printed. First difficulty is the availability of annotatedhandwritten text followed by the large variations in handwritingstyles. Manually preparing such datasets is time consuming andcost ineffective.In the following subsections we will cover data preparationmethodologies for both tasks in detail. This will be followed bydetails on augmentation of these datasets.

We utilised three different approaches to prepare the training datasetfor the task of handwritten OCR. The IAM handwriting dataset [21]contains stroke information for handwritten text collected fromvarious contributors in an un-constrained settings. We leverage theimages from the offline subset as input samples which have corre-sponding transcriptions available. This dataset has limited samplesyet provides line level data points with different handwriting stylesand text patterns.The next approach was to synthetically generate handwrittentext data, similar to how we prepared the machine printed dataset.For this we leveraged the method presented by Alex Graves [10]to generate handwritten text. Similar to GRCL [14], we enabledthe model to generate different styles and variations as well. Weused specific hyper-parameters to control the mixture density layerand sampling temperature to generate usable samples. The finalapproach was to manually write as well as label the samples tomake sure we cover a wide variety of text patterns and style. Seefigure 2 for samples from our handwritten text dataset.Together these three approaches led to the training dataset weutilised for training our HTR model. Note that for some experiments we utilised only specific subsets of this dataset, more details in theexperiments section.

Collecting machine printed dataset from with-in the organisationand from different public sources has certain restrictions and issues.These factors range from datasets being very clean, limited fontsor having only specific type of patterns. Instead of collecting andcurating data from such sources, we devised an ingenious methodfor synthetically generating a dataset for machine printed text.The first step was to assemble a list of common patterns of textin the real world. These include text for street addresses, names,dollar amounts, etc.. The second step was to prepare a repositoryof different font styles, strokes, formatting (underline, italics, etc.).The final step was to use the patterns and styles as inputs to a proba-bilistic synthetic text data generator. The generator was designed tohave configurable settings to adjust the variation in styles, patterns,length of text samples and overall dataset size. See figure 2 forsamples from our machine printed text dataset. Using this method,we virtually have the ability to generate infinite such datasets fortraining and improving our models.

Text available in real world is rarely in noise free conditions. Issuessuch as bad scan quality, contrast issues, faded ink (especially incase of handwritten text), etc. complicate the tasks of OCR andHTR even more. To develop models which can handle such issues,we added certain augmentations to our datasets. The augmentationpipeline was designed to add perturbations to our datasets whichwould help mimic real world scenarios. Inspired by degradationpipeline in [14] and augmentation library imgaug [17] we addedaugmentations to both, machine printed and handwritten trainingdatasets. Noisy backgrounds are a common feature, hence gaussiannoise, salt and pepper, fog, speckle, random lines/strokes, etc. we igure 2: Synthetically Generated Random Samples; (a)Handwritten Text samples, (b) Machine Printed Text Samples for ran-dom name, email address, dollar value and street addressFigure 3: Augmented Samples. Each row represents a ran-dom sample from the synthetic dataset where left image isthe initial output and the right image represents the outputafter application of one or several augmentations. used. To handle different sizes, we made use of padding acrossthe four edges or a combination of two to three edges to generatedifferent sizes. Perspective related augmentations involved rotation,warping, dilation, shear, etc.These augmentation techniques were applied to our datasetsusing a probabilistic pipeline to ensure enough variability and rep-resentation. See figure 3 for sample augmentations. OCR and HTR both involve taking images with text as input andgenerating corresponding text as output. Recurrent architecturesmake use of LSTMs to capture sequential dependency followedby Connectionist Temporal Loss (CTC)[11] to help train modelswithout specifically aligning inputs and their labels.The basic architecture we present here is a 1-D Convolutionalnetwork inspired by the research in the field of Automatic SpeechRecognition (ASR) tasks. Works of Li et al. [20] Collobert et al. [6]and Pratap et al. [26] highlight the effectiveness of convolutionalnetworks in handling sequence to sequence tasks (ASR) withoutrecurrent connections. We extend the similar thought process forthe field of OCR and HTR using

EASTER (see figure 5).

EASTER

Encoder

EASTER follows a block approach where-in each block consists ofmultiple repeating sub-blocks. Each sub-block comprises of a 1-DConvolutional layer with multiple filters followed by layers fornormalisation, ReLU and dropout. We utilise padding to maintainthe dimensions of the input slice. Each

EASTER architecture has1 preprocessing block and 3 post-processing blocks. The pre andpost processing blocks also follow similar block structure. In ourexperiments, we found Batch-Normalisation to outperform othernormalisation techniques, this is similar to the findings mentionedin [20]. Figure 4 shows a sample

EASTER block. Table 1 shows thestructure of a 3x3 EASTER model. The table outlines the number ofblocks, sub-blocks, number of filters and other hyper-parameters.

The Connectionist Temporal Classification (CTC) method [11] isused to train as well as infer results from our models. The characters(or vocabulary for our task of OCR/HTR) in the input image varyin width and the spacing. CTC enables us to handle such a taskwithout the need to align input images and ground truth.We denote the training dataset as D = { X , Y } , where X is theinput image for transcription and Y is the label or ground truth.Assuming we have a vocabulary set L , then Y = L s , where s rep-resents label length. CTC generates outputs at every time step t . The final output contains a sequence of repeating consecutivecharacters with ϵ (denoting blank space) in between. Thus, we addanother symbol ϵ representing a blank to the vocabulary set L . L + = { L ∪ ϵ } (1)The objective is to minimise the negative log probability of obtaining Y given an input X , i.e. Objective = − Σ ( X , Y ) ϵD loдp ( Y | X ) (2)The final output is obtained by merging consecutive repeating char-acters delimited by ϵ . For instance, an output sequence like 1 ϵbbϵϵa maps to 1 ba . To obtain such an output, we define a function γ whichsqueezes repeating characters to single occurrence and removesblanks ( ϵ ). Thus, γ is a function which maps the intermediate re-peating sequence output (denoted as π ) to the final output y . p ( y | X ) = Σ γ ( π ) = y loдp ( π | X ) (3) p ( π | X ) = Π Tt = y tπ t (4) lock Table 1: EASTER 3x3: 9 blocks each consisting of 3 1-D Convolutional sub-blocks, 1 preprocessing block and 3 post processingblocks. Overall the model contains 14 layers with 1Million trainable parameters.Figure 4: Components of

EASTER block. Each block containsmultiple repeating sub-blocks consisting of layers for 1-DConv, Batch Normalisation, ReLU and Dropout. Differentblocks utilize different number of convolutional filters andother hyperparameters. , where y tπ t is the probability of generating label π t at time t . Thus,the predicted label y for input X is given as: y = γ ( arдmax π p ( π | X )) (5)Due to the way this function is designed, the model generatesfar too many ϵ ’s as compared to actual characters. This leads tothe model being biased towards the blank class ( ϵ ). To addressthis problem, Li and Wang [19] show multiple ways of adjustingthe class weights for CTC. These weighting strategies address theproblem of class imbalance and result in fast convergence. The classweighted CTC method is denoted as: Class W eiдhted CTC ( y | X ) = − Σ t Σ k α k y tk loдy tk (6), where y tk is the generated output at time t and, α k = (cid:40) − α if k = ϵα otherwise (7)In our experiments we saw significant improvement in performancewith weighted CTC while training on small datasets. The most basicarchitecture for EASTER is shown in figure 5. It is a 3x3 architecturewith separate preprocessing and post-processing blocks.

For a typical training input, we first transform the input image intograyscale followed by scaling it down to a height of 40-pixels. Thenetwork is able to handle variable width inputs and requires noadditional transformations. Individual characters have specific localstructures and we utilise overlapping 1-D convolutions to exploitthe same. 1-D Convolutions also capture short term sequentialdependencies across sliding frames and further assist in capturingthe underlying temporal aspects of the sequence.

EASTER blockarchitecture enables it to learn higher level features without theneed of recurrence or any specialised gated mechanism. We trainedour models using Keras [5] with Tensorflow [1] backend.The smallest model with half million (0.5M) parameters achievesbest results in under 1 hour of training.This enables faster turn-around time for retraining, experiments and ease of deployment.We experimented with two more variants , one with increasednumber of filters and the other with more depth. The deepestvariant has 28M parameters and outperforms RNN based modelswith a good margin on our internal dataset.

We evaluate

EASTER ’s performance on IAM dataset for the taskfor HTR. To enable comparison with GRCL [14], we follow thesame process of training our models on different combinations ofthe IAM dataset. Ingle et al. firstly train GRCL with IAM-offlinedataset which contains only 6161 samples and report results onthe IAM-offline test dataset. We repeat the same experiment with

EASTER -5x3 (20 layer variant) and showcase improvements on bothWER and CER by a good margin. We also went a step ahead andtrained our model on only first 3000 samples out of 6161 samples,i.e. only 50% of the actual training data.

EASTER -5x3’s performanceon 3k samples was observed to be better than GRCL with 6161training samples.In the second experiment performed by Ingle et al., they con-catenated smaller input samples in different combination to form alarger training dataset. The resulting training dataset has 511,524training samples. This is a significantly larger training dataset ascompared to the first experiment. We performed this experimentwithout the augmentation pipeline and observe a better perfor-mance from

EASTER . Table 2 refers to the two experiments.We observed similar performance boost on internal datasets.The improvements in WER and CER along with need for lesser igure 5: Basic ExR EASTER

Architecture with 1 preprocessing block and 3 post-processing blocks

Model Training Dataset

EASTER

IAM-Off No GRCL IAM-Off 6,161 No 35.2 14.1

EASTER

IAM-Off 6,161 No

GRCL IAM-Off + IAM-On-Long 511,524 No 22.3 8.8

EASTER

IAM-Off + IAM-On-Long No GRCL IAM-Off + IAM-On-Long 511,524 Yes 17.0 6.7

Table 2: Results on IAM Offline Dataset as measured using Word Error Rate (WER) and Character Error Rate (CER). GRCLrefers to Gated Recurrent Convolutional Layers as presented by Ingle et al. [14] training data and a lighter model (in terms of trainable parametersand memory footprint) helped us meet production requirements aswell.

We performed multiple separate experiments using

EASTER for thetask for OCR as well. The first experiment was based on the datasetprepared using our data generation pipeline. We utilised this datasetto benchmark our performance for internal datasets/use cases.The second set of experiments involved preparing our modelfor performance on some of the benchmarking datasets for textextraction from natural images. Popular benchmarking datasetswe used for our experiments are as follows: • IIIT-5k [22]: This dataset consists of 2000 training imagescollected from the internet. There are about 3000 test sam-ples. The dataset also consists of 50 and 100 word lexiconwhich we do not use during our experiments. • Google-SVT [31]: is another interesting dataset consist-ing of only 257 training images and 647 images for test.Unlike Wang and Hu, we do not use the lexicon for ourexperiments.The models with best results on the benchmark datasets utilise arange of techniques to train. Since the volume of training data inthese datasets is not enough,

Synth-90k [16] dataset is used as atraining set. As the name implies, this dataset generates realisticsynthetic images. It consists of 900k test images and about 7millionfor training. Other techniques like preprocessing input crops tohandle skew, rotation, contrast, super-resolution etc. are also ap-plied by some of the works. For our benchmarking experiments weonly make use of

Synth-90k dataset for training our models. Wedo not apply any preprocessing techniques other than resizing. Ourexperimental setup consists of a 20 layer deep variant, i.e.

EASTER -5x3), model with residual connections. This model has 28million trainable parameters and a vocabulary of 62 characters from theEnglish alphabet (26 lower-case + 26 Upper-case + 10 numerals).We did not apply any augmentations to our training dataset andperformed greedy decoding. No additional post processing stepslike usage of language models etc were applied. There were twomajor reasons behind experimenting without a language model.Firstly, training language model requires additional time and datawhich is not always available. Secondly, and more importantly,the usage of language model slows down the inference pipelinein practice. Since our aim was to develop a deployable and usableOCR/HTR model, we were experimented without a language modelfor

EASTER .Compared to [15, 18, 27],

EASTER comfortably improves upon theWER and CER performance against the

IIIT-5k dataset. Our modelis able to achieve 86.76% word accuracy with 4.56% character errorrate. For case of

Google-SVT , our model achieves a word accuracyof 78.51% with a 9.7% character error rate. We did not observe anyimprovements while using the training images from

Google-SVT for fine-tuning. Our model improves upon the best in case of

IIIT-5k while nearly achieves benchmark results on

Google-SVT . It isimportant to note that in both cases, the inference was done witha greedy decoder without relying on the lexicon. The results andcomparison are shown in detail in table 3 for reference.

Although our work can achieve compelling results in many cases,the results are not so favourable in some scenarios. Figure 6 high-lights some of the failure modes for IIIT-5k and Google-SVT datasets.We observed that even though the transcriptions do not matchground truth, the outputs seem to be honest mistakes. These mis-takes can be largely attributed to issues like bad image quality alongwith distortions which are completely transforming certain alpha-bets. For instance, consider the distortions of alphabets ”f” and ”G”in figure 6(b) and figure 6(e) respectively. The alphabets have been igure 6: Samples where EASTER fails to transcribe correctly. Each example consists of ground truth followed by model output.Samples (b) and (e) can be attributed to distortions while samples (f) and (g) are associated with font related issues.)Figure 7:

EASTER for handwritten text. Samples showcase scenarios where overlapping and highly cursive handwriting lead toincorrect transcriptions. odel SVT IIIT-5kJaderberg et al. 71.1% -Lee and Osindero 80.7% 78.4%Shi et al. 80.8% 78.2%Wang and Hu EASTER -5x3 78.5%

EASTER

EASTER is a robust model which handles cropswhich contain curved text as well. Each example consistsof ground truth followed by model output distorted to such an extent that it is nearly impossible to pick thecorrect alphabet without extensive post-processing. There are alsocases where specific fonts seem to confuse the model (see figure6(f) and 6(g)). While distortions are tricky to handle, font relatedissues can be tackled using more training examples.The handwritten text recognition task is slightly more complexthan the machine printed case. Though

EASTER outperforms bench-marks both in terms of performance as well as training and computerequirements, it does face challenges when presented with over-lapping and highly cursive handwritings. Figure 7 presents a fewfailure modes. Most issues in the highlighted examples can be at-tributed to unintelligible scribbles or at times hard to distinguishshapes. For instance, in the second example in figure 7, the word”our” is misread as ”ow” which is again a good enough guess giventhat the model is not making use of any language model or otherpost-processing steps.It is also important that we present cases which highlight therobustness of our setup. The

EASTER setup was largely trained onstraight crops with a few augmentations catering to rotation andskew of overall text. Despite not being explicitly trained on curvedcrops, the model is able to handle such cases with ease. Figure 8presents a few such examples of successful transcription of curvedtexts.Additional samples to visualize model performance on machineprinted and handwritten text are available in the appendix section.

We presented a fast, scalable and recurrence free architecture called

EASTER for handwritten and machine printed text recognition tasks.Inspired from fully convolutional architectures for automatic speechrecognition, we discussed the building blocks and training processfor

EASTER . We described the dataset preparation process for bothhandwritten and machine printed text along with a data augmen-tation pipeline. We also discussed about the impact of weightedCTC towards faster convergence. We finally presented results onbenchmarking datasets for both HTR and OCR tasks.

EASTER is ableto achieve significant improvements in WER and CER.

EASTER ’sperformance on internal datasets achieved near state of the art per-formance. We also showcased that the model performs equally wellon line level data. Due to lesser number of trainable parametersand eventually smaller memory footprint,

EASTER is easier to trainand faster to use in production applications without much tooling.As part of future work we plan to utilise attention mechanisms toimprove the decoding stage. We also plan to look into quantisationtechniques to further reduce the memory footprint.

ACKNOWLEDGMENTS

We would like to thank Kishore V Ayyadevara and YeshwanthReddy for their work and contributions to the OCR project andmaking sure it gets widespread adoption, Vineet Shukla for helpfuldiscussions and inputs to improve the solution, and the whole OCRteam for their contributions. EFERENCES , Vol. 1.IEEE, 646–651.[3] Fedor Borisyuk, Albert Gordo, and Viswanath Sivakumar. 2018. Rosetta.

Proceed-ings of the 24th ACM SIGKDD International Conference on Knowledge Discovery &Data Mining (Jul 2018). https://doi.org/10.1145/3219819.3219861[4] D. Castro, B. L. D. Bezerra, and M. Valenca. 2018. Boosting the Deep Mul-tidimensional Long-Short-Term Memory Network for Handwritten Recogni-tion Systems. In . IEEE Computer Society, Los Alamitos, CA, USA, 127–132.https://doi.org/10.1109/ICFHR-2018.2018.00031[5] Franc¸ois Chollet et al. 2015. Keras. https://keras.io.[6] Ronan Collobert, Christian Puhrsch, and Gabriel Synnaeve. 2016. Wav2Letter: anEnd-to-End ConvNet-based Speech Recognition System.

CoRR abs/1609.03193(2016). arXiv:1609.03193 http://arxiv.org/abs/1609.03193[7] Denis Coquenet, Yann Soullard, Clement Chatelain, and Thierry Paquet. 2019.Have Convolutions Already Made Recurrence Obsolete for UnconstrainedHandwritten Text Recognition?. In

Have Convolutions Already Made Recur-rence Obsolete for Unconstrained Handwritten Text Recognition?

Proceedings of International Conference on Frontiers in Handwriting Recognition,ICFHR

IEEE transactions on pattern analysis andmachine intelligence

33, 4 (2010), 767–779.[10] Alex Graves. 2013. Generating Sequences With Recurrent Neural Networks.arXiv:cs.NE/1308.0850[11] Alex Graves, Santiago Fern´andez, Faustino Gomez, and J¨urgen Schmidhuber.2006. Connectionist temporal classification: labelling unsegmented sequencedata with recurrent neural networks. In

Proceedings of the 23rd internationalconference on Machine learning . 369–376.[12] Alex Graves, Marcus Liwicki, Santiago Fern´andez, Roman Bertolami, HorstBunke, and J¨urgen Schmidhuber. 2008. A novel connectionist system for un-constrained handwriting recognition.

IEEE transactions on pattern analysis andmachine intelligence

31, 5 (2008), 855–868.[13] Alex Graves and J¨urgen Schmidhuber. 2009. Offline handwriting recognition withmultidimensional recurrent neural networks. In

Advances in neural informationprocessing systems . 545–552.[14] R Reeve Ingle, Yasuhisa Fujii, Thomas Deselaers, Jonathan Baccash, and Ashok CPopat. 2019. A scalable handwritten text recognition system. arXiv preprintarXiv:1904.09150 (2019).[15] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014.Deep structured output learning for unconstrained text recognition. arXivpreprint arXiv:1412.5903 (2014).[16] Max Jaderberg, Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014.Synthetic Data and Artificial Neural Networks for Natural Scene Text Recogni-tion. arXiv:cs.CV/1406.2227[17] Alexander B. Jung. 2018. imgaug. https://github.com/aleju/imgaug. [Online;accessed 30-Oct-2018].[18] Chen-Yu Lee and Simon Osindero. 2016. Recursive Recurrent Nets with AttentionModeling for OCR in the Wild.

CoRR abs/1603.03101 (2016).[19] Hongzhu Li and Weiqiang Wang. 2019. A Novel Re-weighting Method forConnectionist Temporal Classification. arXiv preprint arXiv:1904.10619 (2019).[20] Jason Li, Vitaly Lavrukhin, Boris Ginsburg, Ryan Leary, Oleksii Kuchaiev,Jonathan M. Cohen, Huyen Nguyen, and Ravi Teja Gadde. 2019. Jasper: AnEnd-to-End Convolutional Neural Acoustic Model. arXiv:eess.AS/1904.03288[21] Urs-Viktor Marti and Horst Bunke. 2002. The IAM-database: an English sentencedatabase for offline handwriting recognition.

International Journal on DocumentAnalysis and Recognition

BMVC . [23] Shunji Mori, Hirobumi Nishida, and Hiromitsu Yamada. 1999.

Optical characterrecognition . John Wiley & Sons, Inc.[24] Vu Pham, Th´eodore Bluche, Christopher Kermorvant, and J´erˆome Louradour.2014. Dropout improves recurrent neural networks for handwriting recognition.In .IEEE, 285–290.[25] Arik Poznanski and Lior Wolf. 2016. CNN-N-Gram for HandwritingWordRecognition. In

CNN-N-Gram for HandwritingWord Recognition . 2305–2314.https://doi.org/10.1109/CVPR.2016.253[26] Vineel Pratap, Awni Hannun, Qiantong Xu, Jeff Cai, Jacob Kahn, Gabriel Syn-naeve, Vitaliy Liptchinsky, and Ronan Collobert. 2018. wav2letter++: TheFastest Open-source Speech Recognition System.

CoRR abs/1812.07625 (2018).arXiv:1812.07625 http://arxiv.org/abs/1812.07625[27] Baoguang Shi, Xiang Bai, and Cong Yao. 2015. An End-to-End Trainable NeuralNetwork for Image-based Sequence Recognition and Its Application to SceneText Recognition.

CoRR abs/1507.05717 (2015).[28] Ray Smith. 2007. An overview of the Tesseract OCR engine. In

Ninth InternationalConference on Document Analysis and Recognition (ICDAR 2007) , Vol. 2. IEEE,629–633.[29] Paul Voigtlaender, Patrick Doetsch, and Hermann Ney. 2016. Handwritingrecognition with large multidimensional long short-term memory recurrentneural networks. In . IEEE, 228–233.[30] Jianfeng Wang and Xiaolin Hu. 2017. Gated Recurrent Convolution NeuralNetwork for OCR. In

Advances in Neural Information Processing Systems .[31] Kai Wang, Boris Babenko, and Serge Belongie. 2011. End-to-end scene textrecognition. In . IEEE, 1457–1464.

A few additional samples to showcase

EASTER ’s transcription per-formance on not so clear machine printed samples (figure 9). Figure10 showcases additional samples of successful transcription forhandwritten text. igure 9: EASTER for Machine Printed Text. Despite unclear images and distortions, transcriptions are of high quality. Eachexample consists of ground truth followed by model outputFigure 10:

EASTER for Handwritten Text. Each example consists of ground truth followed by model outputfor Handwritten Text. Each example consists of ground truth followed by model output