[PDF] Classification of Handwritten Names of Cities and Handwritten Text Recognition using Various Deep Learning Models

Abstract

This article discusses the problem of handwriting recognition in Kazakh and Russian languages. This area is poorly studied since in the literature there are almost no works in this direction. We have tried to describe various approaches and achievements of recent years in the development of handwritten recognition models in relation to Cyrillic graphics. The first model uses deep convolutional neural networks (CNNs) for feature extraction and a fully connected multilayer perceptron neural network (MLP) for word classification. The second model, called SimpleHTR, uses CNN and recurrent neural network (RNN) layers to extract information from images. We also proposed the Bluechet and Puchserver models to compare the results. Due to the lack of available open datasets in Russian and Kazakh languages, we carried out work to collect data that included handwritten names of countries and cities from 42 different Cyrillic words, written more than 500 times in different handwriting. We also used a handwritten database of Kazakh and Russian languages (HKR). This is a new database of Cyrillic words (not only countries and cities) for the Russian and Kazakh languages, created by the authors of this work.

Full PDF

AAdvances in Science, Technology and Engineering Systems JournalVol. 5, No. 2, XX-YY (2020)

ASTES JournalISSN: 2415-6698

Classiﬁcation of handwritten names of cities and Handwritten text recog-nition using various deep learning models

Daniyar Nurseitov , , Kairat Bostanbekov , , Maksat Kanatov , , Anel Alimova , , Abdelrahman Abdallah *2 , , Galymzhan Abdimanap , Satbayev University, Almaty, Kazakhstan National Open Research Laboratory for Information and Space Technologies, Almaty, Kazakhstan MSc Machine Learning & Data Science , Satbayev University, Almaty, Kazakhstan

A R T I C L E I N F O A B S T R A C T

Article history:Received:Accepted:Online:Keywords:Deep Learningconvolutional neural networksrecurrent neural networksRussian handwriting recogni-tionConnectionist Temporal Clas-siﬁcation This article discusses the problem of handwriting recognition in Kazakh and Russian lan-guages. This area is poorly studied since in the literature there are almost no works in thisdirection. We have tried to describe various approaches and achievements of recent yearsin the development of handwritten recognition models in relation to Cyrillic graphics. Theﬁrst model uses deep convolutional neural networks (CNNs) for feature extraction and afully connected multilayer perceptron neural network (MLP) for word classiﬁcation. Thesecond model, called SimpleHTR, uses CNN and recurrent neural network (RNN) layers toextract information from images. We also proposed the Bluechet and Puchserver models tocompare the results. Due to the lack of available open datasets in Russian and Kazakh lan-guages, we carried out work to collect data that included handwritten names of countriesand cities from 42 di ﬀ erent Cyrillic words, written more than 500 times in di ﬀ erent hand-writing. We also used a handwritten database of Kazakh and Russian languages (HKR).This is a new database of Cyrillic words (not only countries and cities) for the Russian andKazakh languages, created by the authors of this work. This paper is an extension of work originally presented at Inter-national Conference on Electronics, Computer and Computation(ICECCO)[1]Handwriting text recognition (HTR) is the process of changinghandwritten characters or phrases into a format that the computerunderstands. It has an active network of educational researchersstudying it for the past few years as advances in this subject help toautomate di ﬀ erent types of habitual tactics and o ﬃ ce work. An ex-ample could be a painstaking seek of a scientiﬁc document insideheaps of handwritten ancient manuscripts by a historian, which isrequires a huge amount of time.Converting these manuscripts right into a virtual layout usingHTR algorithms could permit the historian to ﬁnd the data within afew seconds. Other examples of ordinary work that need automa-tion will be the tasks associated with signature veriﬁcation, authorrecognition, and others. The digitized handwriting text could makecontributions to the automation of many corporations’ business ap- proaches, simplifying human work. Our nation postal carrier, as aninstance, does not have an automated mail processing gadget thatrecognizes handwritten addresses on an envelope. The operatorhas to work manually with the data of any incoming correspon-dence. Automation of this commercial cycle of mail registrationmight dramatically decrease postal carrier fees on mail shipping.The key advances in HTR in mail communication are primar-ily aimed at ﬁnding solutions to the problems of recognition of theregion of interest in the images, text segmentation, elimination ofinterference when working with text background noises, such asmissing or ambiguous bits, spots on paper, detection of skew.The whole cycle of recognizing handwritten addresses of awritten correspondence using machine learning from start to endwill consist of the following steps: • Letters are put face-up on a conveyor in motion. • A snapshot is taken at a certain place of the conveyor. • The machine handles the snapshot and issues addresses to * Abdelrahman Abdallh, Satbayev University, Almaty, Kazakhstan & [email protected] // aj0505114 1 a r X i v : . [ c s . C V ] F e b oth the sender and the receiver. • The address is passed to the sorting and tracking system.Any supervised machine learning problem requires labeled in-put data on which to train the model. In our case, it is necessary totrain at least two models: one to determine the areas of the imagewhere the text is located, and the other to recognize words. Formsfor collecting handwriting samples by keywords were designed andlaunched. A set of data was formed from scanned images of thefront sides of envelopes with handwritten text for training in deter-mining the areas of interest in the image. A model was trained todetect an area of handwritten text on the face of an envelope. Thealgorithm for segmentation of the detected text block by lines andwords was implemented using the construction of histograms.O ﬄ ine handwritten address recognition is a special case of Of-ﬂine Cursive Word Recognition (CWR). The main di ﬀ erence is thatthe set of words for recognition is limited to words that can occurin addresses. To solve the problem of handwriting recognition, ma-chine learning methods are used, namely, RNN and CNN in HTR(Bluche[2],Puigcerver[3]).Russian and Kazakh languages are very di ﬃ cult and challeng-ing when it comes in recognizing text language, where writers canwrite the character contact together like in Figure 1, so the segmen-tation of characters can be impossible. Figure 1: An example of Russian text which di ﬃ cult to recognize "Мы лишилисьшишок” (“My lishilis shishok”) This project aims at further study of the challenge of classifyingRussian handwritten text and translating handwritten text into dig-ital format. Handwritten text is a very broad concept, and for ourpurposes we decided to restrict the reach of the project by speci-fying the meaning of handwritten text. In this project we took onthe task of classiﬁcation of the handwritten word, which could belike a sequence writing. This research will be combined with algo-rithms segmenting the word images into a given line image, whichin turn can be combined with algorithms segmenting line imagesinto a given image of an entire handwritten page. Our researchwill take the form of the end-to-end user program and will be afully functional model that serves the user to solve the problem oftransformation of a handwritten text to the digital format. The aimof the work is to implement such a recognition handwritten sys-tem that will be able to recognize Russian and Kazakh handwritingwords written by di ﬀ erent writers. We use the HKR database[4] astraining, validation and test set. The main contributions coveringseveral key techniques proposed in our system can be highlightedas follows.1. Pre-processing of the snapshot input (noise elimination, hor-izontal alignment). An unprocessed snapshot is forwardedas input data at this step. Here the noise is reduced, and theobject’s angle of rotation is measured along the axis perpen-dicular to the plane. 2. Segmentation of areas by word within the text. Handwrittenwords are described at this stage of service, and they are cutinto rectangular areas within the text for further recognition.3. Recognition of Words. After the snapshot segmentation intoseparate areas of words has been e ﬀ ective, direct word recog-nition will begin.In this article we evaluated models using two methods: in theﬁrst method the standard performance measures are used for all re-sults presented: the character error rate (CER) and word error rate(WER)[5]. The CER is determined as the deviation from Leven-shtein, which is the sum of the character substitution (S), insertion(I) and deletions (D) required to turn one string into the other, di-vided by the total number of characters in the ground truth word(N). Similarly, the WER is calculated as the sum of the number ofterm substitutions (Sw), insertion (Iw) and deletions (Dw), whichare necessary for transformation of one string into the other, di-vided by the total number of ground-truth terms (Nw). The sec-ond method is calculation of the character accuracy rate(CAR) andword accuracy rate(WAR).This article considers four main models based on artiﬁcial neu-ral networks (ANN). Russian and Kazakh handwriting recogni-tion is implemented using Deep CNN[6], SimpleHTR model[7],Bluche[2], and Puigcerver[3].The paper is structured as follows: Section 2 describes the re-lated work, Section 3 presents the description of models proposedin the work. Comprehensive results are presented in Section 4, andconclusions in Section 5. In o ﬄ ine HTR, the input features are extracted and selected fromimages, then ANN or HMM are used to predict the probabilitiesand decode them for the ﬁnal text. The main disadvantage ofHMMs is that they cannot predict long sequences of data. HMMshave been commonly used for o ﬄ ine HTR because they haveachieved good results in automatic speech recognition [8, 9]. Thebasic idea is that handwriting can be perceived as a series of inksignals from left to right that is similar to the sequence of acous-tic signals in voice. The inspiration for the hybrid HMM researchmodels came from:1. o ﬄ ine HTR [10, 11],2. o ﬄ ine HTR using conventional HMMs[12],3. automatic speech recognition using hybrid HMM / ANNmodels[13, 14],4. and online HTR[15].On the other hand, RNNs such as gated recurrent unit(GRU)[16] and long short term memory (LSTM)[17] can solvethis problem. RNN models have shown remarkable abil-ities in sequence-to-sequence learning tasks such as speechrecognition[18], machine tranlation[19], video summarization[20],automated question answer[21] and others.To transform a two-dimensional image for o ﬄ // aj0505114 2nd decoder. The task is solved by HTR, GRU and LSTM, whichtake information and feature from many directions. These hand-writing sequences are fed into RNN networks. Due to the use ofConnectionist Temporal Classiﬁcation (CTC)[22] models, the in-put feature requires no segmentation. One of the key beneﬁts ofthe CTC algorithm is that it does not need any segmented labeleddata. The CTC algorithm allows us to use data alignment with theoutput.A new system called Multilingual Text Recognition Networks(MuLTReNets) was proposed by Zhuo[23]. In particular, the mainmodules in the MuLTReNets are: the function extractor, scriptidentiﬁer and handwriting recognition. In order to convert the textimages into features shared by the document marker and recog-nizer, the Extractor function method combines spatial and temporalinformation. The handwriting recognizer adopts LSTM and CTCto perform sequence decoding. The accuracy of script recognitionachieves 99.9%.A new attention-based fully gated convolutional RNN was pro-posed by Abdallah[24], this model was trained and tested on theHKR dataset[4]. This work shows the e ﬀ ect of the attention mech-anism and the gated layer on selecting relative features. Atten-tion is one of the most inﬂuential ideas in the Deep Learning com-munity. It can be used in image caption[25], automated questionanswer[21], and other problems in deep learning achieving goodresults. Attention mechanism was ﬁrst used in the context of Neu-ral Machine Translation using Seq2Seq Models. It also achievedthe state-of-the-art for o ﬄ ine Kazakh and Russian handwritingdataset[4]. Atten-CNN-BGRU architecture achieves 4.5% CER inthe ﬁrst dataset and 19.2% WER in HKR dataset and 6.4% CERand 24.0% WER in the second test dataset.A multi-task learning scheme in Tassopoulou[26] research,teaches the model to perform decompositions of the target se-quence with target units of varying granularity, from ﬁne to coarse.They consider this approach as a way of using n-gram knowledgein the training cycle, indirectly, while the ﬁnal recognition is madeusing only the output of the unigram. Unigram decoding of such amulti-task method demonstrates the capacity of the trained internalrepresentations, placed on the training phase by various n-grams.In this research, pick n-grams as target units and play with gran-ularities of the subword level from unigrams till fourgrams. Theproposed model, even evaluated only on the unigram task, out-performs its counterpart single-task by absolute 2.52% WER and1.02% CER, in a greedy decoding, without any overhead comput-ing during inference, indicating that an implicit language model issuccessfully implemented.The Weldegebriel[27] for classiﬁcation proposes a hybridmodel of two super classiﬁers: the CNN and the Extreme GradientBoosting (XGBoost). CNN serves as an automatic training featureextractor for raw images for this integrated model, and XGBoostuses the extracted features as an input for recognition and classiﬁ-cation. The hybrid model and CNN output error rates are comparedto the fully connected layer. In the classiﬁcation of handwritten testdataset images, 46.30% and 16.12% error rates were achieved, re-spectively. As a classiﬁer, the XGBoost gave better results than theconventional fully connected layer.A fully convolutional handwriting model suggested byPetroski[28] utilizes an unknown length handwriting sample and generates an arbitrary symbol stream. Both local and global con-texts are used by the dual-stream architecture and need strong pre-processing steps such as symbol alignment correction as well ascomplex post-processing steps such as link-time classiﬁcation, dic-tionary matching, or language models. The model is agnosticto Latin-related languages using over 100 unique symbols and isshown to be very competitive with state-of-the-art dictionary meth-ods based on standard datasets from IAM[29] and RIMES[30]. OnIAM, the ﬁne-tuned model achieves 8.71% WER and 4.43% CERand hits 2.22% CER and 5.68% WER on RIMES.Also, a fully convoluted network architecture proposed byPtucha[31] that outputs random length symbol streams from thehandwritten text. The stage of pre-processing normalizes the inputblocks to a canonical representation that negates the need of ex-pensive recurrent symbol alignment. To correct the oblique wordfragments a lexicon is developed and a probabilistic character er-ror rate is added. On both lexicon-based and arbitrary handwritingrecognition benchmarks, their multi-state convolutional approachis the ﬁrst to show state-of-the-art performance. The ﬁnal convolu-tional method achieves 8.22% WER and 4.70% CER.Arabic handwritten character recognition based on deep neu-ral network uses CNN models with regularization parameterssuch as batch normalization to prevent overﬁtting proposed byYounis[32]. Deep CNN was applied for the AIA9k[33] andAHCD[34] databases, and the accuracy of classiﬁcation for the twodatasets was 94.8% and 97.6%, respectively. In this section four architectures for Handwriting recognition arediscussed:1. Deep CNN models[6].2. SimpleHTR model[7].3. Bluche[2].4. Puigcerver[3].

In this section, we will describe two types of datasets:

The ﬁrst dataset contains handwritten cites in Cyrillic words. Itcontains 21,000 images from various handwriting samples(namesof countries and cities). We increased this dataset for training bycollecting 207,438 images from available forms or samples.

The second

HKR for Handwritten Kazakh & RussianDatabase[4] consisted of distinct words (or short phrases) writ-ten in Russian and Kazakh languages (about 95% of Russian and5% of Kazakh words / //

HKR for Handwritten Kazakh & RussianDatabase[4] consisted of distinct words (or short phrases) writ-ten in Russian and Kazakh languages (about 95% of Russian and5% of Kazakh words / // aj0505114 3his ﬁnal dataset was then divided into Training (70%), Vali-dation (15%), and Test (15%) datasets. The test dataset itself wassplit into two sub-datasets (7.5% each): the ﬁrst dataset was namedTEST1 and it consisted of words that were not included in theTraining and Validation datasets; the other sub-dataset was namedTEST2 and consisted of words that were included in the Train-ing dataset but had completely di ﬀ erent handwriting styles. Themain purpose of splitting the Test dataset into TEST1 and TEST2datasets was to check the di ﬀ erence in accuracy between recogni-tion of unseen words and words seen in the training stage but withunseen handwriting styles. In this experiment, the old approach for classifying images usingvarious deep CNN models was used. To obtain right distributionof records, pre-processing image strategies and data enlargementmethods have been used. Three types of models were consideredin the experiment: The experiment consists of three options:1. Simple model of the CNN[35],2. MobileNet[36],3. MobileNet with small settings.

Experiment 1

In this experiment, CNN model was used to trainand evaluate the dataset, this model includes 2 conventional layersand a softmax layerr[37], which produces probabilities for clas-siﬁcation. This model is usually used in character classiﬁcationlike MNIST handwritten digit database[38]. The image data feed-forward to the model were (512x61x1). 10% of the dataset wasused for evaluate the model.

Experiment 2 and 3

In this section we will describe the Mo-bileNet architecture, which consists of 30 layers. The MobileNetmodel is shown in Figure 3. For more information, see[36]. Thisarchitecture contains 1x1 convolution layer, Batch Normalization,ReLU activation function, average pooling layer and softmax layer,which is used for classiﬁcation. In Experiment 2 we trained themodel for 150 iterations, with Adadelta optimization method [39]where the initial learning step with learning rate (LR) = Figure 2: Some Sample of Dataset

This experiment in our studies used the SimpleHTR system devel-oped by Harald[7]. The proposed system makes use of an ANN,wherein numerous layers of the CNN are used to extract featuresfrom the input photo. Then the output of these layers is feed toRNN. RNN disseminates information through a sequence. TheRNN output contains probabilities for each symbol in the sequence.To predict the ﬁnal text, the decoding algorithms are implementedinto the RNN output. CTC functions (Figure 4) are responsible fordecoding probabilities into the ﬁnal text. To improve recognitionaccuracy, decoding can also use a language model[22].The CTC is used to gain knowledge; the RNN output is a ma-trix containing the symbol probabilities for each time step. TheCTC decoding algorithm converts those symbolic probabilities intothe ﬁnal text. Then, to improve the accuracy, an algorithm is usedthat proceeds a word search in the dictionary. However, the timeit takes to look for phrases depends on the dimensions of the dic-tionary, and it cannot decode arbitrary character strings, includingnumbers.

Figure 3: The MobileNet architecture

Operations //

Operations // aj0505114 4NN: the feature sequence contains 256 signs or symptomsper time step. The relevant information is disseminated by RNNvia these series. LSTM is one of the famous RNN algorithms thatcarries information at long distances and o ﬀ ers more e ﬃ cient train-ing functionality than typical RNNs. The output RNN sequence ismapped to a 32x80 matrix.CTC: receives the output RNN matrix and the predicted textduring the neural network learning process, and determines the lossvalue. CTC receives only the matrix after processing and decodesit into the ﬁnal text. There should be no more than 32 charactersfor the length of the main text and the known text. Figure 4: SimpleHTR model, where green icons are operations and pink icons aredata streams

Data

Input: This is a size 128 to 32 gray ﬁle. Images in thedataset usually do not exactly have this size, so their original di-mension is changed (without distortion) until they are 128 in widthand 32 in height. The image is then copied to a target image of128 to 32 in white. Then the gray values are standardized, whichsimpliﬁes the neural network process.

Bluche model[2] proposes a new neural network structure for mod-ern handwriting recognition as an opportunity to RNNs in multidi-mensional LSTM. The model is totally based on a deep convo-lutional input image encoder and a bi-directional LSTM decoderpredicting sequences of characters. Its goal is to generate standard,multi-lingual and reusable tasks in this paradigm using the convo-lutional encoder to leverage more records for transfer learning.The encoder in the Bluche model contains 3x3 Conv layer with8 features, 2x4 Conv layer with 16 features, a 3x3 gated Conv layer,3x3 Conv layer with 32 features, 3x3 gated Conv layer, 2x4 Convlayer with 64 features and 3x3 Conv layer with 128 features. Thedecoder contains 2 bidirectional LSTM layers of 128 units and 128dense layer between the LSTM layers. Figure 5 shows the Bluchearchitecture.

Modern approaches of Puigcerver model[3] to o ﬄ ine HTR dramat-ically depend on multidimensional LSTM networks. Puigcervermodel has a high level of recognition rate and a large number of pa-rameters (around 9.6 million). This implies that multidimensionalLSTM dependencies, theoretically modelled by multidimensionalrecurrent layers, might not be su ﬃ cient, at least in the lower layersof the system, to achieve high recognition accuracy.The Puigcerver model has three important parts : • Convolutional blocks: they include 2-D Conv layer with 3x3kernal size and 1 horizontal and vertical stride. number ofﬁlters is equal to 16n at the n-th layer of Conv. • Recurrent blocks: Bidirectional 1D-LSTM layers form re-current blocks, that transfer the input image column-wisefrom left to right and from right to left. The output of thetwo directions is concatenated depth-wise. • Linear layer: the output of recurrent 1D-LSTM blocks arefed to linear layer to predict the output label. Dropout is im-plemented before the Linear layer to prevent overﬁtting (alsowith probability 0.5).

Figure 5: Bluche HTR model

In the new Puigcerver version multidimensional LSTM is used.The di ﬀ //

In the new Puigcerver version multidimensional LSTM is used.The di ﬀ // aj0505114 5ics). This allows us to use in the latter model unconstrained, two-dimensional information, doubtlessly capturing long-time term de-pendencies throughout each axis. Figure 6 shows the Puigcerverarchitecture. All models have been implemented using the Python and deeplearning library called Tensorﬂow[40]. Tensorﬂow allows fortransparent use of highly optimized mathematical operations onGPUs through Python. A computational graph is deﬁned in thePython script to deﬁne all operations that are necessary for the spe-ciﬁc computations.The plots for the report were generated using the matplotliblibrary for Python, and the illustrations have been created usingInkscape, which is a vector graphics software similar to AdobePhotoshop. The experiments were run on a machine with 2x “In-tel(R) Xeon(R) E-5-2680” CPUs, 4x ”NVIDIA Tesla k20x” and100 GB RAM. The use of a GPU reduced the training time of themodels by approximately a factor of 3, however, this speed-up wasnot closely monitored throughout the project, hence it could havevaried.

Figure 6: Puigcerver HTR model

This section will discuss the results of the four models on two dif-ferent datasets. Deep CNN models were trained and evaluated onthe ﬁrst dataset, SimpleHTR model was trained and evaluated onthe two datasets and ﬁnally Bluche and Puigcerver were trainedand evaluated on the second dataset.

For the current experiments, only ten classes from the ﬁrstdataset were selected: Kazakhstan, Belarus, Kyrgyzstan, Tajik-istan, Uzbekistan, Nur-Sultan, Almaty, Aktau, Aktobe, and Atyrau.

Experiment 1

The simple CNN model result is presented. Therewere 150 iterations in the learning process, and we showed the ef-fects of 10 iterations in the following (Figure 7): In Figure 7, themodel goes into a re-learning state after the ﬁrst iteration, wherethe results on the training data improve rapidly and the results onthe test data, on the contrary, degrade. This means that the modelachieves overﬁtting on this dataset. For this experiment the mini-batch of 32 size and the value of the learning rate (lr = Experiment 2 and 3

Figure 8 shows the results after the 10thiteration of learning, and as we can see, the model only learns cor-rectly after the 3rd iteration. (a)(b)

Figure 7: First experiment results: (a) Model accuracy, (b) Model error

The Adadelta optimization method is used and we track thelearning behavior, we found that the initial value (lr = ﬀ erence islr = = ﬀ erence is lr = //

The Adadelta optimization method is used and we track thelearning behavior, we found that the initial value (lr = ﬀ erence islr = = ﬀ erence is lr = // aj0505114 6ecome more correct. However, in the 6th iteration, the modelshows a large resonance relative to the entire graph. The reasonsfor this behavior are still being studied; one of the probable reasonscould be small amount of dataThe results of the experiments show that the MobileNet Modelis better than the simple CNN network because MobileNet istrained on a large dataset for feature extract. This means that theCNN model did not have enough data for training for the featureextract. In order to test this hypothesis, the methods of a ﬃ nitytransformation such as stretching image and other distortions weredesigned to further increase the data and repeat experiments. Af-ter 10 epochs, the training method had the early stop because ofoverﬁtting of the models due to the lack of the dataset used in theseexperiments. (a)(b) Figure 8: Second experiment results using MobileNet (lr = SimpleHTR model is training, validation, and test on two di ﬀ erentdatasets. In order to launch the model learning process on our owndata, the following steps were taken: • Words dictionary of annotation ﬁles has been created • DataLoader ﬁle for reading and pre-possessing the imagedataset and reading the annotation ﬁle belongs to the images • The dataset was divided into two subsets: 90% for trainingand 10% for validation of the trained model.To improve the accuracy and decrease the error rate we suggest thefollowing steps: ﬁrstly, increase the dataset by using data augmen-tation; secondly, add more CNN layers and increase the input size; thirdly, remove the noise in the image and cursive writing style;fourthly, replace LSTM by bidirectional GRU and ﬁnally, use de-coder token passing or word beam search decoding to constrain theoutput to dictionary words.

First Dataset

For learning on the collected data, the SimpleHTRmodel was processed, in which there are 42 names of countries andcities with di ﬀ erent handwriting patterns. Such data has been in-creased by 10 times. Two tests were performed: with cursive wordalignment and without alignment. After learning the values on datavalidation presented in Table 1 were obtained.This table shows SimpleHTR recognition accuracy for dif-ferent Decoding Methods (bestpath,beamsearch,wordbeamsearch)The best path decoding only uses the NN output and calculates anestimate by taking the most probable character at each position.The beam search only uses the NN output as well, but it uses moredata from it and hence provides a more detailed result. The beamsearch with character-LM also scores character-sequences that fur-ther boost the outcome. (a)(b) Figure 9: Third experiment results using MobileNet (lr = //

Second Dataset (HKR Dataset)

The SimpleHTR model showedin the ﬁrst test of the dataset 20.13% Character error rate (CER)and second dataset 1.55% CER. We also evaluated the SimpleHTRmodel by each Character accuracy rate (Figure 12). Word error rate(WER) was 58.97% for TEST1and 11.09% for TEST2. The resultfor TEST2 shows that the model can recognize words that existin the Training dataset but have completely di ﬀ erent handwritingstyles. The TEST1 dataset shows that the result is not good whenthe model recognizes the words that do not exist in Training andValidation datasets. Figure 12: Character Accuracy Rate for SimpleHTR model

After training, validation, and test datasets were prepared, andmodels were trained, comparative evaluation experiments wereconducted. The Bluche and Puigcerver model were trained on thesecond dataset (HKR dataset). We evaluated these models by the standard performance measures used for all results presented: CERand WER. For all models the minibatch of 32 size and Early Stop-ping after 20 epochs without improvement in validation loss valueand lr = Table 2: CER, WER for Bluche and Puigcerver

Algorithm TEST1 TEST2CER WER CER WERBluche 16.15% 59.64% 10.15% 37.49%Puigcerver 73.43% 96.89% 54.75% 82.91%Figure 13 and 14 present the character accuracy rate whichshows how the model detects each character.

Figure 13: Bluche HTR model performance on TEST1 and TEST2 dataset //

Figure 13: Bluche HTR model performance on TEST1 and TEST2 dataset // aj0505114 8 igure 14: Puigcerver HTR model performance on TEST1 and TEST2 dataset Two interrelated problems are considered in the paper: classiﬁca-tion of handwritten names of cities and HTR using various deeplearning models. The ﬁrst model is used for classiﬁcation of hand-written cities based on deep CNN and the other three models (Sim-pleHTR, Bluche, Puigcerver) are used for HTR which containsCNN layers, RNN layers, and the CTC decoding algorithm.Experiments on classiﬁcation of handwritten names of citieswere conducted using various machine learning methods and thefollowing results for recognition accuracy were obtained on the testdata: 1) 55.3% for CNN; 2) 57.1% for SimpleHTR recurrent CNNusing best-path decoding algorithms, 58.3% for Beamsearch and75.1% wordbeamsearch is The best result was shown by for Word-beamsearch, which uses a dictionary for the ﬁnal correction of thetext under recognition.Experiments on HTR were also conducted with various deeplearning methods and the following results for recognition accuracywere obtained for the two test datasets: 1) Bluche model achieved16.15% CER and 59.64% WER in the ﬁrst test dataset and 10.15%CER and 37.49% WER in the second dataset; 2)The PuigcerverHTR model showed 73.43% CER and 96.89% WER in the ﬁrst testdataset and 54.75% CER and 82.91% WER in the second dataset;3)The SimpleHTR model showed 20.13% CER and 58.97% WERin the ﬁrst test dataset and 1.55% CER and 11.09% WER in thesecond dataset.

Acknowledgment

The work was carried out within the frame-work of the grant project No. AR05135175 with the support of theMinistry of Education and Science of the Republic of Kazakhstan.

References [1] N. Daniyar, B. Kairat, K. Maksat, A. Anel, “Classiﬁcationof handwritten names of cities using various deep learning models,” in 2019 15th International Conference on Electron-ics, Computer and Computation (ICECCO), 1–4, IEEE, 2019,doi:10.1109 / ICECCO48375.2019.9043266.[2] T. Bluche, R. Messina, “Gated convolutional recurrent neuralnetworks for multilingual handwriting recognition,” in 201714th IAPR International Conference on Document Analysisand Recognition (ICDAR), volume 1, 646–651, IEEE, 2017,doi:10.1109 / ICDAR.2017.111.[3] J. Puigcerver, “Are multidimensional recurrent layers reallynecessary for handwritten text recognition?” in 2017 14thIAPR International Conference on Document Analysis andRecognition (ICDAR), volume 1, 67–72, IEEE, 2017, doi:10.1109 / ICDAR.2017.20.[4] D. Nurseitov, K. Bostanbekov, D. Kurmankhojayev, A. Al-imova, A. Abdallah, “HKR For Handwritten Kazakh & Rus-sian Database,” arXiv preprint arXiv:2007.03579, 2020.[5] V. Frinken, H. Bunke, Continuous Handwritten Script Recog-nition, 391–425, Springer London, London, 2014, doi:10.1007 / / ICEngTechnol.2017.8308186.[7] H. Scheidl, Handwritten text recognition in historical docu-ments, Master’s thesis, Technische Universität Wien, Vienna,2018, diplom-Ingenieur in Visual Computing.[8] A. El-Yacoubi, M. Gilloux, R. Sabourin, C. Y. Suen, “AnHMM-based approach for o ﬀ -line unconstrained handwrit-ten word modeling and recognition,” IEEE Transactions onPattern Analysis and Machine Intelligence, (8), 752–760,1999, doi:10.1109 / ﬀ -line handwrit-ing recognition: a comprehensive survey,” IEEE Transactionson pattern analysis and machine intelligence, (1), 63–84,2000, doi:10.1109 / / ICDAR.2003.1227707.[11] H. Bunke, S. Bengio, A. Vinciarelli, “O ﬄ ine recognitionof unconstrained handwritten texts using HMMs and statis-tical language models,” IEEE transactions on Pattern anal-ysis and Machine intelligence, (6), 709–720, 2004, doi:10.1109 / TPAMI.2004.14.[12] J. Gorbe-Moya, S. E. Boquera, F. Zamora-Martínez, M. J. C.Bleda, “Handwritten Text Normalization by using Local Ex-trema Classiﬁcation.” PRIS, , 164–172, 2008, doi:10.5220 / // aj0505114 913] Y. Bengio, “A connectionist approach to speech recognition,”in Advances in Pattern Recognition Systems Using NeuralNetwork Technologies, 3–23, World Scientiﬁc, 1993, doi:10.1142 / / NeuralNetwork basedSpeech Recognition in Loquendo ASR,” URLhttp: // / en / .[Online], 2010, doi:10.1.1.93.2006.[15] A. Graves, M. Liwicki, H. Bunke, J. Schmidhuber, S. Fernán-dez, “Unconstrained on-line handwriting recognition with re-current neural networks,” in Advances in neural informationprocessing systems, 577–584, 2008.[16] J. Chung, Ç. Gülçehre, K. Cho, Y. Bengio, “Empirical Evalu-ation of Gated Recurrent Neural Networks on Sequence Mod-eling,” CoRR, abs / , 2014.[17] S. Hochreiter, J. Schmidhuber, “Long Short-Term Memory,”Neural Computation, (8), 1735–1780, 1997, doi:10.1162 / neco.1997.9.8.1735.[18] A. Y. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos,E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates,A. Y. Ng, “Deep Speech: Scaling up end-to-end speech recog-nition,” CoRR, abs / , 2014.[19] I. Sutskever, O. Vinyals, Q. V. Le, “Sequence to SequenceLearning with Neural Networks,” CoRR, abs / ,2014.[20] N. Srivastava, E. Mansimov, R. Salakhudinov, “Unsupervisedlearning of video representations using lstms,” in Interna-tional conference on machine learning, 843–852, 2015, doi:10.5555 / / / / j.patcog.2020.107555.[24] A. Abdallah, M. Hamada, D. Nurseitov, “Attention-basedFully Gated CNN-BGRU for Russian Handwritten Text,”arXiv preprint arXiv:2008.05373, 2020.[25] L. Huang, W. Wang, J. Chen, X.-Y. Wei, “Attention on atten-tion for image captioning,” in Proceedings of the IEEE Inter-national Conference on Computer Vision, 4634–4643, 2019,doi:10.1109 / ICCV.2019.00473. [26] V. Tassopoulou, G. Retsinas, P. Maragos, “Enhancing Hand-written Text Recognition with N-gram sequence decompo-sition and Multitask Learning,” 2020, doi:10.13140 / RG.2.2.29336.83200.[27] H. T. Weldegebriel, H. Liu, A. U. Haq, E. Bugingo, D. Zhang,“A New Hybrid Convolutional Neural Network and eXtremeGradient Boosting Classiﬁer for Recognizing HandwrittenEthiopian Characters,” IEEE Access, , 17804–17818, 2019,doi:10.1109 / ICFHR-2018.2018.00024.[28] F. P. Such, D. Peri, F. Brockler, H. Paul, R. Ptucha, “Fullyconvolutional networks for handwriting recognition,” in 201816th International Conference on Frontiers in HandwritingRecognition (ICFHR), 86–91, IEEE, 2018.[29] U.-V. Marti, H. Bunke, “The IAM-database: an English sen-tence database for o ﬄ ine handwriting recognition,” Interna-tional Journal on Document Analysis and Recognition, (1),39–46, 2002, doi:10.1007 / s100320200071.[30] E. Grosicki, M. Carré, E. Augustin, F. Prêteux, “La cam-pagne d’évaluation RIMES pour la reconnaissance de cour-riers manuscrits,” in Colloque International Francophone surl’Ecrit et le Document, 2006, doi:10.1016 / S0031-3203(97)00137-4.[31] R. Ptucha, F. P. Such, S. Pillai, F. Brockler, V. Singh,P. Hutkowski, “Intelligent character recognition using fullyconvolutional neural networks,” Pattern recognition, , 604–613, 2019, doi:10.1016 / j.patcog.2018.12.017.[32] K. S. Younis, “Arabic handwritten character recognitionbased on deep convolutional neural networks,” JordanianJournal of Computers and Information Technology (JJCIT), (3), 186–200, 2017, doi:10.5455 / jjcit.71-1498142206.[33] M. Torki, M. E. Hussein, A. Elsallamy, M. Fayyaz, S. Yaser,“Window-Based Descriptors for Arabic Handwritten Alpha-bet Recognition: A Comparative Study on a Novel Dataset,”CoRR, abs / , 2014.[34] A. El-Sawy, M. Loey, H. El-Bakry, “Arabic handwrittencharacters recognition using convolutional neural network,”WSEAS Transactions on Computer Research, , 11–19,2017, doi:10.1109 / IACS.2019.8809122.[35] Y. Kim, “Convolutional Neural Networks for Sentence Clas-siﬁcation,” CoRR, abs / , 2014.[36] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, H. Adam, “MobileNets: E ﬃ cientConvolutional Neural Networks for Mobile Vision Applica-tions,” CoRR, abs / , 2017.[37] R. Memisevic, C. Zach, M. Pollefeys, G. E. Hinton, “GatedSoftmax Classiﬁcation,” in J. D. La ﬀ //