[PDF] Writer-Aware CNN for Parsimonious HMM-Based Offline Handwritten Chinese Text Recognition

Abstract

Recently, the hybrid convolutional neural network hidden Markov model (CNN-HMM) has been introduced for offline handwritten Chinese text recognition (HCTR) and has achieved state-of-the-art performance. However, modeling each of the large vocabulary of Chinese characters with a uniform and fixed number of hidden states requires high memory and computational costs and makes the tens of thousands of HMM state classes confusing. Another key issue of CNN-HMM for HCTR is the diversified writing style, which leads to model strain and a significant performance decline for specific writers. To address these issues, we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). First, PHMM is designed using a data-driven state-tying algorithm to greatly reduce the total number of HMM states, which not only yields a compact CNN by state sharing of the same or similar radicals among different Chinese characters but also improves the recognition accuracy due to the more accurate modeling of tied states and the lower confusion among them. Second, WCNN integrates each convolutional layer with one adaptive layer fed by a writer-dependent vector, namely, the writer code, to extract the irrelevant variability in writer information to improve recognition performance. The parameters of writer-adaptive layers are jointly optimized with other network parameters in the training stage, while a multiple-pass decoding strategy is adopted to learn the writer code and generate recognition results. Validated on the ICDAR 2013 competition of CASIA-HWDB database, the more compact WCNN-PHMM of a 7360-class vocabulary can achieve a relative character error rate (CER) reduction of 16.6% over the conventional CNN-HMM without considering language modeling. By adopting a powerful hybrid language model (N-gram language model and recurrent neural network language model), the CER of WCNN-PHMM is reduced to 3.17%.

Full PDF

11 Writer-Aware CNN for Parsimonious HMM-BasedOfﬂine Handwritten Chinese Text Recognition

Zi-Rui Wang, Jun Du (cid:12) , Jia-Ming Wang

Abstract

Recently, the hybrid convolutional neural network hidden Markov model (CNN-HMM) has beenintroduced for ofﬂine handwritten Chinese text recognition (HCTR) and has achieved state-of-the-artperformance. However, modeling each of the large vocabulary of Chinese characters with a uniformand ﬁxed number of hidden states requires high memory and computational costs and makes the tensof thousands of HMM state classes confusing. Another key issue of CNN-HMM for HCTR is thediversiﬁed writing style, which leads to model strain and a signiﬁcant performance decline for speciﬁcwriters. To address these issues, we propose a writer-aware CNN based on parsimonious HMM (WCNN-PHMM). First, PHMM is designed using a data-driven state-tying algorithm to greatly reduce the totalnumber of HMM states, which not only yields a compact CNN by state sharing of the same or similarradicals among different Chinese characters but also improves the recognition accuracy due to themore accurate modeling of tied states and the lower confusion among them. Second, WCNN integrateseach convolutional layer with one adaptive layer fed by a writer-dependent vector, namely, the writercode, to extract the irrelevant variability in writer information to improve recognition performance.The parameters of writer-adaptive layers are jointly optimized with other network parameters in thetraining stage, while a multiple-pass decoding strategy is adopted to learn the writer code and generaterecognition results. Validated on the ICDAR 2013 competition of CASIA-HWDB database, the morecompact WCNN-PHMM of a 7360-class vocabulary can achieve a relative character error rate (CER)reduction of 16.6% over the conventional CNN-HMM without considering language modeling. Byadopting a powerful hybrid language model (N-gram language model and recurrent neural networklanguage model), the CER of WCNN-PHMM is reduced to 3.17%. Moreover, the state-tying results ofPHMM explicitly show the information sharing among similar characters and the confusion reduction oftied state classes. Finally, we visualize the learned writer codes and demonstrate the strong relationshipwith the writing styles of different writers. To the best of our knowledge, WCNN-PHMM yields the

Zi-Rui Wang, Jun Du, and Jia-Ming Wang are with the National Engineering Laboratory for Speech and Language InformationProcessing, University of Science and Technology of China, Hefei, Anhui, P. R. China; email: [email protected],[email protected], [email protected]

September 23, 2019 DRAFT a r X i v : . [ c s . C V ] S e p best results on the ICDAR 2013 competition set, demonstrating its power when enlarging the size ofthe character vocabulary. Index Terms

Ofﬂine handwritten Chinese text recognition, writer-aware CNN, parsimonious HMM, state tying,adaptation, hybrid language model.

I. I

NTRODUCTION

The robust recognition of handwritten text lines in an unconstrained writing style plays animportant role in many applications, such as machine scoring, express sorting and documentrecognition. Speciﬁcally, handwritten Chinese text recognition (HCTR) has been intensivelystudied as a popular research topic for many years [1], [2]. However, it remains a challengingproblem due to the large vocabulary and the diversity of writing styles. Moreover, ofﬂine HCTR,which is the focus of this study, is more difﬁcult than online HCTR [3], [4], as the ink trajectoryinformation is missing.In general, the research efforts for ofﬂine HCTR can be divided into two categories: oversegmentation-based approaches and segmentation-free approaches. The former approaches [5], [6], [7], [8]often build several modules by ﬁrst including character oversegmentation, character classiﬁcation,and modeling the linguistic and geometric contexts, and then incorporating them to calculate thescore for path search. The recent work in [8], with the neural network language model, adoptedthree different CNN models to replace the conventional character classiﬁer, segmentation andgeometric models to achieve the best performance of oversegmentation-based methods on theICDAR 2013 competition dataset [9]. By contrast, segmentation-free approaches do not need toexplicitly segment text lines. One early approach to text line modeling [10] used the Gaussianmixture model hidden Markov model (GMM-HMM). Another recent approach [11] utilizedmultidimensional long short-term memory recurrent neural network (MDLSTM-RNN), whichwas inspired by well-veriﬁed LSTM-RNN approaches [12] for the recognition of handwrittenwestern languages with a small set of character classes. The MDLSTM-RNN approach is quiteﬂexible due to the connectionist temporal classiﬁcation (CTC) technique [13], which avoidsexplicit segmentation. In [14], the authors employed a CNN and an LSTM neural network underthe HMM framework to obtain a signiﬁcant improvement over the LSTM-HMM model. In [15],the authors used separable MDLSTM-RNN (SMDLSTM-RNN) with CTC loss, instead of the

September 23, 2019 DRAFT traditional LSTM-CTC method. More recently, the authors in [16] proposed a novel aggregationcross-entropy loss for sequence recognition, which was shown to exhibit competitive performancefor ofﬂine HCTR. In [17], we veriﬁed that combining hybrid deep CNN-HMM (DCNN-HMM)with a powerful language model could achieve the best reported results of the segmentation-freeapproaches on the ICDAR 2013 competition dataset.However, the impressive results reported in recently proposed oversegmentation-based andsegmentation-free approaches [8], [16], [17] highly depend on the use of strong language models(LMs) built with a large number of text corpora, which partially masks the weakness of charactermodels and makes the comparison of character models unfair. Actually, the large vocabulary ofChinese characters and the diversiﬁed writing styles of text lines still limit the performance ofdeep learning methods based on character modeling. For example, in our DCNN-HMM work[17], the number of output nodes in DCNN, i.e., the total state class number, was 19900 bymodeling 3980 characters with a 5-state HMM for each. Obviously, a further increase of thevocabulary size could potentially lead to a data sparsity problem and high computation andmemory costs, which makes the training of CNNs become difﬁcult. Moreover, similar radicalsamong different Chinese characters should be shared by the same states to reduce ambiguityin the decoding stage. Another key issue is that free-style writing usually causes a mismatchbetween the distributions of the training and testing datasets, which signiﬁcantly degrades therecognition accuracy of certain writers.To address these two main problems, we propose a novel writer-aware CNN based on par-simonious HMM (WCNN-PHMM). First, PHMM is designed using a data-driven state-tyingalgorithm to freely compress the total number of HMM states. The binary decision tree with adata-driven question set is adopted to represent one ﬁxed-position HMM state of all characterclasses. In this way, it can not only yield a compact CNN by state sharing of the same or similarradicals among different Chinese characters but also improve the recognition accuracy due tothe more accurate modeling of tied states and the lower confusion among them. Second, WCNNembeds one linear adaptive layer fed by a writer-dependent vector (namely, the writer code)into each convolutional layer, which extracts the irrelevant variability of writer information toimprove recognition performance. In the training stage, all writer codes and the parameters ofthe adaptation layers are initialized randomly and then jointly optimized with other networkparameters using the writer-speciﬁc data. In the recognition stage, with the initial recognitionresults from the ﬁrst-pass decoding with the writer-independent CNN-PHMM model, an unsu-

September 23, 2019 DRAFT pervised adaptation is performed to generate the writer code for the subsequent decoding ofWCNN-PHMM. Furthermore, in order to overcome the data sparseness problem of traditionalN-gram LM (NLM) [18], similar to [8], we introduce a recurrent neural network LM (RNNLM)[19] to form a hybrid LM (HLM).The main contributions of this study can be summarized as follows: • The new structure WCNN-PHMM is presented to tackle two key issues for ofﬂine HCTR:the large vocabulary and the diversity of writing styles. • A general adaptive training approach is proposed to integrate with any type of CNNs tocreate writer-aware models. To the best of our knowledge, this paper is the ﬁrst study ofwriter adaptation for ofﬂine HCTR. • The fast and compact design of PHMM via state tying improves the recognition accuracy.More importantly, compared with other segmentation-free approaches, PHMM can yieldeven better recognition accuracy when enlarging the size of the character vocabulary byfully leveraging more training data and class information sharing. • The effectiveness of WCNN-PHMM is visually illustrated by the analyses of the state-tyingresults and the learned writer codes. • The proposed WCNN-PHMM demonstrates the best reported character error rate (CER)(8.42%) for a 7360-class vocabulary on the ICDAR 2013 competition set without usinglanguage models. By adopting a powerful HLM, the CER of WCNN-PHMM can be furtherreduced to 3.17%.The remainder of this paper is organized as follows. Section II introduces related work.Section III gives an overview of the proposed framework. Section IV elaborates on the detailsof WCNN-PHMM. Section V reports the experimental results and analyses. Finally, Section VIconcludes. II. R

ELATED W ORK

In this section, we describe related work, including the basic principles for mainstreamapproaches of ofﬂine HCTR, model compression and writer adaptation.

September 23, 2019 DRAFT

A. Basic principles for ofﬂine HCTR

Ofﬂine HCTR can be formulated as the Bayesian decision problem: ˆ C = arg max C p ( C | X )= arg max C p ( X | C ) p ( C ) (1)where X is the feature sequence of a given text line image and C = { C , C , ..., C n } isthe underlying n -character sequence. In oversegmentation-based approaches [6], the posteriorprobability p ( C | X ) can be computed by searching the optimal segmentation path and the corre-sponding posterior probability of the character sequence by combining the character classiﬁer, thesegmentation model and the geometric/language model. Regarding segmentation-free approaches,the CTC-based and HMM-based approaches are two mainstream frameworks. In the CTC-basedapproach [15], a special character blank class and a deﬁned many-to-one mapping functionare introduced to directly compute p ( C | X ) with the forward-backward algorithm [13]. For theHMM-based approach [17], p ( C | X ) can be reformulated as the conditional probability p ( X | C ) and the prior probability p ( C ) . More details will be provided in Section III. B. Model compression

The state tying can be regarded as belonging to a more general ﬁeld, i.e., model compression[20]. With the emergence of deep learning [21], many studies have focused on building compactand fast CNNs for practicability. Regarding the reduction in the number of parameters and thecomputation complexity of convolutional layers, research efforts can be divided roughly intolow-rank decomposition [22], pruning [23], quantization [24] and compact network design [25].Aside from these methods, a key issue with CNN-HMM-based ofﬂine HCTR [17] is the largevocabulary problem, which leads to tens of thousands of output nodes (corresponding to HMMstates) in CNN architecture. This heavy overhead in the output layer of the CNN not only requireshigh memory and computation costs but also yields more confusion among state classes and CNNtraining difﬁculties. To handle this problem, inspired by the early work in speech recognition[26], [27], we introduce state tying via decision trees to freely compress the output layer of theCNN model. Considering the particularity of HCTR and the difﬁculty of deﬁning an effectivequestion set for the Chinese language, in our previous work [31], we successfully invented adata-driven state-tying approach for a huge set of HMMs representing Chinese characters andachieved promising recognition performance. It should be noted that, if we simply reduce the

September 23, 2019 DRAFT

Writer 1Writer 2Transcript 截止到昨日下午 6 时，

Fig. 1. Handwritten examples of different writers with the same transcript. state number for each character, the recognition accuracy will decline dramatically due to thelack of resolution for text line modeling [17].

C. Writer adaptation

Writer adaptation is similar to other topics, such as transfer learning [33] and speaker adap-tation [32], where the distribution of test data is different from that of training data [36]. Inofﬂine HCTR, as shown in Fig. 1, the writing styles could be quite different, which makesthe recognition accuracy of unseen writers unpredictable. In comparison to handwritten Chi-nese character recognition (HCCR), aside from the morphological variations within characters,writing orientation and ligatures make HCTR much more challenging. In general, there are twomainstream methodologies to achieve writer adaptation. The one type is to adopt writer-speciﬁcdata to guide writer-independent classiﬁer toward the new distribution of the particular writer, theother is to extract writer-independent features for classiﬁer. More speciﬁcally, this process mightbe supervised, semisupervised or unsupervised, depending on whether the adaptation writer-speciﬁc data are labeled. Usually, unsupervised adaptation needs to reuse the test data. Besides,it depends on adequate writer data. In some applications such as the machine scoring of essays[35], the recognition rate is the most important factor to be considered and there are enoughspeciﬁc writer data available to adopt adaptation techniques for improving the recognition rate.Moreover, the research on writer adaptation could be divided into feature-space and model-spaceapproaches based on the part on which the adaptation parameters are working [37]. To the best ofour knowledge, for Chinese handwriting recognition, almost all efforts of writer adaptation focuson the HCCR task. One such method uses a linear feature transformation to adapt the writingstyles via discriminative linear regression (DLR) [38], [39], which is veriﬁed to be effective whenincorporated with a prototype-based classiﬁer and an NN-based classiﬁer. Another representative

September 23, 2019 DRAFT S1 火月象 S2 S3 S4 S5 S1 S2 S3 S4 S5 S1 S2 S3 S4 S5

Fig. 2. Illustration of text line modeled by cascading character HMMs. method introduces style transfer mapping (STM) [36] for learning a linear transformation toproject writer-speciﬁc data onto a style-free space. As a ﬂexible adaptation method, STM canwork on the outputs of both fully connected layers [40], [41] and convolutional layers [56].A recent study [42] uses adversarial learning [42] to transform writer-dependent features intowriter-independent features under the guidance of printed data. However, there are very fewstudies for the writer adaptation of the more challenging HCTR problem. Inspired by [43], [44],in [45] we propose an unsupervised writer adaptation strategy for DNN-HMM-based HCTR.This study is comprehensively extended from our previous conference papers [31], [45] withthe following new contributions: 1) the proposed PHMM is introduced with more technical detailsand veriﬁed for a more promising CNN-HMM, rather than the DNN-HMM in [31]; 2) we presenta novel unsupervised adaptation strategy with writer codes and adaptation layers to guide theconvolutional layers in CNN-HMM, rather than using the fully connected layers in DNN-HMM[45]; 3) WCNN-PHMM perfectly combines the two techniques to yield a compact and high-performance model; 4) instead of the NLM, the HLM is used to further improve performance;and 5) all experiments are redesigned to verify the effectiveness of WCNN-PHMM, and detailedanalyses are described to give the readers a deep understanding of our approach.III. S

YSTEM O VERVIEW

Our system follows the basic HMM framework [17] in which the handwritten text line ismodeled by a series of cascading HMMs, each representing one character, as illustrated in Fig. 2.The mathematic principle of HMM can be represented by rewriting the formula p ( X | C ) p ( C ) in September 23, 2019 DRAFT

Eq. (1): p ( X | C ) p ( C ) = (cid:88) S (cid:34) π ( s ) T (cid:89) t =1 a s t − s t T (cid:89) t =0 p ( x t | s t ) (cid:35) n (cid:89) i =1 p ( C i | C i − , C i − , ..., C ) (2) = (cid:88) S (cid:34) π ( s ) T (cid:89) t =1 a s t − s t T (cid:89) t =0 p ( s t | x t ) p ( x t ) p ( s t ) (cid:35) n (cid:89) i =1 p ( C i | C i − , C i − , ..., C ) (3)where X = { x , x , x , ..., x T } is a ( T + 1) -frame observation sequence of one text line image. p ( X | C ) , which can be called the character model, is the conditional probability of X given C corresponding to a sequence of HMMs with the corresponding hidden state sequence S = { s , s , s , ..., s T } . Each HMM with a set of states represents one character class. With HMMs,the p ( X | C ) can be decomposed in the frame level: π ( s ) is the initial state probability, a s t − s t is the state transition probability from frame t − to t , p ( x t | s t ) is the output probability of x t given s t , p ( s t ) is the prior probability of state s t estimated from the training set, p ( s t | x t ) is theposterior probability of state s t given x t , and p ( x t ) is independent of the character sequence. Asmentioned in [17], GMM can be used to calculate p ( x t | s t ) in Eq. (2) for the GMM-HMM system,while DNN/CNN can be adopted to compute p ( s t | x t ) in Eq. (3) for the DNN-HMM/CNN-HMMsystem.Meanwhile, p ( C ) , namely the language model, is the probability of an n -character sequence C = { C , C , ..., C n } and can be decomposed as (cid:81) ni =1 p ( C i | C i − , C i − , ..., C ) . However, as thenumber of these values V i for even a moderate vocabulary size V is too large to be accuratelyestimated. The so-called N-gram LM can not realistically depend on all i − conditioninghistories C , C , ..., C i − to compute the term p ( C i | C i − , C i − , ..., C ) . Obviously, a higher order N leads to a more powerful language model which can signiﬁcantly improve the recognitionaccuracy. In this work, the SRILM toolkit [46] is employed to generate a 5-gram LM. To furtherenhance the ability of the LM, we linearly interpolate a standard NLM with an RNNLM to forman HLM.In the training stage, we ﬁrst build the conventional GMM-HMM system as in [17]. Then, thestate-tying GMM-HMM system (GMM-PHMM) can be generated using the proposed decision-tree algorithm to greatly reduce the total number of states, i.e., the dimension of the CNNoutput layer. Meanwhile, state-level forced-alignment is conducted to obtain frame-level labelsfor the subsequent CNN cross-entropy training. After the conventional CNN is trained, a seriesof adaptation layers with the writer codes as the input are appended in parallel to form the September 23, 2019 DRAFT

Writer Code Conv 3x3-100Max-pool 3x3Stride = 2FC-500FC-36800Block Block Conv 1x1-700Conv 3x3-300Conv 3x3-300Conv 3x3-200Conv 3x3-100Output: the untied HMM states Input: Max-pool 3x3Stride = 2Max-pool 3x3Stride = 2Max-pool 3x3Stride = 2Conv 3x3-500Conv 3x3-500Conv 3x3-400Conv 3x3-300 Conv 3x3-700Conv 3x3-700Conv 3x3-600Conv 3x3-500 Input: Adaptation Adaptation Adaptation Adaptation Conv 3x3-100Max-pool 3x3Stride = 2FC-500FC-22080Block Block Conv 1x1-700Conv 3x3-300Conv 3x3-300Conv 3x3-200Conv 3x3-100Max-pool 3x3Stride = 2Max-pool 3x3Stride = 2Max-pool 3x3Stride = 2 S t a t e t y i ng Adaptation Output: the tied HMM states

CNN-HMM WCNN-PHMM

Fig. 3. Comparison between the conventional CNN-HMM and the proposed WCNN-PHMM.

WCNN. With writer-speciﬁc training data, the writer codes and the parameters of the adaptationlayers for WCNN are jointly optimized.In the testing stage, with the initial recognition results from the ﬁrst-pass decoding usingCNN-PHMM, the codes of unknown writers are learned from random initialization via WCNNfor the second-pass decoding. This process could be iteratively conducted for multipass decodingto reﬁne the recognition results and the writer codes.IV. WCNN-PHMMFig. 3 illustrates two main innovations of our proposed WCNN-PHMM architecture over theconventional CNN-HMM in [17], namely, the compact design of the output layer and writer-aware convolutional layers. In the following subsections, we elaborate three basic componentsof WCNN-PHMM: convolutional neural network, state tying for PHMM, and writer code-basedadaptive training for WCNN. In order to help readers understand clearly, in Table I, we ﬁrstdescribe acronyms that are frequently used in this paper. For example, according to Table I, the

September 23, 2019 DRAFT0

TABLE IA

CRONYM D ESCRIPTION

Acronym DescriptionCNN Convolutional neural networkWCNN Writer-aware convolutional neural networkTCNN Tied-state convolutional neural netwotkHMM Hidden Markov modelPHMM Parsimonious hidden Markov modelCER Character error rate

S1 S3 S5S2 S4 S1 S3 S5S2 S4 S1 S3 S5S2 S4

Compressible output layer: the tied HMM states

Fig. 4. Illustration of tied state design for CNN output layer. system WCNN-PHMM means characters are modeled by the PHMM where the WCNN is usedto compute the posterior probabilities of tied-states.

A. Convolutional neural network

As shown in Fig. 3, CNN [47] successively consists of stacked convolutional layers (Conv)optionally followed by spatial pooling, one or more fully connected layer (FC) and a softmaxlayer. For the convolutional and pooling layers, each layer is a three-dimensional tensor organizedby a set of planes called feature maps, while the fully connected layer and the softmax layer arethe same as those in the conventional DNN. Inspired by the locally sensitive, orientation-selectiveneurons in the visual system of cats [48], each unit in a feature map is constrained to connect alocal region in the previous layer, which is called the local receptive ﬁeld. Two contiguous localreceptive ﬁelds are usually s pixels (referred as stride) shifted in a certain direction. Usually, allunits in the same feature map of a convolutional layer share a set of weights, each computinga dot product between its weights and the local receptive ﬁeld in the previous layer and then September 23, 2019 DRAFT1 followed by batch normalization (BN) [49] and a nonlinear activation function. Meanwhile, theunits in a pooling layer perform a spatial average or max operation for their local receptiveﬁeld to reduce spatial resolution and noise interference. Accordingly, the key information foridentifying the pattern is retained. We formalize operations in a convolutional layer as: O i,j,k = f ( BN ( (cid:88) m,n,l I ( i − × s + m, ( j − × s + n,l W m,n,k,l + B k )) (4)where I i,j,k is the value of the input unit in feature map k at row i and column j while O i,j,k corresponds to the output unit, W m,n,k,l is the connection weight between a unit in feature map k of the output and a unit in channel l of the input, with an offset of m rows and n columnsbetween the output unit and the input unit. B k is the k -th value of bias vector B for all units inthe feature map k . BN is used to handle the change of the distribution in each layer by simplynormalizing the input of layers [49], which can yield an obvious improvement in the HCTR task[17]. f is a nonlinear function, i.e., ReLU [50], used in this study. B. State tying for PHMM

Fig. 4 illustrates the main motivation of our proposed algorithm to tie HMM states, namely,fully utilizing the partial similarities of characters (e.g., radicals). State tying is completed usinga binary decision tree in which the question for each node of the tree is automatically generatedby a data-driven algorithm. If each character is represented by a 5-state HMM, then 5 trees arebuilt, with each representing one positioned HMM state to cluster all character classes. Suppose S is the set of HMM states in one nonleaf node of a tree and L ( S ) is the log-likelihood of S generating the training dataset with F frames. Then, by the attached question q , which isselected from an automatically generated question set, this node with S is split into two childrennodes, namely, a left node with a subset S l and a right node with a subset S r , to maximize thelog-likelihood increase with respect to q in the current node: ∆ L = L ( S l ( q )) + L ( S r ( q )) − L ( S ) (5)where L ( S ) , L ( S l ( q )) and L ( S r ( q )) , are log-likelihoods of the state set in the current node, itsleft node and its right node, respectively. Based on the assumptions that all tied states in S share September 23, 2019 DRAFT2 a common mean µ ( S ) and variance Σ ( S ) , and the tying states does not change the frame/statealignment, a reasonable approximation of L ( S ) via Gaussian output distribution N is given by: L ( S ) = F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) ln N ( o f ; µ ( S ) , Σ ( S ))= − F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f )[ D ln(2 π ) + ln | Σ(S) | + D M ( o f )] (6)where D M ( o f ) is the Mahalanobis distance: D M ( o f ) = (cid:113) ( o f − µ ( S )) (cid:62) ( Σ ( S )) − ( o f − µ ( S )) . (7)In Eq. (6), γ s ( o f ) is the posterior probability of the D -dimensional feature vector o f at the f -thframe that is generated by state s . µ ( S ) and Σ(S) can be estimated as: µ ( S ) = F (cid:80) f =1 (cid:80) s ∈ S γ s ( o f ) o fF (cid:80) f =1 (cid:80) s ∈ S γ s ( o f ) (8) Σ(S) = F (cid:80) f =1 (cid:80) s ∈ S γ s ( o f )( o f − µ ( S ))( o f − µ ( S )) (cid:62) F (cid:80) f =1 (cid:80) s ∈ S γ s ( o f ) . (9)Using Eq. (9), we can have the following derivation for the last item in Eq. (6): F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) D M ( o f )= F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) Tr { ( o f − µ ( S )) (cid:62) ( Σ ( S )) − ( o f − µ ( S )) } = F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) Tr { ( Σ ( S )) − ( o f − µ ( S ))( o f − µ ( S )) (cid:62) } = Tr { ( Σ ( S )) − F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f )( o f − µ ( S ))( o f − µ ( S )) (cid:62) } = Tr { ( Σ ( S )) − Σ ( S ) } F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) = D F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) (10)where Tr {·} denotes the trace of a square matrix. If we further deﬁne the notation: γ ( S ) = F (cid:88) f =1 (cid:88) s ∈ S γ s ( o f ) (11) September 23, 2019 DRAFT3

Q3582NoRoot nodeNon-leaf nodeLeaf node Q1095NoNo Q185

Generated question set

Q1: Is 仁仙伯系佃陌侣倔俩俱 ?...Q3582: Is 疾痪痴 ?...Q7958: Is 蛊盎盐盗盛 ?YesYesYes

Fig. 5. Fraction of a generated tree for the ﬁrst state of a 5-state HMM.

Then, Eq. (6) can be rewritten as: L ( S ) = − γ ( S )[ln | Σ(S) | + D + D ln(2 π )] (12)Thus, the log-likelihood L ( S ) depends only on the pooled state occupancy γ ( S ) and the pooledstate variance Σ(S) . Both could be calculated from the saved parameters of state occupancycounts, means, and variances for all HMM states during the preceding Baum-Welch re-estimation.Initially, all corresponding states are placed in the root node of a tree. Then, the abovealgorithm is conducted in a top-down manner to build this binary tree until reaching to a ﬁxedthreshold. Finally, a merge operation of leaf nodes is conducted using a minimum priority queuein a bottom-up manner by computing the log-likelihood decrease to reach the target tied-statenumber.To generate the question set, all feature frames of characters are placed in the root node of abinary decision tree and then a k -means ( k = 2 ) algorithm is used to ﬁnd an optimal partition,which aims to maximize the log-likelihood of frames under the assumption of a single Gaussiandistribution. This procedure is conducted in a top-down manner until each node only containsone character class. One question of a nonleaf node can be obtained from all reachable leavesof this node. All questions form our question set for the state tying. There are 5 trees in total,as each character is modeled by a 5-state HMM. In Fig. 5, a fraction of a generated tree for theﬁrst state is illustrated.In Table II, we summarize the differences of state tying between HCTR and speech recognition(SR). First, the original signal in HCTR is two-dimension image and the signal is one-dimension September 23, 2019 DRAFT4

TABLE IIT

HE DIFFERENCES OF STATE TYING IN

HCTR

AND

SRHCTR SROriginal Signal Two dimension One dimensionObject The states of characters being in the same position The states of tri-phones with the same central phoneMotivation Existing similar radicals among characters Data sparseness problem of tri-phoneCategories Tens of thousands HundredsQuestion Set Data driven Date driven or Artiﬁcial rules speech in SR. Second, the motivation of state tying in HCTR is to overcome the difﬁculty oftraining and decoding in CNN-HMM due to many similar radicals among tens of thousandsof characters while the state tying in SR is introduced for the data sparseness problem of tri-phone. Third, considering the ways of modeling in HCTR, we only tie the states of charactersbeing in the same position to capture similar radicals more accurately. For SR, the state tying isusually conducted on the states of tri-phones with the same central phone. Finally, for HCTR,the question set used in state tying totally depends on the character based features while thequestion set in SR can be predeﬁned artiﬁcially according to pronunciation characteristics.

C. Adaptive training for WCNN based on writer code

As shown in Fig. 3, the conventional CNN used for ofﬂine HCTR does not explicitly incorpo-rate the writer information in both training and testing stages. However, the writing style couldplay an essential role in the ﬁnal CER as an irrelevant variability to recognize the character class.Accordingly, a learnable vector (writer code) is introduced to represent the writer style of eachwriter. If we consider the CNN architecture to integrate both feature extraction and classiﬁerimplicitly, then the proposed ingenious design of WCNN in Fig. 3 seems like a joint feature andmodel adaptive training strategy.To guide the CNN with writer information, two key components, i.e, writer codes and adapta-tion layers, are randomly initialized and can be optimized using the back-propagation algorithm.The code of the r -th writer is a G -dimensional vector V r directly connected with all adaptationlayers. The p -th adaptation layer can be represented by a K × G matrix A p . The writer code isfed into the adaptation layer and transformed into a new vector Q r,p : September 23, 2019 DRAFT5 V A Q A V  

Feature maps Writer code Adaptation Layer Enlargement Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Q Channel-wise add

Fig. 6. Illustration of convolutional layer with writer code in WCNN. Q r,p = A p V r . (13)With the writer information Q , the corresponding p -th convolutional layer of WCNN can bereformulated as: O r,pi,j,k = f ( BN ( M pi,j,k + Q r,pk )) (14)where M pi,j,k = (cid:88) l,m,n I p ( i − × s + m, ( j − × s + n,l W pm,n,k,l + B pk . (15)In Eqs. (14-15), I pi,j,k , O pi,j,k , W pm,n,k,l , and B pk are the corresponding items like in Eq. (4) forthe p -th convolutional layer. The writer information Q r,pk , which is the k -th value of bias vector Q r,p , is newly added as a bias to build writer-aware convolutional layers. The key innovation ofthe WCNN architecture is illustrated in Fig. 6.Suppose we use P adaptation layers with the parameter set A = { A p | p = 1 , ..., P } . Inthe training stage, a well-trained CNN-HMM or CNN-PHMM system is ﬁrst used to initializeWCNN with the writer-independent parameter set Λ . Assume we have R writers in the trainingdataset with the corresponding writer code set V = { V r | r = 1 , ..., R } . Then, the cross-entropycriterion is minimized with respect to writer-aware parameter set { A , V } in WCNN: E ( A, V ) = − N B (cid:88) t =1 log p ( s t | X t , Λ , A , V ) (16)where the WCNN output p ( s t | X t , Λ , A , V ) is the posterior probability of the reference state s t given the input image X t within the sliding window. N B is the minibatch size using stochasticgradient decent algorithm. In our implementation, we process the text lines one by one. Thus, September 23, 2019 DRAFT6 N B equals the number of frames of each text line. Please note that, for each frame X t , theinput parallel writer code vector is selected from V with the writer-aware information. With therandom initialization, we jointly update { A , V } using backpropagation and SGD: A p ← A p − ε tr ∂E ( A, V ) ∂ A p V r ← V r − ε tr ∂E ( A, V ) ∂ V r (17)where ε tr is the step size in the training stage, which is initially set to 0.001 and decreased bya factor of 0.8 after updating with 5 million frames. We summarize the training procedure ofWCNN in Algorithm 1. Algorithm 1

The training procedure of WCNN.

Input:

The writer-independent parameter set Λ is generated using conventional CNN-HMM/CNN-PHMM systems;Randomly initialize the writer-aware parameter set { A , V } ;Prepare the minibatch level training dataset with the state label and writer information ineach frame, Randomly select one minibatch and set the input writer code of each frame using writerinformation and V . Calculate all required derivatives using backpropagation. Update the adaptation layer parameters and writer codes { A , V } using Eq. (17). Go to step 1 until the convergence condition is satisﬁed.

Output:

The parameter set of WCNN { Λ , A , V } In the recognition stage, for the data of an unknown writer, a multipass decoding is conducted.In the ﬁrst-pass decoding, we use only CNN-HMM/CNN-PHMM with the parameter set Λ togenerate the recognition results that are adopted as the state labels for updating the writer codevector of this unknown writer in the next pass. In the second pass, we perform the adaptationby minimizing the cross-entropy criterion with respect to the writer code V U : E (cid:48) ( V U ) = − N (cid:48) B (cid:88) t =1 log p ( s U t | X U t , Λ , A , V U ) . (18)Similar to Eq. (16), X U t is the t -th input frame of an unknown writer, while s U t is its correspondingstate label from the ﬁrst-pass recognition. The batch size N (cid:48) B refers to the number of frames of September 23, 2019 DRAFT7

Algorithm 2

The adaptation/recognition procedure of WCNN.

Input:

Prepare the WCNN parameter set { Λ , A } ;Prepare the minibatch level dataset of an unknown writer;Randomly initialize the corresponding writer code V U , Generate the state labels via ﬁrst-pass decoding using Λ . Perform the adaptation to reﬁne V U using Eq. (19). Conduct decoding using { Λ , A , V U } of WCNN. Go to step 2 for alternative adaptation and recognition until a speciﬁed number of multipassdecoding is reached.

Output:

The writer code V U and recognition resultseach text line. Please note that we do not use V U from the training stage and randomly initializethe code V U of the unknown writer. Accordingly, we can update V U as: V U ← V U − ε ts ∂E (cid:48) ( V U ) ∂ V U (19)where ε ts is the step size in the testing stage, which is set to 0.001. Then, we conduct a second-pass decoding using { Λ , A , V U } of WCNN. This adaptation and recognition processes could bealternatively and iteratively conducted until a speciﬁed number of multipass decoding is reached.We summarize the adaptation/recognition procedure of WCNN in Algorithm 2. D. Hybrid language model

The HLM is linear interpolation of a traditional NLM and an RNNLM. Considering allcalculations in Eq. (2) are performed in the logarithmic domain, the HLM is represented as: log p HLM ( C ) = ω log p NLM ( C ) + (1 − ω ) log p RNNLM ( C ) (20)where the p NLM ( C ) means the probability of an n -character sequence C = { C , C , ..., C n } is computed based on NLM while the value of p RNNLM ( C ) is obtained from RNNLM. ω is ahyperparameter to adjust the ratio between NLM and RNNLM. In the RNNLM, a simple RNNwith three layers including input layer, hidden layer and output layer is used. At time step i ,the input vectors consist of a 1-of- V coding R i that represents the previous word C i − , and theprevious hidden layer output H i − . The output of the hidden layer is computed as: H i = f ( W H,V R i + W H,H H i − ) (21) September 23, 2019 DRAFT8

TABLE IIIT

HE INFORMATION OF THE

CASIA-HWDB

DATABASES . where W H,V and W H,H are learnable matrices of size H × V and H × H , respectively. Theactivation function f is sigmoid. In the output layer, using the history information H i , theprobabilities of the predicted characters at time step i are estimated: P i = g ( W V,H H i ) (22) g is the softmax function and W V,H is a V × H learnable matrix. Naturally, for a predictedcharacter C i at time step i , we have the following equation: p RNNLM ( C i | C i − , C i − , ..., C ) = P i ( C i ) . (23)Finally, p RNNLM ( C ) = n (cid:89) i =1 p RNNLM ( C i | C i − , C i − , ..., C ) = n (cid:89) i =1 P i ( C i ) . (24)In this work, the dimension of the hidden layer is set to 300, the ω is 0.5 and the weights { W H,V , W H,H , W V,H } in the RNNLM are optimized by using the truncated BPTT [52].V. E XPERIMENTS

We designed a set of experiments to validate and explain the effectiveness of the proposedmethod for ofﬂine HCTR. All experiments were implemented with Kaldi [29] and Pytorch [30]toolkits using NVIDIA GeForce GTX 1080Ti GPUs. Additionally, we plan to release our sourcecodes in the near future.

A. Dataset and metrics

We conducted the experiments on a widely used database for HCTR released by the Instituteof Automation of Chinese Academy of Sciences (CASIA) [53], [54]. To train the character

September 23, 2019 DRAFT9 models, both ofﬂine isolated handwritten Chinese character datasets (HWDB1.0 and HWDB1.1)and the training sets of ofﬂine handwritten Chinese text datasets (HWDB2.0, HWDB2.1, andHWDB2.2) were used. The detailed information, including the number of classes, writers, lines,and characters for each dataset, are shown in Table III. In total, 3,980 classes (Chinese characters,symbols, garbage) were formed with 4,091,599 samples. To train the language model, the trainingsets of ofﬂine handwritten Chinese text of HWDB2.0-2.2 and the news data downloaded fromInternet are used. All the news data have been checked to exclude the text of the test set. Thewhole corpus contains approximately ten million characters. The ICDAR 2013 competition setwith 60 writers unseen to the training dataset was adopted as the evaluation set [9]. The CERwas computed as: CER = N s + N i + N d N (25)where N is total number of character samples in the evaluation set. N s , N i and N d denote thenumber of substitution errors, insertion errors and deletion errors, respectively. Firstly, to focuson character modeling, we did not use additional language models. B. Experiments on state tying of PHMM1) Comparison between CNN-HMM and CNN-PHMM:

We ﬁrst compared CNN-HMM withCNN-PHMM according to the best conﬁguration in our previous work [17], i.e., there were 16weight layers (14 Conv and 2 FC layers) and the number of channels increased from 100 to 700.The image patch of each frame was passed through a stack of 3 × × × September 23, 2019 DRAFT0

TABLE IVCER (%)

COMPARISON BETWEEN

CNN-HMM

AND

CNN-PHMM

BASED ON DIFFERENT SETTINGS OF AVERAGE STATESPER

HMM.

RACTICAL ISSUE COMPARISON OF DIFFERENT SETTINGS OF AVERAGE STATES PER

HMM

FOR THE CORRESPONDING

CNN-PHMM

SYSTEM IN T ABLE

IV. N M AND N T REPRESENT THE MODEL SIZE AND RUN - TIME LATENCY , RESPECTIVELY , WHICH ARE NORMALIZED BY THOSE OF

CNN-HMM

SYSTEM WITH STATES PER

HMM. N M N T to the lack of adequate resolution. Notably, the number of output nodes of CNN was 3,980 ×

2) Analysis of state tying:

In Fig. 7, we list representative examples of tied Chinese charactersfrom positioned states 1 to 5 in our CNN-PHMM system. It was quite intuitive and reasonable

September 23, 2019 DRAFT1 State Characters Similar radical 仁仕仙估佃侣イ檄椎槐槛栓桅梳木圈囚园困围囗疥疹痒痔痹瘁瘴疒赴赵赶起趁超趋走财败贬购贝砂炒纱妙抄秒少闯闰闺润门仑仓沦沧仑氦氨氮氯气试式武弋胳骆铬赂略烙路咯洛各邮邯邵郡都阝砍坎饮炊吹欢欠炬距矩拒柜巨 Fig. 7. Examples of tied Chinese characters with similar radicals. that most of the tied Chinese characters shared the same or similar radicals although the state-tying process was purely data driven with diversiﬁed writing styles. This result could explain whythere was a large amount of parameter redundancy in the conventional untied CNN-HMM model.We also give partial results of the data-driven question set in Fig. 8. In total, there were 7,938questions generated. It could be observed that the related characters in one question were similar,which demonstrated the effectiveness of the k -means clustering algorithm. Overall, the proposedstate-tying method has two advantages. First, because the total number of states corresponds tothe size of the CNN output layer, having fewer categories will make CNN training easier andspeed up the recognizer. Second, reducing parameter redundancy can potentially increase thenumber of training samples for the tied states from different characters.For further analysis, we draw the learning curves during training for conventional CNN andtied-state CNN (TCNN) in Fig. 9. Obviously, the learning curve of TCNN was always belowthat of CNN. More interestingly, the gap between the two curves signiﬁcantly increased in thebeginning stage and then decreased to a relatively stable value as an increasing amount of trainingdata was used. We believe that the compact design of the CNN output layer not only made theCNN model easier to train and more effective to classify but also fully utilized the training databy state tying. September 23, 2019 DRAFT2

Generated question set Number Question 1 Is 仁仙伯佃陌侣倔俩俱 ? 2 Is 漳潭滓谭淖 ? 3 Is 马呜呼哗哼嘎鸣啤 ? 4 Is 肩雇扁庸 ? 5 Is 植栏桂桓检杜枉柱 ? 6 Is 客害宾寄寒案牢穷突窖宇守 ? 7 Is 奖桨浆裴 ? 8 Is 义艾又叉 ? 9 Is 昭眨眯睡睦睬睹瞅瞎瞒旺 ? … … 7958 Is 蛊盎盐盗盛 ? Fig. 8. Partial results of generated question set for tree-based state tying.

The number of batches (10 ) T r a i n i ng l o ss CNNTCNN

Fig. 9. Training loss comparison between CNN and TCNN.

C. Experiments on writer adaptive training for WCNN1) The conﬁguration of WCNN:

As shown in Fig. 4, there are two key factors for writer-adaptive training: the number of adaptation layers P and the dimension of writer code G . Theincrease in the number of adaptation layers linking to the convolutional layers goes from inputlayer to output layer. Table VI compares different settings of adaptation layer number P andwriter code dimension G in WCNN-PHMM. P =0 denotes the CNN-PHMM system without September 23, 2019 DRAFT3

TABLE VICER (%)

COMPARISON OF DIFFERENT SETTINGS OF ADAPTATION LAYER NUMBER P AND WRITER CODE DIMENSION G IN WCNN-PHMM. G

200 100 400 P Writer ID C E R ( % ) CNN-PHMMWCNN-PHMM

Fig. 10. CER (%) comparison between WCNN-PHMM and CNN-PHMM for each writer of the competition set. writer adaptive training. Please note that second-pass decoding was adopted as a default forWCNN-PHMM. When the writer code dimension was ﬁxed as 200, the CER decreased from9.54% to 8.96% with P increasing from 0 to 5. The performance was saturated when more than5 adaptation layers were used due to the limited adaptation data. Another interesting observationis that the performance of WCNN-PHMM was not sensitive to writer code dimension, with agood tradeoff of G =200. Thus, we use the conﬁguration of P =5 and G =200 in the followingexperiments.To further demonstrate the effectiveness of writer adaptive training, we make a CER compar-ison between WCNN-PHMM and CNN-PHMM for each writer in Fig. 10. Consistent improve-ments could be obtained for most of the 60 writers, and there were only 5 exceptions (No. 6, No.14, No. 43, No. 48, No. 54). Especially for those writers with relatively high CERs, signiﬁcantgains could be achieved, e.g., the CER was reduced from 15.11% to 9.66% for writer No. 1,with a relative CER reduction of 36.1%. September 23, 2019 DRAFT4

2) WCNN with/without state tying:

In section V-C1, we illustrated that WCNN could yieldadditional gains over CNN on top of PHMM using state tying. In this section, as shown in Fig. 11,we compare the relative CER reduction (%) in WCNN over CNN with/without state tying fordifferent settings of text lines on the competition set. For the CNN-HMM system without statetying, the best conﬁgured 5-state HMM in Table IV was used. In the competition set, the numberof text lines for each writer ranged from 44 to 82. Overall, using all handwritten text lines ofone writer for unsupervised adaptation, the CERs could be reduced from 10.02% to 9.55%(CNN-HMM vs. WCNN-HMM) and from 9.54% to 8.96% (CNN-PHMM vs. WCNN-PHMM).Those stable performance gains indicated that the proposed writer-adaptive training method waseffective for systems with/without state tying (PHMM/HMM). Regarding the performance withrespect to the amount of adaptation data, we observed that only 15 handwritten text lines for eachwriter on average could start to improve the recognition accuracy for unsupervised adaptation.When the number of text lines was reduced to 10, the relative CER reduction was limited,i.e., 0.5% and 1.1% for WCNN-PHMM and WCNN-HMM, respectively. Furthermore, when wecontinued to reduce the number of text lines to 5, the CERs increased compared with respectivebaselines. More interestingly, with increased adaptation data, the CER reduction in WCNN overCNN for the PHMM system with state tying became more signiﬁcant than that for the HMMsystem without state tying, which implies that, as more handwritten data are collected fromone writer, the proposed unsupervised adaptation via WCNN-PHMM can recognize handwrittentext lines from this writer with more accuracy. Thus, the proposed WCNN-PHMM is a perfectdemonstration of a compact model with adaptive capability.

3) Multiple-pass decoding of WCNN-PHMM:

The basic intuition in the adaptation stage isbetter targets can promote the learning of the writer code and so produce beneﬁcial feedback onthe decoding results. By using the results of second-pass decoding based on WCNN-PHMM togenerate better targets for the learning of the test writer codes, a third-pass decoding is conductedto get our ﬁnal results. As shown in Table VII, the multiple-pass decoding can improve therecognition results (from 8.96% to 8.64%), which demonstrates that our intuition is right. Wealso list the run time comparison for different pass numbers. In order to make a fair comparison,all experiments here were evaluated on the same machine and we normalized the decodingtime of ﬁrst-pass to 1. The relative time consumption of n -pass ( n =2,3) included two parts: theadaptation time and the decoding time. Although we could obtain a remarkable improvement viaadaptation, the time consumption was linearly increased with the number of decoding passes. September 23, 2019 DRAFT5 R e l a ti v e C E R R e du c ti on ( % )

10 15 20 25 30 35 40 all

With state tying Without state tyingThe text lines used in adaptation:

Fig. 11. The relative CER reduction (%) of WCNN over CNN with/without state tying for different settings of text lines onthe competition set. TABLE VIICER (%)

AND TIME CONSUMPTION COMPARISONS OF MULTIPLE - PASS DECODING OF

WCNN-PHMM

SYSTEM .Multiple-pass Decoding CER (%) Decoding Time Adaptation TimeFirst-pass (CNN-PHMM) 9.54 1.00 0.00Second-pass 8.96 1.98 0.47Third-pass

To address this problem, the acceleration of CNN and fast adaptation will be investigated in ourfuture work.

4) Visualization analysis for writer code:

To better understand why adaptation based on thewriter code improves recognition performance, we adopted the t-SNE [55] technique to visualizethe generated writer codes by reducing its dimension to 2. In Fig. 12(a), the distribution of severalwriter codes with the same transcripts on the competition set is shown. Correspondingly, we listtheir handwriting in Fig. 12(b). Interestingly, the distance between different writers in Fig. 12(a)was a strong indicator of the similarity of the writing styles of different writers. For example,all the distances of ID pairs (31, 33), (32, 34), and (39, 40) were small, while the correspondingwriting styles for those pairs were quite similar, as observed from the handwritten text lines,which demonstrates that the learned writer code indeed carries the writer information.

September 23, 2019 DRAFT6

No. 34No. 32 No. 36 No. 38No. 37 No. 33No. 31No. 39No. 35No. 40 (a) The t-SNE visualization of several writer codes.

No.31No.33No.37No.38No.39No.40No.35No.36No.32No.34 (b) The corresponding handwriting examples of differentwriters.Fig. 12. Visualization analysis of several writer codes on the competition set.

D. Comparison of different language models

Table VIII shows CER comparison of different language models. First, to demonstrate thescalability of our approach, we also conducted the corresponding 7360-class vocabulary experi-ments for different HMM systems. Please note that all the classes and writer data in HWDB1.0-HWDB1.2 were used in the 7360-class experiments rather than the subset listed in Table III thatincludes 3980-class experiments. Thus, the output layer sizes of CNN in the CNN-HMM systemand WCNN in the WCNN-PHMM system were 36800 and 22080 for the 7360-class experiments,respectively, as illustrated in Fig. 3. Although the confusion among the 7360 classes is higher,the CER of the 7360-class CNN-HMM was slightly increased from 10.02% to 10.1%, thusdemonstrating the robustness of the HMM system. A surprising observation was that the CERof the 7360-class CNN-PHMM was remarkably reduced from 9.54% in the 3980-class CNN-PHMM to 9.17%, which might be due to the larger amount of training data used for 7360-classbeing better utilized and shared among different classes (compared with the 3980-class case)due to the use of our state-tying algorithm. Correspondingly, the recognition performance ofWCNN-PHMM was also improved from the 3980-class case to the 7360-class case, i.e, 8.60%,8.42% for the second-pass decoding and the third-pass decoding, respectively.

September 23, 2019 DRAFT7

TABLE VIIICER (%)

COMPARISON OF DIFFERENT LANGUAGE MODELS . Method Vocabulary Without LM NLM HLMCNN-HMM 3980 10.02 3.72 3.547360 10.1 3.82 3.58CNN-PHMM 3980 9.54 3.57 3.447360 9.17 3.52 3.35WCNN-PHMM 3980 8.64 3.39 3.277360 8.42 3.33 3.17

Second, by adding a language model, a great improvement could be obtained for all thesystems. Besides, compared with the NLM, all systems that use the HLM performed better, e.g,a relative CER reduction of 6.3%, 4.8% and 4.8% could be obtained in the 7000-class CNN-HMM, CNN-PHMM and WCNN-PHMM, respectively. It is reasonable that a weak charactermodel could beneﬁt more from a powerful language model.

E. Overall comparison and error analysis

Table IX shows an overall comparison of our proposed method and other state-of-the-art meth-ods without/with a language model on the ICDAR 2013 competition set. we list the state-of-the-art oversegmentation method heterogeneous CNN [7], CNNs-RNNLM [8] and the segmentation-free method SMDLSTM-CTC [15], CNN-ACE [16] in Table IX for comparison. With the sameconﬁguration of vocabulary size (4 more garbage classes adopted in our HMM system), theproposed WCNN-PHMM yielded the best performance whether a language model was employedor not. Moreover, as shown in Table VIII, by using a powerful language model (HLM), the CNN-HMM, CNN-PHMM with one-pass decoding still could outperform the other methods.For error analysis, we provide two examples in Fig. 13. In the left part of the ﬁgure, theconventional CNN-HMM misrecognized the ﬁrst character of the text line, while CNN-PHMMgenerated the correct result. A reasonable explanation is that the left radical of the characterin the brown box became easier to recognized because state tying could potentially learn theparameters better than the radical with more shared training samples from other characters. Inthe right of the ﬁgure, CNN-PHMM made a substitution error (red), while WCNN-PHMM couldcorrect this mistake. Arguably, even humans could confuse this handwritten character in isolationwithout any prior knowledge. However, by learning the writing style of this particular writer

September 23, 2019 DRAFT8

TABLE IXP

ERFORMANCE COMPARISON OF OUR PROPOSED METHOD AND OTHER STATE - OF - THE - ARTS METHODS WITHOUT / WITHLANGUAGE MODELS ON THE

COMPETITION SET . Method Vocabulary Without LM With LMWCNN-PHMM 3980 8.64 3.277360 8.42 3.17Wu et al. [15] 2672 9.98 7.397356 13.36 9.62Wang et al. [7] 7356 11.21 5.98Wu et al. [8] 7356 - 3.80Xie et al. [16] 7357 8.75 3.78 using the writer code, our WCNN-PHMM could correctly recognize it. Besides, the HMM-based approaches can assign each image frame to a certain state belonging to a character.Once the process of recognition is completed, the segmentation information between differentcharacters can be naturally found. Fig. 14 shows the segmentation results of different HMM-basedsystems, i.e. CNN-HMM, CNN-PHMM and WCNN-PHMM. The red lines were the boundariesof different characters. For many characters such as the characters within the green dotted boxes,the CNN-PHMM and WCNN-PHMM provided more accurate boundaries than the CNN-HMM.For characters within the blue dotted boxes, we observed that the WCNN-PHMM could stillgive the right boundaries while the CNN-PHMM and CNN-HMM failed.Finally, in Figs. 15(a) and 15(b), we explain and analyze the scores of the reference states ofthe underlying characters from the CNN outputs for CNN-HMM, CNN-PHMM, and WCNN-PHMM. Fig. 15(a) shows the comparison of the state posterior probability (SPP) of the framesfor the reference character class in the brown box of Fig. 13. CNN-PHMM consistently gen-erated higher SPPs than CNN-HMM for all frames of the sequence. Similarly, in Fig. 15(b),corresponding to the character class in the red box of Fig. 13, WCNN-PHMM always yieldedhigher SPPs than CNN-PHMM. VI. C

ONCLUSION

In this study, we propose a novel WCNN-PHMM architecture for ofﬂine handwritten Chinesetext recognition to handle two key issues: the large vocabulary of Chinese characters andthe diversity of writing styles. By combining parsimonious HMM based on state tying andunsupervised adaptation based on writer code, our new approach demonstrates its superiority to

September 23, 2019 DRAFT9

Ground truth: 经济过热阶段

CNN-PHMM: 往济过热阶段

WCNN-PHMM: 经济过热阶段

Ground truth: 设立了由新员工

CNN-HMM: 没立了由新员工

CNN-PHMM: 设立了由新员工

Fig. 13. Two examples of recognition results for different HMM systems. 会逐渐好转。次贷危机可谓是此轮经济周 CNN-HMM CNN-PHMM WCNN-PHMM

Transcript

Fig. 14. Comparison of segmentation results of different HMM systems. other state-of-the-art approaches according to both experimental results and analysis. However,current code-based adaptation simply depends on the backpropagation of network, which meansadequate data is important. Besides, the 1-D HMM can not provide up-and-down informationof characters. For future work, we will investigate the meta-learning to reduce dependence ondata in adaptation and a more advanced way by using 2D-HMM to achieve recognition andsegmentation. Furthermore, we will aim to accelerate the CNN to reduce decoding time.

Frame

SPP

CNN-PHMMS1 S3 CNN-HMMS5S2 S4 (a) Comparison of reference state posterior probability of theframes for CNN-HMM and CNN-PHMM.

SPP

CNN-PHMMWCNN-PHMMS1 S2 S3 S4 S5

Frame (b) Comparison of reference state posterior probability (SPP)of the frames for CNN-PHMM and WCNN-PHMM.Fig. 15. Comparison of reference state posterior probability (SPP) for different HMM systems.

September 23, 2019 DRAFT0 A CKNOWLEDGMENTS

This work was supported in part by the National Key R&D Program of China under contractNo. 2017YFB1002202, the National Natural Science Foundation of China under Grant Nos.61671422 and U1613211, the Key Science and Technology Project of Anhui Province underGrant No. 17030901005, and the MOE-Microsoft Key Laboratory of USTC.R

EFERENCES [1] H. Fujisawa, “Forty years of research in character and document recognition–an industrial perspective,”

Pattern Recognition ,Vol. 41, No. 8, pp.2435-2446, 2008.[2] C.-L. Liu and L. Yue, “Advances in chinese document and text processing,”

World Scientiﬁc , Vol. 2, 20018.[3] Z.-C. Xie, Z.-H. Sun, L.-W. Jin, H. Ni and T. Lyons, “Learning spatial-semantic context with fully convolutional recurrentnetwork for online handwritten Chinese text recognition,”

IEEE Trans. Pattern Anal. Mach. Intell. , Vol. 40, No. 8, pp.1903-1917, 2017.[4] X.-D. Zhou, D.-H. Wang, F. Tian, C.-L. Liu and M. Nakagawa, “Handwritten Chinese/Japanese text recognition usingsemi-Markov conditional random ﬁelds,”

IEEE Trans. Pattern Anal. Mach. Intell. , Vol. 35, No. 10, pp.2413-2426, 2013.[5] N.-X. Li and L.-W. Jin, “A Bayesian-based probabilistic model for unconstrained handwritten ofﬂine Chinese text linerecognition,”

Proc. IEEE SMC , 2010, pp. 3664-3668.[6] Q.-F. Wang, F. Yin, and C.-L. Liu, “Handwritten Chinese text recognition by integrating multiple contexts,”

IEEE Trans.Pattern Anal. Mach. Intell. , Vol. 34, No. 8, pp.1469-1481, 2012.[7] S. Wang, L. Chen, L. Xu, W. Fan, J. Sun, and S. Naoi, “Deep Knowledge Training and Heterogeneous CNN for HandwrittenChinese Text Recognition,”

Proc. ICFHR , 2016, pp.84-89.[8] Y.-C. Wu, F. Yin, and C.-L. Liu, “Improving handwritten Chinese text recognition using neural network language modelsand convolutional neural network shape models,”

Pattern Recognition , Vol. 65, pp.251-264, 2017.[9] F. Yin, Q.-F. Wang, X.-Y. Zhang, and C.-L. Liu, “ICDAR 2013 Chinese handwriting recognition competition,”

Proc. ICDAR ,2013, pp.1464-1470.[10] T.-H. Su, T.-W. Zhang, D.-J. Guan, and H.-J. Huang, “Off-line recognition of realistic Chinese handwriting usingsegmentation-free strategy,”

Pattern Recognition , Vol. 42, No. 1, pp.167-182, 2009.[11] R. Messina and J. Louradour, “Segmentation-free handwritten Chinese text recognition with LSTM-RNN,”

Proc. ICDAR ,2015, pp.171-175.[12] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber, “A novel connectionist system forimproved unconstrained handwriting recognition,”

IEEE Trans. on Pattern Analysis and Machine Intelligence , Vol. 31, No.5, pp.855-868, 2009.[13] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist temporal classiﬁcation: labelling unsegmentedsequence data with recurrent neural networks,”

Proc. ICML , 2006, pp. 369-376.[14] D. Suryani, P. Doetsch, and H. Ney, “On the beneﬁts of convolutional neural network combinations in ofﬂine handwritingrecognition,”

Proc. ICFHR , 2016.[15] Y.-C. Wu, F. Yin, Z. Chen, and C.-L. Liu, ”Handwritten Chinese text recognition using separable multi-dimensionalrecurrent neural network,”

Proc. ICDAR , 2017, pp.79-84.[16] Z.-C. Xie, Y.-X. Huang, Y.-Z. Zhu, L.-W. Jin, Y.-L. Liu and L.-L. Xie, ”Aggregation cross-entropy for sequencerecognition,”

Proc. CVPR , 2019, pp.6538-6547.

September 23, 2019 DRAFT1 [17] Z.-R. Wang, J. Du, W.-C. Wang, J.-F. Zhai, and J.-S. Hu, “ A comprehensive study of hybrid neural network hiddenMarkov model for ofﬂine handwritten Chinese text recognition,”

IJDAR , Accpted.[18] S. Katz, “Estimation of probabilities from sparse data for the language model component of a speech recognizer,”

IEEETrans. Acoust. Speech Signal Processs. , Vol. 35, No. 3, pp.400-401, 1987.[19] T. Mikolov, M. Karaﬁt, L. Burget, J. ernock, and S. Khudanpur, “Recurrent neural network based language model,”

Proc.INTERSPEECH , 2010, pp.1045-1048.[20] C. Bucilua, R. Caruana and A. Niculescu-Mizil, “Model compression,”

Proc. KDD , 2006.[21] L.-C. Yan, Y.-S. Bengio, and G. Hinton “Deep learning,”

Nature ,Vol. 521, No. 7553, pp. 426, 2015.[22] X. Zhang, J, Zou, K.-M. He and J. Sun “Accelerating very deep convolutional networks for classiﬁcation and detection,”

PAMI ,Vol. 38, No. 10, pp. 1943-1955, 2016.[23] Y.-H. He, X.-Y. Zhang and J. Sun “Channel pruning for accelerating very deep neural networks,”

Proc. ICCV , 2017.[24] C. Leng, H. Li, S.-H. Zhu and R. Jin, “Extremely low bit neural network: Squeeze the last bit out with admm,” arXiv:1707.09870 , 2017.[25] X.-Y. Zhang, X.-Y. Zhou, M.-X. Lin and J. Sun, “Shufﬂenet: An extremely efﬁcient convolutional neural network formobile devices,” arXiv:1707.01083 , 2017.[26] S.-J. Young, J.-J. Odell and P.-C. Woodland ‘Tree-based state tying for high accuracy acoustic modelling,”

Proc. workshopon Human Language Technology , pp. 307-312, 1994.[27] S. Young et al ., The HTK Book (Revised for HTK version 3.4.1), Cambridge University, 2009.[28] G. Hinton, L. Deng, D. Yu, G. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. Sainath, and B.Kingsbury, “Deep neural networks for acoustic modeling in speech recognition,”

IEEE Signal Processing Magazine , Vol.29, No. 6, pp.82-97, 2012.[29] D. Povey, A. Ghoshal, et al., “The kaldi speech recognition toolkit,”

Proc. ASRU , 2011.[30] P. Adam, et al., “Automatic differentiation in pytorch,”

NIPS-W , 2017.[31] W.-C. Wang, J. Du and Z.-R. Wang, “Parsimonious HMMs for ofﬂine handwritten Chinese text recognition,”

Proc. ICFHR ,2018.[32] C.-J. Leggetter and P.-C. Woodland, “Maximum likelihood linear regression for speaker adaptation of continuous densityhidden Markov models,”

Proc. Computer speech and language , pp. 171-185, 1995.[33] Sinno, J.-L pan and Q. Yang “A survey on transfer learning,”

Proc. IEEE Transactions on knowledge and data engineering ,Vol. 22, No. 10, pp. 1345-1359, 2010.[34] M.-L. Yu, P.C.-K. Kwok, C.-H. Leung and K.-W. Tse, “Segmentation and recognition of Chinese bank check amounts,”

IJDAR , Vol. 3, pp.207-217, 2001.[35] B. Bridgeman, C. Trapani and Y. Attali, “Comparison of human and machine scoring of essays: Differences by gender,ethnicity, and country,”

Applied Measurement in Education , Vol. 25, pp.27-40, 2012.[36] X.-Y. Zhang and C.-L. Liu, “Writer adaptation with style transfer mapping,”

IEEE Trans. on Pattern Analysis and MachineIntelligence , Vol. 35, No. 7, pp. 1773-1787, 2013.[37] J. Du and Q. Huo, “A discriminative linear regression approach to adaptation of multi-prototype based classiﬁers and itsapplications for Chinese OCR,”

Pattern Recognition , Vol. 46, No. 8, pp. 2313-2322, 2013.[38] J. Du, J.-S. Hu, B. Zhu, S. Wei, and L.-R. Dai, “Writer adaptation using bottleneck features and discriminative linearregression for online handwritten Chinese character recognition,”

Proc. ICFHR , 2014, pp. 311-316.[39] J. Du, J.-F. Zhai, J.-S. Hu, B. Zhu, S. Wei, and L.-R. Dai, “Writer adaptive feature extraction based on convolutionalneural networks for online handwritten Chinese character recognition,”

Proc. ICDAR , 2015, pp.841-845.

September 23, 2019 DRAFT2 [40] H.-M. Yang, X.-Y. Zhang, F. Yin, Z.-B. Luo and C.-L. Liu, “Unsupervised adaptation of neural networks for Chinesehandwriting recognition,”

Proc. ICFHR , pp. 512-517, 2016.[41] X.-Y. Zhang, Y. Bengio, and C.-L. Liu, “Online and ofﬂine handwritten chinese character recognition: A comprehensivestudy and new benchmark”

Pattern Recognition , Vol. 38, pp.348-360, 2016.[42] I.-J. Goodfellow, J.-P. Abadie, M. Mirza, B. Xu, D.-W. Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarialnets,”

Proc. NIPS , pp.2672-2680, 2014.[43] O. Abdel-Hamid and H. Jiang, “Fast speaker adaptation of hybrid NN/HMM model for speech recognition based ondiscriminative learning of speaker code,”

Proc. ICASSP , 2013, pp.7942-7946.[44] S.-F. Xue, O. Abdel-Hamid, H. Jiang, L.-R. Dai, and Q.-F. Liu, “Fast adaptation of deep neural network based ondiscriminant codes for speech recognition,”

IEEE/ACM Trans. on Audio, Speech, and Language Processing , Vol. 22, No.12, pp.1713-1725, 2014.[45] Zi-Rui Wang and Jun Du, “Writer Code Based Adaptation of Deep Neural Network for Ofﬂine Handwritten Chinese TextRecognition,”

Proc. ICFHR , 2016, pp.311-316.[46] A. Stolcke, “SRILM: an extensible language modeling toolkit,”

Proc. ICSLP , 2002, pp.901-904.[47] L. Yann, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,”

Proceedingsof the IEEE , Vol. 11, pp.2278-2324, 1986.[48] D. H. Hubel and T. N. Wiesel, “Receptive ﬁelds, binocular interaction and functional architecture in the cat’s visual cortex,”

Journal of Physiology , Vol. 160, pp.106-154, 1962.[49] S. Ioffe and C. Szegedy, “Batch normalization: accelerating deep network training by reducing internal covariate shift,” arXiv:1502.03167 , 2015.[50] A. Krizhevsky, I. Sutskever, and G. Hinton, “Imagenet classiﬁcation with deep convolutional neural networks,”

Proc. NIPS ,2012, pp.1097-1105.[51] Y.-P. Zhang, S. Liang, S. Nie, W.-J. Liu and S.-Y. Peng, “Robust ofﬂine handwritten character recognition through exploringwriter-independent features under the guidance of printed data,”

Pattern Recognition letters , Vol. 106, pp.20-26, 2018.[52] T. Mikolov, S. Kombrink, L. Burget, J.H. ernock, and S. Khudanpur, “Extensions of recurrent neural network languagemodel,”

Proc. ICASSP , 2011, pp.5528-5531.[53] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “CASIA online and ofﬂine Chinese handwriting databases,”

Proc. ICDAR ,2011, pp.37-41.[54] C.-L. Liu, F. Yin, D.-H. Wang, and Q.-F. Wang, “Online and ofﬂine handwritten Chinese character recognition:benchmarking on new databases,”

Pattern Recognition , Vol. 46, No. 1, pp.155-162, 2013.[55] L. van der Maaten, and G. Hinton, “Visualizing data using t-SNE,”

Machine Learning , Vol. 9, pp.2579-2605, 2008.[56] H.-M Yang, X.-Y Zhang, F. Yin, J. Sun and C.-L. Liu, “Deep transfer mapping for unsupervised writer adaptation,”

Proc.ICFHR , 2018., 2018.