[PDF] Reinterpreting CTC training as iterative fitting

Abstract

The connectionist temporal classification (CTC) enables end-to-end sequence learning by maximizing the probability of correctly recognizing sequences during training. The outputs of a CTC-trained model tend to form a series of spikes separated by strongly predicted blanks, know as the spiky problem. To figure out the reason for it, we reinterpret the CTC training process as an iterative fitting task that is based on frame-wise cross-entropy loss. It offers us an intuitive way to compare target probabilities with model outputs for each iteration, and explain how the model outputs gradually turns spiky. Inspired by it, we put forward two ways to modify the CTC training. The experiments demonstrate that our method can well solve the spiky problem and moreover, lead to faster convergence over various training settings. Beside this, the reinterpretation of CTC, as a brand new perspective, may be potentially useful in other situations. The code is publicly available at this https URL.

Full PDF

AA Novel Re-weighting Method for Connectionist TemporalClassiﬁcation

Hongzhu Li + and Weiqiang Wang ∗ ++ University of Chinese Academy of Sciences, Beijing, China

Abstract

The connectionist temporal classiﬁcation (CTC)enables end-to-end sequence learning by maxi-mizing the probability of correctly recognizingsequences during training. With an extra blank class, the CTC implicitly converts recognizing asequence into classifying each timestep within thesequence. But the CTC loss is not intuitive forsuch classiﬁcation task, so the class imbalancewithin each sequence, caused by the overwhelm-ing blank timesteps, is a knotty problem. Inthis paper, we deﬁne a piece-wise function as thepseudo ground-truth to reinterpret the CTC lossbased on sequences as the cross entropy loss basedon timesteps. The cross entropy form makes iteasy to re-weight the CTC loss. Experimentson text recognition show that the weighted CTCloss solves the class imbalance problem as wellas facilitates the convergence, generally leadingto better results than the CTC loss. Beside this,the reinterpretation of CTC, as a brand new per-spective, may be potentially useful in some othersituations.

1. Introduction

The connectionist temporal classiﬁcation (CTC) (Graves& Gomez, 2006) is a commonly used method in sequencerecognition tasks, including speech recognition (Graves &Jaitly, 2014; Miao et al., 2016; Kim et al., 2017), text recog-nition (Alex et al., 2009; He et al., 2016b; Shi et al., 2017;Borisyuk et al., 2018) and so on. With an extra blank class,the output at each timestep in the sequence indicates eithera speciﬁc label or no label. The outputs over all timestepsconsist a sequence of labels and blanks, named as a path .A path is transformed into a label sequence by removingthe repeated labels then the blanks in it, and different pathscan correspond to the same label sequence. The CTC-basedtraining is to maximize the probability of the correct la- ∗ Corresponding author: [email protected]

Figure 1. (Graves & Gomez, 2006) Evolution of the network outputand CTC error signal during training. Lines with different colorsdenote different labels, and the dashed line is the blank class. belling, which is calculated by summing up probabilities ofall the corresponding paths.Some previous works try to improve the CTC with regular-ization or re-weighting/re-sampling heuristics. (Hu et al.,2018) propose a maximum entropy based regularization forCTC (EnCTC) to enhance exploration during training toget models with better generalization. They also proposean entropy-based pruning algorithm (EsCTC) to rule outunreasonable paths. For weakly-supervised action labellingin video, (Huang et al., 2016) introduces the Extended Con-nectionist Temporal Classiﬁcation (ECTC) framework toenforce the consistency of all possible paths with frame-to-frame visual similarities. To solve the imbalance problemfor sequence recognition, (Feng et al., 2019) modify the tra-ditional CTC by fusing focal loss with it and thus make themodel attend to the low-frequent samples at training stage.All these works treat a sequence or a path as an example,and calculate the loss or perform the re-sampling on thebasis of sequences or paths.Different from them, we propose to treat each timestepin a sequence as an individual example, and regard thesequence recognition task as a classiﬁcation task for eachtimestep. The classiﬁcation of each timestep is similar tothe classiﬁcation branch in objection detection (Liu et al.,2016), where the blank class corresponds to the background a r X i v : . [ c s . C V ] A p r Novel Re-weighting Method for Connectionist Temporal Classiﬁcation L o ss A cc u r a c y accuracy: BN with global statisticsaccuracy: BN without global statistics Figure 2.

The accuracy degradation phenomenon during CTC-based training. The network conﬁguration is given in Table 1(b).The training set is Synth100 and test set is Train described in Sec-tion 3.1. The learning rate is 0.01 and batch size is 32, and otherimplementation details can be found in Section 3.2. category and the labels correspond to the objects. In thiscase, the CTC may suffer from a classic problem in objectdetection, the imbalance between background and objectsamples. As shown in Figure 1, the outputs of a trainedCTC network tend to form a series of spikes separated bystrongly predicted blanks. It means only a few timesteps arelabel samples, and the rest are all blank samples, which aremuch more than the former. According to the error signals,the label samples are the hard examples during training, butbecome less harder as the network converges. By then, thenetwork updating may be overwhelmed by the blanks.In our experiments, we observe a phenomenon of accuracydegradation that supports this speculation. When a net-work is trained with the CTC loss and tested on the trainingset, the recognition accuracy starts to decrease after cer-tain iterations and becomes very unstable. But if the batchnormalization is performed within each mini-batch withoutusing global statistics, the accuracy becomes reasonable, asillustrated in Figure 2. It shows that the network updating isunstable so the averaged means/var over the past iterationsdoes not suit the current network weights, which is probablycaused by the overwhelming blanks. Therefore, commonheuristics for object detection to solve the class imbalance,such as online hard example mining (OHEM) (Shrivastavaet al., 2016), focal loss (Lin et al., 2017) and GHM (Li et al.,2019), can be introduced to improve the CTC.In this paper, we propose a novel method to re-weight theCTC, offering the theory basis and successful experimentalexperience. The main contributions are summarized as fol-lows: (1) We reinterpret the CTC loss for sequence labellingas the cross entropy loss for classiﬁcation problem, provid- ing a new perspective to modify the CTC. (2) To deal withthe class imbalance, we propose some weighted CTC losses,and demonstrate their effectiveness by comparison exper-iments on scene text recognition. The proposed weightedCTC has several advantages over the original CTC, includ-ing (1) preventing the accuracy degradation phenomenon,(2) alleviating the negative effects caused by imbalancedtraining data, and (3) facilitating the convergence for mod-els, which means better performance and shorter trainingtime.

2. Method

The CTC (Graves & Gomez, 2006) is proposed for label-ing sequence data within a single network architecture thatdoesn’t need pre-segmentation and post-processing. Thebasic idea is to interpret the network outputs as a condi-tional probability distribution over all possible output labelsequences. Given this distribution, an objective function canbe derived that directly maximises the probabilities of thecorrect label sequences.At each timestep, the network outputs a probability distribu-tion over the label set L (cid:48) = L ∪ { blank } , where L containsall the labels in the task and the extra blank represents ‘nolabel’. The activation y tk is interpreted as the probability ofobserving label k of L (cid:48) at time t . Given the length T inputsequence x , we get the conditional probability p ( π | x ) ofobserving a particular path π through the lattice of labelobservations: p ( π | x ) = T (cid:89) t =1 y tπ t , ∀ π ∈ L (cid:48) T , (1)where π t is the label observed at time t along path π , and L (cid:48) T is the set of length T paths over L (cid:48) .Paths are mapped onto label sequences by an operation B that simply removes the repeated labels then the blanks in asequence. For a given label sequence l ∈ L ≤ T , more thanone π corresponds to it, e.g. B ( aa − − ab − ) = B ( − a − abb ) = aab , where ‘ − ’ denotes the blank . We can evaluatethe conditional probability of l as the sum of probabilitiesof all the corresponding paths: p ( l | x ) = (cid:88) π ∈B − ( l ) p ( π | x ) . (2)The CTC loss function is deﬁned as the negative log proba-bility of correctly labelling the sequence: CTC( l , x ) = − ln p ( l | x ) . (3)During training, to backpropagate the gradient through theoutput layer, we need the derivatives of the loss function ver-sus the outputs { a tk | t ∈ [1 , T ] , k ∈ L (cid:48) } before the activation Novel Re-weighting Method for Connectionist Temporal Classiﬁcationfunction is applied. For the softmax activation function y tk = e a tk (cid:80) k (cid:48) e a tk (cid:48) , (4)where k (cid:48) ranges over L (cid:48) , the derivative with respect to a tk is ∂ CTC( l , x ) ∂a tk = y tk − p ( l | x ) (cid:88) π ∈B − ( l ): π t = k p ( π | x ) , (5)where (cid:80) π ∈B − ( l ): π t = k p ( π | x ) is the sum of probabilities ofall the paths corresponding to l that go through the label k at time t .When the network is used for prediction, the predictionsover all timesteps are converted into a label sequence. Sincethe computational complexity grows exponentially with thelength of the path, it is not practical to ﬁnd the most probablelabel sequence ˆl . There are many approximate alternatives,and the best path decoding is one of the most commonlyused methods. It assumes that the most probable output willcorrespond to ˆl : ˆl ≈ B ( π ∗ )where π ∗ = arg max π p ( π | x ) . (6)It is not guaranteed to ﬁnd the most probable label sequence,but the solution is good enough in most cases and the com-putation procedure is trivial. The cross entropy (CE) is used to estimate the distancebetween two probability distributions. Given ground-truth y (cid:48) and network outputs y , the cross entropy loss is deﬁnedas CE( y (cid:48) , y ) = − (cid:88) k y (cid:48) k ln( y k ) , (7)where k ranges over all the classes, y k and y (cid:48) k are themodel’s estimated and ground-truth probabilities for class k respectively. Let { a k } be the model’s outputs before thesoftmax activation function is applied, the loss functionderivative with respect to a k can be found by ∂ CE( y (cid:48) , y ) ∂a k = y k − y (cid:48) k . (8) The focal loss (Lin et al., 2017) is designed to address theone-stage object detection scenario in which there is anextreme imbalance between foreground and backgroundclasses during training. Let y ∈ {± } specify the ground-truth class for binary classiﬁcation, and p ∈ [0 , denote the model’s estimated probability for the class with label y = 1 . For notational convenience, deﬁne p t as p t = (cid:40) p if y = 11 − p otherwise . (9)The main idea of focal loss is to reshape the loss function todown-weight easy examples and thus focus training on hardnegatives. On the basis of cross entropy loss CE( p t ) = − ln( p t ) , (10)a modulating factor (1 − p t ) γ is added, with tunable focusingparameter γ ≥ . The focal loss is deﬁned as FL( p t ) = − (1 − p t ) γ ln( p t ) . (11)To address class imbalance, a common method is to intro-duce a weighting factor α ∈ [0 , for class 1 and − α for class -1. For notational convenience, α t is deﬁned anal-ogously as p t . The α -balanced variant of the focal loss isdeﬁned as FL( p t ) = − α t (1 − p t ) γ ln( p t ) . (12) Treating the sequence recognition as the classiﬁcation foreach timestep, we rewrite the CTC loss into the form of crossentropy loss. Given an input sequence x and its ground-truthlabel sequence l , the network outputs probability distribu-tions Y = { y tk | t ∈ [1 , T ] , k ∈ L (cid:48) } over the T timesteps ofthe sequence. We deﬁne y t = { y tk | k ∈ L (cid:48) } as the predictedprobability distribution for the sample of timestep t , andassume there is a corresponding ground-truth probabilitydistribution y (cid:48) t = { y (cid:48) tk | k ∈ L (cid:48) } . The cross entropy loss ofcorrectly labelling the sequence should be CTC( l , x ) = (cid:88) t CE( y (cid:48) t , y t ) = − (cid:88) t (cid:88) k y (cid:48) tk ln( y tk ) . (13)A feasible solution for y (cid:48) t can be found by following theconditions below:  y (cid:48) tk = p ( l | x ) (cid:80) π ∈B − ( l ): π t = k p ( π | x ) , ∂y (cid:48) tk ∂y t (cid:48) k (cid:48) = 0 , ∀ t, t (cid:48) ∈ [1 , T ] , k, k (cid:48) ∈ L (cid:48) . (14)We can get the derivative of (cid:80) t CE( y (cid:48) t , y t ) versus a tk ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂a tk = y tk − y (cid:48) tk , (15)which equals to the CTC loss function derivative as in Equa-tion (5).According to Equation (1) and (2), p ( π | x ) and p ( l | x ) arecalculated on the basis of Y and l . It seems unreasonable Novel Re-weighting Method for Connectionist Temporal Classiﬁcation prediction Y ground-truth Y' Figure 3.

Examples of predicted probability distributions Y andthe corresponding ground-truths Y (cid:48) . This is the same sequenceduring different iterations. Lines with different colors denotedifferent labels, and the dashed line is the blank class. that the derivative of y (cid:48) tk versus y t (cid:48) k (cid:48) equals to zero, when y (cid:48) tk depends on Y . We argue that the deﬁnition of y (cid:48) t is valid,and elaborate on it as follows.At ﬁrst, we deﬁne an intermediate variable M = { m tk | t ∈ [1 , T ] , k ∈ L (cid:48) } , where m tk = 1 p ( l | x ) (cid:88) π ∈B − ( l ): π t = k p ( π | x ) , (16)also denoted by M = f ( Y , l ) for the expression con-venience. In this function, the ( Y , l ) is a ( T | L (cid:48) | + | l | ) -dimensional independent variable, whose value is differentfor each sequence in each mini-batch. Over the entire train-ing phase, ( Y , l ) has ﬁnite discrete values. Indexing thevalues of ( Y , l ) by i , we deﬁne the ground-truth Y (cid:48) as apiecewise function Y (cid:48) = g ( Y , l ) = (cid:40) f ( Y i , l i ) if ∃ i, ( Y , l ) ∈ U ( Y i , l i ) whatever otherwise , (17)where U ( · ) stands for neighborhood. It’s easy to know (cid:40) Y (cid:48) i = g ( Y i , l i ) = f ( Y i , l i ) ,∂ Y (cid:48) i /∂ Y i = g (cid:48) ( Y i , l i ) = 0 . (18)Omitting i and substituting Y = { y tk } and Y (cid:48) = { y (cid:48) tk } intothe above equation, we can get Equation (25).Some examples of Y and the corresponding Y (cid:48) are illus-trated in Figure 3 for an intuitive perception. Given the cross entropy loss function in Equation (26),we can apply weighting methods for classiﬁcation tasks to improve CTC. We should notice that, compared with theground-truth in a general classiﬁcation task, y (cid:48) t does notfollow the probability distribution where one of the classeshas a probability of 1 and the other classes have a proba-bility of 0, so there is no ground-truth class . We adopt theweighting method in two different ways to accommodatethis situation. One way is to assign weights for differentclasses, and it is called class weighting . The other way is sample weighting , where each sample weight is calculatedbased on y (cid:48) t .At ﬁrst, we introduce weighting factors to balance the labeland blank samples. The class-weighted CTC loss functionis deﬁned as CTC cs ( l , x ) = − (cid:88) t (cid:88) k α k y (cid:48) tk ln( y tk ) , (19)where α k = (cid:40) − α if k = blankα otherwise (20)is the weighting factor for class k , and α ∈ [0 , is a tunableparameter. The sample-weighted CTC loss function is CTC sp ( l , x ) = − (cid:88) t α t (cid:88) k y (cid:48) tk ln( y tk ) , (21)where α t is the weighting factor for sample t deﬁned as α t = α (1 − y (cid:48) tblank ) + (1 − α ) y (cid:48) tblank . (22)It is easy to know that when α is taken as 0.5, the above twoweighted CTC losses are equivalent to the CTC loss.We introduce focal loss to CTC, naming it as connectionisttemporal focal loss (CTFL) , to focus the training process onhard samples. Extending focal loss from binary classiﬁca-tion to multi-class case is straightforward. Deﬁning p t as theestimated probability for the ground-truth class, (1 − p t ) γ is used to down-weight easy samples, where γ ≥ is a tun-able focusing parameter. But as analyzed before, there is noground-truth class for y (cid:48) t . In this case, we extend the sampleweights of focal loss to class weights form | y tk − y (cid:48) tk | γ , where | y tk − y (cid:48) tk | denotes the distance between the estimated andground-truth probabilities for class k . For class weighting,we use the distances as the class weights of each sample,and deﬁne the classes-weighted CTFL as CTFL cs ( l , x ) = − (cid:88) t (cid:88) k | y tk − y (cid:48) tk | γ y (cid:48) tk ln( y tk ) . (23)As with sample weighting, each sample weight is calculatedby summing the distances over all classes. The smaple-weighted CTFL is given as CTFL sp ( l , x ) = − (cid:88) t ( (cid:88) k | y tk − y (cid:48) tk | γ )( (cid:88) k y (cid:48) tk ln( y tk )) . (24) Novel Re-weighting Method for Connectionist Temporal ClassiﬁcationIt is easy to notice that when the value of γ is 0, CTFLdegenerates into the CTC loss.See the appendix for the loss derivatives and formula deriva-tion processes.

3. Experiments

To evaluate the effects of the weighted CTC losses, we com-pare them with the CTC loss according to the convergenceprocess and recognition performance of the models. For allthe experiments, the accuracy refers to sequence accuracy,i.e. the percentage of testing images correctly recognized.Although different losses are adopted for training, the out-put is always the CTC loss, for it is an indicator of theprobability of correctly recognizing a sequence accordingto Equation (3).

For all the following experiments, we use the syntheticdataset released by (Jaderberg et al., 2014) as the trainingdata. The dataset consists of 8 million word images andtheir corresponding ground-truth words. All the images aregenerated by a synthetic data engine using a 90k word dic-tionary, and are of different sizes. For training efﬁciency, weconstruct a training set

Synth consisting of × images:At ﬁrst, all word images are scaled to have height 32 withoutchanging their aspect-ratios. If the scaled width is largerthan 256, we continue to scale the image to × . Ifthere’s enough room for the next scaled image, we appendit to this image after 20 columns of zeros. Otherwise, wepad the scaled image to width 256 with zeros. Besides, weconstruct the other training set Synth100 , which is morebalanced between different classes. It is the same as Synthbut containing 100 times extra copies of the images contain-ing digits. The character number of each class in Synth andSynth100 is displayed in Figure 4.There are four popular benchmarks for scene text recogni-tion used for model performance evaluation, namely IIIT5k-word (IIIT5k), Street View Text (SVT), ICDAR 2003 (IC03)and ICDAR 2013 (IC13).

IIIT5k (Mishra et al., 2012)contains 3,000 cropped word images collected from theInternet.

SVT (Kai et al., 2012) contains 647 word im-ages cropped from 249 street-view images that are collectedfrom Google Street View.

IC03 (Lucas et al., 2003) con-tains 251 scene images, we discard words that either containnon-alphanumeric characters or have less than three char-acters, and get 860 cropped word images.

IC13 (Karatzaset al., 2013) contains 1,095 word images in total, we dis-card words that contain non-alphanumeric characters, andget 1,015 word images with ground-truths. In addition, weconstruct a subset

Train with the ﬁrst 64,000 images takenfrom the training set to evaluate the model performance on

Figure 4.

The number of characters for each class in Synth andSynth100. The word images containing digits in Synth100 are 100times more than those in Synth. the training data.

There are two networks used in our experiments, one isCRNN (Shi et al., 2017), the other is a CNN replacing theBLSTM layers in CRNN with residual blocks (He et al.,2016a). The network conﬁgurations are summarized inTable 1.Unless otherwise stated, we use the CNN as the default net-work and Synth100 as the default training set in our experi-ments. We implement the network architecture within theCaffe (Jia et al., 2014) framework, with custom implementa-tion for the loss layer. Networks are trained with stochasticgradient descent (SGD). The decay rate of weights is 0.0005,and the momentum is 0.9. The initial learning rate is 0.01,and it is decreased by a factor of 0.1 after every ﬁx numberof iterations denoted as learning rate step. Three differenttraining strategies are used in our experiments: bs32-400k :batch size = 32, learning rate step = 100,000, max iterations= 400,000; bs32-800k : batch size = 32, learning rate step =200,000, max iterations = 800,000; bs256-40k : batch size= 256, learning rate step = 20,000, max iterations = 40,000.For all the experiments, we get the recognition results by thelexicon-free best path decoding (Graves & Gomez, 2006).

We propose four weighted forms of the CTC loss, and com-pare the algorithm complexities of them to CTC. Accord-ing to their loss function derivatives, the gradient updat-ing procedures are performed based on y tk and y (cid:48) tk , which Novel Re-weighting Method for Connectionist Temporal Classiﬁcation Table 1.

Network conﬁgurations. ‘Conv’ is short for convolutionallayer, and ‘MP’ is short for max pooling layer. ‘c’ stands for chan-nels, which denotes the number of feature maps for convolutionallayer, the number of hidden units for BLSTM layer, and the bottle-neck channels for residual unit. ‘k’, ‘s’, ‘p’ stand for kernel, strideand padding sizes respectively. ‘bn’ stands for batch normalization,‘softmax’ stands for softmax activation function. The residual unitused here is the full pre-activation version proposed in (He et al.,2016a). (a) CRNN T YPE C ONFIGURATION I NPUT × W × C ONV C K X P X K X S X ONV C K X P X K X S X ONV C K X P X BN C ONV C K X P X K X S X P X ONV C K X P X BN C ONV C K X P X K X S X P X ONV C K X BN BLSTM C C UTPUT C

SOFTMAX L OSS - (b) CNN T YPE C ONFIGURATION I NPUT × W × C ONV C K X P X K X S X ONV C K X P X K X S X ONV C K X P X BN C ONV C K X P X K X S X P X ONV C K X P X BN C ONV C K X P X K X S X P X ONV C K X ES U NIT C K X P X ES U NIT C K X P X BN O UTPUT C

SOFTMAX L OSS - are also calculated for the CTC loss. First, a softmax ac-tivation is applied to get the normalized network outputs { y tk } , whose time complexity is O ( T | L (cid:48) | ) for the cpu im-plementation and O ( | L (cid:48) | ) for the parallel gpu implementa-tion. Then a dynamic-programming algorithm similar tothe forward-backward algorithm for HMMs (Rabiner, 1993)is performed to calculate { y (cid:48) tk } , whose time complexity is O ( T | l | ) for both cpu and gpu implementations. For the CTCloss, there is the ﬁnal subtraction, whose time complexityis O ( T | L (cid:48) | ) for cpu and O (1) for gpu. Meanwhile, theweighted losses need additional calculations of the weights.We obtain the time complexity of CTC by summing up theabove terms, and list the time complexity of the additionalcalculation for each weighted loss in Table 2. It’s obvi-ous that the additional calculation doesn’t change the timecomplexity, so the amount of additional calculation in theweighted loss is acceptable. This is also validated by experi-ments, where the changes of training time are negligible.The space complexity of CTC is O ( T | l | ) . Since the algo-rithm makes the most of the original space and keep downthe additional space complexity of a weighted CTC to O ( T ) ,which doesn’t change the original space complexity.In one word, the proposed weighted CTC losses have thesame time and space complexity as the CTC loss. Table 2.

The additional time complexity and training time for eachweighted loss, compared with the CTC loss. Trn. Time denotestraining time spent for 100,000 iterations. M ETHOD C OMPLEXITY - CPU C OMPLEXITY - GPU T RN . T IME

CTC O ( T | l | ) + O ( T | L (cid:48) | ) O ( T | l | ) + O ( | L (cid:48) | ) MIN

CTC cs O ( T | L (cid:48) | ) O ( | L (cid:48) | ) MIN

CTC sp O ( T ) O (1) MIN

CTFL cs O ( T | L (cid:48) | ) O ( | L (cid:48) | ) MIN

CTFL sp O ( T | L (cid:48) | ) O ( | L (cid:48) | ) MIN

We train a group of models with each weighted CTC lossunder variable hyper-parameters. The training strategy isbs32-400k. The convergence processes of models are shownin Figure 5, and the recognition performances are illustratedin Figure 6. Note that when α is taken as 0.5 and γ is takenas 0, a weighted CTC loss becomes the CTC loss. L o ss (a) CTCcs (b) CTCsp L o ss (c) CTFLcs (d) CTFLsp alpha=0.33alpha=0.50alpha=0.66alpha=0.75 0.00.20.40.60.81.0 A cc u r a c y alpha=0.33alpha=0.50alpha=0.66alpha=0.75gamma=0gamma=1gamma=2gamma=3 0.00.20.40.60.81.0 A cc u r a c y gamma=0gamma=0.5gamma=1.0gamma=2.0 Figure 5.

The convergence processes of models trained with theweighted CTC losses under variable hyper-parameters. The thin-ner lines indicate the CTC loss, and the wider lines denote therecognition accuracies on the training set.

The parameter α in CTC cs and CTC sp is used to adjust theratio of inﬂuences coming from label and blank samples.Figure 5(a) shows the effect of CTC cs is not ideal. Fig-ure 5(b) suggests that the accuracy degradation is causedby excess blank samples, which is consistent with our classimbalance speculation, for the degradation is more obviousunder a lower α . CTC sp can prevent the accuracy degra-dation by focusing more attention on label samples, but itbrings no obvious improvement for the recognition perfor-mance, as shown in Figure 6(b). Based on our experience, CTC sp beneﬁts the model performance in some situations,but the improvements are minor compared with CTFL. So Novel Re-weighting Method for Connectionist Temporal Classiﬁcation A cc u r a c y (a) CTCcs alpha=0.33alpha=0.50alpha=0.66alpha=0.75 (b) CTCsp alpha=0.33alpha=0.50alpha=0.66alpha=0.75IIIT5k SVT IC03 IC13 Train0.650.700.750.800.850.900.95 A cc u r a c y (c) CTFLcs gamma=0gamma=1gamma=2gamma=3 IIIT5k SVT IC03 IC13 Train (d) CTFLsp gamma=0gamma=0.5gamma=1.0gamma=2.0 Figure 6.

Recognition accuracies on four test sets the training set.Each subﬁgure corresponds the performances of models trainedwith a weighted CTC losses under variable hyper-parameters. we ignore the α -weighting for the rest of this paper, andfocus on discussing the effects of CTFL.According to Figure 5(c) and 5(d), CTFL can prevent theaccuracy degradation, as well as improve the recognitionperformance on the training set. CTFL sp performs slightlybetter than CTFL cs on recognition accuracy. Moreover,compared with CTFL cs , CTFL sp has no negative impacton the CTC loss, i.e. the probability of correctly recognizinga sequence. It means that CTFL sp will not damage therecognition performance when the decoding method is basedon the probability. Figure 6(c) and 6(d) suggest γ = 2 for CTFL cs and γ = 1 for CTFL sp . These values are adoptedfor the rest of our experiments. However it is possiblethat the optimal values of the parameters are different fordifferent tasks. To investigate the effectiveness of CTFL for imbalancedclasses, we train a set of models with the CTC loss andCTFL on Synth and Synth100 respectively. The trainingstrategy is bs32-400k. The recognition performances onfour test sets are presented in Table 3. The results are alsoillustrated in Figure 7 to provide a direct perception.Comparing between the models trained with the CTC loss,there are large margins between the accuracies on IIIT5k,IC03 and IC13 of models trained on Synth and Synth100.Each accuracy margin is consistent with the digits ratio ofthe corresponding test set according to Table 3. Therefore,we speculate that the model trained with CTC on Synthcannot correctly recognize digits, and it is cased by thesevere class imbalance in Synth as shown in Figure 4.

Table 3.

Recognition accuracies (%) on four English scene textdatasets. Digits Ratio (%) indicates the percentage of word ex-amples containing digits in the dataset. Synth and Synth100 inthe ﬁrst column denote the training sets for the network, CTC andCTFL in the second column denote the loss functions used duringtraining.T RN . S ET M ETHOD

IIIT5 K SVT IC03 IC139.2 0 3.8 8.4S

YNTH

CTC 72.9

CTFL cs CTFL sp S YNTH

100 CTC 80.0 75.6 88.7

CTFL cs CTFL sp IIIT5k SVT IC03 IC130.700.750.800.850.900.95 A cc u r a c y CTC, SynthCTC, Synth100CTFLcs, SynthCTFLcs, Synth100CTFLsp, SynthCTFLsp, Synth100

Figure 7.

The illustration of recognition performances in Table 3.

The CTFL focus the network on hard samples during train-ing, thus improves the network’s recognition ability fordisadvantaged classes, i.e. digits in this case. Comparingbetween the models trained on Synth, the models trainedwith CTFL achieve much better performances than CTC,and their performances are nearly as good as models trainedon the balanced training set Synth100.

The convergence process of a model is inﬂuenced by thetraining settings, including batch size, learning rate, net-work architecture, etc. To evaluate the effect of CTFL onconvergence for different training settings, in addition tothe experiments above, we conduct another two groups ofcomparison experiments. The training settings and modelperformances for the three groups of experiments are shownin Table 4.Comparing models trained with the CTC loss over different Novel Re-weighting Method for Connectionist Temporal Classiﬁcation

Table 4.

The performances of models trained with different lossfunctions under three groups of training settings. Trn. St. is shortfor training settings, which include the network conﬁguration andthe training strategy. Note that CTC(800k) denotes the model istrained with the CTC loss under strategy bs32-800k. T RN . S T . M ETHOD

IIIT5 K SVT IC03 IC13 T

RAIN

CNN BS K CTC 80.0 75.6 88.7

CTFL cs CTFL sp CNN BS K CTC 77.2 71.4 86.9 84.1 85.5

CTFL cs CTFL sp CRNN BS K CTC 76.4 69.1 87.2 82.8 84.0CTC(800 K ) 77.1 72.3 87.3 82.8 86.3 CTFL cs CTFL sp training settings, the CNN and strategy bs32-400k lead tothe best model performance. The convergence process isillustrated in Figure 5 with γ = 0 . Accuracy degradationappears early during training, but as the learning rate de-creases, the degradation gradually disappears and does notaffect the ﬁnal model performance. In this situation, theCTFL prevents the accuracy degradation and stabilizes theconvergence process. But its effect of facilitating the con-vergence and improving the performance is not so obvious.Compared with training strategy bs32-400k, the strategybs256-40k leads to under-ﬁtting models. According to Ta-ble 4, the accuracies of models trained by the CTC lossdrop by about 2-4% due to the different training strategy.The convergence processes of this strategy are illustrated inFigure 8. Compared with the CTC-based training, the CTFLgreatly facilitates the convergence and leads to a much bettermodel performance. We think that the advantages of CTFLover CTC is more obvious when the CTC-based trainingsuffers from under-ﬁtting. This opinion is also supportedby the third group of experiments, where the under-ﬁtting iscaused by the network structure, for recurrent structures areusually difﬁcult to train. As shown in Figure 9, the modelstrained with CTFL easily outperform the model trained withCTC in convergence, even for the model trained for doublethe time.On one hand, it usually takes a lot of effort to ﬁnd the propertraining settings. The CTFL facilitates the convergence,thus ensures a relatively reasonable model performance forvarious training settings. On the other hand, under the sametraining settings, models trained with CTFL always achievebetter or similar performances within less training iterations. Convolutional recurrent neural network (CRNN) (Shi et al.,2017) is one of the state-of-the-art approaches in CTC-based L o ss A cc u r a c y CTCCTFLcsCTFLsp

Figure 8.

The convergence processes of models of CNN trainedunder strategy bs256-40k. L o ss A cc u r a c y CTCCTFLcsCTFLsp

Figure 9.

The convergence processes of models of CRNN trainedunder strategy bs32-400k or bs32-800k. text recognition, and it is adopted in our experiments. Theresults of CRNN in Table 4 fall behind the reported resultsin (Shi et al., 2017), because the training sets and test setsare different. Although the training and test sets are builtfrom the same datasets, the details of the building processcan be different, which affects the experimental results. Asshown in Table 5, we test the trained model released by(Shi et al., 2017) (downloaded from their code webpage,and supposed to have similar performance as the reportedresults) on our test sets, and get somehow inferior results.To ensure a fair comparison, we compare results obtainedon our test sets. Besides, we adopt the same training set andrescaling strategy, despite slightly different training settings,that are described in (Shi et al., 2017). According to Table 5,the CTC model gets similar results as the released CRNNmodel, and the model trained with CTFL achieves betterresults. Novel Re-weighting Method for Connectionist Temporal Classiﬁcation

Table 5.

The results of models with CRNN architecture. Our mod-els are trained with SGD, batch-size 64, for 600k iterations onrescaled × images then 200k iterations on variable-sizeimages. M ETHOD

IIIT5 K SVT IC03 IC13R

EPORTED (S HI ET AL ., 2017) 81.2 82.7 91.9 89.6R

ESULTS ON O UR T EST S ETS (S HI ET AL ., 2017) 80.3 81.6 90.0 86.2CTC 80.2 79.9 90.9 86.6

CTFL sp

4. Conclusion

In this paper, we provide a new perspective to modify theCTC method. The basic idea is to treat, instead of a se-quence, each timestep in the sequence as a sample. Accord-ingly, we reinterpret the CTC loss for sequences into thecross entropy loss for timesteps through a pseudo ground-truth. The cross entropy form makes it possible for the CTCloss to cooperate with re-weighting strategy.We introduce label/blank weighting and focal loss to CTCand get four weighted CTC losses. The experiments showthat the smaple-weighted CTFL generally performs bestamong them. The proposed losses are proven to have thesame complexity as CTC and the following beneﬁts: elimi-nating the accuracy degradation, better performance whentrained on imbalanced data, and contributing to faster andbetter convergence in some cases.Apart from the re-weighting method in this paper, the reinter-pretation of CTC may be potentially useful in other contextsthat is worthy of exploration in the future.

References

Alex, G., Marcus, L., Santiago, F., Roman, B., Horst, B., andJrgen, S. A novel connectionist system for unconstrainedhandwriting recognition.

IEEE Transactions on PatternAnalysis & Machine Intelligence , 31(5):855–868, 2009.Borisyuk, F., Gordo, A., and Sivakumar, V. Rosetta: Largescale system for text detection and recognition in images.In

Proceedings of the 24th ACM SIGKDD InternationalConference on Knowledge Discovery & Data Mining , pp.71–79, 07 2018.Feng, X., Yao, H., and Zhang, S. Focal CTC Loss forChinese Optical Character Recognition on UnbalancedDatasets.

Complexity , 2019:11, 2019. doi: 10.1155/2019/9345861.Graves, A. and Gomez, F. Connectionist temporal classiﬁca-tion: labelling unsegmented sequence data with recurrentneural networks. In

International Conference on MachineLearning , pp. 369–376, 2006. Graves, A. and Jaitly, N. Towards end-to-end speech recog-nition with recurrent neural networks. In Xing, E. P. andJebara, T. (eds.),

Proceedings of the 31st InternationalConference on Machine Learning , volume 32 of

Pro-ceedings of Machine Learning Research , pp. 1764–1772,Bejing, China, 22–24 Jun 2014. PMLR.He, K., Zhang, X., Ren, S., and Sun, J. Identity mappingsin deep residual networks. In

European Conference onComputer Vision , pp. 630–645, 2016a.He, P., Huang, W., Qiao, Y., Chen, C. L., and Tang, X.Reading scene text in deep convolutional sequences. In

Thirtieth AAAI Conference on Artiﬁcial Intelligence , pp.3501–3508, 2016b.Hu, L., Sheng, J., and Changshui, Z. Connectionist tempo-ral classiﬁcation with maximum entropy regularization.In , 2018.Huang, D.-A., Fei-Fei, L., and Niebles, J. C. Connectionisttemporal modeling for weakly supervised action labeling.In

European Conference on Computer Vision (ECCV) , pp.137–153. Springer, 2016.Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, A.Synthetic data and artiﬁcial neural networks for naturalscene text recognition.

Eprint Arxiv , 2014.Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., and Long,J. Caffe: Convolutional architecture for fast feature em-bedding.

Eprint Arxiv , pp. 675–678, 2014.Kai, W., Babenko, B., and Belongie, S. End-to-end scenetext recognition. In

IEEE International Conference onComputer Vision , 2012.Karatzas, D., Shafait, F., Uchida, S., Iwamura, M., Bigorda,L. G. I., Mestre, S. R., Mas, J., Mota, D. F., Almazan,J. A., and Heras, L. P. D. L. Icdar 2013 robust readingcompetition. In

International Conference on DocumentAnalysis & Recognition , 2013.Kim, S., Hori, T., and Watanabe, S. Joint ctc-attention basedend-to-end speech recognition using multi-task learning.In , pp. 4835–4839,March 2017.Li, B., Liu, Y., and Wang, X. Gradient Harmonized Single-stage Detector.

AAAI , 2019.Lin, T. Y., Goyal, P., Girshick, R., He, K., and Dollar, P.Focal loss for dense object detection.

IEEE Transac-tions on Pattern Analysis & Machine Intelligence , PP(99):2999–3007, 2017. Novel Re-weighting Method for Connectionist Temporal ClassiﬁcationLiu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu,C. Y., and Berg, A. C. Ssd: Single shot multibox detector.In

European Conference on Computer Vision , pp. 21–37,2016.Lucas, S. M., Panaretos, A., Sosa, L., Tang, A., Wong, S.,and Young, R. Icdar 2003 robust reading competitions.

Proc of the Icdar , 7(2-3):105–122, 2003.Miao, Y., Gowayyed, M., and Metze, F. Eesen: End-to-end speech recognition using deep rnn models and wfst-based decoding. In

Automatic Speech Recognition &Understanding , 2016.Mishra, A., Alahari, K., and V. Jawahar, C. Scene textrecognition using higher order language priors. 09 2012.Rabiner, L. A tutorial on hidden markov models and selectedapplications in speech recognition.

Proceedings of TheIEEE - PIEEE , 77, 01 1993.Shi, B., Bai, X., and Yao, C. An end-to-end trainable neuralnetwork for image-based sequence recognition and itsapplication to scene text recognition.

IEEE Transactionson Pattern Analysis & Machine Intelligence , 39(11):2298,2017.Shrivastava, A., Gupta, A., and Girshick, R. Training region-based object detectors with online hard example mining.In

IEEE Conference on Computer Vision and PatternRecognition , pp. 761–769, 2016. Novel Re-weighting Method for Connectionist Temporal Classiﬁcation

A. Cross Entropy Form of CTC

In the paper, we deﬁne a pseudo ground-truth y (cid:48) t = { y (cid:48) tk | k ∈ L (cid:48) } , where  y (cid:48) tk = p ( l | x ) (cid:80) π ∈B − ( l ): π t = k p ( π | x ) , ∂y (cid:48) tk ∂y t (cid:48) k (cid:48) = 0 , ∀ t, t (cid:48) ∈ [1 , T ] , k, k (cid:48) ∈ L (cid:48) , (25)to reinterpret the CTC loss as the sum of cross entropylosses. To this end, we need to prove that y (cid:48) t is a feasiblesolution of CTC( l , x ) = (cid:88) t CE( y (cid:48) t , y t ) = − (cid:88) t (cid:88) k y (cid:48) tk ln( y tk ) . (26)It means given the deﬁnition of y (cid:48) t , Equ. (26) holds.For an ideal situation, the CTC and the cross-entropy lossare both 0, equality holds. Therefore, we only need to provethat the derivatives of the them are equivalent.Having { a tk | t ∈ [1 , T ] , k ∈ L (cid:48) } denote the unnormalizednetwork outputs, we normalize them with the softmax acti-vation, y tk = sof tmax ( a tk ) = e a tk (cid:80) k (cid:48) e a tk (cid:48) . (27)It’s easy to know ∂y t (cid:48) k (cid:48) ∂a tk =  if t (cid:48) (cid:54) = ty tk (1 − y tk ) if t (cid:48) = t, k (cid:48) = k − y tk y tk (cid:48) if t (cid:48) = t, k (cid:48) (cid:54) = k. (28)The derivation of cross entropy formatted CTC versus y tk can be calculated as ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂y tk = − ∂ (cid:80) t (cid:48) ,k (cid:48) y (cid:48) t (cid:48) k (cid:48) ln( y t (cid:48) k (cid:48) ) ∂y tk = − ∂y (cid:48) tk ln( y tk ) ∂y tk = − y (cid:48) tk ∂ ln( y tk ) ∂y tk + ln( y tk ) ∂y (cid:48) tk ∂y tk = − y (cid:48) tk y tk , (29) and its derivation with respect to a tk can be calculated as ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂a tk = (cid:88) t (cid:48) ,k (cid:48) ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂y t (cid:48) k (cid:48) ∂y t (cid:48) k (cid:48) ∂a tk = (cid:88) k (cid:48) ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂y tk (cid:48) ∂y tk (cid:48) ∂a tk = ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂y tk ∂y tk ∂a tk + (cid:88) k (cid:48) (cid:54) = k ∂ (cid:80) t CE( y (cid:48) t , y t ) ∂y tk (cid:48) ∂y tk (cid:48) ∂a tk =( − y (cid:48) tk y tk ) y tk (1 − y tk ) + (cid:88) k (cid:48) (cid:54) = k ( − y (cid:48) tk (cid:48) y tk (cid:48) )( − y tk y tk (cid:48) )= y (cid:48) tk y tk − y (cid:48) tk + (cid:88) k (cid:48) (cid:54) = k y (cid:48) tk (cid:48) y tk = y tk (cid:88) k (cid:48) y (cid:48) tk (cid:48) − y (cid:48) tk = y tk p ( l | x ) (cid:88) π ∈B − ( l ) p ( π | x ) − y (cid:48) tk = y tk − y (cid:48) tk . (30)It is equal to the derivative of CTC given in the paper, soEqu.(26) holds. B. Derivation Process of

CTC cs The class-weighted CTC loss function is

CTC cs ( l , x ) = − (cid:88) t (cid:88) k α k y (cid:48) tk ln( y tk ) , (31)where α k = (cid:40) − α if k = blankα otherwise. (32)The derivation of CT C cs versus y tk can be calculated as ∂ (cid:80) t CTC cs ( l , x ) ∂y tk = − ∂ (cid:80) t (cid:48) ,k (cid:48) α k (cid:48) y (cid:48) t (cid:48) k (cid:48) ln( y t (cid:48) k (cid:48) ) ∂y tk = − α k y (cid:48) tk y tk , (33) Novel Re-weighting Method for Connectionist Temporal Classiﬁcationand its derivation with respect to a tk can be calculated as ∂ CTC cs ( l , x ) ∂a tk = ∂ (cid:80) t CTC cs ( l , x ) ∂y tk ∂y tk ∂a tk + (cid:88) k (cid:48) (cid:54) = k ∂ (cid:80) t CTC cs ( l , x ) ∂y tk (cid:48) ∂y tk (cid:48) ∂a tk =( − α k y (cid:48) tk y tk ) y tk (1 − y tk ) + (cid:88) k (cid:48) (cid:54) = k ( − α k (cid:48) y (cid:48) tk (cid:48) y tk (cid:48) )( − y tk y tk (cid:48) )= α k y (cid:48) tk y tk − α k y (cid:48) tk + (cid:88) k (cid:48) (cid:54) = k α k (cid:48) y (cid:48) tk (cid:48) y tk = y tk (cid:88) k (cid:48) α k (cid:48) y (cid:48) tk (cid:48) − α k y (cid:48) tk . (34) C. Derivation Process of

CTC sp The sample-weighted CTC loss function is

CTC sp ( l , x ) = − (cid:88) t α t (cid:88) k y (cid:48) tk ln( y tk ) , (35)where α t = α (1 − y (cid:48) tblank ) + (1 − α ) y (cid:48) tblank . (36)Since the derivation of α t versus y tk can be calculated as ∂α t (cid:48) ∂y tk = ∂α (1 − y (cid:48) t (cid:48) blank ) + (1 − α ) y (cid:48) t (cid:48) blank ∂y tk = 0 , (37)and its derivation versus a tk is ∂α t (cid:48) ∂a tk = 0 . (38)The derivation of CT C sp with respect to a tk is ∂ CTC cs ( l , x ) ∂a tk = − ∂ (cid:80) t (cid:48) α t (cid:48) (cid:80) k (cid:48) y (cid:48) t (cid:48) k (cid:48) ln( y t (cid:48) k (cid:48) ) ∂a tk = − ∂y (cid:48) tk ln( y tk ) ∂a tk α t =( y tk − y (cid:48) tk ) α t . (39) D. Derivation Process of

CTFL cs The classes-weighted CTFL is deﬁned as

CTFL cs ( l , x ) = − (cid:88) t (cid:88) k | y tk − y (cid:48) tk | γ y (cid:48) tk ln( y tk ) , (40) and its derivative versus y tk is ∂ CTFL cs ( l , x ) ∂y tk = − ∂ | y tk − y (cid:48) tk | γ y (cid:48) tk ln( y tk ) ∂y tk = − y (cid:48) tk | y tk − y (cid:48) tk | γ − (sign( y tk − y (cid:48) tk ) γ ln( y tk ) + | y tk − y (cid:48) tk | y tk ) , (41)and its derivative versus a tk is ∂ CTFL cs ( l , x ) ∂a tk = ∂ (cid:80) t CTFL cs ( l , x ) ∂y tk ∂y tk ∂a tk + (cid:88) k (cid:48)(cid:54) = k ∂ (cid:80) t CTFL cs ( l , x ) ∂y tk (cid:48) ∂y tk (cid:48) ∂a tk =( − y (cid:48) tk | y tk − y (cid:48) tk | γ − (sign( y tk − y (cid:48) tk ) γ ln( y tk ) + | y tk − y (cid:48) tk | y tk )) y tk (1 − y tk )+ (cid:88) k (cid:48)(cid:54) = k ( − y (cid:48) tk (cid:48) | y tk (cid:48) − y (cid:48) tk (cid:48) | γ − (sign( y tk (cid:48) − y (cid:48) tk (cid:48) ) γ ln( y tk (cid:48) ) + | y tk (cid:48) − y (cid:48) tk (cid:48) | y tk (cid:48) ))( − y tk y tk (cid:48) )= y tk (cid:88) k (cid:48) ( y (cid:48) tk (cid:48) | y tk (cid:48) − y (cid:48) tk (cid:48) | γ − (sign( y tk (cid:48) − y (cid:48) tk (cid:48) ) γy tk (cid:48) ln( y tk (cid:48) ) + | y tk (cid:48) − y (cid:48) tk (cid:48) | )) − y (cid:48) tk | y tk − y (cid:48) tk | γ − (sign( y tk − y (cid:48) tk ) γy tk ln( y tk ) + | y tk − y (cid:48) tk | )= y tk (cid:88) k (cid:48) z tk (cid:48) − z tk , (42)where z tk = y (cid:48) tk | y tk − y (cid:48) tk | γ − (sign( y tk − y (cid:48) tk ) γy tk ln( y tk )+ | y tk − y (cid:48) tk | ) . (43) E. Derivation Process of

CTFL sp The smaple-weighted CTFL is given as

CTFL sp ( l , x ) = − (cid:88) t ( (cid:88) k | y tk − y (cid:48) tk | γ )( (cid:88) k y (cid:48) tk ln( y tk )) , (44)and we ﬁnd its derivative with respect to a tk as ∂ CTFL sp ( l , x ) ∂a tk =( y tk − y (cid:48) tk ) (cid:88) k (cid:48) | y tk (cid:48) − y (cid:48) tk (cid:48) | γ +( y tk (cid:88) k (cid:48) z tk (cid:48) − z tk ) (cid:88) k (cid:48) y (cid:48) tk (cid:48) ln y tk (cid:48) , (45) z tk = sign( y tk − y (cid:48) tk ) γy tk | y tk − y (cid:48) tk | γ − . (46)To simplify the calculation process, we assume that thepartial derivative of the sample weight (cid:80) k | y tk − y (cid:48) tk | γ versus y tk is negligible. In this case, there is ∂ (cid:80) k | y tk − y (cid:48) tk | γ ∂a tk ≈ , (47)and the simpliﬁed derivative of CTFL sp can be calculated Novel Re-weighting Method for Connectionist Temporal Classiﬁcationas ∂ CTFL sp ( l , x ) ∂a tk = − ∂ (cid:80) t ( (cid:80) k | y tk − y (cid:48) tk | γ )( (cid:80) k y (cid:48) tk ln( y tk )) ∂a tk ≈ − ∂y (cid:48) tk ln( y tk ) ∂a tk (cid:88) k | y tk − y (cid:48) tk | γ =( y tk − y (cid:48) tk ) (cid:88) k (cid:48) | y tk (cid:48) − y (cid:48) tk (cid:48) | γ ..