PP-OCR: A Practical Ultra Lightweight OCR System
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, Haoshuang Wang
PPP-OCR: A Practical Ultra Lightweight OCR System
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu,Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, Haoshuang Wang
Baidu Inc. { duyuning, yangyehua } @baidu.com Abstract
The Optical Character Recognition (OCR) systems have beenwidely used in various of application scenarios, such as of-fice automation (OA) systems, factory automations, onlineeducations, map productions etc. However, OCR is still achallenging task due to the various of text appearances andthe demand of computational efficiency. In this paper, wepropose a practical ultra lightweight OCR system, i.e., PP-OCR. The overall model size of the PP-OCR is only 3.5Mfor recognizing 6622 Chinese characters and 2.8M for rec-ognizing 63 alphanumeric symbols, respectively. We intro-duce a bag of strategies to either enhance the model abilityor reduce the model size. The corresponding ablation exper-iments with the real data are also provided. Meanwhile, sev-eral pre-trained models for the Chinese and English recog-nition are released, including a text detector (97K imagesare used), a direction classifier (600K images are used) aswell as a text recognizer (17.9M images are used). Besides,the proposed PP-OCR are also verified in several other lan-guage recognition tasks, including French, Korean, Japaneseand German. All of the above mentioned models are open-sourced and the codes are available in the GitHub repository,i.e., https://github.com/PaddlePaddle/PaddleOCR.
OCR (Optical Character Recognition), a technology whichtargets at recognizing text in images automatically as shownin Figure 1, has a long research history and a wide rangeof application scenarios, such as document electronization,identity authentication, digital financial system, and vehiclelicense plate recognition. Moreover, in factory, products canbe more conveniently managed by extracting the text infor-mation of products automatically. Students’ offline home-work or test paper can be electronized with an OCR systemto make the communication between teachers and studentsmore efficient. OCR can also be used for labeling the pointof interests (POI) of a street view image, benefiting themap production efficiency. Rich application scenarios en-dow OCR technology with great commercial value, mean-while, a lot of challenges.
Various of Text Appearances
Text in image can be gen-erally divided into two categories: scene text and documenttext. Scene text refers to the text in natural scene as shown in
Copyright © 2021, All rights reserved.
Figure 1: Some image results of the proposed PP-OCR sys-tem.Figure 3, which usually changes dramatically for the factorssuch as perspective, scaling, bending, clutter, fonts, multi-lingual, blur, illumination, etc. Document text, as shown inFigure 4, is more often encountered in practical application.Different problems caused by the high density and long textneed to be solved. Otherwise, document image text recogni-tion often comes with the need to structure the results, whichintroduced a new hard task.
Computational Efficiency
In practical, the images thatneed to be processed are usually massive, which makes highcomputational efficiency an important criterion for design-ing an OCR system. CPU is preferred to be used than GPU a r X i v : . [ c s . C V ] O c t igure 2: The framework of the proposed PP-OCR. The model size in the figure is about Chinese and English charactersrecognition. For alphanumeric symbols recognition, the model size of text recognition is from 1.6M to 0.9M. The rest of themodels are the same size.Figure 3: Some images contained scene text.considering the cost. In particular, the OCR system need tobe run on embedded devices in many scenarios, such as cellphones, which makes it necessary to consider the model size.Trade off model size and performance is difficult but of greatvalue. In this paper, we propose a practical ultra lightweightOCR system, named as PP-OCR, which consists of threeparts, text detection, detected boxes rectification and textrecognition as shown in Figure 2. Text Detection
The purpose of text detection is to locatethe text area in the image. In PP-OCR, we use DifferentiableBinarization (DB) (Liao et al. 2020) as text detector which isbased on a simple segmentation network. The simple post-processing of DB makes it very efficient. In order to furtherimprove its effectiveness and efficiency, the following sixstrategies are used: light backbone, light head, remove SEmodule, cosine learning rate decay, learning rate warm-up,and FPGM pruner. Finally, the model size of the text detectoris reduced to 1.4M.
Detection Boxes Rectify
Before recognizing the detectedtext, the text box needs to be transformed into a horizon-tal rectangle box for subsequent text recognition, which is Figure 4: Some images contained document text.easy to be achieved by geometric transformation as the de-tection frame is composed of four points. However, the rec-tified boxes may be reversed. Thus, a classifier is neededto determine the text direction. If a box is determined re-versed, further flipping is required. Training a text directionclassifier is a simple image classification task. We adopt thefollowing four strategies to enhance the model ability andreduce the model size: light backbone, data augmentation,input resolution and PACT quantization. Finally, the modelsize of the text direction classifier is 500KB.
Text Recognition
In PP-OCR, we use CRNN (Shi, Bai,and Yao 2016) as text recognizer, which is widely used andpractical for text recognition. CRNN integrates feature ex-traction and sequence modeling. It adopts the Connection-ist Temporal Classification(CTC) loss to avoid the inconsis-tency between prediction and label. To enhance the modelability and reduce the model size of a text recognizer, thefollowing nine strategies are used: light backbone, data aug-mentation, cosine learning rate decay, feature map resolu-igure 5: Architecture of the text detector DB. This figure comes from the paper of DB (Liao et al. 2020). The red and grayrectangles show the backbone and head of the text detector separately.tion, regularization parameters, learning rate warm-up, lighthead, pre-trained model and PACT quantization. Finally, themodel size of the text recognizer is only 1.6M for Chineseand English recognition and 900KB for alphanumeric sym-bols recognition.In order to implement a practical OCR system, we con-struct a large-scale dataset for Chinese and English recog-nition as an example. Specifically, text detection dataset has97K images. Direction classification dataset has 600k im-ages. Text recognition dataset has 17.9M images. A smallamount of the data are selected to conduct ablation exper-iments quickly and choose the appropriate strategies. Wemake a lot of ablation experiments to show the effects ofdifferent strategies in Figure 2. Besides, we also verify theproposed PP-OCR system for other languages recognitionwhich including alphanumeric symbols, French, Korean,Japanese and German.The rest of the paper is organized as follows. In section2, we present the bag of model enhancement or slimmingstrategies. Experimental results are discussed in section 3and conclusion is conducted in section 4.
In this section, the details of six strategies for enhancing themodel ability or reducing the model size of a text detectorwill be introduced. Figure 5 shows the architecture of thetext detector DB.
Light Backbone
The size of backbone has dominanteffect on the model size of a text detector. Therefore,light backbones should be selected for building the ultralightweight models. With the development of image clas-sification, MobileNetV1, MobileNetV2, MobileNetV3 andShuffleNetV2 series are often used as the light backbones.Each series has different scale. Thanks to the inferencetime on CPU and accuracy of more than 20 kinds of back-bones are provided by PaddleClas , as shown in Figure 6, https://github.com/PaddlePaddle/PaddleClas/ Figure 6: The performance of some light backbones on theImageNet 1000 classification, including MobileNetV1, Mo-bileNetV2, MobileNetV3 and ShuffleNetV2 series. The in-ference time is tested on Snapdragon 855 (SD855) with thebatch size set as 1.MobileNetV3 can achieve higher accuracy when the pre-dict time are same. As for the choice of scale, we adoptMobileNetV3 large x0.5 to balance accuracy and efficiencyempirically. Incidentally, PaddleClas provides a total of upto 24 series of image classification network structures andtraining configurations, 122 models’ pretrained weights andtheir evaluation metrics, such as ResNet, ResNet vd, SERes-NeXt, Res2Net, Res2Net vd, DPN, DenseNet, EfficientNet,Xception, HRNet, etc.
Light Head
The head of the text detector is similar asthe FPN (Lin et al. 2017) architecture in object detectionand fuse the feature maps of the different scales to im-prove the effect for the small text regions detection. For con-venience of merging the different resolution feature maps, × convolution is often used to reduce the feature mapsto the same number of channel (we use inner channels forigure 7: Architecture of the SE block. This figure comesfrom the paper (Hu, Shen, and Sun 2018).Figure 8: Comparison of different ways of learning rate de-cay.short). The probability map and the threshold map are gen-erated from the fused feature map with convolutions whichare also associated with the above inner channels. Thus in-ner channels has a great influence on the model size. Wheninner channels is reduced from 256 to 96, the model size isreduced from 7M to 4.1M, but the accuracy declines slightly. Remove SE
SE is the short for squeeze-and-excitation(Hu, Shen, and Sun 2018). As shown in Figure 7, SE blocksmodel inter-dependencies between channels explicitly andre-calibrate channel-wise feature responses adaptively. Be-cause SE blocks can improve the accuracy of the vision tasksobviously, the search space of MobileNetV3 contains themand numerous of SE blocks are in MobileNetV3 architec-ture. However, when the input resolution is large, such as × , it is hard to estimate the channel-wise featureresponses with the SE block. The accuracy improvement islimited, but the time cost is very high. When the SE blocksare removed from the backbone, the model size is reducedfrom 4.1M to 2.5M, but the accuracy has no effect. Cosine Learning Rate Decay
The learning rate is thehyperparameter to control the learning speed. The lowerthe learning rate, the slower the change of the loss value.Though using a low learning rate can ensure that you willnot miss any local minimum, but it also means that the con-vergence speed is slow. In the early stage of training, theweights are in random initialization state, so we can set arelatively large learning rate for faster convergence. In thelate stage of training, the weights are close to the optimalvalues, so a relatively smaller learning rate should be used.Cosine learning rate decay has become the preferred learn- Figure 9: Illustration of FPGM Pruner. This figure comesfrom the paper (He et al. 2019b).ing rate reduction strategy for improving model accuracy.During the entire training process, cosine learning rate de-cay keeps a relatively large learning rate, so its convergenceis slower, but the final convergence accuracy is better. Figure8 compares the different ways of learning rate decay.
Learning Rate Warm-up
The paper (He et al. 2019a)shows that using learning rate warm-up operation can helpto improve the accuracy in the image classification. At thebeginning of the training process, using a too large learningrate may result in numerical instability, a small learning rateis recommended to be used. When the training process is sta-ble, the initial learning rate is to be used. For text detection,the experiments show that this strategy also is effective.
FPGM Pruner
Pruning is another method to improvethe inference efficiency of neural network model. In orderto avoid the model performance degradation caused by themodel pruning, we use FPGM (He et al. 2019b) to find theunimportant sub-network in original models. FPGM usesgeometric median as the criterion and the each filter in a con-volution layer is considered as a point in Euclidean space.Then calculate the geometric median of these points and re-move the filters with the similar values, as shown in Figure 9.The compress ratio of each layer is also important for prun-ing a model. Pruning every layer uniformly usually leads tosignificant performance degradation. In PP-OCR, the prun-ing sensitivity of each layer is calculated according to themethod in (Li et al. 2016) and then used to evaluate the re-dundancy of each layer.
In this section, the details of four strategies for enhancingthe model ability or reducing the model size of a directionclassifier will be introduced.
Light Backbone
We also adopt MobileNetV3 as thebackbone of the direction classifier which is the same asthe text detector. Because this task is relatively simple, weuse MobileNetV3 small x0.35 to balance accuracy and effi-ciency empirically. When using larger backbones, the accu-racy doesn’t improve more.
Data Augmentation
This paper (Yu et al. 2020) showsome image processing operations to train a text recog-nizer, such as rotation, perspective distortion, motion blurand Gaussian noise. Those processes are referred to as BDA(Base Data Augmentation) for short. They are randomlyadded to the training images. The experiment shows thatBDA also is useful for the direction classifier training. Be-sides BDA, some new data augmentation operations areproposed recently for improving the effect of image clas-sification, for example, AutoAugment (Cubuk et al. 2019),RandAugment (Cubuk et al. 2020), CutOut (DeVries andTaylor 2017), RandErasing (Zhong et al. 2020), HideAnd-Seek (Singh and Lee 2017), GridMask (Chen 2020), Mixup(Zhang et al. 2017) and Cutmix (Yun et al. 2019). But theexperiments show that most of them don’t work for thedirection classifier training except for RandAugment andRandErasing. RandAugment works best. Eventually, we addBDA and RandAugment to the training images of the direc-tion classification.
Input Resolution
In general, when the input resolutionof a normalized image is increased, accuracy will also beimproved. Since the backbone of the direction classifier isvery light, increasing the resolution properly will not leadto the computation time raise obviously. In the most of theprevious text recognition methods, the height and width ofa normalized image is set as and , respectively. How-ever, in PP-OCR, the height and width is set as and ,respectively, to improve the accuracy of the direction classi-fier. PACT Quantization
Quantization allows the neural net-work model to have lower latency, smaller volume and lowercomputational power consumption. At present, quantiza-tion is mainly divided into two categories: offline quanti-zation and online quantization. Offline quantization refersto a fixed-point quantization method that uses methods suchas KL divergence and moving average to determine quan-tization parameters and does not require retraining. Onlinequantization is to determine quantization parameters dur-ing the training process, which can provide less quantizationloss than offline quantization mode.PACT (PArameterized Clipping acTivation) (Choi et al.2018) is a new online quantification method that removessome outliers from the activations in advance. After remov-ing the outliers, the model can learn more appropriate quan-titative scales. The formula for PACT to preprocess the acti-vations is as follows: y = P ACT ( x ) = 0 . | x | − | x − α | + α ) = x ∈ ( −∞ , x x ∈ [0 , α ) α x ∈ [ α, + ∞ ) (1) The preprocessing of the activation value of the ordinaryPACT method is based on the ReLU function. All activationvalues greater than a certain threshold are truncated. How-ever, the activation functions in MobileNetV3 are not onlyReLU, but also hard swish. Using ordinary PACT quantiza-tion leads to a higher quantization loss. Therefore, we mod-ify the formula of the activations preprocessing as follows toreduce the quantization loss. Figure 10: Architecture of the text recognizer CRNN. Thisfigure comes from the paper (Shi, Bai, and Yao 2016). Thered and gray rectangles show the backbone and head of thetext recognizer separately. y = P ACT ( x ) = − α x ∈ ( −∞ , − α ) x x ∈ [ − α, α ) α x ∈ [ α, + ∞ ) (2) We used the improved PACT quantification method toquantify the direction classifier model. In addition, L2 reg-ularization with a coefficient of 0.001 is added to the PACTparameters to improve the model robustness.The implementation of the above FPGM Pruner andPACT quantization is based on PaddleSlim . PaddleSlim isa toolkit for model compression. It contains a collection ofcompression strategies, such as pruning, fixed point quan-tization, knowledge distillation, hyperparameter searchingneural architecture search. In this section, the details of nine strategies for enhancing themodel ability or reducing the model size of a text recognizerwill be introduced. Figure 10 shows the architecture of thetext recognizer CRNN.
Light Backbone
We also adopt MobileNetV3 as thebackbone of the text recognizer which is the same as thetext detection. MobileNetV3 small x0.5 is selected to bal-ance accuracy and efficiency empirically. If you’re not thatsensitive to the model size, MobileNetV3 small x1.0 is alsoa good choice. The model size is just increased by 2M, theaccuracy is improved obviously.
Data Augmentation
Besides BDA (Base Data Augmen-tation) which is often used in text recognition as mentionedearlier, TIA (Luo et al. 2020) also is an effective data aug-mentation method for text recognition. As shown in Figure https://github.com/PaddlePaddle/PaddleSlim/ igure 11: Illustration of data augmentation, TIA. This fig-ure comes from the paper (Luo et al. 2020).Figure 12: Illustration of the modify of the feature map reso-lution. The table comes from the paper (Howard et al. 2019)11, at first, a set of fiducial points are initialized on the im-age. Then move the points randomly to generate a new im-age with the geometric transformation. In PP-OCR, we addBDA and TIA to the training images of the text recognition. Cosine Learning Rate Decay
As mentioned in text de-tection, cosine learning rate decay has become the preferredlearning rate reduction method. The experiments show thatcosine learning rate decay strategy is also effective to en-hance the model ability for text recognition.
Feature Map Resolution
In order to adapt to multilin-gual recognition, particularly in Chinese recognition, in PP-OCR the height and width of the CRNN input are set as and . Then, the strides of the original MobileNetV3 isnot appropriate for text recognition. As shown in Figure 12,for the sake of keeping more the horizontal information, wemodify the stride of the down sampling feature map exceptthe first one from (2,2) to (2,1). In order to keep more verti-cal information, we further modify the stride of the seconddown sampling feature map from (2,1) to (1,1). Thus, thestride of the second down sampling feature map s affectsthe resolution of the whole feature map and the accuracy ofthe text recognizer dramaticly. In PP-OCR, s is set as (1,1)to achieve the better performance empirically. Regularization Parameters
Overfitting is a commonterm in machine learning. A simple understanding is that the model performs well on the training data, but it performspoorly on the test data. To avoid overfitting, many regularways have been proposed. Among them, weight decay isone of the widely used ways to avoid overfitting. After thefinal loss function, L2 regularization (L2 decay) is added tothe loss function. With the help of L2 regularization, theweight of the network tend to choose a smaller value, andfinally the parameters in the entire network tends to 0, andthe generalization performance of the model is improved ac-cordingly. For text recognition, L2 decay has a great influ-ence on the accuracy.
Learning Rate Warm-up
Similar as the text detection,learning rate warm-up is also helping the text recognition.For text recognition, the experiments show that using thisstrategy is also effective.
Light Head
A full connection layer is used to encode thesequence features to the predicted characters in the ordinary.The dimension of the sequence features have an impact onthe model size of a text recognizer, especially for Chineserecognition whose characters are more than 6 thousands.Meanwhile, it is not that the higher of the dimension, thestronger of the ability of the sequence features representa-tion. In PP-OCR, the dimension of the sequence features isset to 48 empirically.
Pre-trained Model
If the training data is fewer, fine tunethe existing networks, which are trained on a large data setsuch as ImageNet, to achieve fast convergence and betteraccuracy. The transfer learning in image classification andobject detection show the above strategy is effective. In realscenes, the data used for text recognition is often limited. Ifthe models are trained with tens of millions samples, even ifthey are synthesized ones, the accuracy can be significantlyimproved with the above models. We demonstrate the effec-tiveness of this strategy through experiments.
PACT Quantization
We adopt the similar quantizationscheme of the direction classification to reduce the modelsize of a text recognizer except for skipping the LSTM lay-ers. Those layers will not be quantified at present since thecomplexity of LSTM quantization.
DataSets
As shown in Table 1, in order to implement apractical OCR system, we construct a large-scale dataset forChinese and English recognition as an example.For text detection, there are 97k training images and 500validation images. Among the training images, 68K im-ages are real scene images, which come from some publicdatasets and Baidu image search. The public datasets usedinclude LSVT (Sun et al. 2019), RCTW-17 (Shi et al. 2017),MTWI 2018 (He and Yang 2018), CASIA-10K (He et al.2018), SROIE (Huang et al. 2019), MLT 2019 (Nayef et al.2019), BDI (Karatzas et al. 2011), MSRA-TD500 (Yao et al.2012) and CCPD 2019 (Xu et al. 2018). Most the trainingimages from Baidu image search are document text images.The remaining 29K synthetic images mainly focus on thescenarios for long text, multi direction text and table text.All the validation images come from the real scenes.umber of training data Number of validation dataTask Total Real Synthesis RealText Detection 97K 68K 29K 500Direction Classification 600K 100K 500K 310KText Recognition 17.9M 1.9M 16M 18.7KTable 1: Statistics of dataset for Chinese and English Recognition.Number of training data Number of validation dataTask CharacterNumber Total Real Synthesis Total Real SynthesisChinese and EnglishRecognition 6622 17.9M 1.9M 16M 18.7K 18.7K 0Alphanumeric SymbolsRecognition 63 15M 0 15M 12K 12K 0French Recognition 118 1.08M 0 1.08M 80K 0 80KJapanese Recognition 4399 0.99M 0 0.99M 80K 0 80KKorean Recognition 3636 0.94M 0 0.94M 80K 0 80KGerman Recognition 131 1.96M 0 1.96M 170K 0 170KTable 2: Statistics of dataset for multilingual recognition.For direction classification, there are 600k training imagesand 310K validation images. Among the training images,100K images are real scene images, which come from thepublic datasets (LSVT, RCTW-17, MTWI 2018). They arehorizontal text which rectify and crop the ground truth of theimages. The remaining 500K synthetic images mainly focuson the reversed text. We use the vertical fonts to synthesizesome text images and then rotate them horizontally. All thevalidation images come from the real scenes.For text recognition, there are 17.9M training images and18.7K validation images. Among the training images, 1.9Mimages are real scene images, which come from some pub-lic datasets and Baidu image search. The public datasetsused include LSVT, RCTW-17, MTWI 2018 and CCPD2019. The remaining 16M synthetic images mainly focus onthe scenarios for different backgrounds, translation, rotation,perspective transformation, line disturb, noise, vertical textand so on. The corpus of synthetic images come from thereal scene images. All the validation images also come fromthe real scenes.In order to conduct ablation experiments quickly andchoose the appropriate strategies, we select 4k images fromthe real scene training images for text detection, and 300kones from the real scene training images for text recogni-tion.In addition, we collected 300 images for different real ap-plication scenarios to evaluate the overall OCR system, in-cluding contract samples, license plates, nameplates, traintickets, test sheets, forms, certificates, street view images,business cards, digital meter, etc. Figure 3 and Figure 4 showsome images of the test set.Furthermore, to verify the proposed PP-OCR for otherlanguages, we also collect some corpus for alphanumericsymbols recognition, French recognition, Korean recogni-tion, Japanese recognition and German recognition. Then synthesize the text line images for text recognition. Someimages for alphanumeric symbols recognition come fromthe public datasets, ST (Gupta, Vedaldi, and Zisserman2016) and SRN (Yu et al. 2020). Table 2 shows the statistics.Since MLT 2019 for text detection includes multilingual im-ages, the text detector for Chinese and English recognitionalso can support multi language text detection. Due to thelimited data, we haven’t found the proper data to train thedirection classifier for multilingual.The data synthesis tool used in text detection and textrecognition is modified from text render (Sanster 2018).
Implementation Details
We use Adam optimizer to trainall the models and adopt cosine learning rate decay as thelearning rate schedule. The initial learning rate, batch sizeand the number of epochs for different tasks can be found inTable 4. When we obtain the trained models, FPGM prunerand PACT quantization can be used to reduce the model sizefurther with the above models as the pre-trained ones. Thetraining processes of FPGM pruner and PACT quantizationare similar as previous.In the inference period, HMean is used to evaluate the per-formance of a text detector. Accuracy is used to evaluate theperformance of a direction classifier or a text recognizer. F-score is used to evaluate the performance of an OCR system.In order to calculate F-score, a correct text recognition resultshould be the accurate location and the same text. GPU in-ference time is tested on a single T4 GPU. CPU inferencetime is tested on a Intel(R) Xeon(R) Gold 6148. We use theSnapdragon 855 (SD 855) to evaluate the inference time ofthe quantification models.
Table 5 compares the performance of the different back-bones for text detection. HMean, the model size and the in-nner channelof the head RemoveSE CosineLearningRate Decay LearningRateWarm-up Precision Recall HMean ModelSize (M) Inference Time(CPU, ms)256 0.6821 0.5560 0.6127 7 40696 0.6677 0.5524 0.6046 4.1 21396 √ √ √ √ √ √ √ Table 7 compares the performance of different backbonesfor direction classification. The accuracy of MobileNetV3with difference scales (0 . , . are close. The model sizeand the inference time of MobileNetV3 small x0.35 aremuch better. Besides, ShuffleNetV2 is used to train a di-rection classifier in some previous work. From the table,whether it’s accuracy or the model size or the inference time,ShuffleNetV2 is not a good choice.Tabel 9 shows the ablation study of data augmentationfor direction classification. The baseline accuracy of text di-rector classify without data augmentation is only 88.79%.When we adopt BDA (base data augmentation), the accu-racy can boost 2.55%. We also verified that RandomErasingnput Resolution PACT Quantization Accuracy Model Size (M) Inference Time (SD 855, ms) × × × × × × √ × × to × × ,The classification accuracy has improved but the predictionspeed is basically unchanged. Furthermore, we also verifiedquantization strategy is effective in accelerating the predic-tion speed of the text direction classifier. The model sizeis reduced 45.9% and the inference time has accelerated25.86%. Accuracy is slight promotion. Table 10 compares the performance of the different back-bones for text recognition. The accuracy, the model sizeand the inference time of the different scales of Mo-bileNetV3 change greatly. In PP-OCR, we choose Mo-bileNetV3 small x0.5 to balance accuracy and efficiency. the numberof channel Accuracy ModelSize (M) InferenceTime(CPU, ms)256 0.6556 23 17.2796 0.6673 8 13.3664 0.6642 5.6 12.6448 0.6581 4.6 12.26Table 11: Ablation study of the number of channel in thehead for text recognition. The data augmentation is onlyused BDA.Table 11 compares the number of channel in the CRNNhead for text recognition. Reduce the number of channelfrom 256 to 48, the model size is reduced from 23M to 4.6Mand the inference time has accelerated nearly 30%. However,the accuracy will not be affected. We can see the number ofchannel in the head has a great influence on the model sizeof a lightweight text recognizer.Tabel 12 shows the ablation study of data augmentation,cosine learning rate decay, the stride of the second downsampling feature map, regularization parameters L2 decayand learning rate warm-up for text recognition.To verify the advantages of each strategy, the setting ofthe basic experimental is the strategy S1. When using BDA,the accuracy will be improved 3.12%. Data augmentationis very necessary for text recognition. When we adopt thecosine learning rate decay further, the accuracy will be im-proved 1.47%. The cosine learning rate is an effective strat-egy for text recognition. Next, when we increase the fea-ture map resolution and reduce the stride of the second downsampling feature map from (2,1) to (1,1), the accuracy willbe improved 5.27%. Then, when we adjust the regulariza-tion parameters L2 decay from 0 to e − further, the accu-racy will be improved 3.4%. The feature map resolution andL2 decay have a great influence on the performance. Final,using learning rate warm-up, the accuracy will be improved0.62%. Using TIA data augmentation, the accuracy will beimproved 0.91%. Learning rate warm-up and TIA also areeffective strategies for text recognition.Tabel 13 shows the ablation study of PACT quantizationfor text recognition. When we use PACT quantization, themodel size is reduced 67.39% and the inference time has ac-celerated 8.3%. Since there was no quantification on LSTM,The acceleration is not obvious. However, accuracy achievesa significant improvement. Therefore, PACT quantizationalso is an effective strategy for reducing the model size ofa text recognizer.In the end, we will illustrate the effect of pre-trainedtrategy DataAugmentation Cosine LearningRate Decay Stride L2 decay Learning RateWarm-up Accuracy Inference Time(CPU, ms)S1 NO (2,1) 0 0.5193 11.84S2 BDA (2,1) 0 0.5505 11.84S3 BDA √ (2,1) 0 0.5652 11.84S4 BDA √ (1,1) 0 0.6179 12.96S5 BDA √ (1,1) e − √ (1,1) e − √ √ (1,1) e − √ √ √ Table 14 shows the ablation study of the prunner or quantiza-tion for the OCR system. When we use the slim approaches,the model size is reduced 55.7% and the inference time hasaccelerated 12.42%. F-score has no impact. The inferencetime includes pre-process and post-process of each parts ofthe system. Therefore, FPGM pruner and PACT quantizationalso are effective strategies for reducing the model size.To compare the gap between the proposed ultralightweight OCR system and large-scale OCR system, wealso train a large-scale OCR system and use Res18 vd as thetext detector backbone and Res34 vd as the text recognizerbackbone. Table 15 shows the comparison. F-score of thelarge-scale OCR system is higher than the ultra lightweightOCR system, but the model size and the inference time ofthe ultra lightweight system are better obviously.Figure 13 and Figure 14 show some image results of theproposed PP-OCR system for Chinese and English recog- Inference Time (ms)ModelType F-score ModelSize (M) CPU T4 GPUUltralightweight 0.5193 8.1 421 137Largescale 0.5414 155.1 1199 204Table 15: Compare between the ultra lightweight OCR sys-tem and the large scale one.nition. Figure 15 show some image results of the proposedPP-OCR system for multilingual recognition.
In this paper, we propose a practical ultra lightweight OCRsystem, PP-OCR, which the overall model size is only 3.5Mfor recognizing 6622 Chinese characters and 2.8M for rec-ognizing 63 alphanumeric symbols. We introduce a bag ofstrategies to either enhance the model ability or light themodel. The corresponding ablation experiments are alsoprovided. Meanwhile, some practical ultra lightweight OCRmodels are released with a large-scale dataset.
References
Chen, P. 2020. GridMask data augmentation. arXiv preprintarXiv:2001.04086 . 2.2Choi, J.; Wang, Z.; Venkataramani, S.; Chuang, P. I.-J.;Srinivasan, V.; and Gopalakrishnan, K. 2018. Pact: Param-eterized clipping activation for quantized neural networks. arXiv preprint arXiv:1805.06085 . 2.2Cubuk, E. D.; Zoph, B.; Mane, D.; Vasudevan, V.; and Le,Q. V. 2019. Autoaugment: Learning augmentation strate-gies from data. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 113–123. 2.2Cubuk, E. D.; Zoph, B.; Shlens, J.; and Le, Q. V. 2020. Ran-daugment: Practical automated data augmentation with a re-duced search space. In
Proceedings of the IEEE/CVF Con-ference on Computer Vision and Pattern Recognition Work-shops , 702–703. 2.2igure 13: Some image results of the proposed PP-OCR system for Chinese and English recognition.DeVries, T.; and Taylor, G. W. 2017. Improved regulariza-tion of convolutional neural networks with cutout. arXivpreprint arXiv:1708.04552 . 2.2Gupta, A.; Vedaldi, A.; and Zisserman, A. 2016. Syntheticdata for text localisation in natural images. In
Proceedings ofthe IEEE conference on computer vision and pattern recog-nition , 2315–2324. 3.1He, M.; and Yang, Z. 2018. ICPR 2018 contest on robustreading for multi-type web images (MTWI). https://tianchi.aliyun.com/competition/entrance/231651/information. 3.1He, T.; Zhang, Z.; Zhang, H.; Zhang, Z.; Xie, J.; and Li, M.2019a. Bag of tricks for image classification with convolu-tional neural networks. In
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 558–567.2.1He, W.; Zhang, X.-Y.; Yin, F.; and Liu, C.-L. 2018. Multi- oriented and multi-lingual scene text detection with directregression.
IEEE Transactions on Image Processing
Proceedings of the IEEE Confer-ence on Computer Vision and Pattern Recognition , 4340–4349. 9, 2.1Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.;Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al.2019. Searching for mobilenetv3. In
Proceedings of theIEEE International Conference on Computer Vision , 1314–1324. 12Hu, J.; Shen, L.; and Sun, G. 2018. Squeeze-and-excitationnetworks. In
Proceedings of the IEEE conference on com-puter vision and pattern recognition , 7132–7141. 7, 2.1igure 14: Some image results of the proposed PP-OCR system for Chinese and English recognition.Huang, Z.; Chen, K.; He, J.; Bai, X.; Karatzas, D.; Lu, S.;and Jawahar, C. 2019. Icdar2019 competition on scannedreceipt ocr and information extraction. In , 1516–1520. IEEE. 3.1Karatzas, D.; Mestre, S. R.; Mas, J.; Nourbakhsh, F.; andRoy, P. P. 2011. ICDAR 2011 robust reading competition-challenge 1: reading text in born-digital images (web andemail). In , 1485–1490. IEEE. 3.1Li, H.; Kadav, A.; Durdanovic, I.; Samet, H.; and Graf, H. P.2016. Pruning filters for efficient convnets. arXiv preprintarXiv:1608.08710 . 2.1Liao, M.; Wan, Z.; Yao, C.; Chen, K.; and Bai, X. 2020. Real-Time Scene Text Detection with Differentiable Bina-rization. In
AAAI , 11474–11481. 1, 5Lin, T.-Y.; Doll´ar, P.; Girshick, R.; He, K.; Hariharan, B.;and Belongie, S. 2017. Feature pyramid networks for ob-ject detection. In
Proceedings of the IEEE conference oncomputer vision and pattern recognition , 2117–2125. 2.1Luo, C.; Zhu, Y.; Jin, L.; and Wang, Y. 2020. Learn toAugment: Joint Data Augmentation and Network Optimiza-tion for Text Recognition. In
Proceedings of the IEEE/CVFConference on Computer Vision and Pattern Recognition ,13746–13755. 2.3, 11Nayef, N.; Patel, Y.; Busta, M.; Chowdhury, P. N.; Karatzas,D.; Khlif, W.; Matas, J.; Pal, U.; Burie, J.-C.; Liu, C.-l.;et al. 2019. ICDAR2019 robust reading challenge on multi-lingual scene text detection and recognition—RRC-MLT-igure 15: Some image results of the proposed PP-OCR system for multilingual recognition.2019. In , 1582–1587. IEEE. 3.1Sanster. 2018. Generate text images for training deep learn-ing ocr model. https://github.com/Sanster/text renderer. 3.1Shi, B.; Bai, X.; and Yao, C. 2016. An end-to-end trainableneural network for image-based sequence recognition andits application to scene text recognition.
IEEE transactionson pattern analysis and machine intelligence , volume 1, 1429–1434. IEEE. 3.1Singh, K. K.; and Lee, Y. J. 2017. Hide-and-seek: Forcing a network to be meticulous for weakly-supervised object andaction localization. In , 3544–3553. IEEE. 2.2Sun, Y.; Liu, J.; Liu, W.; Han, J.; Ding, E.; and Liu, J. 2019.Chinese street view text: Large-scale chinese text readingwith partially supervised learning. In
Proceedings of theIEEE International Conference on Computer Vision , 9086–9095. 3.1Xu, Z.; Yang, W.; Meng, A.; Lu, N.; Huang, H.; Ying, C.;and Huang, L. 2018. Towards end-to-end license plate de-tection and recognition: A large dataset and baseline. In
Proceedings of the European conference on computer vision(ECCV) , 255–271. 3.1Yao, C.; Bai, X.; Liu, W.; Ma, Y.; and Tu, Z. 2012. Detect-ing texts of arbitrary orientations in natural images. In , 1083–1090. IEEE. 3.1Yu, D.; Li, X.; Zhang, C.; Liu, T.; Han, J.; Liu, J.; andDing, E. 2020. Towards accurate scene text recognitionwith semantic reasoning networks. In
Proceedings of theIEEE/CVF Conference on Computer Vision and PatternRecognition , 12113–12122. 2.2, 3.1Yun, S.; Han, D.; Oh, S. J.; Chun, S.; Choe, J.; and Yoo, Y.2019. Cutmix: Regularization strategy to train strong clas-sifiers with localizable features. In
Proceedings of the IEEEInternational Conference on Computer Vision , 6023–6032.2.2Zhang, H.; Cisse, M.; Dauphin, Y. N.; and Lopez-Paz, D.2017. mixup: Beyond empirical risk minimization. arXivpreprint arXiv:1710.09412 . 2.2Zhong, Z.; Zheng, L.; Kang, G.; Li, S.; and Yang, Y. 2020.Random Erasing Data Augmentation. In