On-Device Document Classification using multimodal features
OOn-Device Document Classification using multimodal features
Sugam Garg
Samsung R&D [email protected]
Harichandana
Samsung R&D [email protected]
Sumit Kumar
Samsung R&D [email protected]
ABSTRACT
From small screenshots to large videos, documents take up a bulkof space in a modern smartphone. Documents in a phone can accu-mulate from various sources, and with the high storage capacity ofmobiles, hundreds of documents are accumulated in a short period.However, searching or managing documents remains an oneroustask, since most search methods depend on meta-information oronly text in a document. In this paper, we showcase that a sin-gle modality is insufficient for classification and present a novelpipeline to classify documents on-device, thus preventing any pri-vate user data transfer to server. For this task, we integrate anopen-source library for Optical Character Recognition (OCR) andour novel model architecture in the pipeline. We optimise the modelfor size, a necessary metric for on-device inference. We benchmarkour classification model with a standard multimodal dataset FOOD-101 and showcase competitive results with the previous State ofthe Art with 30% model compression.
CCS CONCEPTS • Multimodal features → Fusion Network.
KEYWORDS
Multimodal classification, on-device classification
ACM Reference Format:
Sugam Garg, Harichandana, and Sumit Kumar. 2021. On-Device DocumentClassification using multimodal features. In
ACM,New York, NY, USA, 5 pages. https://doi.org/10.1145/3430984.3431030
With the advent of smartphones with internal memory in GBs,there is a plethora of documents, which can be present on a mobilephone. Some of these are private while some are downloaded justwhile browsing the internet. Existing search mechanisms heavilyrely on the name of the document, which can be random and maynot represent the content of the document properly. Thus, an im-portant document may get lost in the clutter, providing a bad userexperience. The automatic organization of documents based on itscontent will immensely increase a user’s satisfaction. Since, the
Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
CODS COMAD 2021, January 2–4, 2021, Bangalore, India © 2021 Association for Computing Machinery.ACM ISBN 978-1-4503-8817-7/21/01...$15.00https://doi.org/10.1145/3430984.3431030 contents of a document in a smart phone are personal, sending thedocument or its content to a server for such kind of processing maylead to privacy and latency issue. Hence, in this paper, we presenta light-weight architecture to classify documents on-device, thatimproves user experience as well as preserve privacy.Traditionally, document classification considers only text con-tent of a document. The words from a document are convertedinto vectors, and these vectors are used to compute sentence aswell as document vector representations. Sparse Composite Docu-ment Vector (SCDV) [16] calculates document vectors using softclustering over word vectors. One popular model, Hierarchical At-tention Network (HAN), uses a word- and sentence-level attentionin classifying documents [20]. [3] stipulate that a simple BiLSTMarchitecture with appropriate regularization yields competitive ac-curacy and F1-score. [2] established the state of the art results fordocument classification by fine-tuning BERT [8] and demonstratedthat BERT could be distilled into a much simpler single-layeredlightweight BiLSTM model that provides competitive accuracy. Arecently published paper [1], proposed an approach (HAHNN) takesinto account the text structure as well in a document. However, allthese models are huge, often containing hundreds of millions ofparameters, making on-device deployment infeasible.Most of the work on document classification is based on textextracted from the document. However, we believe that the orga-nization of the text in the document is also an essential factor fordocument classification. For example, a boarding pass may be easilyidentified with the structuring of its data, such as the placementof passenger name, gate number, etc. without having to read theactual text. In this paper, we also consider the organization of textin the virtual space of the document as a feature in form of animage. For this task, we create an in-house dataset of documentswith human annotated class category. To validate the efficacy ofour multimodal approach, we present results on an open-sourcemultimodal dataset. For on-device, we develop a quantized versionof our model. The main contributions of our work are as follows: • Developed a novel multimodal architecture, which considersvisual and text features as input for on-device classification. • Evaluation on a popular text and image dataset, as there isno standard dataset for on-device document classification. • A novel pipeline for on-device document classification thattakes input, a pdf document and gives its class as output.
Multimodal learning brings out some unique challenges for re-searchers, given the heterogenity of data. [5] captures the chal-lenges, methods, and applications of multimodal learning. Docu-ment classification is a subjective problem where the classes anddata depend on the usecase being targeted. [4] classified documents a r X i v : . [ c s . C V ] J a n ODS COMAD 2021, January 2–4, 2021, Bangalore, India Sugam, et al.
Modality Model Details Accuracy Size (MB)Baselines Text SVD 1st Layer 2nd Layer 3rd Layer
Yes 2000 2000 500 85.39 40Yes 2000 1000 500 85.33 20No 2000 1000 500 86.7 946
Image Batch Norm Dropout Layers Frozen
No Yes 53 65.6 17Yes Yes 53 66.1 17Yes Yes 31 65.76 17
Previous Work Text +Image
Wang et al. 2015 [19] 85.1 >534Kiela et al. 2018 [13] 90.8 >230
Fusion Text +Image
Max 86.18 12Concatenate 89.8 13Average 82.9 12Highway 88.03 15
Table 1: Accuracy and model size (in MegaBytes) of fusion models compared to baselines, previous works. (Exact model detailsof previous works are unknown, thus providing minimum model size through visual model used by each.) of type questionnaire, memo, etc. and showcased that integratingan additional modality offer more robust representation. They usedtesseract-OCR to extract text and generate document embedding,and MobileNetv2 to learn visual features. This approach showed aboost in pure image accuracy by 3% on Tobacco3482 and RVL-CDIPdatasets. In 2015, a new image and text dataset, UPMCFood-101dataset, with 100K images and 101 classes was proposed by [19].The researchers built a search engine that retrieves the relevantrecipes given an image by using both text and visual features. [13]verified the performance of multimodal methods on large datasets,and compared various fusion methods with their own method ofdiscretizing continuous features obtained from visual represen-tations. This method demonstrated the feasibility of multimodalmethods on large datasets and results showed that multimodal mod-els outperform Fast-Text [ ? ] and the continuous-only approachregardless of the type of fusion. To the best of our knowledge, ourdocument class types have not been used in any document classifi-cation method. Amongst, all the possible ways to fuse and co-learndifferent modalities representation, we choose late fusion for ourproblem. We want individual modalities to also be able to classifydocuments, incase the device constrains don’t permit a full multi-modal classifier. Moreover, it’s hard to see low level interactionsbetween visual and text modality in a document, as image of adocument hardly describes the textual content of the document. On-device document classification is at a nascent stage, whereprevious work is sparse. Due to the subjectivity of our task, it isdifficult to create a standard dataset that suits the need of all. Thus,to tackle this lack of dataset, we create a small dataset consistingof 5 classes, decided using an internal survey. But, to truly checkthe efficacy of our model, we needed a dataset, which containsmultimodal features. For this, we chose the FOOD-101[19] datasetthat contains recipes and images of 101 popular food categories.We use this dataset to benchmark our modelś performance and toshowcase that our multimodal architecture learns both modalitiesrepresentation and present our experiments below.
Text
The text input of FOOD-101 dataset is the food recipe. Wepre-process this text input by performing stop words removal andlemmatization using NLTK [15] Porter Stemmer algorithm. Further,we remove high-frequency(greater than 100,000 occurences) andlow-frequency(less than 5 occurences to account for spelling/parsingerrors) words from the text since the food category of an recipeis likely to be determined by rare words. Moreover, the maximumnumber of words in a recipe after eliminating stop words wereroughly 100,000. Due to this huge size, it was not practical to traina sequential neural model such as CNN, and RNN for building anon-device classifier since the time complexity of such models isdirectly proportional to number of words in a sequence. Thus, weuse Tf-Idf as feature vectors, and train two fully connected layersand a softmax layer on top of this. To identify recipe category, theorder of the recipe is rarely useful, thus the loss of sequential infor-mation due to Tf-Idf vectors wouldn’t hamper model performance.But the dimensionality of these Tf-Idf vectors is dependent on vo-cabulary size, and thus it rendered a model of size 720 MB, sincethe number of parameters first fully connected (FC) layer is directlyproportional to the size of input vector, i.e. vocabulary size. Tf-Idfvectors are often sparse and low-rank. So, to resolve this, we useTruncated-SVD to reduce the rank of these high dimensional sparsevectors, and train our classifier on these low-dimensional vectors.We demonstrate in Table 1 that our Tf-Idf vectors were indeed lowrank and that with SVD, the model gives an accuracy of 85%, a 1.5%reduction from the Tf-Idf model but with a reduction in the modelsize of more than 95%, since the size of input vector to first FC layerhas gone from size of vocab to rank of SVD output.
Figure 1: Fusion Model Architecture n-Device Document Classification using multimodal features CODS COMAD 2021, January 2–4, 2021, Bangalore, India
Image
We use a pre-trained MobileNet [10] to train a classifierfor visual features. MobileNet serves as an ideal choice for on-device classifier as it is optimized for both space and latency. Weuse transfer learning to retrain a pre-trained MobileNet on ourdataset. Our image classifier consists of MobileNet, a pooling layer,two dense layers and a final softmax layer. We freeze the trainingof MobileNet parameters and train only the final layers for first 15epochs of training. We treat the number of layers unfrozen after15th epoch as a hyperparameter. We use batch normalization [12]and dropout [17] to improve the generalization and performanceof our model.
We built upon the late fusion strategy of [19] and show improvedaccuracy with significant compression in the model size. We use ourpre-trained MobileNet based image classifier and the text classifer totrain a fusion classifier. We transfer the pre-softmax layer featuresof both the networks and merge the features from both modalities.We train only the layers succeding these merged features and builda classifier as shown in Figure 1. We use different methods, F(x), ofmerging these features, which we discuss below:
Concatenation : We concatenate the vectors from both modal-ities and train a dense layer on top of the concatenated vector,i.e., 𝑜 ( 𝑥 𝑛 ) = 𝑊 ( 𝑈 𝑥 𝑡𝑛 ⊕ 𝑉 𝑥 𝑣𝑛 ) , (1)where W, U, V are the weight matrices of dense layer and 𝑥 tn and 𝑥 vn are the pre-softmax text and visual features representation re-spectively. Average : Here, we retain the softmax layers of the pre-trainedmodels. We merge the output of the softmax layers using component-wise average and train a dense layer on that average, i.e., 𝑜 ( 𝑥 𝑛 ) = 𝑊 ( 𝑎𝑣𝑔 ( 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑥 𝑡𝑛 ) , 𝑠𝑜 𝑓 𝑡𝑚𝑎𝑥 ( 𝑉 𝑥 𝑣𝑛 ))) , (2) Max : We combine the information from both modalities usingcomponent-wise maximum. 𝑜 ( 𝑥 𝑛 ) = 𝑊 𝑚𝑎𝑥 ( 𝑈 𝑥 𝑡𝑛 , 𝑉 𝑥 𝑣𝑛 ) , (3) Gating Layer post Concatenation : We concatenate the fea-tures of both modalities and train a highway layer [18] on top ofit. 𝑦 𝑛 = 𝑈 𝑥 𝑡𝑛 ⊕ 𝑉 𝑥 𝑣𝑛 𝑔 = 𝑊 𝑦 𝑛 + 𝑏,𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ( 𝑔 ) 𝑜 ( 𝑥 𝑛 ) = 𝑡 ∗ 𝑔 + ( − 𝑡 ) ∗ 𝑦 𝑛 , (4)We use ReLU [7] as our non-linear function after merging thelayers. The accuracy and model sizes of different fusion strategiesas compared to baselines and previous works are shown in Ta-ble 1. With our concatenation model, we were able to match theperformance achieved by the Gated model of [13] with a reduc-tion in model size, as they use a 152-layer Resnet [9] for capturingvisual features while we build upon mobilenet, which has fewerparameters compared to Resnet.We observed a significant improvement in classification accu-racy of food images in the multimodal classifier as compared toindividual modality classifiers. For specific classes, as shown inTable 2, we observed improvement in classification accuracy witha fusion of image and text features. Class Text Image Fusion scallops 0.125 0.625 0.5breakfast_burrito 0.6 0.2 0.5sashimi 0.75 0.25 1fish_and_chips 0.72 0.27 0.63apple_pie 0.333 0.555 0.666
Table 2: Accuracy of fusion classifier and single modalityclassifiers for some food categories. Here, accuracy is thenumber of correct top-1 predictions for that class.Category
Travel 118Personal information 100Receipts 98papers/books 102Misc 94
Table 3: Class split of the document dataset.
We define On-Device document classification as a task of categoris-ing real documents on user device into topics. These topics weredecided based on an internal survey of ~100 people. For this task,we created an in house dataset of ~512 documents as presentedin Table 3. For the ease of purpose, we consider only PDF docu-ments, but the architecture can be easily extended to any otherdocument type(word, excel, etc.). The architecture as shown inFigure 2, is divided into 3 parts: data extraction, feature generationand classification.
Data Extraction : While libraries such as pdfbox exist to extracttext from a PDF, it will restrict our document set to only PDFs. Tocreate a framework for processing all document types, we use OCRto extract text from PDF. We convert the PDF documents into a setof images and use MLKit to extract text from the images. We thenuse the images and extracted text as input data to our MultimodalArchitecture presented in Figure 1 with some changes to modelparameters. For now, we consider only the first page of the PDFdocument, since the document may consist of 10s of pages andprocessing all of them would increase latency. Feature Generation : OCR text extraction is highly erratic forPDF documents since it parses information in a line manner. Forexample, if a PDF is column split, like research papers, the sequenceof text will be lost. Thus, a word order based model may fail forthis task. We use regex-based text filtering and the pre-processingapproach mentioned in the previous section. Further, to tackle noisedue to watermarks, etc., we use a filter to remove non-English wordsfrom the tokenized text. We maintain a pre-defined a vocabulary of60,000 english words to extract Tf-Idf vectors for a document. Thisvocabulary covers 98% of words present in our dataset. For imagefeatures, we use the image generated while extracting text as inputto the visual modality network.
Classification : We use the architecture presented above forFOOD-101 dataset for this task. We tweak the network parametersto account for the complexity of this specific task and dataset. For https://firebase.google.com/docs/ml-kit/recognize-text ODS COMAD 2021, January 2–4, 2021, Bangalore, India Sugam, et al.
Figure 2: Pipeline of the on-device document classification framework.Modality Accuracy
Text 59.62%Visual 71.88%Text + Visual 84.38%
Table 4: Top-1 Accuracy of single and multimodal models. the text classification, we reduced the rank of Tf-Idf vector to 200using SVD. Following which, we trained 2 dense layers of size64 and 32 units respectively before adding a final softmax layerfor classification. For the image classification, we use MobileNetarchitecture and train a pooling layer of 512 units and a dense layerof 64 units. We followed the same approach of pre-training textand image classifier separately, and trained a fusion classifier ontop of it. The results of single modalities, as well as fusion model,are presented in Table 4.
Training Methodology : We use Stochastic Gradient Descentwith decay as our optimiser with an initial learning rate of 0.01 anddecay of 0.8 after every epoch. For training fusion classifier, wefollow a multi-stage transfer learning approach, inspired by [11].We experiment with various strategies of unfreezing layer weightsfor training. We train a base model for 90 epochs without unfreez-ing the layers of the pre-trained models. MobileNet architectureconsists of repeated blocks of point-wise and depth-wise separatedconvolutions. So, we unfreeze weights post the end of a block ofMobileNet and at the end of second layer of our text model architec-ture. The layer numbers in the subsequent section signify the actuallayer number in order of our model as available in keras’s[6] modelsummary. Firstly, we unfreeze the weights of pre-trained weightsindividual modalities models from the 53rd layer of the networkafter the 30th epoch with and without resetting the learning rate.Secondly, we unfreeze the model weights from 81st layer after the30th epoch and 53rd layer post the 60th epoch, resetting learningrate at both instances. The accuracy and loss variation is presentedin Figure 3. Since, our dataset and the dataset on which mobilenet istrained is dissimilar, we observed better performance by resettingthe learning rate in stages.
On-Device Execution
Depending upon the extension of a file,we can choose an appropriate renderer. For our pdf documents,we use android PdfRenderer for rendering. Once the rendering isdone, we extract bitmap which is processed to get Image features. https://developer.android.com/reference/android/graphics/pdf/PdfRenderer (a) Loss curve of models (b) Accuracy curve of models Figure 3: Comparing validation loss and accuracy of differ-ent training techniques
To extract the text of the document, we have used Google Mlkit which provides on-device API for text extraction. For simplicity,only the English language is considered. However, it’s possible toextract text for different languages on-device as explained in [14].The offline fusion model which was trained, is quantized using thetflite post-training quantization method. The quanitzation led toan accuracy reduction of ~0.5%, a minor impact considering it leadto model compression of 75%. The size of the final model is ~13 MB.The total execution time of this pipeline on a document is 4.6s, outof which 3.5s is taken by OCR. The storage capacity of smartphones is ever-increasing, which leadsto a vast accumulation of documents on-device. Such a clutter in-hibits a user from retrieving relevant documents. Moreover, withthe internet becoming increasingly multimodal, we should lever-age the information offered by different modalities for a betterunderstanding of content. With this work, we show that differentmodalities indeed contribute towards increased understanding ofdocuments. We achieve a ~90% accuracy with our fusion networkon FOOD 101 dataset, matching the previous best with a reductionin model size. We also present a feasible framework to classify doc-uments on-device. We acknowledge that the size of our dataset isnot sufficient and conclusive evidence of the same. But we hopethat our work serves as a precursor for others to contribute to thisfield of on-device document classification. https://firebase.google.com/docs/ml-kit/recognize-text https://tensorflow.org/lite n-Device Document Classification using multimodal features CODS COMAD 2021, January 2–4, 2021, Bangalore, India REFERENCES [1] Jader Abreu, Luis Fred, David Macêdo, and Cleber Zanchettin. 2019. HierarchicalAttentional Hybrid Neural Networks for Document Classification. arXiv preprintarXiv:1901.06610 (2019).[2] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. DocBERT:BERT for Document Classification. arXiv preprint arXiv:1904.08398 (2019).[3] Ashutosh Adhikari, Achyudh Ram, Raphael Tang, and Jimmy Lin. 2019. Re-thinking complex neural network architectures for document classification. In
Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) . 4046–4051.[4] Nicolas Audebert, Catherine Herold, Kuider Slimani, and Cédric Vidal. 2019.Multimodal deep networks for text and image-based document classification.In
Joint European Conference on Machine Learning and Knowledge Discovery inDatabases . Springer, 427–443.[5] Tadas Baltrušaitis, Chaitanya Ahuja, and Louis-Philippe Morency. 2018. Multi-modal machine learning: A survey and taxonomy.
IEEE Transactions on PatternAnalysis and Machine Intelligence
41, 2 (2018), 423–443.[6] François Chollet et al. 2015. Keras. https://keras.io.[7] George E Dahl, Tara N Sainath, and Geoffrey E Hinton. 2013. Improving deepneural networks for LVCSR using rectified linear units and dropout. In . IEEE,8609–8613.[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018).[9] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residuallearning for image recognition. In
Proceedings of the IEEE conference on computervision and pattern recognition . 770–778.[10] Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, WeijunWang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. 2017. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXivpreprint arXiv:1704.04861 (2017).[11] Jeremy Howard and Sebastian Ruder. 2018. Universal language model fine-tuningfor text classification. arXiv preprint arXiv:1801.06146 (2018).[12] Sergey Ioffe and Christian Szegedy. 2015. Batch normalization: Acceleratingdeep network training by reducing internal covariate shift. arXiv preprintarXiv:1502.03167 (2015).[13] Douwe Kiela, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2018. Effi-cient large-scale multi-modal classification. In
Thirty-Second AAAI Conference onArtificial Intelligence .[14] Sumit Kumar, Gopi Ramena, Manoj Goyal, Debi Mohanty, Ankur Agarwal, BenuChangmai, and Sukumar Moharana. 2020. On-Device Information Extractionfrom Screenshots in form of tags. In
Proceedings of the 7th ACM IKDD CoDS and25th COMAD . 275–281.[15] Edward Loper and Steven Bird. 2002. NLTK: the natural language toolkit. arXivpreprint cs/0205028 (2002).[16] Dheeraj Mekala, Vivek Gupta, Bhargavi Paranjape, and Harish Karnick. 2016.SCDV: Sparse Composite Document Vectors using soft clustering over distribu-tional representations. arXiv preprint arXiv:1612.06778 (2016).[17] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and RuslanSalakhutdinov. 2014. Dropout: a simple way to prevent neural networks fromoverfitting.
The journal of machine learning research
15, 1 (2014), 1929–1958.[18] Rupesh Kumar Srivastava, Klaus Greff, and Jürgen Schmidhuber. 2015. Highwaynetworks. arXiv preprint arXiv:1505.00387 (2015).[19] Xin Wang, Devinder Kumar, Nicolas Thome, Matthieu Cord, and Frederic Pre-cioso. 2015. Recipe recognition with large multimodal food dataset. In . IEEE, 1–6.[20] Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and EduardHovy. 2016. Hierarchical attention networks for document classification. In