[PDF] Deep Structured Feature Networks for Table Detection and Tabular Data Extraction from Scanned Financial Document Images

Abstract

Automatic table detection in PDF documents has achieved a great success but tabular data extraction are still challenging due to the integrity and noise issues in detected table areas. The accurate data extraction is extremely crucial in finance area. Inspired by this, the aim of this research is proposing an automated table detection and tabular data extraction from financial PDF documents. We proposed a method that consists of three main processes, which are detecting table areas with a Faster R-CNN (Region-based Convolutional Neural Network) model with Feature Pyramid Network (FPN) on each page image, extracting contents and structures by a compounded layout segmentation technique based on optical character recognition (OCR) and formulating regular expression rules for table header separation. The tabular data extraction feature is embedded with rule-based filtering and restructuring functions that are highly scalable. We annotate a new Financial Documents dataset with table regions for the experiment. The excellent table detection performance of the detection model is obtained from our customized dataset. The main contributions of this paper are proposing the Financial Documents dataset with table-area annotations, the superior detection model and the rule-based layout segmentation technique for the tabular data extraction from PDF files.

Full PDF

DDeep Structured Feature Networksfor Table Detection and Tabular Data Extractionfrom Scanned Financial Document Images

Siwen Luo, Mengting Wu, Yiwen Gong, Wanying Zhou, Josiah Poon

School of Computer ScienceThe University of SydneyEmail: { mewu2491, ygon7645, wzho6305 } @uni.sydney.edu.au { siwen.luo, josiah.poon } @sydney.edu.au Abstract —Automatic table detection in PDF documents hasachieved a great success but tabular data extraction are stillchallenging due to the integrity and noise issues in detected tableareas. The accurate data extraction is extremely crucial in ﬁnancearea. Inspired by this, the aim of this research is proposingan automated table detection and tabular data extraction fromﬁnancial PDF documents. We proposed a method that consistsof three main processes, which are detecting table areas witha Faster R-CNN (Region-based Convolutional Neural Network)model with Feature Pyramid Network (FPN) on each page image,extracting contents and structures by a compounded layoutsegmentation technique based on optical character recognition(OCR) and formulating regular expression rules for table headerseparation. The tabular data extraction feature is embeddedwith rule-based ﬁltering and restructuring functions that arehighly scalable. We annotate a new Financial Documents datasetwith table regions for the experiment. The excellent table de-tection performance of the detection model is obtained fromour customized dataset. The main contributions of this paperare proposing the Financial Documents dataset with table-areaannotations, the superior detection model and the rule-basedlayout segmentation technique for the tabular data extractionfrom PDF ﬁles.

Index Terms —Applications of document analysis, Applicationsof deep learning to document analysis, Document image process-ing

I. I

NTRODUCTION

Tables in PDF documents usually contain the essence ofinformation, however, locating and extracting tables in ﬁlesmanually can be time-consuming and human-errors are com-mon. As the descendent from PostScript Page Description Pro-gramming Language, PDFs ﬁles create complexity in machinereadability, which is why the efﬁciency of algorithms for textand structural information recognition is the main limitationfor table extraction [1]. As a classical topic of pattern recogni-tion, table detection goes through diverse experiments startingfrom different perspectives like tag detection on a text-baseddocument, region detection on page images, and so forth. Andthe tabular data extraction from table image is also challengingsince we want to discard noise and remain table’s originalstructure.Most researches concentrate on table detection and tabularextraction separately while less would consider them as a united process. There are difﬁculties in each ﬁeld. First, notall PDF ﬁles are text-based as PDF pages can be stable imagesand scanned graphs without tags on the whole page. Thus, thetag classiﬁer is not suitable or useful on document images.label classiﬁcation based on text-based documents [2] hasan erroneous detection performance on unstructured tabular.Second, there are different types of table formats, framed orunframed, line-based or irregular and single or combined. End-to-end deep learning model [3] prefers extract structured tablesfor cell recognition. Furthermore, it is a hard task to maintainthe original layout and complete the contents of the tables.Text noise would impact the detection result by using currentpopular Nurminen algorithms. And lack of header is anothercommon issue on contents extraction by using related tools[4]. There are few datasets with a large volume of text-basedtables for table detection and extraction. Thus, automatic andsmart extraction for tables is a complex and difﬁcult task.To overcome the difﬁculties mentioned above, in this paper,we propose a method that extracting tabular data from tableimages detected by an object recognition model. Cutting-edgeobject detection algorithms for our table detection model areembedded into feature pyramid networks. Our table detectionalgorithms are tested on the Financial Document dataset andthe performance of our table detection is competitive. Detectedtable images are processed by layout segmentation techniquesbased on OCR to output data-frames that contain intact tablecontents and original structures. Generally, table extractiononly contains information obtaining and structure rebuildingwhile the separation of header and body is usually anothersingle topic. To achieve separating table schema and records,we adopt seq2seq model combined with beam research ar-chitecture and regular expression rules to extract headers. Inthis paper, we take account of the practical utility of tablesand exploit the semantic separation to support header division.These data-frames are automatically stored into the databasealong with detection results summary sheets. The workﬂowis demonstrated in Figure 1. We annotate our own datasetwhich contains 3238 page images where there are diverseembedded tables. All page images are from 108 FinancialDocuments where 78 ﬁles are training data and 30 ﬁles are a r X i v : . [ c s . C L ] F e b esting data. A new annotated dataset for table detection,the optimal table detection performance of our Faster-RCNNFPN model, effective table segmentation techniques and afunctional program that can automate the whole tabular dataextraction process is the main contributions of this paper. Wejointly consider the table detection and layout extraction asone whole topic and improve the practical utilization of tablesafter detection and extraction.The rest of this paper is organized as follows. Section 2reviews related works. Section 3 elaborates on the proposedmethodology. Section 4 provides details on our Faster-RCNNwith FPN model for detection and hybrid segmentation andrestructuration based on OCR techniques for tabular data ex-tracting. Section 5 presents experimental results and evaluationmetrics and section 6 draws the conclusion.Fig. 1: Flow chart of the procedure of table detection andextraction II. R ELATED W ORK

Plenty of previous research only concentrate on table detec-tion or content extraction from PDF documents independently,nevertheless, less would consider both of them as a wholetopic.

A. PDF Page Object detection

Table detection is one specialization from object detectionfor PDF document images. Li et al. [5] used a conditionalrandom ﬁeld (CRF) classiﬁcation model as line region typeand link classiﬁer on the page object detection. The deepstructured predictor performed well on detecting formulas,tables and ﬁgures within PDF document images and it mainlyachieves to classify regions for different classes. Das et al.[6] performed a Deep Convolutional Neural Network model- VGG16 for document structure learning, and constructedlevels of transfer learning for model training for image segments. The approach is more suitable for region-basedclassiﬁcation for scanned PDF images. Loc et al. [7] used afully convolutional network (FCN) to do watermarking regionsdetection, and FCN-based approach is powerful for regionsegmentation on stable contents such as tables, headers andso on.

B. Table and structure recognition

There are remarkable works on extracting all table datafrom PDF ﬁles, which contain table detection and structurerecognition. Paliwal et al. [3] demonstrated a novel end-to-end deep learning model - TableNet which was basedon pre-trained VGG-19, so as to detect tabular regions andcolumn masks. They mentioned that the model was builtfor detection and extraction from scanned images and needsstrictly vertical table layouts as the input for column detection,so the extraction was limited by the table formats. Prasad et al.[8] proposed CascadeTabNet which is a Cascade mask Region-based CNN High-Resolution Network model combining thetransfer learning, image transformation and data augmentationtechnique to improve the process. They used the approachfor end-to-end tabular region detection and recognized thestructural table cells from document images. Li et al. [5]presented a new image-based dataset - TableBank for tabledetection and recognition from PDF documents. In order totest the quality of the dataset, they used the Faster R-CNNwith Region Proposal Network (RPN) architecture to detecttable areas and adopted the image-to-markup model [9] toextract the tabular structure.Faster RCNN system is a typical deep learning model forobject detection. Ren et al. [10] introduced RPN on the FastRCNN model to share feature maps and the Faster RCNNmodel performed effectively on real-time object detection. Fortext and structure extraction, OCR is a powerful technique,”reading” all word positions within documents. Paliwal et al.[3] used Tesseract OCR to formulate rules for row segmenta-tion covering different problems from line demarcations.

C. Table Region Detection

III. M

ETHODOLOGY

A. System Overview

In this paper, we aim to detect tabular regions from pageimages through deep learning architecture and extract embed-ded texts from table images in order to rebuild the layout.Thus, we used the Hybrid Faster-RCNN modelling with FPNframework to locate tabular regions in page images. To achievethe target of texts and structure extraction, we integrate rowand column segmentation rules based on the OCR techniqueand design a deep learning language modelling, Sequence-to-Sequence (seq2seq), to separate headers and instances.Our main procedures contain the conversion from PDFﬁles to page images, detection of table region images onpage images, extraction of cell information embedded in tableregion image, reconstruction of table layout, separation oftable schema and instances. After the process of detection andextraction, the system outputs structured tables and saves inig. 2: Procedure of bounding box detection from page image using Faster-RCNN modelthe database for further usage. We explain all strategies in thefollowing two sections.We used the Faster-RCNN model as the baseline to supportobject recognition and introduced the FPN backbone to buildthe feature pyramids so as to improve the speed and accuracyof the model. As the original input of our proposed deeplearning model, PDF ﬁles has various types, including text-based or image-based pages which cannot be used directlyas input to predict bounding boxes of tables. Therefore, weconverted all PDF ﬁles into page images with stable sizeand dropped all dot per inch (dpi) down to 200 for keepingacceptable image quality. All uniﬁed page images can improvethe learning capability of our proposed model.Faster RCNN [10] is one of classical deep learning detectorthat uses a single Convolutional Neural Network (CNN) tomake regional proposals and bounding box regressors, ratherthan using Selective Search to generate a pile of potentialregions. In our detection model, the page image is providedas the input for the convolutional neural network, providinga convolutional tabular map. The original Region ProposalNetwork (RPN) works by passing a sliding window over theCNN feature map and outputs four potential bounding boxesfor table regions in each window, as well as the correspondingand expected quality score. The related Region of Interest(RoI) pooling layers are adopted to reshape the table regionproposals from feature maps and extract 7 × × , 512 , 1024 , 2048 pixels on P2, P3, P4,P5, P6 respectively in ﬁgure 33. In this way, we input imagesfrom a single scale and quickly build a feature pyramid withstrong semantic information on all scales without any obviouscost [12]. B. Text and Structure Extraction

All table regions in page images are located and generatedthrough the detection model based on the deep learningarchitecture. In page images, table areas are displayed withdiverse formats where the segmentation between each cellcould be done by lines or white spaces. In most of theprevious experiments, table extraction would be consideredas the structure recognition which needs a large and complexannotated dataset to conduct the cell classiﬁcation. However,it is hard for object detection to capture the completed anddetailed layout as it relies on diverse and complex regionannotations on the huge datasets. Therefore, we put forwardan integrated layout segmentation architecture combined withOCR technique to simplify the extraction process and re-construct tabular structure only depending on the originallayout. Py-tesseract, a typical OCR tool, can “read out” spatialpositions of word patches embedded in images. The approachis capable to deal with most structured and numerical tableswith strictly horizontal and vertical distributions, based onwhich the foreground texts can be segmented cleanly from thebackground image. We designed a sequential procedure fromword extraction to row and column segmentation for efﬁcientlayout reconstruction.

1) Spatial Character Extraction of Text:

Our segmentationrule was built on the horizontal and vertical level of texts in thetable, which is different from typical cell region recognitionby detection models. Region recognition needs a manualannotation dataset and deep learning architecture, which asig. 3: Two pathways of Feature Pyramid Networks for extracting feature mapsa result, depends on vast amount of pre-work and externaltools compared with our approach. OCR technique provides astrong power to support the extraction of all word positions.According to the features of the input image, we set a ﬂexibleOCR engine mode which is either single mode or a mergedmode based on Legacy and Neural net Long Short-TermMemory (LSTM) engine, and additionally, we assumed theentire table as a single uniform block of texts. Therefore,each word embedded in the tabular image is obtained with itsverbose spatial data, line number, word sequence, horizontaland vertical coordinates, etc. All characters are captured fromthe spatial distribution of texts and used as row and columndemarcations for table rebuilding.

2) Reconstruction of Tabular Layout:

In most table struc-ture recognition models, the performance would be impactedby the diversity of table formats and complexity of manualannotations on cells. However, spatial segmentation rule isformulated by the original distribution of word patches, andthe reconstruction complies with the table formation and isnot limited by the complex table layout. Same line numbersand ascending word numbers extracted by OCR techniquecan generally segment table rows where words are sorted byhorizontal and vertical levels successively. Spaces between ad-jacent words identify words in one cell, and the largest numberof cells in each row is equal to the number of table columns.Furthermore, horizontal centers of cells in this longest andunique row list act as standard column demarcations. Wemapped column demarcations to horizontal levels of cells ineach row to segment table columns. In tables, wherever a rowlacks of row title, that non-blank cell needs to be merged intocells in the upper layer. The procedure of row and columnsegmentation is the table layout reconstruction as well, and itmaintains the original structure as accurate as possible. Fig. 4: Row and column segmentation based on spatial posi-tions of cells

3) Separation of Table Header:

Although table structureis extracted by segmentation rule, contents cannot be dividedby position demarcation. After the row segmentation process,headers may be the super contents which can spread outover multiple cells. Seq2seq model in conjunction with beam-search optimization schema [13] inherits and extends theestimation for global, next-word sequence. The model predictsthe occurrence possibility of the followed word when beingprovided a header content if the ﬁrst word of the next cellfalls into the candidate words, and then it is marked asa header. This technique utilizes deep learning modellingand can be continually improved with the growing size ofcandidate words. Considering the size of the word library,we design another alternative method, regular expression toclassify header and instances. The premise is that tables keepconsistent structures and headings. Since all instances haveuniﬁed data types which is quite different from headers, wedetermine the boundary of the table schema when comparingthe similarity for row values. Using a rule-based program tospecify what content we are expecting as a header row worksffectively in our program.IV. E

XPERIMENT R ESULT

A. Dataset

We conducted experiments on the Financial Documentsdataset, which is a customized dataset. There are 78 PDF ﬁlesused as training data and 30 PDF ﬁles used as the testingdata for the performance evaluation. This testing data contains438 tables within pages images from Financial Documents.We convert all PDF ﬁles into page images and use them totest the performance of our detection model. Furthermore, weperform our table extraction approach on the testing datasetfor evaluation.

B. Data Annotation

There are few text-based datasets containing large volumeof page images for table detection and extraction. To improvethe training effection, we annotate 108 Financial Documentscollected from the Business School of the University ofSydney. All PDF ﬁles are the text-based format so all tableareas are stable and structured. After converting each pagein PDF documents into ﬁles of images, we labelled eachqualiﬁed table with the category, Table, and use rectanglesto outline the tabular region . in every image and there mightbe several tables on a single page image. In the dataset, wedraw bounding boxes on 3238 page images where the trainingdata is 2800 images and testing data is 438 images. There aretotally more than 4000 tables in the training dataset to supportthe training process so as to improve the effectiveness of thelearning procedure.Fig. 5: Samples from annotated Financial Documents Datasetfor table detection stage

C. Implementation Details

We built and implemented the Faster R-CNN model with theadjusted conﬁgurations, the model has pre-trained backboneson ﬁnancial document dataset with training schedules of 450iterations. The learning rate is modiﬁed as 0.0015, the batchsize is set as 128, momentum is deﬁned as 0.9, and otherconﬁgurations are kept as default. All models and baselinesare obtained based on Big Basin servers with 8 NVIDIA V100GPUs & NVLink and software such as PyTorch 1.3, CUDA9.2, cuDNN 7.4.2 or 7.6.3. The experiments were conductedon Google Colaboratory platform, with P100 PCIE GPU of15.90 GB GPU memory, Intel(R) Xeon(R) CPU @ 2.20GHzand 12.72 GB of RAM.

D. Evaluation

To evaluate the performance of our own detection model, weadopt average precision (AP), average recall (AR), F1 scorebased on diverse Intersection Over Union (IOU) threshold. Theexperimental results for table detection are measured by thesestandard metrics.The precision rate is computed by the equation: precision =True positive / (True Positive + False Positive) and recall rateis computed by the equation: recall = True positive / (Truepositive + False negative). In addition, F-score is computedas the harmonic average of recall and precision value, whichis equation 2 * (Precision * Recall) / (Precision + Recall).We uniformly deﬁne the True Positive type of detection resultand use them to compute the precision and recall. All regionsrecognized should include the table header and all instances,which ensure that the whole table in the ground truth iscaptured. The area within the bounding box must be withoutany noise that impacts the purity of the tabular region. In ourmodel, other elements in a confusion matrix are represented asFalse Positive which means not being a table with boundingboxes and False Negative which means actual tables withincorrect bounding boxes or without bounding boxes. We useconfusion metrics to calculate the precision, recall and F1metric.In our model, elements in a confusion matrix are representedWe set different IOU thresholds for the overlapping area be-tween the result and the ground truth to compute the evaluationmetrics. IoU is calculated to tell if a table region is correctlydetected, and it’s used to measure the overlapping of thedetected polygons. GTP deﬁnes the Ground Truth Polygon ofthe table region and DTP deﬁnes the Detected Table Polygon.IoU has a range from 0 to 1, where 1 suggests the best possiblesegmentation. When evaluating, different threshold values ofIoU will be used to determine if a region is considered asbeing detected correctly. For example, the threshold value of0.5 indicates that once the detected region is greater than 0.5,it will be considered as correctly detected.

E. Result and Analysis

TABLE I: Evaluation Result for Detection Model

Model AP (IOU) AR F10.5 0.75 0.5:0.95FRRCNN-R50-FPN-1 × × × ×

1) Table Detection:

Table I shows the comparison of theproposed method and other baselines. We use the followingtable detection models to implement ﬁne-tuning above our Fi-nancial Documents dataset. FR-RCNN-R50-FPN is the FasterRCNN model based on ResNet-50 model and FPN as thebackbone, FR-RCNN-R101-DC5 is the Faster RCNN model a) (b) (c)(d) (e) (f)(g) (h) (i)

Fig. 6: Three samples represent the procedure of table detection and extraction. Each row shows the stages from original pageimages (a),(d),(g), to detection results (b),(e),(h) and to extraction results (c),(f),(i)based on ResNet-101 model with dilations in conv5 and FR-RCNN-R101-C4 is the Faster RCNN model using ResNetconv4 backbone with conv5 head. As we can see, our proposedmethod achieves better results on all the metrics. This is mostlybecause we implemented FPN Faster R-CNN in conjunctionwith FPN on our model of ResNeXt101. ResNeXt101 requiresminimal extra effort designing each path, and the total numberlayer of neural network is 101 which is double of the numberlayer of ResNet model implemented by FR-RCNN-R50-FPN-1x and FR-RCNN-R50-FPN-3x models. Thus, AP under IOU50 of these models are similar, however, AP among IOU50 to 95 of our model is obviously higher than others.Moreover, unlike ResNet model, the neurons in ResNeXt at on path will not be connected to the neurons at other paths.In addition, compared to other models based on Faster R-CNN in conjunction with DC5 and C4 backbones, such asFR-RCNN-R101-DC5-3x and FR-RCNN-R101-C4-3x modelsrespectively, combinations that involve the FPN and 101 layersof neural network use standard Conv and FC heads for maskand box prediction, therefore, it can obtain the best accuracy.Model experimental results are shown in Table I.

2) Table Extraction:

In the experiment for table extraction,we present tables in the database (like ﬁgure 5) after using oursegmentation and restructure rules for table regions since ourapproach relies on the real spatial positions of sub-cell notthe object recognition architecture. The table schema is theable headers separated from table instances, which shows thetable structure and data types as well. All values in the tablebody are speciﬁc attributes of records. We extract tables fromthe original PDF page image and save them into the databasefor practical using. The accuracy of text extraction dependson the OCR technique and the language of original PDFdocuments. In this paper, our approach can extract all textswithin tables from the Financial Documents dataset apart fromspecial characters. There are diverse tabular formats in PDFﬁles increasing the difﬁculty of structure reconstruction. Eachinstance can be restructured completely, however, it existsminor error in deciding values of attributions.V. C

ONCLUSION

In this paper, we propose a PDF table detector for tabularregions in PDF documents by using Faster RCNN modelwith FPN backbone, and a hybrid layout segmentation andreconstruction approach based on OCR technique. We convertoriginal PDF documents into page images, detect table areasby the deep learning model and extract all texts and structureby spatial locations. We annotate a new dataset manuallyfor table detection and extraction and our proposed methodachieves outstanding performance on detecting table areas.Table schema and instances are separated effectively and savedin the database for practical utilizing. In the future, we tendto add object classes, such as ﬁgure, formula, text, etc. in theannotated dataset to improve the accuracy for table detection.There are some limits in special character extraction fromtable regions so another focusing point is optimizing the textextraction method. R

EFERENCES[1] S. Khusro, A. Latif, and I. Ullah, “On methods and tools of tabledetection, extraction and annotation in pdf documents,”

Journal ofInformation Science , vol. 41, no. 1, pp. 41–57, 2015.[2] M. Li, L. Cui, S. Huang, F. Wei, M. Zhou, and Z. Li, “Tablebank:Table benchmark for image-based table detection and recognition,” arXivpreprint arXiv:1903.01949 , 2019.[3] S. S. Paliwal, D. Vishwanath, R. Rahul, M. Sharma, and L. Vig,“Tablenet: Deep learning model for end-to-end table detection andtabular data extraction from scanned document images,” in .IEEE, 2019, pp. 128–133.[4] A. Kekare, A. Jachak, A. Gosavi, and P. Hanwate, “Techniques fordetecting and extracting tabular data from pdfs and scanned documents:A survey,”

Tabula , vol. 7, no. 01, 2020.[5] X.-H. Li, F. Yin, and C.-L. Liu, “Page object detection from pdf doc-ument images by deep structured prediction and supervised clustering,”in .IEEE, 2018, pp. 3627–3632.[6] A. Das, S. Roy, U. Bhattacharya, and S. K. Parui, “Document imageclassiﬁcation with intra-domain transfer learning and stacked generaliza-tion of deep convolutional neural networks,” in . IEEE, 2018, pp. 3180–3185.[7] C. V. Loc, J.-C. Burie, and J.-M. Ogier, “Document images watermark-ing for security issue using fully convolutional networks,” in . IEEE, 2018,pp. 1091–1096.[8] D. Prasad, A. Gadpal, K. Kapadni, M. Visave, and K. Sultanpure,“Cascadetabnet: An approach for end to end table detection and struc-ture recognition from image-based documents,” in

Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern RecognitionWorkshops , 2020, pp. 572–573. [9] Y. Deng, A. Kanervisto, J. Ling, and A. M. Rush, “Image-to-markupgeneration with coarse-to-ﬁne attention,” in

International Conferenceon Machine Learning , 2017, pp. 980–989.[10] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-timeobject detection with region proposal networks,” in

Advances in neuralinformation processing systems , 2015, pp. 91–99.[11] T.-Y. Lin, P. Doll´ar, R. Girshick, K. He, B. Hariharan, and S. Belongie,“Feature pyramid networks for object detection,” in

Proceedings of theIEEE conference on computer vision and pattern recognition , 2017, pp.2117–2125.[12] J. Hui and J. Hui, “Understanding feature pyramid networks for objectdetection (fpn),” 2018.[13] S. Wiseman and A. M. Rush, “Sequence-to-sequence learning as beam-search optimization,” arXiv preprint arXiv:1606.02960arXiv preprint arXiv:1606.02960