[PDF] TableLab: An Interactive Table Extraction System with Adaptive Deep Learning

Abstract

Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achieve with one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and (3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models from scratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab by providing a system where users and models seamlessly work together to quickly customize high-quality extraction models with a few labelled examples for the user's document collection, which contains pages with tables. Given an input document collection, TableLab first detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collections often contain tables created with a limited set of templates or similar structures. It then selects a few representative table examples already extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to these selections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trained model and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively until obtaining a customized model with satisfactory performance.

Full PDF

TTableLab: An Interactive Table Extraction System with Adaptive Deep Learning

NANCY XIN RU WANG,

IBM Research, USA

DOUGLAS BURDICK,

IBM Research, USA

YUNYAO LI,

IBM Research, USA

Table extraction from PDF and image documents is a ubiquitous task in the real-world. Perfect extraction quality is difficult to achievewith one single out-of-box model due to (1) the wide variety of table styles, (2) the lack of training data representing this variety and(3) the inherent ambiguity and subjectivity of table definitions between end-users. Meanwhile, building customized models fromscratch can be difficult due to the expensive nature of annotating table data. We attempt to solve these challenges with TableLab byproviding a system where users and models seamlessly work together to quickly customize high-quality extraction models with a fewlabelled examples for the user’s document collection, which contains pages with tables. Given an input document collection, TableLabfirst detects tables with similar structures (templates) by clustering embeddings from the extraction model. Document collectionsoften contain tables created with a limited set of templates or similar structures. It then selects a few representative table examplesalready extracted with a pre-trained base deep learning model. Via an easy-to-use user interface, users provide feedback to theseselections without necessarily having to identify every single error. TableLab then applies such feedback to finetune the pre-trainedmodel and returns the results of the finetuned model back to the user. The user can choose to repeat this process iteratively untilobtaining a customized model with satisfactory performance.CCS Concepts: •

Human-centered computing → User interface design .Additional Key Words and Phrases: Table extraction, neural networks, Label correction

ACM Reference Format:

Nancy Xin Ru Wang, Douglas Burdick, and Yunyao Li. 2021. TableLab: An Interactive Table Extraction System with Adaptive DeepLearning. In

ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3397482.3450718

Recently, there has been increasing interest in extracting complex structures such as tables from PDF and imagedocuments [1]. Table extraction involves identifying the border and the cell structure for each document table such thatit can be displayed in a structured format like HTML. The motivation for TableLab came from requests from industryprofessionals for the ability to easily create ground truth data and customize models for extracting tables for theirspecific document collections. TableLab accomplishes this by addressing the following table extraction challenges.First, there is great diversity of table formatting across different documents types and sources. Tables from invoicesare formatted differently than those from scientific articles or financial reports, with the visual clues across sourcesproviding conflicting information about the table border and/or structure. Thus, creation of a single high-quality modelto support table extraction from the wide diversity of document types is difficult if not impossible when considering

Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are notmade or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-partycomponents of this work must be honored. For all other uses, contact the owner/author(s).© 2021 Copyright held by the owner/author(s).Manuscript submitted to ACM 1 a r X i v : . [ c s . H C ] F e b UI ’21 Companion, April 14–17, 2021, College Station, TX, USA Wang et al. the fact that even humans can disagree about table definitions from the same source document (see Figure. 1). Despitethe diversity of table formats encountered in real-world settings, the user’s needs and table extraction expectations areultimately the most important. TableLab leverages this observation by supporting finetuning of a high-quality tableextraction model trained on hundreds of thousands of tables using a small number of user labelled examples.TableLab also supports efficient labelling of tables in documents, which involves two sub-problems. First, how do weselect the most useful examples for labelling which improve finetuned model accuracy the most? Second, how do weeffectively label individual example tables, particularly table structure where the same error repeatedly occurs? Themechanisms TableLab uses to improve labeller efficiency are described in the overview section.

Fig. 1. Example document with ambiguous tables. Whether the main invoice table should be one or two tables will depend ondownstream tasks for the user.

Many deep-learning solutions have been applied to the table extraction problem in recent years. Examples for tableand cell region detection include [2, 4, 6] while [4–7, 9] address table structure extraction. However, none of thesesolutions are able to extract tables exactly for all documents from all domains due to the wide variety and ambiguity ofthe problem (See Figure 1 for an example) [1, 3]. Additionally, labelled data is tedious to create. There are a few existinglarge-scale datasets for scientific papers [9] and financial reports [8] but many documents in business are confidential.Research into ease of labelling and active learning for tables is not as well studied as table structure extraction.Hoffswell et al. [3] design a system to help users repair extracted tables with a mobile interface. However, users areunable to directly improve the extraction model with their annotations.

Current table extraction systems extract tables without the option to give feedback. Since the systems do not work wellon all document types, users can be frustrated by the lower quality of extractions without the ability to improve them.Our system finetunes models in an iterative fashion, collaborating with the model to quickly label and see improvementswith the model. In our system, we first use deep learning models to extract table and cells to generate table structure.Using visual embeddings derived from the model, we cluster documents into templates in order to recommend specificpages to label for users that balances between ease of labelling and the most impact on the model with a large varietyof styles. Thanks to this recommendation system, we minimize the size of labeled data required. As well, since our tableextraction model is modular in nature, some labels for components (ex. table border) can immediately improve resultsin others (cell border) such that the user does not need to repair every error in the table extraction process. ableLab: An Interactive Table Extraction System with Adaptive Deep Learning IUI ’21 Companion, April 14–17, 2021, College Station, TX, USA Fig. 2. System architecture of TableLab. Details for each step are described in the overview section.

To begin, we apply our table extraction module (based on the GTE framework [8]) to the user’s document collection.We provide a few base model weights that have been pre-trained on different document types for users to select theone that best matches their collection. After the deep models have been applied, we input the resulting table and cellbounding boxes as well as the document text snippets (scanned and image documents are first processed with an OCRengine in order to extract the text) into our structure clustering model. This model determines the row and columnassignment and the content of each cell such that it can be represented by a structured format such as HTML.

After the deep learning models have been applied to the collection, the visual embeddings from the detection modelsare used to cluster the document collection into templates. After clustering, the lowest and highest confidence pages ofeach template is selected for user labelling. The labels for lowest confidence pages will provide the most benefit to themodel while the highest confidence pages should be easy and quick to label, allowing for faster feedback. An icon foreach label recommendation type is indicated beside each page in the user interface.

When the initial table structure has been extracted and label recommendations determined, the user will be able toview the extracted tables and provide feedback as needed. In a typical case, the user can first adjust the table border.This prompts the system to redo the structure clustering, providing an updated extracted table. Sometimes, this resultsin a completely correct extraction and the user may submit the page for finetuning at this time. Otherwise, the user hasfull control to merge, split cells or whole columns and rows similar to manipulations in a spreadsheet program. By UI ’21 Companion, April 14–17, 2021, College Station, TX, USA Wang et al.

Fig. 3. TableLab in model selection and table editing view. The leftmost panel contains previews of each page in the input documentcollection with detected table boundaries, their confidence, template type as well as a yellow or red tag if they are recommended forlabelling. In center, the user may view the page in the table box (and text when a textbox is selected) Overlay mode as shown or theymay magnify the contents in Magnify mode. The user can interact with the boxes to change their size and location. The right panelshows the extracted table and our table editing features. We also show the additional panel where users can view their labellingprogress, finetune models with labelled data, select the model used to extract tables, add additional documents and download thetable annotations. leveraging the layout of the text snippet positions, users can split and move cell content by text chunks rather thanword by word. The user may also edit a text snippet by typing in the content and adjusting its bounding box.

When the page has the correct table extraction, the user may submit the page for model finetuning and apply thecustomized model to their collection for improved extraction results. If there are additional errors, the user may makeadditional corrections and repeat the finetuning process. For a typical collection, we find that one finetuning round isgenerally enough to correct the rest of the collection but this depends largely on the diversity of the collection itself.

TableLab is developed in React and Flask while the model (GTE) is developed with TensorFlow. The models preloadedin the demonstration were trained on PubLayNet and PubTabNet, which are large datasets from the scientific papersdomain. The documents shown for detection and correction labelling are from FinTabNet, which are tables from annualreports of S&P 500 companies. We demonstrate TableLab’s ability to customize models on this new domain with tablesthat have different styles.

There are three main use cases with our demo. First, users can simply visualize table extraction results with TableLab.Second, AI engineers and scientists can use our tool to quickly create ground truth labels for their documents. Finally,end users can create custom models with their private document collection with our interactive TableLab system. ableLab: An Interactive Table Extraction System with Adaptive Deep Learning IUI ’21 Companion, April 14–17, 2021, College Station, TX, USA REFERENCES [1] Douglas Burdick, Marina Danilevsky, Alexandre V Evfimievski, Yannis Katsis, and Nancy Wang. 2020. Table extraction and understanding forscientific and enterprise applications.

Proceedings of the VLDB Endowment

13, 12 (2020), 3433–3436.[2] Azka Gilani, Shah Rukh Qasim, Muhammad Imran Malik, and Faisal Shafait. 2017. Table Detection Using Deep Learning.

01 (2017), 771–776.[3] Jane Hoffswell and Zhicheng Liu. 2019. Interactive Repair of Tables Extracted from PDF Documents on Mobile Devices. In

CHI .[4] Minghao Li, Lei Cui, Shaohan Huang, Furu Wei, Ming Zhou, and Zhoujun Li. 2019. TableBank: Table Benchmark for Image-based Table Detectionand Recognition.

ArXiv abs/1903.01949 (2019).[5] Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure. 2020. CascadeTabNet: An approach for end to end tabledetection and structure recognition from image-based documents. In

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern RecognitionWorkshops . 572–573.[6] Sebastian Schreiber, Stefan Agne, Ivo Wolf, Andreas Dengel, and Sheraz Ahmed. 2017. DeepDeSRT: Deep Learning for Detection and StructureRecognition of Tables in Document Images.

01 (2017),1162–1167.[7] Chris Tensmeyer, Vlad I Morariu, Brian Price, Scott Cohen, and Tony Martinez. 2019. Deep Splitting and Merging for Table Structure Decomposition.In . IEEE, 114–121.[8] Xinyi Zheng, Doug Burdick, Lucian Popa, and Nancy Xin Ru Wang. 2020. Global Table Extractor (GTE): A Framework for Joint Table Identificationand Cell Structure Recognition Using Visual Context. arXiv preprint arXiv:2005.00589 (2020).[9] Xu Zhong, Elaheh ShafieiBavani, and Antonio Jimeno Yepes. 2019. Image-based table recognition: data, model, and evaluation. arXiv preprintarXiv:1911.10683arXiv preprintarXiv:1911.10683