On Extracting Data from Tables that are Encoded using HTML
aa r X i v : . [ c s . I R ] A p r On Extracting Data from Tablesthat are Encoded using HTML
Juan C. Roldán ∗ , Patricia Jiménez, Rafael Corchuelo University of Seville, ETSI InformáticaAvda. Reina Mercedes s/n, Sevilla E-41012, Spain
Abstract
Tables are a common means to display data in human-friendly formats. Manyauthors have worked on proposals to extract those data back since this hasmany interesting applications. In this article, we summarise and compare manyof the proposals to extract data from tables that are encoded using HTML andhave been published between and . We first present a vocabulary thathomogenises the terminology used in this field; next, we use it to summarisethe proposals; finally, we compare them side by side. Our analysis highlightsseveral challenges to which no proposal provides a conclusive solution and afew more that have not been addressed sufficiently; simply put, no proposalprovides a complete solution to the problem, which seems to suggest that thisresearch field shall keep active in the near future. We have also realised thatthere is no consensus regarding the datasets and the methods used to evaluatethe proposals, which hampers comparing the experimental results.
Keywords:
HTML documents; web tables; table mining; data extraction.
1. Introduction
Tables are a common means of displaying data in web documents becausepeople can easily spot and interpret them [3, 5]. The estimations are as highas hundreds of millions; for instance, Lehmberg et al. [35] and Galkin et al. [25]found and billion tables in different editions of the Common Web Crawl,respectively, and Crestan and Pantel [16] found . billion tables in their owncrawl. Cafarella et al. [5] also highlighted the explosion of consumer demandfor data that comes from tables thanks to the increasing popularity of voiceassistants and infobox-like search results.In this context, data extraction consists in transforming tables into struc-tured formats that focus on their data and abstract away from how they are ∗ Corresponding author.
Email addresses: [email protected] (Juan C. Roldán), [email protected] (PatriciaJiménez), [email protected] (Rafael Corchuelo)
Preprint submitted to Knowledge-Based Systems April 3, 2019 isplayed. Data extraction has many applications to text mining [24, 64, 65],data (meta-)search [3, 9, 18, 26, 44, 51, 63–65], query expansion [16], documentsummarisation [40, 64], question answering [1, 20, 44, 46, 65], knowledge dis-covery [9, 22, 26, 32, 44, 46], knowledge base construction [17, 72], knowledgeaugmentation [1, 9, 18, 20, 56, 56, 57, 67], synonym finding [1, 3, 39], improvingaccessibility [43, 47, 49, 64, 65], textual advertising [15], data compression [2, 49],or creating linked data [22, 33], just to mention a few common ones.It is not surprising then that many researchers have worked on a varietyof proposals to extract data from tables, which has motivated others to writearticles in which they summarise and compare them. Lopresti and Nagy [41, 42]presented a definition of table, with a focus on how they are encoded and dis-played, and motivated the need to extract data from them; they summarisedsome data extraction techniques, as well as some techniques to integrate theresulting data. Hurst [27] introduced the problem and then reported on someof the challenges regarding locating tables and their cells; he paid special atten-tion to reporting on the evaluation of the proposals and concluded that commonevaluation methods are not suitable. Zanibbi et al. [70] described the extractiontasks as abstract machine-learning procedures in which input documents arefirst modelled and then mapped onto observations that are transformed priorto performing inference; they analysed many existing proposals according tohow they address the steps of the previous procedure; they also highlighted theneed for common evaluation methods. Costa-Silva et al. [14] discussed on whata table is and what makes it different from a diagram; they then listed manyproposals to implement the tasks involved in extracting data from tables andcompared them using several comparison frameworks; they also criticised com-mon evaluation methods and contributed with some specific purpose evaluationmeasures. Embley et al. [21] first discussed on the definition of table and thenmotivated the need to extract data from them by describing many applications;they listed some proposals to locate tables and their cells, but their emphasiswas on the tasks to classify the cells, to group them, and to interpret the tables.The previous articles focus on the proposals that were published between and . Unfortunately, there is not a recent article that summarisesand compares the proposals that were published later, which motivated us towork on it. Our focus is on proposals that work on tables that are encoded us-ing HTML because there has been a steady shift towards encoding them usingthis language [3, 18, 35], which provides specific-purpose tags and has becomepervasive. We have analysed proposals that were published between and , we have defined a vocabulary that homogenises the terminology usedin this field, we have used it to summarise the proposals as homogeneously aspossible, and we have compared them side by side using several objective char-acteristics. We have identified several challenges to which no proposal providesa conclusive solution and also several challenges that have not been addressedsufficiently; addressing them in future shall definitely help produce solutionsthat increase the range of tables from which data can be extracted correctly.We have also realised that there is not a standardised evaluation method, whichhampers the experimental comparison.2he rest of the article is organised as follows: Section 2 introduces thevocabulary that we have compiled; Section 3 summarises the proposals that wehave analysed using the previous vocabulary; Section 4 compares them side byside using objective characteristics; finally, Section 5 concludes the article.
2. Vocabulary
In this section, we have made a point of integrating the many complementaryterms that are commonly used in the literature under a common vocabulary.We first report on the vocabulary that is related to tables themselves and thenon the vocabulary that is related to extracting data from them. We illustratemost of the concepts with a couple of examples.
Unfortunately, there is not a consensus definition in the literature regardingwhat a table is. Many authors focus on the encoding since they define them aswhatever one can encode within HTML table tags [3, 7, 9, 16, 18, 28, 30, 32, 34,38, 47, 49, 49, 59, 66, 68], which is a pragmatic approach; a few also refer to thedisplay of data, since they define tables as grids in which data are located incells in a manner that lines and/or styles ease interpreting them [22, 24, 26, 28,30, 32, 46, 49]. There is only a proposal that deviates a little from the previousapproaches [21] since the authors focus on the data model behind the tables,independently from how they are displayed; their proposal, however, works ontables in which data are arranged in grids.Neither is there a consensus taxonomy of tables. Most authors differentiatebetween data tables, which provide data to be extracted, and non-data tables,which are used for layout purposes or to provide utilities. Many of them makealso a difference between listings, forms, matrices, and enumerations [16, 18, 26,30, 34, 44, 46, 66], although the exact terminology used is very diverging; thereis also a proposal in which tables are classified according to whether they haveheaders or not [22].In the previous discussion, there are three key concepts, namely: encoding,cell, and table, which we define below.
Definition 1 (Encodings):
An encoding is a specification of how a tablemust be displayed to a person. Common encodings include pre-formatted text,images, and mark-ups. In a table that is encoded using pre-formatted text,the data are arranged in lines, they are aligned to their corresponding columnsusing blanks, and the cells may be delimited using, for instance, dashes, verticalbars, or tabulators. In a table that is encoded using an image, there is a graphiccanvas onto which the data and the lines that delimit the cells, if any, are drawnusing bitmaps or vectors. Contrarily, mark-ups provide a variety of tags thathelp encode the tables, their cells and, hopefully, additional information thathelps interpret them. There are several mark-up languages available [58], butour focus is on HTML due to its pervasiveness in the Web. HTML providesan array of table-related tags, namely: table , thead , tbody , tfoot , col , colgroup ,3 h , tr , td , and caption . It is relatively easy to extract data from tables thatare encoded using the previous tags. Unfortunately, real-world tables have avariety of intricacies that hamper the extraction process, namely: some tablesare encoded using a subset of table-related tags that hardly help locate themand their cells, which does not help interpret them; other tables are encodedusing listing tags ( ul , ol , dl , li , dd , and dt ) [9, 20, 36, 37]; lately, it is also relativelycommon to find tables that are encoded using block tags ( div and span ) due totheir ability to create responsive layouts [50]; and, generally, speaking, thereare many tables that are encoded using a variety of tags that are not actuallyrelated to tables, but look like tables when they are displayed [24, 26]. (cid:3) Definition 2 (Cells):
A cell is a box that provides contents to a table. Theycan be classified along several axes, namely: a) According to how they aresegmented, cells can be single cells, which occupy exactly one position in the gridof a table, or spanned cells, which occupy more than one position. b) Accordingto whether their contents are complete or not, cells can be classified as single-part cells, whose contents are complete, and multi-part cells, which providepartial contents that must be somewhat merged with the contents of other cells.c) According to their function, they can be classified as meta-data cells, whosecontents are labels that help people understand other contents in the table,data cells, whose contents provide the data that must be extracted, decoratorcells, which provide irrelevant contents, and context-data cells, which providecaptions, notes, or factorised data. d) According to how their contents mustbe interpreted, cells can be classified as factorised cells, whose contents mustbe borrowed from adjacent cells, void cells, which are not intended to provideany contents, atomic cells, whose contents cannot be decomposed further, andstructured cells, whose contents can be decomposed into a mixture of data andmeta-data. (cid:3)
Definition 3 (Tables):
A table is a collection of cells that are arranged in rowsand columns within a grid, where lines and/or styles are typically used to helppeople interpret them. There are cases in which some context data are providedin the text that surrounds a table, i.e., captions, notes, and factorised data.The cells in a table are typically grouped as follows according to their functions:headers, which are groups of meta-data cells, tuples, which are groups of datacells, and separators, which are groups of decorator cells. Typically, headersare arranged on the first few rows and/or columns, but we have found sometables in which they are interwoven with tuples for the sake of readability; itwas the case of long listings with many tuples, in which it makes sense to repeatthe header rows or columns every few tuples, or wide listings/forms with manyheaders, in which it makes sense to split the header rows or columns to narrowthem. Data tables can be broadly classified as follows: a) listings, in whichthe headers, if any, occupy either the first few rows or columns and the tuplesare arranged in the remaining rows or columns, respectively; b) forms, in whichthe headers, if any, occupy either the first few rows or columns and there is asingle tuple that is arranged row- or column-wise, respectively; c) matrices, inwhich the headers occupy both the first few rows and columns, and all of the4 (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:0)(cid:11) (cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:18)(cid:19)(cid:20)(cid:21)(cid:22)(cid:23)(cid:24)(cid:25)(cid:26) (cid:27)(cid:28)(cid:29)(cid:30) (cid:31) !"
Figure 1: A sample table. data cells constitute a single tuple; and d) enumerations, in which there areno headers and each individual cell can be considered a tuple. According toCrestan and Pantel [16], this taxonomy covers roughly of the data tablesin their . -billion table repository; the authors mention that it is arguable thatthe remaining tables can be considered actual data tables or that they arefrequent enough to be representative. (cid:3) Example 1:
Figure 1 shows a horizontal listing taken from a document on ashare market. The black, solid lines help delimit the boundaries of the cells inthe grid; the greyed, dashed lines represent the boundaries of a few cells thatexists in the encoding of the table, but are not visible to the reader becausethey are used for layout purposes. Most of the data are displayed on an × grid, but there are also some context data in the surrounding text.Cells like “ Time ” are single because they occupy exactly one position in thegrid; on the contrary, cell “
Volume ” is spanned because it occupies two positionsin the grid. The meta-data cells occupy the first two rows; for instance “
Risk ”is one such meta-data cell. Contrarily, cell “ ” below is a data cell and cell“ Acme Inc. ” is a decorator cell. The caption of the table is displayed withina bottom cell that spans the whole table, which is considered a context-datacell; realise that there are additional context data: there is a note regarding cell5
AAL ” and there is a factorised datum regarding the date of the report, whichcomplements the times provided in some cells. Cells like “ ” are single-partcells because their contents are complete; on the contrary, cells like “ ” aremulti-part cells because it is necessary to merge the contents of two cells so thatthe contents of the resulting cell are complete. A cell like “ ” is atomic sinceits contents cannot be decomposed further; contrarily, a cell like “ N: +1.60 P:-11.04 ” is a structured cell because it provides both meta-data and data, whichmeans that it can be decomposed further. The empty cell below cell “
VISA ” verylikely factorises the ticker since there are two tuples regarding this company at09:00 and 17:00, respectively; contrarily, the four empty cells on the right of cell“ ” are very likely void cells that indicate that no data are available.The table has seven headers, namely: “ ”, “
Sample/Ticker ”, “
Sample/Time ”,“
Today/Volume ”, “
Today/Risk ”, “
Yesterday/Volume ”, and “
Yesterday/Risk ”. Itprovides five tuples, the first of which is (“ ”, “ KML ”, “ ”, “ ”, “
N: -0.95P: -8.48 ”, “ ”, “ ”, “ N: -2.15 P: -0.07 ”, “ ”). It has also a separator at thefifth row, which shows an advertisement. In our context, data extraction refers to a process that transforms the tablesin an input document into record sets. A record is a data structure in which theindividual data in a tuple are endowed with semantics by means of descriptorsthat are computed from the meta-data provided by the corresponding table; incases in which the table does not provide enough meta-data, the descriptorsmust be generated artificially.Costa-Silva et al. [14] did a good job at identifying the tasks of which thedata-extraction process is composed, namely: location, segmentation, functionalanalysis, structural analysis, and interpretation. Note, however, that their focuswas on tables that are encoded using pre-formatted text or images, which meansthat they need not make tables that provide data apart from tables that areintended for layout purposes or to provide utilities. The latter are very commonin nowadays Web, which motivated Cafarella et al. [3], for instance, to introducea task to discriminate data tables from non-data tables.Before feeding the record sets returned by data extraction into a particularapplication, it is commonly necessary to perform some of the following inte-gration tasks: semantisation [25, 45, 54, 55, 60, 63, 71], which either maps thedescriptors onto the terminology box of a particular ontology or the tuples ontoits assertion box [19]; union [23], which merges record sets that provide similardata; finding primary keys [62], which determines which components of the tu-ples identify them as univocally as possible; record linkage [8, 11, 12], which findsdifferent records that refer to the same actual entities; augmentation [6, 52, 67],which joins record sets on the same topic to complete the information that theyprovide individually; and cleaning [10, 31, 61], which fixes data. Note that theintegration tasks are orthogonal to data extraction because they are indepen-dent from the source of the record sets, which is the reason why they fall out ofthe scope of this article. 6n the previous discussion, there are three key concepts: record set, extrac-tion task, and data extraction, which we define below.
Definition 4 (Record set):
A record set is a collection of records. A recordis a map that associates a set of descriptors to each of the components of atuple. A descriptor is a structured label that endows the components of a tuplewith the semantics provided by the meta-data in the corresponding headers orstructured cells; if not enough meta-data are available, then descriptors mustbe generated artificially. We make three types of descriptors apart, namely:simple descriptors, which correspond to the contents of a single meta-data cell,field descriptors, which correspond to the contents of several adjacent meta-datacells, and artificial descriptors, which are used when not enough meta-data areavailable. In listings and forms, every component of the tuples has one associateddescriptor; in matrices, they have two associated descriptors; in enumerations,the descriptors must be created from the meta-data in the cells, if any; in othercases, they must be generated artificially. (cid:3)
Definition 5 (Extraction tasks):
The tasks involved in extracting data froma table are the following [14]: a) location, which searches the input documentfor the excerpts in which tables are encoded and returns them; b) segmenta-tion, which searches for the cells of which a table is composed; c) discrimination,which classifies a table as either a data table or a non-data table, but furthersub-classification is possible; d) functional analysis, which classifies the cells ac-cording to their functions; e) structural analysis, which groups cells into at leastheaders and tuples; and f) interpretation, which produces record sets buildingon the results of the previous tasks. (cid:3)
Definition 6 (Data extraction):
Data extraction refers to a process thatorganises the extraction tasks into a pipeline so that they can achieve theirgoal. Zanibbi et al. [70] and Costa-Silva et al. [14] reported on the many commoninter-dependencies amongst the extraction tasks. (cid:3)
Example 2:
Figure 2 illustrates a sample data extraction process in which wehave organised the tasks into a sequential pipeline.The location task finds two excerpts in the input document that seem tohave tables; the segmentation task is responsible for finding the individual cellsof which the tables are composed, plus the context data that is associated withthem; the discrimination task makes a difference between the table on the left,which seems to be a menu that does not provide any data, and the table on theright, which seems to be a table that provides data; the functional analysis taskmakes meta-data cells apart from data cells; the structural analysis task groupsthe meta-data cells into four headers and the data cells into two tuples; finally,the interpretation task produces a record set with three records.Regarding the descriptors, we illustrate them using the usual field-accessnotation for simple and field descriptors and the usual array-access notation forartificial descriptors. For instance, header “
A/A ” results in a simple descriptorof the form “ A ” because both cells were actually a vertically-spanned cell in theoriginal table. On the contrary, header “ B/C ” results in a field descriptor of theform “
B.C ” in which it is clear that whatever “ C ” represents is subordinated to7 ¥ƒ §¤ ' “ «‹ › fi fl(cid:176)–†‡·(cid:181)¶•‚ „” »… ‰ (cid:190)¿ (cid:192) ` ´ˆ˜¯˘˙¨(cid:201)˚¸(cid:204)˝˛ˇ—(cid:209)(cid:210)(cid:211)(cid:212) (cid:213)(cid:214) (cid:215)(cid:216) (cid:217) (cid:218) (cid:219)(cid:220) (cid:221) (cid:222) (cid:223)(cid:224) Æ (cid:226)ª(cid:228)(cid:229) (cid:230)(cid:231)ŁØŒº(cid:236)(cid:237)(cid:238)(cid:239)(cid:240)æ(cid:242) (cid:243)(cid:244)ı(cid:246)(cid:247)łøœß(cid:252)(cid:253)(cid:254)(cid:255) (cid:23)(cid:3)(cid:0)(cid:26)(cid:1)(cid:27)(cid:28)!(cid:2)(cid:25) (cid:4)(cid:5)(cid:6) "(cid:7)(cid:21) (cid:8)(cid:24)(cid:9)(cid:10) (cid:11)(cid:12)(cid:13)(cid:14)(cid:15)(cid:16)(cid:17)(cid:29)(cid:20)(cid:30)(cid:31)(cid:18)(cid:19) (cid:22) Figure 2: A sample data extraction process. whatever “ B ” represents; note that this descriptor is ambiguous since there aretwo columns of the table with the same header. In such cases, the table doesnot provide enough meta-data and the columns must be made apart by meansof artificial descriptors, that is “ B.C[1] ” and “
B.C[2] ”. Obviously, header “
B/D ”results in a field descriptor of the form “
B.D ”.The records extracted are the following: {"A": "e", "B.C[1]": "f", "B.C[2]":"g", "B.D": "h"} , {"A": "i", "B.C[1]": "j", "B.C[2]": "k", "B.D": "l"} , and {"$caption": "Table 1: XXX"} . Realise that the last record uses a special simpledescriptor to indicate that corresponding datum is the caption of the table.8 (cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:2)(cid:7) (cid:8)(cid:9)(cid:10)(cid:11)(cid:9)(cid:7)(cid:5)(cid:4)(cid:5)(cid:6)(cid:2)(cid:7) (cid:12)(cid:6)(cid:13)(cid:3)(cid:14)(cid:6)(cid:11)(cid:6)(cid:7)(cid:4)(cid:5)(cid:6)(cid:2)(cid:7) (cid:15)(cid:16)(cid:7)(cid:3)(cid:5)(cid:6)(cid:2)(cid:7)(cid:4)(cid:17)(cid:18)(cid:4)(cid:7)(cid:4)(cid:17)(cid:19)(cid:13)(cid:6)(cid:13) (cid:8)(cid:5)(cid:14)(cid:16)(cid:3)(cid:5)(cid:16)(cid:14)(cid:4)(cid:17)(cid:18)(cid:4)(cid:7)(cid:4)(cid:17)(cid:19)(cid:13)(cid:6)(cid:13) (cid:20)(cid:7)(cid:5)(cid:9)(cid:14)(cid:21)(cid:14)(cid:9)(cid:5)(cid:4)(cid:5)(cid:6)(cid:2)(cid:7) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3)(cid:18)(cid:3)(cid:19)(cid:20)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:21) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17)(cid:22)(cid:3)(cid:4)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:21) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17) (cid:12)(cid:17)(cid:15)(cid:17)(cid:16)(cid:2)(cid:13)(cid:23)(cid:7)(cid:5)(cid:7)(cid:4)(cid:23)(cid:5)(cid:24)(cid:17)(cid:19)(cid:13)(cid:16)(cid:7)(cid:25)(cid:7) (cid:10)(cid:11)(cid:11)(cid:21) (cid:12)(cid:17) (cid:12)(cid:17) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:17)(cid:1)(cid:17)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17)(cid:26)(cid:27)(cid:19)(cid:16)(cid:6) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17) (cid:12)(cid:17)(cid:28)(cid:7)(cid:4)(cid:29)(cid:5)(cid:7)(cid:4)(cid:23)(cid:5)(cid:26)(cid:27) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17) (cid:12)(cid:17)(cid:15)(cid:7)(cid:4)(cid:29)(cid:5)(cid:7)(cid:4)(cid:23)(cid:5)(cid:18)(cid:27)(cid:30) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:7)(cid:13)(cid:14)(cid:3)(cid:18)(cid:3)(cid:19)(cid:20)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:31) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:13)(cid:20)(cid:5)(cid:7)(cid:4)(cid:23)(cid:5)(cid:18)(cid:3)(cid:3) (cid:10)(cid:11)(cid:11)! (cid:12)(cid:7)(cid:13)(cid:14)(cid:3) (cid:12)(cid:17) (cid:15)(cid:3)(cid:16) (cid:15)(cid:3)(cid:16) (cid:12)(cid:17) (cid:12)(cid:17)"(cid:27)(cid:4)(cid:29)(cid:5)(cid:7)(cid:4)(cid:23)(cid:5) (cid:25)(cid:17)(cid:4) (cid:10)(cid:11)(cid:11) (cid:22)(cid:9)(cid:23)(cid:9)(cid:14)(cid:9)(cid:7)(cid:3)(cid:9) T a b l e : T a s k s a dd r e ss e db y e a c hp r o p o s a l. . Su mm a r y o f p r o p o s a l s I n t h i ss ec t i o n , w e s u mm a r i s e m a n y p r o p o s a l s t h a t h a v e n o t b ee n s u r v e y e dp r e v i o u s l y , c f . T a b l e . N o t e t h a t o u r s u mm a r i e s a r e i n t e nd e d t o p r o v i d e a n o v e r a ll p i c t u r e u s i n g t h e v o c a bu l a r y t h a t w e h a v e p r e s e n t e dp r e v i o u s l y ; p l e a s e , r e f e r t o t h e o r i g i n a l a r t i c l e s f o r t h e f u ll d e s c r i p t i o n s . . . L o c a t i o n Y o s h i d a e t a l.[ ], E l m e l ee g y e t a l.[ ], L a u t e r t e t a l.[ ], B r a un s c h w e i g e t a l.[ ], E m b l e y e t a l.[ ], a nd N i s h i d a e t a l.[ ] d i dn o t p a y a tt e n t i o n t o t h e l o c a t i o n t a s k . C h e n e t a l.[ ], C o h e n e t a l.[ ], H u r s t [ ], Y a n ga nd L u k [ ], K i m a nd L ee [ ], J un ga nd K w o n [ ], O k a d aa nd M i u r a [ ], C a f a r e ll a9 t al. [3], Son and Park [59], Chu et al. [9], Eberius et al. [18], Milošević et al.[44], Wu et al. [66], and Liao et al. [38] reported on naive approaches thatconsisted in extracting every HTML excerpt with a table tag; Penn et al. [49],Wang and Hu [64], and Crestan and Pantel [16] followed the same approachbut discarded tables with nested tables. The other proposals provide moresophisticated approaches.Lerman et al. [36, 37] focused on tables that are encoded using listing tags.Their proposal works as follows: a) first, the input documents are tokenised andthe tokens are assigned to lexical types; b) then, the smallest input documentis taken as a base template; c) the remaining documents are then iterativelycompared to the template in order to make the sequences of tokens that appearexactly once apart from the others; d) finally, the excerpts of the document thathave the largest repetitive sequence of tokens are returned. The authors didnot evaluate their procedure in isolation, but their complete system.Gatterbauer et al. [26] presented a visual approach that analyses the bound-ing boxes used to display the elements of a document in an attempt to identifytables, lists, and so-called aligned graphics that represent tabular structures.Their proposal works as follows: a) first, they apply some heuristics to lo-cate elements in the DOM tree that are likely to be a part of a table, a list,or an aligned graphic; the authors mention that they have compiled a collec-tion of over twenty such heuristics, but they document only twelve of them intheir paper; b) then, they apply an algorithm that searches for so-called frames,which are collections of elements that are rendered so that they form a box;c) then, the frames are expanded to four orthogonal directions by finding ele-ments whose bounding boxes are near to each other; d) finally, the excerpts thatcorrespond to the extended frames are returned. The authors evaluated theirproposal on tables from their own repository plus additional tables fromWang and Hu’s [64] repository.Fumarola et al. [24] also presented a visual approach. Their proposal worksas follows: a) it first creates a bounding box that encloses the whole inputdocument; b) it then iterates recursively and creates a bounding box for everyelement in that document; c) next, it analyses the positions of the inner bound-ing boxes and finds those that are laid out in a row- or a column-wise manner;d) then, the corresponding excerpts are returned. The authors did not evaluatetheir procedure in isolation, but their complete system.Ling et al. [39] presented a proposal whose focus is on locating context-data cells. It works as follows: a) first, it locates the elements in the inputdocument that have a table tag; b) then, it extracts some context data from the title tag; c) next, it segments the text around the tables and aligns the resultingsegments using a multiple string alignment algorithm; d) finally, the segmentsthat are repetitive enough are considered context-data cells. The authors didnot evaluate their procedure in isolation, but their whole system. Penn et al. [49], Yoshida et al. [69], Hurst [28], Wang and Hu [64], Kim andLee [32], Okada and Miura [47], Cafarella et al. [3], Crestan and Pantel [16],10umarola et al. [24], Lautert et al. [34], Son and Park [59], Braunschweig et al.[1], Eberius et al. [18], Milošević et al. [44], Wu et al. [66], Nishida et al. [46], andLiao et al. [38] did not report on any proposals to implement this task. Chenet al. [7], Yang and Luk [68], Jung and Kwon [30] relied on a naive approachthat searches for the cells using specific tags. The other proposals provide moresophisticated approaches.Lerman et al. [36] focused on tables that are encoded using listing tags; im-plicitly, they assumed that tuples are shown in a row-wise manner. Prior tosegmentation, the authors applied a document alignment method to detect thetemplate of the documents and their repetitive segments, which are very likelyto contain the lists. Once the lists are located, their proposal works as follows:a) first, the segments are grouped according to their separators; b) then, Dat-aPro is invoked on the previous groups to learn patterns that characterise theirdata; c) then, for each segment in each group, it computes binary features thatindicate whether it matches the previous patterns or not; d) next, the Auto-Class clustering algorithm is invoked to learn the optimal number of clustersand to learn a set of rules that assign new segments to the most similar cluster;e) finally, the data in each cluster is assumed to be a column of the correspond-ing table, which facilitates identifying the cells in a row-wise manner. Theevaluation was performed on the tables from a repository with documentsthat were taken from different sources. The authors did not evaluate theirprocedure in isolation, but their complete system.Cohen et al. [13] relied on some transformations that help normalise tablesbefore they are segmented. Their proposal works as follows: a) the HTMLstructure is cleaned up using HTML Tidy and the extra cells generated by thistool are removed; b) structured cells are divided into multiple atomic cells bysplitting inner tables, paragraphs, or pre-formatted text; c) spanned cells aresplit into several cells unless this results in more cells than the height or thewidth of the table. The authors did not evaluate their procedure in isolation,but their complete system.Lerman et al. [37] segmented tables by learning a probabilistic model fromthe repetitive segments in which they decompose tables that are encoded usinglisting tags; they assumed that tuples are shown in a row-wise manner. Theirproposal works as follows: a) lists are split into columns according to candi-date separators, which can be tags or punctuation symbols; b) some contentfeatures are then computed on each column and their siblings; c) then, an infer-ence algorithm learns a probabilistic model from the previous features; d) theparameters are then used to find the best column assignment for a segment,which is the one that maximises the probability of the features observed giventhe model. Their evaluation was performed on the tables from a repositorywith tables from web sites on book sellers, property taxes, white reports,and corrections. They also experimented with a constrain satisfaction approachthat was less accurate.Gatterbauer et al. [26] presented a proposal that requires to identify thespatial relationships between the individual cells of a table. It works as follows:a) it computes the boxes that represent the elements of an input document,11aking into account their contents area, padding, border, and margin areasaccording to the CSS2 visual formatting model; b) it then overlaps a grid thathelps identify each box by means of the co-ordinates of its upper-left corner andits lower-right corner; c) then, it aligns the boxes according to their horizontaland vertical projections; d) next, an adjacency relation is computed accordingto how distant the cells are; e) finally, some cells are selected and a recurrentexpansion algorithm is invoked in an attempt to explore the adjacency relationto find their neighbours. The authors evaluated their proposal on the tablesprovided by a repository with documents that were retrieved from searchengines, from Wang and Hu’s [64] repository, or written by the authors. Theauthors did not evaluate their procedure in isolation, but their complete system.Elmeleegy et al. [20] tried to find columns by checking how similar the cellsin a table are. The similarity is analysed by means of their data types and delim-iters. The authors used two resources, a large-scale language model, which helpsknow sentences that should not be split because they have previously occurredwithin a cell, and a corpus of tables, which helps identify data that appear inthe same column in other tables. Their proposal works as follows: a) each rowis split into a (possibly different) number of columns using two scoring func-tions, namely: a field quality score, which measures the quality of an individualcolumn candidate, and a field-to-field consistency score, which measures thelikelihood that two column candidates are actually the same column; b) then,it sets the number of columns to the most frequent one; c) padding columnsare added to rows that have less columns than expected and some columns aremerged otherwise; d) finally, the segmentation of cells is refined by checking theconsistency amongst the cells in a per-column basis; e) if the consistency checkfails, the procedure is re-launched. Their evaluation was performed on ta-bles from different domains plus additional tables that were randomlysampled from the Web.Ling et al. [39] assumed that tables can be segmented building on their td tags; their key contribution was regarding how to find context data. Theirproposal works as follows: a) it first uses a number of heuristics to generatecandidate context data, namely: tokens in between some punctuation marks,the longest common sub-sequences, pieces of text that can be wikified [53], andpieces of text that vary from document to document but are located at the sameposition; b) then, the previous context data are added to the original table asadditional columns; c) finally, a pairwise adaptation of the Multiple SequenceAlignment algorithm is used to segment the context data. The evaluation wasperformed on
20 000 tables that were picked from a repository with milliontables from different web sites.Chu et al. [9] also focused on finding the columns of a data table. Theirproposal works as follows: a) each row is tokenised using a set of user-defineddelimiters; b) then candidate columns are generated using two approaches: aseed tuple is provided and the system discards segmentations that are verydifferent; a custom pruning procedure that borrows some ideas from the well-known A* procedure is also used; c) it then measures the similarity of eachcolumn using lexical and semantic similarity functions that are averaged (the12ormer computes the difference regarding the number of tokens, characters, andpattern-based types; the latter computes the point-wise mutual informationfunction); d) the process is repeated until a segmentation that maximises sim-ilarity is found. Their evaluation was performed on million data tablesthat were transformed into lists; they used additional tables encoded as listsfrom five different domains.Embley et al. [22] presented a proposal that works as follows: a) the inputdocuments are transformed into a representation that preserves the contentsonly; b) spanned cells are split and their contents are copied verbatim to theresulting cells; c) then, every row with more than two empty cells is consideredto provide context data; d) finally, the right-most bottom non-empty cell isconsidered to be the last cell in the table. (Note that they can work on tablesthat come from spreadsheets, in which it is not uncommon to find empty cellsthat are not actually part of any tables.) The authors did not evaluate theirprocedure in isolation, but their complete system. Lerman et al. [36], Yoshida et al. [69], Lerman et al. [37], Elmeleegy et al.[20], Ling et al. [39], Braunschweig et al. [1], Chu et al. [9], Embley et al. [22],and Milošević et al. [44] did not pay attention to the discrimination task. Theother proposals provide sophisticated approaches.Chen et al. [7] devised a proposal to discriminate tables by means of heuris-tics. It works as follows: a) a cell similarity measure is computed by combin-ing string similarity, named entity similarity, and number similarity functions;b) then, the tables whose cells do not exceed a threshold regarding the numberof similar neighbour cells are discarded; c) finally, tables with less than two cellsor tables with many links, forms, or figures, are also discarded. The evaluationwas performed on tables from their own repository with documents onairlines from the Chinese Yahoo! site.Penn et al. [49] also devised a heuristic-based approach. Their proposalworks as follows: a) tables that do not have multiple rows and columns arediscarded; b) tables whose cells have more than one non-text-formatting tag arealso discarded; c) finally, tables whose cells have more than a user-defined num-ber of words are also discarded. The authors also mentioned that a desirablefeature is to have syntactic and semantic similarity into account, but they didnot explore this idea. They experimented with an unspecified number of tablesfrom their own repository with documents from sites on news, television,radio, and companies.Cohen et al. [13] devised a proposal that builds on machine learning a classi-fier. It works as follows: a) some structural and content features are computedfrom a learning set with tables that are pre-classified as either data tables ornon-data tables; b) then, several classifiers are machine-learnt and evaluated;c) the classifier that achieves the best effectiveness is selected to implementthe discrimination task. The authors experimented with Multinomial NaiveBayes, Maximum Entropy, Winnow, and a decision tree learner that was basedon C4.5; their conclusion was that the best results were achieved using Winnow.13hey evaluated their proposal using a -trial approach on tables from theirown repository; in each trial, of the tables were used for learning and theremaining for evaluation purposes.Hurst [28] presented another machine-learning approach in which he alsotook visual features into account. He performed his evaluation on datatables and non-data tables from his own repository; they were randomlygrouped into five sets from which of the tables were selected for learningpurposes and for evaluation purposes. The results confirmed that NaiveBayes achieved the best results when the whole set of features was used, whereasWinnow worked better when only geometric features were used.Wang and Hu [64] devised another machine-learning proposal that relies onstructural and content features that are used to feed a custom decision treelearner; some of the features need to be transformed into real values usingNaive Bayes or k -NN. The content features rely on the words found in the inputdocuments, which requires a large learning set so as to minimise the chances thata classifier is applied to a document with a word that was not in the learningset. The evaluation was performed using -fold cross evaluation on
11 477 tablesfrom their own repository with documents from Google’s directories.Yang and Luk [68] reported on another heuristic-based method. Their pro-posal works as follows: a) tables that have th tags are considered data tables;b) tables that do not only contain links, forms, or images are also considereddata tables; c) meta-data and data cells are then located using some user-definedpatterns; d) tables that do not have both meta-data and data cells are discarded.They evaluated their method on tables from their own repository, whichwas assembled with random documents from the Web.Kim and Lee [32] used heuristics and an algorithm to check how similar thecells are. Their proposal works as follows: a) tables are considered data tablesif they contain caption or th tags and there are td tags at the right or the bottomsides; b) they are discarded if they have a single cell, if they have nested tables,or if they seem to have meta-data cells only; c) if they have too many links,images, or empty cells, then they are also discarded; d) then, it checks that thecells selected previously are consistent using some user-defined patterns; e) if thedegree of similarity per row or column does not exceed a pre-defined threshold,then the corresponding table is discarded. The evaluation was performed on
11 477 tables from Wang and Hu’s [64] repository.Jung and Kwon [30] presented a machine-learning proposal. It works asfollows: a) it first removes empty rows and columns, splits spanned cells byduplicating their contents, and discards tables with only one cell; b) then, itcomputes many structural, visual, and content features of the table to findout if it has meta-data cells, in which case the table is assumed to have data;c) finally, a C4.5 learner is fed with the input features and the classified tables.The evaluation was performed using -fold cross evaluation on
10 000 tablesfrom their own repository plus some tables from Wang and Hu’s [64] repository.Gatterbauer et al. [26] reported on an approach that identifies tables usingsome display heuristics. Their proposal works as follows: a) elements with td , th , and div tags are considered candidate tables; b) it tries to identify frames14hat rely on those elements, which are assumed to be tables; c) overlappingtables are discarded; d) tables are also discarded if, after removing separatorcolumns and rows, they have less than three rows, a single cell is more than the total size of the table, or they contain cells with more than words.The evaluation was performed on tables from their own repository.Okada and Miura [47] devised another machine-learning approach that re-quires to binarise discrete features before feeding them into an ID3 learner. Theevaluation was performed using -fold cross evaluation on data tables and non-data tables from their own repository.Cafarella et al. [3] proposed another machine-learning approach. Their pro-posal works as follows: a) it considers tables that have at least four cells, arenot embedded in HTML forms, and are not calendars; b) the tables that meetthe previous criteria are classified as either data or non-data tables by a person;c) then, a statistical classifier is machine-learnt from a dataset that vectorisesthe previous tables using both structural and content features that are intendedto measure how consistent the cells are. They evaluated their proposal using -fold cross evaluation over several thousand tables from their own repository.Crestan and Pantel [16] also presented a machine-learning proposal. It worksas follows: a) tables that have less than four cells or have cells with more than characters are discarded; b) next, some structural and content features arecomputed; c) then, a Gradient Boosted Decision Tree classification model ismachine-learnt. The evaluation was conducted on tables from their ownrepository by performing -fold cross evaluation without overlapping.Fumarola et al. [24] proposed a heuristic-based approach. Their proposalworks as follows: a) it groups the elements whose bounding boxes are arrangedin a grid; b) it then computes their similarity by comparing their DOM trees;c) next, it computes the number of nodes in each group; d) if the similarityin a group is above a user-defined threshold and the difference in the numberof nodes is below another user-defined threshold, then it is considered a datatable. The evaluation was performed on tables that were gathered fromGatterbauer et al.’s [26] repository.Lautert et al. [34] devised a machine-learning proposal that builds on neuralnetworks. It works as follows: a) it computes some structural, visual, andcontent features; b) then, it uses them to machine-learn a perceptron with onehidden layer and resilient propagation; c) it has one output neuron per type ofdata table, which is encoded using a score in range [0 . .. . ; the classificationis performed in two steps, namely: the first one uses features to classify thetables into the corresponding types and the second step uses the previous features plus the type of table output by the previous classifier. The evaluationwas performed on a repository with
342 795 tables that were gathered randomly.Son and Park [59] also tried a machine-learning approach. Their proposalworks as follows: a) it selects every DOM node with tag table and their cor-responding parents; b) the features described by Wang and Hu [64] are thencomputed to create a learning set; c) finally, an SVM classifier is machine-learntusing a kernel that works with structural features plus a kernel that workswith content features; the structure kernel is based on two other kernels, one15f which works on the table nodes and the other on the corresponding parentnodes. The authors performed -fold cross evaluation on a subset of
11 477 tables from Wang and Hu’s [64] repository; roughly of the tables were usedfor learning purposes and roughly were used for evaluation purposes.Eberius et al. [18] devised a proposal that builds on machine learning a clas-sifier. It works as follows: a) some heuristics are applied to filter most non-datatables out, namely: tables with less than two rows or columns, tables with aninvalid HTML structure, and tables that cannot be displayed correctly; b) somestructural and content features are then computed regarding the tables andsome of their subregions in order to compute local features; c) two alternativesare now tried: learning one classifier for every table type or using one classifierto discriminate between data and non-data tables and an additional classifier toclassify some kinds of data tables; d) several classifiers are machine-learnt andevaluated, namely: CART, C4.5, SVM, and Random Forest; e) the classifierthat achieves the best effectiveness is selected to implement the discriminationtask. They evaluated their proposal on a repository with
24 654 tables from theOctober 2014 Common Crawl. According to their experience, the best resultswere achieved with Random Forest.Wu et al. [66] provided a method to cluster tables that are similar accordingto their structure. Their proposal works as follows: a) for every two tables, itcomputes the set of paths that corresponds to caption , td , and th tags; b) then,the similarity between the paths of every two tables is computed; c) then, ta-bles are clustered according to their local density plus the previous similarities;d) now, for each cluster, clustering is performed again building on the pathsthat lead to elements with tags li , span , or div ; e) finally, a so-called artificialjudgment method is used to decide on the class of each cluster. The authorsused a repository with tables from the Wikipedia to evaluate their system,but no results were provided regarding this task.Nishida et al. [46] devised a proposal that analyses a subset of cells at thetop-left corner of a table using a deep neural network. It works as follows: a) foreach td or th tag, an embedding is generated by tokenising words, tags, and rowand column indexes; b) each token is encoded as a one-hot vector; c) an LSTMwith an attention mechanism is then used to obtain a semantic representation ofeach cell; d) a convolutional neural network is then connected to three residualunits and applied to vectorise the input table; e) finally, a classification layer isused. The authors learnt the network using tables from web sites,and evaluated the results on
60 678 tables from web sites; the documentswere selected from the April 2016 Common Crawl. They also experimentedwith an ensemble of five neural networks, which attained the best results.Liao et al. [38] presented a heuristic-based approach that takes into accountthe existence of nested data tables. It works as follows: a) tables with a th or caption tag are considered data tables; b) tables with a large numberof pictures, frames, forms, or script tags are discarded; c) tables with a smallnumber of elements or many empty cells are discarded, too; d) tables with toomany homogeneous contents in their rows are considered incomplete data tables,which must be stitched to other sibling tables to create a complete data table.16hey evaluated their method on tables from different sites. Lerman et al. [36], Penn et al. [49], Cohen et al. [13], Hurst [28], Wang andHu [64], Lerman et al. [37], Okada and Miura [47], Crestan and Pantel [16],Elmeleegy et al. [20], Fumarola et al. [24], Lautert et al. [34], Son and Park [59],Chu et al. [9], Eberius et al. [18], Nishida et al. [46], and Liao et al. [38] did notreport on any proposals to implement the functional analysis task. Gatterbaueret al. [26] presented a naive approach that matches the structure of a table toa number of pre-defined structures in which it is also relatively easy to find themeta-data cells. Ling et al. [39] and Wu et al. [66] assumed that meta-datacells can be easily located by searching for th tags. Braunschweig et al. [1] alsopresented a naive solution since they assumed that meta-data cells are locatedon the first row. The other proposals provide more sophisticated approaches.Chen et al. [7] devised a proposal that is based on row/column similarity. Itworks as follows: a) it first divides the input table into blocks using the spannedcells as boundaries; b) it them compares how similar the last row/column in eachblock is to the previous ones using string, named-entity, and number similarityfunctions; c) then the right-most and/or bottom-most rows/columns that aresimilar to the last row/column are considered to contain data cells and the othersare considered to contain meta-data cells. The evaluation was not performedon this task, but on their whole system.Yoshida et al. [69] suggested using ontologies. Their proposal works as fol-lows: a) for each cell in a table, it computes the ratio of times that its content isrecorded in the ontology; b) these ratios are then used to feed the Expectation-Maximisation algorithm in order to learn a classifier that makes a few subtypesof listings apart; c) once the exact type of listing is clear, identifying meta-datacells is relatively easy and the rest of cells are assumed to be data cells. (Notethat the authors assume that the input tables are data tables, which is the rea-son why this cannot be considered a discrimination proposal.) They evaluatedtheir proposal on tables that were randomly sampled from a repository with
35 232 tables.Yang and Luk [68] applied some heuristics to differentiate rows with meta-data cells from rows with data cells. Their proposal works as follows: a) a rowis considered to have meta-data cells if it has at most the average numberof cells per row, if it contains no structured cells, or if the visual features aredifferent from the visual features of the others rows; b) then, it tries to detect ifthe input table is a listing or a matrix; c) once the table structure is identified,it is easy to identify the meta-data. (Note that the authors assume that theinput tables are data tables, which is the reason why this cannot be considereda discrimination proposal.) The authors did not report on their experimentalresults regarding this task, but their whole system.Kim and Lee [32] devised a proposal that first attempts to classify the inputtable. It works as follows: a) in the case of tables with one single row or column,the first cell is considered to be a meta-data cell and the rest are considered tobe data cells; b) in the case of tables with two rows and two columns that17o not have any spanned cells, both the first row and column are consideredto have meta-data cells and the bottom-right cell is considered to be a datacell; c) tables with two rows/columns and three or more columns/rows whoseupper-left cell spans a whole row/column are discarded; otherwise if the firstrow/colum has some spanned cells (but not all), then the first column/row isassumed to have meta-data cells and the others are assumed to have data cells;d) otherwise, the similarity of the cells is checked per rows and columns usingthe following functions: a lexical similarity function that focuses on the datatypes and the length of the contents, and a semantic similarity function thatbuilds on some user-provided key words and patterns. The authors did notprovide any experimental results regarding this task.Jung and Kwon [30] proposed a heuristic-based technique to locate the meta-data within the tables. Their proposal works as follows: a) cells with a th tagare assumed to have meta-data; b) if the table can be partitioned into two blockswith the same background colour or font, then the top and/or the left blocksare assumed to contain meta-data; c) if the cells in a row or column have someuser-defined contents or match some user-defined patterns, then they are alsoconsidered to contain meta-data; d) spanned cells that are embedded in td tagsare also assumed to have meta-data as long as they are located on the top-leftareas of the table; its adjacent cells are also considered to have meta-data; e) ifthe top-right cell is empty, then it is likely that the cells in the first row or columnhave meta-data; f) a probability is finally computed for every cell building onthe previous heuristics and the cells whose probability exceeds a threshold arethen considered to be meta-data cells whereas the others are assumed to bedata cells. The evaluation was performed using -fold cross evaluation on
10 000 tables from their own repository plus the tables from Wang and Hu’s[64] repository.Cafarella et al. [3] devised a machine-learning proposal. It works as follows:a) a learning set is assembled with data tables in which the cells are classifiedas either meta-data or data cells; b) in cases in which a table does not haveany meta-data cells, synthetic cells are created and the meta-data is fed from aseparate database with similar tables; c) some structural and content featuresare computed for each cell; d) a classifier is machine-learnt from the previousfeatures; e) the results of the classifier are used to enrich the other database.The authors evaluated their proposal by means of -fold cross evaluation on arepository with tables that were gathered from the Web.Embley et al. [22] devised a heuristic-based proposal that searches for fourcritical cells that help delimit where the meta-data and the data cells are lo-cated. These cells are referred to as CC1, CC2, CC3, and CC4. CC1 and CC2identify the top left-most region such that the cells on the right and below thatregion are mostly meta-data cells; CC3 and CC4 identify the bottom right-mostregion whose cells are mostly data cells. (Note that CC4 is identified in theirsegmentation task.) Their proposal works as follows: a) it sets CC1 to thetop left-most cell and CC2 to the bottom left-most cell; b) it then iterativelyshifts CC2 upwards or rightwards while searching for the minimum set of cellsbetween CC1 and CC2 that result in headers that can identify the cells between18C2 and CC4; c) then, CC3 is set to the first cell below to the right of CC2that does not belong to an empty row or column; d) after that, footnotes areidentified in cells whose contents start with a footnote-mark symbol; e) finally,it analyses some dependencies amongst the meta-data cells to find out the orderin which they must be grouped. The evaluation was performed on a repositorywith tables that was provided by Padmanabhan et al. [48].Milošević et al. [44] restricted their attention to tables from the PubMedCentral repository. Their focus is on identifying the meta-data cells, since theother cells are considered data cells by default. Their proposal works in threephases. In the first phase, it searches for thead tags; if they are found, then theinner th tags are assumed to encode the meta-data cells and their procedurefinishes. Otherwise, the second phase is intended to find meta-data cells at thetop rows as follows: a) they examine the syntactic similarity of cells on a percolumn basis; the cells at the top whose syntactic type is different from the cellsbelow, if any, are considered meta-data cells as long as the cells in the same rowsin adjacent columns are also considered meta-data cells; b) if a cell in the firstrow spans several columns, then it is assumed to have meta-data, as well as thecells in the rows below, until a non-spanned cell is found; c) the cells at the topthat are between horizontal lines are considered meta-data if they are markedwith a thead tag and they are not empty; in cases in which only one cell in arow has meta-data, the authors refer to it as a super row. The third phase isintended to find meta-data cells on the left columns as follows: a) the cells onthe left-most column that are spanned are meta-data cells and so are the cellson the right until the first non-spanned cell is found; b) the first column belowa super row is considered to have meta-data cells that are referred to as stubs.They used a repository with tables from which tables were randomlyselected to evaluate their proposal. Penn et al. [49], Hurst [28], Wang and Hu [64], Kim and Lee [32], Jung andKwon [30], Gatterbauer et al. [26], Okada and Miura [47], Cafarella et al. [3],Crestan and Pantel [16], Lautert et al. [34], Son and Park [59], Eberius et al. [18],Embley et al. [22], Nishida et al. [46], and Liao et al. [38] did not report on anyproposals to implement the structural analysis task. Chen et al. [7] presenteda naive proposal that works on tables that provide data about a single entity,so all of the data cells form a single tuple; regarding the meta-data cells, theygroup them into headers horizontally or vertically after expanding spanned cells.Yoshida et al. [69] presented a naive proposal that classifies tables in a numberof categories, which makes identifying the tuples quite a trivial task. Elmeleegyet al. [20] also assumed that the tuples within tables that are encoded as lists arealways laid out row-wise. Ling et al. [39] and Braunschweig et al. [1] assumedthat tuples are displayed row-wise or column-wise depending on the number ofmeta-data or data cells found in the first few rows or columns. Chu et al. [9]also presented a naive approach that assumes that the tuples within tables thatare encoded as lists are always laid out row-wise. Wu et al. [66] presented an19dditional naive approach since they just identify tuples in horizontal listings.The other proposals provide more sophisticated approaches.Lerman et al. [36] used a couple of algorithms to detect row-wise tuples.Their proposal works as follows: a) first, it uses DataPro to find the patternsthat describe the data in each column; b) such patterns can be interpreted astags that allow to transform a table into a sequence of symbols; c) then, a versionof ALERGIA is used to infer a finite automaton from those sequences; d) theautomaton is then transformed into a regular expression; e) finally, it identifiesrepeating sub-patterns that correspond to the tuples in the original table. Noexperimentation was performed regarding this task.Cohen et al. [13] presented a proposal that relies on four so-called builders,namely: a builder focuses on meta-data cells that cut in on the table, one thatfocuses on columns of headers, another that focuses on rows of headers, andan additional one that takes the tag paths into account. The builders are fedinto a FOIL-based system in order to learn a classification rule that allowsto identify both horizontal and vertical tuples. No experimental results werereported regarding this specific task, but their whole system.Yang and Luk [68] presented a proposal that specialises in numerical tables.It works as follows: a) first, it removes the headers of the input table; b) then,it checks whether the tuples seem to be one-dimensional or two-dimensionalusing some heuristics; c) the type of cells is analysed using pre-defined patternsin order to label numeric data cells; d) given the types of cells and the dimen-sionality of the tuples, their proposal tries to match a number of pre-definedpatterns that help identify the tuples. The evaluation was performed on one-dimensional and two-dimensional tables.Lerman et al. [37] devised two proposals to identify tuples, namely: a con-straint solving technique and a probabilistic technique. The former works asfollows: a) it models the cells in the tables using Boolean variables; b) it thenadds constraints to ensure that each cell belongs to a single tuple, only contigu-ous cells can be assigned to the same tuple, and two cells cannot be in the sameposition in the same table; c) then a constraint solver is used to find a solutionto the constraints. The latter works as follows: a) it uses a set of observablevariables that model the types of tokens in the data cells, and a set of hiddenvariables, which provide the tuple number or the column number to which everycell belongs; b) a probabilistic model is then learnt by assuming a number ofdependencies between token types, cells, columns, neighbour columns, format,or tuple numbers; c) finally, the contents of the hidden variables are inferredbuilding on the probabilistic model. Their evaluation was performed on thetables from their own repository, which were gathered from web sites on booksellers, property taxes, white reports, and prisons.Fumarola et al. [24] presented a proposal that was described very shallowly.It seems to work on so-called candidate lists, which are sets of cells that cor-respond to different columns and form a single tuple; each candidate list is asub-tree of the DOM tree and they all are required to satisfy some structuralsimilarity constraints, including a minimum size in terms of nodes. The evalu-ation was performed on tables from Gatterbauer et al.’s [26] repository.20ilošević et al. [44] identify the tuples according to how the meta-data cellsare arranged within a table. If meta-data cells are at the top-most rows andon the left-most columns, then, the table is a matrix with a single tuple thatconsists of the whole set of data cells; if there are meta-data cells at the top,but not on the left, then the table is a listing in which each row is a tuple;if there are not any meta-data cells, every single data cell corresponds to atuple. (Note that this proposal cannot be considered a discrimination proposalsince it assumes that the input tables are data tables; recall that their focuswas on tabled in PubMed publications, which are tables with scientific data.)They used a repository with tables from which tables were selected toevaluate their proposal. Lerman et al. [36], Penn et al. [49], Yoshida et al. [69], Cohen et al. [13],Hurst [28], Wang and Hu [64], Lerman et al. [37], Kim and Lee [32], Jung andKwon [30], Gatterbauer et al. [26], Okada and Miura [47], Crestan and Pantel[16], Elmeleegy et al. [20], Fumarola et al. [24], Lautert et al. [34], Ling et al.[39], Son and Park [59], Braunschweig et al. [1], Chu et al. [9], Eberius et al.[18], Nishida et al. [46], and Liao et al. [38] did not report on this task.Most of the other authors reported on naive solutions. Chen et al.’s [7]proposal works on tables with a single tuple that can spread across severalblocks, each of which has its own headers; for each component of the tuple, itcreates field descriptors by joining the meta-data cells in the corresponding rowsand/or columns. Embley et al.’s [22] proposal is similar, but they focused ontables with a single block. Milošević et al. [44] reported on a naive approach,too: in matrices or listings, they create descriptors for each component from themeta-data in the corresponding column and/or row; in enumerations, they usethe caption of the table as a descriptor for every component in the tuples. Yangand Luk [68] proposed a similar procedure, but it takes multiple header rows orcolumns into account, in which case the cells are simply merged to create fielddescriptors, as well as cells that contain both meta-data and data, in which casethe meta-data are transformed into simple descriptors.The proposal by Cafarella et al. [3] goes a step forward in cases in which atable does not provide any meta-data cells. In such cases, they collect the dataon a per-column basis and attempt to find the most similar data in the ACSDbdatabase, which is a resource that has many data with correct descriptors. Theauthors did not report on the evaluation of this task. Wu et al. [66] went alsoa bit further since they used several ad-hoc interpretation methods dependingon the structure of the table identified in the discrimination task. They onlyreported on a method to extract information from horizontal listings with head-ers using some heuristics that are related to how the th and the td tags encodea subject-predicate-object relation. They conducted their experimentation on arepository with horizontal listings from Wikipedia. The authors evaluatedtheir proposal on a repository with tables that were randomly selectedfrom the Wikipedia. 21 . Comparison of proposals In this section, we compare the proposals that we have summarised in theprevious section by means of a comparison framework with both general andtask-specific characteristics.The general characteristics are the following: a)
Foundation: it is a hint onthe technique behind each proposal. b)
Tables required: it is the minimum num-ber of tables required for a proposal to work; the less tables required, the better.c)
Effectiveness: it is the extent to which a proposal succeeds in implementinga task correctly according to an effectiveness measure; the higher the effective-ness, the better. d)
Efficiency: it is the amount of computing power that aproposal requires to implement a task; the more efficient (i.e., the less comput-ing power is required), the better. e)
Resources: it refers to the resources that auser must provide so that a proposal can work properly; the less resources, thebetter. f)
Features: it refers to the features onto which the input data must beprojected in order to machine learn a predictor or to make a decision accord-ing to a heuristic. Features can be either structural, which are related to theHTML or the DOM representation of the input documents, visual, which arerelated to how they are displayed, or content features, which are related to thecontents of the cells. g)
Parameters: it refers to the settings that must be tunedso that a proposal works well, which can be either pre-defined, learnable, oruser-defined parameters. Pre-defined parameters have a value that the authorsof a proposal have found generally appropriate; they are preferable to learnableparameters, whose values must be experimentally learnt by the user; in turn,they are preferable to user-defined parameters, which must be set by the userusing his or her intuition; the less parameters, the better.Note that it is easy to make decisions building on the general features thatwe presented above since we have characterised their preferred contents; thesame applies to the task-specific features that we describe in the following sub-sections. The only exceptions are the foundation characteristic and the featurescharacteristic. The reason is that it is not generally clear whether a heuristic-based approach is preferable to a machine-learning approach or vice versa, orwhether structural, visual, or content features are preferable to each other. Note,too, that effectiveness and efficiency are decision-making characteristics, butthe figures provided by an author are not generally comparable to the figuresprovided by a different author because they evaluated their proposals usingdifferent approaches, learning sets, evaluation sets, and machinery.
Table 2 summarises our comparison regarding location proposals. The task-specific characteristics are the following: a)
Body encodings: it refers to how thetables that a proposal can locate must be encoded; the more kinds of encodingsare identified, the better. b)
Context-data encodings: it refers to how context-data cells are encoded; the more kinds of encodings are identified, the better.Regarding the general characteristics, it is surprising that all of the locationproposals are based on heuristics; there is no record in the literature of a single22 (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16) (cid:22)(cid:7)(cid:24)(cid:5) caption (cid:25)(cid:3)(cid:14)(cid:26)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:18) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:10)(cid:27) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:25)(cid:15)(cid:16)(cid:6)(cid:15)(cid:4)(cid:24)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:28)(cid:3)(cid:4)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:18) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:25)(cid:3)(cid:7)(cid:29)(cid:5)(cid:6)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:1)(cid:30)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:12)(cid:13)(cid:14)(cid:16)(cid:6) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:31)(cid:7)(cid:4)(cid:24)(cid:5)(cid:7)(cid:4) (cid:5)(cid:12)(cid:13) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:25)(cid:3)(cid:7)(cid:29)(cid:5)(cid:6)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)!(cid:7)(cid:4)(cid:24)(cid:5)(cid:7)(cid:4) (cid:5)(cid:25)(cid:13)" (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:25)(cid:3)(cid:14)(cid:26)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11) table +(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:5)(cid:7) ,(cid:7)(cid:17)(cid:3)(cid:4)(cid:17)-(cid:9) (cid:1)..(cid:5)(cid:7)(cid:6)(cid:6)(cid:14)(cid:15)(cid:23)(cid:13)(cid:6)(cid:3)(cid:16)(cid:5) color /(cid:5) bgcolor /(cid:5) font-size /(cid:5) font-style /(cid:5) font-weight / font-family /(cid:5)(cid:7)(cid:4) (cid:5) text-align +(cid:5)(cid:7)(cid:6)(cid:6)(cid:14)(cid:15)(cid:23)(cid:13)(cid:6)(cid:3)(cid:5) href (cid:9) 0(cid:7)1(cid:15)(cid:26)(cid:13)(cid:26)(cid:5) (cid:15)(cid:16)(cid:6)(cid:7)(cid:4)(cid:17)(cid:3)(cid:5)(cid:6)(cid:30)(cid:5)(cid:7) ,(cid:7)(cid:17)(cid:3)(cid:4)(cid:6)(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:16) (cid:21)(cid:4)-2"(cid:7) (cid:7)(cid:5)(cid:7)(cid:4) (cid:5)0(cid:15)(cid:13)(cid:14)(cid:7) (cid:10)(cid:11)(cid:11)* (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:1)(cid:7)(cid:29)(cid:7)(cid:14)(cid:3)(cid:8)(cid:8)(cid:7)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)3 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:1)(cid:14)(cid:3)(cid:16)(cid:6)(cid:7)(cid:4)(cid:5)(cid:7)(cid:4) (cid:5)(cid:28)(cid:7)(cid:4)(cid:6)(cid:3)(cid:8) (cid:10)(cid:11)(cid:18)(cid:18) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) 0(cid:15)(cid:4)(cid:15)(cid:26)(cid:13)(cid:26)(cid:5)(cid:4)(cid:13)(cid:26)(cid:23)(cid:3)(cid:14)(cid:5)(cid:30)(cid:29)(cid:5)(cid:16)(cid:7)(cid:26)4(cid:8)(cid:3)(cid:16)(cid:5)4(cid:3)(cid:14)(cid:5)(cid:4)(cid:30) (cid:3) (cid:25)(cid:3)(cid:7)(cid:29)(cid:5)(cid:6)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)5(cid:13)(cid:26)(cid:7)(cid:14)(cid:30)(cid:8)(cid:7)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)(cid:18) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:21)(cid:4)-(cid:25)(cid:15)(cid:4)(cid:24)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)6 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:10)(cid:27) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16) (cid:22)(cid:7)(cid:24)(cid:5) title +(cid:5)(cid:16)(cid:13)(cid:14)(cid:14)(cid:30)(cid:13)(cid:4) (cid:15)(cid:4)(cid:24)(cid:5)(cid:6)(cid:3)1(cid:6)(cid:9).(cid:30)(cid:4)(cid:5)(cid:7)(cid:4) (cid:5)(cid:28)(cid:7)(cid:14)" (cid:10)(cid:11)(cid:18)6 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:1)(cid:2)(cid:13)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)% (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:25)(cid:15)(cid:16)(cid:6)(cid:15)(cid:4)(cid:24)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)7(cid:23)(cid:3)(cid:14)(cid:15)(cid:13)(cid:16)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)% (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)0(cid:15)(cid:8)(cid:30)8(cid:3)9(cid:15):(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)( (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:31)(cid:13)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)( (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16)(cid:25)(cid:15)(cid:7)(cid:30)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)3 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:7)(cid:23)(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:6)(cid:7)(cid:24)(cid:16) (cid:22)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:23)(cid:2)(cid:24)(cid:17)(cid:6)(cid:2)(cid:6)(cid:24)(cid:14)(cid:13)(cid:5)(cid:12)(cid:17)(cid:9)(cid:13)(cid:25)(cid:10)(cid:26)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:24)(cid:3)(cid:14)(cid:27)(cid:4)(cid:9)(cid:3)(cid:14)(cid:17)(cid:28)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:28)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:29)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10)(cid:22)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10)(cid:30)(cid:18)(cid:18)(cid:9)(cid:16)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:31)(cid:30)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16) (cid:14)(cid:13)(cid:14)(cid:10)(cid:10) !(cid:12)(cid:17)(cid:31)(cid:24)(cid:14)(cid:13)(cid:5)(cid:12)(cid:17)(cid:9)(cid:13)(cid:25)(cid:10) T a b l e : C o m p a r i s o n o f l o c a t i o np r o p o s a l s . roposal that has tried a machine-learning approach. Most proposals can workon a single table, but the ones by Lerman et al. [36, 37] and Ling et al. [39] requireat least a pair of tables to perform table alignment. None of the proposalswas presented in isolation, but as a component of a larger system, which isthe reason why no author reported on effectiveness or efficiency. Realise thatonly the proposal by Gatterbauer et al. [26] projects the input documents ontostructural and visual features in order to apply their heuristics; note, too, that itis the only one that requires a pre-defined parameter. The proposal by Crestanand Pantel [16] is the only that requires the user to set a learnable parameter.Regarding the task-specific characteristics, most of the proposals locate ta-bles that are encoded using tabular tags, a few focus on tables that are encodedusing listing tags, and only Gatterbauer et al.’s [26] and Fumarola et al.’s [24]proposals are independent from the tags used since they analyse how the inputdocuments are displayed. Note, too, that the vast majority of proposals focuson locating the tables themselves, not their context data. Chen et al. [7] andLing et al. [39] are the exceptions: the former presents a simple approach thatsearches for caption tags and the latter presents a more sophisticated approachthat analyses the title tags and the text that surrounds the tables. Table 3 summarises our comparison regarding segmentation proposals. Thetask-specific characteristics are the following: a)
Spanned cells: it describes ifa proposal is able to identify cells that span multiple columns and/or rows; aproposal that can identify spanned cells is better than a proposal that cannot.b)
Multi-part cells: it describes if a proposal is able to identify cells that providepartial contents and must be merged; a proposal that can identify multi-partcells is better than a proposal that cannot. c)
Context data: it describes if aproposal can identify context data or not; a proposal that can identify contextdata is better than a proposal that cannot.Regarding the general characteristics, it is easy to realise that only the pro-posals by Lerman et al. [36, 37] have tried machine-learning approaches; theothers rely on heuristics that their authors have proven to work well in practice.Furthermore, most of them can work on as few as one input table, but the onesby Lerman et al. [36, 37] and Ling et al. [39]. Unfortunately, roughly ofthe authors did not report on the effectiveness of their proposals; the others re-ported on precision, recall, and/or the F score. Only Elmeleegy et al. [20] andChu et al. [9] reported on the efficiency of their approaches; their figures revealthat the algorithms behind the scenes might not be scalable enough. Regardingthe resources required, only the proposals by Elmeleegy et al. [20] and Linget al. [39] require the user to provide a few, but they do not seem to be difficultto find. Only the proposals by Lerman et al. [36, 37] and Gatterbauer et al.[26] require to project the input tables onto some simple features. Regardingthe parameters, only the proposals by Elmeleegy et al. [20], Ling et al. [39], andChu et al. [9] need the users to set a few.Regarding the task-specific characteristics, it is surprising that many pro-posals do not make an attempt to analyse spanned cells and that none of them24 (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:3)(cid:16) (cid:19)(cid:23) (cid:22)(cid:3)(cid:16)(cid:24)(cid:3)(cid:14)(cid:25)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:18) (cid:1)(cid:8)(cid:13)(cid:16)(cid:6)(cid:3)(cid:14)(cid:15)(cid:4)(cid:26)(cid:27)(cid:5)(cid:28)(cid:14)(cid:23)(cid:29)(cid:7)(cid:29)(cid:15)(cid:8)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:5)(cid:25)(cid:23)(cid:30)(cid:3)(cid:8)(cid:9) (cid:10)(cid:31) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:7)(cid:6)(cid:7)!(cid:14)(cid:23)(cid:5)(cid:28)(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)(cid:4)(cid:16) (cid:19)(cid:23) (cid:19)(cid:23) (cid:19)(cid:23)(cid:1)(cid:23)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:3)(cid:16) (cid:19)(cid:23) (cid:19)(cid:23)(cid:22)(cid:7)(cid:4)(cid:26)(cid:5)(cid:7)(cid:4)(cid:30)(cid:5)(cid:24)(cid:13)" (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:3)(cid:16) (cid:19)(cid:23) (cid:19)(cid:23)(cid:24)(cid:3)(cid:14)(cid:25)(cid:7)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11) (cid:2) (cid:3) (cid:22)(cid:9)(cid:23)(cid:9) !0(cid:11)(cid:9)3190(cid:11)(cid:9)18/ (cid:4) (cid:24)(cid:2)(cid:25)(cid:14)(cid:3) !0(cid:11)(cid:9)3190(cid:11)(cid:9)18/ (cid:5) (cid:26)(cid:18)(cid:18)(cid:9)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:27)(cid:28)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:29)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:30)(cid:3)(cid:14)(cid:31)(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:26)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16) (cid:14)(cid:13)(cid:14)(cid:10)(cid:10)!(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)"(cid:2)(cid:30)(cid:17)(cid:6)(cid:2)(cid:6)!(cid:14)(cid:10)(cid:12)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:28)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6) T a b l e : C o m p a r i s o n o f s e g m e n t a t i o np r o p o s a l s . an identify multi-part cells, both of which are very common in practice. It isalso surprising that only the proposals by Chen et al. [7], Ling et al. [39], andEmbley et al. [21] can identify context data, which are also very common inpractice; unfortunately, the proposal by Chen et al. [7] cannot be considered ageneral solution to the problem since it is very naive. Tables 4–7 summarise our comparison regarding discrimination proposals.The only task-specific characteristic is
Types of data tables , which refers to thekinds of data tables that a proposal can discriminate; the more types can bediscriminated, the better.Regarding the general characteristics, it is easy to realise that of theproposals use a machine-learning approach and the rest use heuristic-based ap-proaches. The former require at least two tables to learn a predictor that imple-ments the discrimination task, whereas the latter can generally work on a singletable. Except for Wu et al.’s [65], the other authors report on effectiveness mea-sures that are specific to this task; most of the authors selected precision, recall,and the F score as effectiveness measures; the exceptions are Cohen et al. [13],Lautert et al. [34], and Nishida et al. [46], who report on the F score only,Okada and Miura [47], who reported on accuracy, and Fumarola et al. [24], whoreported on recall only. Apparently, the effectiveness of the machine-learningproposals is higher than the effectiveness of the heuristic-based proposals; how-ever, due to the differences in the evaluation processes, this conclusion is notsound. Unfortunately, only Son and Park [59] and Eberius et al. [18] reported onthe efficiency of their proposals, which does not seem to be very good accordingto their figures; Wu et al. [65] did not report on the efficiency of their proposalbut they mentioned that it relies on a linear clustering algorithm. The only pro-posals that require resources are the ones by Eberius et al. [18] and Nishida et al.[46]; fortunately, they do not seem to be a major obstacle since they consists ina corpus that was gathered from the Wikipedia. The ones that rely on machinelearning project the input data onto a space of structural, visual, and/or con-tent features that seem simple to compute. Regarding their parameters, mostof them have pre-defined parameters for which the authors recommend somevalues that are expected work generally well; none of the proposals require anylearnable parameters, but a few require user-defined parameters.Regarding the task-specific characteristics, the only proposals that can sub-classify data tables are the following ones: Crestan and Pantel [16] distinguishesamongst listings, forms, matrices, and enumerations; Lautert et al. [34], Eberiuset al. [18], and Nishida et al. [46] distinguish amongst listings, forms, and matri-ces; and Liao et al. [38] distinguishes between complete and incomplete tables(which are encoded as independent tables, but must be stitched together so thatthey can be properly interpreted). Table 8 summarises our comparison regarding functional analysis proposals.26 (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:11)(cid:9)(cid:21)(cid:10)(cid:22)(cid:20)(cid:11)(cid:9)(cid:23)(cid:11)(cid:24) (cid:6) (cid:20)(cid:11)(cid:9)(cid:23)(cid:25) (cid:26)(cid:27)(cid:28) (cid:29)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)(cid:16)(cid:5)(cid:6)(cid:30)(cid:5)(cid:6)(cid:2)(cid:3)(cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:8)(cid:15)(cid:4) (cid:7) (cid:20)(cid:11)(cid:9)(cid:23)(cid:23) (cid:26)(cid:27)(cid:28) (cid:28)((cid:3)(cid:14)(cid:7)%(cid:3)(cid:5))(cid:30)(cid:14)(cid:31)(cid:5)(cid:8)(cid:3)(cid:4)%(cid:6)(cid:2)(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)(cid:9)(cid:1)(cid:30)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:10) *(cid:15)(cid:4)(cid:4)(cid:30)) (cid:10)+ (cid:24) (cid:8) (cid:20)(cid:11)(cid:9),(cid:21)-(cid:18)(cid:9)(cid:11)(cid:11) (cid:26)(cid:27)(cid:28) (cid:26)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:14)(cid:30))(cid:16)&(cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:17)(cid:30)(cid:8)(cid:13) (cid:4)(cid:16)(cid:9) (cid:22)(cid:7)(cid:6)(cid:15)(cid:30)(cid:5)(cid:30)"(cid:5)(cid:17)(cid:30)(cid:8)(cid:13) (cid:4)(cid:16)(cid:5))(cid:15)(cid:6)(cid:2)(cid:5)(cid:7)(cid:8).(cid:2)(cid:7)!(cid:3)(cid:6)(cid:15)(cid:17)(cid:5)(cid:17)(cid:30)(cid:4)(cid:6)(cid:3)(cid:4)(cid:6)&(cid:5)(cid:14)(cid:7)(cid:6)(cid:15)(cid:30)(cid:5)(cid:30)"(cid:5)(cid:16)(cid:15) .(cid:8)(cid:3)(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:16)(cid:9) *(cid:15)(cid:4)(cid:4)(cid:30))(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)(cid:9)(cid:12)(cid:13)(cid:14)(cid:16)(cid:6) (cid:10)(cid:11)(cid:11)(cid:10) (cid:26)(cid:7)(cid:15)((cid:3)(cid:5)/(cid:7)'(cid:3)(cid:16)&(cid:5)*(cid:15)(cid:4)(cid:4)(cid:30))(cid:9) (cid:10)+ (cid:22)(cid:6)(cid:9)(cid:23)(cid:14)(cid:24)(cid:25)(cid:6)(cid:26)(cid:14)(cid:10) (cid:19)(cid:20)(cid:11)(cid:9)(cid:21)0(cid:22)(cid:20)(cid:11)(cid:9)(cid:21)1(cid:24) (cid:9) (cid:20)(cid:11)(cid:9)(cid:21)1 (cid:27)(cid:9)(cid:13)(cid:13)(cid:12)(cid:28) (cid:19)(cid:20)(cid:18)(cid:9)(cid:11)(cid:11)(cid:22)(cid:20)(cid:11)(cid:9)(cid:21)(cid:10)(cid:24) (cid:10) (cid:20)(cid:11)(cid:9)(cid:21)(cid:25) (cid:26)(cid:27)(cid:28) (cid:26)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5) tr (cid:6)(cid:7)%(cid:16)&(cid:5) (cid:7)2(cid:15) (cid:13) (cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5) th (cid:27) td (cid:5)(cid:6)(cid:7)%(cid:16)&(cid:5)!(cid:7)%-(cid:30)"-(cid:6)(cid:7)%(cid:16)(cid:5)!(cid:3)(cid:4)(cid:3)(cid:7)(cid:6)(cid:2)(cid:5)(cid:6)(cid:2)(cid:3)(cid:5) table (cid:5)(cid:6)(cid:7)%(cid:9) /(cid:30)(cid:14)(cid:31)(cid:3)(cid:14)&(cid:5)!(cid:7)%-(cid:30)"-(cid:7)(cid:6)(cid:6)(cid:14)(cid:15)!(cid:13)(cid:6)(cid:3)(cid:16)(cid:5)(cid:30)"(cid:5)(cid:6)(cid:2)(cid:3)(cid:5)(cid:6)(cid:7)%(cid:16)(cid:5)(cid:15)(cid:4)(cid:16)(cid:15)(cid:31)(cid:3)(cid:5)(cid:7)(cid:5) table (cid:6)(cid:7)%&(cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:14)(cid:30))(cid:16)(cid:5)(cid:30)(cid:14)(cid:5)(cid:17)(cid:30)(cid:8)(cid:13) (cid:4)(cid:16)(cid:5)(cid:7)(cid:17)(cid:17)(cid:30)(cid:14)(cid:31)(cid:15)(cid:4)%(cid:5)(cid:6)(cid:30)(cid:5)(cid:7)(cid:5)%(cid:3)(cid:30) (cid:3)(cid:6)(cid:14)(cid:15)(cid:17)(cid:5) (cid:30)(cid:31)(cid:3)(cid:8) 3(cid:6)(cid:14)(cid:15)(cid:4)%(cid:5)(cid:17)(cid:30)(cid:4)(cid:6)(cid:3)(cid:4)(cid:6)(cid:5)(cid:14)(cid:7)(cid:6)(cid:15)(cid:30)&(cid:5)(cid:16)(cid:15)(cid:4)%(cid:13)(cid:8)(cid:7)(cid:14)(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:5)(cid:14)(cid:7)(cid:6)(cid:15)(cid:30)(cid:9)*(cid:7)(cid:4)%(cid:5)(cid:7)(cid:4)(cid:31)(cid:5)(cid:12)(cid:13) (cid:10)(cid:11)(cid:11)(cid:10) (cid:1)(cid:13)(cid:16)(cid:6)(cid:30) (cid:5)(cid:31)(cid:3)(cid:17)(cid:15)(cid:16)(cid:15)(cid:30)(cid:4)(cid:5)(cid:6)(cid:14)(cid:3)(cid:3)(cid:5)(cid:8)(cid:3)(cid:7)(cid:14)(cid:4)(cid:3)(cid:14) (cid:10)+ (cid:19)(cid:20)(cid:11)(cid:9)(cid:21),(cid:22)(cid:20)(cid:11)(cid:9)(cid:21)1(cid:24) (cid:11) (cid:20)(cid:11)(cid:9)(cid:21)0 (cid:26)(cid:27)(cid:28) (cid:28)((cid:3)(cid:14)(cid:7)%(cid:3)(cid:5)(cid:7)(cid:4)(cid:31)(cid:5)(cid:16)(cid:6)(cid:7)(cid:4)(cid:31)(cid:7)(cid:14)(cid:31)(cid:5)(cid:31)(cid:3)((cid:15)(cid:7)(cid:6)(cid:15)(cid:30)(cid:4)(cid:5)(cid:30)"(cid:5)(cid:6)(cid:2)(cid:3)(cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:16)$(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:16)(cid:5).(cid:3)(cid:14)(cid:5)(cid:14)(cid:30))(cid:5)(cid:7)(cid:4)(cid:31)(cid:5)(cid:17)(cid:3)(cid:8)(cid:8)(cid:16)(cid:5).(cid:3)(cid:14)(cid:5)(cid:17)(cid:30)(cid:8)(cid:13) (cid:4)&(cid:5)(cid:7)((cid:3)(cid:14)(cid:7)%(cid:3)(cid:5)(cid:17)(cid:13) (cid:13)(cid:8)(cid:7)(cid:6)(cid:15)((cid:3)(cid:5)(cid:8)(cid:3)(cid:4)%(cid:6)(cid:2)(cid:5)(cid:17)(cid:30)(cid:4)(cid:16)(cid:15)(cid:16)(cid:6)(cid:3)(cid:4)(cid:17)'(cid:9) (cid:12)(cid:15)(cid:16)(cid:6)(cid:30)%(cid:14)(cid:7) (cid:5)(cid:7)(cid:4)(cid:31)(cid:5)(cid:17)(cid:30)(cid:4)(cid:16)(cid:15)(cid:16)(cid:6)(cid:3)(cid:4)(cid:17)'(cid:5)(cid:30)"(cid:5)(cid:3)(cid:7)(cid:17)(cid:2)(cid:5)((cid:7)(cid:8)(cid:13)(cid:3)(cid:5)(cid:6)'.(cid:3)&(cid:5))(cid:30)(cid:14)(cid:31)(cid:5)%(cid:14)(cid:30)(cid:13).(cid:5)"(cid:14)(cid:3)4(cid:13)(cid:3)(cid:4)(cid:17)'(cid:5)"(cid:3)(cid:7)(cid:6)(cid:13)(cid:14)(cid:3)(cid:16)(cid:9) (cid:26)(cid:7)(cid:15)((cid:3)(cid:5)/(cid:7)'(cid:3)(cid:16)(cid:5).(cid:7)(cid:14)(cid:7) (cid:3)(cid:6)(cid:3)(cid:14)(cid:16)&(cid:5) (cid:7)2(cid:15) (cid:13) (cid:5)(cid:15) .(cid:13)(cid:14)(cid:15)(cid:6)'(cid:5)(cid:14)(cid:3)(cid:31)(cid:13)(cid:17)(cid:6)(cid:15)(cid:30)(cid:4)&(cid:5) (cid:7)2(cid:15) (cid:13) (cid:5)(cid:31)(cid:3).(cid:6)(cid:2)(cid:5)(cid:30)"(cid:5)(cid:6)(cid:2)(cid:3)(cid:5)(cid:6)(cid:14)(cid:3)(cid:3)&(cid:5) (cid:15)(cid:4)(cid:15) (cid:13) (cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:16)(cid:7) .(cid:8)(cid:3)(cid:16)(cid:5)(cid:13)(cid:16)(cid:3)(cid:31)(cid:9)5(cid:7)(cid:4)%(cid:5)(cid:7)(cid:4)(cid:31)(cid:5)6(cid:13) (cid:14) (cid:20)(cid:11)(cid:9)(cid:21), (cid:26)(cid:27)(cid:28) (cid:28)(cid:6)(cid:6)(cid:14)(cid:15)!(cid:13)(cid:6)(cid:3)-(cid:15)(cid:4)(cid:31)(cid:15)(cid:17)(cid:7)(cid:6)(cid:15)(cid:4)%(cid:5)(cid:3)(cid:4)(cid:6)(cid:15)(cid:6)'(cid:5).(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)(cid:4)(cid:16)7(cid:7)(cid:8)(cid:13)(cid:3)-(cid:15)(cid:4)(cid:31)(cid:15)(cid:17)(cid:7)(cid:6)(cid:15)(cid:4)%(cid:5)(cid:3)(cid:4)(cid:6)(cid:15)(cid:6)'(cid:5).(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)(cid:4)(cid:16)(cid:9)8(cid:15) (cid:5)(cid:7)(cid:4)(cid:31)(cid:5)6(cid:3)(cid:3) (cid:10)(cid:11)(cid:11)0 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:11)(cid:9)(cid:21),(cid:22)(cid:20)(cid:11)(cid:9)(cid:21)(cid:21)(cid:24) (cid:15) (cid:20)(cid:11)(cid:9)(cid:21)(cid:23) (cid:26)(cid:27)(cid:28) (cid:29)(cid:7)!(cid:8)(cid:3)(cid:5)(cid:2)(cid:30) (cid:30)%(cid:3)(cid:4)(cid:3)(cid:15)(cid:6)'(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)&(cid:5)(cid:16)'(cid:4)(cid:6)(cid:7)(cid:17)(cid:6)(cid:15)(cid:17)(cid:5)(cid:2)(cid:30) (cid:30)%(cid:3)(cid:4)(cid:3)(cid:15)(cid:6)'(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)(cid:9) (cid:29)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:30)(cid:8)(cid:31)(cid:5)(cid:6)(cid:30)(cid:5)(cid:6)(cid:2)(cid:3)(cid:5)(cid:4)(cid:13) !(cid:3)(cid:14)(cid:5)(cid:30)"(cid:5)(cid:8)(cid:15)(cid:4) (cid:29)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:29)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:30)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:31)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:26)!(cid:14)(cid:10)(cid:24)(cid:12)(cid:18)(cid:24)(cid:17)(cid:6)(cid:2)(cid:6)(cid:24)(cid:2)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:30)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:24)(cid:3)(cid:14)"(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) T a b l e : C o m p a r i s o n o f d i s c r i m i n a t i o np r o p o s a l s ( P a r t ) . (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5)(cid:8)(cid:9)(cid:10)(cid:3) (cid:11)(cid:12)(cid:12)(cid:13) (cid:14)(cid:15)(cid:16)(cid:17) (cid:11)(cid:18) (cid:19)(cid:20)(cid:12)(cid:16)(cid:21)(cid:15)(cid:22)(cid:20)(cid:12)(cid:16)(cid:21)(cid:13)(cid:23) (cid:16) (cid:20)(cid:12)(cid:16)(cid:21)(cid:17) (cid:24)(cid:25)(cid:26) (cid:19)(cid:27)(cid:28)(cid:29)(cid:28)(cid:3)(cid:30)(cid:28)(cid:5)(cid:10)(cid:31)(cid:5)(cid:6)(cid:5) caption (cid:5) th (cid:5)(cid:10)(cid:27)(cid:5) table (cid:5)!(cid:6)(cid:4)"(cid:5)!(cid:6)(cid:4)(cid:5)! (cid:17) (cid:20)(cid:12)(cid:16)0(cid:15) (cid:24)(cid:25)(cid:26) 4(cid:6))%+(cid:2)+(cid:5)(cid:30)(cid:28)&&(cid:5)(cid:30)(cid:10).(cid:28)(cid:27)(cid:6)(cid:4)(cid:28)"(cid:5)+(cid:6))%+(cid:2)+(cid:5)(cid:9)(cid:10)(cid:27)(cid:7)(cid:5)&(cid:28)(cid:3)(cid:4)!((cid:16)5'(cid:6)(cid:7)(cid:6)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5)4%(cid:2)(cid:27)(cid:6) (cid:11)(cid:12)(cid:12)0 678 (cid:11)(cid:18) (cid:18) (cid:19) (cid:20) (cid:21) (cid:22) (cid:23)(cid:24) (cid:25) (cid:24)(cid:25)(cid:26) (cid:19)(cid:27)(cid:28)(cid:29)(cid:28)(cid:3)(cid:30)(cid:28)(cid:5)(cid:10)(cid:31)(cid:5)(cid:3)(cid:28)(cid:29)!(cid:28)(cid:7)(cid:5)!(cid:6)*&(cid:28)(cid:29) (cid:5)%+(cid:6)(cid:4)(cid:28)(cid:29) (cid:5)(cid:29)$(cid:6)(cid:3)(cid:3)(cid:28)(cid:7)(cid:5)(cid:30)(cid:28)&&(cid:29) (cid:5)(cid:10)(cid:27)(cid:5)!((cid:5)!(cid:6)(cid:4)"(cid:5)(cid:3)(cid:2)+*(cid:28)(cid:27)(cid:5)(cid:10)(cid:31)(cid:5)194:(cid:5)!(cid:6)(cid:4)(cid:29)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5)(cid:27)(cid:10)(cid:9)(cid:29)(cid:16) (cid:19)(cid:27)(cid:28)(cid:29)(cid:28)(cid:3)(cid:30)(cid:28)(cid:5)(cid:10)(cid:31)(cid:5).%(cid:29)%*&(cid:28)(cid:5)*(cid:10)(cid:27)(cid:7)(cid:28)(cid:27)(cid:5)&%(cid:3)(cid:28)(cid:29)(cid:5)(cid:10)(cid:27)(cid:5)(cid:30)(cid:6)$!%(cid:10)(cid:3)(cid:5)!(cid:6)(cid:4)"(cid:5)(cid:29)$(cid:28)(cid:30)%(cid:31)%(cid:30)(cid:6)!%(cid:10)(cid:3)(cid:5)(cid:10)(cid:31)(cid:5)(cid:9)%(cid:7)!((cid:5)(cid:10)(cid:27)(cid:5)((cid:28)%(cid:4)(!(cid:16)(cid:14)(cid:6)(cid:31)(cid:6)(cid:27)(cid:28)&&(cid:6)(cid:5)(cid:28)!(cid:5)(cid:6)&(cid:16) (cid:11)(cid:12)(cid:12)3 ;!(cid:6)!%(cid:29)!%(cid:30)(cid:5)+(cid:10)(cid:7)(cid:28)& (cid:11)(cid:18) (cid:19)(cid:20)(cid:12)(cid:16)(cid:15)2(cid:22)(cid:20)(cid:12)(cid:16)32(cid:23) (cid:26) (cid:20)(cid:12)(cid:16)(cid:17)(cid:15) (cid:24)(cid:25)(cid:26) 7%+(cid:28)(cid:3)(cid:29)%(cid:10)(cid:3)(cid:29) (cid:22)(cid:6)!%(cid:10)(cid:5)(cid:10)(cid:31)(cid:5)(cid:27)(cid:10)(cid:9)(cid:29)(cid:5)(cid:9)%!((cid:5)+(cid:10)(cid:29)!& (cid:27) (cid:20)(cid:12)(cid:16)03 (cid:24)(cid:25)(cid:26) 4(cid:6))%+(cid:2)+(cid:5)(cid:3)(cid:2)+*(cid:28)(cid:27)(cid:5)(cid:10)(cid:31)(cid:5)(cid:30)(cid:28)&&(cid:29)(cid:5)$(cid:28)(cid:27)(cid:5)(cid:27)(cid:10)(cid:9)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5)(cid:30)(cid:10)&(cid:2)+(cid:3)"(cid:5)(cid:27)(cid:6)!%(cid:10)(cid:5)(cid:10)(cid:31)(cid:5)(cid:30)(cid:28)&&(cid:29)(cid:5)(cid:9)%!((cid:5) th (cid:5) a (cid:5) img (cid:5) input (cid:5) select (cid:5) br (cid:5)(cid:10)(cid:27)(cid:5)(cid:31)(cid:10)(cid:27)+(cid:6)!!%(cid:3)(cid:4)(cid:5)!(cid:6)(cid:4)(cid:29)"(cid:5)(cid:27)(cid:6)!%(cid:10)(cid:5)(cid:10)(cid:31)(cid:5) colspan (cid:29)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5) rowspan (cid:29)(cid:16) 4(cid:6))%+(cid:2)+(cid:5)(cid:30)(cid:28)&&(cid:5)&(cid:28)(cid:3)(cid:4)!("(cid:5)(cid:6).(cid:28)(cid:27)(cid:6)(cid:4)(cid:28)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5).(cid:6)(cid:27)%(cid:6)(cid:3)(cid:30)(cid:28)(cid:5)(cid:10)(cid:31)(cid:5)(cid:30)(cid:28)&&(cid:5)&(cid:28)(cid:3)(cid:4)!((cid:29)"(cid:5)(cid:27)(cid:6)!%(cid:10)(cid:5)(cid:10)(cid:31)(cid:5)(cid:7)%(cid:29)!%(cid:3)(cid:30)!(cid:5)!(cid:6)(cid:4)(cid:29) (cid:5)(cid:5)(cid:5)(cid:29)!(cid:27)%(cid:3)(cid:4)(cid:29) (cid:5)(cid:30)(cid:28)&&(cid:29)(cid:5)(cid:9)%!((cid:5)!(cid:27)(cid:6)%&%(cid:3)(cid:4)(cid:5)(cid:30)(cid:10)&(cid:10)(cid:3)(cid:29) (cid:5)(cid:30)(cid:28)&&(cid:29)(cid:5)(cid:9)%!((cid:5)(cid:7)%(cid:4)%!(cid:29) (cid:5)(cid:3)(cid:2)+(cid:28)(cid:27)%(cid:30)(cid:6)&(cid:5)(cid:30)(cid:28)&&(cid:29) (cid:5)(cid:5)(cid:6)(cid:3)(cid:7)(cid:5)(cid:3)(cid:10)(cid:3)-(cid:28)+$! (cid:22)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:22)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:23)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:24)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:25)(cid:26)(cid:27)(cid:14)(cid:10)(cid:28)(cid:12)(cid:18)(cid:28)(cid:17)(cid:6)(cid:2)(cid:6)(cid:28)(cid:2)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:23)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:25)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:28)(cid:3)(cid:14)(cid:29)(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:30)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16)(cid:31)(cid:14)(cid:13)(cid:14)(cid:10)(cid:10) (cid:30)(cid:18)(cid:18)(cid:9)(cid:16)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:26) T a b l e : C o m p a r i s o n o f d i s c r i m i n a t i o np r o p o s a l s ( P a r t ) . (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:6)(cid:4)(cid:7)(cid:5)(cid:4)(cid:7)(cid:2)(cid:8)(cid:9) (cid:10)(cid:11)(cid:12)(cid:13) (cid:14)(cid:5)(cid:3)(cid:6)(cid:2)(cid:8)(cid:7)(cid:15)(cid:5)(cid:4)(cid:16)(cid:17)(cid:6)(cid:18)(cid:19) (cid:10)(cid:20) (cid:21) (cid:28) (cid:22)(cid:11)(cid:9)(cid:10)(cid:10)(cid:23)(cid:11)(cid:9)(cid:24)(cid:25) (cid:14)(cid:26)(cid:27) (cid:28)(cid:2)(cid:29)(cid:30)(cid:31)(cid:3)(cid:31)(cid:7)(cid:15)(cid:3)(cid:31) (cid:5)(cid:6)(cid:7)(cid:17)!(cid:7)"(cid:5)(cid:8)(cid:8)(cid:19)(cid:7) th '(cid:7) a '(cid:7) img '(cid:7) input '(cid:7) select '(cid:7) br '(cid:7)(cid:17)(cid:6)(cid:7)!(cid:17)(cid:6)(cid:31)(cid:2)(cid:4)(cid:4)(cid:30)(cid:15)((cid:7)(cid:4)(cid:2)((cid:19)%(cid:7)(cid:7)(cid:7)(cid:7)(cid:7)(cid:7)(cid:7)(cid:7)"(cid:17)(cid:8)(cid:19) (cid:29) (cid:22)(cid:11)(cid:9)(cid:24). (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:9)(cid:13)(cid:22) (cid:10)(cid:11)& (cid:23)(cid:24)(cid:6)(cid:7)(cid:4)(cid:6)(cid:16)(cid:2)(cid:9)(cid:12)(cid:13) (cid:10)&(cid:7)0(cid:12)(cid:12)(cid:11)(cid:11)(cid:7)(cid:4)(cid:2) (cid:8)(cid:5)(cid:19)1 (cid:27))(cid:5)(cid:6)(cid:2)((cid:5)(cid:7)(cid:2)(cid:15)$(cid:7)(cid:19)(cid:4)(cid:2)(cid:15)$(cid:2)(cid:6)$(cid:7)$(cid:5))(cid:30)(cid:2)(cid:4)(cid:30)(cid:17)(cid:15)(cid:7)(cid:17)!(cid:7)(cid:4)&(cid:5)(cid:7)(cid:15)(cid:3)(cid:31) (cid:5)(cid:6)(cid:7)(cid:17)!(cid:7)"(cid:5)(cid:8)(cid:8)(cid:19)'(cid:7)"(cid:5)(cid:8)(cid:8)(cid:19)(cid:7) (cid:25)(cid:13)(cid:14)(cid:26)(cid:5)(cid:7)(cid:6)(cid:10)(cid:10)(cid:9)(cid:18)(cid:9)(cid:14)(cid:3) +(cid:22)(cid:11)(cid:9).5/(cid:22)(cid:11)(cid:9).(cid:10)(cid:21) (cid:30) (cid:22)(cid:11)(cid:9).(cid:10) (cid:27)(cid:28)(cid:12)(cid:26)(cid:5)(cid:7)(cid:6)(cid:10)(cid:10)(cid:9)(cid:18)(cid:9)(cid:14)(cid:3)(cid:10) +(cid:22)(cid:11)(cid:9).(cid:13)/(cid:22)(cid:11)(cid:9).(cid:13)(cid:21) (cid:31) (cid:22)(cid:11)(cid:9).(cid:13) (cid:12)(cid:11)'(cid:10)(cid:13).&(cid:7)0(cid:31)(cid:30)(cid:8)(cid:8)(cid:30)(cid:17)(cid:15)(cid:19)(cid:7)(cid:17)!(cid:7)(cid:4)(cid:2) (cid:8)(cid:5)(cid:19)1 6(cid:30)(cid:18)(cid:30) th (cid:4)(cid:2)((cid:19)%(cid:7)(cid:8)(cid:17)"(cid:2)(cid:8)(cid:7)"(cid:5)(cid:8)(cid:8)(cid:7)(cid:8)(cid:5)(cid:15)((cid:4)&(cid:7)(cid:2))(cid:5)(cid:6)(cid:2)((cid:5)(cid:7)(cid:2)(cid:15)$(cid:7))(cid:2)(cid:6)(cid:30)(cid:2)(cid:15)"(cid:5)%(cid:7)(cid:8)(cid:17)"(cid:2)(cid:8)(cid:7)(cid:6)(cid:2)(cid:4)(cid:30)(cid:17)(cid:7)(cid:17)!(cid:7) span '(cid:7) th '(cid:7) a '(cid:7) img '(cid:7) input '(cid:7) select '(cid:7) font '(cid:7) br '(cid:7) ul '(cid:7)(cid:17)(cid:6)(cid:7) ol (cid:4)(cid:2)((cid:19)(cid:9) /(cid:2)(cid:4)(cid:30)(cid:17)(cid:7)(cid:17)!(cid:7)(cid:2)(cid:8) (cid:29)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:30)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:27)(cid:31) (cid:14)(cid:10)(cid:26)(cid:12)(cid:18)(cid:26)(cid:17)(cid:6)(cid:2)(cid:6)(cid:26)(cid:2)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)!(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:29)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:27)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:26)(cid:3)(cid:14)"(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:23)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16)(cid:24)(cid:14)(cid:13)(cid:14)(cid:10)(cid:10) (cid:23)(cid:18)(cid:18)(cid:9)(cid:16)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:31) !(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) T a b l e : C o m p a r i s o n o f d i s c r i m i n a t i o np r o p o s a l s ( P a r t ) . (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8) (cid:9)(cid:10)(cid:11)(cid:12) (cid:13)(cid:7)(cid:2)(cid:14)(cid:5)(cid:4)(cid:15)(cid:16)(cid:17)(cid:18) (cid:9)(cid:19) (cid:20)(cid:21)(cid:22) (cid:20)(cid:21)(cid:22) (cid:23)(cid:16)(cid:24)(cid:16)(cid:7)(cid:6)(cid:15)(cid:16)(cid:5)(cid:25)(cid:3)(cid:5)(cid:26)(cid:3)(cid:26)(cid:5)(cid:27)(cid:4)(cid:15)(cid:3)(cid:5)(cid:6)(cid:28)(cid:7)(cid:4)(cid:14)(cid:29)(cid:3)(cid:7)(cid:26)(cid:30)(cid:6)(cid:7)(cid:3)(cid:30)(cid:7)(cid:2)(cid:14)(cid:5)(cid:4)(cid:15)(cid:3)(cid:31)(cid:4)(cid:17)(cid:14)(cid:16)(cid:5)(cid:25)(cid:8) (cid:13)(cid:2)(cid:5) (cid:26)!!(cid:3)(cid:31)(cid:16)(cid:14)(cid:5)(cid:6)(cid:17)(cid:30)(cid:4)(cid:20)(cid:16)(cid:14)(cid:27)(cid:16)(cid:31)(cid:6)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:7)(cid:8) (cid:9)(cid:10)(cid:11)" (cid:1)(cid:9)(cid:13)(cid:22)(cid:7)(cid:14) ' ((cid:10)(cid:8))) (cid:23)(cid:13)(cid:10)(cid:14)(cid:24)(cid:20)(cid:7)(cid:14) ' ! ((cid:10)(cid:8)*(cid:11) (cid:20)(cid:21)(cid:22) (cid:1)(cid:16)&(cid:16)$(cid:4)(cid:31)(cid:16)(cid:6) +(cid:26)&(cid:4)(cid:17)(cid:3)(cid:4)(cid:24)(cid:28)(cid:4)(cid:31)(cid:31)(cid:16)(cid:17)(cid:18)(cid:3)(cid:14)(cid:16),(cid:4)(cid:29)(cid:3)-(cid:23)+.(cid:3)(cid:26)(cid:2)(cid:5)$(cid:2)(cid:5)(cid:3)(cid:14)(cid:16),(cid:4)(cid:29)(cid:3)(cid:17)(cid:2)(cid:24)(cid:28)(cid:4)(cid:15)(cid:3)(cid:26)!(cid:3)(cid:15)(cid:4)(cid:14)(cid:16)(cid:31)(cid:2)(cid:6)(cid:7)(cid:3)(cid:2)(cid:17)(cid:16)(cid:5)(cid:14)(cid:29)(cid:3)(cid:17)(cid:2)(cid:24)(cid:28)(cid:4)(cid:15)(cid:3)(cid:26)!(cid:3)!(cid:2)(cid:7)(cid:7)(cid:25)(cid:3)(cid:30)(cid:26)(cid:17)(cid:17)(cid:4)(cid:30)(cid:5)(cid:4)(cid:31)(cid:3)(cid:7)(cid:6)(cid:25)(cid:4)(cid:15)(cid:14)(cid:29)(cid:3)(cid:17)(cid:2)(cid:24)(cid:28)(cid:4)(cid:15)(cid:3)(cid:26)!(cid:3)%(cid:4)(cid:16)(cid:18)(cid:27)(cid:5)(cid:4)(cid:31)(cid:3)(cid:7)(cid:6)(cid:25)(cid:4)(cid:15)(cid:14)(cid:29)(cid:3)(cid:13)(cid:20)(cid:20)(cid:3)!(cid:16)(cid:7)(cid:5)(cid:4)(cid:15)(cid:14)(cid:29)(cid:3)(cid:23)/ " ((cid:10)(cid:8)*2 (cid:20)(cid:21)(cid:22) 5(cid:24)$(cid:5)(cid:25)(cid:3)(cid:30)(cid:4)(cid:7)(cid:7)(cid:3)(cid:5)(cid:27)(cid:15)(cid:4)(cid:14)(cid:27)(cid:26)(cid:7)(cid:31) -(cid:6)(cid:25)(cid:26)(cid:2)(cid:5)(cid:3)(cid:5)(cid:6)(cid:18)(cid:3)(cid:5)(cid:27)(cid:15)(cid:4)(cid:14)(cid:27)(cid:26)(cid:7)(cid:31)(cid:29)(cid:3)(cid:30)(cid:26)(cid:17)(cid:14)(cid:16)(cid:14)(cid:5)(cid:4)(cid:17)(cid:30)(cid:25)(cid:3)(cid:5)(cid:27)(cid:15)(cid:4)(cid:14)(cid:27)(cid:26)(cid:7)(cid:31)(cid:8)(cid:3) (cid:13)(cid:26)(cid:24)$(cid:7)(cid:4)(cid:5)(cid:4)(cid:29)(cid:3)(cid:16)(cid:17)(cid:30)(cid:26)(cid:24)$(cid:7)(cid:4)(cid:5)(cid:4)(cid:8) (cid:25)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:24)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:26)(cid:27)(cid:28)(cid:14)(cid:10)(cid:29)(cid:12)(cid:18)(cid:29)(cid:17)(cid:6)(cid:2)(cid:6)(cid:29)(cid:2)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:30)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:25)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:26)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:29)(cid:3)(cid:14)(cid:31)(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:23)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16) (cid:14)(cid:13)(cid:14)(cid:10)(cid:10) (cid:23)(cid:18)(cid:18)(cid:9)(cid:16)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:27) (cid:30)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) T a b l e : C o m p a r i s o n o f d i s c r i m i n a t i o np r o p o s a l s ( P a r t ) . (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:15)(cid:4)(cid:15)(cid:23)(cid:13)(cid:23)(cid:5)(cid:14)(cid:3)(cid:24)(cid:13)(cid:15)(cid:14)(cid:3)(cid:25)(cid:5)(cid:16)(cid:15)(cid:23)(cid:15)(cid:8)(cid:7)(cid:14)(cid:15)(cid:6)(cid:26) (cid:19)(cid:27) (cid:19)(cid:27)(cid:28)(cid:27)(cid:16)(cid:2)(cid:15)(cid:25)(cid:7)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:18) (cid:29)(cid:30)(cid:31)(cid:3)(cid:17)(cid:6)(cid:7)(cid:6)(cid:15)(cid:27)(cid:4)(cid:5)(cid:23)(cid:7)(cid:30)(cid:15)(cid:23)(cid:15)(cid:16)(cid:7)(cid:6)(cid:15)(cid:27)(cid:4) (cid:18) !(cid:11)(cid:9)" (cid:19)(cid:20)(cid:21) '(cid:27)(cid:23)(cid:7)(cid:15)(cid:4)((cid:16)(cid:31)(cid:3)(cid:17)(cid:15))(cid:15)(cid:17)(cid:5)(cid:27)(cid:4)(cid:6)(cid:27)(cid:8)(cid:27)*(cid:26) (cid:29)(cid:22)(cid:5) q (cid:5)(cid:31)(cid:7)(cid:14)(cid:7)(cid:23)(cid:3)(cid:6)(cid:3)(cid:14)(cid:5)+(cid:7)(cid:13)(cid:6)(cid:27)(cid:5)(cid:7)(cid:25),(cid:13)(cid:16)(cid:6)(cid:3)(cid:25)- (cid:19)(cid:27) (cid:19)(cid:27)(cid:28)(cid:7)(cid:4)*(cid:5)(cid:7)(cid:4)(cid:25)(cid:5).(cid:13)/ (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:1)(cid:3)(cid:8)(cid:8)(cid:5)(cid:17)(cid:27)(cid:13)(cid:4)(cid:6)(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:27)(cid:8)(cid:25) (cid:19)(cid:27) (cid:19)(cid:27)0(cid:15)(cid:23)(cid:5)(cid:7)(cid:4)(cid:25)(cid:5).(cid:3)(cid:3) (cid:10)(cid:11)(cid:11)& (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:1)(cid:3)(cid:8)(cid:8)(cid:5)(cid:2)(cid:27)(cid:23)(cid:27)*(cid:3)(cid:4)(cid:3)(cid:15)(cid:6)(cid:26)(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:27)(cid:8)(cid:25) 0(cid:3)(cid:26)(cid:5)1(cid:27)(cid:14)(cid:25)(cid:16)2(cid:5)(cid:8)(cid:3)(cid:30)(cid:15)(cid:17)(cid:7)(cid:8)(cid:5)(cid:31)(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)(cid:4)(cid:16)(cid:9) (cid:19)(cid:27) (cid:19)(cid:27)3(cid:13)(cid:4)*(cid:5)(cid:7)(cid:4)(cid:25)(cid:5)01(cid:27)(cid:4) (cid:10)(cid:11)(cid:11)4 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) !(cid:11)(cid:9)%4$!(cid:11)(cid:9)%%5 + !(cid:11)(cid:9)%" (cid:19)(cid:20)(cid:21) (cid:22)(cid:3)(cid:6)(cid:7)((cid:25)(cid:7)(cid:6)(cid:7)(cid:5)(cid:6)(cid:3)(cid:30)(cid:6)(cid:5)(cid:7)(cid:4)(cid:25)(cid:5)(cid:31)(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)(cid:4)(cid:16)2(cid:5)(cid:16)(cid:15)(cid:23)(cid:15)(cid:8)(cid:7)(cid:14)(cid:15)(cid:6)(cid:26)(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:27)(cid:8)(cid:25)(cid:9) (cid:19)(cid:27) (cid:19)(cid:27)6(cid:7)(cid:6)(cid:6)(cid:3)(cid:14)7(cid:7)(cid:13)(cid:3)(cid:14)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)" (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) !(cid:11)(cid:9)" , - . / 0 12 3 (cid:19)(cid:20)(cid:21) '(cid:27)(cid:23)(cid:7)(cid:15)(cid:4)((cid:16)(cid:31)(cid:3)(cid:17)(cid:15))(cid:15)(cid:17)(cid:5)(cid:27)(cid:4)(cid:6)(cid:27)(cid:8)(cid:27)*(cid:26) (cid:19)(cid:27) (cid:19)(cid:27)(cid:1)(cid:7))(cid:7)(cid:14)(cid:3)(cid:8)(cid:8)(cid:7)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)% '(cid:3)(cid:6)(cid:3)(cid:17)(cid:6)(cid:5)(cid:17)(cid:8)(cid:7)(cid:16)(cid:16)(cid:15))(cid:15)(cid:3)(cid:14)2(cid:5)(cid:14)(cid:3))(cid:3)(cid:14)(cid:3)(cid:4)(cid:17)(cid:3)(cid:5)(cid:23)(cid:7)(cid:6)(cid:17)(cid:2)(cid:15)(cid:4)*(cid:9) (cid:10)8 (cid:22)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:23)(cid:24)(cid:9)(cid:2)(cid:25)(cid:23)(cid:25)(cid:14)(cid:6)(cid:17)(cid:14)(cid:3)(cid:10) !(cid:11)(cid:9)% (cid:22)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:23)(cid:24)(cid:9)(cid:2)(cid:25)(cid:12)(cid:4)(cid:2)(cid:23)(cid:25)(cid:14)(cid:6)(cid:17)(cid:14)(cid:3)(cid:10) !(cid:11)(cid:9)"&$!(cid:11)(cid:9)%(cid:11) (cid:19)(cid:20)(cid:21) (cid:21)(cid:1)9'7(cid:25)(cid:7)(cid:6)(cid:7)7(cid:7)(cid:16)(cid:3) '(cid:15)(cid:23)(cid:3)(cid:4)(cid:16)(cid:15)(cid:27)(cid:4)(cid:16) $(cid:7)(cid:6)(cid:15)(cid:27)(cid:5)(cid:27))(cid:5)(cid:17)(cid:27)(cid:8)(cid:13)(cid:23)(cid:4)(cid:16)(cid:5)1(cid:15)(cid:6)(cid:2)(cid:5)(cid:8)(cid:27)1(cid:3)(cid:14)(cid:17)(cid:7)(cid:16)(cid:3):(cid:5)(cid:31)(cid:13)(cid:4)(cid:6)(cid:13)(cid:7)(cid:6)(cid:15)(cid:27)(cid:4):(cid:5)(cid:27)(cid:14)(cid:5)(cid:4)(cid:27)(cid:4)((cid:16)(cid:6)(cid:14)(cid:15)(cid:4)*(cid:5)(cid:25)(cid:7)(cid:6)(cid:7)(cid:5)(cid:15)(cid:4)(cid:5)(cid:6)(cid:2)(cid:3)(cid:5))(cid:15)(cid:14)(cid:16)(cid:6)(cid:5)(cid:14)(cid:27)12(cid:5)(cid:14)(cid:7)(cid:6)(cid:15)(cid:27)(cid:5)(cid:27))(cid:5)(cid:17)(cid:27)(cid:8)(cid:13)(cid:23)(cid:4)(cid:16)(cid:5)1(cid:15)(cid:6)(cid:2)(cid:5)(cid:4)(cid:27)(cid:4)((cid:16)(cid:6)(cid:14)(cid:15)(cid:4)*(cid:5)(cid:25)(cid:7)(cid:6)(cid:7)(cid:5)(cid:15)(cid:4)(cid:5)7(cid:27)(cid:25)(cid:26)2(cid:5)(cid:14)(cid:7)(cid:6)(cid:15)(cid:27)(cid:5)(cid:27))(cid:5)(cid:17)(cid:27)(cid:8)(cid:13)(cid:23)(cid:4)(cid:16)(cid:5)1(cid:15)(cid:6)(cid:2)(cid:5)(cid:8)(cid:3)(cid:4)*(cid:6)(cid:2)(cid:5)(cid:16)(cid:23)(cid:7)(cid:8)(cid:8)(cid:3)(cid:14)(cid:5)(cid:6)(cid:2)(cid:7)(cid:4)(cid:5)(cid:6)(cid:2)(cid:3)(cid:5))(cid:15)(cid:14)(cid:16)(cid:6)(cid:5)(cid:14)(cid:27)1(cid:9) (cid:19)(cid:27) (cid:19)(cid:27).(cid:15)(cid:4)*(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18); (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:10)8 (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) 9(cid:3)(cid:23)(cid:7)(cid:4)(cid:6)(cid:15)(cid:17)(cid:5)(cid:3)(cid:24)(cid:13)(cid:15)<(cid:7)(cid:8)(cid:3)(cid:4)(cid:17)(cid:3)(cid:5)(cid:6)(cid:2)(cid:14)(cid:3)(cid:16)(cid:2)(cid:27)(cid:8)(cid:25) (cid:19)(cid:27) (cid:19)(cid:27)=(cid:14)(cid:7)(cid:13)(cid:4)(cid:16)(cid:17)(cid:2)1(cid:3)(cid:15)*(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)& (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:10)8 (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:19)(cid:27) (cid:19)(cid:27)(cid:29)(cid:23)7(cid:8)(cid:3)(cid:26)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:18)4 (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:21)(cid:17)(cid:17)!(cid:11)(cid:9) (cid:5)!(cid:5)(cid:11)(cid:9) (cid:26)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:27)(cid:2)(cid:16)(cid:17)(cid:6)(cid:2)(cid:6)(cid:23)(cid:5)(cid:14)(cid:7)(cid:7)(cid:10)(cid:28)(cid:18)(cid:18)(cid:9)(cid:16)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:29)(cid:28)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16)(cid:30)(cid:14)(cid:13)(cid:14)(cid:10)(cid:10)(cid:22)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:23)(cid:3)(cid:14)(cid:31)(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:15)(cid:6)(cid:3)(cid:6)!(cid:14)(cid:2)(cid:14)(cid:3)(cid:10)(cid:26)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) "(cid:14)(cid:5)(cid:12)(cid:3)(cid:6)(cid:2)(cid:12)(cid:3)(cid:10) T a b l e : C o m p a r i s o n o ff un c t i o n a l a n a l y s i s p r o p o s a l s . he task-specific characteristics are the following: a) Context-data cells: itdescribes if a proposal is able to make context-data cells apart from the others;a proposal that can identify context-data cells is better than a proposal thatcannot.
Decorators: it refers to the ability of a proposal to identify decoratorcells; a proposal that can find decorators is better than a proposal that cannot.Regarding the general characteristics, of the proposals rely on heuris-tics and rest rely on machine-learning approaches. Most of them can work onas few as a single table, with the exception of the proposals by Cafarella et al.[3], Chen et al. [7], and Chen et al. [7], we need to compare at least two ta-bles. Many of the authors report on the effectiveness of their proposals; realisethat most of the measures are below . , which means that there is enoughroom for improvement regarding this task. Unfortunately, only Embley et al.[21] reported on the efficiency of their proposal, which seems scalable enough.Regarding the resources required, Yoshida et al.’s [69] and Gatterbauer et al.’s[26] proposals require domain-specific ontologies, whereas Cafarella et al.’s [4]requires a publicly-available database. The proposal by Cafarella et al. [4] is theonly that projects the input data onto a space of simple structural and contentfeatures. The proposal by Yoshida et al. [69] requires a pre-defined parameterthat is auto-adjusted, and the proposals by Yang and Luk [68], Kim and Lee[32], and Milošević et al. [44] require another pre-defined parameter for whichthe authors provide a default value; the only proposals that require user-definedparameters are the ones by Chen et al. [7], Kim and Lee [32], Jung and Kwon[30], and Ling et al. [39].Regarding the task-specific characteristics, note that only the proposal byMilošević et al. [44] can identify some decorator cells and context-data cells.This is a bit surprising since, according to our experience, these kinds of cellsare very common in practice. Table 9 summarises our comparison regarding structural analysis propos-als. The task-specific characteristics are the following: a)
Header structure: itdescribes the kinds of headers that a proposal can identify according to theirstructure, namely: none , which means that it can analyse tables without head-ers, simple , which means that it can analyse simple headers that consists of onemeta-data cell only, and complex , which means that it can identify complexheaders that consists of multiple meta-data cells; the more header structures aproposal can identify, the better. b)
Header layout: it describes the kinds ofheaders that a proposal can identify according to how they are laid out, namely: none , which means that it can identify that a table does not have any headers, single , which means that it can identify headers in the first rows and/or columnsof a table, horizontally repeated , which means that it can identify that the head-ers are repeated every some rows, vertically repeated , which means that it canidentify that the headers are repeated every some columns, and split , whichmeans that it can identify series of headers that are split across several non-adjacent rows or columns; the more header layouts a proposal can identify, thebetter. c)
Tuple dimensionality: it describes the dimensionality of the tuples32 (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4) (cid:5)(cid:6)(cid:6)(cid:6) (cid:7)(cid:3)(cid:8)(cid:9)(cid:10)(cid:11)(cid:12)(cid:10)(cid:13)(cid:11) (cid:14) (cid:15)(cid:16)(cid:17) (cid:15)(cid:16)(cid:17) (cid:18)(cid:12)(cid:9)(cid:10)(cid:4)(cid:19)(cid:20)(cid:11)(cid:10)(cid:21)(cid:10)(cid:22)(cid:23)(cid:9)(cid:10)(cid:12)(cid:24)(cid:25)(cid:20)(cid:3)(cid:4)(cid:12)(cid:10)(cid:12)(cid:24)(cid:20)(cid:11)(cid:10)(cid:21)(cid:10)(cid:22)(cid:23)(cid:9)(cid:10)(cid:12)(cid:24)(cid:25)(cid:20)(cid:4)(cid:8)(cid:21)(cid:26)(cid:3)(cid:9)(cid:20)(cid:13)(cid:23)(cid:12)(cid:3)(cid:19)(cid:27)(cid:9)(cid:24)(cid:20)(cid:11)(cid:10)(cid:21)(cid:10)(cid:22)(cid:23)(cid:9)(cid:10)(cid:12)(cid:24)(cid:28) (cid:29)(cid:2)(cid:9)(cid:3)(cid:11)(cid:2)(cid:27)(cid:22)(cid:30)(cid:11)(cid:20)(cid:31)(cid:27)(cid:9)(cid:20)(cid:11)(cid:10)(cid:21)(cid:10)(cid:22)(cid:23)(cid:9)(cid:10)(cid:12)(cid:24) (cid:18)(cid:10)(cid:21) (cid:22)(cid:3)(cid:25)(cid:20)(cid:13)(cid:27)(cid:21) (cid:22)(cid:3)!(cid:28) (cid:18)(cid:10)(cid:4)(cid:19)(cid:22)(cid:3) (cid:14)"(cid:20)(cid:5) (cid:14)"(cid:20) (cid:22)(cid:23)(cid:17)(cid:9)(cid:24)(cid:25) (.(cid:6)(cid:28)/(cid:5)0.(cid:6)(cid:28)/12 .(cid:6)(cid:28)/3 (cid:26)(cid:23)(cid:17)(cid:9)(cid:24)(cid:25) (.(cid:6)(cid:28)/(cid:6)0.(cid:6)(cid:28)/42 .(cid:6)(cid:28)/3 (cid:15)(cid:16)(cid:17) (cid:17)%(cid:3)(cid:9)(cid:23)(cid:19)(cid:3)(cid:20)(cid:13)(cid:3)(cid:22)(cid:22)(cid:20)(cid:13)(cid:27)(cid:8)(cid:4)(cid:12)(cid:20)(cid:12)(cid:2)(cid:9)(cid:3)(cid:11)(cid:2)(cid:27)(cid:22)(cid:30) (cid:1)(cid:3)(cid:22)(cid:22)(cid:20)(cid:12)(cid:24) (cid:3)(cid:20) (cid:23)(cid:12)(cid:12)(cid:3)(cid:9)(cid:4)(cid:11) (cid:15)(cid:27)(cid:4)(cid:3) (cid:15)(cid:27)(cid:4)(cid:3) (cid:14)"(cid:20)(cid:5) .(cid:6)(cid:28)17 (cid:15)(cid:16)(cid:17) (cid:18)(cid:10)(cid:21) (cid:22)(cid:3) (cid:18)(cid:10)(cid:4)(cid:19)(cid:22)(cid:3) (cid:14) (cid:20).(cid:20)(cid:6)(cid:28)/3 (cid:15)(cid:16)(cid:17) (cid:15)(cid:27)(cid:4)(cid:3)(cid:25)(cid:20)(cid:11)(cid:10)(cid:21) (cid:22)(cid:3)(cid:25)(cid:20)(cid:13)(cid:27)(cid:21) (cid:22)(cid:3)!(cid:28) (cid:15)(cid:27)(cid:4)(cid:3)(cid:25)(cid:20)(cid:11)(cid:10)(cid:4)(cid:19)(cid:22)(cid:3)(cid:25)(cid:20)(cid:11) (cid:22)(cid:10)(cid:12)(cid:28) (cid:6)"(cid:20)(cid:14)"(cid:20)(cid:5) (cid:1)(cid:14)(cid:27)(cid:6)(cid:3)(cid:6)(cid:2)(cid:12)(cid:3)(cid:10)(cid:28)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:29)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) (cid:15)(cid:6)(cid:3)(cid:6)(cid:24)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:30)(cid:14)(cid:6)(cid:17)(cid:14)(cid:3)(cid:23)(cid:10)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:14) (cid:30)(cid:14)(cid:6)(cid:17)(cid:14)(cid:3)(cid:23)(cid:7)(cid:6)(cid:31)(cid:12)(cid:4)(cid:2) (cid:4)(cid:27)(cid:7)(cid:14)(cid:23)(cid:17)(cid:9)(cid:24)(cid:14)(cid:13)(cid:16)(cid:10)(cid:9)(cid:12)(cid:13)(cid:6)(cid:7)(cid:9)(cid:2)(cid:31) (cid:4)(cid:27)(cid:7)(cid:14)(cid:23)(cid:12)(cid:3)(cid:9)(cid:14)(cid:13)(cid:2)(cid:6)(cid:16)(cid:2)(cid:9)(cid:12)(cid:13) (cid:4)(cid:27)(cid:7)(cid:14)(cid:23)(cid:24)(cid:4)(cid:7)(cid:2)(cid:9)(cid:16)(cid:27)(cid:7)(cid:9)(cid:5)(cid:9)(cid:2)(cid:31)!(cid:18)(cid:18)(cid:9)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:31)(cid:29)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:23)(cid:3)(cid:14)"(cid:4)(cid:9)(cid:3)(cid:14)(cid:17) !(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16) T a b l e : C o m p a r i s o n o f s t r u c t u r a l a n a l y s i s p r o p o s a l s . hat a proposal can identify, namely: if it can identify the tuples in an enumer-ation, it can identify the tuples in a listing or a form, and if it can identifythe tuple in a matrix; the more tuple dimensionalities a proposal can identify,the better. d) Tuple multiplicity: it describes the number of tuples that a tableis intended to show, namely: in the case of forms and matrices, and * in thecase of listings and enumerations; the more tuple multiplicities a proposal canidentify, the better. e) Tuple orientation: it describes the orientations that itcan identify, namely: none in the case of matrices and enumerations, horizontal or vertical in the case of listings and forms; the more tuple orientations a pro-posal can identify, the better. f) Separators: it describes whether a proposal canidentify separator rows and/or columns; a proposal that can identify separatorsis better than a proposal that cannot.Regarding the general characteristics, many proposals rely on heuristic-basedapproaches; the exceptions are the proposals by Lerman et al. [36, 37], whichleverage some grammar induction techniques, and Cohen et al.’s [13] proposal,which leverages inductive logic programming. Most of the proposals requireas few as one input table; the exceptions are the proposals by Lerman et al.[36, 37], Yoshida et al. [69], and Fumarola et al. [24], which require two tables forcomparison purposes. Unfortunately, only Yang and Luk [68], Elmeleegy et al.[20], and Milošević et al. [44] reported on the effectiveness of their proposals,and none of the authors reported on their efficiency. Note that none of theproposals require to project the input data onto a space of features, but the oneby Chen et al. [7]. Note, too, that Chen et al. [7], Yang and Luk’s [68], andFumarola et al.’s [24] proposals are the only that have parameters.Regarding the task-specific characteristics, it is surprising that most of theproposals assume that the tables do not have any headers or they are simple,except for Milošević et al.’s [44] proposal; it is also surprising that the onlyproposal that can identify single and split headers is the one by Yoshida et al.[69]. Regarding the tuple dimensionality, only the proposals by Yang and Luk[68] and Milošević et al. [44] can make uni-dimensional tuples apart from two-dimensional tuples; Milošević et al.’s [44] can also deal with zero-dimensionaltuples; the proposal by Fumarola et al. [24] implicitly assumes that the tuplesin a table are zero-dimensional and does not make an attempt to analyse thestructure of the corresponding cells; the other proposals implicitly assume thatthe tuples are uni-dimensional. Regarding the tuple multiplicity, it is interestingto see that all of the proposals assume that tables may display more than onetuple; simply put, they cannot make listings apart from forms. Regarding thetuple orientation, most proposals implicitly assume that the tuples are orientedhorizontally; the only exceptions are the proposals by Yoshida et al. [69], Cohenet al. [13], and Yang and Luk [68], which can make horizontal tuples apart fromvertical tuples. It is surprising that none of the proposals that we have surveyedcan identify separators, even though they are very common in practice.
Table 10 summarises our comparison regarding interpretation proposals.The task-specific characteristics are the following: a)
Descriptors: it reports34 (cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:6)(cid:7) (cid:8)(cid:9)(cid:10)(cid:4)(cid:6)(cid:7) (cid:11)(cid:12)(cid:13)(cid:2)(cid:14)(cid:13)(cid:2) (cid:15)(cid:3)(cid:14)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:19)(cid:14)(cid:6)(cid:3)(cid:13)(cid:6)(cid:20)(cid:7)(cid:14) (cid:21)(cid:10)(cid:14)(cid:3)(cid:16)(cid:17)(cid:14)(cid:18)(cid:9)(cid:13)(cid:14)(cid:17) (cid:1)(cid:2)(cid:3)(cid:4)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)(cid:11) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:15)(cid:23)(cid:24)(cid:8)(cid:3)(cid:25)(cid:5)(cid:26)(cid:15)(cid:3)(cid:8)(cid:27)(cid:9) (cid:19)(cid:28) (cid:19)(cid:28)(cid:29)(cid:7)(cid:4)(cid:30)(cid:5)(cid:7)(cid:4)(cid:27)(cid:5)(cid:31)(cid:13) (cid:10)(cid:11)(cid:11)(cid:10) (cid:12)(cid:3)(cid:13)(cid:14)(cid:15)(cid:16)(cid:6)(cid:15)(cid:17)(cid:16) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:22)(cid:15)(cid:23)(cid:24)(cid:8)(cid:3)(cid:25)(cid:5)(cid:26)(cid:15)(cid:3)(cid:8)(cid:27)(cid:9) (cid:19)(cid:28) (cid:29)(cid:3)(cid:16)(cid:1)(cid:7)(cid:26)(cid:7)(cid:14)(cid:3)(cid:8)(cid:8)(cid:7)(cid:5)(cid:3)(cid:6)(cid:5)(cid:7)(cid:8)(cid:9) (cid:10)(cid:11)(cid:11)! "(cid:3)(cid:26)(cid:3)(cid:14)(cid:3)(cid:4)(cid:17)(cid:3)(cid:5)(cid:23)(cid:7)(cid:6)(cid:17)(cid:2)(cid:15)(cid:4)(cid:30) (cid:18) (cid:19)(cid:20)(cid:21) (cid:19)(cid:20)(cid:21) (cid:21)(cid:1)(cid:22) (cid:22)(cid:14)(cid:18)(cid:14)(cid:3)(cid:14)(cid:13)(cid:5)(cid:14) (cid:23)(cid:24)(cid:25)(cid:2)(cid:26)(cid:27)(cid:28)(cid:6)(cid:7)(cid:4)(cid:14)(cid:10) (cid:8)(cid:6)(cid:7)(cid:4)(cid:14)(cid:27)(cid:10)(cid:2)(cid:3)(cid:4)(cid:5)(cid:2)(cid:4)(cid:3)(cid:14)(cid:15)(cid:6)(cid:3)(cid:6)(cid:24)(cid:14)(cid:2)(cid:14)(cid:3)(cid:10) (cid:29)(cid:14)(cid:10)(cid:5)(cid:3)(cid:9)(cid:25)(cid:2)(cid:12)(cid:3)(cid:10)(cid:30)(cid:12)(cid:4)(cid:13)(cid:17)(cid:6)(cid:2)(cid:9)(cid:12)(cid:13) (cid:31)(cid:6)(cid:20)(cid:7)(cid:14)(cid:10)(cid:27)(cid:3)(cid:14) (cid:4)(cid:9)(cid:3)(cid:14)(cid:17) (cid:23)(cid:18)(cid:18)(cid:14)(cid:5)(cid:2)(cid:9)(cid:16)(cid:28)(cid:14)(cid:13)(cid:14)(cid:10)(cid:10) (cid:23)(cid:18)(cid:18)(cid:9)(cid:5)(cid:9)(cid:14)(cid:13)(cid:5)(cid:26) (cid:22)(cid:14)(cid:10)(cid:12)(cid:16)(cid:4)(cid:3)(cid:5)(cid:14)(cid:10) (cid:30)(cid:14)(cid:6)(cid:2)(cid:4)(cid:3)(cid:14)(cid:10) T a b l e : C o m p a r i s o n o f i n t e r p r e t a t i o np r o p o s a l s . n the kind of descriptors that a proposal can assign to the data in a table;the more kinds of descriptors a proposal can generate, the better. b) Emptycontents: it refers to the ability of a proposal to make a difference betweenempty cells whose contents are factorised and cells that are actually empty; aproposal that can make a difference between factorised and void cells is betterthan another proposal that cannot. c)
Content structure: it refers to the abilityof a proposal to make a difference between atomic cells and structured cells; aproposal that can make a difference between atomic cells and structured cells isbetter than a proposal that cannot.Regarding the general characteristics, most proposals rely on heuristics thathave proven to work well in practice; the only exception is the proposal byCafarella et al. [4], which uses a reference matching approach. Wu et al. [65]were the only authors who reported on effectiveness, but they measured precisiononly; unfortunately, none of the proposals report on efficiency. Cafarella et al.’s[4] proposal is the only one that requires a publicly-available resource. Noneof the proposals project the input data onto a feature space and none of themrequire any parameters to be set.Regarding the task-specific characteristics, all of the proposals can generatesimple descriptors; only the proposals by Chen et al. [7], Yang and Luk [68],and Milošević et al. [44] can generate field descriptors. Unfortunately, noneof the proposals can make a difference between factorised cells and void cells.Regarding making a difference amongst atomic and structured cells, it seemsthat only the proposal by Yang and Luk [68] can deal with this problem.
5. Conclusions
This article summarises and compares many proposals that have been pub-lished between and regarding extracting data from tables that areencoded using HTML. The problem is not trivial insofar many tables are en-coded using a subset of table-related tags that help locate and segment them,but do not provide a clue on the function of the cells or their structure; manyothers are encoded using listing tags, block tags, or other tags that look like atable when they are displayed, which hampers locating and segmenting them.Our analysis makes it clear that none of the proposals that we have listedprovide a complete solution to the data-extraction problem. Most of themaddress only some of the tasks involved and they differ regarding the problemsthat they address within each task. Regarding the location task, most proposalsfocus on tables that are encoded using table-related tags, there are a couplethat focus on listing tags, and also a couple that are independent from thetags used; what seems an actual challenge is to identify context data, since thefew proposals that take this problem into account are very naive. Regardingthe segmentation task, it is surprising that no proposal can identify multi-partcells and that most of them do not attempt to segment the context data. Thediscrimination task is the one that has been paid more attention, but not manyproposals attempt to go further than making non-data tables apart from datatables; recent proposals attempt to classify data tables in more categories since36his definitely helps interpret them. Regarding the functional analysis task,it is surprising that almost none of the proposals pay attention to identifyingcontext-data cells or decorators cells. Regarding the structural analysis task,the problems that have got none or very little attention are identifying splitheaders and zero- and two-dimensional tuples. Regarding the interpretationtasks, creating artificial descriptors in cases in which not enough meta-data areavailable, analysing whether an empty value is actually empty or factorised, andanalysing the structure of the contents of a cell are problems that have not beenaddressed sufficiently. Addressing these problems would help expand the kindsof tables from which data can be extracted.Last, but clearly not least, the evaluation of the proposals is also a veryrelevant problem. We have found that many authors have used Wang and Hu’s[64] repository in addition to their own repositories; unfortunately, the subsets oftables selected were different and their sizes range from as many as
342 795 tablesto a hundred tables or less. Definitely, recent repositories like DWDTC [18] orWDC [35] will help. We have also found many authors who used k -fold crossevaluation, but there is not a general consensus; there is not even a consensusregarding the value of k in the cases in which this procedure was used. As aconclusion, the experimental results reported are not comparable to each other.Neither is it common to find figures regarding efficiency, which makes it difficultto realise if a proposal might work well in a production scenario. Jiménez et al.[29] set a foundation regarding how to evaluate information extraction proposalsin general, but they did not focus on the tasks involved in extracting informationfrom tables that are encoded using HTML.Summing up: extracting data from tables that are encoded in HTML is anactive research field in which we expect new results to be published in the nearfuture. We hope that this article helps researchers sift through the state-of-the-art proposals in this field. Acknowledgments
The work by Juan C. Roldán, Patricia Jiménez, and Rafael Corchuelo wassupported by the Spanish R&D programme with grants TIN2013-40848-R andTIN2016-75394-R. The work by Juan C. Roldán was also supported by theFulbright programme.
References [1] K. Braunschweig, M. Thiele, and W. Lehner. From web tables to concepts:a semantic normalization approach. In ER , pages 247–260, 2015. doi:10.1007/978-3-319-25264-3_18.[2] A. L. Buchsbaum, D. F. Caldwell, K. W. Church, G. S. Fowler, andS. Muthukrishnan. Engineering the compression of massive tables:an experimental approach. In SODA , pages 175–184, 2000. URL http://dl.acm.org/citation.cfm?id=338219.338249 .373] M. J. Cafarella, A. Y. Halevy, D. Z. Wang, E. Wu, and Y. Zhang. WebTa-bles: exploring the power of tables on the Web.
PVLDB , 1(1):538–549,2008. URL .[4] M. J. Cafarella, A. Y. Halevy, Y. Zhang, D. Z. Wang, andE. Wu. Uncovering the relational Web. In
WebDB , 2008. URL http://webdb2008.como.polimi.it/images/stories/WebDB2008/paper30.pdf .[5] M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu, D. Z. Wang,and E. Wu. Ten years of web tables.
PVLDB , 11(12):2140–2149, 2018. doi:10.14778/3229863.3240492.[6] M. Cannaviccio, L. Ariemma, D. Barbosa, and P. Merialdo. Leveragingwikipedia table schemas for knowledge graph augmentation. In
WebDB ,pages 5:1–5:6, 2018. doi: 10.1145/3201463.3201468.[7] H.-H. Chen, S.-C. Tsai, and J.-H. Tsai. Mining tables from largescale HTML texts. In
COLING , pages 166–172, 2000. URL http://aclweb.org/anthology/C00-1025 .[8] P. Christen.
Data Matching - Concepts and Techniques for Record Linkage,Entity Resolution, and Duplicate Detection . Springer, 2012. doi: 10.1007/978-3-642-31164-2.[9] X. Chu, Y. He, K. Chakrabarti, and K. Ganjam. TEGRA: table extractionby global record alignment. In
SIGMOD , pages 1713–1728, 2015. doi:10.1145/2723372.2723725.[10] X. Chu, J. Morcos, I. F. Ilyas, M. Ouzzani, P. Papotti, N. Tang, andY. Ye. KATARA: a data cleaning system powered by knowledge basesand crowdsourcing. In
SIGMOD Conference , pages 1247–1261, 2015. doi:10.1145/2723372.2749431.[11] A. Cimmino and R. Corchuelo. On feeding business systems with linkedresources from the Web of Data. In
BIS , pages 307–320, 2018. doi: 10.1007/978-3-319-93931-5_22.[12] A. Cimmino and R. Corchuelo. A hybrid genetic-bootstrapping approachto link resources in the Web of Data. In
HAIS , pages 145–157, 2018. doi:10.1007/978-3-319-92639-1_13.[13] W. W. Cohen, M. Hurst, and L. S. Jensen. A flexible learning system forwrapping tables and lists in HTML documents. In
WWW , pages 232–241,2002. doi: 10.1145/511446.511477.[14] A. Costa-Silva, A. M. Jorge, and L. Torgo. Design of an end-to-end methodto extract information from tables.
IJDAR , 8(2-3):144–171, 2006. doi:10.1007/s10032-005-0001-x. 3815] E. Crestan and P. Pantel. A fine-grained taxonomy of tables on the Web.In
CIKM , pages 1405–1408, 2010. doi: 10.1145/1871437.1871633.[16] E. Crestan and P. Pantel. Web-scale table census and classification. In
WSDM , pages 545–554, 2011. doi: 10.1145/1935826.1935904.[17] X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy,T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: a web-scale ap-proach to probabilistic knowledge fusion. In
KDD , pages 601–610, 2014.doi: 10.1145/2623330.2623623.[18] J. Eberius, K. Braunschweig, M. Hentsch, M. Thiele, A. Ahmadov, andW. Lehner. Building the Dresden Web Table Corpus: a classification ap-proach. In
BDC , pages 41–50, 2015. doi: 10.1109/BDC.2015.30.[19] V. Efthymiou, O. Hassanzadeh, M. Rodríguez-Muro, and V. Christophides.Matching web tables with knowledge base entities: from entity lookupsto entity embeddings. In
ISWC , pages 260–277, 2017. doi: 10.1007/978-3-319-68288-4_16.[20] H. Elmeleegy, J. Madhavan, and A. Y. Halevy. Harvesting relational ta-bles from lists on the Web.
VLDB , 20(2):209–226, 2011. doi: 10.1007/s00778-011-0223-0.[21] D. W. Embley, M. Hurst, D. P. Lopresti, and G. Nagy. Table-processingparadigms: a research survey.
IJDAR , 8(2-3):66–86, 2006. doi: 10.1007/s10032-006-0017-x.[22] D. W. Embley, M. S. Krishnamoorthy, G. Nagy, and S. C. Seth. Convert-ing heterogeneous statistical tables on the Web to searchable databases.
IJDAR , 19(2):119–138, 2016. doi: 10.1007/s10032-016-0259-1.[23] J. Fan, M. Lu, B. C. Ooi, W.-C. Tan, and M. Zhang. A hybrid machine-crowdsourcing system for matching web tables. In
ICDE , pages 976–987,2014. doi: 10.1109/ICDE.2014.6816716.[24] F. Fumarola, T. Weninger, R. Barber, D. Malerba, and J. Han. Extractinggeneral lists from web documents: a hybrid approach. In
IEAAIE , pages285–294, 2011. doi: 10.1007/978-3-642-21822-4_29.[25] M. Galkin, D. Mouromtsev, and S. Auer. Identifying web tables: supportinga neglected type of content on the Web. In
KESW , pages 48–62, 2015. doi:10.1007/978-3-319-24543-0_4.[26] W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towardsdomain-independent information extraction from web tables. In
WWW ,pages 71–80, 2007. doi: 10.1145/1242572.1242583.[27] M. Hurst. Layout and language: challenges for table under-standing on the Web. In
WDA , pages 27–30, 2001. URL http://wda2001.csc.liv.ac.uk/Papers/12_hurst_wda2001.pdf .3928] M. Hurst. Classifying TABLE elements in HTML. In
WWW , 2002. URL .[29] P. Jiménez, R. Corchuelo, and H. A. Sleiman. ARIEX: automated rankingof information extractors.
Knowl.-Based Syst. , 93:84–108, 2016. doi: 10.1016/j.knosys.2015.11.004.[30] S.-W. Jung and H.-C. Kwon. A scalable hybrid approach for extractinghead components from web tables.
IEEE Trans. Knowl. Data Eng. , 18(2):174–187, 2006. doi: 10.1109/TKDE.2006.19.[31] Z. Khayyat, I. F. Ilyas, A. Jindal, S. Madden, M. Ouzzani, P. Papotti,J. Quiané-Ruiz, N. Tang, and S. Yin. BigDansing: a system for big datacleansing. In
SIGMOD Conference , pages 1215–1230, 2015. doi: 10.1145/2723372.2747646.[32] Y.-S. Kim and K.-H. Lee. Detecting tables in web documents.
Eng. Appl.of AI , 18(6):745–757, 2005. doi: 10.1016/j.engappai.2005.01.009.[33] C. A. Knoblock, P. A. Szekely, E. E. Fink, D. Degler, D. Newbury,R. Sanderson, K. Blanch, S. Snyder, N. Chheda, N. Jain, R. R. Krishna,N. B. Sreekanth, and Y. Yao. Lessons learned in building linked data forthe American Art Collaborative. In
ISWC , pages 263–279, 2017. doi:10.1007/978-3-319-68204-4_26.[34] L. R. Lautert, M. M. Scheidt, and C. F. Dorneles. Web table taxonomy andformalization.
SIGMOD Record , 42(3):28–33, 2013. doi: 10.1145/2536669.2536674.[35] O. Lehmberg, D. Ritze, R. Meusel, and C. Bizer. A large public corpusof web tables containing time and context meta-data. In
WWW , pages75–76, 2016. doi: 10.1145/2872518.2889386.[36] K. Lerman, C. Knoblock, and S. Minton. Automatic data extrac-tion from lists and tables in web sources. In
IJCAI , 2001. URL .[37] K. Lerman, L. Getoor, S. Minton, and C. A. Knoblock. Using the structureof web sites for automatic segmentation of tables. In
SIGMOD , pages 119–130, 2004. doi: 10.1145/1007568.1007584.[38] T. Liao, T. Liu, S. Zhang, and Z. Liu. Research on web table positioningtechnology based on table structure and heuristic rules. In
AISC , pages351–360, 2018. doi: 10.1007/978-3-319-67071-3_41.[39] X. Ling, A. Y. Halevy, F. Wu, and C. Yu. Synthesiz-ing union tables from the Web. In
IJCAI , 2013. URL .4040] M.-L. Lo, K.-L. Wu, and P. S. Yu. TabSum: a flexible and dynamic tablesummarization approach. In
ICDCS , pages 628–635, 2000. doi: 10.1109/ICDCS.2000.840979.[41] D. P. Lopresti and G. Nagy. Automated table processing: an(opinionated) survey. In
GREC , pages 109–134, 1999. URL .[42] D. P. Lopresti and G. Nagy. A tabular survey of automated table process-ing. In
GREC , pages 93–120, 2000. doi: 10.1007/3-540-40953-X_9.[43] J. Mankoff, H. Fait, and T. Tran. Is your web page accessible? A compar-ative study of methods for assessing web page accessibility for the blind.In
CHI , pages 41–50, 2005. doi: 10.1145/1054972.1054979.[44] N. Milošević, C. Gregson, R. Hernandez, and G. Nenadic. Disentanglingthe structure of tables in scientific literature. In
NLDB , pages 162–174,2016. doi: 10.1007/978-3-319-41754-7_14.[45] V. Mulwad, T. Finin, Z. Syed, and A. Joshi. UsingLinked Data to interpret tables. In
COLD , 2010. URL http://ceur-ws.org/Vol-665/MulwadEtAl_COLD2010.pdf .[46] K. Nishida, K. Sadamitsu, R. Higashinaka, and Y. Matsuo. Under-standing the semantic structures of tables with a hybrid deep neu-ral network architecture. In
AAAI , pages 168–174, 2017. URL http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14396 .[47] H. Okada and T. Miura. Detection of layout-purpose table tags basedon machine learning. In
UAHCI , pages 116–123, 2007. doi: 10.1007/978-3-540-73283-9_14.[48] R. K. Padmanabhan, R. C. Jandhyala, M. S. Krishnamoorthy, G. Nagy,S. C. Seth, and W. Silversmith. Interactive conversion of web tables. In
GREC , pages 25–36, 2009. doi: 10.1007/978-3-642-13728-0_3.[49] G. Penn, J. Hu, H. Luo, and R. T. McDonald. Flexible web documentanalysis for delivery to narrow-bandwidth devices. In
ICDAR , pages 1074–1078, 2001. doi: 10.1109/ICDAR.2001.953951.[50] C. Peterson.
Learning Responsive Web Design . O’Reilly, 2014.[51] R. Pimplikar and S. Sarawagi. Answering table queries on the Web usingcolumn keywords.
PVLDB , 5(10):908–919, 2012. doi: 10.14778/2336664.2336665.[52] F. Qi, X. Wu, and N. Wang. Building top- k consistent results for web tableaugmentation. In WISA , pages 74–79, 2017. doi: 10.1109/WISA.2017.30.4153] L.-A. Ratinov, D. Roth, D. Downey, and M. Anderson. Local and globalalgorithms for disambiguation to Wikipedia. In
ACL , pages 1375–1384,2011. URL .[54] X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher,and J. Han. CoType: joint extraction of typed entities and relations withknowledge bases. In
WWW , pages 1015–1024, 2017. doi: 10.1145/3038912.3052708.[55] D. Ritze and C. Bizer. Matching web tables to DBpedia: a feature utilitystudy. In
EDBT , pages 210–221, 2017. doi: 10.5441/002/edbt.2017.20.[56] A. D. Sarma, L. Fang, N. Gupta, A. Y. Halevy, H. Lee, F. Wu, R. Xin,and C. Yu. Finding related tables. In
SIGMOD , pages 817–828, 2012. doi:10.1145/2213836.2213962.[57] Y. A. Sekhavat, F. D. Paolo, D. Barbosa, and P. Merialdo. Knowl-edge base augmentation using tabular data. In
LDOW , 2014. URL http://ceur-ws.org/Vol-1184/ldow2014_paper_02.pdf .[58] J. L. Sierra, A. Fernández-Valmayor, and B. Fernández-Manjón. Fromdocuments to applications using markup languages.
IEEE Software , 25(2):68–76, 2008. doi: 10.1109/MS.2008.36.[59] J. W. Son and S.-B. Park. Web table discrimination with composition ofrich structural and content information.
Appl. Soft Comput. , 13(1):47–57,2013. doi: 10.1016/j.asoc.2012.07.025.[60] M. Taheriyan, C. A. Knoblock, P. A. Szekely, and J. L. Ambite. Learningthe semantics of structured data sources.
J. Web Semant. , 37-38:152–169,2016. doi: 10.1016/j.websem.2015.12.003.[61] I. Taleb, R. Dssouli, and M. A. Serhani. Big Data pre-processing: a qualityframework. In
IEEE Intl. Congress on Big Data , pages 191–198, 2015. doi:10.1109/BigDataCongress.2015.35.[62] F. Tschirschnitz, T. Papenbrock, and F. Naumann. Detecting inclusiondependencies on very many tables.
ACM Trans. Database Syst. , 42(3):18:1–18:29, 2017. doi: 10.1145/3105959.[63] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao,and C. Wu. Recovering semantics of tables on the Web.
PVLDB , 4(9):528–538, 2011. doi: 10.14778/2002938.2002939.[64] Y. Wang and J. Hu. Detecting tables in HTML documents. In
DAS , pages249–260, 2002. doi: 10.1007/3-540-45869-7_29.[65] K.-L. Wu, S.-K. Chen, and P. S. Yu. Dynamic refinement of table sum-marization for m-commerce. In
WECWIS , pages 179–186, 2002. doi:10.1109/WECWIS.2002.1021257. 4266] X. Wu, C. Cao, Y. Wang, J. Fu, and S. Wang. Extracting knowledge fromweb tables based on DOM tree similarity. In
KSEM , pages 302–313, 2016.doi: 10.1007/978-3-319-47650-6_24.[67] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. Infogather:entity augmentation and attribute discovery by holistic matching with webtables. In
SIGMOD Conference , pages 97–108, 2012. doi: 10.1145/2213836.2213848.[68] Y. Yang and W.-S. Luk. A framework for web table mining. In
WIDM ,pages 36–42, 2002. doi: 10.1145/584931.584940.[69] M. Yoshida, K. Torisawa, and J. Tsujii. A method to integrate ta-bles of the World Wide Web. In
WDA , pages 31–34, 2001. URL http://wda2001.csc.liv.ac.uk/Papers/13_yoshida_wda2001.pdf .[70] R. Zanibbi, D. Blostein, and J. R. Cordy. A survey of table recognition.
IJDAR , 7(1):1–16, 2004. doi: 10.1007/s10032-004-0120-9.[71] M. Zhang and K. Chakrabarti. InfoGather+: semantic matching and an-notation of numeric and time-varying attributes in web tables. In
SIGMODConference , pages 145–156, 2013. doi: 10.1145/2463676.2465276.[72] X. Zhang, Y. Chen, J. Chen, X. Du, and L. Zou. Mapping entity-attributeweb tables to web-scale knowledge bases. In