Automatic Integration Issues of Tabular Data for On-Line Analysis Processing
AAutomatic Integration Issues of Tabular Data for On-LineAnalysis Processing
Yuzhao Yang ∗ , Jérôme Darmont ∗∗ , Franck Ravat ∗ , Olivier Teste ∗∗ Institut de Recherche en Informatique de Toulouse -IRIT CNRS (UMR 5505)-Université de Toulouse, 118, Route de Narbonne, 31069 Toulouse Cedex 9, France{Yuzhao.Yang, Franck.Ravat, Olivier.Teste}@irit.fr, ∗∗ ERIC UR 3083, Université de Lyon, Lyon 25 avenue Pierre Mendès France, F69676 Bron Cedex, [email protected]
Abstract.
Companies and individuals produce numerous tabular data. The ob-jective of this position paper is to draw up the challenges posed by the automaticintegration of data in the form of tables so that they can be cross-analyzed. Weprovide a first automatic solution for the integration of such tabular data to al-low On-Line Analysis Processing. To fulfill this task, features of tabular datais analyzed and the challenges of automatic multidimensional schema genera-tion should be addressed. Hence, we propose a typology of tabular data and aautomatic process based on different steps to integrate one or more data sources.
Business Intelligence (BI) plays an important role in numerous companies and adminis-trations to efficiently support decision making processes. With the current digitization trend,even small companies and organizations can exploit a large number of data every day and therise of open data make various data even more accessible. Nevertheless, the implementationof a BI project needs to be realized by people who have the professional knowledge and deepskills in BI technologies such as data warehousing and data visualization. Such projects arealso usually expensive and time-consuming. As a result, it is necessary to find a solution toautomate the BI process to allow small enterprises, organizations and even individuals withoutdeep technical expertise to easily analyze data. Up to now, there is no platform that achievesthis goal.In current BI systems, data are extracted and stored in a data warehouse to allow On-LineAnalysis Processing (OLAP) and visual data rendering. Thus, automating the data warehous-ing process is crucial to allow non-specialist to exploit such approaches. There exist variousforms of data, but most of the data in small enterprises and organizations, as well as most ofthe open data are in tabular form from spreadsheet software. Although there are commercialBI tools allowing the exploitation of tabular data such as Excel, Qlikview or Tableau, none ofthem automates the multidimensional analysis of tabular data. a r X i v : . [ c s . D B ] S e p utomatic Integration Issues of Tabular Data for On-Line Analysis ProcessingUnlike relational databases that are well-structured and where we can easily retrieve keys,cardinalities and other table metadata, or XML files which sometimes provide schemas suchas Document Type Definitions (DTDs), spreadsheet tabular data do not directly provide suchinformation. Moreover, there are different types and structures of tabular data, which makes itdifficult to automatically identify labels, values and aggregates. Without input from the user,it is hard to understand the semantics of some data and to choose the appropriate measures forOLAP. Different tables also need to be integrated by matching and mapping. Facing all theserequirements and problems, we propose a possible solution in this paper. Table identificationis the first significant step we introduce to reach our goal. To carry out this step, we also definean extended typology of tabular data to classify different table types.The remainder of this paper is organized as follows. In Section 2, we review the relatedworks about automatic multidimensional schema generation and typologies of tabular data,respectively. In Section 3, we define our typology of tabular data. In Section 4, we presentour tabular data integration solution, and discuss related perspectives. Finally, in Section 5, weconclude this paper and hint at future research. OLAP (On-Line Analytical Processing) systems allow analysts to improve decision-makingprocess by consulting and analyzing aggregated data. A multidimensional schema organizesdata according to analysis subjects (facts) associated to analysis axes (dimensions). Each factis composed of measures. Each dimension contains one or several analysis viewpoints (hierar-chies). Each hierarchy contains various data granularities of analysis data.There are different approaches for the generation of multidimensional schemas (Romeroand Abelløs, 2009). There are top-down (also called demand-driven) approaches that startfrom user requirements and generate the schema to satisfy these requirements manually orautomatically. Conversely there are bottom-up approaches that are mostly automatic or semi-automatic solutions to generate the schema from the data source. Moreover, there are hybridapproaches taking both user requirements and the data source into account. Our work focuseson the bottom-up approaches because the users we target do not necessarily know or anticipateprecise requirements.Most of the bottom-up processes are designed for relational databases. (Phipps and Davis.,2002) introduce an automatic data warehouse design approach whose input is an Entity-Relationship (ER) schema. They consider numeric fields as measures, the more numeric fieldsa table contains, the more likely it is to be a fact, so they create for each table a multidimen-sional schema by taking the numeric fields in the table as measures. Then, they identify thedatetime type data to create temporal dimensions. The fields that are not key, numeric or dateof the table become dimensions. Finally, they complete dimensions by verifying if the tablesconnecting with the fact are in the many side of the relationship and complete hierarchies byverifying the many-to-one relationship of the remaining tables. However, they use queries toevaluate what candidate schemas best meet the userâ ˘A ´Zs needs, which is done manually. Thus,the process is not fully automated.. Yang et al.(I.-Y.Song et al., 2007) also propose a semi-automatic method to generate star schemasfrom ER diagrams. The main difference with the approach by (Phipps and Davis., 2002) isthat the fact table is chosen by calculating the Connection Topology Value (CTV) of eachentity, which is a composite function of the topology value of direct and indirect many-to-one relationships. The selected entities are those that have a CTV higher than a thresholdcalculated by the mean and standard deviation of the CTVs. In the end, the user must checkthe redundancy of the time dimension, merge the related dimensions and rename the elements.There are also methods for other data types. (Golfarelli et al., 2001) propose a semi-automatic method to construct a conceptual data warehouse schema from XML sources. Intheir approach, schema design is mainly based on DTDs, as they define the elements andattributes and describe the relationships between elements. They simplify the DTDs and createa DTD graph which represent the structure of the XML. Then, based on the facts chosen by thedesigner and the DTD graph, they build an attribute tree representing the conceptual schema.Eventually, data mining methods are also used to automate multidimensional schema gen-eration. (Usman et al., 2011) propose to use hierarchical clustering. Their architecture includesa data mining layer to preprocess the dataset and cluster data, and an automatic schema gener-ation layer to generate the multidimensional schema. The solution is implemented and testedwith the dataset
F orestCoverT ype , but the authors did not specify the similarity metric theyuse nor the schema generation algorithm.In summary, existing multidimensional schema generation approaches are designed forspecific data sources, such as relational databases and XML documents, and they are not fullyautomated. There is no such approach specially designed for tabular data. Moreover, theautomatic generation of multidimensional schema for tabular data is specific and there are stillmany challenges to face compared to other data types (Section 1).
There is a lot of research concerning tabular data and especially web tables. There aredifferent table classification methods. (Wang and Hu, 2002) and (Crestan and Pantel, 2011)divide web tables into two types: genuine or relational knowledge tables, and non-genuine orlayout tables, based on whether tables contain relational knowledge or used for only groupingcontents for easy viewing.We are only interested in genuine tables/relational knowledge tables because non-genuine/layout tables are only used for the navigational or formatting purpose on the web.(Crestan and Pantel, 2011) further divide relational knowledge tables into listing tables, at-tribute/value tables, matrix tables, enumeration tables and form tables. This classificationmethod is refined and completed by (Lautert et al., 2013), who add new types of relationalknowledge tables: concise tables, nested tables, multivalued tables and split tables.(Yoshida et al., 2001) define nine table types according to whether attributes are arrangedvertically or horizontally, whether there is a header and how the header is positioned. Thesetypes are presented by a graph. (Milosevic et al., 2016) give a classification of structural tablesthrough dimensionality: one-dimensional tables, two dimensional tables and multidimensionaltables that can also be typed into super-row tables and multi-tables. (Chen and Cafarella, 2013)also introduce the concept of hierarchical spreadsheet which contains the header with multi-levels.utomatic Integration Issues of Tabular Data for On-Line Analysis ProcessingAll these classifications describe tabular data by different aspects such as the direction ofthe arrangement of data, the characteristics of cells, the dimensionality of tables, the function-ality of tables, etc. However, when considering all table types, none of these classificationsare complete. For example, the case of a table whose header is distributed in two rows dueto the limit of the space is considered in the classification of (Yoshida et al., 2001), but is notconsidered by the other people, while the matrix table is not considered in their classification.
Data are presented differently in various types of tables. How to extract them in differenttypes of tables is thus difficult. Therefore, identifying the type of table is important and requiresa complete classification of tabular data. Thus, we propose a new typology of tabular data (FIG.1) that fuse and complements the classifications from Section 2.2 with respect to three aspects:the structure of tables, the content of cells and headers which are respectively in gray, blue andgreen in the figure. F IG . 1 – Typology of tabular data
The structure of tables aspect concerns the dimensionality of data and the arrangement ofrows and columns. With respect to the dimensionality of tables, these latter are classified into. Yang et al.one-dimensional tables and multidimensional tables, which also bear sub-types.F IG . 2 – Cross table, Hierarchical header table One - dimensional tables have only one dimension.— Listing tables list values belonging to a same attribute.— In horizontal tables , data are arranged horizontally, i.e ., each column representsone attribute, each row represents a tuple containing values of different attributes.— In vertical tables , data are arranged vertically, i.e . each row represents one at-tribute, each column represents a tuple containing values of different attributes.2.
M ultidimensional tables have multiples dimensions.—
Super - row tables contain multiple dimensions arranged into different level of onerow/column, as classified and exampled by Milosevic et al. (2016)— Cross tables are usually two-dimensional tables where there is one dimensionarranged in column and one other arranged in row (Table in FIG. 2). The value ofeach cell is determined by the dimensions of the column and the row in which it islocated. In the cases where there are more than two dimensions, we can have crosstables for different dimensions.
This concerns the characteristics of cell content, as classified and exampled by (Lautertet al., 2013).1.
Simple cell tables contain only â ˘AIJnormalâ ˘A˙I cells without any of the exceptionslisted below.2.
M erged cell tables contain cells that are gathered or merged into one cell.—
Category - merged cell tables contain cells merged together because they belongto the same category.— V alue - merged cell tables contain cells with the same value merged into one cell.3. N ested tables contain tables nested into cells.4.
M ultivalued tables contain cells where there are multiple values.—
Simple multivalued tables contain multivalued cells whose values belong to thesame domain.—
Composed multivalued tables contain multivalued cells whose values belong tothe different domains.utomatic Integration Issues of Tabular Data for On-Line Analysis Processing
We classify table headers with respect to the following characteristics.1.
N o − header tables refer to tables without any header.2. Simple header tables contain only one-level â ˘AIJusualâ ˘A˙I headers.3.
Hierarchical header tables contain headers with multiple levels (Table 1).
Simpleheader tables and
Hierarchical header tables are subdivided with respect to headerarrangement.— Each header in
Single header tables is arranged in only one single row/column,without repetition. F IG . 3 – Distributed header table — The header of
Distributed header tables is distributed in different rows/columns(Table in FIG. 3). F IG . 4 – Duplicated header table — Duplicated header tables have headers repeated many times (Table in FIG. 4).(Zhang and Balog, 2020) define elements of a table, including table page title, table caption,table headings, table cell, table row, table column and table entities. Table page title and tablecaption are explanatory elements of a table. Table entities concern the content and topic ofa table. The other elements are all considered in our proposed classification. Moreover, ourclassification is based on the state of the art, all the possible cases mentioned in existing tabulardata typologies are taken into account. Thus, we believe that this typology is complete.
FIG. 5 illustrates our approach which aims to integrate automatically tabular data for theOLAP analysis. We first extract and identify tables (Section 4.1). The data source may containone or several tables. If there is only one table, we transform the table type (Section 4.2), selectthe measures (Section 4.3), detect functional dependencies (Section 4.4) and finally generate. Yang et al.the multidimensional schema (Section 4.5). If there are several tables, we either match thetables into one table by using by similarity measures and proceed as above, or generate amultidimensional schema for each table as above and match the schemas.F IG . 5 – Overview of our tabular data integration approach
Since the user we target are individuals, small enterprises or organizations without knowl-edge of BI, data sources may be of various types, e.g. CSV files, spreadsheet files, HTMLtables or even images. There may also be several tables in some sources. Therefore, the firstthing to do is to identify the sources and extract the tables within. Many table extraction meth-ods are surveyed in (Zhang and Balog, 2020), especially for web tables. There also exist deeplearning methods to detect tables from images (Schreiber et al., 2017); (Paliwal et al., 2019).Let us first discuss our integration solution for the one-table case. We generalize it in Section4.6.
First, we should transform multidimensional source tables into a one-dimensional table.(Milosevic et al., 2016) propose an helpful method allowing the identification of multidimen-sional table structures. However, multidimensional tables may already contain obvious multi-dimensional structures that we can employ directly for the construction of a multidimensionalschema. If this was the case, transforming the multidimensional table would be useless andtime-wasting. In this article, we propose to perform the transformation to get a uniform rep-resentation of tables so that we can easily get the attributes and values for the following stepsof our approach. We consider multidimensional tables as a special case that we will processspecifically.For tables containing non-simple cells, it is necessary to convert these cells into simplecells by decomposing merged cells and multivalued cells by adding new cells or columns. Dis-tributed and duplicated headers need to be detected by the algorithm of (Yoshida et al., 2001)in order to reform the table as a single-header table. In hierarchical header tables, hierarchicalrelations should be identified as candidates to hierarchies in the multidimensional model. Hi-erarchies can be extracted by the method proposed by (Chen and Cafarella, 2013). Eventually,there may sometimes be even no attribute header in a table. In this case, we can use columnutomatic Integration Issues of Tabular Data for On-Line Analysis Processingidentification methods (Zhang and Balog, 2020) to retrieve the semantic label of columns bymatching them with knowledge bases.
To integrate a one-dimensional table for OLAP, we must firstly define measures. Measuresare usually obtained by aggregation and calculation (max, min, countâ ˘A˛e), so most of themare numeric data. Normally, during the data warehouse design process, facts and measures arechosen by the user. As we aim to automate this process, we choose all numeric attributes in ta-bles as candidate measures. However, not all numeric data can be considered as measures. Forexample, some numeric data may be IDs or Booleans. Yet, we can use the semantic labelingof numeric data to determine whether an attribute can be treated as a measure. (Alobaid et al.,2019) propose algorithms to identify the type of numeric data, including nominal data, ordinaldata, intervals and ratios. Intervals and ratios are more likely to be measures.This automatic process can provide candidate measures to the users, but we may wronglychoose or omit some measures. For instance, a measure calculated by the count of an attributedoes not need to be numeric. Calculating measures is also a big problem, because withoutknowing the exact user requirements, we cannot know the exact aggregation function. Wecan only propose basic functions such as count, sum, average, minimum and maximum, whilemeasures may also be calculated by complex formulas on one or several attributes. Thence, ifthe user has specific requirements, we should also let her/him specify measures and correct thecandidate measures we propose automatically.
The objective of this step is to determine the multidimensional modelâ ˘A ´Zs dimensions andassociated hierarchies. To meet these needs, we apply the principles of functional dependen-cies between attributes. There exists a lot of algorithms to discover functional dependencies(Liu et al., 2012); (Papenbrock et al., 2015). For a relation schema R , X and Y represent asubset of the attributes of R . In a statement of functional dependency X → Y , X is usuallycalled the left-hand side and Y is called the right-hand side. We are only interested in thefunctional dependencies that have one attribute only in the right-hand side and get the minimalcover which is the minimal set of functional dependencies being able to infer all the functionaldependencies. Hence, we obtain the elementary functional dependency such that no strict sub-set of the left-hand side determines the right-hand side. Moreover, we delete all the transitivityand pseudo-transitivity dependencies. We are also only interested in the functional dependen-cies whose left-hand side has one attribute only, since there is one attribute in each level of ahierarchy.To make sure that the functional dependencies that we discover conform to the dependencyrelationship of attributes in the real world, we hypothesize that there is enough data in terms ofquantity and variety so as to represent real dependency relationships. Moreover, there shouldbe no error in data nor empty data, but if this was the case, we could detect approximatefunctional dependencies. There are many methods for approximate functional dependency(Liu et al., 2012), which propose different error measures. We can set a satisfaction thresholdto decide the approximate functional dependencies. We can also search and complete emptydata by querying open data (Eberius et al., 2012).. Yang et al. Given all the functional dependencies that satisfy our requirements, we can connectthem to infer all hierarchies and even draw a functional dependency diagram to visual-ize them. For example, let us consider a product order table with attributes idCustomer , nameCustomer , cityCustomer , countryCustomer , classCustomer , idP roduct , nameP roduct , categoryP roduct and quantity . With quantity chosen as the measure, weget a fact ( F , quantity ) . We may get the functional dependencies by applying the algorithmof functional dependency detection if the data of the table satisfy the hypothesis mentionedin the section 4.4: idCustomer → nameCustomer , nameCustomer → cityCustomer , cityCustomer → countryCustomer , nameCustomer → classCustomer , idP roduct → nameP roduct , nameP roduct → categoryP roduct . We can then obtain threehierarchies, but we cannot know whether an attribute is a parameter or a weak at-tribute, so we consider all the attributes as parameters. Thus, the three hierarchiesare: ( H , < idCustomer, nameCustomer, cityCustomer, countryCustomer > ) , ( H ,
All the hierarchies that share the same root have the same dimension, as we can see inFIG. 6, we can get 2 dimensions: ( D , { idCustomer, nameCustomer, cityCustomer,countryCustomer, classCustomer } , H , H ) , ( D , { idP roduct, nameP roduct,categoryP roduct } , H ) The final step of our approach is to simply link measures to dimensions. We do notneed to verify whether measures are dependent on the totality of the root parameters, be-cause measures are obtained by the aggregation of attributes. If we aggregate all root param-eters, measures are certainly dependent on the totality of the root parameters, since there isno redundancy after aggregation. Therefore, we obtain a conceptual multidimensional model ( nameSchema, F , { D , D } ) .If there are equivalent attributes, i.e.X, Y are subsets of a relation schema R , X and Y areequivalent attributes if X → Y and Y → X , there is probably a parameter and its weak attributesinside an equivalent attribute set, so we should know what attribute may be an ID so that it canbe considered as a parameter. We could also introduce semantic solutions for solving theseproblems.utomatic Integration Issues of Tabular Data for On-Line Analysis Processing Our approach currently applies to the one-table case. In case of multiple tables, we pro-pose two potential matching solutions, by using different schema matching and data matchingtechnologies with the help of the similarity measures. We can either match all tables into onesingle table and generate automatically the multidimensional model of the table, or we cangenerate automatically the multidimensional model of each table and match these schemas toone single schema. The fusion can also be described by a manipulation algebra (Ravat et al.,2005).
Since we are still studying the integration of multisource tables, we have only implementedthe automatic multidimensional schema generation algorithm for one table source in Python.To illustrate our approach, we test with a csv file from the open data site of the French govern-ment . This is a file recording speaking time of men and women corresponding to more thana million hours of programs broadcast from 1995 to February 28, 2019.F IG . 7 – Test file excerpt
As illustrated in FIG. 7, the file contains 7 columns: media type, channel name, is publicchannel, year, women expression rate, speech rate and number of the analyzed hours. Thereare 21 radio stations and 34 TV channels recorded in the file. The result of our algorithm isshown in FIG. 8.We detect 3 measures: women expression rate, speech rate and number of the analyzedhours. We get 3 hierarchies, since there are 2 hierarchies sharing the same root, they are in thesame dimension. So, we get 2 dimensions. . Yang et al.F IG . 8 – Test file excerpt
The user may also need the men expression rate, she/he can just add this measure by asimple calculation (1 â ˘A ¸S women expression rate). The user can also give a name to thedimensions, then we can get the multidimensional schema (FIG. 9).F IG . 9 – Final multidimensional schema
In this paper, we study how to automate the integration of tabular data to generate multidi-mensional data warehouses. First, we propose a taxonomy of tabular data to help in the tableidentification step. Then, our table identification solution consists in identifying the table,transforming it into a one-dimensional table and generating the multidimensional schema byexploiting functional dependencies. Matching the tables then generating a multidimensionalschema or generating a multidimensional schema for each table and then matching the schemasare two possible solutions for the integration of multiple tables.In ongoing work, we will implement and refine the idea that we propose. We will test thetwo possible methods for the generalization of multiple tables and evaluate their feasibility andperformance to decide the one to use. Once we succeed in generating automatically the con-ceptual multidimensional schema, we will define an automatic process to populate the schemafrom the data and the content of the source table.utomatic Integration Issues of Tabular Data for On-Line Analysis Processing
Acknowledgements
The research depicted in this paper is funded by the French National Research Agency(ANR), project ANR-19-CE23-0005 BI4people (Business Intelligence for the people).
References
Alobaid, A., E. Kacprzak, and O. Corcho (2019). Typology-based semantic labeling of nu-meric tabular data.
Semantic Web 1 (0), 1–5.Chen, Z. and M. Cafarella (2013). Automatic web spreadsheet data extraction. In
SSW ’13:Proceedings of the 3rd International Workshop on Semantic Search Over the Web , pp. 1–8.Crestan, E. and P. Pantel (2011). Web-scale table census and classification. In
WSDM ’11:Proceedings of the fourth ACM international conference on Web search and data mining ,pp. 545â ˘A ¸S554.Eberius, J., M. Thiele, K. Braunschweig, and W. Lehner (2012). Drillbeyond: Enabling busi-ness analysts to explore the web of open data. In
Proceedings of the VLDB Endowment ,Volume 5.Golfarelli, M., S. Rizzi, , and B. Vrdoljak (2001). Data warehouse design from xml sources.In
DOLAP ’01: Proceedings of the 4th ACM international workshop on Data warehousingand OLAP , pp. 40â ˘A ¸S47.I.-Y.Song, R. Khare, and B. Dai (2007). Samstar: A semi-automated lexical method for gen-erating star schemas from an entity-relationship diagram. In
Proceedings of the ACM TenthInternational Workshop on Data Warehousing and OLAP , pp. 9–16.Lautert, L. R., M. M. Scheidt, and C. F. Dorneles (2013). Web table taxonomy and formaliza-tion.
ACM SIGMOD Record 42 (3), 28â ˘A ¸S33.Liu, J., J. Li, C. Liu, and Y. Chen (2012). Discover dependencies from dataâ ˘AˇTa review.
IEEETransactions on Knowledge and Data Engineering 24 (2), 251â ˘A ¸S264.Milosevic, N., C. Gregson, R. Hernandez, and G. Nenadic (2016). Disentangling the structureof tables in scientific literature. In
Natural Language Processing and Information Systems ,Volume 9612, pp. 162â ˘A ¸S174.Paliwal, S., V. D, R. Rahul, M. Sharma, and L. Vig (2019). Tablenet: Deep learning modelfor end-to-end table detection and tabular data extraction from scanned document images.In , pp.128â ˘A ¸S133.Papenbrock, T., J. Ehrlich, J. Marten, T. Neubert, J. Rudolph, M. SchÃ˝unberg, J. Zwiener, andF. Naumann (2015). Functional dependency discovery: An experimental evaluation of sevenalgorithms. In
Proceedings of the VLDB Endowment , Volume 8, pp. 1082â ˘A ¸S1093.Phipps, C. and K. C. Davis. (2002). Automating data warehouse conceptual schema designand evaluation. In
Proceedings of 4th International Workshop on Design and Managementof Data Warehouses , pp. 23–32.. Yang et al.Ravat, F., O. Teste, and G. Zurfluh (2005). Manipulation et fusion de donnÃl’es multidimen-sionnelles.
Revue des Nouvelles Technologies de lâ ˘A ´ZInformation (RNTI-E-3) - Extractionet Gestion des Connaissances I , 349â ˘A ¸S354.Romero, O. and A. Abelløs (2009). A survey of multidimensional modeling methodologies.
International Journal of Data Warehousing and Mining 5 (2), 1â ˘A ¸S23.Schreiber, S., S. Agne, I. Wolf, A. Dengel, and S. Ahmed (2017). Deepdesrt: Deep learning fordetection and structure recognition of tables in document images. In , pp. 1162â ˘A ¸S1167.Usman, M., S. Asghar, and S. Fong (2011). Hierarchical clustering model for automatic olapschema generation.
Journal of E-Technology 2 (1), 9â ˘A ¸S20.Wang, Y. and J. Hu (2002). A machine learning based approach for table detection on the web.In
Proceedings of the 11th International Conference on World Wide Web , pp. 242â ˘A ¸S250.Yoshida, M., K. Torisawa, and J. Tsujii (2001). A method to integrate tables of the worldwide web. In
Proceedings of the 1st international workshop on Web document analysis , pp.31–34.Zhang, S. and K. Balog (2020). Web table extraction, retrieval, and augmentation: A survey.
ACM Transactions on Intelligent Systems and Technology 11 (2), 1–35.