[PDF] AMALGAM: A Matching Approach to fairfy tabuLar data with knowledGe grAph Model

Abstract

In this paper we present AMALGAM, a matching approach to fairify tabular data with the use of a knowledge graph. The ultimate goal is to provide fast and efficient approach to annotate tabular data with entities from a background knowledge. The approach combines lookup and filtering services combined with text pre-processing techniques. Experiments conducted in the context of the 2020 Semantic Web Challenge on Tabular Data to Knowledge Graph Matching with both Column Type Annotation and Cell Type Annotation tasks showed promising results.

Full PDF

AAMALGAM: A Matching Approach to fairfytabuLar data with knowledGe grAph Model

Rabia Azzi and Gayo Diallo

BPH Center/INSERM U1219, Univ. Bordeaux, F-33000, France, [email protected]

Abstract.

In this paper we present

AMALGAM , a matching approach tofairify tabular data with the use of a knowledge graph. The ultimate goalis to provide fast and eﬃcient approach to annotate tabular data withentities from a background knowledge. The approach combines lookupand ﬁltering services combined with text pre-processing techniques. Ex-periments conducted in the context of the 2020 Semantic Web Challengeon Tabular Data to Knowledge Graph Matching with both Column TypeAnnotation and Cell Type Annotation tasks showed promising results.

Keywords:

Tabular Data, Knowledge Graph, Entity Linking

Making web data complying with the FAIR principles has become a necessityin order to facilitate their discovery and reuse [1]. The value for the knowledgediscovery of implementing FAIR is to increase, data integration, data clean-ing, data mining, machine learning and knowledge discovery tasks. Successfullyimplemented FAIR principles will improve the value of data by making themﬁndable, accessible and resolve semantic ambiguities. Good data management isnot a goal in itself, but rather is the key conduit leading to knowledge discoveryand acquisition, and to subsequent data and knowledge integration and reuse bythe community after the data publication process [2].Semantic annotation could be considered as a particular knowledge acquisi-tion task [3,4,5]. The semantic annotation process may rely on formal metadataresources described with an Ontology, even sometimes with multiple ontologiesthanks to the use of semantic repositories [6]. Over the last years, tables are oneof the most used formats to share results and data. In this ﬁeld, a set of systemsfor matching web tables to knowledge bases has been developed [7,8]. They canbe categorized in two main tasks: structure and semantic annotation. The struc-ture annotation deals with tasks such as data type prediction and table headerannotation [9]. Semantic annotation involves matching table elements into KG[10] e.g., columns to class and cells to entities [11,12].Recent years have seen an increasing number of works on Semantic TableInterpretation. In this context, SemTab 2020 has emerged as an initiative which FAIR stands for: Findable, Accessible, Interoperable and Reusable a r X i v : . [ c s . D B ] J a n Rabia Azzi et al. aims at benchmarking systems which deals with annotating tabular data withentities from a KG, referred as table annotation [13]. SemTab is organised intothree tasks, each one with several evaluation rounds. For the 2020 edition forinstance, it involves: (i) assigning a semantic type (e.g., a KG class) to a column(

CTA ); (ii) matching a cell to a KG entity (

CEA ); (iii) assigning a KG propertyto the relationship between two columns (

CPA ).Our goal is to automatically annotate on the ﬂy tabular data. Thus, ourannotation approach is fully automated, as it does not need prior informationregarding entities, or metadata standards. It is fast and easy to deploy, as ittakes advantage of the existing system like Wikidata and Wikipedia to accessentities.

Various research works have addressed the issue of semantic table annotation.The most popular approaches which deal with the three above mentioned tasksrely on supervised learning setting, where candidate entities are selected by aclassiﬁcation models [14]. Such systems include (i) MTab [15], which combinesa voting algorithm and the probability models to solve critical problems of thematching tasks, (ii) DAGOBAH [16] aiming at semantically annotating tableswith Wikidata and DBpedia entities; more precisely it performs cell and col-umn annotation and relationship identiﬁcation, via a pipeline starting from apre-processing step to enriching an existing knowledge graph using the tableinformation; (iii) ADOG [17] is a system focused on leveraging the structure ofa well-connected ontology graph which is extracted from diﬀerent KnowledgeGraphs to annotate structured or semi-structured data. In the latter approach,they combine in novel ways a set of existing technologies and algorithms to au-tomatically annotate structured and semi-structured records. It takes advantageof the native graph structure of ontologies to build a well-connected network onontologies from diﬀerent sources; (iv) Another example is described in [18]. Itsprocess is split into a Candidate Generation and a Candidate Selection phases.The former involves looking for relevant entities in knowledge bases, while thelatter involves picking the top candidate using various techniques such as heuris-tics (the ‘TF-IDF’ approach) and machine learning (the Neural Network Rankingmodel).In [19] the authors present TableMiner, a learning approach for a semantictable interpretation. This is essentially done by improving annotation accuracyby making innovative use of various types of contextual information both in-side and outside tables as features for inference. Then, it reduces computationaloverheads by adopting an incremental, bootstrapping approach that starts bycreating preliminary and partial annotations of a table using ‘sample’ data inthe table, then using the outcome as ‘seed’ to guide interpretation of remainingcontents. Following also a machine learning approach, [20] proposes Meimei. Itcombines a latent probabilistic model with multi-label classiﬁers.

MALGAM 3

Other alternative approaches address only a single speciﬁc task. Thus, in thework of [21], the authors focuses on column type prediction for tables without anymetadata. Unlike traditional lexical matching-based methods, they follow a deepprediction model that can fully exploit tables’ contextual semantics, includingtable locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and queryanswering algorithm. It exhibits good performance not only on individual tablesets, but also when transferring from one table set to another. In the same vein,a work conducted by [22] propose Taipan, which is able to recover the semanticsof tables by identifying subject columns using a combination of structural andsemantic features.From Web tables point of view, various works could be mentioned. Thus, in[23] an iterative matching approach is described. It combines both schema andentity matching and is dedicated to matching large set of HTML tables witha cross-domain knowledge base. Similarly, TabEL uses a collective classiﬁcationtechnique to disambiguate all mentions in a given table [24]. Instead of using astrict mapping of types and relations into a reference knowledge base, TabELuses soft constraints in its graphical model to sidestep errors introduced by anincomplete or noisy KB. It outperforms previous work on multiple datasets.Overall, all the above mentioned approaches are based on a learning strategy.However, for the real-time application, there is a need to get the result as fastas possible. Another main limitation of these approaches is their reproducibil-ity. Indeed, key explicit information concerning study parameters (particularlyrandomization control) and software environment are lacking.The ultimate goal with AMALGAM, which could be categorized as a tabulardata to KG matching system, is to provide a fast and eﬃcient approach fortabular data to KG matching task.

AMALGAM is designed according to the workﬂow in Fig. 1. There are three mainphases which consist in, respectively, pre-processing, context annotation andtabular data to KG matching. The ﬁrst two steps are identical for both CEAand CTA tasks.

Fig. 1.

Workﬂow of

AMALGAM . Tables Pre-Processing.

It is common to have missing values in datasets.Beside, the content of the table can have diﬀerent types (string, date, ﬂoat,

Rabia Azzi et al. etc.)The aim of the pre-processing step is to ensure that loading table happenswithout any error. For instance, a textual encoding where some characters areloaded as noisy sequences or a text ﬁeld with an unescaped delimiter will causethe considered record to have an extra column, etc. Loading incorrect encodingmight strongly aﬀect the lookup performance. To overcome this issue,

AMALGAM relies on the Pandas library to ﬁx all noisy textual data in the tables beingprocessed. Fig. 2.

Illustration of a table structure.

Annotation context.

We consider a table as a two-dimensional tabularstructure (see Fig. 2(A)) which is composed of an ordered set of x rows and ycolumns. Each intersection between a row and a column determines a cell c ij with the value v ij where 1 ≤ i ≤ x, ≤ j ≤ y . To identify the attribute labelof a column also referred as header detection ( CTA task), the idea consists inannotating all the items of the column using entity linking. Then, the attributelabel is estimated using a random entity linking. The annotation context isrepresented by the list of items in the same column (see Fig. 2(B)). For example,the context of the ﬁrst column in the Fig. 2 is represented by the following items:[

A1,B1,...,n ]. Following the same logic, we consider that all cells in the same rowdescribe the same context. More speciﬁcally, the ﬁrst cell of the row describesthe entity and the following cells the associated properties. For instance, thecontext of the ﬁrst row in the Fig. 2 is represented by the following list of items:[

A1,A2,A3,A4 ]. Assigning a semantic type to a column (

CTA ). The

CTA task consistsin assigning a Wikidata KG entity to a given column. It can be performed byexploiting the process described in Fig. 3. The Wikidata KG allows to look upa Wikidata item by its title of its corresponding page on Wikipedia or otherWikimedia family sites using a dedicated API . In our case, the main informa-tion needed from the entity is a list of the instances of (P31) , subclass of (P279) and part of (P361) statements. To do so, a parser is developed to retrieve thisinformation from the Wikidata built request. For example, ”Grande Prairie”provides the following results: [list of towns in Alberta:Q15219391, village in Al-berta:Q6644696, city in Alberta:Q55440238]. To achieve this, our methodology https://pandas.pydata.org/ combines wbsearchentities and parse actions provided by the API. It could beobserved that in this task, there were many items that have not been anno-tated. This is because tables contain incorrectly spelled terms. Therefore, beforeimplementing the other tasks, a spell check component is required.As per the literature [25], spell-checker is a crucial language tool of naturallanguage processing ( NLP ) which is used in applications like information extrac-tion, proofreading, information retrieval, social media and search engines. In ourcase, we compared several approaches and libraries: Textblob , Spark NLP ,Gurunudi , Wikipedia api , Pyspellchecker , Serpapi . A comparison of theseapproaches could be found in table 1. Table 1.

Comparative of approaches and libraries related to spell-checking.

Name Category Strengths/Limitations

Textblob NLP Spelling correction, Easy-to-useSpark NLP NLP Pre-trained, Text analysisGurunudi NLP Pre-trained, Text analysis, Easy-to-useWikipedia api Search engines Search/suggestion, Easy-to-use, Unlimited accessPyspellchecker Spell checking Simple algorithm, No pre-trained, Easy-to-useSerpapi Search engines Limited access for free

Fig. 3.

Assigning a semantic type to a column (

CTA ). Our choice is oriented towards Gurunudi and the Wikidata API with a post-processing step consisting in validating the output using fuzzywuzzy to keeponly the results whose ratio is greater than the threshold of 90%. For example, https://textblob.readthedocs.io/en/dev/ https://nlp.johnsnowlabs.com/ https://github.com/guruyuga/gurunudi https://wikipedia.readthedocs.io/en/latest/code.html https://github.com/barrust/pyspellchecker https://serpapi.com/spell-check https://github.com/seatgeek/fuzzywuzzy Rabia Azzi et al. let’s take the expression “St Peter’s Seminarz” after using the Wikidata APIwe get “St Peter’s seminary” and the ratio of fuzzy string matching is 95%.We are now able to perform the CTA task. In the trivial case, the result ofan item lookup is equal a single record. The best matching entity is chosen asa result. In the other cases, where the result is more than one, no annotation isproduced for the

CTA task. Finally, if there is no result after the lookup, anotherone is performed using the output of the spell check produced by the item. Atthe end of these lookups, the matched couple results are then stored in a nesteddictionary [item:claims]. The most relevant candidate, counting the number ofoccurrences, is selected.

Algorithm 1:

CTA task

Input:

T able T

Output:

Annotated T able T (cid:48) foreach col i ∈ T do candidates col ← ∅ foreach el ∈ col do label ← el.value candidates ← wd-lookup ( label ) if ( candidates.size = 1 ) then candidates col ( k, candidates ) else if ( candidates.size = 0 ) then new label ← spell-check ( label ) candidates ← wd-lookup ( new label ) if ( candidates.size = 1 ) then candidates col ( k, candidates ) endendend annotate ( T (cid:48) .col.i, getMostCommunClass ( candidates col )) end Matching a cell to a KG entity (

CEA ). The

CEA task can be performedby exploiting the process described in Fig. 4. Our approach reuse the processof the

CTA task and made necessary adaptations. The ﬁrst step is to get allthe statements for the ﬁrst item of the list context. The process is the sameas

CTA , the only diﬀerence is where result provides than one record. In thiscase, we create nested dictionary with all candidates. Then, to disambiguate thecandidates entities, we use the concept of the column generated with the

CTA task. Next, a lookup is performed by using the other items of the list contextin the claims of the ﬁrst item. If the item is found, it is selected as the targetentity; if not, the lookup is performed with the item using the Wikidata API (ifthe result is empty, no annotation is produced).With this process, it is possible to reduce errors associated with the lookup.Let’s take the value “650“ in row 0 of the table Fig. 4 for instance. If we lookupdirectly in Wikidata, we can get many results. However, if we check ﬁrst in thestatements of the ﬁrst item of the list, “Grande Prairie“, it is more likely tosuccessfully identify the item.

MALGAM 7

Fig. 4.

Matching a cell to a KG entity (

CEA ). Algorithm 2:

Algorithm of CEA processing task

Input:

T able T , T ColsContext

Output:

Annotated T able T (cid:48) foreach row i ∈ T do F irstEl properties ← ∅ foreach el ∈ row do label ← el.value if ( el = 0 ) then F irstEl properties ← GetP roperties ( label, ColsContext ) endif (Prop-lookup ( label ) (cid:54) = ∅ ) then annotate ( T (cid:48) .row.i.el, candidates.value ) else candidates ← wd-lookup ( label,ColsContext ) if ( candidates.size = 1 ) then annotate ( T (cid:48) .row.i.el, candidates.value ) else if ( candidates.size = 0 ) then new label ← spell-check ( label ) candidates ← wd-lookup ( new label,ColsContext ) if ( candidates.size = 1 ) then annotate ( T (cid:48) .row.i.el, candidates.value ) endendendendend The evaluation of AMALGAM is done in the context of the SemTab 2020 chal-lenge . This challenge is subdivided into 4 successive rounds containing respec-tively 34294, 12173, 62614 and 22387 CSV tables to annotate. For example,Table 2, lists all Alberta towns with additional information such as the countryand the elevation above sea level. The evaluation metrics are respectively the F1score and the Precision [26]. Table 2.

List of Alberta towns, extracted from SemTab Round 1.col0 col1 col2 col3 col4 col5Grande Prairie city in Alberta canada Sexsmith 650 AlbertaSundre town in Alberta canada Mountain View County 1093 AlbertaPeace River town in clberta Canada Northern Sunrise County 330 AlbertaVegreville town in Alberta canada Mundare 635 Alberta

Tables 3, 4, 5 and 6 report the evaluation of

CTA and

CEA respectively forround 1, 2, 3 and 4. Thus, it could be observed that

AMALGAM handles properlythe two tasks, in particular in the

CEA task. Regarding the

CTA task, these resultscan be explained with a new revision of Wikidata created in the item revisionhistory and there are possibly spelling errors in the contents of the tables. Forinstance, ”rural district of Lower Saxony” became ”district of Lower Saxony”after the 16th April 2020 revision. A possible solution to this issue is to retrievethe history of the diﬀerent revisions, by parsing Wikidata data history dumps,to use them in the lookup. This is a possible extension to this work. Anotherobserved issue is that spelling errors impacts greatly the lookup eﬃciency.

Table 3.

Results of Round 1.TASK F1 Score PrecisionCTA 0.724 0.727CEA 0.913 0.914

Table 4.

Results of Round 2.TASK F1 Score PrecisionCTA 0.926 0.928CEA 0.921 0.927

Table 5.

Results of Round 3.TASK F1 Score PrecisionCTA 0.869 0.873CEA 0.877 0.892

Table 6.

Results of Round 4.TASK F1 Score PrecisionCTA 0.858 0.861CEA 0.892 0.914

From the round 1 experience, we speciﬁcally focused on the spell check pro-cess of items to improve the results of the

CEA and

CTA tasks in round 2. TwoAPI services, from Wikipedia and Gurunudi (presented in Sect. 3.) respectivelywere used for spelling correction. According to the results in Table 4, both F1-Score and Precision have been improved. From these rounds, we observed thatterm with a single word is often ambiguous as it may refer to more than oneentity. In Wikidata, there is only one article (one entry) for each concept. How-ever, there can be many equivalent titles for a concept due to the existence ofsynonyms, etc. These synonymy and ambiguity issues make it diﬃcult to matchthe correct item. For example, the term “Paris” may refer to various conceptssuch as “the capital and largest city of France“, “son of Priam, king of Troy“,“county seat of Lamar County, Texas, United States“. This leads us to introduce

MALGAM 9 a disambiguation process during rounds 3 and 4. For these two last rounds, wehave updated the annotation algorithm by integrating the concept of the col-umn obtained during the

CTA task in the linking phase. We showed that the twotasks can be performed relatively successfully with

AMALGAM , achieving higherthan 0.86 in precision. However, the automated disambiguation process of itemsproved to be a more challenging task.

In this paper, we described

AMALGAM , a matching approach to enabling tabulardatasets to be FAIR compliant by making them explicit thanks to their anno-tation using a knowledge graph, in our case Wikidata. Its advantage is that itallows to perform both

CTA and

CEA tasks in a timely manner. These tasks canbe accomplished through the combination of lookup services and a spell checktechniques quickly. The results achieved in the context of the SemTab 2020 chal-lenge show that it handles table annotation tasks with a promising performance.Our ﬁndings suggest that the matching process is very sensitive to errors inspelling. Thus, as of future work, an improved spell checking techniques will beinvestigated. Further, to process such errors the contextual based spell-checkersare needed. Often the string is very close in spelling, but context could help re-veal which word makes the most sense. Moreover, the approach will be improvedthrough ﬁnding a trade-oﬀ between eﬀectiveness and eﬃciency.

References

1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principlesfor scientiﬁc data management and stewardship. Sci Data 3, 160018 (2016).2. Wilkinson, M.-D., et all.: The FAIR Guiding Principles for scientiﬁc data manage-ment and stewardship. Scientiﬁc Data (1), 1–9 (2016).3. Diallo G., Simonet M., Simonet A. (2006) An Approach to Automatic Ontology-Based Annotation of Biomedical Texts. In: Ali M., Dapoigny R. (eds) Advances inApplied Artiﬁcial Intelligence. IEA/AIE 2006. Lecture Notes in Computer Science,vol 4031. Springer, Berlin, Heidelberg.4. Dram´e, K., Mougin F., Diallo, G. Large scale biomedical texts classiﬁcation: a kNNand an ESA-based approaches. J. Biomedical Semantics. Vol:7(40), 2016.5. Handschuh, S.: Semantic Annotation of Resources in the Semantic Web. SemanticWeb Services. Springer, Berlin, Heidelberg, 135–155 (2007).6. Diallo, G. ”Eﬃcient Building of Local Repository of Distributed Ontologies,” 7thIEEE International Conference on SITIS, Dijon, 2011, pp. 159-166.7. Subramanian, A., Srinivasa, S.: Semantic Interpretation and Integration of OpenData Tables. In: Geospatial Infrastructure, Applications and Technologies: IndiaCase Studies, pp. 217–233. Springer Singapore (2018).8. Taheriyan, M., Knoblock, C.-A., Szekely, P., Ambite, J.-L.: Learning the semanticsof structured data sources. Web Semantics: Science, Services and Agents on theWorld Wide Web (38), 152–169 (2016)9. Zhang, L., Wang, T., Liu, Y., Duan, Q.: A semi-structured information semanticannotation method for Web pages. Neur. Comp. and App. (11), 6491–6501(2019)0 Rabia Azzi et al.10. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs:Representation, acquisition and applications. CoRRabs/2002.00388(2020)11. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Match-ing Web Tables with Knowledge Base Entities: From Entity Lookups to EntityEmbeddings. In: LNCS, pp. 260–277. Springer Int. Publishing (2017).12. Eslahi, Y., Bhardwaj, A., Rosso, P., Stockinger, K., Cudre-Mauroux, P.: Annotat-ing Web Tables through Knowledge Bases: A Context-Based Approach. In: 20207th Swiss Conference on Data Science (SDS), pp. 29–34. IEEE (2020).13. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas, K.:SemTab2020: Semantic Web Challenge on Tabular Data to Knowledge GraphMatching - 2020 Data Sets, October 2020.14. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas, K.:SemTab2019: Semantic Web Challenge on Tabular Data to Knowledge GraphMatching - 2019 Data Sets (Version 2019).15. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: Matching TabularData to Knowledge Graph using Probability Models. Proceedings of the SemTabChallenge co-located with the 18th ISWC conference, 2019.16. Chabot, Y., Labbe, T., Liu, J., Troncy, R.: DAGOBAH: An End-to-End Context-Free Tabular Data Semantic Annotation System. Proceedings of the SemTab Chal-lenge co-located with the 18th ISWC conference, 2019.17. Oliveira, D., Aquin, M.: ADOG-Annotating Data with Ontologies and Graphs.Proc. of the SemTab Challenge co-located with the 18th ISWC conference, 2019.18. Thawani, A., Hu, M., Hu, E., Zafar, H., Divvala, N-.T., Singh, A., Qasemi, E.,Szekely, P., Pujara, J.: Entity Linking to Knowledge Graphs to Infer Column Typesand Properties. Proc. of the SemTab Challenge co-located with ISWC’19, 2019..19. Zhang, Z.: Eﬀective and eﬃcient Semantic Table Interpretation using TableM-iner+.Semantic Web IOS Press (6), 921—957 (2017).20. Takeoka, K., Oyamada, M., Nakadai, S., Okadome, T.: Meimei: An Eﬃcient Prob-abilistic Approach for Semantically Annotating Tables. Proceedings of the AAAIConference on Artiﬁcial Intelligence , 281–288 (2019).21. Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Learning Semantic Annota-tions for Tabular Data. Proceedings of the 28th International Joint Conference onArtiﬁcial Intelligence (IJCAI-19), 2088–2094 (2019).22. Ermilov, I., Ngomo, AC.N.: TAIPAN: Automatic Property Mapping for TabularData. In: Blomqvist E., Ciancarini P., Poggi F., Vitali F. (eds) Knowledge Engi-neering and Knowledge Management. EKAW 2016. Lecture Notes in ComputerScience, vol 10024. Springer, Cham.23. Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML Tables to DBpedia. In: Pro-ceedings of the 5th International Conference on Web Intelligence, Mining and Se-mantics - WIMS '

15, pp. 1—6. ACM Press (2015).24. Bhagavatula, C.-S., Noraset, T., Downey, D.: TabEL: Entity Linking in Web Ta-bles. In: Proceedings of the The Semantic Web - ISWC 2015, Springer InternationalPublishing, pp. 425–441 (2015).25. Shashank, S., Shailendra, S.: Systematic review of spell-checkers for highly inﬂec-tional languages. Artiﬁcial Intelligence Review53