AMALGAM: A Matching Approach to fairfy tabuLar data with knowledGe grAph Model
AAMALGAM: A Matching Approach to fairfytabuLar data with knowledGe grAph Model
Rabia Azzi and Gayo Diallo
BPH Center/INSERM U1219, Univ. Bordeaux, F-33000, France, [email protected]
Abstract.
In this paper we present
AMALGAM , a matching approach tofairify tabular data with the use of a knowledge graph. The ultimate goalis to provide fast and efficient approach to annotate tabular data withentities from a background knowledge. The approach combines lookupand filtering services combined with text pre-processing techniques. Ex-periments conducted in the context of the 2020 Semantic Web Challengeon Tabular Data to Knowledge Graph Matching with both Column TypeAnnotation and Cell Type Annotation tasks showed promising results.
Keywords:
Tabular Data, Knowledge Graph, Entity Linking
Making web data complying with the FAIR principles has become a necessityin order to facilitate their discovery and reuse [1]. The value for the knowledgediscovery of implementing FAIR is to increase, data integration, data clean-ing, data mining, machine learning and knowledge discovery tasks. Successfullyimplemented FAIR principles will improve the value of data by making themfindable, accessible and resolve semantic ambiguities. Good data management isnot a goal in itself, but rather is the key conduit leading to knowledge discoveryand acquisition, and to subsequent data and knowledge integration and reuse bythe community after the data publication process [2].Semantic annotation could be considered as a particular knowledge acquisi-tion task [3,4,5]. The semantic annotation process may rely on formal metadataresources described with an Ontology, even sometimes with multiple ontologiesthanks to the use of semantic repositories [6]. Over the last years, tables are oneof the most used formats to share results and data. In this field, a set of systemsfor matching web tables to knowledge bases has been developed [7,8]. They canbe categorized in two main tasks: structure and semantic annotation. The struc-ture annotation deals with tasks such as data type prediction and table headerannotation [9]. Semantic annotation involves matching table elements into KG[10] e.g., columns to class and cells to entities [11,12].Recent years have seen an increasing number of works on Semantic TableInterpretation. In this context, SemTab 2020 has emerged as an initiative which FAIR stands for: Findable, Accessible, Interoperable and Reusable a r X i v : . [ c s . D B ] J a n Rabia Azzi et al. aims at benchmarking systems which deals with annotating tabular data withentities from a KG, referred as table annotation [13]. SemTab is organised intothree tasks, each one with several evaluation rounds. For the 2020 edition forinstance, it involves: (i) assigning a semantic type (e.g., a KG class) to a column(
CTA ); (ii) matching a cell to a KG entity (
CEA ); (iii) assigning a KG propertyto the relationship between two columns (
CPA ).Our goal is to automatically annotate on the fly tabular data. Thus, ourannotation approach is fully automated, as it does not need prior informationregarding entities, or metadata standards. It is fast and easy to deploy, as ittakes advantage of the existing system like Wikidata and Wikipedia to accessentities.
Various research works have addressed the issue of semantic table annotation.The most popular approaches which deal with the three above mentioned tasksrely on supervised learning setting, where candidate entities are selected by aclassification models [14]. Such systems include (i) MTab [15], which combinesa voting algorithm and the probability models to solve critical problems of thematching tasks, (ii) DAGOBAH [16] aiming at semantically annotating tableswith Wikidata and DBpedia entities; more precisely it performs cell and col-umn annotation and relationship identification, via a pipeline starting from apre-processing step to enriching an existing knowledge graph using the tableinformation; (iii) ADOG [17] is a system focused on leveraging the structure ofa well-connected ontology graph which is extracted from different KnowledgeGraphs to annotate structured or semi-structured data. In the latter approach,they combine in novel ways a set of existing technologies and algorithms to au-tomatically annotate structured and semi-structured records. It takes advantageof the native graph structure of ontologies to build a well-connected network onontologies from different sources; (iv) Another example is described in [18]. Itsprocess is split into a Candidate Generation and a Candidate Selection phases.The former involves looking for relevant entities in knowledge bases, while thelatter involves picking the top candidate using various techniques such as heuris-tics (the ‘TF-IDF’ approach) and machine learning (the Neural Network Rankingmodel).In [19] the authors present TableMiner, a learning approach for a semantictable interpretation. This is essentially done by improving annotation accuracyby making innovative use of various types of contextual information both in-side and outside tables as features for inference. Then, it reduces computationaloverheads by adopting an incremental, bootstrapping approach that starts bycreating preliminary and partial annotations of a table using ‘sample’ data inthe table, then using the outcome as ‘seed’ to guide interpretation of remainingcontents. Following also a machine learning approach, [20] proposes Meimei. Itcombines a latent probabilistic model with multi-label classifiers.
MALGAM 3
Other alternative approaches address only a single specific task. Thus, in thework of [21], the authors focuses on column type prediction for tables without anymetadata. Unlike traditional lexical matching-based methods, they follow a deepprediction model that can fully exploit tables’ contextual semantics, includingtable locality features learned by a Hybrid Neural Network (HNN), and inter-column semantics features learned by a knowledge base (KB) lookup and queryanswering algorithm. It exhibits good performance not only on individual tablesets, but also when transferring from one table set to another. In the same vein,a work conducted by [22] propose Taipan, which is able to recover the semanticsof tables by identifying subject columns using a combination of structural andsemantic features.From Web tables point of view, various works could be mentioned. Thus, in[23] an iterative matching approach is described. It combines both schema andentity matching and is dedicated to matching large set of HTML tables witha cross-domain knowledge base. Similarly, TabEL uses a collective classificationtechnique to disambiguate all mentions in a given table [24]. Instead of using astrict mapping of types and relations into a reference knowledge base, TabELuses soft constraints in its graphical model to sidestep errors introduced by anincomplete or noisy KB. It outperforms previous work on multiple datasets.Overall, all the above mentioned approaches are based on a learning strategy.However, for the real-time application, there is a need to get the result as fastas possible. Another main limitation of these approaches is their reproducibil-ity. Indeed, key explicit information concerning study parameters (particularlyrandomization control) and software environment are lacking.The ultimate goal with AMALGAM, which could be categorized as a tabulardata to KG matching system, is to provide a fast and efficient approach fortabular data to KG matching task.
AMALGAM is designed according to the workflow in Fig. 1. There are three mainphases which consist in, respectively, pre-processing, context annotation andtabular data to KG matching. The first two steps are identical for both CEAand CTA tasks.
Fig. 1.
Workflow of
AMALGAM . Tables Pre-Processing.
It is common to have missing values in datasets.Beside, the content of the table can have different types (string, date, float,
Rabia Azzi et al. etc.)The aim of the pre-processing step is to ensure that loading table happenswithout any error. For instance, a textual encoding where some characters areloaded as noisy sequences or a text field with an unescaped delimiter will causethe considered record to have an extra column, etc. Loading incorrect encodingmight strongly affect the lookup performance. To overcome this issue,
AMALGAM relies on the Pandas library to fix all noisy textual data in the tables beingprocessed. Fig. 2.
Illustration of a table structure.
Annotation context.
We consider a table as a two-dimensional tabularstructure (see Fig. 2(A)) which is composed of an ordered set of x rows and ycolumns. Each intersection between a row and a column determines a cell c ij with the value v ij where 1 ≤ i ≤ x, ≤ j ≤ y . To identify the attribute labelof a column also referred as header detection ( CTA task), the idea consists inannotating all the items of the column using entity linking. Then, the attributelabel is estimated using a random entity linking. The annotation context isrepresented by the list of items in the same column (see Fig. 2(B)). For example,the context of the first column in the Fig. 2 is represented by the following items:[
A1,B1,...,n ]. Following the same logic, we consider that all cells in the same rowdescribe the same context. More specifically, the first cell of the row describesthe entity and the following cells the associated properties. For instance, thecontext of the first row in the Fig. 2 is represented by the following list of items:[
A1,A2,A3,A4 ]. Assigning a semantic type to a column (
CTA ). The
CTA task consistsin assigning a Wikidata KG entity to a given column. It can be performed byexploiting the process described in Fig. 3. The Wikidata KG allows to look upa Wikidata item by its title of its corresponding page on Wikipedia or otherWikimedia family sites using a dedicated API . In our case, the main informa-tion needed from the entity is a list of the instances of (P31) , subclass of (P279) and part of (P361) statements. To do so, a parser is developed to retrieve thisinformation from the Wikidata built request. For example, ”Grande Prairie”provides the following results: [list of towns in Alberta:Q15219391, village in Al-berta:Q6644696, city in Alberta:Q55440238]. To achieve this, our methodology https://pandas.pydata.org/ combines wbsearchentities and parse actions provided by the API. It could beobserved that in this task, there were many items that have not been anno-tated. This is because tables contain incorrectly spelled terms. Therefore, beforeimplementing the other tasks, a spell check component is required.As per the literature [25], spell-checker is a crucial language tool of naturallanguage processing ( NLP ) which is used in applications like information extrac-tion, proofreading, information retrieval, social media and search engines. In ourcase, we compared several approaches and libraries: Textblob , Spark NLP ,Gurunudi , Wikipedia api , Pyspellchecker , Serpapi . A comparison of theseapproaches could be found in table 1. Table 1.
Comparative of approaches and libraries related to spell-checking.
Name Category Strengths/Limitations
Textblob NLP Spelling correction, Easy-to-useSpark NLP NLP Pre-trained, Text analysisGurunudi NLP Pre-trained, Text analysis, Easy-to-useWikipedia api Search engines Search/suggestion, Easy-to-use, Unlimited accessPyspellchecker Spell checking Simple algorithm, No pre-trained, Easy-to-useSerpapi Search engines Limited access for free
Fig. 3.
Assigning a semantic type to a column (
CTA ). Our choice is oriented towards Gurunudi and the Wikidata API with a post-processing step consisting in validating the output using fuzzywuzzy to keeponly the results whose ratio is greater than the threshold of 90%. For example, https://textblob.readthedocs.io/en/dev/ https://nlp.johnsnowlabs.com/ https://github.com/guruyuga/gurunudi https://wikipedia.readthedocs.io/en/latest/code.html https://github.com/barrust/pyspellchecker https://serpapi.com/spell-check https://github.com/seatgeek/fuzzywuzzy Rabia Azzi et al. let’s take the expression “St Peter’s Seminarz” after using the Wikidata APIwe get “St Peter’s seminary” and the ratio of fuzzy string matching is 95%.We are now able to perform the CTA task. In the trivial case, the result ofan item lookup is equal a single record. The best matching entity is chosen asa result. In the other cases, where the result is more than one, no annotation isproduced for the
CTA task. Finally, if there is no result after the lookup, anotherone is performed using the output of the spell check produced by the item. Atthe end of these lookups, the matched couple results are then stored in a nesteddictionary [item:claims]. The most relevant candidate, counting the number ofoccurrences, is selected.
Algorithm 1:
CTA task
Input:
T able T
Output:
Annotated T able T (cid:48) foreach col i ∈ T do candidates col ← ∅ foreach el ∈ col do label ← el.value candidates ← wd-lookup ( label ) if ( candidates.size = 1 ) then candidates col ( k, candidates ) else if ( candidates.size = 0 ) then new label ← spell-check ( label ) candidates ← wd-lookup ( new label ) if ( candidates.size = 1 ) then candidates col ( k, candidates ) endendend annotate ( T (cid:48) .col.i, getMostCommunClass ( candidates col )) end Matching a cell to a KG entity (
CEA ). The
CEA task can be performedby exploiting the process described in Fig. 4. Our approach reuse the processof the
CTA task and made necessary adaptations. The first step is to get allthe statements for the first item of the list context. The process is the sameas
CTA , the only difference is where result provides than one record. In thiscase, we create nested dictionary with all candidates. Then, to disambiguate thecandidates entities, we use the concept of the column generated with the
CTA task. Next, a lookup is performed by using the other items of the list contextin the claims of the first item. If the item is found, it is selected as the targetentity; if not, the lookup is performed with the item using the Wikidata API (ifthe result is empty, no annotation is produced).With this process, it is possible to reduce errors associated with the lookup.Let’s take the value “650“ in row 0 of the table Fig. 4 for instance. If we lookupdirectly in Wikidata, we can get many results. However, if we check first in thestatements of the first item of the list, “Grande Prairie“, it is more likely tosuccessfully identify the item.
MALGAM 7
Fig. 4.
Matching a cell to a KG entity (
CEA ). Algorithm 2:
Algorithm of CEA processing task
Input:
T able T , T ColsContext
Output:
Annotated T able T (cid:48) foreach row i ∈ T do F irstEl properties ← ∅ foreach el ∈ row do label ← el.value if ( el = 0 ) then F irstEl properties ← GetP roperties ( label, ColsContext ) endif (Prop-lookup ( label ) (cid:54) = ∅ ) then annotate ( T (cid:48) .row.i.el, candidates.value ) else candidates ← wd-lookup ( label,ColsContext ) if ( candidates.size = 1 ) then annotate ( T (cid:48) .row.i.el, candidates.value ) else if ( candidates.size = 0 ) then new label ← spell-check ( label ) candidates ← wd-lookup ( new label,ColsContext ) if ( candidates.size = 1 ) then annotate ( T (cid:48) .row.i.el, candidates.value ) endendendendend The evaluation of AMALGAM is done in the context of the SemTab 2020 chal-lenge . This challenge is subdivided into 4 successive rounds containing respec-tively 34294, 12173, 62614 and 22387 CSV tables to annotate. For example,Table 2, lists all Alberta towns with additional information such as the countryand the elevation above sea level. The evaluation metrics are respectively the F1score and the Precision [26]. Table 2.
List of Alberta towns, extracted from SemTab Round 1.col0 col1 col2 col3 col4 col5Grande Prairie city in Alberta canada Sexsmith 650 AlbertaSundre town in Alberta canada Mountain View County 1093 AlbertaPeace River town in clberta Canada Northern Sunrise County 330 AlbertaVegreville town in Alberta canada Mundare 635 Alberta
Tables 3, 4, 5 and 6 report the evaluation of
CTA and
CEA respectively forround 1, 2, 3 and 4. Thus, it could be observed that
AMALGAM handles properlythe two tasks, in particular in the
CEA task. Regarding the
CTA task, these resultscan be explained with a new revision of Wikidata created in the item revisionhistory and there are possibly spelling errors in the contents of the tables. Forinstance, ”rural district of Lower Saxony” became ”district of Lower Saxony”after the 16th April 2020 revision. A possible solution to this issue is to retrievethe history of the different revisions, by parsing Wikidata data history dumps,to use them in the lookup. This is a possible extension to this work. Anotherobserved issue is that spelling errors impacts greatly the lookup efficiency.
Table 3.
Results of Round 1.TASK F1 Score PrecisionCTA 0.724 0.727CEA 0.913 0.914
Table 4.
Results of Round 2.TASK F1 Score PrecisionCTA 0.926 0.928CEA 0.921 0.927
Table 5.
Results of Round 3.TASK F1 Score PrecisionCTA 0.869 0.873CEA 0.877 0.892
Table 6.
Results of Round 4.TASK F1 Score PrecisionCTA 0.858 0.861CEA 0.892 0.914
From the round 1 experience, we specifically focused on the spell check pro-cess of items to improve the results of the
CEA and
CTA tasks in round 2. TwoAPI services, from Wikipedia and Gurunudi (presented in Sect. 3.) respectivelywere used for spelling correction. According to the results in Table 4, both F1-Score and Precision have been improved. From these rounds, we observed thatterm with a single word is often ambiguous as it may refer to more than oneentity. In Wikidata, there is only one article (one entry) for each concept. How-ever, there can be many equivalent titles for a concept due to the existence ofsynonyms, etc. These synonymy and ambiguity issues make it difficult to matchthe correct item. For example, the term “Paris” may refer to various conceptssuch as “the capital and largest city of France“, “son of Priam, king of Troy“,“county seat of Lamar County, Texas, United States“. This leads us to introduce
MALGAM 9 a disambiguation process during rounds 3 and 4. For these two last rounds, wehave updated the annotation algorithm by integrating the concept of the col-umn obtained during the
CTA task in the linking phase. We showed that the twotasks can be performed relatively successfully with
AMALGAM , achieving higherthan 0.86 in precision. However, the automated disambiguation process of itemsproved to be a more challenging task.
In this paper, we described
AMALGAM , a matching approach to enabling tabulardatasets to be FAIR compliant by making them explicit thanks to their anno-tation using a knowledge graph, in our case Wikidata. Its advantage is that itallows to perform both
CTA and
CEA tasks in a timely manner. These tasks canbe accomplished through the combination of lookup services and a spell checktechniques quickly. The results achieved in the context of the SemTab 2020 chal-lenge show that it handles table annotation tasks with a promising performance.Our findings suggest that the matching process is very sensitive to errors inspelling. Thus, as of future work, an improved spell checking techniques will beinvestigated. Further, to process such errors the contextual based spell-checkersare needed. Often the string is very close in spelling, but context could help re-veal which word makes the most sense. Moreover, the approach will be improvedthrough finding a trade-off between effectiveness and efficiency.
References
1. Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principlesfor scientific data management and stewardship. Sci Data 3, 160018 (2016).2. Wilkinson, M.-D., et all.: The FAIR Guiding Principles for scientific data manage-ment and stewardship. Scientific Data (1), 1–9 (2016).3. Diallo G., Simonet M., Simonet A. (2006) An Approach to Automatic Ontology-Based Annotation of Biomedical Texts. In: Ali M., Dapoigny R. (eds) Advances inApplied Artificial Intelligence. IEA/AIE 2006. Lecture Notes in Computer Science,vol 4031. Springer, Berlin, Heidelberg.4. Dram´e, K., Mougin F., Diallo, G. Large scale biomedical texts classification: a kNNand an ESA-based approaches. J. Biomedical Semantics. Vol:7(40), 2016.5. Handschuh, S.: Semantic Annotation of Resources in the Semantic Web. SemanticWeb Services. Springer, Berlin, Heidelberg, 135–155 (2007).6. Diallo, G. ”Efficient Building of Local Repository of Distributed Ontologies,” 7thIEEE International Conference on SITIS, Dijon, 2011, pp. 159-166.7. Subramanian, A., Srinivasa, S.: Semantic Interpretation and Integration of OpenData Tables. In: Geospatial Infrastructure, Applications and Technologies: IndiaCase Studies, pp. 217–233. Springer Singapore (2018).8. Taheriyan, M., Knoblock, C.-A., Szekely, P., Ambite, J.-L.: Learning the semanticsof structured data sources. Web Semantics: Science, Services and Agents on theWorld Wide Web (38), 152–169 (2016)9. Zhang, L., Wang, T., Liu, Y., Duan, Q.: A semi-structured information semanticannotation method for Web pages. Neur. Comp. and App. (11), 6491–6501(2019)0 Rabia Azzi et al.10. Ji, S., Pan, S., Cambria, E., Marttinen, P., Yu, P.S.: A survey on knowledge graphs:Representation, acquisition and applications. CoRRabs/2002.00388(2020)11. Efthymiou, V., Hassanzadeh, O., Rodriguez-Muro, M., Christophides, V.: Match-ing Web Tables with Knowledge Base Entities: From Entity Lookups to EntityEmbeddings. In: LNCS, pp. 260–277. Springer Int. Publishing (2017).12. Eslahi, Y., Bhardwaj, A., Rosso, P., Stockinger, K., Cudre-Mauroux, P.: Annotat-ing Web Tables through Knowledge Bases: A Context-Based Approach. In: 20207th Swiss Conference on Data Science (SDS), pp. 29–34. IEEE (2020).13. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas, K.:SemTab2020: Semantic Web Challenge on Tabular Data to Knowledge GraphMatching - 2020 Data Sets, October 2020.14. Hassanzadeh, O., Efthymiou, V., Chen, C., Jimenez-Ruiz, E., Srinivas, K.:SemTab2019: Semantic Web Challenge on Tabular Data to Knowledge GraphMatching - 2019 Data Sets (Version 2019).15. Nguyen, P., Kertkeidkachorn, N., Ichise, R., Takeda, H.: MTab: Matching TabularData to Knowledge Graph using Probability Models. Proceedings of the SemTabChallenge co-located with the 18th ISWC conference, 2019.16. Chabot, Y., Labbe, T., Liu, J., Troncy, R.: DAGOBAH: An End-to-End Context-Free Tabular Data Semantic Annotation System. Proceedings of the SemTab Chal-lenge co-located with the 18th ISWC conference, 2019.17. Oliveira, D., Aquin, M.: ADOG-Annotating Data with Ontologies and Graphs.Proc. of the SemTab Challenge co-located with the 18th ISWC conference, 2019.18. Thawani, A., Hu, M., Hu, E., Zafar, H., Divvala, N-.T., Singh, A., Qasemi, E.,Szekely, P., Pujara, J.: Entity Linking to Knowledge Graphs to Infer Column Typesand Properties. Proc. of the SemTab Challenge co-located with ISWC’19, 2019..19. Zhang, Z.: Effective and efficient Semantic Table Interpretation using TableM-iner+.Semantic Web IOS Press (6), 921—957 (2017).20. Takeoka, K., Oyamada, M., Nakadai, S., Okadome, T.: Meimei: An Efficient Prob-abilistic Approach for Semantically Annotating Tables. Proceedings of the AAAIConference on Artificial Intelligence , 281–288 (2019).21. Chen, J., Jimenez-Ruiz, E., Horrocks, I., Sutton, C.: Learning Semantic Annota-tions for Tabular Data. Proceedings of the 28th International Joint Conference onArtificial Intelligence (IJCAI-19), 2088–2094 (2019).22. Ermilov, I., Ngomo, AC.N.: TAIPAN: Automatic Property Mapping for TabularData. In: Blomqvist E., Ciancarini P., Poggi F., Vitali F. (eds) Knowledge Engi-neering and Knowledge Management. EKAW 2016. Lecture Notes in ComputerScience, vol 10024. Springer, Cham.23. Ritze, D., Lehmberg, O., Bizer, C.: Matching HTML Tables to DBpedia. In: Pro-ceedings of the 5th International Conference on Web Intelligence, Mining and Se-mantics - WIMS '
15, pp. 1—6. ACM Press (2015).24. Bhagavatula, C.-S., Noraset, T., Downey, D.: TabEL: Entity Linking in Web Ta-bles. In: Proceedings of the The Semantic Web - ISWC 2015, Springer InternationalPublishing, pp. 425–441 (2015).25. Shashank, S., Shailendra, S.: Systematic review of spell-checkers for highly inflec-tional languages. Artificial Intelligence Review53