Information Extraction From Co-Occurring Similar Entities
IInformation Extraction From Co-Occurring Similar Entities
Nicolas Heist [email protected] and Web Science GroupUniversity of Mannheim, Germany
Heiko Paulheim [email protected] and Web Science GroupUniversity of Mannheim, Germany
ABSTRACT
Knowledge about entities and their interrelations is a crucial factorof success for tasks like question answering or text summarization.Publicly available knowledge graphs like Wikidata or DBpediaare, however, far from being complete. In this paper, we explorehow information extracted from similar entities that co-occur instructures like tables or lists can help to increase the coverage ofsuch knowledge graphs. In contrast to existing approaches, we donot focus on relationships within a listing (e.g., between two entitiesin a table row) but on the relationship between a listing’s subjectentities and the context of the listing. To that end, we propose adescriptive rule mining approach that uses distant supervision toderive rules for these relationships based on a listing’s context.Extracted from a suitable data corpus, the rules can be used toextend a knowledge graph with novel entities and assertions. In ourexperiments we demonstrate that the approach is able to extract upto 3M novel entities and 30M additional assertions from listings inWikipedia. We find that the extracted information is of high qualityand thus suitable to extend Wikipedia-based knowledge graphslike DBpedia, YAGO, and CaLiGraph. For the case of DBpedia, thiswould result in an increase of covered entities by roughly 50%.
CCS CONCEPTS • Information systems → Information extraction ; Data extrac-tion and integration ; Association rules . KEYWORDS
Entity co-occurrence, Information extraction, Novel entity detec-tion, CaLiGraph, DBpedia
ACM Reference Format:
Nicolas Heist and Heiko Paulheim. 2021. Information Extraction FromCo-Occurring Similar Entities. In
Proceedings of the Web Conference 2021(WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.
ACM, New York, NY,USA, 11 pages. https://doi.org/10.1145/3442381.3449836
In tasks like question answering, text summarization, or entitydisambiguation, it is essential to have background informationabout the involved entities. With entity linking tools like DBpediaSpotlight [19] or Falcon [26], one can easily identify named entities
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3449836
Figure 1: Simplified view on the Wikipedia page of GilbyClarke with a focus on its title, sections, and listings. in text and retrieve the respective entity in a background entity hubof the linking tool (e.g. in a wiki like Wikipedia or in a knowledgegraph like DBpedia [14]). This is, however, only possible if theentity in question is contained in the respective entity hub [29].The trend of entities added to publicly available knowledgegraphs in recent years indicates that they are far from being com-plete. The number of entities in Wikidata [31], for example, grewby 37% in the time from October 2019 (61.7M) to October 2020(84.5M). In the same time, the number of statements increased by41% from 770M to 1085M. According to [9], Wikidata describesthe largest number of entities and comprises – in terms of entities –other open knowledge graphs to a large extent. Consequently, thisproblem applies to all public knowledge graphs, and particularlyso for long-tail and emerging entities [6].Automatic information extraction approaches can help mitigat-ing this problem if the approaches can make sure that the extractedinformation is of high quality. While the performance of open in-formation extraction systems (i.e. systems that extract informationfrom general web text) has improved in recent years [4, 16, 27], thequality of extracted information has not yet reached a level wherean integration into knowledge graphs like DBpedia should be donewithout further filtering.The extraction of information from semi-structured data is ingeneral less error-prone and already proved to yield high-qualityresults as, for example, DBpedia itself is extracted primarily fromWikipedia infoboxes; further approaches use the category systemof Wikipedia [10, 28, 33] or its list pages [11, 24]. Many more https://tools.wmflabs.org/wikidata-todo/stats.php a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia N. Heist and H. Paulheim approaches focus on tables (in Wikipedia or the web) as semi-structured data source to extract entities and relations (see [36]for a comprehensive survey). The focus of recent web table-basedapproaches like Zhang et al. [35] is set on recognizing entities andrelationships within a table. Considering Fig. 1, the table below thesection
Solo albums may be used to discover the publication yearsof albums (relation extraction) or discover additional unknown al-bums that are listed in further rows below
Rubber and
Swag (entityand type detection).The focus of this paper is broader with respect to two dimensions:First, we extract information from any kind of structure wheresimilar entities co-occur. In Fig. 1, we would consider both tablesand lists (e.g. the list in the section
Albums with Guns N’ Roses ).We refer to these co-occurrence structures as listings. Second, weconsider only the subject entities (SE) of listings. In our previouswork we defined SE with respect to Wikipedia list pages as "theinstances of the concept expressed by the list page" [11]. Consideringthe
List of Japanese speculative fiction writers , its SE comprise allJapanese speculative fiction writers mentioned in listings of thepage. While in [11] the concept of SE is made explicit by the listpage, we deal with arbitrary listings in this paper. We thus assumethe concept may not be explicit or it may be indicated as part ofthe page in which the listing appears (e.g. in the table header, orthe page title). Therefore, to each entity in a listing appearing asinstance to a common concept, we will further refer as subject entity.The purpose of this work is to exploit the relationship between theSE of a listing and the listing context. For Fig. 1, this means weextract that all SE on the page’s listings are albums with the artist
Gilby Clarke , that
The Spaghetti Incident? is an album by
Guns N’Roses , and so on.To that end, we propose to learn these characteristics of a listingwith respect to the types and contextual relations of its SE. Inan ideal setting we know the SE of a listing and we are able toretrieve all information about them from a knowledge graph – thecharacteristics of a listing are then simply the types and relationsthat are shared by all SE. But uncertainty is introduced by severalfactors: • SE can only be determined heuristically. In previous work[11], we achieved a precision of 90% for the recognition ofSE in Wikipedia listings. • Cross-domain knowledge graphs are not complete. Accord-ing to the open world assumption (OWA), the absence of afact in a knowledge graph does not imply its incorrectness. • Web tables have a median of 6 rows, and Wikipedia listingshave a median of 8 rows. Consequently, many listings onlyhave a small number of SE from which the characteristicscan be inferred.As a result, considering each listing in isolation either leadsto a substantial loss of information (as listings with insufficientbackground information are disregarded) or to a high generaliza-tion error (as decisions are made based on insufficient backgroundinformation).We observe that the context of a listing is often a strong in-dicator for its characteristics. In Fig. 1, the title of the top section According to the WDC Web Table Corpus 2015: http://webdatacommons.org/webtables/.
Discography indicates that its listings contain some kind of musicalworks, and the section title
Albums with Guns N’ Roses providesmore detailed information. Our second observation is that these pat-terns repeat when looking at a coherent data corpus. The Wikipediapage of
Axl Rose , for example, contains the same constellation ofsections.Considering listing characteristics with respect to their contextcan thus yield in more general insights than considering everylisting in isolation. For example, the musical works of many artistsin Wikipedia are listed under the top section Discography . Hence,we could learn the axioms ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Discography" } ⊑ MusicalWork (1)and ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Discography" } ⊑ ∃ 𝑎𝑟𝑡𝑖𝑠𝑡. { < 𝑃𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦 > } (2)which are then applicable to any listing with the top section Disco-graphy in Wikipedia.
In this work, we frame the task of finding descriptive rules forlistings based on their context as association rule mining problem[1]. We define rule metrics that take the inherent uncertainty intoaccount and make sure that rules are frequent (rule support), correct(rule confidence), and consistent over all listings (rule consistency).Furthermore, we present an approach that executes the completepipeline from identification of SE to the extraction of novel entitiesand assertions with Wikipedia as data corpus. To find a reasonablebalance between correctness and coverage of the rules, we setthe thresholds based on a heuristic that takes the distribution ofnamed entity tags over entities as well as existing knowledge in aknowledge graph into account. Applying the approach, we showthat we can enhance the knowledge graphs DBpedia with up to2.9M entities and 8.3M assertions, and CaLiGraph with up to 3Mentities and 30.4M assertions with an overall correctness of morethan 90%.To summarize, the contributions of this paper are as follows: • We formulate the task of information extraction from co-occurring similar entities in listings and show how to derivedescriptive rules for listing characteristics based on the list-ing context (Sec. 3). • We present an approach that learns descriptive rules forlistings in Wikipedia and is capable of extracting severalmillions of novel entities and assertions for Wikipedia-basedknowledge graphs (Sec. 4). • In our evaluation we demonstrate the high quality of theextracted information and analyze the shortcomings of theapproach (Sec. 5).The produced code is part of the CaLiGraph extraction frame-work and publicly available. https://en.wikipedia.org/wiki/Axl_Rose http://caligraph.org https://github.com/nheist/CaLiGraph nformation Extraction From Co-Occurring Similar Entities WWW ’21, April 19–23, 2021, Ljubljana, Slovenia The work presented in this paper is a flavour of knowledge graphcompletion , more precisely, of adding new entities to a knowledgegraph [22]. We use rules based on page context to infer facts aboutco-occurring entities. In particular, we focus on co-occurrence ofentities within document listings, where co-occurrence refers toproximity in page layout. Hence, in this section, we discuss relatedworks w.r.t. knowledge graph completion from listings, exploitationof listing context, as well as rule learning for knowledge graphs.
Knowledge graph completion using information in web tables hasalready been an active research area in the last several years. In 2016,Ritze et al. [25] profiled the potential of web tables in the WDC WebTable Corpus. Using the T2K Match framework, they match webtables to DBpedia and find that the best results for the extractionof new facts can be achieved using knowledge-based trust [5] (i.e.,judging the quality of a set of extracted triples by their overlapwith the knowledge base). Zhang et al. [35] present an approach fordetection of novel entities in tables. They first exploit lexical andsemantic similarity for entity linking and column heading propertymatching. In a second step they use the output to detect novelentities in table columns. Oulabi and Bizer [21] tackle the sameproblem for Wikipedia tables with a bootstrapping approach basedon expert-defined rules. Macdonald and Barbosa [17] extract newfacts from Wikipedia tables to extend the Freebase knowledge base.With an LSTM that uses contextual information of the table, theyextract new facts for 28 relations.Lists have only very sparsely been used for knowledge graphcompletion. Paulheim and Ponzetto [24] frame the general potentialof list pages as a source of knowledge in Wikipedia. They proposeto use a combination of statistical and NLP methods to extractknowledge and show that, by applying them to a single list page,they are able to extract a thousand new statements.Compared to all previously mentioned approaches, we take anabstract view on listings by considering only their subject entities.This provides the advantage that rules can be learned from andapplied to arbitrary listings. In addition to that, we do not onlydiscover novel entities, but also discover relations between thoseentities and the page subject.In our previous work [11], we have already presented an ap-proach for the identification of novel entities and the extractionof facts in Wikipedia list pages. List pages are pages in Wikipediathat start with
List of and contain listings (i.e., tables or lists) ofentities for a given topic (e.g.
List of Japanese speculative fictionwriters ). The approach is divided into two phases: In a first phase, adataset of tagged entities from list pages is extracted. With distantsupervision from CaLiGraph, a knowledge graph with a detailedtype hierarchy derived from Wikipedia categories and list pages,a part of the mentioned entities is heuristically labeled as subjectentities and non-subject entities. In a second phase, the dataset isenriched with positional, lexical, and statistical features extractedfrom the list pages. On the basis of this data, an XGBoost classi-fier is able to identify more than two million subject entities withan average precision of 90%. As not all the information about the subject entities is contained in the knowledge graphs DBpedia andCaLiGraph, they can be enhanced with the missing information.In this work, we reuse the approach presented in [11] for identi-fying subject entities. Further, as it is the only approach that alsoworks with arbitrary listings, we use it as a baseline in our ex-periments. As, in its current state, it only works for list pages inWikipedia, we extend it to arbitrary pages with a simple frequency-based approach.
As tables are the more actively researched type of listings, we focushere on the types of context used when working with tables. Themost obvious source of context is found directly on the page wherethe table is located. This page context is, for example, used byInfoGather [34] to detect possible synonyms in table headers formeans of table matching.Zhang [38] distinguishes between "in-table" features like thetable header, and "out-table" features like captions, page title, andtext of surrounding paragraphs. With both kinds of features, theyperform entity disambiguation against Freebase.The previously mentioned approach of Macdonald and Barbosa[17] focuses on tables in Wikipedia and hence uses specific contextfeatures like section titles, table headers and captions, and the textin the first paragraph of the table’s section. Interestingly, they donot only discover relations between entities in the table, but alsobetween a table entity and the page subject.MENTOR [2] leverages patterns occurring in headers of Wikipe-dia tables to consistently discover DBpedia relations. Lehmberg etal. [15] tackle the problem of small web tables with table stitching,i.e., they combine several small tables with a similar context (e.g.,same page or domain and a matching schema) into one large table,making it easier to extract facts from it.Apart from page context, many approaches use the context ofentities in tables to improve extraction results. Zhang et al. [37]generate new sub-classes to a taxonomy for a set of entities. There-fore, they find the best-describing class using the context of theentities. In particular, they use the categories of the entities as wellas the immediate context around the entities on the page. Anotherapproach that uses entity categories as context is TableNet [7]. Theyleverage the context to find schematically similar or related tablesfor a given table in Wikipedia.In our experiments with Wikipedia, we use section headers aspage context and types in the knowledge graph as entity context.However, the definition of context in our approach is kept very gen-eric on purpose. By doing that, we are able to incorporate additionalcontext sources like section text or entity categories to improveextraction results. This, however, also comes with an increase inrule complexity and, consequently, run time.
Rule-based knowledge graph completion approaches typically gen-erate rules either on instance-level (rules that add new facts forindividual instances) or on schema-level (rules that add additionalschematic constraints).AMIE+ [8] and AnyBURL [18] are instance-level rule learnersinspired by inductive logic programming (ILP). The former uses
WW ’21, April 19–23, 2021, Ljubljana, Slovenia N. Heist and H. Paulheim top-down, the latter bottom-up rule learning to generate rules inthe fashion of 𝑏𝑜𝑟𝑛 ( 𝑋, 𝐴 ) ∧ 𝑐𝑎𝑝𝑖𝑡𝑎𝑙 ( 𝐴, 𝑌 ) = ⇒ 𝑐𝑖𝑡𝑖𝑧𝑒𝑛 ( 𝑋, 𝑌 ) .DL-Learner [13] is an ILP-based approach on schema-level whichfinds description logic patterns for a set of instances. A relatedapproach uses statistical schema induction [30] to derive additionalschema constraints (e.g. range restrictions for predicates).The above mentioned approaches are merely link prediction ap-proaches, i.e. they predict new relations between entities alreadycontained in the knowledge graph. The same holds for the om-nipresent knowledge graph embedding approaches [32]. Such ap-proaches are very productive when enough training data is availableand they provide exact results especially when both positive andnegative examples are given. In the setting of this paper, we areworking with (more or less) noisy external data.With regard to instance- versus schema-level, our approach canbe regarded as a hybrid approach that generates rules for sets ofentities, which are in turn used to generate facts on an instance-level. In this respect, our approach is similar to C-DF [33] whichuses Wikipedia categories as an external data source to derive thecharacteristics of categories. To that end, they derive lexical patternsfrom category names and contained entities.In this paper, we apply rule learning to co-occurring entities inWikipedia. While existing approaches have only considered explicitco-occurrence, i.e., categories or list pages, we go beyond the stateof the art by considering arbitrary listings in Wikipedia, as the oneshown in Fig. 1. In this paper, we consider a data corpus 𝐷 from which co-occurringentities can be extracted (e.g., listings in Wikipedia or a collectionof spreadsheets). Furthermore, we assume that a knowledge graphwhich contains a subset of those entities can be extended withinformation learned about the co-occurring entities. The Knowledge Graph K is a set of assertions about its entities inthe form of triples {( 𝑠, 𝑝, 𝑜 )| 𝑠 ∈ E , 𝑝 ∈ P , 𝑜 ∈ E ∪ T ∪ L} definedover sets of entities E , predicates P , types T , and literals L . Werefer to statements about the types of an entity (i.e., 𝑝 = rdf:type , 𝑜 ∈T ) as type assertions ( 𝑇𝐴 ⊂ 𝐾 ), and to statements about relationsbetween two entities (i.e., 𝑜 ∈ E ) as relation assertions ( 𝑅𝐴 ⊂ 𝐾 ).With K ∗ ⊇ K , we refer to the idealized complete version of K .With regard to the OWA this means that a fact is incorrect if it isnot contained in K ∗ . The data corpus 𝐷 contains a set of listings Φ , where each listing 𝜙 ∈ Φ contains a number of subject entities 𝑆𝐸 𝜙 . Our task is toidentify statements that hold for all subject entities 𝑆𝐸 𝜙 in a listing 𝜙 . We distinguish taxonomic and relational information that isexpressed in K .The taxonomic information is a set of types that is shared by allSE of a listing: T 𝜙 = { 𝑡 | 𝑡 ∈ T , ∀ 𝑠 ∈ 𝑆𝐸 𝜙 : ( 𝑠, rdf : type , 𝑡 ) ∈ K ∗ } , (3) K ∗ is merely a theoretical construct, since a complete knowledge graph of all entitiesin the world cannot exist. and the relational information is a set of relations to other entitieswhich is shared by all SE of a listing: R 𝜙 = {( 𝑝, 𝑜 )| 𝑝 ∈ P ∪ P − , 𝑜 ∈ E , ∀ 𝑠 ∈ 𝑆𝐸 𝜙 : ( 𝑠, 𝑝, 𝑜 ) ∈ K ∗ } . (4)From these characteristics of listings, we can derive all the addi-tional type assertions 𝑇𝐴 + = (cid:216) 𝜙 ∈ Φ {( 𝑠, rdf : type , 𝑡 )| 𝑠 ∈ 𝑆𝐸 𝜙 , 𝑡 ∈ T 𝜙 } \ 𝑇𝐴 (5)and additional relation assertions 𝑅𝐴 + = (cid:216) 𝜙 ∈ Φ {( 𝑠, 𝑝, 𝑜 )| 𝑠 ∈ 𝑆𝐸 𝜙 , ( 𝑝, 𝑜 ) ∈ R 𝜙 } \ 𝑅𝐴 (6)that are encoded in Φ and missing in K . Furthermore, 𝑇𝐴 + and 𝑅𝐴 + can contain additional entities that are not yet contained in K ,as there is no restriction for subject entities of Φ to be part of K .For the sake of readability, we will only describe the case of R 𝜙 for the remainder of this section as T 𝜙 is – notation-wise – a specialcase of R 𝜙 with 𝑝 = rdf : type and 𝑜 ∈ T . Due to the incompleteness of K , it is not possible to derive theexact set of relations R 𝜙 for every listing in Φ . Hence, our goal is toderive an approximate version ˆ R 𝜙 by using 𝜙 and the knowledgeabout 𝑆𝐸 𝜙 in K .Similar to the rule learner AMIE+ [8], we use the partial com-pleteness assumption (PCA) to generate negative evidence. ThePCA implies that if ( 𝑠, 𝑝, 𝑜 ) ∈ K then ∀ 𝑜 ′ : ( 𝑠, 𝑝, 𝑜 ′ ) ∈ K ∗ = ⇒( 𝑠, 𝑝, 𝑜 ′ ) ∈ K . In order words, if K makes some assertions with apredicate 𝑝 for a subject 𝑠 , then we assume that K contains every 𝑝 -related information about 𝑠 .Following from the PCA, we use the 𝑐𝑜𝑢𝑛𝑡 of entities with aspecific predicate-object combination in a set of entities 𝐸𝑐𝑜𝑢𝑛𝑡 ( 𝐸, 𝑝, 𝑜 ) = |{ 𝑠 | 𝑠 ∈ 𝐸, ∃ 𝑜 : ( 𝑠, 𝑝, 𝑜 ) ∈ K}| (7)and the 𝑐𝑜𝑢𝑛𝑡 of entities having predicate 𝑝 with an arbitraryobject 𝑐𝑜𝑢𝑛𝑡 ( 𝐸, 𝑝 ) = |{ 𝑠 | 𝑠 ∈ 𝐸, ∃ 𝑜 ′ : ( 𝑠, 𝑝, 𝑜 ′ ) ∈ K}| (8)to compute a maximum-likelihood-based frequency of a specificpredicate-object combination occurring in 𝐸 : 𝑓 𝑟𝑒𝑞 ( 𝐸, 𝑝, 𝑜 ) = 𝑐𝑜𝑢𝑛𝑡 ( 𝐸, 𝑝, 𝑜 ) 𝑐𝑜𝑢𝑛𝑡 ( 𝐸, 𝑝 ) . (9)From Eq. 9 we first derive a naive approximation of a listing’srelations by including all relations with a frequency above a definedthreshold 𝜏 𝑓 𝑟𝑒𝑞 :ˆ R 𝑓 𝑟𝑒𝑞𝜙 = {( 𝑝, 𝑜 )|( 𝑝, 𝑜 ) ∈ R , 𝑓 𝑟𝑒𝑞 ( 𝑆𝐸 𝜙 , 𝑝, 𝑜 ) > 𝜏 𝑓 𝑟𝑒𝑞 } . (10)As argued in Sec. 1.1, we improve this naive frequency-basedapproximation by learning more general patterns that describe thecharacteristics of listings using their context. Here, the entities in 𝑆𝐸 𝜙 may occur both in the subject as well as in the objectposition. But for a more concise notation, we use only (p,o)-tuples and introduce theset of inverse predicates P − to express that SE may also occur in object position.This is, however, only a notation and the inverse predicates do not have to exist in theschema. nformation Extraction From Co-Occurring Similar Entities WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Table 1: Exemplary context ( 𝜁 ), type frequency ( 𝑇 𝐹 ), and re-lation frequency ( 𝑅 𝐹 ) vectors for a set of listings extractedfrom 𝐷 . While 𝜁 is extracted directly from 𝐷 , 𝑇 𝐹 and 𝑅 𝐹 areretrieved via distant supervision from K . Listing 𝜁 𝑇 𝐹 𝑅 𝐹 𝜙 (1 0 1 ... 1) (0.2 0.9 0.0 ... 0.1) (0.9 0.1 0.0 ... 0.1) 𝜙 (0 1 1 ... 0) (0.0 0.2 0.0 ... 0.9) (0.0 0.0 0.0 ... 0.2) 𝜙 (0 0 0 ... 0) (0.7 0.7 0.0 ... 0.0) (0.0 0.0 0.0 ... 0.4)... 𝜙 𝑛 − (1 0 0 ... 1) (0.8 0.9 0.0 ... 0.0) (0.0 0.9 0.0 ... 0.0) 𝜙 𝑛 (1 0 0 ... 1) (0.7 1.0 0.0 ... 0.3) (0.0 0.0 0.8 ... 0.0) Hypothesis 1.
The context 𝜁 𝜙 of a listing 𝜙 in 𝐷 contains suchinformation about R 𝜙 that it can be used to find subsets of Φ withsimilar R .Let Table 1 contain the information about all listings in 𝐷 . Alisting 𝜙 is defined by its context 𝜁 𝜙 (which can in theory containany information about 𝜙 , from the title of its section to an actualimage of the listing), the type frequencies ( 𝑡 , 𝑡 , .., 𝑡 𝑥 ) ∈ 𝑇 𝐹𝜙 , andthe relation frequencies ( 𝑟 , 𝑟 , .., 𝑟 𝑦 ) ∈ 𝑅 𝐹𝜙 . Listings 𝜙 , 𝜙 𝑛 − , and 𝜙 𝑛 have overlapping context vectors. 𝑡 has a consistently highfrequency over all three listings. It is thus a potential type charac-teristic for this kind of listing context. Furthermore, 𝑟 has a highfrequency in 𝜙 , 𝑟 in 𝜙 𝑛 − , and 𝑟 in 𝜙 𝑛 – if the three relationsshare the same predicate, they may all express a similar relation toan entity in their context (e.g. to the subject of the page).In a concrete scenario, the context vector (1 0 0 ... 1) mightindicate that the listing is located on the page of a musician underthe section Solo albums . 𝑡 holds the frequency of the type Album in this listing and 𝑟 to 𝑟 describe the frequencies of the relations( artist , Gilby Clarke), ( artist , Axl Rose), and ( artist , Slash).We formulate the task of discovering frequent co-occurrencesof context elements and taxonomic and relational patterns as anassociation rule mining task over all listings in 𝐷 . Associationrules, as introduced by Agrawal et al. [1], are simple implicationpatterns originally developed for large and sparse datasets liketransaction databases of supermarket chains. To discover items thatare frequently bought together, rules of the form 𝑋 = ⇒ 𝑌 areproduced, with 𝑋 and 𝑌 being itemsets. In the knowledge graphcontext, they have been used, e.g., for enriching the schema of aknowledge graph [23, 30].For our scenario, we need a mapping from a context vector 𝜁 ∈ 𝑍 to a predicate-object tuple. Hence, we define a rule 𝑟 , its antecedent 𝑟 𝑎 , and its consequent 𝑟 𝑐 as follows: 𝑟 : 𝑟 𝑎 ∈ 𝑍 = ⇒ 𝑟 𝑐 ∈ (P ∪ P − ) × (T ∪ E ∪ X) . (11)As a rule should be able to imply relations to entities that vary withthe context of a listing (e.g. to Gilby Clarke as the page’s subjectin Fig. 1), we introduce X as the set of placeholders for contextentities (instead of Gilby Clarke , the object of the rule’s consequentwould be <
PageEntity >).We say a rule antecedent 𝑟 𝑎 matches a listing context 𝜁 𝜙 (short: 𝑟 𝑎 ≃ 𝜁 𝜙 ) if the vector of 𝜁 𝜙 is 1 when the vector of 𝑟 𝑎 is 1. In essence, 𝜁 𝜙 must comprise 𝑟 𝑎 . Accordingly, we need to find a set of rules 𝑅 , so that for every listing 𝜙 the set of approximate listing relationsˆ R 𝑟𝑢𝑙𝑒𝜙 = (cid:216) 𝑟 ∈ 𝑅 { 𝑟 𝑐 | 𝑟 𝑎 ≃ 𝜁 𝜙 } (12)resembles the true relations R 𝜙 as closely as possible.Considering all the listings in Fig. 1, their ˆ R 𝑟𝑢𝑙𝑒𝜙 should, amongothers, contain the rules , 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛 ( "Discography" ) = ⇒ ( 𝑡𝑦𝑝𝑒, MusicalWork ) (13)and 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛 ( "Discography" ) = ⇒ ( 𝑎𝑟𝑡𝑖𝑠𝑡, < 𝑃𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦 > ) . (14)It is important to note that these rules can be derived fromlistings with differing context vectors. All listings only have to havein common that their top section has the title Discography and thatthe contained entities are of the type
MusicalWork with the pageentity as artist. Still, the individual listings may, for example, occurin sections with different titles.
In original association rule mining, two metrics are typically con-sidered to judge the quality of a rule 𝑋 = ⇒ 𝑌 : the support of therule antecedent (how often does 𝑋 occur in the dataset), and theconfidence of the rule (how often does 𝑋 ∪ 𝑌 occur in relation to 𝑋 ).Transferring the support metric to our task, we count the ab-solute frequency of a particular context occurring in Φ . Let Φ 𝑟 𝑎 = { 𝜙 | 𝜙 ∈ Φ , 𝑟 𝑎 ≃ 𝜁 𝜙 } , then we define the support of the rule ante-cedent 𝑟 𝑎 as 𝑠𝑢𝑝𝑝 ( 𝑟 𝑎 ) = | Φ 𝑟 𝑎 | . (15)Due to the incompleteness of K , the values of 𝑌 are in our caseno definitive items but maximum-likelihood estimates of types andrelations. With respect to these estimates, a good rule has to fulfilltwo criteria: it has to be correct (i.e. frequent with respect to all SEof the covered listings) and it has to be consistent (i.e. consistentlycorrect over all the covered listings).We define the correctness, or confidence, of a rule as the fre-quency of the rule consequent over all SE of a rule’s covered list-ings: 𝑐𝑜𝑛𝑓 ( 𝑟 ) = (cid:205) 𝜙 ∈ Φ 𝑟𝑎 𝑐𝑜𝑢𝑛𝑡 ( 𝑆𝐸 𝜙 , 𝑝 𝑟 𝑐 , 𝑜 𝑟 𝑐 ) (cid:205) 𝜙 ∈ Φ 𝑟𝑎 𝑐𝑜𝑢𝑛𝑡 ( 𝑆𝐸 𝜙 , 𝑝 𝑟 𝑐 ) , (16)and we define the consistency of a rule using the mean abso-lute deviation of an individual listing’s confidence to the overallconfidence of the rule: 𝑐𝑜𝑛𝑠 ( 𝑟 ) = − (cid:205) 𝜙 ∈ Φ 𝑟𝑎 | 𝑓 𝑟𝑒𝑞 ( 𝑆𝐸 𝜙 , 𝑝 𝑟 𝑐 , 𝑜 𝑟 𝑐 ) − 𝑐𝑜𝑛𝑓 ( 𝑟 )| 𝑠𝑢𝑝𝑝 ( 𝑟 𝑎 ) . (17)While a high confidence ensures that the overall assertions gen-erated by the rule are correct, a high consistency ensures that fewlistings with many SE do not outvote the remaining covered listings.To select an appropriate set of rules 𝑅 from all the candidaterules 𝑅 ∗ in the search space, we have to pick reasonable thresholdsfor the minimum support ( 𝜏 𝑠𝑢𝑝𝑝 ), the minimum confidence ( 𝜏 𝑐𝑜𝑛𝑓 ), Note that Eqs. 1 and 2 are the axiom equivalents of Eqs. 13 and 14. For better readability,we use the description logics notation of Eqs. 1 and 2 from here on. Instead of a binary vector, we use a more expressive notation for the listing contextin our examples. The notations are trivially convertible by one-hot-encoding.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia N. Heist and H. Paulheim and the minimum consistency ( 𝜏 𝑐𝑜𝑛𝑠 ). By applying these thresholds,we find our final set of descriptive rules 𝑅 : { 𝑟 | 𝑟 ∈ 𝑅 ∗ , 𝑠𝑢𝑝𝑝 ( 𝑟 𝑎 ) > 𝜏 𝑠𝑢𝑝𝑝 ∧ 𝑐𝑜𝑛𝑓 ( 𝑟 ) > 𝜏 𝑐𝑜𝑛𝑓 ∧ 𝑐𝑜𝑛𝑠 ( 𝑟 ) > 𝜏 𝑐𝑜𝑛𝑠 } . (18)Typically, the choice of these thresholds is strongly influenced bythe nature of the dataset 𝐷 and the extraction goal (correctnessversus coverage). Wikipedia is a rich source of listings, both in dedicated list pagesas well as in sections of article pages. Hence, we use it as a datacorpus for our experiments. In Sec. 6, we discuss other appropriatecorpora for our approach.Due to its structured and encyclopedic nature, Wikipedia is aperfect application scenario for our approach. We can exploit thestructure by building very expressive context vectors. Obviously,this positively influences the quality of extraction results. Still, thedefinition of the context vector is kept abstract on purpose to makethe approach applicable to other kinds of web resource as well.However, an empirical evaluation of the practicability or perform-ance of the approach for resources outside of the encyclopedicdomain is out of scope of this paper.
Fig. 2 gives an overview of our extraction approach. The inputof the approach is a dump of Wikipedia as well as an associatedknowledge graph. In the
Subject Entity Discovery phase, listings andtheir context are extracted from the Wikipedia dump and subjectentities are identified (Sec. 4.3). Subsequently, the existing informa-tion in the knowledge graph is used to mine descriptive rules fromthe extracted listings (Sec. 4.4). Finally, the rules are applied to allthe listings in Wikipedia in order to extract new type and relationassertions (Sec. 4.5).
We pick Wikipedia as a data corpus for our experiments as it bringsseveral advantages:
Structure.
Wikipedia is written in an entity-centric style with afocus on facts. Listings are often used to provide an overview ofa set of entities that are related to the main entity. Due to the en-cyclopedic style and the peer-reviewing process, it has a consistentstructure. Especially section titles are used consistently for specifictopics. Wikipedia has its own markup language (Wiki markup),which allows a more consistent access to interesting page struc-tures like listings and tables than plain HTML.
Entity Links.
If a Wikipedia article is mentioned in another art-icle, it is typically linked in the Wiki markup (a so called blue link ).Furthermore, it is possible to link to an article that does not (yet)exist (a so called red link ). As Wikipedia articles can be triviallymapped to entities in Wikipedia-based knowledge graphs like DB-pedia, since they create one entity per article, we can identify manynamed entities in listings and their context without the help of anentity linker. For our experiments, we use a Wikipedia dump of October 2016which is, at the time of the experiments, the most recent dump thatis compatible with both DBpedia and CaLiGraph. In this version,Wikipedia contains 6.9M articles, 2.4M of which contain listingswith at least two rows. In total, there are 5.1M listings with a rowcount median of 8, mean of 21.9, and standard deviation of 76.8. Ofthese listings, 1.1M are tables, and 4.0M are lists.
Apart from the already tagged entities viablue and red links, we have to make sure that any other namedentity in listings and their context is identified as well. This is donein two steps:In a first step, we expand all the blue and red links in an article. Ifa piece of text is linked to another article, we make sure that everyoccurrence of that piece of text in the article is linked to the otherarticle. This is necessary as by convention other articles are onlylinked at their first occurrence in the text. In a second step, we use a named entity tagger to identify ad-ditional named entities in listings. To that end, we use a state-of-the-art entity tagger from spaCy. This tagger is trained on theOntoNotes5 corpus, and thus not specifically trained to identifynamed entities in short text snippets like they occur in listings.Therefore, we specialize the tagger by providing it Wikipedia list-ings as additional training data with blue links as positive examples.In detail, the tagger is specialized as follows: • We retrieve all listings in Wikipedia list pages as trainingdata. • We apply the plain spaCy entity tagger to the listings to getnamed entity tags for all mentioned entities. • To make these tags more consistent, we use informationfrom DBpedia about the tagged entities: We look at the dis-tribution of named entity tags over entities with respect totheir DBpedia types and take the majority vote. For example,if 80% of entities with the DBpedia type
Person are annot-ated with the tag
PERSON , we use
PERSON as label for allthese entities. • Using these consistent named entity tags for blue-link entit-ies, we specialize the spaCy tagger.
We apply the approach from[11] for the identification of subject entities in listings. In short, weuse lexical, positional, and statistical features to classify entities assubject or non-subject entities (refer to Sec. 2.1 for more details).Despite being developed only for listings in list pages, the classifieris applicable to any kind of listing in Wikipedia. A disadvantageof this broader application is that the classifier is not trained insuch a way that it ignores listings used for organisational or designpurposes (e.g. summaries or timelines). These have to be filteredout in the subsequent stages.
After expanding all the blue and red links on thepages, the dataset contains 5.1M listings with 60.1M entity mentions. Wiki markup is parsed with WikiTextParser: https://github.com/5j9/wikitextparser. https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Linking https://spacy.io https://catalog.ldc.upenn.edu/LDC2013T19 nformation Extraction From Co-Occurring Similar Entities WWW ’21, April 19–23, 2021, Ljubljana, Slovenia Figure 2: An overview of the approach with exemplary outputs of the individual phases.
The search space for rule candidates isdefined by the listing context. Thus, we choose the context in sucha way that it is expressive enough to be an appropriate indicatorfor T 𝜙 and R 𝜙 , and concise enough to explore the complete searchspace without any additional heuristics.We exploit the fact that Wikipedia pages of a certain type (e.g.,musicians) mostly follow naming conventions for the sections oftheir articles (e.g., albums and songs are listed under the top sec-tion Discography ). Further, we exploit that the objects of the SE’srelations are usually either the entity of the page, or an entity men-tioned in a section title. We call these typical places for objects therelation targets . In Fig. 1,
Gilby Clarke is an example of a
PageEntity target, and
Guns N’ Roses as well as
Nancy Sinatra are examples for
SectionEntity targets. As a result, we use the type of the page entity,the top section title, and the section title as listing context.Additionally, we use the type of entities that are mentioned insection titles. This enables the learning of more abstract rules, e.g.,to distinguish between albums listed in a section describing a band: ∃ 𝑝𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Person } ⊓ ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Discography" }⊓∃ 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Band } ⊑
Album , and songs listed in a section describing an album: ∃ 𝑝𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Person } ⊓ ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Discography" }⊓∃ 𝑠𝑒𝑐𝑡𝑖𝑜𝑛𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Album } ⊑
Song . We want to pick the thresholds in sucha way that we tolerate some errors and missing information in K ,but do not allow many over-generalized rules that create incorrectassertions. Our idea for a sensible threshold selection is based ontwo assumptions: Assumption 1.
Being based on a maximum-likelihood estima-tion, rule confidence and consistency roughly order rules by thedegree of prior knowledge we have about them.
Assumption 2.
Assertions generated by over-generalized rulescontain substantially more random noise than assertions generatedby good rules.Assumption 1 implies that the number of over-generalized rulesincreases with the decrease of confidence and consistency. As aconsequence, assumption 2 implies that the amount of randomnoise increases with decrease of confidence and consistency.To measure the increase of noise in generated assertions, weimplicitly rely on existing knowledge in K by using the namedentity tags of subject entities as a proxy. This works as follows:For a subject entity 𝑒 that is contained in K , we have its typeinformation T 𝑒 from K and we have its named entity tag 𝜓 𝑒 fromour named entity tagger. Going over all SE of listings in Φ , wecompute the probability of an entity with type 𝑡 having the tag 𝜓 by counting how often they co-occur: 𝑡𝑎𝑔𝑝𝑟𝑜𝑏 ( 𝑡,𝜓 ) = |{ 𝑒 |∃ 𝜙 ∈ Φ : 𝑒 ∈ 𝑆𝐸 𝜙 ∧ 𝑡 ∈ T 𝑒 ∧ 𝜓 = 𝜓 𝑒 }||{ 𝑒 |∃ 𝜙 ∈ Φ : 𝑒 ∈ 𝑆𝐸 𝜙 ∧ 𝑡 ∈ T 𝑒 }| . (19) WW ’21, April 19–23, 2021, Ljubljana, Slovenia N. Heist and H. Paulheim
For example, for the DBpedia type
Album , we find the tag prob-abilities
WORK_OF_ART : 0.49,
ORG : 0.14,
PRODUCT : 0.13,
PERSON : 0.07,showing that album titles are rather difficult to recognize. For thetype
Person and the tag
PERSON , on the other hand, we find aprobability of 0.86.We can then compute the tag-based probability for a set of as-sertions 𝐴 by averaging over the tag probability that is producedby the individual assertions. To compute this metric, we comparethe tag of the assertion’s subject entity with some kind of typeinformation about it. This type information is either the assertedtype (in case of a type assertion), or the domain of the predicate (in case of a relation assertion): 𝑡𝑎𝑔𝑓 𝑖𝑡 ( 𝐴 ) = (cid:205) ( 𝑠,𝑝,𝑜 )∈ 𝐴 𝑡𝑎𝑔𝑝𝑟𝑜𝑏 ( 𝑜,𝜓 𝑠 )| 𝐴 | if 𝑝 = rdf : type (cid:205) ( 𝑠,𝑝,𝑜 )∈ 𝐴 𝑡𝑎𝑔𝑝𝑟𝑜𝑏 ( 𝑑𝑜𝑚𝑎𝑖𝑛 𝑝 ,𝜓 𝑠 )| 𝐴 | otherwise. (20)While we do not expect the named entity tags to be perfect, ourapproach is based on the idea that the tags are consistent to a largeextent. By comparing the 𝑡𝑎𝑔𝑓 𝑖𝑡 of assertions produced by ruleswith varying levels of confidence and consistency, we expect to seea clear decline as soon as too many noisy assertions are added. Fig. 3 shows the 𝑡𝑎𝑔𝑓 𝑖𝑡 for type and relation asser-tions generated with varying levels of rule confidence and consist-ency. Our selection of thresholds is indicated by blue bars, i.e. weset the thresholds to the points where the 𝑡𝑎𝑔𝑓 𝑖𝑡 has its steepestdrop. The thresholds are picked conservatively to select only high-quality rules by selecting points before an accelerated decrease ofcumulative 𝑡𝑎𝑔𝑓 𝑖𝑡 . But more coverage-oriented selections are alsopossible. In Fig. 3d, for example, a threshold of is also a validoption.An analysis of rules with different levels of confidence and con-sistency has shown that a minimum support for types is not ne-cessary. For relations, a support threshold of 2 is helpful to discardover-generalized rules. Further, we found that it is acceptable topick the thresholds independently from each other, as the turningpoints for a given metric don’t vary significantly when varying theremaining metrics.Applying these thresholds, we find an overall number of 5,294,921type rules with 369,139 distinct contexts and 244,642 distinct types.Further, we find 3,028 relation rules with 2,602 distinct contextsand 516 distinct relations. 949 of the relation rules have the pageentity as target, and 2,079 have a section entity as target.Among those rules are straightforward ones like ∃ 𝑝𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Person } ⊓ ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Acting filmography" }⊑ ∃ 𝑎𝑐𝑡𝑜𝑟 . { < 𝑃𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦 > } , and more specific ones like ∃ 𝑝𝑎𝑔𝑒𝐸𝑛𝑡𝑖𝑡𝑦𝑇𝑦𝑝𝑒. { Location } ⊓ ∃ 𝑡𝑜𝑝𝑆𝑒𝑐𝑡𝑖𝑜𝑛. { "Media" }⊓∃ 𝑠𝑒𝑐𝑡𝑖𝑜𝑛. { "Newspapers" } ⊑ Periodical _ literature . We use the domain of the predicate 𝑝 as defined in K . In case of 𝑝 ∈ P − , we usethe range of the original predicate. We apply the rules selected in theprevious section to the complete dataset of listings to generate typeand relation assertions. Subsequently, we remove any duplicateassertions and assertions that already exist in K . To get rid of errors introduced duringthe extraction process (e.g. due to incorrectly extracted subjectentities or incorrect rules), we employ a final filtering step for thegenerated assertions: every assertion producing a 𝑡𝑎𝑔𝑝𝑟𝑜𝑏 ≤ isdiscarded. The rationale behind the threshold is as follows: Typeshave typically one and sometimes two corresponding named entitytags (e.g. the tag PERSON for the DBpedia type
Person , or the tags
ORG and
FAC for the type
School ). As tag probabilities are relativefrequencies, we make sure that, with a threshold of , at most twotags are accepted for any given type.For the tag probabilities of type Album from Sec. 4.4.2, the onlyvalid tag is
WORK_OF_ART . As a consequence, any assertionsof the form ( 𝑠, 𝑟𝑑 𝑓 : 𝑡𝑦𝑝𝑒, Album ) with 𝑠 having a tag other than WORK_OF_ART are discarded.
Tab. 2 shows the number of generated type and rela-tion assertions before and after the tag-based filtering. The numberof inferred types are listed separately for DBpedia and CaLiGraph.For relations, we show two kinds: The entry
Relations lists the num-ber of extracted assertions from rules. As DBpedia and CaLiGraphshare the same set of predicates, these assertions are applicable toboth graphs. Furthermore, as
Relations (via CaLiGraph) , we list thenumber of relations that can be inferred from the extracted CaLi-Graph types via restrictions in the CaLiGraph ontology. CaLiGraphcontains more than 300k of such restrictions that imply a relationbased on a certain type. For example, the ontology contains thevalue restriction
Pop _ rock _ song ⊑ ∃ 𝑔𝑒𝑛𝑟𝑒. { Pop music } . As we extract the type
Pop_rock_song for the Beach Boys song
AtMy Window , we infer the fact ( At My Window , 𝑔𝑒𝑛𝑟𝑒,
Pop music ) .For CaLiGraph, we find assertions for 3.5M distinct subject en-tities with 3M of them not contained in the graph. For DBpedia,we find assertions for 3.1M distinct subject entities with 2.9M ofthem not contained. The unknown subject entities are, however,not disambiguated yet. Having only small text snippets in listingsas information about these entities, a disambiguation with general-purpose disambiguation approaches [39] is not practical. We thusleave this as an own research topic for future work. For an estim-ation of the actual number of novel entities, we rely on previouswork [11], where we analyzed the overlap for red links in list pages.In that paper, we estimate an overlap factor of 1.07 which would –when applied to our scenario – reduce the number of actual novelentities to roughly 2.8M for CaLiGraph and 2.7M for DBpedia. In re-lation to the current size of those graphs, this would be an increaseof up to 38% and 54%, respectively [9]. In our performance evaluation, we judge the quality of generatedassertions from our rule-based approach. As a baseline, we ad-ditionally evaluate assertions generated by the frequency-based nformation Extraction From Co-Occurring Similar Entities WWW ’21, April 19–23, 2021, Ljubljana, Slovenia (a) Type confidence (b) Type consistency (c) Relation confidence (d) Relation consistency
Figure 3: 𝑡𝑎𝑔𝑓 𝑖𝑡 of assertions generated from rules in a specified confidence or consistency interval. Bars show scores for agiven interval (e.g. (0.75,0.80] ), lines show cumulative scores (e.g. (0.75,1.00] ). Blue bars indicate the selected threshold.Table 2: Number of generated assertions after removing ex-isting assertions (Raw), and after applying tag-based filter-ing (Filtered).Assertion Type Raw Filtered
Types (DBpedia) 11,459,047 7,721,039Types (CaLiGraph) 47,249,624 29,128,677Relations 732,820 542,018Relations (via CaLiGraph) 1,381,075 796,910
Table 3: Correctness of manually evaluated assertions.Assertion Type
Types (DBpedia) frequency-based 6,680,565 414 91.55 ± ± Types (CaLiGraph) frequency-based 26,676,191 2,000 89.40 ± ± Relations frequency-based 392,673 1,000 93.80 ± ± 𝜏 𝑓 𝑟𝑒𝑞 to 𝜏 𝑐𝑜𝑛𝑓 and disregardlistings with less than three subject entities). The evaluated assertions are created with a stratified random samplingstrategy. The assertions are thus distributed proportionally over allpage types (like
Person or Place ) and sampled randomly withinthese.The labeling of the assertions is performed by the authors withthe procedure as follows: For a given assertion, first the page ofthe listing is inspected, then – if necessary and available – thepage of the subject entity. If a decision cannot be made based onthis information, a search engine is used to evaluate the assertion.Samples of the rule-based and frequency-based approaches areevaluated together and in random order to ensure objectivity. Tab. 3 shows the results of the performance evaluation. In total,we evaluated 2,000 examples per approach for types and 1,000examples per approach for relations. The taxonomy of CaLiGraphcomprises the one of DBpedia. Thus, we evaluated the full samplefor CaLiGraph types and report the numbers for both graphs, whichis the reason why the sample size for DBpedia is lower. For relations,we only evaluate the ones that are generated directly from rulesand not the ones inferred from CaLiGraph types, as the correctnessof the inferred relations directly depends on the correctness ofCaLiGraph types.
The evaluation results in Tab. 3 show that the information extractedfrom listings in Wikipedia is of an overall high quality. The rule-based approach yields a larger number of assertions with a highercorrectness for both types and relations.For both approaches, the correctness of the extracted assertionsis substantially higher for DBpedia. The reason for that lies in thediffering granularity of knowledge graph taxonomies. DBpedia has764 different types while CaLiGraph has 755,441 with most of thembeing more specific extensions of DBpedia types. For example,DBpedia might describe a person as
Athlete , while CaLiGraphdescribes it as
Olympic_field_hockey_player_of_South_Korea .The average depth of predicted types is 2.06 for the former and 3.32for the latter.While the asserted types are very diverse (the most predictedtype is
Agent with 7.5%), asserted relations are dominated by thepredicate genus with 69.8% followed by isPartOf (4.4%) and artist (3.2%). This divergence cannot be explained with a different cover-age: In DBpedia, 72% of entities with type
Species have a genus ,and 69% of entities with type
MusicalWork have an artist . But weidentify two other influencing factors: Wikipedia has very specificguidelines for editing species, especially with regard to standardiz-ation and formatting rules. In addition to that, the genus relationis functional and hence trivially fulfilling the PCA. As our approachis strongly relying on this assumption and it potentially inhibitsthe mining of practical rules for non-functional predicates (like,for example, for artist ), we plan on investigating this relationshipfurther.The inferred relations from CaLiGraph types are not evaluatedexplicitly. However, based on the correctness of restrictions inCaLiGraph that is reported to be 95.6% [10] and from the correctness https://species.wikimedia.org/wiki/Help:General_Wikispecies WW ’21, April 19–23, 2021, Ljubljana, Slovenia N. Heist and H. Paulheim
Table 4: Error types partitioned by cause. The occurrencevalues are given as their relative frequency (per 100) in thesamples evaluated in Tab. 3.
Error type Type Relation (1) Entity parsed incorrectly (2) Wrong subject entity identified (3) Rule applied incorrectly (4) Semantics of listing too complex
For CaLiGraph, the frequency-based approach finds assertions for2.5M distinct subject entities (2.1M of them novel). While the rule-based approach finds 9% more assertions, its assertions are distrib-uted over 40% more entities (and over 43% more novel entities).This demonstrates the capabilities of the rule-based approach to ap-ply contextual patterns to environments where information aboutactual entities is sparse.Further, we analyzed the portion of evaluated samples that ap-plies to novel entities and found that the correctness of these state-ments is slightly better (between 0.1% and 0.6%) than the overallcorrectness. Including CaLiGraph types, we find an average of 9.03assertions per novel entity, with a median of 7. This is, again, due tothe very fine-grained type system of CaLiGraph. For example, forthe rapper
Dizzle Don , which is a novel entity, we find 8 types (from
Agent over
Musician to American_rapper ) and 4 relations: ( occu-pation , Singing), ( occupation , Rapping), ( birthPlace , United States),and ( genre , Hip hop music).
With Tab. 4, we provide an analysis of error type frequencies forthe rule-based approach on the basis of the evaluated sample. (1) iscaused by the entity linker, mostly due to incorrect entity borders.For example, the tagger identifies only a part of an album title. (2)is caused by errors of the subject entity identification approach,e.g. when the approach identifies the wrong column of a table asthe one that holds subject entities. (3) can have multiple reasons,but most often the applied rule is over-generalized (e.g. implying
Football_player when the listing is actually about athletes ingeneral) or applied to the wrong listing (i.e., the context describedby the rule is not expressive enough). Finally, (4) happens, forexample, when a table holds the specifications of a camera as thiscannot be expressed with the given set of predicates in DBpedia orCaLiGraph.Overall, most of the errors are produced by incorrectly appliedrules. This is, however, unavoidable to a certain extent as knowledgegraphs are not error-free and the data corpus is not perfect. Asubstantial portion of errors is also caused by incorrectly parsedor identified subject entities. Reducing these errors can also have apositive impact on the generated rules as correct information aboutentities is a requirement for correct rules.
In this work, we demonstrate the potential of exploiting co-occurringsimilar entities for information extraction, and especially for thediscovery of novel entities. We show that it is possible to mineexpressive descriptive rules for listings in Wikipedia which can beused to extract information about millions of novel entities.To improve our approach, we are investigating more sophistic-ated filtering approaches for the generated assertions to reduce themargin from raw to filtered assertions (see Tab. 2). Furthermore, weare experimenting with more expressive rules (e.g. by including ad-ditional context like substring patterns or section text) to improveour Wikipedia-based approach.At the moment, we extract entities from single pages. Whileentity disambiguation on single pages is quite simple (on a singleWikipedia page, it is unlikely that the same surface form refers todifferent entities), the disambiguation of entities across pages isa much more challenging problem. Here, entity matching acrosspages is required, which should, ideally, combine signals from thesource pages as well as constraints from the underlying ontology.Furthermore, we work towards applying our approach to addi-tional data corpora. Since the only language-dependent ingredientof our approach is the named entity tagging, and the entity taggerwe use in our experiments has models for various languages, our approach can also be extended to various language editions ofWikipedia.Besides Wikipedia, we want to apply the approach to wikis inthe Fandom universe containing more than 380k wikis on variousdomains (among them many interesting wikis for our approach, likefor example WikiLists ). For background knowledge, we plan torely on existing knowledge graphs in this domain like DBkWik [12]or TiFi [3]. In the longer term, we want to extend the applicabilityof the approach towards arbitrary web pages, using microdata andRDFa annotations [20] as hooks for background knowledge. REFERENCES [1] Rakesh Agrawal, Tomasz Imieliński, and Arun Swami. 1993. Mining associationrules between sets of items in large databases. In . 207–216.[2] Matteo Cannaviccio, Lorenzo Ariemma, Denilson Barbosa, and Paolo Merialdo.2018. Leveraging Wikipedia table schemas for knowledge graph augmentation.In . 1–6.[3] Cuong Xuan Chu, Simon Razniewski, and Gerhard Weikum. 2019. TiFi: TaxonomyInduction for Fictional Domains. In
The World Wide Web Conference . 2673–2679.[4] Luciano Del Corro and Rainer Gemulla. 2013. Clausie: clause-based open inform-ation extraction. In
The World Wide Web Conference . 355–366.[5] Xin Luna Dong, Evgeniy Gabrilovich, Kevin Murphy, Van Dang, Wilko Horn,Camillo Lugaresi, Shaohua Sun, and Wei Zhang. 2015. Knowledge-Based Trust:Estimating the Trustworthiness of Web Sources.
VLDB Endowment
8, 9 (2015),938–949.[6] Michael Färber, Achim Rettinger, and Boulos El Asmar. 2016. On emerging entitydetection. In
European Knowledge Acquisition Workshop . Springer, 223–238.[7] Besnik Fetahu, Avishek Anand, and Maria Koutraki. 2019. TableNet: An approachfor determining fine-grained relations for Wikipedia tables. In
The World WideWeb Conference . 2736–2742.[8] Luis Galárraga, Christina Teflioudi, Katja Hose, and Fabian M Suchanek. 2015.Fast rule mining in ontological knowledge bases with AMIE+.
The VLDB Journal
24, 6 (2015), 707–730.[9] Nicolas Heist, Sven Hertling, Daniel Ringler, and Heiko Paulheim. 2020. Know-ledge Graphs on the Web–an Overview.
Studies on the Semantic Web
47 (2020),3–22. https://spacy.io/models https://list.fandom.com/wiki/Main_Page nformation Extraction From Co-Occurring Similar Entities WWW ’21, April 19–23, 2021, Ljubljana, Slovenia [10] Nicolas Heist and Heiko Paulheim. 2019. Uncovering the Semantics of WikipediaCategories. In International Semantic Web Conference . Springer, 219–236.[11] Nicolas Heist and Heiko Paulheim. 2020. Entity Extraction from Wikipedia ListPages. In
Extended Semantic Web Conference . Springer, 327–342.[12] Sven Hertling and Heiko Paulheim. 2020. DBkWik: extracting and integratingknowledge from thousands of wikis.
Knowledge and Information Systems
62, 6(2020), 2169–2190.[13] Jens Lehmann. 2009. DL-Learner: learning concepts in description logics.
TheJournal of Machine Learning Research
10 (2009), 2639–2642.[14] Jens Lehmann et al. 2015. DBpedia–a large-scale, multilingual knowledge baseextracted from Wikipedia.
Semantic Web
6, 2 (2015), 167–195.[15] Oliver Lehmberg and Christian Bizer. 2017. Stitching web tables for improvingmatching quality.
VLDB Endowment
10, 11 (2017), 1502–1513.[16] Guiliang Liu, Xu Li, Jiakang Wang, Mingming Sun, and Ping Li. 2020. ExtractingKnowledge from Web Text with Monte Carlo Tree Search. In
The Web Conference2020 . 2585–2591.[17] Erin Macdonald and Denilson Barbosa. 2020. Neural Relation Extraction onWikipedia Tables for Augmenting Knowledge Graphs. In . 2133–2136.[18] Christian Meilicke, Melisachew Wudage Chekol, Daniel Ruffinelli, and HeinerStuckenschmidt. 2019. Anytime Bottom-Up Rule Learning for Knowledge GraphCompletion.. In . 3137–3143.[19] Pablo N Mendes et al. 2011. DBpedia spotlight: shedding light on the web ofdocuments. In . 1–8.[20] Robert Meusel, Petar Petrovski, and Christian Bizer. 2014. The webdatacommonsmicrodata, rdfa and microformat dataset series. In
International Semantic WebConference . Springer, 277–292.[21] Yaser Oulabi and Christian Bizer. 2019. Using weak supervision to identifylong-tail entities for knowledge base completion. In
International Conference onSemantic Systems . Springer, 83–98.[22] Heiko Paulheim. 2017. Knowledge graph refinement: A survey of approachesand evaluation methods.
Semantic web
8, 3 (2017), 489–508.[23] Heiko Paulheim and Johannes Fümkranz. 2012. Unsupervised generation of datamining features from linked open data. In . 1–12.[24] Heiko Paulheim and Simone Paolo Ponzetto. 2013. Extending DBpedia withWikipedia List Pages. In , Vol. 1064.CEUR Workshop Proceedings, 85–90.[25] Dominique Ritze, Oliver Lehmberg, Yaser Oulabi, and Christian Bizer. 2016.Profiling the potential of web tables for augmenting cross-domain knowledgebases. In
The World Wide Web Conference . 251–261.[26] Ahmad Sakor, Kuldeep Singh, Anery Patel, and Maria-Esther Vidal. 2020. Falcon2.0: An entity and relation linking tool over Wikidata. In . 3141–3148.[27] Gabriel Stanovsky, Julian Michael, Luke Zettlemoyer, and Ido Dagan. 2018. Su-pervised open information extraction. In . 885–895.[28] Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: a core ofsemantic knowledge. In
The World Wide Web Conference . 697–706.[29] Marieke Van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Gi-useppe Rizzo, and Jörg Waitelonis. 2016. Evaluating entity linking: An analysisof current benchmark datasets and a roadmap for doing a better job. In
Proceed-ings of the Tenth International Conference on Language Resources and Evaluation(LREC’16) . 4373–4379.[30] Johanna Völker and Mathias Niepert. 2011. Statistical schema induction. In
Extended Semantic Web Conference . Springer, 124–138.[31] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: a free collaborativeknowledgebase.
Commun. ACM
57, 10 (2014), 78–85.[32] Quan Wang, Zhendong Mao, Bin Wang, and Li Guo. 2017. Knowledge graphembedding: A survey of approaches and applications.
IEEE Transactions onKnowledge and Data Engineering
29, 12 (2017), 2724–2743.[33] Bo Xu, Chenhao Xie, Yi Zhang, Yanghua Xiao, Haixun Wang, and Wei Wang. 2016.Learning defining features for categories. In . 3924–3930.[34] Mohamed Yakout, Kris Ganjam, Kaushik Chakrabarti, and Surajit Chaudhuri.2012. InfoGather: entity augmentation and attribute discovery by holistic match-ing with web tables. In . 97–108.[35] Shuo Zhang et al. 2020. Novel Entity Discovery from Web Tables. In
The WebConference 2020 . 1298–1308.[36] Shuo Zhang and Krisztian Balog. 2020. Web Table Extraction, Retrieval, andAugmentation: A Survey.
ACM Transactions on Intelligent Systems and Technology(TIST)
11, 2 (2020), 1–35.[37] Shuo Zhang, Krisztian Balog, and Jamie Callan. 2020. Generating Categories forSets of Entities. In . 1833–1842. [38] Ziqi Zhang. 2014. Towards efficient and effective semantic table interpretation.In
International Semantic Web Conference . Springer, 487–502.[39] Ganggao Zhu and Carlos A Iglesias. 2018. Exploiting semantic similarity fornamed entity disambiguation in knowledge graphs.