Abstract

Equipping machines with comprehensive knowledge of the world's entities and their relationships has been a long-standing goal of AI. Over the last decade, large-scale knowledge bases, also known as knowledge graphs, have been automatically constructed from web contents and text sources, and have become a key asset for search engines. This machine knowledge can be harnessed to semantically interpret textual phrases in news, social media and web tables, and contributes to question answering, natural language processing and data analytics. This article surveys fundamental concepts and practical methods for creating and curating large knowledge bases. It covers models and methods for discovering and canonicalizing entities and their semantic types and organizing them into clean taxonomies. On top of this, the article discusses the automatic extraction of entity-centric properties. To support the long-term life-cycle and the quality assurance of machine knowledge, the article presents methods for constructing open schemas and for knowledge curation. Case studies on academic projects and industrial knowledge graphs complement the survey of concepts and methods.

Full PDF

SSubmitted to Foundations and Trends in Databases

Machine Knowledge: Creation and Curationof Comprehensive Knowledge Bases

Gerhard WeikumMax Planck Institute for [email protected] Luna [email protected] RazniewskiMax Planck Institute for [email protected] Fabian SuchanekTelecom Paris [email protected] 25, 2020

Abstract

Equipping machines with comprehensive knowledge of the world’s entities and theirrelationships has been a long-standing goal of AI. Over the last decade, large-scaleknowledge bases, also known as knowledge graphs, have been automatically constructedfrom web contents and text sources, and have become a key asset for search engines.This machine knowledge can be harnessed to semantically interpret textual phrasesin news, social media and web tables, and contributes to question answering, naturallanguage processing and data analytics.This article surveys fundamental concepts and practical methods for creating andcurating large knowledge bases. It covers models and methods for discovering and canon-icalizing entities and their semantic types and organizing them into clean taxonomies.On top of this, the article discusses the automatic extraction of entity-centric properties.To support the long-term life-cycle and the quality assurance of machine knowledge,the article presents methods for constructing open schemas and for knowledge curation.Case studies on academic projects and industrial knowledge graphs complement thesurvey of concepts and methods. a r X i v : . [ c s . A I] S e p ubmitted to Foundations and Trends in Databases Contents ubmitted to Foundations and Trends in Databases ubmitted to Foundations and Trends in Databases ubmitted to Foundations and Trends in Databases

10 Wrap-Up 207

References 214 ubmitted to Foundations and Trends in Databases Enhancing computers with “machine knowledge” that can power intelligent applicationsis a long-standing goal of computer science [323]. This formerly elusive vision has becomepractically viable today, made possible by major advances in knowledge harvesting . Thiscomprises methods for turning noisy Internet content into crisp knowledge structures onentities and relations. The knowledge harvesting methodology has enabled the automaticconstruction of knowledge bases (KB) : collections of machine-readable facts about thereal world. Today, publicly available KBs provide millions of entities (such as people,organizations, locations and creative works like books, music etc.) and billions of statementsabout them (such as who studied where, which country has which capital, or which singerperformed which song). Proprietary KBs deployed at major companies comprise knowledgeat an even larger scale, with one or two orders of magnitude more entities.A prominent use case where knowledge bases have become a key asset is web search.When we send a query like “dylan protest songs” to Baidu, Bing or Google, we obtain acrisp list of songs such as Blowin’ in the Wind, Masters of War, or A Hard Rain’s a-GonnaFall. So the search engine automatically detects that we are interested in facts about anindividual entity – Bob Dylan in this case – and ask for speciﬁcally related entities of acertain type – protest songs – as answers. This is feasible because the search engine has ahuge knowledge base in its back-end data centers, aiding in the discovery of entities in userrequests (and their contexts) and in ﬁnding concise answers.The KBs in this setting are centered on individual entities, containing (at least) thefollowing backbone information: • entities like people, places, organizations, products, events, such as Bob Dylan , the

Stockholm City Hall , the • the semantic classes to which entities belong, for example h Bob Dylan, type, singer-songwriter i , h Bob Dylan, type, poet i • relationships between entities, such as h Bob Dylan, created, Blowin’ in the Wind i , h Bob Dylan, won, Nobel Prize in Literature i Some KBs also contain validity times such as • h Bob Dylan, married to, Sara Lownds, [1965,1977] i This temporal scoping is optional, but very important for the life-cycle management of aKB as the real world evolves over time. In the same vein of long-term quality assurance,KBs may also contain constraints and provenance information.6 ubmitted to Foundations and Trends in Databases

The concept of a comprehensive KB goes back to pioneering work in Artiﬁcial Intelligenceon universal knowledge bases in the 1980s and 1990s, most notably, the

Cyc project atMCC in Austin [322] and the

WordNet project at Princeton [159]. However, these knowledgecollections have been hand-crafted and curated manually. Thus, the knowledge acquisitionwas inherently limited in scope and scale. With the Semantic Web vision in the early 2000s,domain-speciﬁc ontologies [551] have been developed, but these were also manually created.In the ﬁrst decade of the 2000s, automatic knowledge harvesting from Web and text sourcesbecame a major research avenue, and has made substantial practical impact. Knowledgeharvesting is the core methodology for the automatic construction of large knowledge bases,going beyond manually compiled knowledge collections like Cyc or WordNet.These achievements are rooted in academic research and community projects. Salientprojects that started in the 2000s are DBpedia [18], Freebase [51], KnowItAll [153], WebOf-Concepts [110], WikiTaxonomy [456] and YAGO [562]. More recent projects with publiclyavailable data include BabelNet [418], ConceptNet [548], DeepDive [527], EntityCube (aka.Renlifang) [425], KnowledgeVault [129], NELL [71], Probase [631], WebIsALOD [232], Wiki-data [600], and XLore [617]. More on the history of KB technology can be found in theoverview article [245].At the time of writing this survey, the largest general-purpose KBs with publicly acces-sible contents are Wikidata ( wikidata.org ), BabelNet ( babelnet.org ), DBpedia ( dbpedia.org ),and YAGO ( yago-knowledge.org ). They contain millions of entities, organized in hundredsto hundred thousands of semantic classes, and hundred millions to billions of relationalstatements on entities. These and other knowledge resources are interlinked at the entitylevel, forming the Web of Linked Open Data [225, 244].Over the 2010s, knowledge harvesting has been adopted at big industrial stakeholders[429], and large KBs have become a key asset in a variety of commercial applications,including semantic search (see, e.g., [35, 484]), analytics (e.g., aggregating by entities),recommendations (see, e.g., [195]), and data integration (i.e., to combine heterogeneousdatasets in and across enterprises). Examples are the Google Knowledge Graph [536], theuse of KBs in IBM Watson [163], the Amazon Product Graph [127, 133], the Alibabae-Commerce Graph [360], the Baidu Knowledge Graph [26], Microsoft Satori [467], WolframAlpha [248] as well as domain-speciﬁc knowledge bases in business, ﬁnance, life sciences,and more (e.g., at Bloomberg [375]).In addition, KBs have found wide use as a source of distant supervision for a variety oftasks in natural language processing, such as entity linking.7 ubmitted to Foundations and Trends in Databases

Knowledge bases enable or enhance a wide variety of applications.

Semantic Search and Question Answering:

All major search engines have some form of KB as a background asset. Whenever auser’s information need centers around an entity or a speciﬁc type of entities, such assingers, songs, tourist locations, companies, products, sports events etc., the KB can returna precise and concise list of entities rather than merely giving “ten blue links” to web pages.The earlier example of asking for “dylan protest songs” is typical for this line of semanticsearch. Even when the query is too complex or the KB is not complete enough to enableentity answers, the KB information can help to improve the ranking of web-page results byconsidering the types and other properties of entities. Similar use cases arise in enterprisesas well, for example, when searching for customers or products with speciﬁc properties, orwhen forming a new team with employees who have speciﬁc expertise and experience.An additional step towards user-friendly interfaces is question answering (QA) wherethe user poses a full-ﬂedged question in natural language and the system aims to returncrisp entity-style answers from the KB or from a text corpus or a combination of both. Anexample for KB-based QA is “Which songs written by Bob Dylan received Grammys?”;answers include All Along the Watchtower, performed by Jimi Hendrix, which received aHall of Fame Grammy Award. An ambitious example that probably requires tapping intoboth KB and text would be “Who ﬁlled in for Bob Dylan at the Nobel Prize ceremony inStockholm?”; the answer is Patti Smith.Overviews on semantic search and question answering with KBs include [35, 119, 320,484, 589].

Language Understanding and Text Analytics:

Both written and spoken language are full of ambiguities. Knowledge is the key tomapping surface phrases to their proper meanings, so that machines interpret languageas ﬂuently as humans. AI-style use cases include machine translation, and conversationalassistants like chatbots. Prominent examples include Amazon’s Alexa, Apple’s Siri, andGoogle’s Assistant and new chatbot initiatives [2], and Microsoft’s Cortana.In these applications, world knowledge plays a crucial role. Consider, for example,sentences like “Jordan holds the record of 30 points per match” or “The forecast for Jordanis a record high of 110 degrees”. The meaning of the word “Jordan” can be inferred byhaving world knowledge about the basketball champion Michael Jordan and the middle-eastcountry Jordan.Understanding entities (and their properties and relations) in text is also key to large-scale analytics over news articles, scientiﬁc publications, review forums, or social mediadiscussions. For example, we can identify mentions of products (and associated consumer8 ubmitted to Foundations and Trends in Databases opinions), link them to a KB, and then perform comparative and aggregated studies. Wecan even incorporate ﬁlters and groupings on product categories, geographic regions etc.,by combining the textual information with structured data from the KB or from productand customer databases. All this can be enabled by the KB as a clean and comprehensiverepository of entities (see, e.g., [523] for a survey on the core task of entity linking).A trending example of semantic text analytics is detecting gender bias in news and otheronline content (see, e.g., [565]). By identifying people, and the KB knowing their gender, wecan compute statistics over male vs. female people in political oﬃces or on company boards.If we also extract earnings from movies and ask the KB to give us actors and actresses, wecan shed light into potential unfairness in the movie industry.

Visual Understanding:

For detecting objects and concepts in images (and videos), computer vision has madegreat advances using machine learning. The training data for these tasks are collections ofsemantically annotated images, which can be viewed as visual knowledge bases. The mostwell-known example is ImageNet ([115]) which has populated a subset of WordNet conceptswith a large number of example images. A more recent and more advanced endeavor alongthese lines is VisualGenome ([295]). By themselves, these assets already go a long way, buttheir value can be further boosted by combining them with additional world knowledge.For example, knowing that lions and tigers are both predators from the big cat familyand usually prey on deer or antelopes, can help to automatically label scenes as “catsattack prey”. Likewise, recognizing landmark sites such as the Brandenburg gate and havingbackground knowledge about them (e.g., other sites of interest in their vicinity) helps tounderstand details and implications of an image. In such computer vision and further AIapplications, a KB often serves as an informed prior for machine learning models, or as areference for consistency (or plausibility) checks.

Data Cleaning:

Coping with incomplete and erroneous records in large heterogeneous data is a classicaltopic in database research (see, e.g., [472]). The problem is more timely and pressing thanever. Data scientists and business analysts want to rapidly tap into diverse datasets, forcomparison, aggregation and joint analysis. So diﬀerent kinds of data need to be combinedand fused, more or less on the ﬂy and thus largely depending on automated tools. Thistrend ampliﬁes the crucial role of identifying and repairing missing and incorrect values.In many cases, the key to spotting and repairing errors or to infer missing values isconsistency across a set of records. For example, suppose that a database about music hasa new tuple stating that Jeﬀ Bezos won the Grammy Award. A background knowledgebase would tell that Bezos is an instance of types like businesspeople, billionaires, companyfounders etc., but there is no type related to music. As the Grammy is given only forsongs, albums and musicians, the tuple about Bezos is likely a data-entry error. In fact,9 ubmitted to Foundations and Trends in Databases the requirement that Grammy winners, if they are of type person, have to be musicians,can be encoded into a logical consistency constraint. Several KBs contain such consistencyconstraints. They typically include: • type constraints, e.g.: a Grammy winner who belongs to the type person must also bean instance of type musician (or a sub-type), • functional dependencies, e.g.: for each year and each award category, there is exactlyone Grammy winner, • inclusion dependencies, e.g.: composers are also musicians and thus can win a Grammy,and all Grammy winners must have at least one song to which they contributed (i.e.,the set of Grammy winners is a subset of the set of people with at least one song), • disjointness constraints, e.g.: songs and albums are disjoint, so no piece of music cansimultaneously win both of these award categories for the Grammy, • temporal constraints, e.g.: the Grammy award is given only to living people, so thatsomeone who died in a certain year cannot win it in any later year.Data cleaning as a key stage in data integration is surveyed by [256, 255] and the articlesin [328]. Are knowledge bases part of the Semantic Web, and can they be used only forSemantic Web applications?

The big revival of knowledge bases in this millenium originated from research projects inthe Semantic Web community. This is why some design choices favor models and methodsfrom the Semantic Web. For example, the RDF data model is popular among KBs, and thequery language of choice is often close to the Sparql language rather than SQL. However,it is easy to move KBs into the ecosystem of other data models and their tool suites. Inparticular, KBs can be stored, accessed and managed by relational database systems as well,and can be used also with NoSQL platforms such as Apache Spark and with cloud-basedservices. Likewise, it is easy to combine KBs with popular tools for machine learning suchas TensorFlow, SciPy etc.

How is knowledge harvesting related to the ﬁeld of information extraction?

Information extraction (IE) (see, e.g., [511, 162, 208]) comprises methodologies forrecognizing and semantically annotating meaningful units in natural language and otherkinds of noisy contents (e.g., ad-hoc tables in web pages, or query-and-click logs), buildingon text mining and machine learning. Given an arbitrary input, IE aims at a best-eﬀortjob on computing value-added mark-up. Knowledge harvesting leverages IE methods, butit is output-driven. To construct a high-quality KB, judicious choices about input sources10 ubmitted to Foundations and Trends in Databases and extraction strategies are crucial considerations. For example, we often want to “picklow-hanging fruit” ﬁrst, for high quality, and tap into noisier sources only subsequently andwith speciﬁc focus and customized techniques.

Why do we need machine knowledge, when we already have end-to-end machinelearning working so well?

Machine learning (ML), especially deep neural networks, work well when there issuﬃcient training data with gold-standard labels. However, there are fundamental reasonswhy ML alone is not a full solution. First, training data is the typical bottleneck whentackling new applications, so that a lot of time and money needs to be spent on compilingand organizing the relevant data. This cost arises in each and every application againand again. Machine knowledge is an easily re-usable, versatile asset that can simplify andaccelerate these expensive steps. Second, even the best deep learning methods are far fromnear-human quality in identifying crisp statements in complex, noisy and ambiguous textsand other input sources. There are fundamental issues for these limitations: ML is basedon the paradigm of learning with iid samples from a data distribution and applying thetrained model to new samples from the same distribution (iid = independently identicallydistributed). In contrast, humans do not solely rely on situative data alone, but interpretobservations and learn with a rich body of background experience – the human’s generalworld knowledge. Therefore, machine learning and machine knowledge are complementarypillars of modern AI. The more a machine knows, the better it can learn; and better learningenables acquiring more and deeper knowledge.

Are knowledge bases simply some sort of databases?

Knowledge bases are data, and can, of course, be stored in standard databases. Buttheir scope and use cases make them special in several ways. First, a KB is a referencerepository of entities, types and vocabulary with open-ended scope, modeling broad domainsor enterprise-wide knowledge or even aiming for universal encyclopedic coverage. To thisend, KBs include a rich taxonomy of types, way more expressive than the usual databases.Second, as a consequence, KBs operate under the Open World Assumption, which allowsdata to be incomplete, and KBs are grown with continuous curation. Third, to fulﬁll theserequirements, KBs need to continously adapt and enhance their schemas for types andproperties, following the paradigm of agile data spaces [206] rather than traditional “schemaﬁrst” databases. This aspect led to the pragmatic choice for the RDF data model withbinary relations, to allow new types and new properties to be added in a light-weightmanner. 11 ubmitted to Foundations and Trends in Databases

Are knowledge bases useful for enterprises?

KBs are useful as reference data in many ways. They contain encyclopedic knowledgeabout the world’s notable entities, including people, places, events, and organizations. Thiscan serve as background knowledge in enterprises and their applications. In the travel andtourism industry, for example, KBs can contribute knowledge about vacation sites, naturalor cultural points of interest, and geography. Several of the publicly available KBs are richon this kind of geo-spatial knowledge. Industrial applications can combine this with theirdata about hotels, ﬂights, commercial tours etc.There are also KBs speciﬁcally geared for a vertical domain or a company. In the healthdomain, for example, KBs can collect background knowledge about diseases, drugs, symp-toms, therapies and their properties. In a company, a KB can provide relevant knowledgeabout customers, products, product categories, sales regions, and so on.Industrial applications require involving domain experts, within the enterprise. Themethodology presented in this article will beneﬁt such teams in companies on automatingand accelerating their endeavors.

Can high-tech startups beneﬁt from knowledge bases?

KBs contribute to methods and tools for language understanding, data cleaning, machine-learning-based AI, semantic reasoning, and knowledge modeling. Therefore, several startupshave taken to oﬀer KB technology as a product. Some of these turned into big successstories: Freebase (by Metaweb Technologies, Inc.) was acquired by Google and kick-startedthe Google Knowledge Graph, and DeepDive (by Lattice Data, Inc.) was acquired by Apple.

This article covers methods for automatically constructing and curating large knowledgebases from web and text sources. We hope that it will be useful for doctoral studentsand faculty interested in a wide spectrum of topics – from machine knowledge and dataquality to machine learning and data science as well as applications in web content miningand natural language understanding. In addition, this article aims to be useful also forindustrial researchers and practitioners working on semantic technologies for web, socialmedia, or enterprise contents, including all kinds of applications where sense-making fromtext or semi-structured data is an issue. Prior knowledge on natural language processing orstatistical learning is not required; we will introduce relevant methods as they are needed(or at least give speciﬁc pointers to literature).The article is organized into ten chapters. Chapter 2 gives foundational basics onknowledge representation and discusses the design space for building a KB. Chapters3, 4 and 5 cover the methodology for constructing the core of a KB that comprisesentities and types. Chapter 3 discusses tapping premium sources with rich and clean semi-12 ubmitted to Foundations and Trends in Databases structured contents, and Chapter 4 addresses knowledge harvesting from textual contents.Chapter 5 speciﬁcally focuses on the important issue of canonicalizing entities into uniquerepresentations. Chapters 6 and 7 extend the scope of the KB by methods for discoveringand extracting attributes of entities and relations between entities. Chapter 6 focuses on thecase where a schema is designed upfront for the properties of interest. Chapter 7 discussesthe case of discovering new property types for attributes and relations that are not (yet)speciﬁed in the KB schema. Chapter 8 discusses the issue of quality assurance for KBcuration and the long-term maintenance of KBs. Chapter 9 presents several case studies onspeciﬁc KBs including industrial knowledge graphs (KGs). We conclude in Chapter 10 withkey lessons and an outlook on where the theme of machine knowledge may be heading.13 ubmitted to Foundations and Trends in Databases

Knowledge bases, KBs for short, comprise salient information about entities, semantic classesto which entities belong, attributes of entities, and relationships between entities. Whenthe focus is on classes and their logical connections such as subsumption and disjointness,knowledge repositories are often referred to as ontologies . In database terminology, thisis referred to as the schema . The class hierarchy alone is often called a taxonomy . Thenotion of KBs in this article covers all these aspects of knowledge, including ontologies andtaxonomies.This chapter presents foundations for casting knowledge into formal representations.Knowledge representation has a long history, spanning decades of AI research, from theclassical model of frames to recent variants of description logics. Overviews on this spectrumare given by [504] and [551]. In this article, we restrict ourselves to the knowledge repre-sentation that has emerged as a pragmatic consensus for entity-centric knowledge bases(see [564] for an extended discussion). More on the wide spectrum of knowledge modelingcan be found in the survey [245].

The most basic element of a KB is an entity .An entity is any abstract or concrete object of ﬁction or reality.This deﬁnition includes people, places, products and also events and creative works(books, poems, songs etc.), real people (living or dead) as well as ﬁctional people (e.g., HarryPotter), and also general concepts such as empathy and Buddhism. KBs take a pragmaticapproach: they model only entities that match their scope and purpose. A KB on writersand their biographies would include Shakespeare and his drama Macbeth, but it may notinclude the characters of the drama such as King Duncan or Lady Macbeth. However, a KBfor literature scholars – who want to analyze character relationships in literature content –should include all characters from Shakespeare’s works.

Individual Entities (aka. Named Entities):

We often narrow down the set of entities ofinterest by emphasizing uniquely identiﬁable entities and distinguishing them from generalconcepts.An individual entity is an entity that can be uniquely identiﬁed against all otherentities. 14 ubmitted to Foundations and Trends in Databases

To uniquely identify a location, we can use its geo-coordinates: longitude and latitudewith a suﬃciently precise resolution. To identify a person, we would – in the extreme case –have to use the person’s DNA sequence, but for all realistic purpose a combination of fullname, birthplace and birthdate are suﬃcient. In practice, we are typically even coarser andjust use location or person names when there is near-universal social consensus about whator who is denoted by the name. Wikipedia article names usually follow this principle ofusing names that are suﬃciently unique. For these arguments, individual entities are alsoreferred to as named entities . Identiﬁers and Labels:

To denote an entity unambiguously, we need a name that canrefer to only a single entity. Such an identiﬁer can be a unique name, but can also bespeciﬁcally introduced keys such as URLs for web sites, ISBNs for books (which can evendistinguish diﬀerent editions of the same book), DOIs for publications, Google ScholarURLs or ORCID IDs for authors, etc. In the data model of the Semantic Web,

RDF (for

Resource Description Framework ) [602], identiﬁers always take the form of URIs(Unique Resource Identiﬁers, a generalization of URLs).An identiﬁer for an entity is a string of characters that uniquely denotes the entity.As identiﬁers are not necessarily of a form that is nicely readable and directly inter-pretable by a human, we often want to have human-readable labels or names synonyms or alias names (diﬀerent names, same meaning). When diﬀerent entities sharea label such as “Hamlet”, this label is a homonym (same name, diﬀerent meanings). Entities of interest often come in groups where all elements have a shared characteristic.For example,

Bob Dylan , Elvis Presley and

Lisa Gerrard are all musicians and singers. Wecapture this knowledge by organizing entities into classes or, synonomously, types .A class (or type ) is a named set of entities that share a common trait. An elementof that set is called an instance of the class.For Dylan, Presley and Gerrard, classes of interest include: musicians and singers withall three as instances, guitarists with Dylan and Presley as instances, men with these two, women containing only Gerrard, and so on. 15 ubmitted to Foundations and Trends in Databases Note that an entity can belong to multiple classes, and classes can relate to eachother in terms of their members as being disjoint (e.g. men and women ), overlapping, or onesubsuming the other (e.g., musicians and guitarists ). Classes can be quite speciﬁc; forexample left-handed electric guitar players could be a class in the KB containing JimiHendrix, Paul McCartney and others.It is not always obvious whether something should be modeled as an entity or as a class.We could construct, for every entity, a singleton class that contains just this entity. Classesof interest typically have multiple instances, though. By this token, we do not considergeneral concepts such as love, Buddhism or pancreatic cancer as classes, unless we wereinterested in speciﬁc instances (e.g., the individual cancer of one particular patient).

Taxonomies (aka. Class Hierarchies):

By relating the instance sets of two classes, we can specify invariants that must holdbetween the classes, most notably, subsumption, also known as the subclass / superclass relation. By combining these pairwise invariants across all classes, we can thus construct a class hierarchy . We refer to this aspect of the KB as a taxonomy .Class A is a subclass of (is subsumed by) class B if all instances of A must also beinstances of B .For example, the classes singers and guitarists are subclasses of musicians becauseevery singer and every guitarist is a musician. We say that class X is a direct subclassof Y if there is no other class that subsumes X and is subsumed by Y. Classes can havemultiple superclasses, but there should not be any cycles in the subsumption relation. Forexample, left-handed electric guitar players are a subclass of both left-handed people and guitarists .A taxonomy is a directed acyclic graph, where the nodes are classes and there isan edge from class X to class Y if X is a direct subclass of Y.Note that these invariants do not just describe the current instances of such class pairs,but actually prescribe that the invariant holds for all possible instance sets. So the taxonomyacts like a database schema , and is instrumental for keeping the KB consistent. In databaseterminology, a subclass/superclass pair is also called an inclusion dependency . In theSemantic Web, the RDFS extension of the RDF model (RDFS = RDF Schema) allowsspecifying such constraints [604]. Other Semantic Web models, notably, OWL, support alsodisjointness constraints (aka. mutual exclusion), for example, specifying that men and women are disjoint.One of the largest taxonomic repositories is the

WordNet lexicon [159], comprisingmore than 100,000 classes. Figure 2.1 visualizes an excerpt of the WordNet taxonomy.The nodes are classes, called word senses in WordNet, the edges indicate subsumption.16 ubmitted to Foundations and Trends in Databases

Person … MusicianAccordionist Bassist BassoonistCellist ClarinetistGuitarist Harpist KeyboardPlayerJazz Musician …AdventurerAdult Entertainer… FemalePerson … Scientist ……CreatorArtistExpressionist Painter … … … … … … … … Figure 2.1:

Excerpt from the WordNet Taxonomy

In linguistic terminology, this lexical relation is called hypernymy : an edge connects amore special class, called hyponym , with a generalized class, called hypernym . [203] givesan overview of this kind of lexical resources. Further examples of (potentially unclean)taxonomies include the Wikipedia category system or product catalogs.

Subsumption vs. Part-Whole Relation:

Class subsumption should not be confused or conﬂated with the relationship between partsand wholes. For example, a soprano saxophone is part of a jazz band . This does not mean,however, that every soprano saxophone is a jazz band. Likewise, New York is part of theUSA, but New York is not a subclass of the

USA . Instance-of vs. Subclass-of:

Some KBs do not make a clear distinction between classes and instances, and they collapsethe instance-of and subclass-of relations into a single is-a hierarchy . Instead of stating that

Bob Dylan is an instance of the class singers and that singers are a subclass of musicians ,they would view all three as general entities and connect them in a generalization graph.17 ubmitted to Foundations and Trends in Databases

Entities have properties such as birthdate, birthplace and height of a person, prizes won,books, songs or software written, and so on. KBs capture these in the form of mathematicalrelations:A relation or relationship for the instances of classes C , ..., C n is a subset of the Cartesian product C × ... × C n , along with an identiﬁer(i.e., unique name) for the relation.For example, we can state the birthdate and birthplace of Bob Dylan in the relationalform: h Bob Dylan , , Duluth (Minnesota) i ∈ birth where birth is the identiﬁer of the relation. This instance of the birth relation is a ternarytuple, that is, it has three arguments: the person entity, the birthdate, and the birthplace.The underlying Cartesian product of the relation is persons × dates × cities .In logical notation, we also write R ( x , ..., x n ) instead of h x , ..., x n i ∈ R , and we referto R as a predicate . The number of arguments, n , is called the arity of R . The domain C × ... × C n is also called the relation’s type signature As most KBs are of encyclopedic nature, the instances of a relation are often referred toas facts . We do not want to exclude knowledge that is not fact-centric (e.g., commonsenseknowledge with a socio-cultural dimension); so we call relational instances more generally statements . The literature also speaks of facts , and sometimes uses the terminology assertion as well. For this article, the three terms statement , fact and assertion are more orless interchangeable.In logical terms, statements are grounded expressions of ﬁrst-order predicate logic(where “grounded” means that the expression has no variables). In the KB literature, theterm “relation” is sometimes used to denote both the relation identiﬁer R and an instance h x , ..., x n i . We avoid this ambiguity, and more precisely speak of the relation and its(relational) tuples. Attributes of Entities:

In the above example about the birth relation, we made use of the class dates . By statingthis, we consider individual dates, such as , entities. It is a design choice whetherwe regard numerical expressions like dates, heights or monetary amounts as entities or not.Often, we want to treat them simply as values for which we do not have any additionalproperties. In the RDF data model, such values are called literals . Strings such as nicknamesof people (e.g., “Air Jordan”) are another frequent type of literals.We introduce a special class of relations with two arguments where the ﬁrst argumentis an entity of interest, such as

Bob Dylan or Michael Jordan (the basketball player), and18 ubmitted to Foundations and Trends in Databases the second argument is a value of interest, such as their heights “171 cm” and “198 cm”,respectively.The case for binary relations with values as second argument largely corresponds to themodeling of entity attributes in database terminology. Such relations are restricted to be functions : for each entity as ﬁrst argument there is only one value for the second argument.We denote attributes in the same style as other relational properties, but we use numeric orstring notation to distinguish the literals from entities: h Michael Jordan , 198cm i ∈ height or height ( Michael Jordan , 198cm) h Michael Jordan , “Air Jordan” i ∈ nickname or nickname ( Michael Jordan , “Air Jordan”)

Relations between Entities:

In addition to their attributes, entities are characterized by their relationships with otherentities, for example, the birthplaces of people, prizes won, songs written or performed, andso on. Mathematical relations over classes, as introduced above, are the proper formalismfor representing this kind of knowledge. The frequent case of binary relations capturesthe relationship between exactly two entities.Some KBs focus exclusively on binary relations, and the Semantic Web data model RDFhas speciﬁc terminology and formal notation for this case of so-called subject-predicate-object triples , or

SPO triples , or merely triples for short.The RDF model restricts the three roles in a subject-predicate-object (SPO)triple as follows: • S must be a URI identifying an entity, • P must be a URI identifying a relation, and • O must be a URI identifying an entity for a relationship between entities, or aliteral denoting the value of an attribute.As binary relations can be easily cast into a labeled graph – with node labels for S andO and edge labels for P – knowledge bases that focus on SPO triples are widely referred toas knowledge graphs . SPO triples are often written in the form h S, P, O i or as S P O with the relation between subject and object. Examples of SPO triples are:

Bob Dylan married to Sara LowndsBob Dylan composed Blowin’ in the WindBlowin’ in the Wind composed by Bob DylanBob Dylan has won Nobel Prize in LiteratureBob Dylan type Nobel Laureate ubmitted to Foundations and Trends in Databases The examples also illustrate the notion of inverse relations : composed by is inverse to composed , and can be written also as composed − : h S, O i ∈ P ⇔ h O, S i ∈ P − . The last example in the above table shows that an entity belonging to a certain class canalso be written as a binary relation, with type as the predicate, following the RDF standard.It also shows that knowledge can sometimes be expressed either by class membership orby a binary-relation property. In this case, the latter adds information (Nobel Prize in

Literature ) and the former is convenient for querying (about all Nobel Laureates). Moreover,having a class

Nobel Laureate allows us to deﬁne further relations and attributes with thisclass as domain. To get the beneﬁts of all this, we may want to have both of the exampletriples in the KB.An advantage of binary relations is that they can express facts in a self-containedmanner, even if some of the arguments for a higher-arity relation is missing or the instancesof the relations are only partly known. For example, if we know only Dylan’s birthplacebut not his birthdate (or vice versa), capturing this in the ternary relation birth is a bitawkward as the unknown argument would have to be represented as a null value (i.e.,a placeholder for an unknown or undeﬁned value). In database systems, null values arestandard practice, but they often make things complicated. In KBs, the common practice isto avoid null values and prefer binary relations where we can simply have a triple for theknown argument (birthplace) and nothing else.

Higher-Arity Relations:

Some KBs emphasize binary relations only, leading to the notion of knowledge graphs (KGs). However, ternary and higher-arity relations can play a big role, and these cannot bedirectly captured by a graph.At ﬁrst glance, it may seem that we can always decompose a higher-arity relation intomultiple binary relations. For example, instead of introducing the ternary relation birth : person × date × city , we can alternatively use two binary relations: birthdate : person × date and birthplace : person × city . In this case, no information is lost by using simpler binaryrelations. Another case where such decomposition works well is the relation that contains alltuples of parents, sons and daughters: children : person × boys × girls . This could be equallyrepresented by two separate relations sons : person × boys and daughters : person × girls .In fact, database design theory tells us that this decomposition is a better representation,based on the notion of multi-valued dependencies [588].However, not every higher-arity relation is decomposable without losing information.Consider a quarternary relation won : person × award × year × f ield capturing who wonwhich prize in which year for which scientiﬁc ﬁeld. Instances would include h Marie Curie, Nobel Prize, 1903, physics i ubmitted to Foundations and Trends in Databases and h Marie Curie, Nobel Prize, 1911, chemistry i . If we simply split these 4-tuples into a set of binary-relation tuples (i.e., SPO triples), wewould end up with: h MarieCurie , NobelPrize i , h MarieCurie , i , h MarieCurie , Physics i , h MarieCurie , NobelPrize i , h MarieCurie , i , h MarieCurie , Chemistry i . Leaving the technicality of two identical tuples aside, the crux here is that we can no longerreconstruct in which year Marie Curie won which of the two prizes. Joining the binary tuplesusing database operations would produce spurious tuples, namely, all four combinations of1903 and 1911 with physics and chemistry.The Semantic Web data model RDF and its associated W3C standards (including theSPARQL query language) support only binary relations. They therefore exploit cleverways of encoding higher-arity relations into a binary representation, based on techniquesrelated to reiﬁciation [603]. Essentially, each instance of the higher-arity relation is givenan identiﬁer of type statement and that identiﬁer is combined with the original relation’sarguments into a set of binary tuples.For the n -ary relation instance R ( X , X . . . X n ) the reiﬁed representation con-sists of the set of binary instances type ( id, statement ), arg ( id, X ) , arg ( id, X ) . . . arg n ( id, X n )where id is an identiﬁer.With this technique, the triple h id type statement i asserts the existence of the higher-arity tuple, and the additional triples ﬁll in the arguments. In some KBs, the techniqueis referred to as compound objects , as the h id type statement i is expanded into a setof facets , often called qualiﬁers , with the number of facets even being variable (see, e.g.,[230]). Reiﬁcation can be applied to binary relations as well (if desired): the representation of h S P O i then becomes h id type statement i , h id hasSubject S i , h id hasPredicate P i , h idhasObject O i . Our use case for reiﬁcation is higher-arity relations, though, most importantly,to capture events and their diﬀerent aspects. Attaching provenance or belief information tostatements is another case for reiﬁcation.The identiﬁer id could be a number or a URI (as required by RDF). The names of thefacets of arguments arg i ( i = 1 ..n ) can be arbitrarily chosen, but often capture certainproperties that can be aptly reﬂected in their names. For example, a tuple for the higher-arityrelation wonAward may result in the following triples:21 ubmitted to Foundations and Trends in Databases id type statementid hasPredicate wonAwardid winner Marie Curieid award Nobel Prizeid year 1903id field physics The example additionally includes a triple that encodes the name of the property wonAward for which the n-tuple holds. Strictly speaking, we could then drop the id typestatement triple without losing information, and the remaining triples are a typical knowledgerepresentation for n-ary predicates (e.g., for events): one triple for the predicate itself andone for each of the n-ary predicate arguments. If we want to emphasize two of the argumentsas major subject and object, we could also use a hybrid form with triples like h id hasSubjectMarie Curie i , h id hasObject Nobel Prize i , h id year 1903 i , h id field physics i .The advantage of the triples representation is that it stays in the world of binaryrelations, and the notion of a knowledge graph still applies. In a graph model, the SPO triplethat encodes the existence of the original n-ary R instance is often called a compound node (e.g., in Freebase and in Wikidata), and serves as the “gateway” to the qualiﬁer triples .The downside of reiﬁcation and related techniques for casting n-ary relations into RDFis that they make querying more diﬃcult if not to say tedious. It requires more joins, andconsidering paths and not just single edges when dealing with compound nodes in thegraph model. For this reason, some KBs have also pursued hybrid representations where foreach higher-arity relation, the most salient pair of arguments are represented as a standardbinary relation and reiﬁcation is used only for the other arguments. An important objective for a clean knowledge base is the uniqueness of the subjects,predicates and objects in SPO triples and other relational statements. We want to captureevery entity and every fact about it exactly once, just like an enterprise company shouldcontain every customer and her orders and account balance once and only once. As soon asredundancy creeps in, this opens the door for variations of the same information and hencepotential inconsistency. For example, if we include two entities in our KB,

Bob Dylan and

Robert Zimmerman (Dylan’s real name) without knowing that they are the same, we couldattach diﬀerent facts to them that may eventually contradict each other. Furthermore, wewould distort the results of counting queries (counting two people instead of one person).This motivates the following canonicalization principle:Each entity, class and property in a KB is canonicalized by having a uniqueidentiﬁer and being included exactly once.22 ubmitted to Foundations and Trends in Databases

For entities this implies the need for named entity disambiguation , also knownas entity linking ([523]). For example, we need to infer that Bob Dylan and RobertZimmerman are the same person and should have him as one entity with two diﬀerentlabels rather than two entities with diﬀerent identiﬁers. The same principle should hold forclasses, for example, avoiding that we have both guitarists and guitar players , and forproperties as well.We strive to avoid redundancy and the resulting ambiguities and potential inconsistencies.However, this goal is not always perfectly achievable in the entire life-cycle of KB creation,growth and curation. Some KBs settle for softer standards and allow diverse representationsfor the same facts to co-exist, eﬀectively demoting entities and relations into literal values.Here is an example of such a softer (and hence less desirable, but still useful) representation:

Bob Dylan has won “Nobel Prize in Literature”Bob Dylan has won “Literature Nobel Prize”Bob Dylan has won award “Nobel”

In addition to the grounded statements about entities, classes and relational properties, KBscan also contain intensional knowledge in the form of logical constraints and rules. Thepurpose of constraints is to enforce the consistency of the KB: grounded statements thatviolate a constraint cannot be entered. For example, we do not allow a second birthdatefor a person, as the birthdate property is a function, and we require creators of songs tobe musicians (including composers and bands). The former is an example of a functionaldependency , and the latter is an example of a type constraint . We discuss consistencyconstraints and their crucial role for KB curation in Chapter 8 (particularly, Section 8.3.1).The purpose of rules is to derive additional knowledge that logically follows from what theKB already contains. For example, if Bob is married to Sara, then Sara is married to Bob,by the symmetry of the spouse relation). We discuss rules in Section 8.3.2.

Machines cannot create any knowledge on their own; all knowledge about our world iscreated by humans (and their instruments) and documented in the form of encyclopedia,scientiﬁc publications, books, daily news, all the way to contents in online discussion forumsand other social media. What machines actually do to construct a knowledge base is to tapinto these sources, harvest their nuggets of knowledge, and reﬁne and semantically organizethem into a formal knowledge representation. This process of distilling knowledge fromonline contents is best characterized as knowledge harvesting [622, 621, 620].23 ubmitted to Foundations and Trends in Databases

This big picture opens up a design space for how we go about harvesting online contentstowards large-scale knowledge bases. Depending on the kinds of sources we tap into andthe standards that we set for the output we want to achieve, there is a variety of designchoices. Figure 2.2 depicts this design space. Note that the connections between methodsand their associated inputs and outputs are merely indicative for typical approaches; theyare not meant to be exhaustive. For example, NLP tools and Deep Learning are useful fordiscovering entities and their types as well, but they face higher complexity and usuallyyield lower quality than the simpler methods based on rules and patterns.

Input Sources:

There is a wide spectrum of input sources to be considered. The top part of Figure 2.2 showssome notable design points, with diﬃculty increasing from left to right. The diﬃculties arisefrom the decreasing ratio of valuable knowledge to noise in the respective sources.To build a high-quality KB, we advocate to start with the cleanest sources, called premium sources in the ﬁgure. These include well-organized and curated encyclopediccontent like Wikipedia. For example, Wikipedia’s set of article names is a great sourcefor KB construction, as these names constitute the world’s notable entities in reasonablystandardized form: millions of entities with human-readable unique labels. First harvestingthese entities (and cues about their classes, e.g., via Wikipedia categories) forms a strong

InputsOutputs

Premium Sources (Wikipedia, WordNet,

GeoNames, Librarything …)

Semi-Structured Data (Infoboxes, Tables, Lists …)

Text Documents& Web Pages Mass User Content

Online Forums & SocialMedia Queries& Clicks

Entity Names& Classes Entities inTaxonomy Relational

Statements

Invariants &

Constraints

Canonicalized

Statements

Difficult Text (Books,

Interviews …)

High-Quality Text (News Articles,

Wikipedia …)

Methods

Rules & Patterns LogicalInference StatisticalInference DeepLearningNLPTools

Figure 2.2:

Design Space for Knowledge Harvesting ubmitted to Foundations and Trends in Databases backbone for subsequent extensions and reﬁnements. This design choice can be seen as aninstantiation of the folk wisdom to “pick low-hanging fruit” ﬁrst, widely applied in systemsengineering. Beyond Wikipedia as a general-purpose source, knowledge harvesting shouldgenerally start with the most authoritative high-quality sources for the domains of interest.For example, for a KB about movies, it would be a mistake to disregard IMDB as it isthe world’s largest and cleanest – manually constructed – repository of movies, characters,actors, keywords and phrases about movie plots, etc. Likewise, we must not overlook sourceslike GeoNames and OpenStreetMap for geographic entities, GoodReads and Librarythingfor books, MusicBrainz (or even Spotify’s catalog) for music, DrugBanks for medical drugs,and so on. Note that some sources may be proprietary and require licensing.After considering premium sources, the next step is to tap into semi-structuredelements in online data, like infoboxes (in Wikipedia and other wikis), tables, lists,headings, category systems, etc. This is almost always easier than comprehending textualcontents. However, if we aim at rich coverage of facts about entities, we eventually haveto extract knowledge from natural-language text as well – ranging from high-qualitysources like Wikipedia articles all the way to user contents in online forums and othersocial media. Finally, mass-user data about online behavior – like queries and clicks – is yetanother source, which comes with a large amount of noise and potential bias.From these considerations it should be obvious that we generally face a precision-recalltrade-oﬀ . In Figure 2.2, the precision of extracted knowledge tends to decrease from leftto right, and the recall typically increases from left to right (with exceptions, though). Thatis, sources on the left end often yield highly accurate KBs but limited coverage, whereassources on the right end usually yield less accurate KBs but with higher coverage. We deﬁneprecision and recall as follows:The precision of a KB of statements is the ratio correct statements in KB statements in KB . The recall of a KB of statements is the ratio correct statements in KB correct statements in real world .

Precision can be evaluated by inspecting a KB alone, but recall can only be estimatedas the complete real-world knowledge is not known to us. However, we can sample on a per-entity basis and compare what a KB knows about an entity (in the form of relational tuples)against what a human can learn from reading the Wikipedia article or other high-coveragesources about the entity. 25 ubmitted to Foundations and Trends in Databases

Output Scope and Quality:

Depending on our goals on precision and recall of a KB and our choice on dealing with thetrade-oﬀ, we can expect diﬀerent kinds of outputs from a knowledge-harvesting machinery.Figure 2.2 lists major options in the bottom part.The minimum that every KB-building endeavor should have is that all entities in theKB are semantically typed in a system of classes. Without any classes, the KB would merelybe a ﬂat collection of entities, and a lot of the mileage that search applications get fromKBs is through the class system. For example, queries or questions about “singers who arealso poets” can be answered by intersecting the entities of two classes (returning, e.g., BobDylan, Leonard Cohen, Serge Gainsbourg).As for the entities themselves, some KBs do not normalize, lacking unique identiﬁersand including real-world duplicates. Such a KB of names, containing, for example, both BobDylan and Robert Zimmerman as if they were two entities (see Section 2.1.4), can still beuseful for many applications. However, a KB with disambiguated names and canonicalizedrepresentation of entities clearly oﬀers more value for high-quality use cases such as dataanalytics.Larger coverage and application value come from capturing also properties of entities,in the form of relational statements. Often, but not necessarily, this goes hand in handwith logical invariants about properties, which could be acquired by rule mining or byhand-crafted modeling using expert or crowdsourced inputs. Analogously to surface-formversus canonicalized entities, relational statements also come in these two forms: with adiversity of names for the same logical relation, or with unique names and no redundancy(see Section 2.1.4 for examples).

Methodological Repertoire:

To go from input to output, knowledge harvesting has a variety of options for its algorithmicmethods and tools. The following is a list of most notable options; there are further choices,and practical techniques often combine several of the listed options.•

Rules and Patterns:

When inputs have rigorous structure and the desired outputquality mandates conservative techniques, rule-based extraction can achieve bestresults. The System T project at IBM is a prime example of rule-based knowledgeextraction for industrial-strength applications (see, e.g., [89]).•

Logical Inference:

Using consistency constraints can often eliminate spurious can-didates for KB statements, and deduction rules can generate additional statementsof interest. Both cases require reasoning with computational logics. This is usuallycombined with other paradigms such as extraction rules or statistical inference.•

Statistical Inference:

Distilling crisp knowledge from vague and ambiguous textcontent or semi-structured tables and lists often builds on the observation that there26 ubmitted to Foundations and Trends in Databases is redundancy in content sources: the same KB statement can be spotted in (many)diﬀerent places. Thus we can leverage statistics and corresponding inference methods.In the simplest case, it boils down to frequency arguments, but it can be muchmore elaborated by considering diﬀerent statistical measures and joint reasoning. Inparticular, statistical inference can be combined with logical invariants, for example,by probabilistic graphical models ([126]).•

NLP Tools:

Modern tools for natural language processing (see, e.g., [146, 273])encompass a variety of methods, from rule-based to deep learning. They reveal structureand text parts of interest, such as dependency-parse trees for syntactic analysis,pronoun resolution, identiﬁcation of entity names, sentiment-bearing phrases, andmuch more. However, as language becomes more informal with incomplete sentencesand colloquial expressions (incl. social-media talks such as “LOL”), mainstream NLPdoes not always work well.•

Deep Learning:

The most recent addition to the methodogical repertoire is deepneural networks, trained in a supervised or distantly supervised manner (see, e.g., [68,186]). The sweet spot here is when there is a large amount of “gold-standard” labeledtraining data, and often in combination with learning so-called embeddings from largetext corpora. Thus, deep learning is most naturally used for increasing the coverage ofa KB after initial population, such that the initial KB can serve as a source of distantsupervision.In Figure 2.2, the edges between input and methods and between outputs and methodsindicate choices for methods being applied to diﬀerent kinds of inputs and outputs. Notethat this is not meant to exclude further choices and additional combinations.The outlined design space and the highlighted options are by no means complete, butmerely reﬂect some of the prevalent choices as of today. We will largely use this big pictureas a “roadmap” for organizing material in the following chapters. However, there are furtheroptions and plenty of underexplored (if not unexplored) opportunities for advancing thestate of the art in knowledge harvesting. 27 ubmitted to Foundations and Trends in Databases

This chapter presents a powerful method for populating a knowledge base with entitiesand classes, and for organizing these into a systematic taxonomy. This is the backbonethat any high-quality KB – broadly encyclopedic or focused on a vertical domain – musthave. Following the rationale of our design-space discussion, we focus here on knowledgeharvesting from premium sources such as Wikipedia or domain-speciﬁc repositories suchas GeoNames for spatial entities or GoodReads and Librarything for the domain of books.This emphasizes the philosophy of “picking low-hanging fruit ﬁrst” for the best beneﬁt/costratio.

We recommend to start every KB construction project by tapping one or a few premiumsources ﬁrst. Such sources should have the following characteristics: • authoritative high-quality content about entities of interest, • high coverage of many entities, and • clean and uniform representation of content, like having clean HTML markup or evenwiki markup, uniﬁed headings and structure, well-organized lists, informative categories,and more.Distilling some of the contents from such sources into machine-readable knowledge cancreate a strong core KB with a good ratio of “mileage per eﬀort”. In particular, relativelysimple extraction and cleaning methods go a long way already. The core KB can thenbe further expanded from other sources – with more advanced methods as presented insubsequent chapters. Wikipedia:

For a general-purpose encyclopedic knowledge base,

Wikipedia is presumably the mostsuitable starting point, with its huge number of entities, highly informative descriptions andannotations, and quality assurance via curation by a large community and a sophisticatedsystem of moderators. The English edition of Wikipedia (https://en.wikipedia.org) containsmore than 6 million articles with 500 words of text on average (as of July 1, 2020), all withunique names most of which feature individual entities. These include more than 1.5 millionnotable people, more than 750,000 locations of interest, more than 250,000 organizations,and instances of further major classes including events (i.e., sports tournaments, naturaldisasters, battles and wars, etc.) and creative works (i.e., books, movies, musical pieces,etc.).Another great starting point, with even more entities (ca. 100 million), would be the

Wikidata knowledge base ( https://wikidata.org ), populated with entity-centric facts (SPO28 ubmitted to Foundations and Trends in Databases triples) by a knowledge-sharing community [600]. Wikidata is already a full-ﬂedged KB, ina formal representation following the RDF data model. So there is no point in illustratingknowledge extraction from Wikidata. Moreover, Wikidata largely focuses on capturing basicbiographic facts about entities, like birthdate, birthplace, spouses, children for people, orcity, country and geo-coordinates for buildings and landmarks, and so on. Its type systemis large, but most entities belong only to one or two types, whereas Wikipedia often oﬀersseveral tens of highly informative categories that characterize an entity. Last but not least,Wikidata does not have full-text articles about entities with rich descriptions, lists, tablesand more. We will come back to Wikidata in Chapter 9, including a case for integrationwith another knowledge source (see Section 9.1).Wikipedia serves as an archetype of knowledge-sharing communities, which can be seenas “proto-KBs”: the right contents for a KB, but not yet in the proper representation.Another case in point would be the Chinese encyclopedia Baidu Baike with almost 20million articles ( https://baike.baidu.com ). In this chapter, we focus on the English Wikipediaas an exemplary case. We will see that Wikipedia alone does not lend itself to building aclean KB as easily as one would hope. Therefore, we combine input from Wikipedia withanother premium source: the

WordNet lexicon [159] as a key asset for the KB taxonomy.

Geographic Knowledge:

For building a KB about geographic and geo-political entities, like countries, cities, rivers,mountains, natural and cultural landmarks, Wikipedia itself is a good starting point,but there are very good alternatives as well. Wikivoyage ( ) isa travel-guide wiki with specialized articles about travel destinations. GeoNames ( ) is a huge repository of geographic entities, from mountains and vol-canos to churches and city parks, more than 10 million in total. If city streets, high-ways, shops, buildings and hiking trails are of interest, too, then OpenStreetMaps ( ) is another premium source to consider (or alternatively com-mercial maps if you can aﬀord to license them). Even commercial review forums such asTripAdvisor ( ) could be of interest, to include hotels, restaurantsand tourist services.These sources complement each other, but they also overlap in entities. Therefore, simplytaking their union as an entity repository is not a viable solution. Instead, we need tocarefully integrate the sources, using techniques for entity matching to avoid duplicates andto combine their diﬀerent pieces of knowledge for each entity (see Chapter 5, especiallySection 5.2).To obtain an expressive and clean taxonomy of classes, we could tap each of the sourcesseparately, for example, by interpreting categories as semantic types. But again, simplytaking a union of several category systems does not make sense. Instead, we need to ﬁndways of aligning equivalent (and possibly also subsumption) pairs of categories, as a basis29 ubmitted to Foundations and Trends in Databases for constructing a uniﬁed type hierarchy . For example, can and should we map craters fromone source to volcanoes in a second source, and how are both related to volcanic nationalparks ? This alignment and integration is not an easy task, but it is still much simpler thanextracting all volcanoes and craters from textual contents in a wide variety of diverse webpages.

Knowledge about Movies:

For the movie domain, we are primarily interested in entities like movies, directors, actors,producers, soundtrack music, contributors to special eﬀects etc. IMDB (Internet MovieDatabase, ) is by far the best source of information for this scope.This premium source is commercial and disallows crawling, but it oﬀers periodic dumps ofits core data for downloading (subject to licensing conditions).However, advanced users may ask for more: a movie KB should also provide convenientaccess to additional knowledge about the life of the movie contributors, for example, howoften they are divorced and how they started their careers. To this end, we could combineIMDB entities with selected articles from

Wikipedia or entries from

Wikidata , but suchcombinations involve non-trivial knowledge integration tasks. Moreover, although IMDB ishuge, it does not have perfect coverage of the world’s ﬁlm footage, for example, missingmany Bollywood productions, African movies or lesser-known documentaries. Therefore,merging its repository with entities from other sites would be desirable, with appropriate entity matching and type alignment , similar to the Wikipedia case discussed earlier.An even better KB for movie aﬃcionados would cover also the contents of moviesand their characters. A rich source about popular movies and TV series is fan-communitywikis , many of which are hosted at (formerly registered as ). On these wikis, a large number of ﬁctitious characters are systematicallyorganized in semantic types, and are annotated with crisp statements about their traits andrelationships in the respective stories. The type labels allow ﬁnding favorite villains, heroes,wizards, witches, and more. However, if we want a truly integrated and clean KB, we needto align these types with the categories that IMDB provides for real people and some moviecharacters, for example, to consider the gender and race of actors and their roles.

Health Knowledge:

Another vertical domain of great importance for society is health: building a KB withentity instances of diseases, symptoms, drugs, therapies etc. There is no direct counterpartto Wikipedia for this case, but there are large and widely used tagging catalogs andterminology lexicons like MeSH ( ) and UMLS( including the SNOMED clinical terminology forhealthcare), and these can be treated as analogs to Wikipedia: rich categorization but notalways semantically clean. The next step would then be to clean these raw assets, usingmethods like the presented ones, and populate the resulting classes with entities.30 ubmitted to Foundations and Trends in Databases

For the latter, additional premium sources could be considered: either Wikipedia articlesabout biomedical entities, or curated structured sources such as DrugBank ( ) or Disease Ontology ( http://disease-ontology.org/ ), and also human-oriented Webportals like the one by the Mayo Clinic ( ).Research projects along the lines of this knowledge integration and taxonomy construc-tion for health include KnowLife/DeepLife [150, 148], Life-INet [487] and Hetionet [234];see also [270] for general discussion of health knowledge.

Many premium sources come with a rich category system : assigning pages to relevantcategories that can be viewed as proto-classes but are too noisy to be considered as asemantic type system. Wikipedia, as our canonical example, organizes its articles in ahierarchy of more than 1.9 million categories (as of July 1, 2020). For example,

Bob Dylan (the entity corresponding to article en.wikipedia.org/wiki/Bob_Dylan ) is placed in categoriessuch as

American male guitarists , Pulitzer Prize winners , Songwriters from Minnesota etc.,and

Blowin’ in the Wind (corresponding to en.wikipedia.org/wiki/Blowin’_in_the_Wind ) isin categories such as

Songs written by Bob Dylan , Elvis Presley songs , Songs about freedom and

Grammy Hall of Fame recipients , among others.Using these categories as classes with their respective entities, it seems we could eﬀortlesslyconstruct an initial KB. So are we done already?Unfortunately, the Wikipedia category system is almost a class taxonomy, but onlyalmost. We face the following diﬃculties: • High Speciﬁcity of Categories:

The direct categories of entities (i.e., leaves in thecategory hierarchy) tend to be highly speciﬁc and often combine multiple classes intoone multi-word phrase. Examples are

American male singer-songwriters , or Nobel laureates absent at the ceremony . For humans, it isobvious that this implies membership in classes singers , men , guitar players , Nobellaureates etc., but for a computer, the categories are initially just noun phrases. • Drifting Super-Categories:

By considering also super-categories (i.e., non-leaf nodesin the hierarchy) and the paths in the category system, we could possibly generalizethe leaf categories and derive broader classes of interest, such as men , American people ,31 ubmitted to Foundations and Trends in Databases … American Male Guitarists Songwritersfrom MinnesotaPulitzerPrize Winners Songs writtenby Bob Dylan

Bob Dylan

Blowin‘ in the Wind

Elvis Presley

Songs Songs aboutFreedomElvis Presley Human

Rights Songs by

ThemeTopics inCulture Works byTopic AppliedEthicsHumanBehavior

Action Free Will

Culture Humans … Burials inTennessee Cemetries inTennesseeBuildings inTennessee AmericanSingers

People by

Nationality … …

Works by

Bob DylanBob

DylanNorth America … North AmericanCulture … People by Gender

MenMaleMusiciansMale

GuitaristsGender

PeopleSex PulitzerPrizesWriters

By Award

PulitzerFamilyAmericanLiteraryAwardsAmerican

Literature

NewspaperPublishingFamiliesPeople inLiterature

Arts

MinnesotaCultureMinne-sotaMidwestUnited StatesCountries inthe Americas … …………… … Figure 3.1:

Example categories and super-categories from Wikipedia musicians , etc. However, the Wikipedia category system exhibits conceptual drifts wheresuper-categories imply classes that are incompatible with those of the correspondingleaves and the entity itself. Figure 3.1 shows excerpts of the category hierarchy for theentities

Bob Dylan and

Blowin’ in the Wind . By transitivity, the super-categories wouldimply that Bob Dylan is a location, a piece of art, a family and a kind of reproductionprocess (generalizing the “sex” category). For the example song, the category systemalone would likewise lead to ﬂawed or meaningless classes: locations, buildings, singers,actions, etc. • Entity Types vs. Associative Categories:

Some of the super-categories are generalconcepts, for example,

Applied Ethics and

Free Will in Figure 3.1. Some of the edgesbetween a category and its immediate super-category are conceptual leaps, for example,moving from songs and works to the respective singers in Figure 3.1.All this makes sense when the category hierarchy is viewed as a means to support userbrowsing in an associative way, but it is not acceptable for the taxonomic backbone ofa clean KB. For example, queries about buildings on North America may return

Blowin’in the Wind as an answer, and statistical analytics on prizes, say by geo-region or gender,would confuse the awards and the awardees.In the following we present a simple but powerful methodology to leverage the Wikipediacategories as raw input with thorough cleansing of their noun-phrase names and integration32 ubmitted to Foundations and Trends in Databases with upper-level taxonomies for clean KB construction based on the works of [456, 457, 562,240] (applied in the WikiTaxonomy and YAGO projects).

Head Words of Category Names:

As virtually all category names are noun phrases, a basic building block for uncoveringtheir semantics is to parse these multi-word phrases into their syntactic constituents. Thistask is known as noun-phrase parsing in NLP (see, e.g., [273] and [146]). In general, nounphrases consist of nouns, adjectives, determiners (like “the” or “a”), pronouns, coordinatingconjunctions (“and” etc.), prepositions (“by”, “from” etc.), and possibly even further wordvariants.A typical ﬁrst step is to perform part-of-speech tagging , or

POS tagging for short.This tags each word with is syntactic sort: noun, adjective etc. Nouns are further classiﬁedinto common nouns, which can have an article, e.g., “guitarist”, and proper nouns whichdenote names (e.g., “Minnesota”). POS tagging usually works by dynamic programming overa pre-trained statistical model of word-variant sequences in large corpora. The subsequent noun-phrase parsing computes a syntactic tree structure for the POS-tagged wordsequence, inferring which word modiﬁes or reﬁnes which other word. This is usually basedon stochastic context-free grammars, again using some form of dynamic programming. Laterchapters in this article will make intensive use of such NLP methods, too.The root of the resulting tree is called the head word , and this is what we are mostlyafter. For example, for “American male guitarists” the head word is “guitarists” and for“songwriters from Minneota” it is “songwriters”. The head word is preceded by so-called pre-modiﬁers , and followed by post-modiﬁers . Sometimes, words from these modiﬁers can becombined with the head word to form a semantically meaningful class as well (e.g., “femaleguitarists”).Equipped with this building block, we can now tackle the desired category cleaning. Ourgoal is to distinguish taxonomic categories (such as “guitarists”) from associative categories(such as “music”). The key idea is the heuristics that plural-form common nouns are likelyto denote classes, whereas single-form nouns tend to correspond to general concepts (e.g.,“free will”). The reason for this is that classes regroup several instances, and a plural formis thus a strong indicator for a class. We can possibly relax this heuristics to consider alsosingular-form nouns where the corresponding plural form is frequently occurring in corporasuch as news. For example, if a Wikipedia category were named “jazz band” rather than“jazz bands” we should accept it as a class, while still disregarding categories such as “freewill” or “ethics” (where “wills” is very rare, and “ethics” is a singular word despite endingwith “s”). These ideas can be cast into the following algorithm.33 ubmitted to Foundations and Trends in Databases

Algorithm for Category Cleaning

Input: Wikipedia category name c : leaf node or non-leaf nodeOutput: semantic class label or null

1. Run noun-phrase parsing to identify headword h and modiﬁer structure: c = pre .. pre k h post .. post l .2. Test if h is in plural form or has a frequently occurring plural form. If not, return null .Optionally, consider also pre i .. pre k h as class candidates, with increasing i from 0 to k − c , return h (and optionally additonal class labels pre i .. pre k h ).4. For non-leaf category c , test if the class candidate h is a synonym or hypernym(i.e., generalization) of an already accepted class (including h ).If so, keep it; otherwise, discard it.The rationale for the additional test in Step 4 is that non-leaf categories in Wikipediaare often merely associative, as opposed to denoting semantically proper super-classes (seediscussion above). So we impose this additional scrutinizing, while still being able to harvestthe cases when head words of super-categories are meaningful outputs (e.g., Musicians and

Men in Figure 3.1). The test itself can be implemented by looking up head words in existingdictionaries like

WordNet [159] or

Wiktionary ( ), which listsynonyms and hypernyms for many words. This is a prime case of harnessing a secondpremium source. Class Candidates from Wikipedia Articles:

We have so far focused on Wikipedia categories as a source of semantic class candidates.However, normal Wikipedia articles may be of interest as well. Most articles representindividual entities, but some feature concepts, among which some may qualify as classes.For example, the articles https://en.wikipedia.org/wiki/Cover_version and https://en.wikipedia.org/wiki/Aboriginal_Australians correspond to classes as they have instances, whereas https://en.wikipedia.org/wiki/Empathy is a singleton concept as it has no individual entities asinstances of interest.Simple but eﬀective heuristics to capture these cases have been studied by [198, 440].The key idea is that an article qualiﬁes as a class if its textual body mentions the article’stitle in both singular and plural forms. For example, “cover version” and “cover versions”are both present in the article about cover versions (of songs), but the article on empathydoes not refer to “empathys”. Obviously, this technique is just another building block thatshould be combined with other heuristics and statistical inference.34 ubmitted to Foundations and Trends in Databases

By applying the category cleaning algorithm to all Wikipedia categories, we can obtaina rich set of class labels for each entity. However, as the Wikipedia community does notenforce strict naming standards, we could arrive at duplicates for the same class, for example,accepting both guitarist and guitar player . Moreover, as we are mostly harvesting theleaf-node categories and expect to prune many of the more associative super-categories, ourKB taxonomy may end up with many disconnected classes.To ﬁx these issues, we resort to pre-existing high-quality taxonomies like

WordNet [159]. This lexicon already covers more than hundred thousand concepts and classes – called word senses or synsets for sets of synonyms – along with clean structure for hypernymy.Alternatively, we could consider Wiktionary ( ). Both of theselexical resources also have multilingual extensions, covering a good fraction of mankind’slanguages. See also [203] for a general overview of this kind of lexical resources.A major caveat, however, is that WordNet has hardly any entity-level instances for itsclasses; you can think of it as an un-populated upper-level taxonomy . The same holds forWiktionary. The goal now is to align the class candidates harvested from Wikipedia withthe classes in WordNet. Similarity between Categories and Classes:

The key idea is to perform a similarity test between a class candidate from Wikipedia andpotentially corresponding classes in WordNet. In the simplest case, this is just a surface-form string similarity . For example, the Wikipedia-derived candidate “guitar player” has highsimilarity with the WordNet entry ”guitarist”. There are two problems to address, though.First, we could still observe low string similarity for two matching classes, for example,“award” from Wikipedia against “prize” in WordNet. Second, we can ﬁnd multiple matcheswith high similarity, for example “building” from Wikipedia matching two diﬀerent sensesin WordNet, namely, building in the sense of a man-made structure (e.g., houses, towersetc.) and building in the sense of a construction process (e.g., building a KB). We have tomake the right choice among such alternatives for ambiguous words.The solution for the ﬁrst problem – similarity-based matching – is to consider contextsas well. WordNet, and Wiktionary alike, provide synonym sets as well as short descriptions(so-called glosses), for their entries, and the Wikipedia categories can be contextualized byrelated words occurring in super-categories or (articles for) their instances. This way, we areable to map “award” to “prize” because WordNet has the entry “prize, award (somethinggiven for victory or superiority in a contest or competition or for winning a lottery)”,stating that “prize” and “award” are synonyms (for this speciﬁc word sense). More generally,we could consider also entire neighborhoods of WordNet entries deﬁned by hypernyms,hyponyms, derivationally related terms, and more. Such contextualized lexical similarity ubmitted to Foundations and Trends in Databases measures have been investigated in research on word sense disambiguation (WSD) ,see [417, 203] for overviews.Another approach to strengthen the similarity comparisons is to incorporate wordembeddings such as Word2Vec [384] or Glove [447] (or even deep neural networks alongthese lines, such as BERT [118]). We will not go into this topic now, but will come back toit in Section 4.5.For the second problem – ambiguity – we could apply state-of-the-art WSD methods,but it turns out that there is a very simple heuristic that works so well that it is hardlyoutperformed by any advanced WSD method. It is known as the most frequent sense(MFS) heuristic: whenever there is a choice among diﬀerent word senses, pick the onethat is more frequently used in large corpora such as news or literature. Conveniently, theWordNet team has already manually annotated large corpora with WordNet senses, and hasrecorded the frequency of each word sense. It is thus easy to identify, for each given word,its most frequent meaning. For example, the MFS for “building” is indeed the man-madestructure. There are exceptions to the MFS heuristics, but they can be handled in otherways. Putting Everything Together:

Putting these considerations together, we arrive at the following heuristic algorithm foraligning Wikipedia categories and WordNet senses.

Algorithm for Alignment with WordNet

Input: Class name c derived from Wikipedia categoryOutput: synonym or hypernym in WordNet, or null

1. Compute string or lexical similarity of c to WordNet entries s (for candidates s with certain overlap of character-level substrings). Then pick the s with highestsimilarity if this is above a given threshold; otherwise return null .2. If the highest-similarity entry s is unambiguous (i.e., the same word has onlythis sense), then return the WordNet sense for s . If s is an ambiguous word, thenreturn the MFS for s (or use another WSD method for c and the s candidates).Once we have mapped the accepted Wikipedia-derived classes onto WordNet, we have acomplete taxonomy, with the upper-level part coming from the clean WordNet hierarchy ofhypernyms. The last thing left to decide for the alignment task is whether the category-basedclass c is synonymous to the identiﬁed WordNet sense s or whether s is a hypernym of c .The latter occurs when c does not have a direct counterpart in WordNet at all. For example,we could keep the category c = “singer-songwriter”, but WordNet does not have this classat all. Instead we should align c to singer or songwriter or both. If WordNet did not havean entry for songwriters, we should map c to the next hypernym, which is composer .This ﬁnal issue can be settled heuristically, for example, by assuming a synonymy match36 ubmitted to Foundations and Trends in Databases if the similarity score is very high and assuming a hypernym otherwise, or we could resortto leveraging information-theoretic measures over additional text corpora (e.g., the full textof all Wikipedia articles). Speciﬁcally, for a pair c and s (e.g., songwriter and composer –a case of hypernymy), a symmetry-breaking measure is the conditional probability P [ s | c ]estimated from co-occurrence frequencies of words. If P [ s | c ] (cid:29) P [ c | s ], that is, c sort ofimplies s but not vice versa, then s is likely a hypernym of c , not a synonym. Variousmeasures along these lines are investigated in [618, 183].The presented methodology can also be adapted to other cases of taxonomy alignment,for example, to GeoNames, Wikivoyage and OpenStreetMap (and Wikipedia or Wikidata)about geographic categories and classes (see Section 3.1). A diﬀerent paradigm for aligning Wikipedia categories with WordNet classes, as primeexamples of premium sources, has been developed by [418] for constructing the

BabelNet knowledge base. It is based on a candidate graph derived from the Wikipedia category athand and potential matches in WordNet, and then uses graph metrics such as shortest pathsor graph algorithms like random walks to infer the best alignment. Figure 3.2 illustrates thisapproach. The graph construction extracts the head word of interest and salient contextwords from Wikipedia – “play” as well as “ﬁlm” and “ﬁction” in the example. Then allapproximate matches are identiﬁed in WordNet, and their respective local neighborhoodsare added to the graph casting WordNet’s lexical relations like hypernymy/hyponymy,holoynymy/meronymy (whole-part) etc. into edges. Edges could even be weighted based onsimilarity or salience metrics.In the example, we have two main candidates to which “play” could refer: “play, drama”or “play (sports)” (WordNet contains even more). To rank these candidates, the simplestmethod is to look at how close they are to the Wikipedia-based start nodes “play”, “ﬁlm”and “ﬁction”, for example, by aggregating the shortest paths from start nodes to candidate-class nodes. Alternatively, and typically better performing, we can use methods basedon random walks over the graph, analogously to how Google’s original PageRank andPersonalizedPageRank measures were computed ([59, 263]).The walk starts at the Wikipedia-derived head word “play” and randomly traverses edgesto visit other nodes – where the probabilities for picking edges should be proportional toedge weights (i.e., uniform over all outgoing edges if the edges are unweighted). Occasionally,the walk could jump back to the start node “play”, as decided by a probabilistic coin toss.By repeating this random procedure suﬃciently often, we obtain statistics about how ofteneach node is visited. This statistics converges to well-deﬁned stationary visiting probabilities as the walk length (or repetitions after jumping back to the start node) approaches inﬁnity.37 ubmitted to Foundations and Trends in Databases

Play

Hamlet Macbeth

Fiction about RegicideBritish Plays Adapted into Films

Film FictionPlay,Drama Play(Sport)LiteraryWork

Fiction

NovelStory Film, Movie

Show, Public

Performance TheatricalPerformance …… … … … Plan of

Action

Football

PlayBasketballPlay …… … Wikipedia-WordNetGraph

Wikipedia

WordNet

Figure 3.2:

Example for graph-based alignment

The candidate class with the highest visiting probability is the winner: “play, drama” inthe example. Such random-walk methods are amazingly powerful, easy to implement andwidely applicable. We will see other use cases in later chapters.

There are various extensions of the presented method. Wikipedia category names do not justindicate class memberships, but often reﬂect other relations as well. For example, for BobDylan being in the category

Nobel Laureates in Literature we can infer a relational triple h Bob Dylan , has won , Literature Nobel Prize i . Such extensions have been developed by[415]; we will revisit these techniques in Chapter 6. Another line of generalization is toleverage the taxonomies from the presented methods as training data to learn generalizedpatterns in category names and other Wikipedia structures (infobox templates, list pagesetc.). This way, the taxonomy can be further grown and reﬁned. Such methods have beendeveloped in the Kylin/KOG project by [626, 625]. Relationship with Ontology Alignment:

The alignment between Wikipedia categories and WordNet classes can be seen as a specialcase of ontology alignment (see, e.g., [551] for an overview, [20, 331] for representative38 ubmitted to Foundations and Trends in Databases state-of-the-art methods, and http://oaei.ontologymatching.org/ for a prominent benchmarkseries). Here, the task is to match classes and properties of one ontology with those ofa second ontology, where the two ontologies are given in crisp logical forms like RDFSschemas or, even better, in the OWL description-logic language [601]. Ontology matching isin turn highly related to the classical task of database schema matching [123].The case of matching Wikipedia categories and WordNet classes is a special case, though,for two reasons. First, WordNet has hardly any instances of its classes, and we chose toignore the few existing ones. Second, the upper part of the Wikipedia category hierarchy ismore associative than taxonomic so that it had to be cleaned ﬁrst (as discussed in Section3.2). For these reasons, the case of Wikipedia and WordNet beneﬁts from tailored alignmentmethods, and similar situations are likely to arise also for domain-speciﬁc premium sources.

Beyond Wikipedia:

We used Wikipedia and WordNet as exemplary cases of premium sources, and pointedout a few vertical domains for wider applicability of the presented methods. Aligning andenriching pre-existing knowledge sources is also a key pillar for industrial-strength KBsabout retail products (see, e.g., [117, 133]). More discussion on this use case is oﬀered inSection 9.5.Apart from this mainstream, similar cases for knowledge integration can be made for lessobvious domains, too, examples being food [15, 218] or fashion [277, 189]. Food KBs haveintegrated sources like the FoodOn ontology [136], the nutrients catalog https://catalog.data.gov/dataset/food-and-nutrient-database-for-dietary-studies-fndds and a large recipe collection[369], and fashion KBs could make use of contents from catalogs such as https://ssense.com .More exotic verticals to which the Wikipedia-inspired methodology has been carried overare ﬁctional universes such as Game of Thrones, the Simpsons, etc., with input from richcategory systems by fan-community wikis ( ). Recent research onthis topic includes works by [231] and [97]. Finally, another non-standard theme for KBconstruction is how-to knowledge : organizing human tasks and procedures for solving themin a principled taxonomy. Research on this direction includes [641, 98].

We summarize this chapter by the following take-home lessons. • For building a core KB, with individual entities organized into a clean taxonomy ofsemantic types, it is often wise to start with one or a few premium sources . Examplesare Wikipedia for general-purpose encyclopedic knowledge, GeoNames and WikiVoyagefor geo-locations, or IMDB for movies. • A key asset are categories by which entities are annotated in these sources. As categoriesare often merely associative, designed for manual browsing, this typically involves a39 ubmitted to Foundations and Trends in Databases category cleaning step to identify taxonomically clean classes. • To construct an expressive and clean taxonomy, while harvesting two or more premiumsources, it is often necessary to integrate diﬀerent type systems. This can be achievedby alignment heuristics based on NLP techniques (such as noun phrase parsing), or by random walks over candidate graphs. 40 ubmitted to Foundations and Trends in Databases

This chapter presents advanced methods for populating a knowledge base with entitiesand classes (aka. types), by tapping into textual and semi-structured sources. Building onthe previous chapter’s insights on harvesting premium sources ﬁrst, this chapter extendsthe regime for inputs to discovering entities and type information in Web pages and textdocuments. This will often yield noisier output, in the form of entity duplicates. The followingchapter, Chapter 5, will address this issue by presenting methods for canonicalizing entitiesinto unique subjects in the KB.

Harvesting entities and their classes from premium sources goes a long way, but it is boundto be incomplete when the goal is to fully cover a certain domain such as music or health,and to associate all entities with their relevant classes. Premium sources alone are typicallyinsuﬃcient to capture long-tail entities, such as less prominent musicians, songs and concerts,as well as long-tail classes such as left-handed cello players or cover songs in a diﬀerentlanguage than the original. In this section, we present a suite of methods for automaticallyextracting such additional entities and classes from sources like web pages and other textdocuments.In addressing this task, we typically leverage that premium sources already give us ataxonomic backbone populated with prominent entities. This leads to various discoverytasks:1. Given a class T containing a set of entities E = { e . . . e n } , ﬁnd more entities for T thatare not yet in E .2. Given an entity e and its associated classes T = { t . . . t k } , ﬁnd more classes for e notyet captured in T .3. Given an entity e with known names, ﬁnd additional (alternative) names for e , such asacronyms or nicknames. This is often referred to as alias name discovery .4. Given a class t with known names, ﬁnd additional names for t . This is sometimes referredto as paraphrase discovery .In the following, we organize diﬀerent approaches by methodology rather than by thesefour tasks, as most methods apply to several tasks.In principle, there is also a case where the initial repository of entities and classes isempty – that is, when there is no premium source to be harvested ﬁrst. This is the case for ab-initio taxonomy induction from noisy observations, which will be discussed in Section4.6. 41 ubmitted to Foundations and Trends in Databases An important building block for all of the outlined discovery tasks is to detect mentions ofalready known entities in web pages and text documents. These mentions do not necessarilytake the form of an entity’s full name as derived from premium sources. For example, wewant to be able to spot occurrences of

Steve Jobs and

Apple Inc. in a sentence such as“Apple co-founder Jobs gave an impressive demo of the new iPhone.” Essentially, this isa string matching task where we compare known names of entities in the existing KBagainst textual inputs. The key asset here is to have a rich dictionary of alias names from the KB. Early works on information extraction from text made extensive use of namedictionaries, called gazetteers , in combination with NLP techniques (POS tagging etc.) andstring patterns. Two seminal projects of this sort are the

GATE toolkit [106, 107] and the

UIMA framework [164].A good dictionary should include • abbreviations (e.g., “Apple” instead of “Apple Inc.”), • acronyms (e.g., “MS” instead of “Microsoft”), • nicknames and stage names (e.g., “Bob Dylan” vs. his real name “Robert Zimmerman”,or “The King” for “Elvis Presley”), • titles and roles (e.g., “CEO Jobs” or “President Obama”), and possibly even • rules for deriving short-hand names (e.g., “Mrs. Y” for female people with last name“Y”).Where do we get such dictionaries from? This is by itself a research issue, tackled,for example, by [76]. The easiest approach is to exploit redirects in premium sourcesand hyperlink anchors. Whenever a page with name X is redirected to an oﬃcial pagewith name Y and whenever a hyperlink with anchor text X points to a page Y , we canconsider X an alias name for entity Y . In Wikipedia, for example, a page with title “Elvis”( https://en.wikipedia.org/w/index.php?title=Elvis ) redirects to the proper article ( https://en.wikipedia.org/wiki/Elvis_Presley ), and the page about the Safari browser contains a hyperlinkwith anchor “Apple” that points to the article https://en.wikipedia.org/wiki/Apple_Inc. . Thisapproach extends to links from Wikipedia disambiguation pages: for example, the page https://en.wikipedia.org/wiki/Robert_Zimmerman lists 11 people with this name, including alink to https://en.wikipedia.org/wiki/Bob_Dylan . Of course, this does not resolve the ambiguityof the name, but it gives us one additional alias name for Bob Dylan . Hyperlink anchor textsin Wikipedia were ﬁrst exploited for entity alias names by [67], and this simple technique wasextended to hyperlinks in arbitrary web pages by [550]. A special case of interest is to harvestmultilingual names from interwiki links in Wikipedia (connecting diﬀerent language editions)or non-English web pages linking to English Wikipedia articles. For example, French pageswith anchor text “Londre” link to the article https://en.wikipedia.org/wiki/London . All this is42 ubmitted to Foundations and Trends in Databases low-hanging fruit from an engineering perspective, and gives great mileage towards richdictionaries for entity names.An analogous issue arises for class names as well. Here, redirect and anchor texts areuseful, too, but are fairly sparse. The WordNet thesaurus and the Wiktionary lexiconcontain synonyms for many word senses (e.g., “vocalists” for singers , and vice versa) andcan serve to populate a dictionary of class paraphrases.The ideas that underlie the above heuristics can be cast into a more general principle of strong co-occurrence : Strong Co-Occurrence Principle:

If an entity or class name X co-occurs with name Y in a context with cue Z , then Y is (likely) an alias name for X .This principle can be instantiated in various ways, depending on what we consideras context cue Z : • The cue Z is a hyperlink where X is the link target and Y is the anchor text. • The cue Z is a speciﬁc wording in a sentence, like “also known as”, “aka.”, “bornas”, “abbreviated as”, “for short” etc. • The context cue Z is a query-click pair (observed by a search engine), where X is the query and Y is the title of a clicked result (with many clicks by diﬀerentusers). • The context cue Z is the frequent occurrence of X in documents about Y (e.g.,Wikipedia articles, biographies, product reviews etc.).For example, when many users who query for “Apple” (or “MS”) subsequently click onthe Wikipedia article or homepage of Apple Inc. (or “Microsoft”), we learn that “Apple” isa short-hand name for the company. This co-clicking technique has been studied by [577].Extending this to textual co-occurrence (i.e., the last of the above itemized cases) comeswith a higher risk of false positives, but could still be worthwhile for cases like short namesor acronyms for products. The technique can be tuned towards either precision or recall bythresholding on the observation frequencies and by making the context cue more or lessrestrictive. More advanced techniques for learning co-occurrence cues about alias names –so-called synonym discovery, have been investigated by [469], among others.

To discover more entities of a given class or more classes of a given entity, a powerful43 ubmitted to Foundations and Trends in Databases approach is to consider speciﬁc patterns that co-occur with input (class or entity) anddesired output (entity and class). For example, a text snippet like “singers such as BobDylan, Elvis Presley and Frank Sinatra” suggests that Dylan, Presley and Sinatra belongto the class of singers . Such patterns have been identiﬁed in the seminal work of MartiHearst [224], and are thus known as

Hearst patterns . They are a special case of the strongco-occurrence principle where the Hearst patterns serve as context cues.In addition to the “such as” pattern, the most important Hearst patterns are: “ X like Y ” (with class X in plural form and entity Y ), “ X and other Y ” (with entity X and class Y in plural form), and “ X including Y ” (with class X in plural form and entity Y ).Some of the Hearst patterns also apply to discovering subclass relations between classes.In the pattern “ X including Y ”, X and Y could both be classes, for example, “singersincluding rappers”. In fact, the patterns alone cannot distinguish between observations ofentity-class relations (types) versus subclass relations. Additional techniques can be appliedto identify which surface strings denote entities and which ones refer to classes. Simpleheuristics can already go a long way: for example, words that start with an uppercaseletter are often entities whereas common nouns in plural form are more likely class names.Full-ﬂedged approaches make use of dictionaries as discussed above or more advancedmethods for entity recognition, discussed further below.Hand-crafted patterns are useful also for discovering entities in semi-structured webcontents like lists and tables. For example, if a list heading or a column header denotesa type, then the list or column entries could be considered as entities of that type. Thepattern for this purpose would refer to HTML tags that mark headers and entries. Ofcourse, this is just a crude heuristics that has a non-negligible risk of failing. We will discussmore advanced methods that handle this case more robustly Particularly, sections 6.2.1.5and 6.3 go into depth on extraction from semi-structured contents for the more generalscope of entity properties. Multi-anchored Patterns:

Hearst patterns may pick up spurious observations. For example, the sentence “protestsongs against war like Universal Soldier” could erroneously yield that

Universal Solider isan instance of the class wars . One way of making the approach more robust is to includepart-of-speech tags or even dependency-parsing trees (see [146, 273] for these NLP basics) inthe speciﬁcation of patterns. Another approach is to extend the context cue and strengthenits role. In addition to the pattern itself, we can demand that the context contains at leastone additional entity which is already known to belong to the observed class. For example,the text “singers such as Elvis Presley” alone may be considered insuﬃcient evidence toaccept Elvis as a singer, but the text “singers such as Elvis Presley and Frank Sinatra”would have a stronger cue if

Frank Sinatra is a known singer already. This idea has beenreferred to as doubly-anchored patterns in the literature [292] for the case of observing two44 ubmitted to Foundations and Trends in Databases entities of the same class. With a strong cue like “singers such as”, insisting on a knownwitness may be an overkill, but the principle equally applies to weaker cues, for example,“voices such as ...” for the target class singers .Multi-anchored patterns are particularly useful when going beyond text-based Hearstpatterns by considering strong co-occurrence in enumerations, lists and tables. For simplicity,consider only the case of tables in web pages – as opposed to relational tables in databases.The co-occurring entities are typically the names in the cells of the same column, and theclass is the name in the column header. Due the ambiguity of words and the ad-hoc natureof web tables, the spotted entities in the same column may be very heterogeneous, mixingup apples and oranges. For example, a table column on Oscar winners could have bothactors and movies as rows. Thus, we may incorrectly learn that

Godfather is an actor and

Bob Dylan is a movie. To overcome these diﬃculties, we can require that a new entity nameis accepted only when co-occurring with a certain number of known entities that belong tothe proper class ([109]), for example, at least 10 actors in the same column for a table of 15rows. Needless to say, all these are still heuristics that may occasionally fail, but they areeasy to implement, powerful and have practical value.

Pre-speciﬁed patterns are inherently limited in their coverage. This motivates approachesfor automatically learning patterns, using initial entity-type pairs and/or initial patterns for distant supervision . For example, when frequently observing a phrase like “great voice in”for entities of type singers , this phrase could be added to a set of indicative patterns fordiscovering more singers. This idea is captured in the following principle of statement-patternduality , ﬁrst formulated by [58] (see also [478] in the context of question answering):

Principle of Statement-Pattern Duality

When correct statements about entities x (e.g., x belonging to class y ) frequentlyco-occur with textual pattern p , then p is likely a good pattern to derive statementsof this kind.Conversely, when statements about entities x frequently co-occur with a goodpattern p , then these statements are likely correct.Thus, observations of good statements and good patterns reinforce each other; hencethe name statement-pattern duality.This insightful paradigm gives rise to a straightforward algorithm where we start withstatements in the KB as seeds (and possibly also with pre-speciﬁed patterns), and theniterate between deriving patterns from statements and deriving statements from patterns.45 ubmitted to Foundations and Trends in Databases Algorithm for Seed-based Pattern Learning

Input: Seed statements in the form of known entities for a classOutput: Patterns for this class, and new entities of the classInitialize: S ← seed statements P ← ∅ (or pre-speciﬁed patterns like Hearst patterns)Repeat1. Pattern discovery:- search for mentions of entity x ∈ S in web corpus, and identify co-occurringphrases;- generalize phrases into patterns by substituting x with a placeholder $ X ;- analyze frequencies of patterns (and other statistics);- P ← P ∪ frequent patterns;2. Statement expansion:- search for occurrences of patterns p ∈ P in web corpus, and identify co-occurring entities;- analyze frequencies of entities co-occurring with multiple patterns (and otherstatistics);- S ← S ∪ frequent entities;A toy example for running this algorithm is shown in Table 4.1.In the example, phrases such as “$ X ’s vocal performance” (with $ X as a placeholderfor entities) are generalized patterns. They co-occur with at least one but typically multipleof the entities in S known so far, and their strength is the cumulative frequency of theseoccurrences. Newly discovered patterns also co-occur with seed entities, and this wouldfurther strengthen the patterns’ usefulness. The NELL project [394] has run a variantof this algorithm at large scale, and has found patterns for musicians, such as “originalsong by X ”, “ballads reminiscent of X ”, “bluesmen , including X ”, “was later coveredby X ”, “also performed with X ”, “ X ’s backing bands”, and hundreds more (see http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:musician ).Despite its elegance, the algorithm, in this basic form, has severe limitations:1. Over-speciﬁc patterns:Some patterns are overly speciﬁc. For example, a possible pattern “$ X and her chansons”would apply only to female singers. This can be overcome by generalizing patterns into regular expressions over words and part-of-speech tags [153]: “$ X ∗ and P RP chansons”in this example, where ∗ is a wildcard for any word sequence and P RP requires a personalpronoun. Likewise, the pattern “$ X ’s great voice” could be generalized into “$ X ’s J J voice” to allow for other adjectives (with part-of-speech tag

J J ) such as “haunting voice”46 ubmitted to Foundations and Trends in Databases singers like $ X Nina Simone X ’s vocal performance Amy Winehouse sentence: Queen’s vocal performance led by Freddie Mercury . . .new entity:

Queen voice of $ X Francoise Hardy sentence: The great voice of Donald Trump got loud and angry.new entity:

Donald Trump ... ... ...

Table 4.1:

Toy example for Seed-based Pattern Learning, with seed “Elvis Presley” or “angry voice”. Moreover, instead of considering these as sequences over the surfacetext, the patterns could also be derived from paths in dependency-parsing trees (see,e.g., [65, 402, 561]).2. Open-ended iterations:In principle, the loop could be run forever. Recall will continue to increase, but precisionwill degrade with more iterations, as the acquired patterns get diluted. So we needto deﬁne a meaningful stopping criterion. A simple heuristic could be to consider thefraction of seed occurrences obtained at the end of iteration i , as the new patternsshould still co-occur with known entities. So a sudden drop in the observations of seedswould indicate a notable loss of quality.3. False positives and pattern dilution:Even after one or two iterations, some of the newly observed statements are falsepositives: Queen is a band, not a singer, and Donald Trump does not have a lot ofmusical talent. This is caused by picking up overly broad or ambiguous patterns. Forexample, “Grammy winner” applies to bands as well, and “$ X ’s great voice” may beobserved in sarcastic news about politics, besides music.As already stated for points 1 and 2, these weaknesses can be ameliorated by extendingthe method. The hardest issue is point 3. To mitigate the potential dilution of patterns,a number of techniques have been explored. One is to impose additional constraints for47 ubmitted to Foundations and Trends in Databases pruning out misleading patterns and spurious statements; we will discuss these in Chapter6 for the more general task of acquiring relational statements. A second major techniqueis to compute statistical measures of pattern and statement quality after each iteration,and use these to drop doubtful candidates [3]. In the following, we list some of the salientmeasures that can be leveraged for this purpose.The support of pattern p for seed statements S , supp ( p, S ), is the ratio of thefrequency of joint occurrences of p with any of the entities x ∈ S to the co-occurrencefrequency of any pattern with any x ∈ S : supp ( p, S ) = P x ∈ S freq ( p, x ) P q P x ∈ S freq ( q, x )where P q ranges over all observed patterns q (possibly with a lower bound onabsolute frequency) and freq ( ) is the total number of observations of a pattern orpattern-entity pair.The conﬁdence of pattern p with regard to seed statements S is the ratio of thefrequency of p jointly with seed entities x ∈ S to the frequency of p with anyentities: conf ( p, S ) = freq ( p, S ) freq ( p )The diversity of pattern p with regard to seed statements S , div ( p, S ), is thenumber of distinct entities x ∈ S that co-occur with pattern p : div ( p, S ) = |{ x ∈ S | freq ( p, x ) > }| We can also contrast the positive occurrences of a pattern p , that is, co-occurrenceswith correct statements from S , against the negative occurrences with statements knownto be incorrect. To this end, we have to additionally compile a set of incorrect statementsas negative seeds , for example, specifying that the Beatles and Barack Obama are notsingers to prevent noisy patterns that led to acquiring Queen and Donald Trump as newsingers. Let us denote these negative seeds as S . This allows us to revise the deﬁnition ofconﬁdence: 48 ubmitted to Foundations and Trends in Databases Given positive seeds S and negative seeds S , the conﬁdence of pattern p , conf ( p ),is the ratio of positive occurrences to occurrences with either positive or negativeseeds: conf ( p ) = P x ∈ S freq ( p, x ) P x ∈ S freq ( p, x ) + P x ∈ S freq ( p, x )These quality measures can be used to restrict the acquired patterns to those for whichsupport, conﬁdence or diversity – or any combination of these – are above a given threshold.Moreover, the measures can also be carried over to the observed statements. This is againbased on the principle of statement-pattern duality. To this end, we now identify a subset P of good patterns using the statistical quality measures.The conﬁdence of statement x (i.e., that an entity x belongs to the class ofinterest) is the normalized aggregate frequency of co-occurring with good patterns,weighted by the conﬁdence of these patterns: conf ( x ) = P p ∈ P freq ( x, p ) · conf ( p ) P q freq ( x, q )where P q ranges over all observed patterns. That is, we achieve perfect conﬁdencein statement x if it is observed only in conjunction with good patterns and all thesepatterns have perfect conﬁdence 1.0.The diversity of statement x is the number of distinct patterns p ∈ P that x co-occurs with: div ( x ) = |{ p ∈ P | freq ( p, x ) > }| For both of these measures, variations are possible as well as combined measures.Diversity is a useful signal to avoid that a single pattern drives the acquisition of statements,which would incur a high risk of error propagation. Rather than relying directly on thesemeasures, it is also possible to use probabilistic models or random-walk techniques forscoring and ranking newly acquired statements. In particular, algorithms for set expansion (aka. concept expansion) can be considered to this end (e.g., [612, 223, 608, 87]).We can now leverage these considerations to extend the seed-based pattern learningalgorithm. The key idea is to prune, in each round of the iterative method, both patterns andstatements that do not exceed certain thresholds for support, conﬁdence and/or diversity.Conversely, we can promote , in each round, the best statements to the status of seeds,to incorporate them into the calculation of the quality statistics for the next round. Forexample, when observing

Amy Winehouse as a high-conﬁdence statement in some round, wecan add her to the seed set, this way enhancing the informativeness of the statistics in the49 ubmitted to Foundations and Trends in Databases next round. We sketch this extended algorithm, whose key ideas have been developed by [3]and [153].

Extended algorithm for Seed-based Pattern Learning

Input: Seed statements in the form of known entities for a classOutput: Patterns for this class, and new entities of the classInitialize: S ← seed statements S + ← S //acquired statements P ← ∅ (or pre-speciﬁed patterns like Hearst patterns)Repeat1. Pattern discovery:- same steps as in base algorithm- P ← P \ { patterns below quality thresholds }

2. Statement expansion:- same steps as in base algorithm- S + ← S + ∪ { newly acquired statements } - S ← S ∪ { new statements above quality thresholds } All of the above assumes that patterns are either matched or not. However, it is oftenthe case that a pattern is almost but not exactly matched, with small variations in thewording or using synonymous words. For example, the pattern “Nobel prize winner $ X ”(for scientists for a change) could be considered as approximately matched by phrases suchas “Nobel winning $ X or “Nobel laureate $ X ”. If these phrases are frequent and we wantto consider them as evidence, we can add a similarity kernel sim ( p, q ) to the observationstatistics, based on edit distance or n-gram overlap where n-grams are sub-sequences of n consecutive characters. This simply extends the frequency of pattern p into a weightedcount freq ( p ) = P q sim ( p, q ) if sim ( p, q ) > θ where P q ranges over all approximate matchesof p in the corpus and θ is a pruning threshold to eliminate weakly matching phrases.The presented techniques are also applicable to acquiring subclass/superclass pairs (forthe subclass-of relation, as opposed to instance-of ). For example, we can detect frompatterns in web pages that rappers and crooners are subclasses of singers . This has beenfurther elaborated in work on ontology/taxonomy learning like [363] and [153]. A major alternative to learning patterns for entity discovery is to devise end-to-endmachine learning models. In contrast to the paradigm of seed-based distant supervision50 ubmitted to Foundations and Trends in Databases of Subsection 4.3, we now consider fully supervised methods that require labeled trainingdata in the form of annotated sentences (or other text snippets). Typically, these methodswork well only if the training data has a substantial size, in the order of ten thousand or higher.By exploiting large corpora with appropriate markup, or annotations from crowdsourcingworkers, or high-quality outputs of employing methods like those for premium sources, suchlarge-scale training data is indeed available today. On the ﬁrst direction, Wikipedia is againa ﬁrst-choice asset, as it has many sentences where a named entity appears and is markedup as a hyperlink to the entity’s Wikipedia article.In the following, we present two major families of end-to-end learning methods. The ﬁrstone is probabilistic graphical models where a sequence of words, or, more generally, tokens , is mapped into the joint state of a graph of random variables, with states denotingtags for the input words. The second approach is deep neural networks for classifyingthe individual words of an input sequence onto a set of tags. Thus, both of these methodsare geared for the task of sequence labeling , also known as sequence tagging . This family of models considers a set of coupled random variables that take a ﬁnite set oftags as values. In our application, the tags are primarily used to demarcate entity names ina token sequence, like an input sentence or other snippet. Each random variable correspondsto one token in the input sequence, and the coupling reﬂects short-distance dependencies(e.g., between the tags for adjacent or nearby tokens). In the simplest and most widelyused case, the coupling is pair-wise such that a random variable for an input token dependsonly on the random variable for the immediately preceding token. This setup amounts tolearning conditional probabilities for subsequent pairs of tags as a function of the inputtokens. As the input sequence is completely known upfront, we can generalize this to eachrandom variable being a function of all tokens or all kinds of feature functions over theentire token sequence.

Conditional Random Fields (CRF):

The most successful method from this family of probabilistic graphical models is known as

Conditional Random Fields , or

CRF s for short, originally developed by [304]. Morerecent tutorials are by [569] on foundations and algorithms, and [511] on applying CRFsfor information extraction. CRFs are in turn a generalization of the prior notion of

HiddenMarkov Models (HMMs) , the diﬀerence lying in the incorporation of feature functions onthe entire input sequence versus only considering two successive tokens.51 ubmitted to Foundations and Trends in Databases A Conditional Random Field (CRF) , operating over input sequence X = x . . . x n is an undirected graph, with a set of ﬁnite-state random variables Y = { Y . . . Y m } as nodes and pair-wise coupling of variables as edges.An edge between variables Y i and Y j denotes that their value distributions arecoupled. Conversely, it denotes that, in the absence of any other edges, two variablesare conditionally independent, given their neighbors.More precisely, the following Markov condition is postulated for all variables Y i andall possible values t : P [ Y i = t | x . . . x n , Y . . . Y i − , Y i +1 . . . Y m ] = P [ Y i = t | x . . . x n , all Y j with an edge ( Y i , Y j )]Strictly speaking, the coupling of random variables may go beyond pairs by introducing factor nodes (or factors for short) for dependencies between two or more variables. In thissection, we restrict ourselves to the basic case of pair-wise coupling. Moreover, we assumethat the graph forms a linear chain: a so-called linear-chain CRF . Often the variablescorrespond one-to-one to the input tokens: so each token x i is associated with variable Y i (and we have n = m for the number of tokens and variables). CRF Training:

The training of a CRF from labeled sequences, in the form of (

X, Y ) value pairs, involvesthe posterior likelihoods P [ Y i | X ] = P [ Y i = t i | x . . . x n , all neighbors Y j of Y j ]With feature functions f k over input X and subsets Y c ⊂ Y of coupled random variables(with known values in the training data), this can be shown to be equivalent to P [ Y | X ] ∼ Z Y c exp( X k w k · f k ( X, Y c )]with an input-independent normalization constant Z , k ranging over all feature functions,and c ranging over all coupled subsets of variables, the so-called factors . For a linear-chainCRF, the factors are all pairs of adjacent variables: P [ Y | X ] ∼ Z Y i exp( X k w k · f k ( X, Y i − , Y i )]The parameters of the model are the feature-function weights w k ; these are the output ofthe training procedure. The training objective is to choose weights w k to minimize the errorbetween the model’s maximum-posterior Y values and the ground-truth values, aggregatedover all training samples. The error, or loss function , can take diﬀerent forms, for example,the negative log-likelihood that the trained model generates the ground-truth tags.52 ubmitted to Foundations and Trends in Databases As with all machine-learning models, the objective function is typically combined witha regularizer to counter the risk of overﬁtting to the training data. The training procedureis usually implemented as a form of (stochastic) gradient descent (see, e.g., [68] and furtherreferences given there). This guarantees convergence to a local optimum and empiricallyapproximates the global optimum of the objective function fairly well (see [569]). For thecase of linear-chain CRFs, the optimization is convex; so we can always approximatelyachieve the global optimum.

CRF Inference:

When a trained CRF is presented with a previously unseen sentence, the inference stagecomputes the posterior values of all random variables, that is, the tag sequence for theentire input, that has the maximum likelihood given the input and the trained model withweights w k : Y ∗ = argmax Y P [ Y | X, all w k ]For linear-chain CRFs, this can be computed using dynamic programming , namely, variants ofthe Viterbi algorithm (for HMMs). For general CRFs, other – more expensive – techniqueslike Monte Carlo sampling or variational inference are needed. Alternatively, the CRFinference can also be cast into an Integer Linear Program (ILP) and solved by optimizerslike the Gurobi software ( ), following [500]. We will go into moredepth on ILPs as a modeling and inference tool in Chapter 8, especially Section 8.5.3.

CRF for Part-of-Speech Tagging:

A classical application of CRF-based learning is part-of-speech tagging: labeling each wordin an input sentence with its word category, like noun (NN), verb (VB), preposition (IN),article (DET) etc. Figure 4.1 shows two examples for this task, with their correct outputtags. Y Time flies like an arrow x x x x x Y Y Y Y NN VB IN DET NN Y Fruit flies like a banana x x x x x Y Y Y Y NN NN VB DET NN

Figure 4.1:

Examples for CRF-based Part-of-Speech Tagging

The intuition why this works so well is that word sequence frequencies as well as tagsequence frequencies from large corpora can inform the learner, in combination with other53 ubmitted to Foundations and Trends in Databases feature functions, to derive very good weights. For example, nouns are frequently followedby verbs, verbs are frequently followed by prepositions, and some word pairs are compositenoun phrases (such as “fruit ﬂies”). Large corpus statistics also help to cope with exoticor even non-sensical inputs. For example, the sentence “Pizza ﬂies like an eagle” would beproperly tagged as

NN VB IN DET NN because, unlike fruit ﬂies, there is virtually nomention of pizza ﬂies in any corpus.

CRF for Named Entity Recognition (NER) and Typing:

For the task at hand, entity discovery, the CRF tags of interest are primarily

N E for NamedEntity and O for Others. For example, the sentence“Dylan composed the song Sad-eyed Lady of the Lowlands,about his wife Sara Lownds, while staying at the Chelsea hotel”should yield the tag sequence NE O O O NE NE NE NE NE O O O NE NE O O O O NE NE .Many entity mentions correspond to the part-of-speech tag NNP, for proper noun, that is,nouns that should not be preﬁxed with an article in any sentence, such as names of people.However, this is not suﬃcient, as it would accept false positives like abstractions (e.g.,“love” or “peace”) and would miss out on multi-word names that include non-nouns, such assong or book titles (e.g. “Sad-eyed Lady of the Lowlands”), and names that come with anarticle (e.g., “the Chelsea hotel”). For these reasons, CRFs for

Named Entity Recognition ,or

NER for short, have been speciﬁcally developed, with training over annotated corpora.The seminal work on this is [166]; advanced extensions are implemented in the StanfordCoreNLP software suite ([367]).As the training of a CRF involves annotated corpora, instead of merely distinguish-ing entity mentions versus other words, we can piggyback on the annotation eﬀort andincorporate more expressive tags for diﬀerent types of entities , such as people, places,products (incl. songs and books). This idea has indeed been pursued already in the workof [166], integrating coarse-grained entity typing into the CRF for NER. The simplechange is to move from output variables with tag set { N E, O } to a larger tag set like { P ERS, LOC, ORG, M ISC, O } denoting persons (PERS), locations (LOC), organizations(ORG), entities of miscellanous types (MISC) such as products or events, and non-entitywords (O). For the above example about Bob Dylan, we should then obtain the tag sequence PERS O O O MISC MISC MISC MISC MISCO O O PERS PERS O O O O LOC LOC .The way the CRF is trained and used for inference stays the same, but the training datarequires annotations for the entity types. Later work has even devised (non-CRF) classiﬁersfor more ﬁne-grained entity typing , with tags for hundreds of types, such as politicians,scientists, artists, musicians, singers, guitarists, etc. [168, 344, 413, 93]. An easy way ofobtaining training data for this task is consider hyperlink anchor texts in Wikipedia as54 ubmitted to Foundations and Trends in Databases entity names and derive their types from Wikipedia categories, or directly from a core KBconstructed by methods from Chapter 3. An empirical comparison of various NER methodswith ﬁne-grained typing is given by [365].Widely used feature functions for CRF-based NER tagging, or features for other kindsof NER/type classiﬁers, include the following: • part-of-speech tags of words and their co-occurring words in left-hand and right-handproximity, • uppercase versus lowercase spelling, • word occurrence statistics in type-speciﬁc dictionaries, such as dictionaries of peoplenames, organization names, or location names, along with short descriptions (fromyellow pages and so-called gazetteers), • co-occurrence frequencies of word-tag pairs in the training data, • further statistics for word n-grams and their co-occurrences with tags.There is also substantial work on domain-speciﬁc NER , especially for the biomedicaldomain (see, e.g., [170] and references given there), and also for chemistry, restaurant namesand menu items, and titles of entertainment products. In these settings, domain-speciﬁcdictionaries play a strong role as input for feature functions [476, 519]. In recent years, neural networks have become the most powerful methodology for supervisedmachine learning when suﬃcient training data are available. This holds for a variety ofNLP tasks [186], potentially including Named Entity Recognition.The most primitive neural network is a single perceptron which takes as input a set ofreal numbers, aggregates them by weighted summation, and applies a non-linear activationfunction (e.g., logistic function or hyperbolic tangent) to the sum for producing its output.Such building blocks can be connected to construct entire networks, typically organizedinto layers. Networks with many layers are called deep networks . As a loose metaphor, onemay think of the nodes as neurons and the interconnecting edges as synapses.The weights of incoming edges (for the weighted summation) are the parameters ofsuch neural models, to be learned from labeled training data. The inputs to each node areusually entire vectors, not just single numbers, and the top-layer’s output are real values forregression models or, after applying a softmax function, scores for classiﬁcation labels. Theloss function for the training objective can take various forms of error measures, possiblycombined with regularizers or constraint-based penalty terms. For neural learning, it iscrucial that the loss function is diﬀerentiable in the model parameters (i.e., the weights), andthat this can be backpropagated through the entire network. Under this condition, trainingis eﬀectively performed by methods for stochastic gradient descent (see, e.g., [68]), andmodern software libraries (e.g., TensorFlow) support scaling out these computations across55 ubmitted to Foundations and Trends in Databases many processors. For inference, with new inputs outside the training data, input vectorsare simply fed forward through the network by performing matrix and tensor operations ateach layer.

LSTM Models:

There are various families of neural networks, with diﬀerent topologies for interconnectinglayers. For NLP where the input to the entire network is a text sequence, so-called

LSTMnetworks (for “Long Short Term Memory”) have become prevalent (see [515, 186] andreferences there). They belong to the broader family of recurrent neural networks withfeedback connections between nodes of the same layer. This allows these nodes to aggregatelatent state computed from seeing an input token and the latent state derived from thepreceding tokens. To counter potential bias from processing the input sequence in forwarddirection alone,

Bi-directional LSTM s (or

Bi-LSTMs for short) connect nodes in bothdirections.We can think of LSTMs as the neural counterpart of CRFs. A key diﬀerence, however,is that neural networks do not require the explicit modeling of feature functions. Instead,they take the raw data (in vectorized form) as inputs and automatically learn latentrepresentations that implicitly capture features and their cross-talk.Figure 4.2 gives a pictorial illustration of an LSTM-based neural network for NER,applied to a variant of our Bob Dylan example sentence. The outputs of the forward LSTMand the backward LSTM are combined into a latent data representation, for example, byconcatenating vectors. On top of the bi-LSTM, further network layers (learn to) computethe scores for each tag, typically followed by a softmax function to choose the best label.The output sequence of tags is slightly varied here, by preﬁxing each tag with its role in asubsequence of identical tags: B for Begin, I for In, and E for End. This serves to distinguishthe case of a single multi-word mention from the case of diﬀerent mentions without anyinterleaving “Other” words. The E tags are not really needed as a following B tag indicatesthe next mention anyway, hence E is usually omitted. This simple extension of the tag setis also adopted by CRFs and other sequence labeling learners; we disregarded this earlierfor simplicity.LSTM-based neural networks can be combined with a CRF on top of the neural layers[250, 361, 306], this way combining the strengths of the two paradigms. Other enhancements(see [330] for a survey of neural NER) include additional bi-LSTM layers for learningcharacter-level representations, capturing character n-grams in a latent manner. This canleverage large unlabeled corpora, analogously to the role of dictionaries in feature-basedtaggers. This line of methods has also extended the scope of ﬁne-grained entity typing ,yielding labels for thousands of types [93].Overall, deep neural networks, in combination with CRFs, tend to outperform othermethods for NER whenever a large amount of training data is at hand. When training data56 ubmitted to Foundations and Trends in Databases

Dylan composed Sad-eyed Lady of the Lowlands at the Chelsea hotelB-PERS O B-Misc I-Misc I-Misc I-Misc E-Misc O O B-LOC E-LOC . . . . . . . . . . . . . . . inputvectorsbi-LSTMlayerslearned representations additionalnetworklayersoutput tags input sequence

Figure 4.2:

Illustration of LSTM network for NER is not abundant, for example, in speciﬁc domains such as health, pattern-based methodsand feature-driven graphical models (incl. CRFs) are still a good choice.

Machine learning methods, and especially neural networks, do not operate directly on text,as they require numeric vectors as inputs. A popular way of casting text into this suitableform is by means of embeddings . Word Embedding:

Embeddings of words (or multi-word phrases) are real-valued vectors of ﬁxeddimensionality, such that the distance between two vectors (e.g., by their cosine)reﬂects the relatedness (sometimes called “semantic similarity”) of the two words,based on the respective contexts in which the words typically occur.Word embeddings are computed (or “learned”) from co-occurrences and neighborhoodsof words in large corpora. The rationale is that the meaning of a word is captured by thecontexts in which it is often used, and that two words are highly related if they are used57 ubmitted to Foundations and Trends in Databases in similar contexts. This is often referred to as “distributional semantics” or “distributedsemantics” of words [324].This hypothesis should not be confused with two words directly co-occurring together.Instead, we are interested in indirect co-occurrences where the contexts of two input wordsshare many words. For example, the words “car” and “automobile” have the same semantics,but rarely co-occur directly in the same text span. The point rather is that both oftenco-occur with third words such as “road”, “traﬃc”, “highway” and names of car models.Technically, this becomes an optimization problem. Given a large set of text windows C of length k + 1 with word sequences w . . . w k , we aim to compute a ﬁxed-length vector ~w for each word w such that the error for predicting the word’s surrounding window C ( w t ) = w t − k/ . . . w t . . . w t + k/ , from all word-wise vectors alone, is minimized. Thisconsideration leads to a non-convex continuous optimization with the objective function[384]: maximize X C X j ∈ C ( w t ) ,j = t log exp ( ~w jT · ~w t ) P v exp ( ~v T · ~w t )where the outermost sum ranges over all possible text windows (with overlapping windows).The dot product between the output vectors ~w j and ~w t reﬂects overlapping-context like-lihoods of word pairs, and the softmax function normalizes these scores by consideringall possible words v . Intuitively, the objective is maximized if the resulting word vectorscan predict the surrounding window from a given word with high accuracy. This speciﬁcobjective is known as the skip-gram model ; there are also other variations, with a similarﬂavor. Computing solutions for the non-convex optimization is typically done via gradientdescent methods.Embedding vectors can be readily plugged into machine-learning models, and they area major asset for the power of neural networks for NLP tasks. A degree of freedom is thechoice for the dimensionality of the vectors. For most use cases, this is set to a few hundred,say 300, for robust behavior.The most popular models and tools for this line of text embeddings are word2vec, by[384], and GloVE, by [447]. Both come with pre-packaged embeddings derived from newsand other collections, but applications can also compute new embeddings from customizedcorpora. The word2vec approach has been further extended to compute embeddings forshort paragraphs and entire documents (called doc2vec). Important earlier work on latenttext representations, with similar but less expressive models, includes Latent SemanticIndexing (LSI) and Latent Dirichlet Allocation (LDA) ([114, 243, 48]).A recent, even more advanced way of computing and representing embeddings is bytraining deep neural networks for word-level or sentence-level prediction tasks, and thenkeeping the learned model as a building block for training an encompassing network fordownstream tasks (e.g., question answering or conversational chatbots).58 ubmitted to Foundations and Trends in Databases The pre-training utilizes large corpora like the full text of all Wikipedia articles or theGoogle Books collection. A typical objective function is to minimize the error in predictinga masked-out word given a text window of successive words. The embeddings for all wordsare jointly given by the learned weights of the network’s synapses (i.e., connections betweenneurons), with 100 millions of real numbers or even more. Popular instantiations of thisapproach are ElMo [448] and BERT [118].

Embeddings for Word and Entity Pair Relatedness:

Embedding vectors are not directly interpretable; they are just vectors of numbers. However,we can apply linear algebra operators to them to obtain further results. Embeddings areadditive and subtractive, which allows forming analogies of the form: −−→ man + −−→ king = −−−−−→ woman + −−−→ queen −−−−−→ F rance + −−−→ P aris = −−−−−−→ Germany + −−−−→ Berlin −−−−−−→

Einstein + −−−−−−→ scientist = −−−−→ M essi + −−−−−−−→ f ootballer So we can solve an equation like −−−−−−−−−−→

Rock and Roll + −−−−−−−−−−→ Elvis P resley = −−−−−−−→ F olk Rock + −→ X yielding −→ X = −−−−−−−−−−→ Rock and Roll + −−−−−−−−−−→ Elvis P resley − −−−−−−−→

F olk Rock ≈ −−−−−−−→

Bob Dylan

Most importantly for practical purposes, we can compare the embeddings of two words(or phrases) by computing a distance measure between their respective vectors, typically,the cosine or the scalar product. This gives us a measure of how strongly the two words arerelated to each other, where (near-) synonyms would often have the highest relatedness.

Embedding-based Relatedness:

For two words v and w , their relatedness can be computed as cos ( ~v, ~w ) from theirembedding vectors ~v and ~w .The absolute values of the relatedness scores are not crucial, but we can now easily orderrelated words by descending scores. For example, for the word “rock”, the most relatedwords and short phrases are “rock n roll”, “band”, “indie rock” etc., and for “knowledge”we obtain the most salient words “expertise”, “understanding”, “knowhow”, “wisdom” etc.We will later see that such relatedness measures are very useful for many sub-tasks inknowledge base construction.The embedding model can capture not just words or other text spans, but we canalso apply it to compute distributional representations of entities . This is achievedby associating each entity in the knowledge base with a textual description of the entity,typically the Wikipedia article about the entity (but possibly also the external referencesgiven there, homepages of people and organizations, etc.).59 ubmitted to Foundations and Trends in Databases Once we have per-entity vectors, we can again compute cosine or scalar-product distancesfor entity pairs. This results in measures for entity-entity relatedness . Moreover, bycoupling the computations of per-word and per-entity embeddings, we also obtain scoresfor entity-word relatedness , which is often handy when we need salient keywords orkeyphrases for an entity. For example, the embedding for Elvis Presley should be close tothe embeddings for “king”, “rock n roll”, etc. Technical details for these models can befound in [616, 677, 638]; the latter includes data and code for the wikipedia2vec tool.Such embeddings have also been computed from domain-speciﬁc data sources, mostnotably, for biomedical entities and terminology, with consideration of the standard MeSHvocabulary. Resources of this kind include BioWordVec ([663]) and BioBERT ([314]).An important predecessor to all these works is the semantic relatedness model of [171],which was the ﬁrst to harness Wikipedia articles for this purpose.A related, recent direction is knowledge graph (KG) embeddings (see [611] for a survey).These kinds of embeddings capture the neighborhood of entities in an existing graph-structured KB. They do not use textual inputs, however, and serve diﬀerent purposes. Wewill discuss KG embeddings in Chapter 8, speciﬁcally Section 8.4.

Assuming that we can extract a large pool of types, and optionally also entities for them,the task discussed here is to construct a taxonomic tree or DAG (directed acyclic graph)for these types – without assuming any prior structure such as WordNet. In the literature,the problem is also referred to as taxonomy induction [541, 457], as its output is ageneralization of bottom-up observations. The input can take diﬀerent forms, for example,starting from the noisy set of Wikipedia categories (but ignoring the graph structure),or from noisy and sparse pairs of hyponym-hypernym candidates derived by applyingpatters to large text and web corpora. A good example for the latter is the

WebIsALOD project ( http://webisa.webdatacommons.org/ , which used more than 50 patterns to extractcandidate pairs and a supervised classiﬁer to prune out the most noisy ones [232]. Thiscollection, and others of similar ﬂavor, does not strictly focus on hypernymy but alsocaptures meronymy/holonymy (part-of) and, to some extent, instance-type pairs. Hencethe broader term

IsA in the project name.

Methods for Wikipedia Categories:

Seminal work that considered all Wikipedia categories as noisy type candidates and thesubcategory-supercategory pairs as hypernymy candidates was the

WikiTaxonomy project[456, 457]. Its approach can be characterized by three steps:60 ubmitted to Foundations and Trends in Databases

Wikipedia-based Taxonomy Induction: • Category Cleaning: eliminating noisy categories that do not really denote types. • Category-Pair Classiﬁcation: using a rule-based classiﬁer to eliminate pairs thatdo not denote hypernymy. • Taxonomy Graph Construction: building a tree or DAG from the remainingtypes and hypernymy pairs.The ﬁrst step is very similar to the techniques presented in Section 3.2. State-of-the-arttechniques for this purpose are discussed in [440]. The second step is based on heuristic butpowerful rules that compare stems or lemmas of head words in multi-word noun phrases.The following shows two examples for rules: • For sub-category S and direct super-category C : if head ( S ) is the same as head ( C ),then this is likely a hyponym-hypernym pair (e.g., S = “American baritones”, C =“baritones by nationality”). • For C and S : if head ( C ) appears in S , but head ( S ) is diﬀerent from head ( C ), then thisis likely not a good pair (e.g., S = “American baritones”, C = “baritone saxophoneplayers”),Additional rules are used to reﬁne the ﬁrst case and to handle other cases. This includesconsidering instances of a category, at the entity level, and comparing their set of categoriesagainst the category at hand.For the third step, graph construction , the method applies transitivity to build a multi-rooted graph, eliminates cycles by removing as few edges as possible, and connects all rootsof the resulting DAG to the universal type entity .An industrial-strength variation and extension of the presented method is discussed in[117]. Another Wikipedia component that has been considered as noisy input for taxonomyinduction is infobox templates . The Wikipedia community has developed a large number ofdiﬀerent templates for people, musicians, bands, songs, albums etc. They are instantiated inhighly varying numbers, and there is redundancy, for example, diﬀerent templates for songs,some used more than others. [625] proposed a learning approach, using SVM classiﬁers andCRF-like graphical models, to infer a clean taxonomy from this noisy data. Taxonomies from Catalogs, Networks and User Behavior:

Alternatively to Wikipedia categories, other catalogs of categories can be processed in asimilar manner, for example, the DMOZ directory of web sites ( https://dmoz-odp.org/ ) orthe Icecat open product catalog ( https://icecat.biz/ ). Some methods combine informationfrom catalogs with topical networks, for example, connecting business categories, users andreviews on sites such as Yelp or TripAdvisor, and potentially also informative terms fromuser reviews. Examples of such methods are [609, 520]. Last but not least, recent methods61 ubmitted to Foundations and Trends in Databases on this task start with a high-quality product catalog and its category hierarchy, and thenlearn to extend and enrich the taxonomy with input from other sources, most notably, logsof customer queries, clicks, likes and purchases. This method is part of the

AutoKnow pipeline [133], discussed further in Section 9.5.

Folksonomies from Social Tags:

The general approach has also been carried over to build taxonomies from social tagging ,resulting in so-called folksonomies [199]. The input is a set of items like images or webpages of certain types that are associated with concise tags to annotate items, such as“sports car”, “electric car”, “hybrid auto” etc. If the number of items and the taggingcommunity are very large, the frequencies and co-occurrences of (words in) tags provide cuesabout proper types as well as type pairs where one is subsumed by the other. Data miningtechniques (related to association rules) can then be applied to clean such a large but noisycandidate pool, and the subsequent DAG construction is straightforward (e.g.,[233, 247,262]).Another target for similar techniques are fan communities (e.g., hosted at http://fandom.com aka. Wikia), which have collaboratively built extensive but noisy category and taggingsystems for entertainment ﬁction like movie series or TV series (e.g., Lord of the Rings,Game of Thrones, The Simpsons etc.) [97, 231].

Methods for Web Contents:

Early approaches spotted entity names in web-page collections and clustered these byvarious similarity measures (e.g., [141]). Using Hearst patterns and other heuristics, typelabels are then derived for each cluster. Such techniques can be further reﬁned for scoringand ranking the outputs (e.g., using label propagation with random-walk techniques [571]).The seminal

KnowItAll project [153] advanced this line of research by a suite of scoringand classiﬁcation techniques to enhance the output quality. It also studied tapping into listsof named entities as a source of type cues. For example, headers or captions of lists mayserve as type candidates, and pairs of lists where one mostly subsumes the other in terms ofelements (with some tolerance for exceptions) can be viewed as candidates for hypernymy.In the

Probase project, noisy candidates for hypernymy pairs were mined from theWeb index of a major search engine [631]. First, Hearst patterns were liberally applied tothis huge text collection. Then, a probabilistic model was used to prune out noise and inferlikely candidates for hyponym-hypernym pairs, based on (co-)occurrence frequencies. Theapproach led to a huge but still noisy and incomplete taxonomy. Also, the resulting typesare not canonicalized, meaning that synonymous type names may appear as diﬀerent nodesin the taxonomy with diﬀerent neighborhoods of hyponyms and hypernyms. Nevertheless,for use cases like query recommendation in web search, such a large collection of taxonomicinformation can be a valuable asset.Recent works approached taxonomy induction as a supervised machine-learning task,62 ubmitted to Foundations and Trends in Databases using factors graphs or neural networks, or by reinforcement learning (see, e.g., [31, 529,368]).

Methods for Query-Click Logs:

Search engine companies have huge logs of query-click pairs: user-issued keyword queriesand subsequent clicks on web pages after seeing the preview snippets of top-ranked results.When a suﬃciently large fraction of queries is about types (aka. classes), such as “Americansong writers” or “pop music singers from the midwest”, one can derive various signalstowards inferring type synonymy and pairs for the IsA relation: • Surface cues in query strings: frequent patterns of query formulations that indicate typenames. Examples are queries that start with “list of” or noun-phrase query strings witha preﬁx (or head word) known to be a type followed by a modiﬁer, such as “musicianswho were shot” or “IT companies started in garages”. • Co-Clicks: pages that are (frequently) clicked upon two diﬀerent queries. For example,if the queries “American song writers” and “Americana composers” have many clicks incommon, they could be viewed as synonymous types. • Overlap of query and page title: the word-level n-gram overlap between the query stringand the title of a (frequently) clicked page. For example, if the query “pop music singersfrom the midwest” often leads to clicking the page with title “baritone singers fromthe midwest”, this pair is a candidate for the IsA relation (or, speciﬁcally, hypernymybetween types if the two strings are classiﬁed to denote types, not entity instances orother phrases).A variety of methods have been devised to harness these cues for inferring taxonomicrelations (synonymy and hypernymy, or more coarsely IsA) by [24, 441, 347, 346]. Theseinvolve scoring and ranking the candidates, so that diﬀerent slices can be compiled depend-ing on whether the priority is precision or recall. By incorporating word-embedding-basedsimilarities and learning techniques, the directly observed cues can also convey generaliza-tions, for example, inferring that “crooners from Mississippi” are a subtype of “singers fromthe midwest”.Some of the resulting collections of types and type-name pairs are much richer and moreﬁne-grained than the taxonomies that hinge on Wikipedia-like sources. Their strength isthat they cover very speciﬁc types absent in current KBs, such as “musicians who wereshot” (e.g., John Lennon), “musicians who died at 27” (e.g., Jim Morrison, Amy Winehouse,etc.), or “IT companies started in garages” (e.g., Apple), along with sets of paraphrases forthese (e.g., “27 club” for “musicians who died at 27”). Such repositories are very useful forquery suggestions (i.e., auto-completion or re-formulations) and explorative browsing (see,e.g., [347, 346]), but they do not (yet) reach the semantic rigor and near-human quality of63 ubmitted to Foundations and Trends in Databases full-ﬂedged knowledge bases.

Discussion:

Overall, the methods presented in this section have not yet achieved taxonomies of betterquality and much wider coverage than those built directly from premium sources (seeChapter 3). Nevertheless, the outlined methodologies for coping with noisier input areof interest and value, for example, for query suggestion by search engines and towardsconstructing domain-speciﬁc KBs (e.g. on health where user queries could be valuable cues;see [286] and references there).

The following are key points to remember. • The task of entity discovery involves ﬁnding more entities for a given type as well asﬁnding more informative types for a given entity. These two goals are intertwined, andmany methods in this chapter apply to both of them. • To discover entity names in web contents, dictionaries and patterns are an easy and veryeﬀective way. Patterns can be hand-crafted, such as Hearst patterns, or automaticallycomputed by seed-based distantly supervised learning , following the principle of statement-pattern duality . For assuring the quality of newly acquired entity-type pairs, quantitativemeasures like support and conﬁdence must be considered. • When suﬃcient amounts of labeled training data are available, in the form of annotatedsentences, end-to-end supervised learning is a powerful approach. This is typically castinto a sequence tagging task, known as

Named Entity Recognition (NER) and

NamedEntity Typing . Methods for this purpose are based on probabilistic graphical modelslike

CRFs , or deep neural networks like

LSTMs , or combinations of both. • A useful building block for all these methods are word and entity embeddings , whichlatently encode the degree of relatedness between pairs of words or entities. • While most of these methods start with a core KB that already contains a (limited) setof entities and types, it is also possible to compute IsA relations by ab-initio taxonomyinduction from text-based observations only. This has potential for obtaining morelong-tail items, but comes at a higher risk of quality degradation.64 ubmitted to Foundations and Trends in Databases

The entity discovery methods discussed in Chapter 4 may inﬂate the KB with alias namesthat refer to the same real-world entity. For example, we may end up with entity namessuch as “Elvis Presley”, “Elvis” and “The King”, or “Harry Potter Volume 1” and “HarryPotter and the Philosopher’s Stone”. If we treated all of them as distinct entities, we wouldend up with redundancy in the KB and, eventually, inconsistencies. For example, the birthand death dates for

Elvis Presley and

Elvis could be diﬀerent, causing uncertainty aboutthe correct dates. For some KB applications, this kind of inconsistency may not cause muchharm, as long as humans are satisﬁed with the end results, such as ﬁnding songs for asearch-engine query about

Elvis . However, applications that combine , compare and reason with KB data, such as entity-centric analytics or recommendations, need to be aware ofcases when two names denote the same entity. For example, counting book mentions formarket studies should properly combine the two variants of the same Harry Potter bookwhile avoiding conﬂation with other book titles that denote diﬀerent volumes of the series.Likewise, a user should not erroneously get recommendations for a book that she alreadyread.This motivates why a high-quality KB needs to tame ambiguity by canonicalizing entitymentions, creating one entry for all observations of the same entity regardless of namevariants. The task comes in a number of diﬀerent settings. The most widely studied case is called

Entity Linking (EL) , where we assume an existingKB with a rich set of canonicalized entities (e.g., harvested from premium sources) and weobserve a new set of mentions in additional inputs like text documents or web tables. Whenthe input is text, the task is also known as

Named Entity Disambiguation (NED) inthe computational linguistics community. Historically, the so-called

Wikiﬁcation task [383,386] has aimed to map both named entities and general concepts onto Wikipedia articles(including common nouns such as “football”, which could mean either American football orEuropean football aka. soccer, or the ball itself). This leads to the broad task of

WordSense Disambiguation (WSD) . Text with entity mentions also contains general wordsthat are often equally ambiguous. For example, words like “track” and “album” (cf. Figure5.1) can have several and quite diﬀerent meanings, referring to music (as in the ﬁgure) or tocompletely diﬀerent topics. The WSD task is to map these surface words (and possibly alsomulti-word phrases) onto their proper word senses in WordNet or onto Wikipedia articles[417, 399, 203]. For the mission of KB construction, general concepts and WSD are out of65 ubmitted to Foundations and Trends in Databases scope.Figure 5.1 gives an example for the EL task. In the input text on the left side, NERmethods can detect mentions, and these need to be mapped to their proper entities inthe KB, shown on the right-hand side. Candidate entities can be determined based onsurface cues like string similarity of names. In the example, this leads to many candidatesfor the ﬁrst name Bob, but also for the highly ambiguous mentions “Hurricane”, “Carter”and “Washington”. As each mention has so many mapping options, we face a complexcombinatorial problem. Wikipedia knows more than 700 people with ﬁrst name (or nickname)Bob, and Wikidata contains many more.

Hurricane,about Carter,is one of Bob‘stracks.It is played in the film with

Washington.

Hurricane

Carter

BobWashington Hurricane (song)

Hurricane cocktail

Jimmy Carter

Rubin CarterBob Dylan

Robert KennedyWashington, DCGeorge WashingtonDenzel Washington

Figure 5.1:

Example for Entity Linking

To compute, or learn to compute, the correct mapping, all methods consider varioussignals that relate input and output:

Mention-Entity Popularity:

If an entity is frequently referred by the name of the mention, this entity is a likelycandidate. For example, “Carter” and “Washington” most likely denote the formerUS president

Jimmy Carter and

Washington, DC .66 ubmitted to Foundations and Trends in Databases

Mention-Entity Context Similarity:

Mentions have surrounding text, which can be compared to descriptions of entitiessuch as short paragraphs from Wikipedia or keyphrases derived from such texts.For example, the context words “tracks” and “played” are cues towards music andmusicians, and “ﬁlm with” suggests that “Washington” is an actor or actress.

Entity-Entity Coherence:

In meaningful texts, diﬀerent entities do not co-occur uniformly at random. Whywould someone write a document about

Jimmy Carter drinking a

Hurricane (cocktail) together with

Robert Kennedy and

George Washington ?For two entities to co-occur, a semantic relationship should hold between them. Theexisting KB may have such prior knowledge that can be harnessed. For example,

BobDylan has composed

Hurricane (song) , and its lyrics is about the Afro-Americanboxer

Rubin Carter , aka. Hurricane, who was wrongfully convincted for murder inthe 1970s and later released after 20 years in prison. By mapping the mentions tothese inter-related entities, we obtain a highly coherent interpretation.In Figure 5.1, the edges between mentions and candidate entities indicate mention-entity similarities, and the edges among candidate entities indicate entity-entity coherence.Obviously, for quantifying the strength of similarity and coherence, these edges should be weighted , which is not shown in the ﬁgure. Some edge weights are stronger than others, andthese are the cues for inferring the proper mapping. In Figure 5.1, these indicative edgesare thicker lines in blue.Algorithmic and learning-based methods for EL are based on these three components. ELis intensively researched and applied not just for KB construction, with comprehensivesurveys by [523, 343, 370] and widely used benchmarks (e.g., [494]). This chapter discussesmajor families of methods and their building blocks.

The EL task has many variations and extensions. An important case is the treatment of out-of-KB entities : mentions that denote entities that are not (yet) included in the KB.This situation often arises with emerging entities , such as newly created songs or books,people or organizations that suddenly become prominent, and long-tail entities such asgarage bands or small startups. In such cases, the EL method has an additional optionto map a mention to null , meaning that none of the already known entities is a propermatch. This may hold even if the KB has reasonable string matches for the name itself. Forexample, the Nigerian saxophonist Peter Udo, who played with the Orchestra Baobab, isnot included in any major KB (to the best of our knowledge), but there are many matches67 ubmitted to Foundations and Trends in Databases for the string “Peter Udo” as this is also a German ﬁrst name. A good EL method needs tocalibrate its linking decisions to avoid spurious choices, and should map mentions with lowconﬁdence in being KB entities to null . Such long-tail entities may become candidates to beincluded later in the KB life-cycle. We will revisit this issue in Chapter 8 on KB curation,speciﬁcally Section 8.6.3.

The initial KB against which EL methods operate is inevitably incomplete, regarding bothcoverage of entities and coverage of diﬀerent names for the known entities. The former isaddressed by awareness of out-of-KB entities. The latter calls for grouping mentions intoequivalence classes that denote the same entities. In NLP, this task is known as coreferenceresolution (CR) ; in the world of structured data, its counterpart is the entity matching(EM) problem (see Section 5.2).The CR task is highly related to the EL setting, as illustrated by Figure 5.2. Here,the input text contains underdetermined phrases like “the album”, “the singer” and “wife”(or say “his wife”). Longer texts will likely contain pronouns as well, such as “she”, “her”,“it”, “they”, etc. All these are not immediately linkable to KB entities. However, we canﬁrst aim to identify to which other mentions these coreferences refer, this way computing equivalence classes of mentions . In Figure 5.2, possible groupings are indicated by edgesbetween mentions, and the correct ones are marked by thick lines in blue.The ideal output would thus state that “Desire” and “the album” denote the sameentity, and by linking one of the two mentions to the Bob Dylan album

Desire (album) ,EL covers both mentions. In general, however, the grouping and linking will be partial,meaning that some coreferences may be missed and some of the coreference groups maystill be unlinkable – either because of remaining uncertainty or because the proper entitydoes not exist in the KB. Although this partial picture may look unsatisfying, it does givevaluable information for KB construction and completion: • Mentions in the same coreference group linked to a KB entity may be added as aliasnames, or simply textual cues, for an existing entity. • Coreference groups that cannot be linked to a KB entity can be captured as candidatesfor new entities to be added, or at least reconsidered, later.For example, if we pick up mentions like “Peter Udo”, “the sax player”, “the Nigeriansaxophonist” and “he” as a coreference group, not only can we assert that he is not anexisting KB entity, but we already have informative cues about what type of entity this isand even a gender cue.Methods for coreference resolution over text inputs, and for coupling this with entitylinking, can be rule-based (see, e.g., [471, 313, 143]), based on CRF-like graphical models(see, e.g., [142]) or based on neural learning (see, e.g., [100, 315, 272]). The latter beneﬁts68 ubmitted to Foundations and Trends in Databases

Hurricane (song)Hurricane cocktailBob DylanRobert Kennedy

Hurricane is on Bob‘sDesire.The album also contains a track about Sara,the singer‘sformer wife.

Desire (album)USS Desire (ship)Sara LowndsMia SaraSarah ConnorHurricaneBobDesire

The album

Sarathe singerwife

Figure 5.2:

Example for Combined Entity Linking and Coreference Resolution from feeding large unlabeled corpora into the model training (via embeddings such as BERT(e.g., [272], see also Section 4.5).The CR task mostly focuses on short-distance mentions, often looking at successivesentences or paragraphs only. However, the task can be extended to compute coreferencegroups across distant paragraphs or even across diﬀerent documents. This aims to mark upan entire corpus with equivalence classes of mentions, and partial linking of these classesto KB entities. The extended task is known as cross-document coreference resolution(CCR) , ﬁrst studied by [25]. Methods along these lines include inference with CRF-likegraphical models (e.g, [534]) and hierarchical clustering (e.g., [144]).

Beneﬁt of CR for KB construction:

Partial linking of mentions to KB entities helps coreference resolution by capturing distantsignals. Conversely, having good candidates for coreference groups is beneﬁcial for EL,as it provides richer context for individual mentions. In addition to these dual beneﬁts,coreference groups can be important for extracting types and properties of entities, whenthe latter are expressed with pronouns or underdetermined phrases (e.g., “the album”). Forexample, when given the text snippet“Hurricane was not exactly a big hit.It is a protest song about racism.”, 69 ubmitted to Foundations and Trends in Databases we can infer the type protest song and its super-type song only with the help of CR.Likewise, in the example of Figure 5.2, we can acquire knowledge about Bob Dylan’s ex-wifeonly when considering coreferences. So CR can improve extraction recall and thus thecoverage of knowledge bases.

Computing equivalence classes over a set of observed entity mentions is also a frequent taskin data cleaning and integration. For example, when faced with two or more semanticallyoverlapping databases or other datasets (incl. web tables), we need to infer which recordsor table rows correspond to the same entity. This task of entity matching (EM) or duplicate detection , historically called record linkage , is a long-standing problem incomputer science [140, 160].In a nutshell, all EM methods leverage cues from comparing records in a matching-candidate pair: • Name similarity:

The higher the string similarity between two names, the higher thelikelihood that they denote matching entities (e.g., strings like “Sara” and “Sarah” beingclose). • Context similarity:

As context of a database record or table cell that denote an entityof interest, we should consider the full record or row as the data in such proximityoften denote related entities or salient attribute values. Additionally, it can be beneﬁcialto contextualize the mention in a table cell by considering the other cells in the samecolumn, for example, as signals for the speciﬁc entity type of a cell (e.g., when all valuesin a table column are song titles). • Consistency constraints:

When a pair of rows from two tables is matched, this rulesout matching one of the two rows to a third one, assuming that there are no duplicateswithin each table. With duplicates or when we consider more than two tables as input,matchings need to satisfy the transitivity of an equivalence relation. • Distant knowledge:

By connecting highly related entities, using a background KB, wecan establish matching contexts despite low scores on string similarity. For example, asinger’s name (e.g., Mick Jagger) and the name of his or her band (e.g., Rolling Stones)could be very diﬀerent, but there is nevertheless a strong connection (e.g., in matchinga Stones song, using either one or both of the related names).These ingredients can be fed into matching rules generated from human-provided samples(e.g., [284, 533]), or used as input to supervised learning or probabilistic graphical models(e.g., [537, 40, 624, 475, 559, 291]) or neural networks (e.g., [405, 675]).Similar techniques can be applied also to spotting matching entity pairs across datasetsin the Web of (Linked) Open Data [225, 385]; see the survey by [419] on this link discovery ubmitted to Foundations and Trends in Databases problem. For query processing, the problem takes the form of ﬁnding joinable tables andthe respective join columns and values (see, e.g., [671, 318]).As EM may operate over large databases with millions of records, it faces a scalabilitychallenge : conceptually comparing all record pairs from two databases may entail quadraticcomplexity and is bound to be intractable in practice. To overcome this potential show-stopper, the input needs to be partitioned into manageable blocks, based on a variety of blocking techniques . The simplest approach is to partition the records of both sides bythe name string of the entities of interest (e.g., person names, company names, or songs orbooks, etc.) or by one or more of the most informative attributes (e.g., birthdate, address,etc.). More advanced techniques pre-compute ﬁngerprints that approximate string similarityscores, such as min-hash sketches for n-gram overlap (and edit distance) [60, 82], anduse these for partitioning. Because of these techniques, blocking-based EM algorithms arealso referred to as similarity joins or fuzzy joins . The actual EM step then applies morereﬁned techniques to compare records between blocks in the same partition. State-of-the-artmethods do not rely on a single partitioning, but use adaptive and iterative blocking, so asto reduce the false negative rate and boost the overall accuracy (e.g., [623, 135, 96]).Entity matching between databases or other structured data repositories is of interestfor KB construction when incorporating premium sources into an existing KB. For example,when adding entities from GeoNames to a Wikipedia-derived KB, the detection of duplicates(to be eliminated from the ingest) is essentially an EM task. There is ample literature on EMmethods for integrating heterogeneous databases and, recently, in the context of so-called data lakes [385]. Therefore, we do not discuss methods in more depth and instead refer tosurveys and best-practice papers by [287, 416, 94, 135, 96]. All EL methods are based on quantitative measures for mention-entity popularity, mention-entity similarity and entity-entity coherence. This section elaborates on these measures.Throughout the following, we assume that many entities in the KB, and especially theprominent ones, are uniquely identiﬁed also in Wikipedia. It is very easy for a KB to alignits entities with Wikipedia articles. This holds for large KBs like Wikidata, even if the KBitself is not built from Wikipedia as a premium source. We will use Wikipedia articles as abackground asset for various aspects of EL methods.

As most texts, like news, books and social media, are about prominent entities, the popularityof an entity determines a prior likelihood to be selected by an EL algorithm. For example,the Web contents about Elvis Presley is an order of magnitude larger than about the less71 ubmitted to Foundations and Trends in Databases known musician Elvis Costello. Therefore, when EL sees a mention “Elvis”, the probabilitythat this denotes

Elvis Presley is a priori much higher than for Elvis Costello. To measurethe global popularity of entities, a variety of indicators can be used: • the length of an entity’s Wikipedia article (or “homepage” in domain-speciﬁc platformssuch as IMDB for movies or Goodreads for books, • the number of incoming links of the Wikipedia page, • the number of page visits based on Wikipedia usage statistics, • the amount of user activity, such as clicks or likes, referring to the entity in a domain-speciﬁc platform or in social media, and more.While considering global popularity is useful, it can be misleading and insuﬃcient as a priorprobability. For example, for the mention “Trump”, the most likely entity is Donald Trump ,but for the mention “Donald”, albeit Donald Trump still being a candidate, the more likelyentity is

Donald Duck . This suggests that we should consider the combination of mentionand entity for estimating popularity.The most widely used estimator for mention-entity popularity is based on Wikipedialinks [383, 386], exploiting the observation that href anchor texts are often short names andthe pages to which they link are canonicalized entities already.

Link-based Mention-Entity Popularity:

The mention-entity popularity score for mention m and entity e is proportionalto the occurrence frequency of hyperlinks with href anchor text “m” that point tothe main page about e . This includes redirects within Wikipedia as well as interwikilinks between diﬀerent language editions.Obviously, this works only for entities that have Wikipedia articles, and it hinges onsuﬃciently many href links pointing to these articles. On the other hand, the popularityscore is useful only for prominent entities and will not be a good signal for long-tail entitiesanyway. Nevertheless, the approach can be generalized for larger coverage, by consideringall kinds of Web pages with links to Wikipedia [550]. Further alternatives are to leveragequery-click logs, by considering names in search-engine queries as mentions and subsequentclicks on Wikipedia articles or other kinds of homepages as linked entities. The most obvious cue for inferring that mention m denotes entity e is to compare their surface strings . For m , this is given by the input text, for example, “Trump” or “PresidentTrump”. For e , we can consider the preferred (i.e., oﬃcial or most widely used) label forthe entity, such as “Donald Trump” or “Donald John Trump”, but also alias names thatare already included in the KB, such as “the US president”. String similarity betweennames , like edit distance or n-gram overlap, can then score how good e is a match for m ,72 ubmitted to Foundations and Trends in Databases this way ranking the candidate entities. In doing this, we can consider weights for tokens atthe word or even character level. The weights can be derived from frequency statistics, in thespirit of IR-style idf weights. The weights and the similarity measure can be type-speciﬁcor domain-speciﬁc, dealing diﬀerently with say person names, organization acronyms, songtitles, etc.Beyond the basic comparison of m and names for e , a fairly obvious approach is toleverage the mention context , that is, the text surrounding m , and compare it to concisedescriptions of entities or other kinds of entity contextualization . The mention contextcan take the form of a single sentence, single paragraph or entire document, possibly inweighted variants (e.g., weights decreasing with distance from m ). The entity contextdepends on the richness of the existing KB. Entities can be augmented with descriptionstaken, for example, from their Wikipedia articles (e.g., the ﬁrst paragraph stating mostsalient points). Alternatively, the types of entities, their Wikipedia categories and otherprominently appearing entities in a Wikipedia article (i.e., outgoing links) can be used for acontextualized representation. In essence, we create a pseudo-document for each candidateentity, and we may even expand these by external texts such as news articles about entities,or highly related concepts for the entity types using WordNet and other sources.In the example of Figure 5.1, the mention “Hurricane” has words like “track” and“played” in its proximity, and the mention “Washington” is accompanied by the word “ﬁlm”.These should be compared to entity contexts with words like song, music etc. for Hurricane(song) versus beverage, alcohol, rum etc. for

Hurricane (cocktail) .The following are some of the widely used measures for scoring the mention-entitycontext similarity . Bag-of-Words Context Similarity:

Both m and e are represented as tf-idf vectors derived from bags of words (BoW).Their similarity is the scalar product or cosine between these vectors: sim ( cxt ( m ) , cxt ( e )) = −−−−−−−−−−→ BoW ( cxt ( m )) · −−−−−−−−−→ BoW ( cxt ( e ))or analogously for cosine. Some variants restrict the BoW representation to informa-tive keywords from both contexts, using statistical and entropy measures to identifythe keywords.A generalization of keyword-based contexts is to focus on characteristic keyphrases [238]: multi-word phrases that are salient and speciﬁc for candidate entities. Keyphrasecandidates can be derived from entity types, category names, href anchor texts in an entity’sWikipedia article, and other sources along these lines. For example, Rubin Carter (as acandidate for the mention “Carter” in Figure 5.1) would be associated with keyphrasessuch as “African-American boxer”, “people convicted of murder”, “overturned convictions”,73 ubmitted to Foundations and Trends in Databases “racism victim”, “nickname The Hurricane” etc. Such phrases can be gathered from nounphrases n in the Wikipedia page (or other entity description) for e , and then scored andﬁltered by criteria like pointwise mutual information (PMI) : weight ( n | e ) ∼ log P [ n, e ] P [ n ] · P [ e ]or other information-theoretic entropy measures, with probabilities estimated from (co-)occurrence frequencies (in Wikipedia). Intuitively, the best keyphrases for entity e shouldfrequently co-occur with e , but should not be globally frequent for all entities. The contextof e , cxt ( e ), then becomes the weighted set of keyphrases n for which weight ( n | e ) is abovesome threshold.Comparing a set of phrases against the mention context is a bit more diﬃcult thanfor the Bag-of-Words representations. The reason is that exact matches of multi-wordphrases are infrequent, and we need to pay attention to similar phrases with some wordsmissing, diﬀerent word order, and other variations. [238] proposed a word-proximity-awarewindow-based model for such approximate matches. For example, the keyphrase “racismvictim” can be matched by “victim in a notorious case of racism”. Keyphrase Context Similarity:

Representing cxt ( e ) as a weighted set KP of keyphrases n and cxt ( m ) as a sequenceof words conceptually broken down into a set W of (overlapping) small text windowsof bounded length, the similarity is computed by aggregating the following scores: • for each n ∈ KP identify the best matching window ω ∈ W ; • for each such ω compute the (sub-)set of words w ∈ n ∩ ω and their maximumpositional distance δ in ω ; • aggregate the word matches for n in ω with consideration of δ and the entity-speciﬁc weight for w (treating w as if it were a keyphrase by itself); • aggregate these per-keyphrase scores over all n ∈ KP .This template can be varied and extended in a number of ways.With the advent of latent embeddings (see Subsection 4.5), both BoW and keyphrasemodels may seem to be superseded by word2vec-like and other kinds of embeddings, whichimplicitly capture also synonyms and other strongly related terms. However, at the wordlevel alone, the embeddings are susceptible to over-generalization and drifting focus. Forexample, the word embedding for “Hurricane” brings out highly related terms that havenothing to do with the song or the boxer (e.g., “storm”, “damage”, “deaths” etc.). So itis important to use entity-centric embeddings likes the ones for wikipedia2vec [638], andideally, these should reﬂect multi-word phrases as well.74 ubmitted to Foundations and Trends in Databases Embedding-based Context Similarity:

With embedding vectors −−−−→ cxt ( m ) and −−−→ cxt ( e ), the context similarity between m and e is cos (cid:16) −−−−→ cxt ( m ) , −−−→ cxt ( e ) (cid:17) .Each of these context-similarity models has its sweet spots as well as limitations.Choosing the right model thus depends on the topical domain (e.g., business vs. music) andthe language style of the input texts (e.g., news vs. social media). Whenever an input text contains several mentions, EL methods should compute theircorresponding entities jointly , based on the principle that co-occurring mentions usuallymap to semantically coherent entities. To this end, we need to deﬁne measures for entity-entity relatedness that capture this notion of coherence.One of the most powerful and surprisingly simple measures exploits the rich link structureof Wikipedia. The idea is to consider two entities as highly related if the hyperlink sets oftheir Wikipedia pages have a large overlap. More speciﬁcally, the incoming links are a goodsignal. For example, two songs like “Hurricane” and “Sara” are highly related because theyare both linked to from the articles about

Bob Dylan , Desire (album) and more (e.g., theother involved musicians). Likewise,

Bob Dylan and

Elvis Presley are notably related asthere are quite a few Wikipedia categories that link to both of them. Intuitively, in-links areconsidered more informative than out-links as outgoing links tend to refer to more generalentities and concepts whereas incoming links often have the direction from more general tomore speciﬁc. These considerations have given rise to the following link-based deﬁnition ofentity-entity coherence.

Link-based Entity-Entity Coherence:

For two entities e and f with Wikipedia articles that have incoming-link sets In ( e )and In ( f ), their coherence score is1 − log ( max {| In ( e ) | , | In ( f ) |} ) − log | In ( e ) ∪ In ( f ) | )log( | U | ) − log ( min {| In ( e ) | , | In ( f ) |} )where U is the total set of known entities (e.g., Wikipedia articles about namedentities).This approach was pioneered by [386], and is thus sometimes referred to as the Milne-Witten metric. Similarly to link-based popularity measures, we can generalize this Wikipedia-centric link-overlap model to other settings. In a search engine’s query-and-click log, entitieswhose pages both appear in the clicked-pages set for the same queries (so-called “co-clicks”) should have a high relatedness score. In domain-speciﬁc content portals and web75 ubmitted to Foundations and Trends in Databases communities, such as Goodreads and LibraryThing for books or IMDB and Rottentomatoesfor movies, the overlap of the user sets who expressed liking the same entity is a goodmeasure for entity-entity relatedness. All these can be seen as instantiations of the strongco-occurrence principle (see Subsection 4.2).When entities are represented as weighted sets of keywords (BoW) or weighted sets ofkeyphrases (KP), their relatedness can be captured by measures for the (weighted) overlapof these sets. BoW-based and KP-based Entity-Entity Coherence:

Consider two entities e and f with associated keyword sets E and F whose entries x have weights w E ( x ) and w F ( x ), respectively. The coherence between e and f canbe measured by the weighted Jaccard metric: X x ∈ E ∩ F min { w E ( x ) , w F ( x ) } max { w E ( x ) , w F ( x ) } The extension to keyphrases is more sophisticated. It needs to consider also partialmatches between keyphrases for e and for f (e.g., “rock and roll singer” vs. “rock ’n’roll musician”), following the same ideas as the KP-based context similarity model(see [238] for details).Analogously to the context-similarity aspect, embeddings are a strong alternative forentity-entity coherence as well. They are straightforward to apply. Embedding-based Entity-Entity Coherence:

With embedding vectors −→ e and −→ f for entities e and f , their relatedness for ELcoherence is cos ( −→ e , −→ f ).All these purely text-based coherence measures – keywords, keyphrases, embeddings– have the advantage that they can be computed solely from textual entity descriptions.There is no need for Wikipedia-style links, and not even for any relations between entities.This setting has been called EL with a linkless KB in [336]. That prior work, which predatesembedding-based methods, made use of latent topic models (in the style of LDA [48])for linkless EL. Today, word2vec-style embeddings or even BERT-like language models[118] (see Section 4.5) seem to be the more powerful choice, but they all fall into the samearchitecture presented above.Yet another way of deﬁning and computing entity-entity relatedness is by means of random walks over an existing knowledge graph (e.g., the link structure of Wikipedia), see,for example, [197]. Here each entity is represented by the (estimated) probability distributionof reaching related entities by random walks with restart. The relatedness score betweenentities can be deﬁned as the relative entropy between two distributions.76 ubmitted to Foundations and Trends in Databases

The above are major cases within a wider space of measures for entity-entity coherence.A good discussion of the broader design space and empirical comparisons can be found in[557, 75, 455]. Ultimately, however, the best choice depends on the domain of interest (e.g.,business vs. music) and the style of the ingested texts (e.g, news vs. social media).

All EL methods aim to optimize a scoring or ranking function that maximizes a combinationof mention-entity popularity, mention-entity context similarity and entity-entity coherence.This can be formalized as follows.

EL Optimization Problem:

Consider an input text with entity mentions M = { m , m . . . } each of which hasentity candidates, E ( m i ) = { e i , e i . . . } , together forming a pool of target entities E = { e , e . . . } . The goal is to ﬁnd a, possibly partial, function φ : M → E thatmaximizes the objective α X m pop ( m, φ ( m )) + β X m sim ( cxt ( m ) , cxt ( φ ( m )))+ γ X e,f { coh ( e, f ) | ∃ m, n ∈ M : m = n, e = φ ( m ) , f = φ ( n ) } where α, β, γ are tunable hyper-parameters, pop denotes mention-entity popularity, cxt the context of mentions and entities, sim the contextual similarity and coh thepair-wise coherence between entities.In principle, coherence could even be incorporated over the set of all entities in the imageof φ , but the practical sweet spot is to break this down into pair-wise terms which allowmore robust estimators.Combinatorially, the function φ is the solution of a combined selection-and-assignment problem: selecting a subset of the entities and assigning the mentions onto them. We aimfor the globally best solution, with a choice for φ that bundles the linking of all mentionstogether. Algorithmically, this optimization opens up a wide variety of design choices: fromunsupervised scoring to graph-based algorithms all the way to neural learning. The basicchoice, found in the classical works of [120, 67, 383, 103, 386, 379], is to view EL as a localoptimization problem (notwithstanding its global nature): for each mention in the inputtext, we compute its best-matching entity from a pool of candidates. This is carried outfor each mention independently of the other mentions, hence the adjective local . By thisrestriction, coherence is largely disregarded, but some aspects can still be incorporatedvia clever ways of contextualization. Most notably, the method of [386] ﬁrst identiﬁes allunambiguous mentions and expands their contexts by (the descriptions of) their respective77 ubmitted to Foundations and Trends in Databases entities. With this enriched context, the method then optimizes a combination of popularityand similarity (i.e., setting γ to zero in the general problem formulation). Generalizedtechniques for enhancing mention context and entity context via document retrieval havebeen devised by [336]. The literature on EL contains further techniques along these lines.These approaches can be used to score and rank mention-entity pairs, but can likewise bebroken down into their underlying scoring components as features for learning a ranker. Suchmethods have been explored using support vector machines and other kinds of classiﬁers,picking the entity with the highest classiﬁcation conﬁdence (e.g., [67, 386, 138, 525, 312]).Alternatively, more advanced learning-to-rank (LTR) regression models [350, 329], havebeen investigated (e.g., [669, 477, 197]), for example, with learning from pairwise preferencesbetween entity candidates for the same mention. Note that these supervised learningmethods hinge on the availability of a labeled training corpus where mentions are markedup with ground-truth entities. By factoring entity-entity relatedness scores into the context of candidate entities, context-similarity methods already go some way in considering global coherence, examples being[103, 104]. Nevertheless, more powerful EL methods make decisions on mapping mentionsto entities jointly for all mentions of the input text. This family of methods is referred to as

Collective Entity Linking . Many of these can be seen as operating on an

EL candidategraph :For an entity linking task, the

EL Candidate Graph consists of • a set M of mentions and a set E of entities as nodes, • a set M E of mention-entity edges with weights derived from popularity andsimilarity scores, and • a set EE of entity-entity edges with weights derived from relatedness scoresbetween entities.Figure 5.1, in Section 5.1, showed an example for such a candidate graph, with edgeweights omitted. [161] proposed a relatively simple but eﬀective and highly eﬃcient approach for incorporatingentity-entity edge weights into the scoring of mention-entity edges. Given a candidatemapping m e , its score is augmented by the coherence of e with all candidate entities for78 ubmitted to Foundations and Trends in Databases all other mentions in the graph: score ( m e ) = · · · + X n = m X f :( n,f ) ∈ ME sim ( n, f ) · coh ( e, f )The intuition here is that a candidate e for m is rewarded if e has highly weighted edges orpaths with all other entities in the entire candidate graph.Obviously, the graph still contains spurious entities, but we expect those to have lowcoherence with others anyway. Conversely, entities for unambiguous mentions and entitieswith tight connections to many others in the graph have a strong inﬂuence on the decisionsfor all mentions. Hence the collective ﬂavor of the method. Motivated by these considerations, a generalization is to consider entire subgraphs thatconnect the best cues for joint mappings. This leads to a powerful framework, although itentails more expensive algorithms. More speciﬁcally, we are interested in identifying densesubgraphs in the candidate graph such that there is at most one mention-entity edge foreach mention (or exactly one if we insist on linking all mentions). Density refers to highedge weights, where both mention-entity weights and entity-entity weights are aggregatedover the subgraph. We assume the weights of the two edge types are calibrated before beingcombined (e.g., via hyper-parameters like α, β, γ ). EL based on Dense Subgraph:

Given a candidate graph with nodes M ∪ E and weighted edges M E ∪ EE , the goalis to compute the densest subgraph S , with nodes S.M ⊆ M and S.E ⊂ E andedges S.M E ⊂ M E and

S.EE ⊂ EE , maximizing aggr { weight ( s ) | s ∈ S.M E ∪ S.EE } subject to the constraint:for each m ∈ M there is at most one e ∈ S.E such that ( m, e ) ∈ S.M E . In thisobjective, aggr denotes an aggregation function over edge weights, a natural choicebeing summation (or the sum normalized by the number of nodes or edges in thesubgraph).Note that the resulting subgraph is not necessarily connected, as a text may be aboutdiﬀerent groups of entities, related within a group but unrelated across. However, thissituation should be rare, and the method could enforce a connected subgraph. Also, it maybe desirable to have identical mentions in a document mapped to the same entity (e.g., allmentions “Carter” linked to the same person), at least when the text is not too long. Thiscan be enforced by an additional constraint.79 ubmitted to Foundations and Trends in Databases

Bob

Hurricane

Carter

Hurricane (song)Ruben CarterJimmy CarterBob DylanRobert Kennedy10 9

73 255 Figure 5.3:

Example for Dense Subgraphs in EL Candidate Graph

Figure 5.3 illustrates this subgraph-based method with a simple example: 3 mentions and5 candidate entities, with one mention being unambiguous. There are two subgraphs withhigh values for their total edge weight, shown in blue and red. The one in blue correspondsto the ground-truth mapping (assuming the context is still the Bob Dylan song), with atotal weight of 30. The one in red is an alternative solution (centered on the two prominentpoliticians, one of which was a strong advocate against racism), with a total edge weight of31. Both are substantially better than other choices, such as mapping the three mentions toRobert Kennedy, the song and Ruben Carter, which has a total edge weight of 23. However,the best subgraph by using sum for weight aggregation is the wrong output in red, albeitby a tiny margin only.This observation motivates the following alternative choice for the aggregation function aggr : within the subgraph of interest, instead of paying attention only to the total weight,we consider the weakest link , that is, the edge with the lowest weight. The goal then is tomaximize this minimum weight, making the weakest link as strong as possible. Using thisobjective, the best subgraph in Figure 5.3 is the one with Bob Dylan, the song and RubenCarter, as the lowest edge weight in the respective subgraph is 3, whereas it is 2 for thealternative subgraph with Robert Kennedy, the song and Jimmy Carter.

EL based on Max-Min Subgraph:

For a given candidate graph, the goal is to compute the subgraph S that maximizes min { weight ( s ) | s ∈ S.M E ∪ S.EE } subject to the constraint:for each m ∈ M there is at most one e ∈ S.E such that ( m, e ) ∈ S.M E .Both variants of the dense-subgraph approach are NP-hard, due to the constraint aboutmention-entity edges. However, there are good approximation algorithms, including greedyremoval of weak edges as well as stochastic search to overcome local optima. This family80 ubmitted to Foundations and Trends in Databases of methods has been proposed by [241], and achieved good results on benchmarks withnews articles, in a completely unsupervised way (except for tuning a small number ofhyper-parameters). Methods with similar considerations on incorporating coherence havebeen developed by [477, 525].

Using the same EL candidate graph as before, an alternative approach that also has eﬃcientimplementations, is based on random walks with restarts , essentially the same principlethat underlies

Personalized Page Rank [219].First, edge weights in the candidate graph are re-calibrated to become proper transitionprobabilities, and a small restart probability is chosen to jump back to the starting node ofa walk. Conceptually, we initiate such walks on each of the mentions, making probabilisticdecisions for traversing both mention-entity and entity-entity edges many times, andoccasionally jumping back to the origin. In the limit, as the walk length approaches inﬁnity,the visiting frequencies of the various nodes converge to stationary visiting probabilities , whichare then interpreted as scores for mapping mentions to entities. An actual implementationwould bound the length of each walk, but walk repeatedly to obtain samples towardsbetter approximation. Alternatively, iterative numeric algorithms from linear algebra, mostnotably, Jacobi iteration, can be applied to the transition matrix of the graph, until someconvergence criterion is reached for the best entity candidates of every mention.

EL based on Random Walks with Restart:

Given a candidate graph with weights, re-calibrate the weights into proper transitionprobabilities.For each mention m ∈ M • approximate the visiting probabilities of the possible target entities e ∈ E :( m, e ) ∈ M E , and • map m to the entity e with the highest probability.Although these algorithms make linking decisions one mention at a time, they docapture the essence of collective EL as the walks involve the entire candidate graph and thestationary visiting probabilities take the mutual coherence into account. EL methods basedon random walks and related techniques include, for example, [399, 196, 451, 190]. The coherence-aware graph-based methods can also be cast into probabilistic graphicalmodels , like CRFs and related models. They can be seen as reasoning over a joint probabilitydistribution P [ m , m . . . , e , e . . . , d ]81 ubmitted to Foundations and Trends in Databases with mentions m i , entities e j and the context given by document d . This denotes thelikelihood that a document d contains entities e , e . . . and that these entities are textuallyexpressed in the form of mentions m , m . . . . Obviously, this high-dimensional distributionis not tractable. So it is factorized by making model assumptions and mathematicaltransformations, such as P [ m , m . . . , e , e . . . , d ] = Y i,j P [ m i | e j , d ] · Y j,k P [ e j , e k ]where P [ m i | e j , d ] is the probability of m i expressing e j in the context of d and P [ e j , e k ]is the probability of the two entities co-occurring in the same, semantically meaningfuldocument.This kind of probabilistic reasoning can be cast into a CRF model or factor graph (cf. Section 4.4) as follows. CRF Model for Entity Linking:

For each mention m i with entity candidates E i = { e i , e i . . . } , the model has a random variable X i with values from E i . These variables capture the probabilities P [ m i | e j ].For each candidate entity e k (for any of the mentions), the model has a binary random variable Y k that is true if e k is mentioned in the document and falseotherwise. These variables capture probabilities P [ e k | d ] of entity occurrence in thedocument.All variables are assumed to be conditionally independent, except for the following coupling factors : • X i , Y j are coupled if e j is a candidate for m i , • Y j , Y k are coupled for all pairs e j , e k .Figure 5.4 depicts an example, showing how the candidate graph for our running examplecan be cast into the CRF structure with variables for each mention and entity node, andcoupling factors for each of the edges.Unlike the CRF models for NER, discussed in Section 4.4, this kind of CRF does notoperate on token sequences but on graph-structured input. So it falls under a more generalclass of Markov Random Fields (MRF) , but the approach is nevertheless widely referredto as a CRF model. The joint distribution of all X i and Y j variables is factorized, by theMarkov assumption, according to the cliques in the graph. This way, we obtain one “cliquepotential” or “coupling factor” per clique, often with restriction to binary cliques, couplingtwo variables (i.e., the edges in the graph). Training such a model entails learning per-cliqueweights (or estimating conditional probabilities for the coupled variables), typically usinggradient-descent techniques. Alternatively, similarity and coherence scores can be used82 ubmitted to Foundations and Trends in Databases BobHurricaneCarter Hurricane (song)Ruben CarterJimmy CarterBob DylanRobert Kennedy X1X2X3 Y3Y4Y5Y2Y1

Figure 5.4:

Example for CRF derived from EL candidate graph for these purposes, at least for a good initialization of the gradient-descent optimization.Inference on the values of variables, given a new text document as input, is usually limitedto joint MAP inference: computing the combination of variable values that has the highestposterior likelihood. This involves Monte Carlo sampling, belief propagation, or variationalcalculus (cf. Section 4.4 on CRFs for NER).CRF-based EL has ﬁrst been developed by [297], with a variety of enhancements infollow-up works such as [142, 178, 422].Many CRF-like models can also be cast into

Integer Linear Programs (ILP) : adiscrete optimization problem with constraints [500, 498].83 ubmitted to Foundations and Trends in Databases

ILP Model for Entity Linking:

For each mention m i with entity candidates E i = { e i , e i . . . } , the model has abinary decision variable X ij set to 1 if m i truly denotes e j .For each pair of candidate entities e k , e l (for any of the mentions), the model has abinary decision variable Y kl set to 1 if e k and e l are indeed both mentioned inthe input text.The objective function for the ILP is to maximize the data evidence for the choiceof 0-1 values for the X ij and Y kl variables, subject to constraints:maximize β X ij weight ( m i , e j ) X ij + γ X kl weight ( e k , e l ) Y kl with weights corresponding to similarity and coherence scores and hyper-parameters β, γ .The constraints specify that mappings are functions, couple the X ij and Y ij variables,and may optionally capture transitivity among identical mentions or (obvious)coreferences, if desired: • P j X ij ≤ i ; • Y kl ≥ X ik + X jl −

1, stating that Y kl must be 1 if both e k and e l are chosen asmapping targets; • X ik ≤ X jk and X ik ≥ X jk for all k for identical mentions m i , m j ; • ≤ X ij ≤ ≤ Y kl ≤ i, j, k, l .Solving such an ILP is computationally expensive: NP-hard in the worst case and alsocostly in practice for large instances. However, there are very eﬃcient ILP solvers, such asGurobi ( ), which can handle reasonably sized inputs such as shortnews articles with tens of mentions and hundreds of candidate entities. Larger inputs couldhave their candidate space pruned ﬁrst by other, simpler, techniques. Moreover, ILPs canbe relaxed into LPs, linear programs with continuous variables, followed by randomizedrounding. Often, this yields very good approximations for the discrete optimization (seealso [297]). Early work on EL (e.g., [67, 386, 138, 477, 525, 104, 312]) already pursued machine learningfor ranking entity candidates, building on labeled training data in the form of ground-truth mention-entity pairs in corpora (most notably, Wikipedia articles or annotated newsarticles). These methods used support vector machines, logistic regression and other learners,all relying on feature engineering, with features about mention contexts and a suite of84 ubmitted to Foundations and Trends in Databases cues for entity-entity relatedness. More recently, with the advent of deep neural networks,these feature-based learners have been superseded by end-to-end architectures withoutfeature modeling. However, these methods still, and perhaps even more strongly, hinge onsuﬃciently large collections of training samples in the form of correct mention-entity pairs.

Recall from Section 4.4 that neural networks require real-valued vectors as input. Thus, akey point in applying neural learning to the EL problem is the embeddings of the inputs:mention context (both short-distance and long-distance), entity description and moststrongly related entities, and more. This neural encoding is already a learning task byitself, successfully addressed, for example, by [169, 639, 640, 200]. The jointly learnedembeddings are fed into a deep neural network, with a variety of architectures like LSTM,CNN, Feed-Forward, Attention learning, Transformer-style, etc. The output of the neuralclassiﬁer is a scoring of entity candidates for each mention. For end-to-end training, atypical choice for the loss function is softmax over the cross-entropy between predictionsand ground-truth distribution. Figure 5.5 illustrates such a neural EL architecture. Asembedding vectors are fairly restricted in size, mention contexts can be captured at diﬀerentscopes: short-distance like sentences as well as long-distance like entire documents. Bythe nature of neural networks, the “cross-talk” between mentions and entities and amongentities is automatically considered, capturing similarity as well as coherence. scores of entity candidates . . . . . . . . . . . . neuralnetworklayers(LSTM, CNN,

FFN or …)

Hurricane, about Carter, is one of Bob‘s tracks … Jimmy Carter Ruben Carter BobDylan RobertKennedy … entity embeddingsmentioncontextembeddings Figure 5.5:

Illustration of Neural EL Architecture

Neural networks for EL are trained end-to-end (e.g., [151, 282, 406, 518, 252]) onlabeled corpora where mentions are marked up with their proper entities, using gradientdescent techniques. Some of these methods integrate EL with the NER task, jointly spotting85 ubmitted to Foundations and Trends in Databases mentions and linking them. In the literature, Wikipedia full-text with hyperlink targets asground truth provides ample training samples. However, the articles all follow the sameencyclopedic style. Therefore, the learned models, albeit achieving excellent performanceon withheld Wikipedia articles, do not easily carry over to text with diﬀerent stylisticcharacteristics and neither to domain-speciﬁc settings such as biomedical articles or healthdiscussion forums. Another large resource for training is the

WikiLinks corpus [535] whichcomprises Web pages with links to Wikipedia articles. This captures a wider diversity oftext styles, but the ground-truth labels have been compiled automatically, hence containingerrors.Overall, it seems that neural EL is not yet as mature and successful as its neuralcounterparts for NER (see Section 4.4), as it is easier to obtain training data for NER.Neural EL shines when trained with large labeled collections and the downstream texts towhich the learned linker is applied have the same general characteristics. When trainingsamples are scarce or the use-case data characteristics substantially deviate from the trainingdata, it is much harder for neural EL to compete with feature-based unsupervised methods.Very recently, methods for transfer learning have been integrated into neural EL (e.g.,[357, 629]). These methods are trained on one labeled collection, but applied to a diﬀerentcollection which does not have any labels and has a disjoint set of target entities. A majorasset to this end is the integration of large-scale language embeddings, like BERT [118],which covers both training and target domains. The eﬀect is an implicit capability of“reading comprehension”, which latently captures relevant signals about context similarityand coherence. Transfer learning still seems a somewhat brittle approaches, but suchmethods will be further advanced, leveraging even larger language embeddings, such asGPT-3 based on a neural network with over 100 billion parameters [62].Regardless of future advances along these lines, we need to realize that EL comes inmany diﬀerent ﬂavors: for diﬀerent domains, text styles and objectives (e.g., precisionvs. recall). Therefore, ﬂexibly conﬁgurable, unsupervised methods with explicit featureengineering will continue to play a strong role. This includes methods that require tuning ahandful of hyper-parameters, which can be done by domain experts or using a small set oflabeled samples.

In addition to text documents, KB construction also beneﬁts from tapping semi-structuredcontents with lists and tables. In the following, we focus on the case of ad-hoc web tablesas input, to exemplify EL over semi-structured data. Table 5.1 shows an example withambiguous mentions such as “Elvis”, “Adele”, “Columbia”, “RCA” as well as abbreviatedor slightly misspelled names (e.g., “Pat Garrett” should be the album

Pat Garrett & Billy ubmitted to Foundations and Trends in Databases Name Title Album Label YearBob Dylan Hurricane Desire Columbia 1976Bob Dylan Sara Desire Columbia 1976Bob Dylan Knockin on Heavens Door Pat Garrett Columbia 1973Elvis Cant Help Falling in Love Blue Hawaii RCA 1961Adele Make You Feel My Love n/a XL 2008

Table 5.1:

Example for entity linking task over web tables the Kid ). Note that such tables are usually surrounded by text – within web pages, withtable captions, headings etc., which can be harnessed as additional context.From a traditional database perspective, it seems that the best cues for EL over tablesis to exploit the table schema, that is, column headers and perhaps inferrable column types.However, these tables are very diﬀerent from well-designed databases: they are hand-craftedin an ad-hoc manner, and their column names are often not exactly informative (e.g.,“Name”, “Title”, “Label” are very generic). Thus, it seems that the EL problem is muchharder for tables. However, we can leverage the tabular structure to guide the search for theproper entities. Speciﬁcally, we pay attention to same-row mentions and same-columnmentions : • Same-row mentions are most tightly related. So their coherence should be boostedin the objective function. • Same-column mentions are not directly related, but they are typically of the sametype, such as musicians , songs , music albums and record labels . So the objectivefunction should incorporate a soft constraint for per-column homogeneity .By taking these design considerations into account, the EL optimization can be variedas follows. 87 ubmitted to Foundations and Trends in Databases EL Optimization for Web Tables:

Consider a table with c columns, r rows and entity mentions m ij where i, j are therow and column where the mention occurs. Each m ij has a set of entity candidates E ( m ij ). All mentions together are denoted as M , and the pool of target entitiesoverall as E . For ease of notation, we assume that all table cells are entity mentions(i.e., disregarding the fact that some columns are about literal values). The goal isto ﬁnd a, possibly partial, function φ : M → E that maximizes the objective α X m ij ∈ M pop ( m ij , φ ( m ij ))+ β X m ij ∈ M sim ( rowcxt ( m ij ) , cxt ( φ ( m ij )))+ β X m ij ∈ M sim ( doccxt ( m ij ) , cxt ( φ ( m ij )))+ γ X e,f ∈ E { coh ( e, f ) | m ij , m ik ∈ M : j = k, e = φ ( m ij ) , f = φ ( m ik ) } + δ X j =1 ..c hom { type ( e ) | e = φ ( m ∗ j , ∗ = 1 ..r } where α, β , β , γ , δ are tunable hyper-parameters. As before, pop denotes mention-entity popularity, cxt the context of mentions and entities in two variants formentions: rowcxt for same-row cells, doccxt for the entire document. sim denotescontextual similarity and coh pair-wise coherence between same-row entities. hom is a measure of type homogeneity and speciﬁcity .This framework leaves many choices for the underlying measures: deﬁning speciﬁcs ofthe two context models, deﬁning the measures for type homogeneity and speciﬁcity, and soon. For example, hom may combine the fraction of per-column entities that have a commontype and the depth of the type in the KB taxonomy. The latter is important to avoidoverly generic types like entity , person or artefact . The inference of column types hasbeen addressed as a problem by itself (e.g., [593, 85]). The case of lists , which can be viewedas single-column tables, has received special attention, too (e.g., [524, 228]).The literature on EL over tables, most notably [338, 39, 253, 493], discusses a variety ofviable design choices in depth. [660] is a recent survey on knowledge extraction from webtables.Algorithmically, many of the previously presented methods for text-based EL carry overto the case of tables. Scoring and ranking methods can simply extend their objective functionsto the above table-speciﬁc model. Graph-based methods, including dense subgraphs, randomwalks and CRF-based inference, merely have to re-deﬁne their input graphs accordingly.88 ubmitted to Foundations and Trends in Databases The seminal work of [338] did this with a probabilistic graphical model, integrating thecolumn type inference in a joint learning task.Figure 5.6 illustrates this graph-construction step: edges denote CRF-like coupling factorsor guide random walks (only some edges are shown). In the ﬁgure, the type homogeneity isdepicted by couplings with the prevalent column type. Alternatively, the CRF could coupleall mention pairs in the same column including the column header [39].

ColType1 ColType2

Candidate

Entities … … ………… ……

Figure 5.6:

Illustration of EL Graph for Tables

Iterative Linking:

A simple yet powerful principle that can be combined with virtually all EL methods is tomake linking decisions in multiple rounds, based on the mapping conﬁdence (see, e.g., [196,421, 451]). Initially, only unambiguous mentions are mapped, unless there is uncertainty onwhether they could denote out-of-KB entities. In the next round, only those mentions aremapped for which the method has high conﬁdence. After every round, all similarity andcoherence measures are updated triggering updates to the graph or other model on whichEL operates. As more and more entities are mapped, they create a more focused contextfor subsequent rounds. For the running example, suppose that we can map “Hurricane” tothe song with high conﬁdence. Once this context about music is established, the conﬁdencein linking “Bob” to Bob Dylan, rather than any other prominent Bobs, is boosted.

Domain-speciﬁc Methods:

There are numerous variations and extensions of EL methods, including domain-speciﬁc approaches, for example, for mentions of proteins, diseases etc. in biomedical texts (see,e.g., [17, 170, 108, 267] and references given there), and multi-lingual approaches wheretraining data is available only in some languages and transferred to the processing of other89 ubmitted to Foundations and Trends in Databases languages (see, e.g., [531] and references there). The case for domain-speciﬁc methods canalso be made for music (e.g., names of songs, albums, bands – which include many commonwords and appear incomplete or misspelled), bibliography with focus on author names andpublication titles, and even business where company acronyms are common and productnames have many variants.

Methods for Speciﬁc Text Styles:

There are approaches customized to speciﬁc kinds of text styles , most notably, socialmedia posts such as tweets (e.g., [351, 526, 116]) with characteristics very diﬀerent fromencyclopedic pages or news articles. Yet another speciﬁc kind of input is search-enginequeries when users refer to entities by telegraphic phrases (e.g., “Dylan songs covered byGrammy and Oscar winners”). For EL over queries, [513] developed powerful methods basedon probabilistic graphical models.

Methods for Speciﬁc Entity Types:

Finally, there are also specialized EL methods for disambiguating geo-spatial entities, suchas “Twin Towers” (e.g., in Kuala Lumpur, or formerly in New York) and temporal entitiessuch as “Mooncake Festival”. Methods for these types of entities are covered, for example, by[321, 508, 83] for spatial mentions, aka. toponyms , and [555, 301] for temporal expressions,aka. temponyms . A notable project where spatio-temporal entities have been annotated atscale is the GDELT news archive ( ), supporting event-orientedknowledge discovery [316].

Key points to remember from this chapter are the following: • Entity Linking (EL) (aka. Named Entity Disambiguation) is the task of mapping entitymentions in web pages (detected by NER, see Chapter 4) onto uniquely identiﬁed entitiesin a KB or similar repository. This is a key step for constructing canonicalized

KBs.Full-ﬂedged EL methods also consider the case of out-of-KB entities, where a mentionshould be mapped to null rather than any entity in the KB. • The input sources for EL can be text documents or semi-structured contents such as listsor tables in web pages. In the text case, related tasks like coreference resolution (CR)for pronouns and common noun phrases, or even general word sense disambiguation,may be incorporated as well. • Entity Matching (EM) is a variation of the EL task where the inputs are structureddata sources, such as database tables. The goal here is to map mentions in data recordsfrom one source to those of a diﬀerent source and, this way, compute equivalence classes.There is not necessarily a KB as a reference repository.90 ubmitted to Foundations and Trends in Databases • EL and EM leverage a variety of signals, most notably: a-priori popularity measures forentities and name-entity pairs, context similarity between mentions and entities, and coherence between entities that are considered as targets for diﬀerent mentions in thesame input. Speciﬁc instantiations may exploit existing links like those in Wikipedia,text cues like keywords and keyphrases, or embeddings for latent encoding of suchcontexts. • EL methods can be chosen from a spectrum of paradigms, spanning graph algorithms , probabilistic graphical models , feature-based classiﬁers , all the way to feature-less neurallearning . A good choice depends on the prioritization of complexity, eﬃciency, precision,recall, robustness and other quality dimensions. • None of the state-of-the-art EL methods seems to universally dominate the others.There is (still) no one-size-ﬁts-all solution. Instead, a good design choice depends onthe application setting: scope and scale of entities under consideration (e.g., focus onvertical domains or entities of speciﬁc types), language style and structure of inputs(e.g., news articles vs. social media vs. scientiﬁc papers), and requirements of the usecase (e.g., speed vs. precision vs. recall).91 ubmitted to Foundations and Trends in Databases

Using methods from the previous chapters, we can now assume that we have a knowledgebase that has a clean and expressive taxonomy of semantic types (aka. classes) and thatthese types are populated with a comprehensive set of canonicalized (i.e., uniquely identiﬁed)entities.The next step is to enrich the entities with properties in the form of SPO triples, coveringboth • attributes with literal values such as the birthday of a person, the year when a song oralbum was released, the maximum speed and energy consumption of a car model, etc.,and • relations with other entities such as birthplace, spouse, composers and musicians for asong or album, manufacturer of a car, etc.In this chapter, we present methods for extracting such SPO triples; most of these canhandle both attributes and relations in a more or less uniﬁed way. We will see that manyof the principles (e.g., statement-pattern duality), key ideas (e.g., pattern learning) andmethodologies (e.g., CRFs or neural networks) of the previous chapters are applicable hereas well. Assumptions:

Best-practice methods build on a number of assumptions that are justiﬁed by already havinga clean and large KB of entities and types. • Argument Spotting:

Given input content in the form of a text document, Webpage, list or table, we can spot and canonicalize arguments for the subject and objectof a candidate triple. This assumption is valid because we already have methods forentity discovery and linking. As for attribute values, speciﬁc techniques for dates,monetary numbers and other quantities (with units) can be harnessed for spotting andnormalization (e.g., [362, 502, 9, 555]). • Target Properties:

We assume that, for each type of entities, we have a fairly goodunderstanding and a reasonable initial list of which properties are relevant to capturein the KB. For example, we should know upfront that for people in general we areinterested in birthdate, birthplace, spouse(s), children, organizations worked for, awards,etc., and for musicians, we additionally need to harvest songs composed or performed,albums released, concerts given, instruments played, etc. These lists are unlikely to becomplete, but they provide the starting point for this chapter. We will revisit and relaxthe assumption in Chapter 7 on the construction and evolution of open schemas. • Type Signatures:

We assume that each property of interest has a type signature92 ubmitted to Foundations and Trends in Databases such that we know the domain and range of the property upfront. This is part ofthe KB schema (or ontology). By associating properties with types, we already havethe domain, but we require also that the range is speciﬁed in terms of data types forboth attributes (literal values) and relations (entity types). This enables high-precisionknowledge acquistion, as we can leverage type constraints for de-noising. For example,we will assume the following speciﬁcations: birthdate: person x datebirthplace: person x locationspouse: person x personworksFor: person x organization

Schema Repositories of Properties:

Where do these pre-speciﬁed properties of interest and their type signatures come from?Spontaneously, one may think this is a leap of faith, but on second thought, there are majorassets already available. • Early KB projects like Yago and Freebase demonstrated that it is well feasible, withlimited eﬀort, to manually compile schemas (aka. ontologies) for relevant properties.

Freebase comprised several thousands of properties with type signatures. • Frameworks like schema.org [193] have speciﬁed vocabularies for types and properties.These are not populated with entities, but one can easily use the schemas to drive theKB population. Currently, schema.org comprises mostly business-related types (ca. 1000)and their salient properties. • There are rich catalogs and thesauri that cover a fair amount of vocabulary for typesand properties. Some are well organized and clean, for example, the icecat.biz catalogof consumer products. Others are not quite so clean, but can still be starting pointstowards a schema, an example being the UMLS thesaurus for the biomedical domain( ). • Domain-speciﬁc KBs , say on food, health or energy, can start with some of the aboverepositories and would then require expert eﬀorts to extend and reﬁne their schemas.This is manual work, but it is not a huge endeavor, as the KB is very unlikely to requiremore than a few thousand types and properties. For health KBs, for example, some tensof types and properties already cover a useful slice (e.g., [150, 605]).93 ubmitted to Foundations and Trends in Databases

The easiest and most eﬀective way of harvesting attribute values and relational arguments,for given entities and a target property, is again to tap premium sources like Wikipedia (orIMDB, Goodreads etc. for speciﬁc domains). They feature entity-speciﬁc pages and theirstructure follows fairly rigid templates. Therefore, extraction patterns can be speciﬁed withrelatively little eﬀort, most notably, in the form of regular expressions (regex) over thetext and existing markup of the target pages. The underlying assumption for the viabilityof this approach is:

Consistent Patterns in Single Web Site:

In a single web site, all (or most) pages about entities of the same type (e.g., musiciansor writers) exhibit the same patterns to express certain properties (e.g., their albumsor their books, respectively). A limited amount of diversity and exceptions needs tobe accepted, though.

Figure 6.1:

Examples of Wikipedia Infoboxes ubmitted to Foundations and Trends in Databases Within Wikipedia, semi-structured elements like infoboxes, categories, lists, headings, etc.provide the best opportunity for harvesting facts by regular expressions. Consider theinfoboxes shown in Figure 6.1 for three musicians (introducing new ones for a change, togive us a break from Bob Dylan and Elvis Presley). Our goal is to extract, say, the datesand places of birth of these people, to populate the birthdate attribute and birthplace relation. For these examples, the

Born ﬁelds provide this knowledge, with small variations,though, such as showing only the year for Nive Nielsen or repeating the person name forJimmy Page. The following regular expressions specify the proper extractions, coping withthe variations. For simplicity of explanation, we restrict ourselves to the birth year andbirth city. birth year X: Born .* (X = (1|2)[0-9]{3}) .*birth city Y: Born .* ([0-9]{4}|")") (Y = [A-Z]([a-z])+) .*

In these expressions, “.*” denotes a wildcard for any token sequence, “|” and “[...]”denote disjunctions and ranges of tokens, “ { ... } ” and “+” are repetition factors, and putting“)” itself in quotes is necessary to distinguish this token from the parenthesis symbol usedto group sub-structures in a regex. Note that the speciﬁc syntax for regex varies amongdiﬀerent pattern-matching tools.Intuitively, the regex for birth year ﬁnds a subsequence X that has exactly four digitsand starts with 1 or 2 (disregarding, for simplicity, people who were born more than 1020years ago). The regex for birth cities identiﬁes the ﬁrst alphabetic string that starts withan upper-case letter and follows a digit or closing parenthesis.This is still not perfectly covering all possible cases. For example, cities could be multi-word noun phrases (e.g., New Orleans). We do not show more complex expressions for easeof explanation. It is straightforward to extend the approach for both i) completeness, likeextracting the full date rather than merely the year and the exact place, and ii) diversity ofshowing this in infoboxes. On the latter aspect, the moderation of Wikipedia has gone along way towards standardizing infobox conventions by templates, but it could still be (andearlier was the case) that some people’s infoboxes show ﬁelds birth place , place of birth , born in , birth city , country of birth , etc. Nevertheless, it is limited eﬀort to manually specifywide-coverage and robust regex patterns for hundreds of attributes and relations. TheYAGO project, for example, did this for about 100 properties in a few days of single-personwork [563]. Industrial knowledge bases harvest deterministic patterns from Web sites thatare fed by back-end databases, such that each entity page has the very same structure (e.g.,IMDB pages for the cast of movies). 95 ubmitted to Foundations and Trends in Databases To ease the speciﬁcation of regex patterns, methods have been developed that merelyrequire marking up examples of the desired output in a small set of pages, sometimeseven supported by visual tools (e.g., [507, 162, 209]). For restricted kinds of patterns, it isthen possible to automatically learn the regex or, equivalently, the underlying ﬁnite-stateautomaton, essentially inferring a regular grammar from examples of the language. Thismethodology applied to pattern extraction has become known as wrapper induction [300, 545, 299, 410, 36]. The survey [511] covers best-practice methods, with emphasison CRF-based learning. Wrapper induction is a standard building block for informationextraction today.

Neither manually speciﬁed nor learned regex patterns are perfect: there could always be anunanticipated variation among the pages that are processed. An extremely useful techniqueto prune out false positives among the extracted results is semantic type checking ,utilizing the a-priori knowledge of type signatures for the properties of interest (see Section6.1). If we expect birthplace to have cities as its range rather than countries or evencontinents, we can test an extracted argument for this relation against the speciﬁcation.The types themselves can be looked up in the existing KB of entities and classes, afterrunning entity linking on the extracted argument. This technique substantially improvesthe precision of regex-based KB population [563, 240]. It equally applies to literal valuesof attributes if there are pre-speciﬁed patterns, for example, for dates, monetary values orquantities with units.

Often, a number of patterns, rules, type-checking and other steps have to be combinedinto an entire execution plan to accomplish some extraction task. The underlying stepscan be seen as operators in an algebraic language. The

System T project [485, 90, 89]has developed a declarative language, called AQL (Annotation Query Language), and aframework for orchestrating, optimizing and executing algebraic plans combining suchoperators. In addition to expressive kinds of pattern matching, the framework includesoperators for text spans and for combining intermediate results. A similar project, with adeclarative language called

Xlog , was pursued by [522, 521], and closely related researchwas carried out by [50] and [260].To illustrate the notion and value of operator-based execution plans, assume that wewant to extract from a large corpus of web pages statements about music bands and their96 ubmitted to Foundations and Trends in Databases live concerts, speciﬁcally, the relation between the involved musicians and the instrumentsthat they played. An example input could look as follows:

Led Zeppelin returns with rocking London reunion.

The quartet had a crowd of around 20,000 at London’s 02 Arena calling for more at the end of 16tracks ranging from their most famous numbers to less familiar fare. Lead singer Robert Plant, 59,strutted his way through “Good Times Bad Times” to kick oﬀ one of the most eagerly-anticipatedconcerts in recent years. A grey-haired Jimmy Page, 63, reminded the world why he is consideredone of the lead guitar greats, while John Paul Jones, 61, showed his versatility jumping from bassto keyboards. Completing the quartet was Jason Bonham on drums.

We aim to extract all triples of the form playsInstrument : musician × instrument ,namely, the ﬁve SO pairs (Robert Plant, vocals) , (John Paul Jones, bass) , (John Paul Jones, keyboards) , (Jimmy Page, guitar) , (Jason Bonham, drums) .In addition, we want to check that this really refers to a live performance. The extractiontask entails several steps:1. Detect all person names in the text.2. Check that the mentions of people are indeed musicians, by entity linking and typechecking, or accept them as out-of-KB entities.3. Detect all mentions of musical instruments, using the instances from the KB type musicinstruments , including specializations such as Gibson Les Paul for electric guitar andparaphrases from the KB dictionary of labels, such as “singer” for vocals .4. Check that the instrument mentions refer to speciﬁc mentions of musicians, for example,by considering only pairs of musician and instrument that appear in the same sentence.Here the text proximity condition may have to be varied depending on the nature andstyle of the text, for example, by testing for co-occurrences within a text span of 30words, or by ﬁrst applying co-reference resolution.Sometimes, even deeper analyis is called for, to handle diﬃcult cases where subject andobject co-occur incidentally without standing in the proper relation to each other.5. Check that the entire text refers to a live performance. For example, this could requirechecking for the occurrence of a date and speciﬁc kinds of location like concert halls,theaters, performance arenas, music clubs and bars, or festivals.There are many ways for ordering the execution of these steps, each of which involvessub-steps (e.g., for matching performance locations), or for running them in parallel or ina pipelined manner (with intermediate results streamed between steps). When applyingsuch an entire operator ensemble to a large collection of input pages, the choice of orderor parallel execution is crucial. The reason is that diﬀerent choices incur largely diﬀerent97 ubmitted to Foundations and Trends in Databases costs because they materialize intermediate results of highly varying sizes. The System Tapproach therefore views the entire ensemble as a declarative task, and invokes a queryoptimizer to pick the best execution plan. This involves estimating the selectivities ofoperators, that is, the fraction of pages that yield intermediate results and the number ofintermediate candidates per page. For query optimization over text, this cost-estimationaspect is still underexplored, notwithstanding results by the SystemT project [485] as wellas [522] and [258].

Extraction patterns can also be speciﬁed for texts or lists as inputs. This is often amaz-ingly easy and can yield high-quality outputs. For example, many Wikipedia articles andother kinds of online biographies contain sentences such as “Presley was born in Tupelo,Mississippi”. By the stylistic conventions of such web sites, there is little variation in theseformulations. Therefore, a regex pattern like

P * born in * C with P bound to a personentity and C to a city can robustly extract the birth places of many people. Analogously, apattern like P .* (received|won) .* (award(s)?)? .* A applied to single sentences (with (...)? denoting optional occurrences) can extract outputssuch as (Bob Dylan, hasWon, Grammy Award)(Bob Dylan, hasWon, Academy Award)(Elvis Presley, hasWon, Grammy Award)

Obviously, there are many other ways of phrasing statements about someone winning anaward (e.g., “awarded with”, “honored by”). Manually specifying all of these would be abottleneck; so we will discuss how to learn patterns based on distant supervision in Section6.2.2. Nevertheless, the eﬀort of identifying a few widely used patterns is modest and goesa long way in “picking low-hanging fruit”.Simple regex patterns with surface tokens and wildcards, like the ones shown above,often face a dilemma of either being too speciﬁc or too liberal, thus sacriﬁcing either recall orprecision. For example, the pattern

P played his X with P and X matching a musician and amusical instrument, respectively, can capture only a subset of male guitarists, drummers, etc.Moreover, it misses out on more elaborate phrases such as “Jimmy Page played his bowedguitar”. By employing NLP tools for ﬁrst creating word-level annotations like lemmatization and POS tags (see Section 3.2), more expressive regex patterns can be speciﬁed, for example

P play $PPZ ($ADJ)? X ubmitted to Foundations and Trends in Databases where “play” is the lemma for “plays”, “played” etc. and $PPZ and $ADJ are the tags forpossessive personal pronouns and adjectives, respectively. This pattern would also capturea triple h PJ Harvey, playsInstrument, guitar i from the sentence “PJ Harvey plays hergrungy guitar”. Further generalization could consider pre-processing sentences by dependencyparsing , so as to capture arguments that are distant in the token sequence of surface textbut close when understanding the grammatical structure. Figure 6.2 shows exampleswhere the parsing reveals short-distance paths, highlighted in blue, between arguments forextracting instances of playsInstrument . The third example may fail in practice, as thepath between musician and instrument is not suﬃciently short. However, by additionallyrunning coreference resolution (see Section 5.1), the word “himself” can be linked back to“Bob Dylan”, thus shortening the path.Note, though, that the extra eﬀort of dependency parsing or coreference resolution isworthwhile only for suﬃciently frequent patterns and properties that cannot be harvestedby easier means. Moreover, parsing may fail on inputs with ungrammatical sentences, likein social media. Jimmy Page used a cello bow on his double-necked electric guitarPJ Harvey delighted the audience with new songs and surprised them with her saxophone playingBob Dylan performed Blind Willie McTell with Mark Knopfler on guitar and himself playing piano

Figure 6.2:

Examples for Relation Extraction based on Dependency Parsing Paths

Patterns are also frequent in headings of lists, including Wikipedia categories. TheEnglish edition of Wikipedia contains more than a million categories and lists withinformative names such as “list of German scientists”, “list of French philosophers”, “Chinesebusinesswomen” or “astronauts by nationality”. As discussed in Section 3.2 we use such cuesfor inferring semantic types, but we can also exploit them for deriving statements for speciﬁcproperties like hasProfession , hasNationality or bornInCountry and more. Especially whenharvesting Wikipedia, judicious speciﬁcation of patterns with frequent occurrences yieldhigh-quality outputs at substantial scale. This has been demonstrated by the WikiNet project [415]. 99 ubmitted to Foundations and Trends in Databases List of awards and nominations …

List of awards and nominations received by Bob Dylan

Academy_Awards

… …

Grammy Awards

Year	Category
2000	Best Original Song

… … … … Figure 6.3:

Awards List of Bob Dylan with Resulting DOM Tree

Last but not least, patterns are also found in trees, most notably, in the DOM-treestructure of HTML pages (DOM = Document Object Model, a W3C standard). Figure 6.3shows an excerpt of a web page on Bob Dylan’s awards (from Wikipedia), and outlines theDOM tree for this page. The tree has HTML tags like headings (h1, h2), table rows (tr)and table cells (td) as inner nodes and the rendered text on the leaf nodes. This tree modelapplies also to texts with markups in XML/XPath or the Wiki markup language (which isused by Wikipedia). Extraction patterns can be speciﬁed via rules over paths and positionsof tags. For example, to identify the fact that Bob Dylan won the Academy Award in 2000and a Grammy in 1973, the following patterns and rules can be applied:1. locate node $ A with path root → h → h → $ A such that $ A contains “Awards”2. locate node $ Y with path$ A → table → tbody → tr → th [ k ] → $ Y such that $ Y contains “Year” and th [ k ] is the k -th occurrence of tag th as a child of the tr tag3. locate node $ Y Y with path$ A → table → tbody → tr → td [ k ] → $ Y Y such that td [ k ] is the k -th occurrence of tag td as a child of the tr tag, with k being the100 ubmitted to Foundations and Trends in Databases same as in step 24. locate node $ R with path$ A → table → tbody → tr → th [ l ] → $ R such that $ R contains “Result” and th [ l ] is the l -th occurrence of tag th as a child ofthe tr tag5. locate node $ RR with path$ A → table → tbody → tr → td [ l ] → $ RR such that td [ l ] is the l -th occurrence of tag td as a child of the tr tag, with l being thesame as in step 46. if no $ R node found then output $ A and $ Y Y

7. if $ R found and $ RR contains “Won” then output $ A and $ Y Y

We give this procedurally ﬂavored description for ease of explanation. There are formallanguages and tools for expressing this extraction workﬂow in concise expressions, general-izing regex from strings to trees [511]. We will discuss advanced methods for extractingproperties from such trees in Section 6.3.

Manually speciﬁed patterns go a long way, but are limited in scope, at least if we aim forhigh recall. Therefore, analogously to Chapter 4 on discovering entity-type pairs, we nowdiscuss methods for automatically learning patterns. The key principle of statement-patternduality, introduced in Section 4.3, applies to the machinery for attributes and relations aswell. This was ﬁrst formulated by [58] and reﬁned in diﬀerent contexts by [3, 478, 153]. Webrieﬂy recap this fundamental insight, generalizing it to triples about properties now:

Statement-Pattern Duality

When correct statements about

P(S,O) for property P frequently co-occur withtextual pattern p (in snippets that mention S and O ), then p is likely a good patternfor P .Conversely, when snippets with two arguments S and O also contain a good pattern p for property P , then the statement P(S,O) is likely correct.Following the same rationale as in Section 4.3, this suggests an approach by seed-basedpattern learning and statement extraction where we start with a set T of correctstatements { P ( S , O ), P ( S , O ), . . . } for property P (and optionally hand-crafted patterns P ), and then iterate the following two steps:1. Pattern Discovery: ﬁnd occurrences of ( S i , O i ) pairs from T in a web corpus, and identify new patterns p j ubmitted to Foundations and Trends in Databases Round Seeds Patterns New Statements1 (Dylan, Blowin) $ X wrote the song $ Y (Cohen, Hallelujah)(Dylan, Knockin) $ X wrote * including $ Y (Lennon, Imagine)2 $ X ’s masterpieces include $ Y (Morricone, Ecstasy)(Poe, Tell-Tale Heart)$ Y performed by $ X (Hardy, Mon Amie)3 $ X * cover version of $ Y (Bono, Hallelujah)$ X * composer of $ Y (Beethoven, Elise)... ... ... ... Table 6.1:

Toy example for Seed-based Pattern Learning and Property Extraction that co-occur with these pairs with high frequency (and other statistical measures), andadd p j to P ;2. Statement Expansion: ﬁnd snippets that contain i) mentions of new ( S k , O k ) pairs with proper types and ii) agood pattern in P , and add new statements P ( S k , O k ) to T .Note that we assume that entity discovery and canonicalization is performed on allinput snippets, using methods from Chapters 4 and 5. So we can leverage the pre-existingKB of entities and types, and can check that the spotted entity pairs have the right typesignature for the property of interest. We greatly beneﬁt from our overriding approach ofﬁrst acquiring and populating the entity-type backbone before embarking on attributes andrelationships.Table 6.1 shows a toy example for the target property composed: musician × song ,using two Dylan songs as seeds. We shorten entity names for ease of reading.The example shows that the output contains good as well as bad patterns, and somemixed blessings such as “masterpieces include”. Some of the resulting false positives amongthe extracted statements can be detected and eliminated by type checking (e.g., Poe andhis stories). Generally, aggressive pattern learning can boost the recall of the KB, butcomes with high risk of degrading precision. There are two major ways of mitigating theserisks: • Corroborate candidates for new statements based on their spottings in diﬀerent sources,using statistical measures of conﬁdence . • Employ consistency constraints to reason on the validity of candidate statements, toprune false positives.We will elaborate on the second approach in the chapter on

KB Curation , speciﬁcally inSection 8.5. Here we focus on conﬁdence-driven weighting and pruning.102 ubmitted to Foundations and Trends in Databases

The measures of support , conﬁdence and diversity introduced in Section 4.3.2, onpattern learning for entity-type pairs, carry over to property extraction. They solely referto patterns and statements, so that we can directly compute them for each property ofinterest. For easier reading, we give the deﬁnition of conﬁdence again, in generalized form:Given positive seeds S and negative seeds S , the conﬁdence of pattern p , conf ( p ), is the ratio of positive occurrences to occurrences with either positive ornegative seeds: conf ( p ) = P x ∈ S freq ( p, x ) P x ∈ S freq ( p, x ) + P x ∈ S freq ( p, x )The conﬁdence of statement x is the normalized aggregate frequency of co-occurring with good patterns, weighted by the conﬁdence of these patterns: conf ( x ) = P p ∈ P freq ( x, p ) · conf ( p ) P q freq ( x, q )where P q ranges over all patterns.This way, we can score and rank both patterns and statements, and then pick thresholdsfor selecting the output depending on whether we prioritize precision or recall for the KBpopulation. Moreover, we can integrate quality measures into the iterative algorithm forlearning patterns and extracting statements, along the same lines that we explained inSection 4.3.2. The Extended Algorithm for Seed-based Pattern Learning applies appropriateweighting and pruning after each round, for quality control.Just like explicitly speciﬁed patterns discussed in Section 6.2.1, learned patterns are notlimited to consecutive strings in surface text. We already used wildcards in our examples,and we can also learn patterns that have POS tags as placeholders, or refer to paths independency parsing trees or DOM trees (see, e.g., [153, 65, 560]. The outlined methodology for pattern learning and statement harvesting relies on seedstatements. These can be manually compiled, say 10 to 100 for each property of interest.But there is a more intriguing and larger-scale approach: distantly supervised learning froman existing high-quality KB (see, e.g., [389, 566]):103 ubmitted to Foundations and Trends in Databases

1. Build a high-quality KB by using conservative techniques with hand-crafted patternsapplied to premium sources that have structurally and stylistically consistent pages. Givepriority to precision, disregard recall or consider it secondary. For example, harvestingWikipedia infoboxes falls under this regime.2. Treat all statements of this prior KB as correct seeds for pattern learning and statementexpansion. If needed, generate negative seeds (i.e., statements that do no hold) byre-combining S and O arguments that violate hard constraints (e.g., wrong birthplacesfor people for whom the KB already has the correct birthplace).Note that this method does not need any labeling of patterns or features for training, hencethe name distant supervision.It is straightforward to scale the method to very large inputs, as we can partitionthe input pages and process them in parallel (in addition to handling the propertiesindependently). When run over multiple rounds, there is a need for handshakes regardingthe statistical weights, but this is easy to implement. This makes the method perfectlyamenable to Map-Reduce or other kinds of bulk-synchronous distributed computation (see,e.g., [412]).

Correlated Properties with Identical Type Signatures:

When applying seed-based learning to multiple properties, a diﬃcult case arises whenevertwo properties have the same type signature and their instances are correlated. For example, bornIn and diedIn are both of type person × city , and most people die in their birth place,simply because they are born and spend their lives in large cities. In pattern learning, thereis a high risk to confuse these two relations.Another notorious example is: locatedIn: city × country vs. capitalOf: city × country .If we learn for the latter with prominent entity pairs as seeds, such as (Paris,France),(Berlin,Germany),(Tokyo,Japan) , . . .we will acquire misleading patterns such as$ X ∗ largest city of $ Y .Even if the seed set contains some cases where this does not hold, such as (Canberra,Australia) ,the majority of seeds would still comply with the pattern. This is where negative seeds canplay a signiﬁcant role. Explicitly stating that (Sydney,Australia) and (Toronto,Canada) are not in the capitalOf relation carries much higher weight and can steer the patternlearner so as to discriminate the two confusable relations. Instead of learning patterns and then applying patterns to extract statements, we canalternatively use the features that accompany seeds and patterns to train a classiﬁer that104 ubmitted to Foundations and Trends in Databases accepts or rejects new candidate statements given their features. For training, we i) spotthe S and O arguments of seeds in pages and ii) observe features from local contexts suchas: • words, POS tags, n-grams, NER tags etc. to the left of S, • words, POS tags, n-grams, NER tags etc. to the right of O, • words, POS tags, n-grams, NER tags etc. between S and O • words, n-grams, tags etc. in the root-to-S and root-to-O paths in DOM trees, tables,lists etc., • and further features that can be observed within local proximity.When the classiﬁer needs negative training samples, too, we use again the techniquementioned above: generate incorrect statements by replacing S and O in a correct statementwith an alternative argument that is known (or at least very likely) to be incorrect.This approach to distantly supervised property extraction has been pursued by [389]and [242], with the latter training property-speciﬁc CRF models for hundreds of properties,with little eﬀort regarding the compilation of seeds. Advances on the TAC competition on Knowledge Base Population (TAC-KBP) have reﬁned and extended this line of methods(see, e.g., [568, 265]). A typical task for these approaches is to populate a set of attributesand relations for entities of type person and organization , with properties typically foundin Wikipedia infoboxes (e.g., date of birth, country of birth, city of birth, cause of death,religion, spouses, children etc. for person ) – the sweet spot for distant supervision.Another line of feature-based learning combines relation classiﬁcation with the detectionof entity mentions, into a latent topic model (e.g., [642]) or a collective CRF or otherkind of probabilistic graphical model (aka. factor graph). This direction for joint inferencewas started by [499] and [491, 643], using the Freebase KB for distant supervision, and[673, 667, 672] for the

EntityCube/Renlifang project [425]. Further advances were made, forexample, by [332, 488, 470]. These joint models are elegant and powerful, but so far, theyhave not been able to demonstrate extraction accuracy at a level that allows direct importof statements into a high-quality knowledge base.A key issue to consider here is that stand-alone entity discovery and linking, as presentedin Chapters 4 and 5, have matured and become so good that it makes sense to run thesesteps upfront before embarking on the task of property extraction. This is an engineeringargument for quality assurance, suggesting to decompose the mission of KB constructioninto modules that are easier to customize, optimize and debug to the speciﬁc goals of thedownstream application. 105 ubmitted to Foundations and Trends in Databases

Further Techniques from NLP:

When features are derived from dependency parsing structures, rather than encoding themas explicit features, a useful technique is to integrate tree kernels into the classiﬁer (e.g.,using SVM). This way, similarity comparisons between parsing structures are carried outonly on demand. These techniques have been invented by [105, 66, 65, 400, 401] and furtheradvanced for practical usage, for example, by [670, 584, 349].An NLP task highly related to property extraction is

Semantic Role Labeling (SRL) where a set of property types for an entity is speciﬁed, called a frame with slots to be ﬁlled[185, 438]. Then, given a sentence or text passage, the goal is to infer the property valuesfor the entity. A key diﬀerence to KB population is that the input is a short text givenupfront and we have the prior knowledge that the sentence and the target frame are aboutone and only one context (e.g., describing an event of a certain type with a ﬁtting frametype). This is in contrast to large-scale knowledge harvesting which operates on a diversecorpus and needs to consider many targets. State-of-the-art SRL methods used to be basedon deep syntactic analysis and constraint-based reasoning; see, for example, [465]. Morerecently, supervised end-to-end learning has been brought forward for SRL, using LSTMsand other neural networks [220, 167]. These methods are powerful, but how to utilize themfor large-scale KB construction remains an open issue.

Information extraction (IE) as a methodology typically assumed that it would operateon a given piece of web content. This mindset is exempliﬁed in popular competitions andbenchmarks like TAC (Text Analysis Conference), ACE (Automatic Content Extraction),SemEval (co-located with premier NLP conferences) and Semantic Web Challenges. Inconstructing large KBs, we have an additional degree of freedom, though, by judiciouslypicking the sources from which we want to extract properties. While Wikipedia is auniversally good choice, KBs may need to tap additional sources for in-depth knowledge,especially when covering vertical domains such as health, food or music. Several criteria for source quality are important: • Trustworthiness:

The source has correct information about entities and their propertiesof interest. • Coverage:

The source covers many entities and many relevant properties, for the sametype of entities (e.g., medical drugs or songs) or vertical domain (e.g., health or music). • Freshness:

The source has (nearly) up-to-date information. • Tractability:

The structural and stylistic conventions of the source are favorable forautomated extraction, featuring, for example, crisp wording and lists, as opposed to,say, sophisticated essays.The repertoire for identifying good sources is broad, and appropriate methods highly106 ubmitted to Foundations and Trends in Databases depend on the topical domain and the nature of web content to be tackled. Although itis tempting to think about universal machine reading across the entire Web [152], viableengineering needs to carefully prioritize. The following are some of the major choices: • Web directories:

Earlier, the Web oﬀered good directories, such as http://dmoz.org and the original Yahoo! directory, making it easy to identify (some of) the best web sitesfor a category of interest. These directories are mostly defunct or frozen now, but thereare still some domain-speciﬁc sites, such as http://wikivoyage.org on travel destinations,and commercial sites on hotels, restaurants etc. • Topical portals:

For some vertical domains, highly authoritative “one-stop” portalshave emerged. These are easy to ﬁnd via search engines and a bit of browsing. Examplesinclude http://mayoclinic.org for health, http://imdb.com and http://rottentomatoes.com about movies, or http://secondhandsongs.com for cover versions of songs (to give a highlyspecialized case). • Focused crawling:

If we aim for long-tail entities or properties that are rarely featuredin web sites, we may have to explore a larger fraction of the Web. The strategy here is focused crawling [78] where we start with a few seed pages that are known to be good,and then follow links based on a combination of classifying sites as topically relevantand scoring them as authoritative. In addition to using hyperlinks, a technique fordiscovering new sites is to use informative phrases from the visited pages as queriessubmitted to search engines. Techniques for focused crawling are described in [77, 377,137, 539, 33, 598].For each identiﬁed source of potential interest, the quality measures outlined aboveshould be assessed, either via meta-sites or by sampling some of the source’s pages andanalyzing them [613].For trustworthiness , PageRank-style metrics based on random walks have been investi-gated in great detail (e.g., the Eigentrust method [275]). Today, however, traﬃc-based mea-sures are more informative, most notably, the Alexa rank of a site ( ).Mentioning a site as an external source in a Wikipedia page can also be seen as anendorsement of authority.

Freshness can be judged by metadata or, more reliably, by sampling pages as to whetherthey contain known statements of recent origin (e.g., the latest songs of popular artists).

Coverage can only be estimated, without already running the full extraction machineryon the entire site. This is best done by sampling, using seeds from the KB. However, whengoing for long-tail entities and infrequently mentioned properties, this may be treacherousand infeasible. Estimating the coverage, or completeness, of web sources is an open challenge.We will discuss this further in Chapter 8, speciﬁcally Section 8.1.Finally, assessing the tractability of a source is a tricky issue, posing major diﬃculties.A pragmatic approach is to sample pages from a site, run extractions, and then manually107 ubmitted to Foundations and Trends in Databases inspect and assess the output (see also Chapter 8).

We outlined the basic approach of using path patterns in trees in Subsection 6.2.1.5. Thisapplies to DOM trees of HTML pages as well as Web tables and lists. For deploying thesetechniques at Web scale, we face several bottlenecks. Suppose we tackle a single web site,such as imdb.com on movies or secondhandsongs.com on cover versions of songs. The followingissues need to be addressed.

Richness of Pages:

Individual pages can be very rich, if not verbose, showing many diﬀerent “sub-pages”, andit is all but easy to identify the components that contain statements of interest for KBconstruction. This task has been addressed by techniques for page segmentation basedon structural and visual-layout features (e.g., [70, 674]). For major portals such as imdb,hand-crafted rules can do this job. For tapping long-tail sites, though, feature-based learningis the method of choice.

Diversity across Pages:

Even if all pages of the web site are generated from a back-end database, they coverdiﬀerent pieces and aspects of the site content. So there is no single path pattern that canbe applied to all pages uniformly. Figure 6.4 shows an example: three diﬀerent views of secondhandsongs.com featuring i) original songs of an artist, ii) cover versions of this artist’ssongs by other artists, iii) songs by other artists covered by the artist himself. Each of thethree is instantiated for all artists, and we need three separate patterns to harvest the site.The problem is that, upfront, without manual inspection and analysis, we do not know howmany diﬀerent cases we need to handle. Therefore, techniques have been devised to discovertemplates among the pages of the site, based on clustering with tree-alignment and othersimilarity models (e.g., [668, 194]).

Cost of Labeling Training Samples:

The biggest potential showstopper is that each site and relevant template requires at leastone annotated page, so that patterns and extractors can be properly induced. These costsmay even arise repeatedly for the same site, namely, whenever the site owner decides tochange the content slicing and layout of its pages.The cost concern has been mitigated, to some extent, by • visual tools for annotation (e.g., [162, 209]) and by • hierarchical organization of patterns into libraries (e.g., [667, 194, 210]) for easy re-useand light-weight adaptation to diﬀerent templates or even across sites.Nevertheless, for harvesting the long tail of smaller sites within a vertical domain, it would108 ubmitted to Foundations and Trends in Databases Figure 6.4:

Examples for Three Templates on Original Songs and Cover Versions(Source: https://secondhandsongs.com/) be desirable to completely remove the dependency on manual labeling. Next, we describe arecent approach to this goal. 109 ubmitted to Foundations and Trends in Databases

Like with distant supervision for extraction from text, we start with seed pairs (

S, O ) ofentities for a given property P and make two key assumptions. First, it is assumed that theweb site under consideration contains many entity detail pages where most information onthe page refers to the same entity, for example, the IMDB page on a movie or the Goodreadspage on a book. Second, when processing a page about a seed entity S , we assume that theproperty P ( S, O ) occurs on that page with high probability. In other words, a signiﬁcantfraction of the entity detail pages contain seed statements. This is key to the viability ofdistant supervision. We then learn extraction patterns from these seed-matching pages, andapply them to all other pages of the web site to acquire new statements.This high-level approach needs to overcome several diﬃculties, and this leads to thefollowing processing stages, following the

Ceres method of [354].

Distantly Supervised Method for Property Extraction from DOM Treesof Entity Detail Pages:

Input: set of entity detail pages x ,seed statements P ( S, O ),other statements about S from the KBOutput: set of objects T such that P ( S, T ) for some S with page x DOM-tree paths to T for all these objects T For all pages x :1. Locate subject entity S in page x .2. Locate the dominant path to S across all pages.3. Prune doubtful entities and pages.4. Locate the path to T for P ( S, T ) in page x .5. Compute the dominant path to P ( S, T ) across all pages.

Locating Subject Entity in Page:

There are easy cases where the URL, page title or top-level headings give proper cues.However, in cases like Figure 6.4, it is not obvious where the central entity is located for thetarget property covered:

M usician × M usician . In fact this varies across the three pages:

TomWaits in the

Performer column for

Originals , Tom Waits next to the text “ h song i writtenby” below the heading for Versions , and

Tom Waits in the header

Covers By Tom Waits forthe third page. Note that the third page can be interpreted as a source for extracting theinverse relation of covered so that Tom Waits is still the S entity.To identify the central entity on each page, we ﬁrst spot all S entities in the seed setfor property P . For each of these candidates e (e.g., all musicians on the page) we look upits related object set Obj ( e ) in the KB (i.e., all O entities that have some relation with110 ubmitted to Foundations and Trends in Databases the e musician), and compute overlap measures, like Jaccard coeﬃcients, of Obj ( e ) andall entities on the page. The e candidate with the highest overlap score is considered thewinner. For example, for the ﬁrst page of Figure 6.4, if half of the shown songs are listed inthe KB as Tom Waits songs, this is high indication that we have identiﬁed the Tom Waitsdetail page in the web site. Dominant Path Across Pages:

The previous step may still yield too many false positives. By assuming consistent patternsacross all proper pages with entity detail, we can compute global evidence for the preliminarychoices. Given a set of pairs ( x, e ) with page x being the identiﬁed candidate for the detailpage on entity e , we derive the DOM-tree paths for each pair, resulting in a pool of pathpatterns. The most frequent pattern is considered as the proper one. Optionally, we mayrelax the pattern a bit, allowing for minor variations, or we can pick multiple patternsbased on a frequency threshold. Pruning Entities and Pages:

Having identiﬁed the dominant path(s) allows us to eliminate all entities and pages that donot exhibit a strong pattern. Additional heuristics can be used to further prune doubtfulpages. For example, pages x that contain only very few entities related (by the KB) to thecandidate entity e may be discarded, too. Likewise, if the same entity e is identiﬁed as thecentral entity for a large number of pages, this could be an erroneous case (e.g., due to anentity linking error) to be pruned. The implementation for this prioritization of precisionrequires setting threshold parameters. Locating Paths to Objects:

With the consolidated set of proper ( x, e ) pairs for seed subjects, we can locate the objects O from seeds P ( e, O ) in page x . This may lead to multiple matches of the same object inthe same page, though, posing ambiguity as to whether some object mentions may referto entities other than e (even if x is the true detail page for e ) or refer to properties otherthan P . This is not the case in Figure 6.4, but could easily happen with these kinds ofpages. Consider the third page about cover songs performed by Tom Waits. If Waits hadcovered multiple songs originally by Johnny Cash, we may consider the object match JohnnyCash as doubtful (although it is correct in this case). If the page were extended by furthercolumns about the singer of the original song – which could be diﬀerent from the song writer –, we could accidentally match a seed object in the singer column and thus pick awrong path pattern. In this example, such cases would be rare, as song writers are oftenalso the original singers, but the risk of spurious paths can be considerable in other settings.To mitigate these risks, one can simply prune object matches with multiple occurrencesin the same page. This sacriﬁces recall for the beneﬁt of higher precision. An alternative isto check for additional evidence that the located o in page x is indeed the match for seed111 ubmitted to Foundations and Trends in Databases statement P ( e, o ), using other objects for the same e and P in the KB. If the other objects(mostly) share the same DOM-tree path preﬁx as o , then all these siblings are likely in thesame relation P . For example, if the KB gives us a set of song writers covered by TomWaits, most of these should be in the same column originally by and only some may occurin other columns as well. The path preﬁx to the most frequent column wins. Dominant Path Across All Pages:

Finally, for given property P , the best paths for every page can be compared to strengthenthe global evidence. To this end, we deﬁne a similarity measure between paths, like editdistance scores, and all paths are clustered based on such a metric. Only clusters with highdensity and clear delineation from less focused clusters are selected, to retain a subset ofhigh-quality DOM-tree path patterns.The ﬁnal set of path patterns are utilized by applying them to all identiﬁed entity detailpages of the entire web site. Comprehensive experiments by [354] have shown that thisdistant-supervision method is viable at Web-scale and yields high-quality output, whilecompletely avoiding manual annotation of web pages. Web tables can be viewed as a special case of DOM trees. Nevertheless they have receivedspecial attention for knowledge extraction in the literature (e.g., [338, 69, 493, 435]). Wealready discussed methods for entity linking (EL) and column typing in Section 5.7.By applying EL for tables, we identify S entities in column E , and we hypothesize thatanother table column C contains the entities (or attribute values) for the O argument of aKB property P . The key task then is to scrutinize whether C does indeed correspond to P . The method devised by [435] applies an ensemble of matchers to test this hypothesis.This includes computing the overlap between the already known O entities for P in the KBagainst the entities mentioned in column C . The column header, table caption and furthercontext serve as cues for other matchers.By applying this technique to a large collection of Web tables, a huge number ofcandidate statements are collected. As the candidates come from many diﬀerent sites,they can be consolidated by similarity-based clustering to distill the candidates into thehighest-evidence statements (see [435] for details).112 ubmitted to Foundations and Trends in Databases Higher-arity Relations:

Tables naturally exhibit ternary and higher n-ary relations. As discussed in Chapter 2,some of these cannot be decomposed without losing information. Recall the examplethat someone wins the Nobel Prize in a certain ﬁeld and year. This is a ternary relation winsNobelPrize: person × f ield × year that cannot be properly reconstructed by having onlythe binary projections winsNobelPrizeInField: person × f ield and winsNobelPrizeInYear: person × year . A number of methods have been developed in the literature for extractinghigher-arity relations from both text and tables (e.g., [294, 293, 149, 319]).One of the diﬃculties that goes beyond the case of binary relations is that observationsin web contents are often partial: many pages mention only the ﬁeld of the Nobel Prizewinner while others mention only the year, and only few pages if any have all arguments inthe same spot. This calls for reasoning over sets of extraction candidates. We will comeback to this issue in Chapter 8, speciﬁcally, Section 8.5. Microformat Annotations:

An increasing number of Web pages contain microformat markup embedded in HTML[46], using standards like RDFa, hcard or XML- or JSON-based vocabularities. For speciﬁcproperties, such as addresses of organizations, authors of publications or product data, thiscan be a rich source, and there are speciﬁc techniques to tap microformat contents [649].

The most straightforward way of harnessing deep neural learning for the extraction ofproperties is by viewing the task as a classiﬁcation problem, or alternatively as a sequence-tagging problem.For the classiﬁer , the input is a sentence or another short passage of text, and theoutput is a binary decision on whether the sentence contains an entity pair or entity-valuepair for a given property of interest. Often, the input already contains markup for candidateentities, and these text spans are fed into the classiﬁer as well.For the tagger , the input is the sentence or text snippet, and the output is a sequenceof tags that identify the entity pair or entity-value pair if the network decides to accept theinput as a positive case. The tags are pretty much the same as for neural NER, using BIOlabels to identify the begin and inner part of tagged sub-sequences and O for irrelevant tokens;see Section 4.4.2. For example, for extracting the property playsInstrument(Dylan,piano) ,the input

Dylan performed Blind Willie McTell on this rare occasionwith Mark Knopfler on guitar and himself playing piano. ubmitted to Foundations and Trends in Databases would be tagged

B O O O O O O O O O O O O O O O O B

Both classiﬁer and tagger methods are trained by giving them sample sentences, positiveas well as negative samples. The training is for a given property type; each property calls fora separate learner. In this regard, there is no diﬀerence from any other – simpler – machine-learning method, based, for example, on logistic regression or random forests. However, theneural learners do not take explicitly modeled features as input (e.g., dependency-parsingtags) but instead expect an embedding vector for each word (e.g., using word2vec), andnothing else. Some methods further consider word positions by encoding them into per-wordvectors as well.With these assumptions, the neural architectures that can be readily applied are the sameas for NER, discussed in Section 4.4.2. Most notably, recurrent neural networks (RNNs),including

LSTMs , and convolutional neural networks (CNNs) have been intensively studiedin the literature for this purpose of property extraction from sentences. Early work [544, 217,653] tackled the task of classifying coarse-grained lexical relations between pairs of commonnouns or phrases (e.g., partOf, origin and destination of motion, etc.), and encoded wordembeddings and positional information (i.e., relative distances between words) to this end.Later work integrated syntactic structures (e.g., dependency parsing), lexical knowledgesuch as hypernymy, and entity descriptions into their learned representations (e.g., [352,188, 635, 509, 529, 342, 264, 665, 546, 95]).More recent methods integrate property extraction with NER (e.g., [396, 32]), andsome additionally incorporate coreference resolution into joint learning (e.g., [358]). Theassumption that the input is a single sentence has also been relaxed in a few approachesthat can cope with arbitrary text passages as input sequence (of limited length) (e.g.,[446, 645, 506]). This way, properties can be detected where subject and object occur inseparate sentences within close proximity. On the other hand, entity linking (EL) is usuallydisregarded. It is either assumed that the sub-sequences that denote entities in the inputare already canonicalized, or EL is postponed and applied only to the entity mentions inthe positive outputs of the neural extractor. Only few works have attempted to integrateEL into this kind of neural extractors (e.g., [585]).

Since labeled training data is, once again, the bottleneck for supervised deep learning,training the outlined neural networks is pursued by distant supervision. To this end,statements from a prior KB can be used to gather sentences that contain a subject-objectpair for a given property. For example, Wikipedia is a rich source of such sentences andeven comes with partial entity markup, in the form of hyperlinks to entity articles or114 ubmitted to Foundations and Trends in Databases entity mentions that match article titles of prominent entities. Other corpora, such as newscollections, could be considered, too.As an example, assume the KB contains the pairs h Bob Dylan, guitar i and h Bob Dylan,harmonica i for the property playsInstrument . We may spot these pairs in the followingsentences: Bob Dylan sang and played acoustic guitar.The audience cheered when Dylan took out his harmonica.Dylan was accompanied by Knopfler on guitar.The harmonica was played by Robertson, not Dylan.

For playsInstrument , only the ﬁrst two of these four sentences are positive samples, andthe other two would be misleading. However, we do not know which are the good onesupfront. This is a case for multi-instance learning with uncertain labels . The classiﬁerhas to cope with this noise in its training data, and this involves learning to (implicitly)distinguish proper positive samples (the ﬁrst two in the example) from positive sampleswith spurious labels (the latter two).The simplest technique for multi-instance learning for property extraction is to learnthe best sentence from a given set, as part of the classiﬁer training. However, this couldoverly focus on a single sentence, underutilizing the potential of the entire set. Therefore, abetter approach is to learn weights that capture the relative inﬂuence of individual sentencestowards the learned model. This is known as a sentence-level attention mechanism in the literature [652, 342, 264, 650], a component in CNNs, LSTMs and other networksfor selecting or prioritizing input regions. The classiﬁer could thus be enabled to properlydistinguish between newly seen sentences such as

Bob Dylan performed the song with a tight rhythm on his piano.Bob Dylan's accompanying band included an alto sax. where only the former leads to a correct extraction.Another diﬃculty that distantly supervised learners need to handle is that the samesubject-object pair may appear in more than one relation. For example, the pair h MarkKnopfler, Bob Dylan i could be a training sample for all three of the properties accompanied , has role model and friend of . Moreover, several properties could even co-occur in thesame sentence, such as: Knopfler played guitar with Bob Dylan, his admired role model.Robertson and his late friend Manuel were on Dylan's band.

In principle, state-of-the-art neural learners for property extraction are geared for allthese diﬃculties. Moreover, they can be combined with other techniques like CRF-based115 ubmitted to Foundations and Trends in Databases inference to enforce consistent outputs (e.g., [666]). However, it is not yet clear how robustthese methods perform in the presence of limited supervision and complex inputs. Recentresearch therefore explores also leveraging BERT embeddings (e.g., [542]), reinforcementlearning and other methods that relax reliance on training data (e.g., [632, 567, 570]).The recent article [208] reviews state-of-the-art methods and outlines challenges andopportunities in this line of neural extraction. Key issues identiﬁed for future researchinclude: • utilizing more data and background knowledge for better de-noising of training sampleswith distant supervision; • improving the scope and scale of neural learners, to cope with longer inputs and forfaster training; • robustness to coping with complex inputs in sophisticated contexts (e.g., conversationsand narrative texts, or with cues spread across wider distances); • transfer learning to handle new property types and discover properties never seen before.We will elaborate on this theme of Open Information Extraction in Chapter 7. Instead of viewing property extraction as a classiﬁcation or tagging task, a diﬀerent paradigmis to approach it as a machine translation problem. The input, or source language, isEnglish sentences, and the output, or target language, is the formal language of SPO triples.This perspective has been pursued in recent literature, mostly building on neural learningwith general encoder-decoder architectures (which subsume LSTMs and CNNs, althoughthis is often not made explicit). The encoder learns a latent representation of the sourcelanguage, and the decoder generates output from this learned representation into the targetlanguage.Among the best performing and most popular models for this setting are

Transformernetworks [591] (see also [215, 7] for coding and illustrations). In a nutshell, a Transformerconsists of a stack of encoders and, on top of this, a stack of decoders. Each of thesecomponents comprises a self-attention layer and a feedforward network. The self-attentionmechanism allows the model to inform the latent representation of a word (as learned bythe encoder) with signals about all other words in the input sequence. This way, Trans-formers capture “cross-talk” between words more directly than LSTMs where latent statesaccumulate distant-word inﬂuence. Moreover, Transformer networks are designed such thatboth training and application can be highly parallelized. The language embedding modelBERT [118] is an example of a very powerful and popular Transformer network.Recent works that leverage Transformers (including BERT) for neural extraction ofproperties include [596, 10, 610, 630, 55]. Another recent trend is to leverage neural methods116 ubmitted to Foundations and Trends in Databases for

Machine Reading Comprehension (MRC) [473] to extract relational pairs of entitymentions from a given text passage. One way of achieving this is by casting the targetproperty into one or more questions (e.g., “Who performed the song . . . ?”) to be answeredby the MRC model (e.g., [325, 334, 112]). These techniques build on the asset that theirunderlying neural networks have been trained with huge amounts of texts about manytopics of our world.

Major observations and recommendations from this chapter are the following: • Tapping into premium sources (ideally, with semi-structured cues like DOM trees, listsand tables) and using highly accurate speciﬁed patterns is the best for high-precisionextraction, to keep the KB at near-human quality. For large-scale extraction, thisapproach has beneﬁts also regarding declarative programming and optimization as wellas scrutable quality assurance. • For recall, and especially to acquire properties of long-tail entities or infrequentlymentioned properties, pattern learning is a viable option. Still, harvesting high-qualityweb sites with semi-structured content by distantly supervised extractors , is the practicallymost viable way. For less prominent vertical domains, discovering these sources may bea challenge by itself. • Extractors with deep neural learning have been greatly advanced, based on distantsupervision from prior KB statements. This allows tapping textual contents on a broaderscale, with great potential for higher recall at acceptable precision. However, the relianceon suﬃciently large and clean training data (for distant supervision) is a potentialobstacle, and needs further research. 117 ubmitted to Foundations and Trends in Databases

In Chapter 6, we assumed that the property types of interest are explicitly speciﬁed.For example, for entities of type musician , we would gather triples for attributes andrelationships like bornOn, bornIn, spouses, children, citizenOf, wonAward,creatorOf (songs), released (albums), musicalGenre,contractWith (record label), playsInstrument

This approach goes a long way towards building densely populated and expressiveknowledge bases. However, it is bound to be incomplete by missing out on the “unknownunknowns” : properties beyond the speciﬁed ones that are of potential interest but cannotbe captured as we do not know about them yet. For example, suppose we want to expandthe population of a KB about musicians such as Bob Dylan. Which properties should thisKB cover, in addition to the ones listed above? Here are some candidates that quickly cometo mind: performedAt (location), performedAtAct (event),coveredArtist (other musician), coveredByArtist,workedWith (producer), performedWith (band or musician),accompaniedBy (instrumentalist) duetWith (singer),composed (song), wroteLyrics (for song),songAbout (a person), etc. etc.

For the Bob Dylan example, noteworthy instances of these properties would include (with

Bob Dylan abbreviated as BD ): < BD, performedAt, Gaslight Cafe >,< BD, performedAtEvent, Live Aid Concert >,< BD, coveredArtist, Johnny Cash >,< BD, coveredByArtist, Adele >,< BD, coveredByArtist, Jimi Hendrix >,< BD, performedWith, Grateful Dead >,< BD, duetWith, Patti Smith >,< Sad Eyed Lady of the Lowlands, songAbout, Sara Lownds >,< Hurricane, songAbout, Rubin Carter >, and many more. Limitations of Hand-Crafted Schemas:

Although it is conceivable that a good team of knowledge engineers come up with allthese property types, there are limitations in manual schema design [123] or ontology118 ubmitted to Foundations and Trends in Databases engineering [551] To underline this point, consider the following examples. Schema.org( http://schema.org ) is an industry standard for microformat data within web pages [193].As of April 2020, it comprises ca. 800 entity types with a total of ca. 1300 propertytypes. This is highly incomplete. For example, the type

CollegeOrUniversity , despitehaving 67 speciﬁed properties, misses out on numberOfStudents or degreesOffered . Wikidata( http://wikidata.org ) [600] builds on collaborative KB building; so one would expect bettercoverage from its large online community (further discussed in Section 9.4). As of April2020, it speciﬁes about 7000 property types, but still lacks many of the interesting ones formusicians and songs, such as performedAt , coveredArtist , duetWith , songAbout etc. Finally,for knowledge about movies, even the most authoritative IMDB website ( http://imdb.com )misses many of the attributes and relations found across a variety of semi-structured sites: astudy by [355] manually identiﬁed interesting properties about movies and found that IMDBcomprises only about 10% of the ones jointly covered by eight domain-speciﬁc websites. Key Issues:

The following sections address three key points about handling these “unknown un-knowns”: • Given solely a corpus of documents or web pages, automatically discover all predicatesof interest , using

Open Information Extraction . • Given a domain of interest, such as music or health, and an existing KB with a subset ofspeciﬁed properties, discover new attributes and relationships for existing entity types,using distant supervision . • Given a set of discovered properties, with noise and redundancy, organize them intoa clean system of canonicalized attributes and relations with clean type signatures, bymeans of clustering, matrix/tensor factorization and other data mining algorithms.Note that once we have identiﬁed a new property of interest and have gathered at least afew of its subject-object pairs as instances, the task of further populating the property fallsinto the regime of Chapter 6, most importantly, using seed-based distant supervision.

To discover property types in text, without any assumptions about prior backgroundknowledge, the natural and best approach is to exploit syntactic patterns in naturallanguage. More precisely, we aim at universal patterns , or “hyper-patterns” , that capturerelations and attributes of entities regardless of their types. The simplest pattern is obviouslythe basic and ubiquitous grammatical structure of the English language: noun – verb –noun , where nouns, or more generally, noun phrases, should denote entities in their roles assubject and object of a sentence. A simple example is “Bob Dylan sings Hurricane” withthe verb “sings” being the newly detected property. The verbal part could also be a phrase,119 ubmitted to Foundations and Trends in Databases often of the form verb + preposition , like in sentences such as “Bob Dylan sings with JoanBaez” or “Bob Dylan performed at the Live Aid concert”. The output of this approach isnot just the property itself, but an entire predicate-argument structure . The verbalphrase becomes a (candidate for a) binary predicate in a logical sense, and its argumentsare the subject and object extracted from the sentence. In full generality, there could bemore than two arguments, because of adverbial modiﬁers (e.g., “Dylan performed Hurricaneat the Rolling Thunder Revue”) or verbs that require two objects (to express higher-aritypredicates). Open Information Extraction (Open IE) is the task of extracting predicate-argument structures from natural languagesentences.

Input: a propositional sentence (i.e., not a question or mere exclamation) that can beprocessed with generic methods for syntactic analysis (incl. POS tagging, chunking,dependency parsing, etc.).

Output: a predicate-argument tuple, as a proto-statement, where the predicate andtwo or more arguments take the form of short phrases extracted from input sentence.To further illustrate the task, consider the sentences:

Bob Dylan sang Hurricane accompanied by guitar and violin.In this concert Dylan sings together with Baez.Bob Dylan performed a duet with Joan Baez.Dylan performed the song with Joan Baez.Bobby's and Joan's voices beautifully blend together.Hurricane is about Rubin Carter and criticizes racism.

State-of-the-art Open IE methods produce the following proto-statements as output (e.g., https://nlp.stanford.edu/software/openie.html or https://demo.allennlp.org/open-information-extraction/ ): < Bob Dylan, sang, Hurricane >< Dylan, sings, together with Baez >< Bob Dylan, performed, a duet >< Dylan, performed, the song >< Bobby's and Joan's voices, blend, [null] >< Hurricane, is about, Rubin Carter >< Hurricane, criticizes, racism > A few remarks are in order: • The S and O arguments are surface mentions in exactly the same form as they appearin the input sentence. There is no canonicalization yet, but an Entity Linking (EL)steps could be easily added for post-processing. Alternatively, one could ﬁrst run entityrecognition and EL on the input sentences as pre-processing.120 ubmitted to Foundations and Trends in Databases • Some of the output triples appear incomplete and uninformative, missing crucial in-formation. Open IE methods tend to add this as additional modiﬁer arguments . Forexample, the sentence about the duet with Joan Baez would return:predicate = “performed”, arg1 = “Bob Dylan”, arg2 = “a duet”, modiﬁer = “with JoanBaez”.So the crucial part is not lost, but it is not that easy to properly recombine argumentsand modiﬁers. For example, for the ﬁrst sentence, predicate, arg1 and arg2 are perfect,and the modiﬁer “accompanied by guitar and violin” is indeed auxiliary informationthat could be disregarded for the purpose of discovering the predicate sings betweensingers and songs. • Some sentences do not easily convey the subject-object arguments, an example being“Bobby’s and Joan’s voices” collapsed into the subject argument. As a result, the outputfor this sentence has no object (denoted by [null] ). • Some sentences, like the last one, contain information that is properly captured bymultiple triples. This is a typically case for sentences with conjunctions, but there areother kinds of complex sentences as well.

Until recently, most methods for Open IE crucially relied on patterns and rules [29, 627, 155,374, 154, 101, 373] (with a minor role of learning-based components). Recent methods thatbuild on neural learning will be discussed in Subsection 7.2.3. As an exemplary representativefor pattern-based Open IE, we discuss the

ReVerb method [155] which focuses on verb-mediated predicates and builds on a single but powerful regular expression pattern overPOS tags, to recognize predicate candidates:

Regex Pattern for Open IE: predicate = V | VP | VW*PV = verb particle? adverb?W = (noun | adjective | adverb | pronoun | determiner)P = (preposition | particle | inﬁnitive marker)where determiners are words like “the”, “a” etc., particles subsume conjunctionsand auxiliary verbs, and inﬁnitive markers capture the word “to” in constructs suchas “Bob and Joan reunited to perform Farewell Angelina”.The above expression matches, for instance, “sang” (V), “sang with” (VP), or “performeda duet with” (VWP).This way ReVerb gathers a pool of predicate candidates, and subsequently corroboratesthem into a cleaner set by the following steps:121 ubmitted to Foundations and Trends in Databases

1. Filter predicate candidates based on frequency in a large corpus, discarding rare ones.2. For each retained predicate, consider the nearest noun phrase to the left as subject andthe nearest noun phrase to the right as object.3. Use a supervised model to run these (proto-)statement candidates through a classiﬁer,yielding a conﬁdence on the plausibility of the candidate triple.For step 1, the huge diversity of predicates is reduced by removing adjectives, adverbs,and similar components, so that, for example, “performs a duet with” and “performs abeautiful duet with” are combined, and so are “sings with” and “occasionally sang with”.For step 3, a logistic regression classiﬁer is learned over a hand-crafted set of trainingdata, based on manually compiled features. Example features include sentence begins withstatement subject (positive weight), sentence is longer than 20 words (negative weight), last preposition in the predicate is a particular word, like “on”, “of”, “for” etc. (diﬀerentpositive weights), presence of noun phrases other than those for subject and object (negativeweight), and more. Training this classiﬁer to reasonable accuracy required manual labelingof 1000 sentences [155].

The basic principles of ReVerb can be extended in many diﬀerent ways, most notably, byincorporating richer syntactic cues from dependency parsing [29, 627, 374, 13] or fromanalyzing the clauses that constitute a complex sentence [101]. Another theme successfullypursued is to run a conservative method ﬁrst, such as ReVerb, and use its output as pseudo-training data to bootstrap the learning of more sophisticated extractors. An example forthe latter is the

OLLIE method [374].In the following, we discuss some of the less obvious extensions that together makeOpen IE a powerful tool for discovering predicates.

Non-contiguous and Out-of-order Argument Structure:

The regex-based approach is elegant, but falls short of the full complexity of natural language.For example, the following predicates comprise non-contiguous words: • Dylan was covered among many others by Adele. • Hurricane made the case of Rubin Carter widely known.Similarly, statements do not necessarily follow the standard subject-verb-object order: • When winning ( P ) the Nobel Prize in Literature ( O ), Bob Dylan ( S ) announced thathe would not attend the award ceremony.There are two ways to improve the coverage of Open IE: 1) by increasing the expressive-ness of patterns, and 2) by using machine learning to generalize the pre-speciﬁed patterns.Examples of the former include the WoE system [627] the

ClausIE tool [101] and the

Stanford Open IE tool [13], using dependency parsing and clause structures. Learning was122 ubmitted to Foundations and Trends in Databases already considered by the ﬁrst major system,

TextRunner [29], but merely leveraged asmall amount of labeled samples for bootstrapping a classiﬁer. More recent systems such asOLLIE [374]

Stanford Open IE [13] and

OpenIE 4.0 [373] systematically combine supervisedlearning with hand-crafted patterns.To generate training data, the paradigm of distant supervision (see Section 6.2.2)can be leveraged. For example, the method of [627] matches Wikipedia infobox entriesagainst sentences in the same articles, and treats the matching sentences as positive trainingsamples. In contrast to learning extractors for a-priori known property types, this OpenIE approach puts all properties together into a single pool of samples. This way, it canlearn features that indicate new property types in text which are not covered in any of theinfoboxes at all. We will come back to this form of seed-based supervision in Section 7.3.

Noun-mediated Properties:

In addition to expressing properties by verbs or verbal phrases, sometimes relations are alsoexpressed in the form of modiﬁers in noun phrases. For example, the sentence “Grammywinner Bob Dylan also received an Oscar for . . . ” gives a strong cue for the property h Bob Dylan, winner of, Grammy i . Wikipedia category names are a prominent case of thisobservation (see Section 6.2.1.5). Especially long-tail property types such as wroteLyricsFor beneﬁt from tapping all available cues, hence the need for including noun phrases such as“Dylan’s lyrics for Hurricane . . . ”.The works of [636, 436] are examples for tackling this issue. They operate by using seedfacts to learn extraction patterns from noun phrases based on a variety of features, andthen combine the learned classiﬁer with additional rules. Attribution and Factuality:

Another limitation of the basic method is its ignorance ofutterance context, such as attribution of claims. For example, from the sentence “BloggerJoe4Paciﬁsm demanded that Dylan will also receive the Peace Nobel Prize”, a simplepattern-based Open IE method would extract the triple h Bob Dylan, will receive, PeaceNobel Prize i – which is wrong when taken out of the original context. Open IE extensionssuch as OLLIE [374] handle this by adding attribution ﬁelds in their combination of pattern-based rules and supervised classiﬁcation. This would yield the following output, denotedhere in the form of nested tuples: < Blogger Joe4Pacifism, demanded,< Bob Dylan, will receive, Peace Nobel Prize >> Another notorious case where simple Open IE often fails is the presence of negation cuesin sentences. For example, the sentence “None of Dylan’s songs was ever covered by ElvisPresley”, a straightforward extractor would incorrectly yield h Dylan’s songs, covered by, ubmitted to Foundations and Trends in Databases

Elvis Presley i , ignoring the crucial word “None”. MinIE [181] ﬁxes this issue by adding a polarity ﬁeld that is toggled when observing a negation cue from a list of keywords (e.g.,not, none, never, etc.).The

NestIE method [41] and the

StuﬀIE method [461] generalize these approaches bycasting sentences into nested-tuple structures to represent facets like attribution, location,origin, destination, time, cause, consequence, etc. These facets are handcrafted basedon lexical resources such as Wiktionary ( ), PropBank ( https://propbank.github.io/ ) [439], FrameNet ( https://framenet.icsi.berkeley.edu/ ) [165] and OntoNotes( https://github.com/ontonotes/ ) [460], reﬂecting especially the semantic roles of prepositions(“at”, “for”, “from”, “to” etc.) and conjunctions (“because”, “when”, “while” etc.). Semi-automatically annotated sentences are used to train a logistic regression classiﬁer for castingcomplex sentences into nested tuples.

Distilling propositional sentences into predicate-argument structures has been pursued incomputational linguistics for the task of

Semantic Role Labeling (SRL, see Section 6.2.2.4).A typical setting is to consider a small set of frame types with speciﬁc slots to be ﬁlled. Forexample, for frame type creates they would include creator, created entity, components,co-participant, circumstances, means, manner, and more. An SRL method takes a sentenceas input, uses a classiﬁer to assign it to one of the available frame types (or none), and thenapplies a sequence tagger to ﬁll all (or most) of the frame slots. This setup with argumentsfor frame-speciﬁc roles does not make sense for

Open

IE, since we do not want to limitourselves to pre-speciﬁed frame types. Instead, we settle for more generic arguments, likearg0, arg1, arg2, etc. – an ordered list of arguments, typically with subject as arg0, objectas arg1, and context-capturing modiﬁers as further arguments. This eliminates the need fora type classiﬁer and focuses on the slot-ﬁlling tagger.Not surprisingly, the state-of-the-art methodology for this sequence tagging task is deepneural networks, more speciﬁcally, bi-LSTM transducers and other recurrent networks (seeSection 4.4.2). [553] presents a full-ﬂedged solution that encodes sentences with word andposition embeddings as initial inputs. The learned representation is decoded into a sequenceof BIO tags to mark up the begin and inner part of predicate and argument spans (using Ofor all other, irrelevant tokens). To distinguish the predicate and the diﬀerent arg-i’s, theBIO tags are separately instantiated for each of these constituents. An example of inputand output is shown in Figure 7.1. The learning architecture itself pretty much followsthose that have been developed for entity recognition (NER) and entity typing (see Chapter4). Therefore, we do not elaborate on this any further.An interesting aspect of [553] is the way that training data is compiled. This includeslabeled sentences speciﬁcally for Open IE, but also a clever way of leveraging annotated124 ubmitted to Foundations and Trends in Databases

Dylan wrote the lyrics for Sad-eyed Lady of the Lowlands about his wife Sara at the Chelsea hotelBarg0 Bpred Ipred Ipred Ipred Barg1 Iarg1 Iarg1 Iarg1 Iarg1 Barg2 Iarg2 Iarg2 Iarg2 Barg3 Iarg3 Iarg3 Iarg3 neuralnetworkfor sequence taggingoutputtagsinputsentence

Figure 7.1:

Sequence-Tagging Neural Network for Open IE collections for QA-style machine reading comprehension [221]. The latter data comprisesa large set of triples with a propositional sentence, a question that someone could askabout the sentence, and an answer that could be derived from the sentence. Questions aregenerated by templates and other means; they provide a user-friendly way to annotatesentences and are thus more amenable to crowdsourcing than other forms of labeling [382].Another approach to mitigate the training bottleneck is to combine sequence-taggingmethods with reinforcement learning (see, e.g., [348]).

In the previous section, we assumed that there is no prior knowledge base of properties:Open IE taps into any given text collection and aims at maximally broad coverage. In thissection, we exploit the fact that we already have a pre-existing KB that contains manytriples about a limited set of speciﬁed properties. This prior knowledge can be leveragedfor distant supervision, by spotting patterns for the limited-properties statements andgeneralizing them into hyper-patterns that are applicable to discover previously unknownproperty types. For example, semi-structured web sites about commercial products mayhave many pages where salient attributes and their values are rendered in a certain style. Afrequent case is list elements with the attribute name in boldface or other highlighted fontfollowed by a colon or tab followed by the attribute value. This is a hyper-pattern.The methods presented in this section are based on the following key intuitions: • Property names often share similar presentation patterns, like language style (for text)or layout (for lists and table). • Property names are much more frequent in a topically focused text corpus or semi-structured web site than their values for individual entities.125 ubmitted to Foundations and Trends in Databases

The

Biperpedia project [201] took a large-scale approach to discovering new propertiesin web pages and query logs, by learning patterns and classiﬁers based on statements forpre-speciﬁed properties from a large KB. Speciﬁcally, [201] leveraged Freebase; the

WoE system [627] pursued similar techniques at smaller scale, based on Wikipedia infoboxes.The key idea is as follows. Suppose the KB has many SPO triples for properties like createdSong: musician x songcreatedMovieScript: writer x movieperformedWith: musician x musicianplayedOnAlbum: song x albumplayedAtFestival: song x event

We can spot these SO pairs in a large text corpus and check that there is contextual evidencefor the property P as well, by good pattern, such as: song O written by S, script for O written by S,song S played on album O, song S played live at O

Next, we aim to identify commonalities among patterns for multiple properties, such as

NN * O written by Ssong S played PREP NN O with noun phrases NN and prepositions PREP . These can now serve as hyper-patterns tobe instantiated, for example, by “lyrics for O written by S” and “song S played in movie O”.This is the idea to discover the previously unknown property types wroteLyricsFor: artist x songplayedInMovie: song x movie where the type signatures would be learned from co-occurring S-O pairs via entity linking(assuming the KB is already richly populated with entities and types).Generally, the method runs in the following steps:126 ubmitted to Foundations and Trends in Databases

Distantly Supervised Discovery of Properties in Text:

Input: text corpus and prior KBOutput: new property types1. Spot text snippets with occurrences of SPO triples from the KB. Use coreferenceresolution to capture also passages that span more than one sentence.2. Compute frequent hyper-patterns by relaxing speciﬁc words and entities intosyntactic structures (regex on part-of-speech tags, dependency-parsing tags etc.).3. Assume that new properties are always expressed by concise nouns or nounphrases next to the parts that match S or O. Gather these noun phrases ascandidates for new properties.4. Prune noisy candidates by means of statistics and classiﬁers, and group candidatesinto synonymy-sets via co-occurrence mining or classiﬁers (e.g., “lyrics of Owritten by S” is synonymous to “text of O written by S”).5. Train a feature-based classiﬁer to assign semantic types to the arguments ofthe new properties (e.g., artist × song ), and run this classiﬁer on the retainedproperty candidates. Features include informative words in textual proximity(e.g., “rise” or “drop” as cues for datatype number of a discovered attribute).The implementation of Biperpedia focused strongly on attributes rather than relations,such as salesOfAlbum: album x integersongIsAbout: song x text where the O argument could be a person, location, event before entity linking or any generaltopic – hence the generic datatype text . Here, the type inference covers datatypes likenumeric, date, money, text. For attributes, a fairly compact subset of hyper-patterns alreadygives high yield, including NN of S is O , S and PREP NN O , etc. The method itself appliesto relationships between entity pairs equally well. For extracting complete SPO triples,entity linking must be run in a post-processing step or incorporated into one of the earliersteps (potentially even as pre-processing before the ﬁrst step).127 ubmitted to Foundations and Trends in Databases

Tapping Query Logs:

Similar principles of distant supervision and the learning of hyper-patterns that generalizefrom speciﬁed properties to a-priori unknown properties, also apply to query logs as an inputsource [443, 201, 442, 347]. For example, many users ask search engines about “sales of h album i ”, “lyrics writer of h song i ”, “topic of h song i ”, “meaning of h song i ”, “what is h song i about?”, and so on. This is a huge source for property discovery. However, it is accessibleonly by the search-engine providers, and it faces the challenge that user queries contain anenormous amount of odd inputs (e.g., misspelled, nonsensical or heavily biased). Semi-structured content is a prime resource for KB construction (see Chapter 3 Section 6.3),and this applies also to open information extraction. In particular, many web sites withcontent generated from back-end databases organize a major part of their information into entity detail pages . In Section 6.3, we exploited this fact to identify pages that are (moreor less) exclusively about a single entity from the prior KB. As we dealt with populatingpre-speciﬁed property types in Section 6.3, this meant that we already knew S and P whentapping a web page about possible O values for an SPO triple. This greatly simpliﬁed thetask, and methods like Ceres [354] showed the way to extraction with both high precisionand high recall. In our current setting, with the goal of discovering new property types, wedo not have the luxury anymore that we already know which P we are after, but we canstill exploit that S is known for most pages from a semi-structured web site (if sites andpages are chosen carefully).The

OpenCeres method [355] extends Ceres to discover new properties. Earlier workalong similar lines is [61] with focus on regex-based rules.OpenCeres starts with distant supervision from the prior KB to extract and generalizepatterns. This is similar to Biperpedia (see Section 7.3.1), but focuses on DOM-tree patterns.For example, many instances for a known property type could be spotted by paths in theDOM tree that end in a leaf of the form

NN:O with a noun phrase NN in highlighted fontfollowed by an entity mention O that can be linked to the KB. If NN does not correspond toany of the KB properties, we can consider it a newly discovered property, holding between S , the subject of the entity detail page, and O . This way, a large number of candidateproperties can be gathered, from cues across all pages of the web site.The key idea of OpenCeres lies in the subsequent corroboration stage , where it uses thesemi-supervised method of label propagation [676] to clean the candidates.128 ubmitted to Foundations and Trends in Databases Input Graph for Label Propagation over New Property Candidates:

The method operates on a graph where each node is either • a distant-training sample on a PO pair from the KB (S is assumed to be certainfrom the entity detail page), or • a candidate pair of PO with P as a text phrase and O in text or entity form(i.e., before or after entity linking).The graph connects nodes by similarity, using two kinds of features to computeedge weights: • distance measures with regard to the DOM tree, and • cues about visual layout such as font size, color etc.An example graph is shown in Figure 7.2, showing PO pairs as nodes, with S being movies(not shown) from entity detail pages of the same web site. Seed nodes are in blue; edgeweights are omitted for simplicity. The ﬁgures shows that entities like music pieces andmusicians are connected with ﬁlm directors, because they co-occur in detail pages about therespective movies. Likewise, there are movies based on books or theater plays, and theselead to edges between these literature works and the respective ﬁlm director (i.e., YimouZhang who directed The Banquet and

The Flowers of War ). Director:

Sergio LeoneDirector:Yimou Zhang

Script:

Heng Liu

Director:

Yimou Zhang

Writer:

Geling Yan

Score:

Ennio Morricone

Music:

Ecstasy of Gold

Music:

L‘Arena

Soundtrack:

Kitaro

Based on Novel:

13 Flowersof Nanjing

Song:

Yearning for Peace

Soundtrack:

Hans Zimmer

Played by:

Itzhak Perlman

Based on:

Hamlet

Soundtrack:

Ennio Morricone

Figure 7.2:

Candidate Graph for Label Propagation to Discover New Properties

Label propagation aims to assign a binary truth label to each node with a certainprobabilistic conﬁdence. To this end, it propagates the initial perfect-conﬁdence truth of itspositive seeds to neighbors in proportion to the edge eights. An underlying assumption isthat the variations in truth conﬁdence should be gradual and smooth within neighborhoods.This resembles the computation of PageRank for web-page authority, and indeed it is thesame family of nicely scalable algorithms (e.g., Jacobi iteration) that can be used here. In129 ubmitted to Foundations and Trends in Databases the end, when the algorithm converges and each node has a conﬁdence score, a threshold isapplied to select the most likely proper nodes as output.An extension called

ZeroShotCeres [356] has further advanced this methodology byrelaxing the need to have at least some seeds per web site in the distant supervision. Thenew method uses visual and structural cues from any domain-speciﬁc web site to learn howto tap a previously unseen web site of the same domain (e.g., movies). Technically, this isbased on a graph neural network with graph embeddings for the visual and structural cues.

Tapping Web Tables:

The InfoGather project [637] tackled the problem of property discovery by focusing on webtables (cf. Section 6.3.2). A key idea is again using SPO triples from a prior KB for distantsupervision:1. Spot SPO in a table with SO in the same row and column header P for the columnwhere O occurs.2. Identify additional columns with headers Q1, Q2, . . . as candidates for new propertiesthat apply to S.3. Aggregate these cues over all tables in the data collection, and apply statistics andprobabilistic inference to compute high-conﬁdence properties.In addition, the discovered properties are organized into synonymy groups, using overlapmeasures and schema-matching techniques.A particular challenge with web tables is that headers may be very generic and lackenough information to meaningfully identify them as informative and new. For example,column headers like

Name or Value do not add any beneﬁt to a KB schema. Even morespeciﬁc headers such as

Sales or Growth are doubtful, if we miss the relevant referencedimensions like currency and year. [658] extended this framework by propagating suchinformation across tables. [253, 254] enhanced the interpretation of web tables by consideringinformation from the text that surrounds tables in web pages, matching table cells againsttext phrases and, this way, picking up more context.

There are cases where the KB itself exhibits patterns that can suggest additional propertytypes that are not explicitly speciﬁed yet. For example, suppose the KB contains properties created: musician × song and performed: musician × song . Then the composition of created and performed − denotes the property coveredBy between musicians. Methods for rulemining (see Section 8.3.3) and path mining over knowledge graphs can automaticallydiscover such interesting predicates. Appropriately naming them would be a modest manualeﬀort afterwards. The path ranking method by [310, 311] has advanced this direction in thecontext of the NELL project. Note, though, that such methods can only discover what is130 ubmitted to Foundations and Trends in Databases implicitly already in the KB – there is no way to ﬁnd “unknown unknowns” that have nocues at all in the KB. OpenIE-style methods, as discussed in the previous sections, are good at coverage, aimingto discover as many new property types as possible. However, this comes at the expenseof redundancy and inconsistency. Despite some steps to clean their outputs, a methodcould yield two seemingly distinct properties playedIn and heardIn , both between songsand movies and both denoting the very same semantic relationship. Similarly, the outputcould have both composed and wroteLyricsFor between artists and songs; these are highlycorrelated but not semantically equivalent. To avoid all these pitfalls and arrive at uniquerepresentations of properties without redundancy and risks of inconsistencies, we needadditional methods for canonicalization – the property-focused counterpart to what entitylinking does for the property arguments S and O.Techniques for this purpose include clustering, matrix or tensor factorization, item-set mining, and more. We start the discussion by ﬁrst introducing the construction of paraphrase dictionaries for properties, as a building block for other techniques. Thisincludes also inferring type signatures for properties, and organizing all properties into a subsumption hierarchy . Recall that for constructing rich type taxonomies, as a key part of the KB schema, weleveraged existing resources like WordNet and the Wikipedia category system (see Chapter3).Such assets were available for types/classes, that is, unary predicates in a logical sense, butnot for properties, the binary predicates for the KB. If we had a comprehensive dictionaryof binary-predicate names, along with type signatures, and each name associated with a setof synonyms, then we could use this as complete schema where all collected properties arepositioned. In other words, there would be no more “unknown unknowns” if the dictionarywere truly complete.Needless to say that a perfect dictionary, with complete coverage and absolutely ac-curate synonymy-sets, is wishful thinking. Nevertheless, we can approximate this goal byconstructing a reasonable dictionary and then using it as an asset in the property discovery.The dictionary building is itself a property discovery task, though. As propertiesare expressed by names and phrases, such as romance with , is dating and loving couple ,the NLP community views this as the task of paraphrase detection . In addition toinferring synonymy among such phrases (i.e., one paraphrasing the other), we also wantto understand their subsumption relationships (i.e., one entailing the other). For example,131 ubmitted to Foundations and Trends in Databases romance with is a specialization of friendship with (at least when it is fresh), and duet with (between singers) is a sub-property of performed with (between musicians). To infer the synonymy of two property phrases, all methods make use of the distributionalhypothesis : the meaning of a word or phrase can by understood by observing the contextsin which it occurs.

Textual Contexts:

The ﬁrst methods for paraphrase detection were built directly on thedistributional hypothesis applied to text corpora. The

DIRT method (Discovering InferenceRules from Text) [339] used dependency parsing to identify subject and object of propertyphrases (in general text, not necessarily about KB entities). Typically, the property phrasewould be a verb or verbal phrase. IDF weights (inverse document frequency) were assignedto the co-occurring subject and object, to compute an informativeness score for the propertyphrase. Finally, pairs of phrases could be ranked as paraphrase candidates by weightedoverlap scores of their two arguments.The approach was extended by the

OceanVerb method [92] to consider also antonyms(i.e., opposite meanings). For example, “in love with” and “enemy of” would have highdistributional similarity in DIRT, because the subject-object arguments would includecommon names such as Joe and Jane and even personal pronouns (as opposed to crispentities). OceanVerb uses additional compatibility checks to infer that the example consistsof opposing phrases. In principle, OceanVerb even infers semantic (meta-) relations betweenproperties like happened before and enables . The former would hold between “marriedwith” and “divorced from” , an example of the latter is “composed song” and “played inmovie” . However, this early work was carried out at small scale, and lacked robustness.

Co-Occurrences of Entity Pairs:

By restricting the subject-object arguments of prop-erty phrases to be canonicalized entities, based on a prior KB and using entity linking,the distributional-similarity signals can be made sharper. This was investigated in the

PATTY project [414] at large scale, with extraction of candidate phrases from the fulltext of Wikipedia and web crawls. Each property phrase is associated with a supportset of co-occurring entity pairs. Then, distributional similarity can be deﬁned by overlapmeasures between support sets. To this end, PATTY devised techniques for frequentitemset mining with generalizations along two axes:i) lifting entity pairs to type pairs and super-type pairs, andii) lifting words to POS tags (to replace overly speciﬁc words, e.g., a personal-pronoun taginstead of “his” or “her”) including wildcards for short subsequences.132 ubmitted to Foundations and Trends in Databases

A key advantage of this approach is that the entities are typed, so that we can automati-cally infer type signatures for property phrases (e.g., person × person, or song × movie). Infact, the type signatures are harnessed in computing synonymy among phrases, by checkingif diﬀerent type signatures are compatible (e.g., musician × musician and person × person)or not (e.g., song × movie and person × movie). The PATTY method compiled a largeresource of paraphrases, but still suﬀered from sparseness and a fair amount of noise. Mostnotably, its reliance on sentences that contain a property phrase and two named entitieswas a limiting factor in its coverage. It did not consider any coreference resolution, or othermeans for capturing more distant signals in text passages.The DEFIE project [56] took this methodology further, speciﬁcally tapping into concisedeﬁnitions, like the ﬁrst sentences in Wikipedia articles or entries in Wiktionary and similarresources.

Multilingual Alignments:

Collections of multilingual documents with sentence-wise alignments, referred to as parallelcorpora in NLP, are another asset, ﬁrst considered in works by [34] and [30]. The key ideais to use pairs of sentences in language A, and their translations into language B: A1, A2,B1, B2. If A1 and A2 contain diﬀerent property phrases that are translated onto the samephrase in B1 and B2, and if this happens frequently, then A1 and A2 give us candidates forsynonymy. For example, consider the English-to-German translations:A1: . . . composed the soundtrack for . . . → B1: . . . schrieb die Filmmusik für . . .A2: . . . wrote the score for . . . → B2: . . . schrieb die Filmmusik für . . .These are cues that “composed the soundtrack” and “wrote the score” are paraphrases ofeach other.One of the largest paraphrase dictionaries,

PPDB (Paraphrase Database) [179, 445]( http://paraphrase.org/ ), has been constructed by similar methods from multi-lingual corpora with alignments. This comprises more than 100 million paraphrase pairs,with rankings and further annotations, covering both unary predicates (types/classesand WordNet-style senses) and binary predicates (relations and attributes). However, theproperty phrases have no type signatures attached to them.The methodology is not limited to intellectually aligned corpora, but can be extendedto leverage large collections of machine-translated sentences . This has been investigated by[191] using multilingual data constructed by [158] from Wikipedia sentences. By integratingentity linking (easily feasible for Wikipedia text), the method also infers type signaturesfor all paraphrase-sets. The resulting dictionary, called

POLY and extending PATTY (seeabove), is available at .Note that all these methods for constructing large paraphrase dictionaries date back anumber of years. With today’s advances in machine learning and entity linking and theavailability of much larger datasets, similar methodology re-engineered today would have133 ubmitted to Foundations and Trends in Databases potential for strongly enhanced dictionaries with both much better coverage and accuracy.

Web Tables and Query Logs:

Another source for detecting distributional similarityof property phrases is web tables and query logs [222]. Both are customized cases of the strong co-occurrence principle introduced already in Section 4.2. For web tables, iftwo columns with headers P and Q in diﬀerent tables contain approximately the sameentities or values and both tables have high row-wise overlap of key entities (the S entities),then P and Q are good candidates for synonymy. As for query logs, the method exploitsthat search-engine queries that share an entity but vary in another keyword may yieldhighly overlapping sets of clicked pages . For example, the two queries “movie soundtracksby Ennio Morricone” and “ﬁlm scores by Ennio Morricone” may return, to a large degree,the same good web pages that are clicked on by many users. Therefore, “soundtrack by”and “ﬁlm score by” are likely synonymous. These indirect co-occurrence signals are evenmore pronounced for attributes such as “Paris population”, “Paris inhabitants” and “howmany people live in Paris”. For semantically organizing large sets of property phrases, synonymy grouping is notsuﬃcient. In addition, we want to identify cases where one property subsumes another,this way building a taxonomic hierarchy of binary predicates. For example, performedWith subsumes duetWith between musicians, so the former should be a super-property of thelatter. The PATTY, DEFIE, PPDB, and POLY projects (discussed above) addressed thisissue of inferring subsumptions (or linguistic entailments) as well [414, 56, 445, 191].The key idea in PATTY was to compare the support sets of diﬀerent phrases (see above),computing scores for soft inclusion (i.e., set inclusion that tolerates a small amount ofexceptions). When property phrase A has a support set that mostly includes the supportset of phrase B , we can conclude that A subsumes B . This way, a large set of candidatesfor subsumption pairs can be derived, each with a quantitative score.Since phrases can be generalized also at the lexico-syntactic level (e.g., replacing wordsby POS tags or by wildcards), this subsumption mining is carried out also for generalizedphrases. For example, the phrase VB soundtrack PREP is more general than wrote soundtrackfor and would thus have a larger support set, which in turn aﬀects the mined subsumptions.PATTY integrates all these considerations into an eﬃcient and scalable sequencemining task, to arrive at set of candidate subsumptions. As these candidates do notnecessarily form an acylic DAG, a ﬁnal step breaks cycles using a minimum-cost graphcleaning technique. The taxonomic hierarchy of property paraphrases is available at . Despite its large size, its coverage and reﬁnementare limited, though, by the limitations of the underlying input corpora. Reconsidering the134 ubmitted to Foundations and Trends in Databases approach with modern re-engineering for Web-scale inputs and also leveraging embeddingscould be a promising endeavor.

Suppose we have run an Open IE method (see Section 7.2) on a large corpus and obtaineda set of proto-statements in the form of SPO triples where all three components are shorttext phrases. This may give high recall but comes with noise and diversity: many variantsof P phrases potentially denoting the same property and ambiguous phrases for S and O.An obvious idea to clean this data and organize it towards crisper triples for KB population,KB schema construction or schema extension with new properties, is by clustering entiretriples or P components. Several methods have been developed to this end [646, 387, 463,175]; they diﬀer in speciﬁcs but share the same key principles, as discussed in the following.Consider the following example output of Open IE: < Bob Dylan, sang with, Patti Smith >< Dylan, sang with, Joan Baez >< Bobby, duet with, Joan >< Dylan, performed with, Baez >< Dylan, performed with, violinist Rivera >< Dylan, performed with, his band >< Dylan, performed with, Hohner harmonica >< Bobby, shot by, assassinator >< Hurricane, is about, Rubin Carter >< Hurricane, criticizes, racism >< Sara, is about, Dylan's wife >< Green Mountain, is about, civil war >< Green Mountain, critizes, civil war >< Masters of War, criticizes, war >

Pre-processing of S and O Arguments:

To have a clearer focus on canonicalizing the properties, most methods ﬁrst process the Sand O arguments towards a cleaner representation. An obvious approach, if a prior KBabout entities and types is available, is to run entity linking on both S and O. If this workswell, we could then group the triples by their proper S and/or O arguments. For example,the ﬁrst 5 triples all have

Bob Dylan as S argument. The sixth triple, which actually refersto

Robert Kennedy , may produce a linking error and would then be spuriously added to theprevious group. The KB could further serve to lift the entities into their most salient types,like musician and song for the S arguments. These types can further annotate the groupsas features in downstream processing.If no KB exists a priori, then the method of choice is to cluster all S arguments and allO arguments based on string similarity and distributional features like co-occurrence of P135 ubmitted to Foundations and Trends in Databases and O phrases and tokens in the underlying sentences or passages.The result of this stage would be a set of augmented triples where each unique P phraseis associated with a group of S arguments and a group of O arguments (where non-entityphrases are kept as strings): < {Bob Dylan}, sang with, {Patti Smith, Joan Baez} >< {Bob Dylan}, duet with, {Joan Baez} >< {Bob Dylan}, performed with,{Joan Baez, Scarlet Rivera, his band, Hohner harmonica} >< {Bob Dylan}, shot by, {assassinator} >< {Hurricane, Sara, Green Mountain}, is about,{Rubin Carter, Sara Lownds, civil war} >< {Hurricane, Masters of War}, criticizes, {racism, civil war, war} >

Similarity Metrics for Clustering

The main task now is the clustering of the P phrases themselves. The crucial design decisionhere is to choose features and a similarity metric that guides the clustering algorithms. Forthis purpose, prior works investigated combinations of the following feature groups:

Features for Property Clustering • String similarity of property phrases P and Q, potentially incorporating wordsembeddings, so that, for example, sang with and performed with would be similar. • Support sets overlap where the support sets for phrase P are the groups of Sand O arguments associated with P (see example above). • Type signature similarity between P and Q, with the semantic types of Sand O arguments derived from the support sets, such as musician , singer , song etc. The scores for how close two types are to each other can be based on wordembeddings or on other relatedness measures derived from the KB´s taxonomyor entity-linked data collections (cf. Section 5.7). • Relatedness in property dictionary , to leverage an existing dictionary suchas PPDB or PATTY (see Section 7.4.1) where a subset of property synonymsor near-synonyms (e.g., “plays with”, “performs with” and perhaps also “singswith”) are already known, as are highly related properties in the subsumptionhierarchy. • Context similarity based on the sentences or passages from where the tripleswere extracted, again with the option of word embeddings (cf. Section 5.3.2).136 ubmitted to Foundations and Trends in Databases

Algorithmic Considerations:

The algorithm itself could be as simple as k-means clustering . The output for our toyexample could be as follows, with k set to three: cluster 1: {sang with, duet with, performed with}cluster 2: {shot by}cluster 3: {is about, criticizes} This would be a fair set of clusters. Still it confuses duets of singers with musiciansperforming with each other, and it does not discriminate the diﬀerent semantics of is about and criticizes . A generally critical issue is the choice of the number of clusters, k . If wehad set k to 4, it may perhaps separate is about and criticizes , but the output could alsoget worse, for example, by combining performed with and shot by into one cluster (whichcould well be due to “shot by´´ also having a meaning for movies and videos).For more robust output, one should adopt more advanced algorithms like soft clustering or hierarchical clustering . The latter has been used by prior works [646, 387, 175]. It allowspicking a suitable number of clusters by principled measures like the Bayesian InformationCriterion. For eﬃciency, the actual clustering can be preceded by simpler blocking techniques (cf. Section 5.2), such that the more expensive computations are run only for each blockseparately.Another, more far-reaching, extension is to cluster properties P and SO argument pairs jointly by using co-clustering methods. Such approaches have been developed, for example,by [463, 437]. These methods do not perform the pre-processing for SO pairs, but insteadintegrate the clustering of entities and/or SO phrases with the clustering of properties.The factorization method discussed in the following section can also be seen as a kind ofco-clustering. Another variation of this theme is the work of [420] where Open IE triplesare ﬁrst organized into a graph and then reﬁned towards canonicalized representations byiteratively merging nodes. This approach has been further enhanced by integrating propertycanonicalization with entity linking [340]. An alternative to explicit grouping of property phrases, with type signatures, synonymyand subsumption, is to organize all triples from Open IE into a latent representation, using matrix or tensor factorization . These approaches take their inspiration from the greatsuccess of such models for recommender systems [289, 288, 664]. A typical recommendersetting is is to start with a user-item matrix whose entries capture the purchases or likes ofonline users regarding product items, or numeric ratings from product reviews. Naturally,this matrix is very sparse, yet it latently reﬂects the users’ overall tastes; hence its usage togenerate recommendations for new items. The input matrix is factorized into two low-rank137 ubmitted to Foundations and Trends in Databases matrices such that their product approximates the original data but in a much lower-dimensional space of rank k . One factor matrix maps users into this latent space, and theother one maps items. By re-combining user and item representations in the joint latentspace, we compute how well a user may like an item that she has never seen before. This isalso referred to as matrix completion as it ﬁlls in the initially unknown values of the sparseinput matrix.An approach to property organization along similar lines has been developed by [492,644] under the name Universal Schema . The basic idea is to arrange the collected OpenIE triples into a matrix and then apply low-rank factorization to map both SO pairs and Pphrases into the same latent space.

Latent Organization of SO-P Data:

The input data for computing a latent representation is an m × n matrix M whoserows are SO pairs (phrases, as proxies of entity pairs) and whose columns are Pphrases.The output is a pair of factor matrices U : m × k and W : k × n with a speciﬁedlow rank k , such that their product ˜ M = U × W minimizes a distance norm (e.g.,Frobenius norm) between M and ˜ M .This low-rank factorization reduces the original data to its gist, capturing cross-talkand removing noise. We may think of this as a soft clustering technique, as it leads torepresenting each SO pair and each P phrase as a k -dimensional vector of soft-membershipscores. The rows of U are latent vectors of SO pairs, and the columns of W are the vectorfor P phrases. Computing the factorization requires non-convex optimization as it typicallyinvolves additional constraints and regularizers. Therefore, gradient descent methods arethe algorithms of choice. Inference over Latent Vectors:

The latent form of Universal Schema makes it diﬃcult to interpret for human users,including knowledge engineers. It still remains unclear how the Universal Schema datacan be incorporated for KB curation and long-term maintenance. Nevertheless, there arevaluable use cases for harnessing the latent data.To determine if two P phrases denote the same property, we compare their latent vectors,for example, by the scalar product or cosine of the respective columns of W . Analogously,we can test if two SO pairs denote the same entity pairs. This is not a crisp true-or-falsesolution, yet it is a mechanism for interpreting the Open IE data under a universal but soft“schema” . Moreover, it is a building block for other downstream tasks, particularly, testingor predicting if a given property would likely hold between two entities. This resembles andis closely related to Knowledge Graph Embeddings [423, 611, 266, 503], to be discussed inSection 8.4. 138 ubmitted to Foundations and Trends in Databases

A drawback of matrix factorization models is that they are inherently tied to theunderlying input data. So any inference we perform in the latent space is, strictly speaking,valid only for the original triples and not for any newly seen data that we discover later.However, there are workarounds to this issue. Most widely used is the so-called fold-intechnique: when we spot a previously unknown P phrase in a new set of triples, we organizeit as if it were an n + 1 st column for the data matrix M , capturing all its co-occurrenceswith SO pairs. The multiplication W × M ∗ ,n +1 of the factor matrix W with this additionalvector maps the vector into the k -dimensional space. From here, now that we have a latentvector for the new phrase, we can compute all comparisons as outlined above.There is still a problem if the new P phrase comes with SO pairs where an S or an O arenot among the entities captured in the original matrix M . The “row-less” Universal Schemamethod [595] adds techniques to handle property phrases whose observations include newlyseen entities (“out-of-KB” entities if M is viewed as a pseudo-KB). Extensions:

The Universal Schema method has been extended in a variety of ways. Instead of treatingSO pairs as one side of an input matrix, one can arrange all data into a third-order tensor with modes corresponding to S, P and O phrases. This tensor is decomposed into low-rankmatrices which yield the latent representations of S arguments, P phrases and O arguments(see, e.g., [147, 80]).Another extension is to combine inputs from Open IE triples and a pre-existing KB,where the KB is conﬁned to pre-speciﬁed properties, the triples may contain new propertytypes unknown so far, and both inputs overlap in the entities that they cover [595]. In asimilar vein, multilingual inputs can be combined into a joint matrix or tensor, for highercoverage [594]. This exploits co-occurrences of SO pairs with property phrases in diﬀerentlanguages.Finally, more recent works on Universal Schema have devised probabilistic and neuralmodels for learning the latent representations of S, P and O phrases [595, 657], replacingthe matrix/tensor factorization machinery. Essentially, a classiﬁer is trained for P phrasesto denote one of n existing property types or a new, previously unseen, property. The inputfor learning the classiﬁer is the vectors (and additional embeddings) for the S, P and Ocomponents of triples from both prior KB and Open IE collection. The objective is tolearn an “entity-neighborhood-aware” encoder of property phrases. When the trained modelreceives a new triple of SPO phrases, it makes the classiﬁcation decision, as mentionedabove. P phrases which are most likely none of the KB properties become (still noisy)candidates for extending the KB schema at a later point.139 ubmitted to Foundations and Trends in Databases Major conclusions from this line of research are the following: • Open information extraction, from any given text collection without prior KB, is apowerful machinery to enhance recall towards better coverage of KB construction. It isable to discover new properties, but this often comes with substantial noise, adverselyaﬀecting precision. • Using distant supervision from known properties in a prior KB can help to improve theprecision of property detection, especially when tapping semi-structured web sites. • Canonicalizing the resulting SPO triples, especially the newly seen property phrases,for inclusion in a high-quality KB still poses major challenges. Methods based ondistributional-similarity cues employ clustering or data mining techniques, either toconstruct paraphrase dictionaries or for cleaning of Open IE collections. • Organizing SPO triples from Open IE in a latent low-rank space is an interestingalternative to explicit schemas, but latent “schemas” are not easily interpretable. Therole of such approaches in KB curation and long-term maintenance is still unclear.140 ubmitted to Foundations and Trends in Databases

Although the KB construction methods presented in the previous chapters aim for high-quality output, no KB will be anywhere near perfect. There will always be some errors thatresult in incorrect statements, and complete coverage of everything that the KB should haveis an elusive goal. Therefore, additional quality assurance and curation is inevitable fora KB to provide, maintain and enhance its value. This aspect is of utmost importance asthe KB grows and evolves over time, with deployment in applications over many years.We start this chapter by introducing quantitative metrics for

KB quality (Section 8.1)and the degree of completeness (Section 8.2). A key instrument for quality assurance islogical constraints and rules (Section 8.3), as well as their latent representations via graphembeddings (Section 8.4). Consistency constraints, in particular, are crucial for detectingand pruning false positives (i.e., spurious statements) and thus facilitate the

KB cleaning process (Section 8.5). We conclude this chapter by discussing key aspects in the long-term life-cycle of a KB, including provenance tracking, versioning and emerging entities (Section8.6).

The degrees of correctness and coverage of a KB are captured by the metrics of precision and recall , analogously to the evaluation of classiﬁers. Precision and recall matter onseveral levels: for a set of entities that populate a given type, for the set of types known fora given entity, for the set of property statements about a given entity, or for all entities of agiven type. For all these cases, assume that the KB contains a set of statements S whosequality is to be evaluated, and a ground-truth set GT for the respective scope of interest. Precision is the fraction of elements in S that are also in GT : precision ( S ) = S ∩ GT S Precision may also be referred to, in the literature, by the terms accuracy, validity orcorrectness (with some diﬀerences in the technical detail).

Recall is the fraction of elements in GT that are also in S : recall ( S ) = S ∩ GTGT

Recall may also be referred to by the terms completeness or coverage.141 ubmitted to Foundations and Trends in Databases

Knowledge extraction methods, from both semi-structured and textual contents, usuallyyield conﬁdence scores for their output statements. To compute precision and recall forconﬁdence-annotated statements, one needs to choose a threshold above which statementsare retained and below which statements are discarded. For example, for a binary classiﬁer,we could use the odds of its scores for accepting versus rejecting a candidate statementas a conﬁdence value. For advanced machine-learning methods, this can be much moresophisticated, though. Properly calibrating conﬁdence scores for human interpretability posesdiﬃculties depending on the speciﬁc choice of the learner (see, e.g., [74]). Moreover, somemethods, such as MAP inference for probabilistic graphical models [283] (cf. Section 5.5.4),produce only joint scores for entire sets of statements, which cannot be easily propagatedto individual statements. Computing marginal probabilities is often prohibitively expensivefor such joint inference methods.To avoid the diﬃcult and highly application-dependent choice of an adequate threshold, awidely used approach is to compute precision and recall for a range of threshold values, and toinspect the resulting precision-recall-tradeoﬀ-curve . This curve can be aggregated intoa single metric as the area under the curve (AUC) , also known as average precision ,or alternatively, the area under the receiver-operating-characteristic curve (ROC). Also,for each of the points where we know precision and recall, we can compute their harmonicmean, and use the best of these values as a quality metric. This is known as

F1 score = precision − + recall − Besides these standard measures, further metrics of interest are the density and con-nectivity of the KB: the number of statements per entity and the average number of linksto other entities. Also, the number of distinct property types for entities (of a given type)can be interpreted as a notion of semantic richness . Freshness with regard to capturingup-to-date real-world facts is another important dimension. More discussion on KB qualitymeasures can be found in [444, 157, 454, 227]. For taxonomic knowledge, that is, thehypernymy graph of types (aka. classes), speciﬁc measures have been studied (e.g., [53]).

The deﬁnitions of precision and recall require ground truth GT , such as all songs by BobDylan or all artists who covered Bob Dylan. Since obtaining comprehensive ground truth isoften impossible, evaluating precision and recall usually resorts to sampling . Evaluating Precision:

For precision, this entails selecting a (random) subset of elements from the KB, and for eachof them, deciding whether it is correct or not. For example, we could pick a few hundredstatements for a given property type uniformly at random and have them assessed [240].More sophisticated sampling strategies for KBs are discussed in [180].142 ubmitted to Foundations and Trends in Databases

The assessment is by human judgement, which may come from crowdsourcing workers or,if needed, domain experts (e.g., for biomedical knowledge). Such gathering of ground-truthdata requires care and eﬀort in terms of annotation guidelines and quality control, especiallyif relying on laypeople on crowdsourcing platforms. Also, assessments often have inherentuncertainty, as the annotators may misjudge some statements depending on their personalbackground and (lack of) thoroughness. By evaluating diﬀerent subsets of the sampled datapoints, the process can also serve to approximate precision-recall curves [505].A common scheme is to have multiple annotators judge the same samples, and to considerassessments only as ground-truth if the inter-annotator agreement is suﬃciently high. Inthis process, some annotators may be dismissed and the agreement and outcome could bebased on annotator-speciﬁc conﬁdence weights, performing a form of weighted voting (e.g.,[574]). The overviews [124, 91, 8] give more background on handling crowdsourcing tasks.

Evaluating Recall:

For recall, evaluation based on sampling is more diﬃcult, as the key point here is coveragewith the goal of capturing all knowledge of interest. A common approach is to have humanannotators hand-craft small collections of ground-truth, for example, all songs by Bob Dylanor all movies in which Elvis Presley starred. In contrast, for evaluating the quality of anautomated extraction method and tool, annotators have to read an entire text or web pageand mark up all statements that would ideally be captured for a KB.Obviously, both tasks are labor-intensive and do not scale well. Therefore, a proxytechnique is to evaluate a KB or extractor output relative to another pre-existing KB (ordataset) which is known to have high recall with regard to a certain scope (e.g., songs andmovies of famous musicians) [444].

Often, the sources from which a KB is built cannot be easily characterized in terms ofconﬁdence. This holds, for example, for collaboratively created knowledge bases where manyusers enter and edit statements (e.g., Wikidata). For automatic extraction, on the otherhand, it is often desirable to estimate the quality that a certain extractor obtains from acertain source. This is a building block for assessing the resulting KB statements, and it isalso useful as an a-priori estimator before paying the cost of actually running the extractorat scale. These settings call for predicting the output quality based on features of theunderlying sources and methods.For the case where KB content is created by online users, three kinds of signals can beused:1. user features , such as general expertise and the number of previous contributions,2. features of the

KB subject or statement under consideration, such as popularity orcontroversiality, and 143 ubmitted to Foundations and Trends in Databases joint features of the user and the subject [574, 226] (e.g., user living in a country onwhich she adds KB statements).A ﬁnding in [574] is that the third group of features is most important: quality of contribu-tions correlates more with topic-speciﬁc user expertise than with general user standing orthe popularity of the topic. These ﬁndings are based on empirical data from editing theFreebase KB. On the other hand, [226] reports that for the Wikidata KB, user features,such as the number of edits and the user status, are indicative for quality. Other studieshave looked at the edit history of Wikipedia (e.g., [337, 298, 249]). All the above featurescan contribute to counter vandalism and other kinds of quality-degrading edits. Knowledge Fusion:

For the case of using automatic extractors, a key issue is to compare and consolidateoutputs from diﬀerent sources and methods, like statements obtained from DOM trees,from web tables, from query logs, and from text sources, or statements from diﬀerent websites on the same topic. By quantifying the quality of each source/extractor combination ,their outputs can be given adequate weights in the corroboration process. The knowledgefusion method of [129, 131] modeled and quantiﬁed the mutual reinforcement betweenoutput quality and source/extractor quality. The output statements have higher quality ifthey come from a better source and extractor, and the source and extractor have higherquality if they yield more correct statements. This line of arguments is related to theprinciple of statement-pattern duality (see Sections 4.3.2 and 6.2.2.1) as well as to jointlearning over probabilistic factor graphs. We discuss this direction further in Section 8.5.A key ﬁnding from [129, 131] is that the best quality was obtained from DOM trees (seeSections 6.3.1 and 7.3.2), whereas both query logs and web tables provided rather poorquality due to high noise.

Evidence Collection:

The knowledge fusion techniques can also be used to validate or refute KB statementscollected from multiple sources. This is particularly challenging for knowledge about long-tailentities. To this end, [327] devised strategies for collecting evidence in support of candidatestatements or counter-evidence for falsiﬁcation. The compiled pieces of evidence are thensubject to jointly assessing the quality of sources and statements, following the outlinedidea of knowledge fusion.[327] developed a complete fact-checking tool called

FACTY , successfully used for KBcuration. Section 9.5 gives more details on FACTY as part of an industrial KB infrastructure.Other fact-checking methods have been studied, for example, by [648, 333, 411, 173]; anoverview on this topic is given by [335].

Predicting Recall:

Again, recall is more tricky and tedious to estimate. Prior work on text-centric query144 ubmitted to Foundations and Trends in Databases processing [258] investigated selectivity estimators, but these do not carry over to KBquality. Viable approaches exist for special cases, though. For predicting the recall oftypes , that is, the instances of a semantic class such as folk songs or Grammy award winners,statistical methods for species sampling can be used [586, 359]: given a set of birdwatchers’partial counts of diﬀerent species in the same habitat, what is (a good statistical estimatorfor) the population size for each of the species? In the KB setting, we would have to samplecounts for class instances from diﬀerent sources, and then estimate the total number ofdistinct instances. [359] shows that the growth history of the Wikidata KB can be used toderive multiple samples, and the overlap between these samples yields estimates for the sizeof semantic classes. Other statistical techniques along similar lines include using Benford’sLaw on the distribution of class cardinalities [547].An alternative to these intrinsic predictions are extrinsic studies and coverage modelson how well KBs support typical workloads of queries and questions [246, 479].

Textual Cues for Cardinalities and Recall:

Sometimes, there are ways to estimate the cardinality of a semantic class or the cardinalityof the distinct objects for a given subject-property pair, even without being able to extractthe actual entities. In particular, text sources often contain cues such as:

Bob Dylan released 51 albumsBob Dylan's works include 39 studio and 12 live albumsOn his 51st album Dylan covered Sinatra songsFrank Sinatra and his first daughter NancyFrank Sinatra and his second daughter TinaThere are more than two thousand Grammy award winners

These sentences contain numbers, numerals (i.e., text expressions that denote numbers)and ordinal phrases (e.g., “51st”, “second”). From these we can infer cardinalities or at leastlower and upper bounds. Adding this knowledge, in the form of counting quantiﬁers ,enhances the KB. The extraction is not straightforward, though. On one hand, there is thediversity of surface expressions; on the other hand, there are overlapping and noisy cues.For example, we must not over-count albums by adding up studio albums, live albums,albums from the 20th century and albums from the 21st century. [390] presents techniquesfor extraction and consolidation of such cardinalities.Counting quantiﬁers can be used to assess recall by comparing a conﬁrmed count (e.g.,for the number of Bob Dylan albums) against the KB instances of the respective set. Thiscan in turn reveal gaps in the KB, and can steer the knowledge gathering process to speciﬁcenrichment of the KB contents [391].The textual form of expressing lists of objects for a given subject-property pair can alsogive cues about recall. For example, the sentence “Dylan’s favorite instruments used to beguitar and harmonica” suggests that there could be more instruments, whereas the sentence145 ubmitted to Foundations and Trends in Databases “Bob Dylan has played guitar, harmonica, banjo, piano and organ” implies high recall oreven completeness. [480] presents methods for leveraging such textual cues towards recallestimation, based on linguistic theories of communication.

No knowledge base can ever be fully complete. This suggests that, unlike database, whichare traditionally interpreted under a

Closed-World Assumption (CWA) , we should treatKBs as following an

Open-World Assumption (OWA) :The

Open-World Assumption (OWA) postulates that if a statement is not inthe KB, it may nevertheless be true in the real world. Its truth value is unknown.In other words, absence of evidence is not evidence of absence.Whenever we probe a statement and the KB does not have that statement (and neithera perfectly contradictory statement), we assume that this statement may be true or may befalse – we just do not know its validity. For example, even when a rich KB contains 500songs by Bob Dylan and 1000 artists who covered him, Boolean queries such as “Has BobDylan written the anthem of Vanuatu?” or “Has Bob Dylan ever been covered by WoodyGuthrie?” have to return the answer “maybe” (or “don’t know”).This is at least in the absence of knowing a diﬀerent composer of the Vanuatu anthem,and not being able to rule out a covering artist because he or she died already before Dylanwas born. The answer cannot be “no” even if both statements are sort of absurd.

Notwithstanding the general OWA, we are often able to observe and state that a KB is locally complete . For example, a human knowledge engineer may assert that the instances ofa speciﬁc type, such as

Nobel laureates or Grammy winners , are complete. Then all queriesof the form “Did Elvis Presley win the Nobel Prize?” can faithfully return the answer “no”.This idea can be generalized to other localized settings, most notably, to the set ofobjects that would be in the range of a given subject-property pair. For example, the KBcould have a complete list of Bob Dylan albums, perhaps even a complete list of his songs,but only a partial set of artists who covered him. Formally, we specify that the sets { O |h Bob Dylan , releasedAlbum , O } and { O |h Bob Dylan , composedSong , O } are complete. This helps querying under CWA semantics for the feasible queries, and ithelps the KB curation process, as we know that further triples for these SO pairs thathuman contributors or automatic extractors may oﬀer should not be accepted.The principle that underlies these considerations is the following [174, 129]:146 ubmitted to Foundations and Trends in Databases The

Local Completeness Assumption (LCA) , aka.

Partial Completeness As-sumption or Local Closed-World Assumption , for entity S in the KB asserts thatif, for some property P , the KB contains at least one statement h S, P, O i , then itcontains all statements for the same S and P . That is, there exists no object X forwhich h S, P, X i holds in the real world but is not captured in the KB.For example, if we know one of a person’s children, then we know all of them; S iscomplete with regard to P .A generalized form of the LCA that applies to entire classes asserts that for allentities S of a given type T and a given property P , the set of objects for SP iscomplete.The rationale for the LCA is that salient properties of important entities will be coveredcompletely, throughout the KB maintenance process, or not at all. Hence the LCA forproperties like children, types like Nobel laureates, and subjects like premier league footballclubs. [482, 176, 27] conducted empirical studies on how well the LCA holds for large KBslike Wikidata. The studies did ﬁnd gaps (e.g., only subsets of children known), but by andlarge, the LCA is often a valid assumption. More precisely, this holds for relations in thedirection that is closer to being functional (i.e., with smaller range of distinct objects persubject). For example, the LCA is empirically supported for hasCitizenship: person × country , but not for its inverse relation hasCitizen: country × person .Of course, for SP object sets that may still grow over time, the LCA has to bechecked again every now and then. But there are many cases with immutable sets wherea completeness guarantee would indeed freeze the relevant part of the KB. Examples arethe parents of people, the founders of companies, or movies by directors who are alreadydead. Instead of the complete set being the objects for an SO pair, we can also consider thesituation with a set of subjects for a given PO pair. Furthermore, the local completenesscould be conditioned on an additional property, most notably, on the type of the S entities.For example, we could assert completeness of parents and children for all politicians or allEU politicians, but perhaps not for artists (who may have a less documented life – at leastby the cliché). Completeness Assertions:

Starting from seminal database work [403], proposals have been made to extend knowledgebases with completeness statements , to assert which parts of the data should be treatedunder LCA semantics. By default, all other parts would then be potentially incomplete,to be treated under OWA semantics. Formal languages and patterns for expressing suchassertions have been investigated, for example, by [481, 111, 307] for relational databasesand for the RDF data model.The RDF standard itself can not express these assertions, though. Instead, we can turn147 ubmitted to Foundations and Trends in Databases to the Web Ontology Language OWL 2 [601], which has diﬀerent ways of specifying localcompleteness. The following are OWL examples for asserting that i) there are only fourteeneight-thousander mountains, and that ii) people have at most one birthdate: i) :classOfEightthousanders owl:oneOf( :Everest :K2 :Kangchenjunga ... :Shishapangma )ii) :people rdfs:subclassOf [rdf:type owl:Restriction;owl:maxcardinality "1"^^xsd:nonNegativeInteger;owl:onProperty :birthdate]

The second example can be generalized. Asserting local completeness is often possiblethrough additional properties about the cardinalities for SP object sets. By adding state-ments such as h Bob Dylan , number of albums , i it is easy to compare the cardinality of { O | h Bob Dylan , released album , O i} against the asserted number. A perfect match indicates local completeness. If the object setis smaller, some knowledge is missing; if it is larger than the stated number, some of thestatements are wrong (or the statement about number of albums is incorrect or stale). Asystematic study of such counting statements and their underlying object sets has beenperformed by [184]. The design philosphy of KBs is to store positive statements only: facts that do hold. However,it is sometimes also of interest to make negative statements explicit: statements that do nothold, despite common belief or when they are otherwise noteworthy. For example, we maywant to explicitly assert that

Ennio Morricone has never lived in the USA , despite his greatsuccess and inﬂuence in Hollywood, or that

Elvis Presley has not won an Oscar despitestarring in a number of movies and having been the idol of an entire generation. In additionto such fully grounded statements, the absence of objects for a given property could be ofinterest, too. For example,

Angela Merkel has no children . Having this knowledge at handcan help question answering (e.g., to correct high-scoring spurious answers when the properresult is the empty set) as well as KB curation (e.g., to refute improper insertions based oncommon misbeliefs).Negative assertions are straightforward to express in logical terms: ¬ livedIn ( Ennio M orricone, U SA ), ¬ wonP rize ( Elvis P resley, Academy Award ), ¬ ∃ O : hasChild ( Angela M erkel, O ).The OWL 2 standard [601] provides syntax to capture such formulas. However, speciﬁc148 ubmitted to Foundations and Trends in Databases

KBs like Wikidata have only limited ways of expressing negative statements in a principledand comprehensive manner. Wikidata, for example, has statements of the form h Angela Merkel , child , no value i where no value has the semantics that no object exists (in the real world) for this SP pair. Essentially, this asserts the LCA for this local context, conﬁrming that the absence ofobjects is indeed the truth (as opposed to merely not knowing any objects). In addition,statements about counts such as h Angela Merkel , number of children , 0 i can capture empty sets as well. The negation of fully grounded statements is not expressiblein RDF, whereas it is straightforward in the OWL 2 language [601].Obviously, it does not make sense to add all (or too many) negative statements even ifthey are valid. For example, Elvis Presley did not win the Nobel Prize, the Turing Award,the Fields Medal etc. But these statements are not interesting, as nobody would expect himto have these honors. So a key issue is to identify salient negative statements that deservebeing made explicit.The work in [16] developed a number of techniques to this end. One of the approachesis to compare an entity against its peers of similar standing, for example, Ennio Morriconeagainst other famous contributors of Hollywood productions. If we see that many of thesepeers have lived in the USA, then Morricone’s counterexample is a remarkable negativestatement. Of course, the negative statement could as well be wrong; that is, Morriconehas lived in the USA but we have missed out on this fact so far. To rectify this situation,the LCA can be asserted for properties of interest, allowing only the prediction of negativestatements if at least one positive statement is present. This high-level idea can be madequantitative based on statistical arguments, leading to a ranked list of salient negativestatements for a given entity (see [16] for details). So far, we have considered only atomic SPO statements such as married (ElvisPresley,PriscillaPresley) . An important asset for curation are logical patterns, or ideally invari-ants in the KB, such as:

Every child has two parents (at the time of birth).Mothers are (mostly) married to the fathers of their children.Only humans can marry.

Such invariants are called intensional knowledge . There are diﬀerent ways to make use ofintensional knowledge. 149 ubmitted to Foundations and Trends in Databases

Constraints deﬁne invariants that must be satisﬁed by the KB to be logicallyconsistent and fulﬁlling a necessary condition for being correct . For example, whoeveris married must be a human person. If the KB knows about Elvis’s marriage butdoes not have him in the class person yet, we ﬂag the KB as inconsistent. This isthe purpose of constraint languages such as SHACL, discussed in Section 8.3.1.

Rules deﬁne a calculus to deduce additional statements, to complete the KB andmake it logically consistent. For example, if the KB misses out on Elvis being a person , a rule could add him to this class upon the premise that there is a marriagestatement about him. This is the purpose of logical rule languages such as OWL,discussed in Section 8.3.2. These languages can also be used to enforce constraints,although with some restrictions.

Soft rules express plausibility restrictions, holding for most cases but toleratingexceptions. For example, the pattern that mothers are married to the fathers oftheir children certainly has exceptions, but would still capture the common case.Such rules can serve as either soft constraints , to detect implausible content in theKB and ﬂag it for curation, or soft deduction rules , to infer additional statementsthat are likely valid and could expand the KB.Soft rules can be derived from logical patterns in the KB. This is the purpose of rule mining algorithms, discussed in Section 8.3.3.Hard and soft constraints are a powerful instrument for KB cleaning: identifyingimplausible and dubious statements, and resolving conﬂicts and gaps in the KB to establishlogical consistency. Methods for this purpose are discussed in Section 8.5.

Consistency constraints prescribe invariants in the KB: people must have a birth date,people cannot have more than one birth date, etc.There is a wide spectrum of invariants that we could consider. The following lists someuseful templates, by examples.•

Type constraints , e.g.: ∀ x, y : composed ( x, y ) ⇒ type ( x, musician )(domain of the composed relation) and ∀ x, y : composed ( x, y ) ⇒ type ( y, song )(range of the composed relation) 150 ubmitted to Foundations and Trends in Databases • Value constraints , e.g.: ∀ x, v : type ( x, basketballer ) ∧ height ( x, v ) ⇒ v < cm • Relation restrictions like symmetry or transitivity, e.g.: ∀ x, y : spouse ( x, y ) ⇒ spouse ( y, x )(symmetry of the spouse relation)• Functional dependencies , e.g.,: ∀ x, y, z : birthplace ( x, y ) ∧ birthplace ( x, z ) ⇒ y = z (the birthplace relation is a function)• Conditional functional dependencies , e.g.: ∀ x, y, z : citizen ( x, y ) ∧ citizen ( x, z ) ∧ y = Germany ⇒ y = z (Germany does not allow dual citizenships)• Inclusion dependencies , e.g.: ∀ x, y : type ( x, singer ) ⇒ type ( x, musician )More advanced kinds of constraints include:• Disjointness constraints , e.g.: ∀ x, y : type ( x, weightlif ter ) ⇒ ¬ type ( x, ballerina ) and ∀ x, y : type ( x, ballerina ) ⇒ ¬ type ( x, weightlif ter )• Existential dependencies , e.g.: ∀ x : type ( x, scholar ) ⇒∃ y : ( published ( x, y ) ∧ type ( y, scientiﬁc article ))• Temporal dependencies , e.g.: ∀ x, y, z, s, t : marriage ( x, y, s ) ∧ marriage ( x, z, t ) ⇒ ( y = z ∨ ¬ overlaps ( s, t ))where marriage is a ternary relation and s and t are time intervals during which themarriages last.Some of these invariants may appear too strict, given that real life holds many surprises.For example, there could be exceptions for the mutual exclusion of being a weightlifterand being a ballerina. It is up to the scope and purpose of the KB whether we want torule out such exceptions or not. Constraints that cover the prevalent case but do tolerateexceptions are very useful; they can be used to ﬂag suspicious statements in a KB andprompt a human curator (see Section 8.5).Database languages like SQL support the declarative speciﬁcation of constraints (seetextbooks such as [588, 432]). However, knowledge bases have mostly adopted Web stan-dards. The W3C standard language for expressing constraints is the Shapes ConstraintLanguage (SHACL) [281]. Here is an example:151 ubmitted to Foundations and Trends in Databases :Personrdf:type owl:Class, sh:NodeShape ;:property [sh:path :hasMother ;sh:maxCount 1 ;sh:Class :Person ;] .

This code speciﬁes that every instance of the type

Person has at most one entity in therange of the hasMother relation, disallowing more than one biological mother. To enforceexactly one mother in the KB, an analogous shape constraint with minCount could be added.However, a SHACL validation tool would then raise an inconsistency ﬂag if some people missthe respective statements for hasMother . Therefore, to allow for leeway in the KB growthprocess, with people being added without complete properties, the maxCount constraint onlywould be a typical design choice.More advanced features of SHACL allow specifying constraints on strings (such aslength restrictions and compliance with regex patterns) for attributes, value ranges fornumerical properties, and composite constraints via Boolean operators [281]. An alternativeto SHACL is the

Shape Expressions Language (ShEx) [182]. ShEx comes with avalidation algorithm to test whether all constraints hold in a given KB [52].

A deductive rule can infer statements for addition to a KB, to make it consistent andto enhance its coverage. For example, a rule can codify that the entities in a marriedTo relationship must be human people; so

Elvis Presley would be added to the class person .Rules may produce contradictions among their deduced statements, though. For example,an additional rule could state that entities must not belong to both the class of persons andthe class of ﬁctitious characters. If the KB already identiﬁes Elvis as an instance of fictiouscharacter then would derive a contradiction. However, unlike constraints discussed inSection 8.3.1, this would not automatically result in the explicit removal of contradictorystatements.Deduction rules are expressed in appropriately chosen fragments of ﬁrst-order predicatelogics, typically trading oﬀ expressiveness versus computational complexity. In the SemanticWeb world, several logics are widely used.

RDFS (= RDF Schema) is the simplestformalism [604]. It allows only limited kinds of rules: domain and range rules, subclass-ofrules, and sub-property rules. In RDFS, we can specify that everyone who is married is aperson: :marriedTo rdfs:domain :Person ubmitted to Foundations and Trends in Databases

RDFS cannot express the disjointness of classes. For this, a more powerful language hasbeen developed: the

Web Ontology Language OWL 2 [601], which allows asserting thatcertain classes rule each other out. OWL exists in several ﬂavors, from the least expressive(and more computationally benign) to the most expressive (and computationally expensive)variant. For our purpose, the OWL 2 QL ﬂavor is suﬃcient to specify that real-life personsand ﬁctitious characters are disjoint: :Person owl:disjointWith :FictitiousCharacter

OWL reasoners can ingest sets of formulas to perform deduction, and ﬂag the KB asinconsistent if a contradiction is found. The theoretical foundation of OWL is descriptionlogics [551, 22]. Our example can be expressed as follows:

F ictitiousCharacter ( Elvis ) marriedT o ( Elvis, P riscilla ) ∃ marriedT o v P ersonP erson u F ictitiousCharacter v ⊥

The OWL reasoner would detect that this has a logical contradiction. There are manyapproaches that aim to repair such contradictions [37, 38]. One option is to identify aminimal set of axioms to remove. Alternatively, we can compute the maximal subset ofinstance-level statements that are still compatible with all the given axioms. In the example,we can remove either the ﬁrst statement or the second. There are diﬀerent ways to prioritizethe statements, for example, by preferably keeping those with high conﬁdence [44].

Rule mining is the task of identifying logical patterns and invariants in a knowledge base.For example, we aim to ﬁnd that people who are married usually live in the same city: ∀ x, y, z : marriedT o ( x, y ) ∧ livesIn ( x, z ) ⇒ livesIn ( y, z )or that mayors of cities have the citizenship of the respective countries: ∀ x, y, z : mayor ( x, y ) ∧ locatedInCountry ( y, z ) ⇒ citizen ( x, z )Rule mining can be used either for learning constraints to prune out spurious statements,or for rule-based deduction to ﬁll gaps in the KB. Typically, the same rule is used for eitherone of two purposes; this is the choice of the KB architect, depending on where the painpoints are: precision or recall.Rules are often of soft nature, holding for a large fraction of statements and tolerating“minorities”. Thus, rule mining needs to consider metrics for the degree of validity: support and conﬁdence , as discussed below. In addition to playing a vital part in the KB curationprocess, rules can also give insight into potential biases in the KB content and, possibly,even the real world. For example, a rule saying that153 ubmitted to Foundations and Trends in Databases actors are (usually) millionaires may arise from the KB emphasis on successful stars and lack of covering long-tail actors,and a rule saying that Nobel laureates are (mostly) men reﬂects the gender imbalance in our society. Yet another potential use case for rule miningis to generate explanations for doubtful facts – in combination with retrieving evidence orcounter-evidence from text corpora (see, e.g., [173]).Rule mining, as discussed in the following, is related to the task of discovering approximatefunctional dependencies and approximate inclusion dependencies in relational databases(e.g., [257, 88, 514]), as part of the data cleaning pipeline or for data exploration. Thetextbook [255] covers this topic, including references to state-of-the-art algorithms. Notethat functional and inclusion dependencies are special cases of logical invariants. KB rulemining also aims at a broader, more expressive class of logical patterns, such as Horn rules.

Horn Rules and Inductive Logic Programming:

Unless we restrict rules to suitable fragments of ﬁrst-order logics, rule mining is boundto run into computational complexity issues and intractability. Therefore, the rules underconsideration are usually restricted to the following shape:A

Horn rule is a ﬁrst-order predicate logic formula restricted to the format ∀ x ∀ x . . . ∀ x k : P ( args ) ∧ P ( args ) ∧ · · · ∧ P m ( args m ) ⇒ Q ( args )where x through x k are variables, P through P m and Q are KB properties, andtheir arguments args args m and args are ordered lists of either variablesfrom { x . . . x k } or constants, that is, entities or literal values from the KB. Theproperties are often binary (if the KB adopts RDF) but could also have higherarities.The conjunction of the terms P ( args ) through P m ( args m ) is called the rule body ,and Q ( args ) is called the rule head . The terms P i ( args i ) and Q ( args ) are called atoms (or literals ).As all quantiﬁers are universal (i.e., no existential quantiﬁers) and preﬁx thepropositional-logic part of the formula (i.e., using prenex normal form), we of-ten drop the quantiﬁers and write P ( args ) ∧ P ( args ) · · · ∧ P m ( args m ) ⇒ Q ( args ).When normalized into clause form with disjunctions only, the propositional-logicspart looks like ¬ P ( args ) ∨ ¬ P ( args ) · · · ∨ ¬ P m ( args m ) ∨ Q ( args ). Horn clauses restrict formulas to have at most one positive atom. The two examplesat the begin of this subsection are Horn rules.154 ubmitted to Foundations and Trends in Databases

Automatically computing such rules from extensional data is highly related to the taskof

Association Rule Mining [5, 4, 207], for example, ﬁnding rules such as:

Customers who buy white rum and mint leaveswill also buy lime juice (to prepare Mojito cocktails). or Subscribers who like Elvis Presley and Bob Dylanwill also like Nina Simone.

However, association rules, mined over transactions and other user events, do not havethe full logical expressiveness. Essentially, they are restricted to single-variable patterns,capturing customers, subscribers, users etc., but no other variables. The logics-basedgeneralization of rules has been addressed in a separate community, known as

InductiveLogic Programming [407, 654, 102]. Both association rule mining and inductive logicprogramming face a huge combinatorial space of possible rules. To identify the interestingones, quantitative measures from data mining need to be considered:Given a rule B ⇒ H over a knowledge base KB and a substitution θ that instantiatesthe rule’s variables with entities and values from KB , the resulting instantiationof the rule head, θ ( H ), is called a prediction of the rule. A prediction is called positive if the grounded head is contained in the KB, and negative otherwise. Werefer to these also as positive and negative samples , respectively.The support of a Horn rule is the number of positive predictions: support ( P . . . P m ⇒ Q ( args )) = |{ θ ( args ) | P . . . P m , Q and Q ( θ ( args )) ∈ KB }| The conﬁdence of a Horn rule is the fraction of positive predictions relative to thesum of positive and negative predictions: conf idence ( P . . . P m ⇒ Q ( args )) = support / ( support + |{ θ ( args ) | P . . . P m , Q and Q ( θ ( args )) / ∈ KB }| ) Rule Mining Algorithms:

The goal of KB rule mining is to compute high-conﬁdence rules, but to ensure signiﬁcance,these should also have support above a speciﬁed threshold. Finding rules can proceed intwo ways: bottom-up or top-down.

Top-down algorithms start with the head of a rule and aim to construct the rule bodyby incrementally adding and reﬁning atoms – so that rules gradually become more speciﬁc.

Bottom-up algorithms start from the data, the KB content, and select atoms for the body ofinitially special rules that are gradually generalized. Both of these paradigms resemble keyelements of the A-Priori algorithm for association rules [5], namely, iteratively growing theatoms sets for rule bodies (like in itemset mining), while checking for suﬃciently high supportand scoring the rule conﬁdence. In particular, all algorithms exploit the anti-monotonicity155 ubmitted to Foundations and Trends in Databases of the support measure, so as to prune candidate rules that cannot exceed the supportthreshold.The following algorithmic skeleton outlines the

AMIE method [174, 177, 305] for top-down rule mining with binary predicates (i.e., over SPO triples), using a queue for atomsets that can form rule bodies.

Top-down mining of KB rules:

Input : head predicate Q(x,y) with variables x,y, and the KBOutput : rules with support and conﬁdence0. Initialize a queue with empty set1. Pick an entry P = { P . . . P l } from the queue and extend it by generating anadditional atom P l +1 wherea. P l +1 shares one argument with those of P or Q and has a new variable forthe other argument, orb. P l +1 has a variable as one argument shared with those of P or Q andintroduces a new constant (entity or value) for the other argument, orc. P l +1 shares both of its arguments with those of P or Q

2. Compute conﬁdence, support and other measures for the rule P ∧ · · · ∧ P l +1 ⇒ Q

3. Insert atom set { P . . . P l , P l +1 } into the queueif support above threshold4. Output the rule P ∧ · · · ∧ P l +1 ⇒ Q

5. Repeat steps 1–4until enough rules or priority queue exhaustedInstead of a simple queue, we can also use a priority queue by conﬁdence score, acombination of conﬁdence and support, or other measures of interestingness, so as to arriveat the most insightful rules for the ﬁnal output. Recall that conﬁdence is the ratio of correctpredictions and the sum of correct and incorrect predictions, where a prediction is aninstantiation of the rule head in line with the variable bindings in the rule body. Conﬁdencecan be calculated in diﬀerent ways in a KB setting. The diﬀerence lies in selecting negativecases (i.e., what counts as an incorrect prediction).1. By postulating the

Closed-World Assumption (CWA) , we identify instantiationsof the rule body such that the accordingly instantiated head is not in the KB. Figure8.1 shows an example for deducing fathers of children. All predicted fathers for whichno statement is in the KB count as negative samples. This way, the number of negativesamples is typically high, as most KBs are far from being complete. So the conﬁdencetends to be inherently underestimated.2. To counter the biased inﬂuence of absent statements being counted as negative samples,we can alternatively adopt the

Local Completeness Assumption (LCA) , as deﬁned156 ubmitted to Foundations and Trends in Databases in Section 8.2. Missing statements are considered as negative evidence only if they haveat least one statement for the property in the respective rule atom (e.g., fatherOf ).Otherwise, the absent statements are disregarded, as their validity is uncertain underthe default Open World Assumption. Figure 8.1 shows an example.  x,y,z: married(x,y)  motherOf(x,z)  fatherOf(y,z)Married(Priscilla,Elvis) married(Michelle,Barack) married(Diana,Charles)motherOf(Priscilla,Lisa) motherOf(Michelle,Malia) motherOf(Diana,William) motherOf(Michelle,Sasha) motherOf(Diana,Harry) fatherOf(Elvis,Lisa) fatherOf(Charles,William)CWA positive predictions:fatherOf(Elvis,Lisa) fatherOf(Charles,William)CWA negative predictions: fatherOf(Barack,Malia) fatherOf(Charles,Harry) fatherOf(Barack,Sasha) CWA-based confidence: 2/5

LCA positive predictions:fatherOf(Elvis,Lisa) fatherOf(Charles,William)LCA negative predictions: fatherOf(Charles,Harry)

LCA-based confidence: 2/3

Figure 8.1:

Example for rule conﬁdence under CWA and LCA

As shown by [174, 177], the LCA-based conﬁdence measure is more eﬀective in identifyinginteresting rules by conﬁdence scoring. Note that the mined rules may include constants forsome arguments, for salient values or entities. Examples are: ∀ x, y : type ( x, musician ) ∧ livesIn ( x, N ashville ) ⇒ citizenOf ( x, U SA ) ∀ x, y : type ( x, politician ) ∧ citizenOf ( x, F rance ) ⇒ livesIn ( x, P aris )By swapping the positive examples and the negative examples, it is also possible to learnrules with negation. For example, if two people are married, then one cannot be the childof the other. An issue here is that if LCA were used to generate negative samples, it wouldyield an inﬁnite number of negative statements: Elvis is not the child of Madonna, Elvis isnot the child of Barack Obama, etc. To consider only a ﬁnite number of meaningful cases,the solution is to generate a negative sample Q ( x, y ) only if x and y appear together in atleast one SPO statement in the KB [434, 433]. More precisely, we generate the statement Q ( x, y ) as an incorrect prediction if1. Q ( x, y ) is not in the KB.2. There is y with Q ( x, y ) in the KB or there is x with Q ( x , y ).157 ubmitted to Foundations and Trends in Databases

3. There is a predicate R with R ( x, y ) in the KB.A variety of works have developed algorithmic alternatives, extensions and optimizations,most importantly, for pruning the search space and for avoiding expensive computations ofsupport and conﬁdence to the best possible extent. Relevant literature includes [310, 308,579, 376, 305]. A particularly noteworthy method is the Path Ranking Algorithm by[310, 308] that operates on a graph view of the KB, with entities as nodes and propertiesas edges. The algorithm computes frequent edge-label sequences on the paths of this graph;these form candidates for the atom sets of rules. Random walk techniques are leveraged foreﬃciency.Several works have investigated generalizations beyond Horn rules: mining rules withexceptions [172], OWL constraints [599], rules with numerical constraints [388], rules incombination with text-based evidence and counter-evidence [173], and rules that take intoaccount the conﬁdence scores of other rules [376]. Surveys on KB rule mining are given by[554] and [564].

So far, we have looked at symbolic representations of entities, types and properties, foundedon computational logics. Recently, another kind of knowledge representation has gainedpopularity: mapping entities, properties and SPO triples into real-valued vectors in a low-dimensional latent space. Such knowledge graph embeddings , or

KG embeddings for short, are motivated by the great success of word embeddings in machine learning forNLP [384]. Note that the setting here is diﬀerent from those for word embeddings andentity embeddings such as Wikipedia2vec [638], as covered in Section 4.5. The focus here ison capturing structural patterns of an entity in its graph neighborhood alone, as an assetfor machine learning and predictions.Imagine a KB where a set of entities of type person and the relation spouse have allbeen mapped to latent vectors, as illustrated in Figure 8.2 with the simpliﬁed case of atwo-dimensional latent space. The dots represent the coordinates of these vectors. The spouse vector has been shifted and scaled in length, while retaining its direction, to connectentities which are known to be spouses. Now consider the French president

Emmanuel Macron and a set of potential wives such as

Carla Bruni , Francoise Hardy and

Brigitte Macron .We can identify

Brigitte Macron as his proper spouse by adding up the vectors for

EmmanuelMacron and the spouse relation, as illustrated in the ﬁgure.

Use Cases of KG Embeddings:

The vector representation has several use cases. It is a good way of quantifying similaritiesbetween entities, and thus enables statistical methods like clustering. Also, deep learningrequires real-valued vectors as input. The most prominent application is to predict miss- ubmitted to Foundations and Trends in Databases

Joachim Sauer Angela Merkel

Emmanuel Macron Brigitte Macron

Francoise HardyCarla BruniElvis Presley Priscilla PresleyJohnny Cash Yoko Ono BjörkJohn Lennon June CarterNina Simone

Figure 8.2:

Embeddings of entities and the spouse relation ing statements for the KB. The literature has referred to this as knowledge graphcompletion [424, 54, 543, 341]. By operating on structural patterns of the graph alone,this is actually a form of link prediction , similar to predicting (and recommending) “friends”in a social network. Speciﬁcally, for a given target entity and relation, KG completion wouldcompute a ranked list of candidates for the respective object(s) to suggest a new SPO triple.This could support a KB curator in ﬁlling gaps. However, state-of-the-art methods for KGembeddings are still far from consistently recommending the correct entities at the topranks. So human curators do not run out of work that soon.A dual case is to scrutinize doubtful statements in the KB, or in a pool of candidatesconsidered for addition. If none of the say top-100 predictions for a given SP pair producesthe O that appears in a candidate statement, we should consider discarding the candidatecompletely. However, this should still be taken with a big grain of salt when maintaining aproduction-level KB.

Learning of KG Embeddings:

The objective for learning embedding vectors is as follows:159 ubmitted to Foundations and Trends in Databases

KG Embedding Vectors:

Given a KB, ﬁnd a mapping v from its entities and relations onto real-valued vectors,such that v ( s ) + v ( p ) ≈ v ( o )for each statement h s, p, o i in the KB.The application then is, for given s and p with unknown o , to predict o by ﬁndingobjects close to v ( s ) + v ( p ).Embeddings can be computed by algorithms for tensor factorization (e.g., [424]) or bytraining a neural network (e.g., [54, 543]). At training time, the neural network takes asinput h s, p, o i statements that are one-hot encoded as preliminary vectors. The networklearns latent vectors v ( s ) , v ( p ) , v ( o ) and outputs a score how well these learned embeddingsapproximate the existing facts. So the key term in the loss function is || v ( s ) + v ( p ) − v ( o ) || summed up over all training samples. To minimize the loss function, the learner performsbackpropagation with gradient descent, and this way computes the best embedding vectors.At deployment time, one can simply add or subtract learned vectors to compute similaritiesbetween vector expressions.To avoid that the network just returns 0, that is, perfect matches, for all inputs, it hasto include also negative samples for its training. These are usually obtained by perturbingpositive statements h s, p, o i of the KB into statements h s, p, o i that are not in the KB. Fornegative training samples, the network should return a large output value (i.e., mismatch).This basic method is known by the name TransE [54], for “translating” entities intolatent-space vectors. A major limitation of TransE is that all objects of a one-to-manyrelation end up in the same spot in the vector space. In our example, if Emmanuel Macronever marries someone else and the marriage triples are the only statements about hisspouses, then that other person would have the same vector as Brigitte Macron (much toher annoyance, certainly). To overcome this and other limitations, more advanced modelshave been developed. Major works, overviews and experimental comparisons include [424,543, 54, 341, 423, 611, 266, 503].

Constraints allow us to scrutinize each statement candidate, one at a time. For example,a candidate statement that Donald Trump has composed the US national anthem “StarSpangled Banner” could be refuted by violating the type constraint that pieces of musicmust be created by musicians, or at least artists. For doing this, we only need to check this individual candidate against the applicable constraints, without looking at any of the othercandidates. 160 ubmitted to Foundations and Trends in Databases

However, the situation is often more complex. Suppose, for example, that we have threecandidates for the place where Elvis Presley died:a) h Elvis Presley, deathplace, Baptist Hospital Memphis i ,b) h Elvis Presley, deathplace, Graceland Ranch i ,c) h Elvis Presley, deathplace, Mare Elysium (Mars) i (where he lived until February 2020, according to some beliefs).Each of them individually could be accepted, but together they violate the constraint thatthe deathplace of a person is unique. So we must inspect the evidence and counter-evidencefor the three statements jointly , to arrive at a conclusion on which of these hypothesesis most likely to be valid. The following subsections present various methologies for thenecessary holistic consistency checking. Although relational databases usually undergo a careful process for schema design and dataingest, the case where some data records contain erroneous values is a frequent concern.Humans may enter incorrect values, and, most critically, database tables may be the resultof some imperfect data integration, for example, merging two or more customer tables withdiscrepancies in their addresses. A major part of the necessary data cleaning [472, 256, 255],therefore, is entity matching (EM) [287, 416, 94, 135, 96]: inferring which records trulycorrespond to the same entities and dropping or correcting doubtful matches. We discussedthis issue in Section 5.2.Entity-matching errors are not the only issue in data cleaning, though. A typical settinginvolves a relational table and a set of integrity constraints that the data should satisfy. Theseconstraints could be manually speciﬁed, but could also be automatically discovered fromthe data itself – by data-mining algorithms that reveal approximate invariants [257, 88, 514].These include functional dependencies (FDs), inclusion dependencies (IDs), conditional FDs,and more. The family of denial constraints subsumes a variety of these kinds of invariants,and has been used in industrial-strength data-cleaning tools [486, 255]. The key idea forcleaning is: when a constraint holds for almost all of the data instances, the remainingviolations are likely errors. To remove these errors, we either need a way of obtaining thecorrect values, possibly by some human-in-the-loop procedure, or we remove the incorrectrecords. Often, there are multiple ways of choosing a set of erroneous records for removal,in order to restore the integrity of the remaining data. The guiding principle in makinggood choices is the following [14, 49, 38]: 161 ubmitted to Foundations and Trends in Databases

Minimal-Repair Data Cleaning: • Input: Relational table R with errors, and a set of constraints • Output: Clean table S ⊂ R Choose tuples { t , t . . . } such that S = R − { t , t . . . } satisﬁes all constraints, and |{ t , t . . . }| is mimimal.The repair mechanism can vary: instead of solely considering discarding entire tuples,there are alternative options, including the replacement of erroneous attribute values. Theliterature on database (DB) cleaning has developed a suite of powerful algorithms forminimum repair [255]. They include both combinatorial optimization methods, with smartpruning of the underlying search space (e.g., [99, 462]), and machine-learning methods, suchas probabilistic graphical models (e.g., [486]). We will come back to the latter in Section8.5.4.In principle, DB cleaning methods are applicable to knowledge bases as well, and couldcontribute to KB curation. However, this has not been explored much, for several reasons.First, data cleaning typically tackles a single relational table, whereas KBs cover hundredsor thousands of relations. Second, at this magnitude of the schema size, the number ofconstraints can be huge and DB methods are not geared for this scale (at the schema level).Third, the assumption for DB cleaning is that most records are correct and errors arespread among a few incorrect values of single attributes. In contrast, KB curation may startwith a huge number of uncertain statements, all collected by potentially noisy extractionalgorithms. It is well possible that a KB-cleaning process would have to remove ten, twentyor more percent of its inputs. Multi-Source Fusion:

Another related topic in database research is data fusion [134, 128]: given a set of value-conﬂicting tuples from diﬀerent databases or structured web sources, the task is to inferthe correct value. For illustration, consider the data about people’s current residence, withvalues obtained from four diﬀerent sources.

Source A Source B Source C Source DFabian Paris Paris Mountain View ParisGerhard Seattle Saarbruecken Kununurra SaarbrueckenLuna Mountain View Seattle Mountain View SeattleSimon Saarbruecken Bolzano Saarbruecken Bolzano

The goal is to choose the correct residence values for each of the four people. A simpleapproach could aim to identify the best source , by some authority or trust measure (e.g.,162 ubmitted to Foundations and Trends in Databases

Alexa rank for web sites), and pick all values from this source. In the example, Source Bcould be the best choice. However, this alone would yield an incorrect value for Simon.Therefore, an alternative is to compute a voting from all sources, and let the majority win.However, this would yield ties for Luna and Simon. So the best approach is to combine source quality and weighted voting , with source quality cast into weights.The key point now is how to measure source quality. Suppose, for ease of explanation,that we have ground-truth values for a subset of the tuples, for example, by consideringsome of the sources as a priori trustworthy, and applying the fusion algorithm to all othersources and their additional data values. Then, the source quality can be deﬁned as thefraction of its values that match the ground-truth. In the example, if we knew all correctvalues, source A would have weight 2 /

4, source B 3 /

4, source C 1 / / knowledge fusion [130, 131]: Multi-Source Fusion for KB Statements:

Input: set of uncertain SPO statements { t , t . . . t n } from various sources { S . . . S k } with k (cid:28) n ,where each S j stands for a combination of website and extraction method (e.g.,regex-based extraction from lists in musicbrainz.org )Output: jointly learned probabilities (or scores) for • the correctness of statement t i ( i = 1 ..n ) and • the trustworthiness of source S j ( j = 1 ..k ).The joint learning of statement validity and source trustworthiness takes into account mutual reinforcement dependencies . A speciﬁc instantiation, developed in [131], is basedon a sophisticated equation system and an iterative EM algorithm (EM = ExpectationMaximization) for an approximate solution.Methods along these lines have been utilized in the Google Knowledge Vault project[129]. Section 9.5.2.2 provides further insights on this work. Knowledge fusion and other techniques for ensemble learning still treat each candidatestatement in isolation. For example, they make independent decisions about the truthof h Bob Dylan, citizenship, USA i and h Bob Dylan, citizenship, UK i . However, it is oftenbeneﬁcial to perform joint inference over a set of candidates, based on coupling constraints .In the following, we present a major line of how to do this, using so-called Weighted MaxSat ubmitted to Foundations and Trends in Databases inference, which is an extension of the

Maximum Satisﬁability problem, MaxSat for short[45]. A broader perspective would be to harness methods from constraint programming [496],but we focus on MaxSat here as it has a convenient way of incorporating the uncertainty ofthe inputs.Consider, for example, a set of candidate statements hasWon(BobDylan,Grammy),hasWon(BobDylan,LiteratureNobelPrize) , and (by erroneously extracting the type of NobelPrize) hasWon(BobDylan,PhysicsNobelPrize) , along with two type statements type(Bob-Dylan,musician) and type(BobDylan,scientist) . Imposing consistency constraints is a goodway of corroborating such uncertain candidates. For the simpliﬁed example, we postulatethat people are either scientists or musicians (i.e., never both), and that it is unlikely for amusician to win a Nobel Prize in Physics. These (soft) constraints can be either speciﬁedby a knowledge engineer or automatically learned (e.g., by rule mining, see Section 8.3.3).The following table shows the ﬁve noisy candidate statements and the constraints (withnumbers in brackets denoting weights, explained below). h BobDylan hasWon Grammy i [0.7] h BobDylan hasWon LiteratureNobelPrize i [0.5] h BobDylan hasWon PhysicsNobelPrize i [0.3] h BobDylan type musician i [0.9] h BobDylan type scientist i [0.1] ∀ x ( hasW on ( x, Grammy )) ⇒ type ( x, musician )) [0.9] ∀ x ( hasW on ( x, P hysicsN obelP rize )) ⇒ type ( x, scientist )) [0.9] ∀ x ( type ( x, scientist )) ⇒ ¬ type ( x, musician )) [0.8] ∀ x ( type ( x, musician )) ⇒ ¬ type ( x, scientist )) [0.8]All of these logical formulas together are unsatisﬁable; that is, they are inconsistent andimply false . However, if we choose a proper subset of them, we may arrive at a perfectlyconsistent solution and would thus identify the statements that we should accept for theKB. Typically, this would only involve discarding some of the atomic statements whileretaining all constraints. However, it is also conceivable to drop constraints, as they are softand could be violated by exceptions. Obviously, we want to sacriﬁce as few of the inputsas possible, maximizing the remaining proper statements. This problem of computing alarge consistent subset of the input formulas is the Maximum Satisﬁability problem,

MaxSat for short.Both candidate statements and constraints often come with weights; these are shown inbrackets in the above table. For statements, the weights are usually the conﬁdence scoresreturned by the extractor, potentially with re-scaling and normalization. For constraints,the weights would reﬂect conﬁdence scores for underlying logical invariants (see Section8.3.3), or the degrees to which they should be fulﬁlled in a proper KB. Another way of164 ubmitted to Foundations and Trends in Databases interpreting weights is that they denote costs that a reasoner has to pay when dropping astatement or violating a constraint. The goal is now reﬁned as follows:

Principle of Weighted Maximum Satisﬁability :For a given set of weighted candidate statements and weighted consistency con-straints, identify a subset of formulas that is logically consistent and has the highesttotal weight.Intuitively, in the example, we can discard the statement that Dylan won the NobelPrize in Physics and the statement that he is a scientist, which come at a cost of 0.3 and0.1, respectively. The remaining formulas are then logically consistent.

Grounding into Clauses:

Our input so far is heterogenous, mixing apples and oranges: candidate statements arepropositional-logic formulas without variables, whereas constraints are predicate-logicformulas with variables (and quantiﬁers like ∀ and perhaps also ∃ ). To cast the consistencyreasoning into a tangible algorithmic problem, we need to unify these two constituents. First,we limit constraints to be of the Horn clause type, in prenex normal form with universalquantiﬁers only (see Section 8.3.3). By being more permissive on the number of positiveatoms, we can further relax this into arbitrary clauses .Second and most importantly, we instantiate the constraints by substituting variableswith constants from the statements. For example, we plug in

BobDylan for the x variablein all constraints. If we had more candidate statements, say also about ElvisPresley and

EnnioMorricone , we would generate more instantiations. This renders all formulas intopropositional logic without any variables, and all formulas become clauses. In computationallogics, this process is referred to as grounding . For our simple example, the groundingwould produce the following set of clauses: h BobDylan hasWon Grammy i [0.7] h BobDylan hasWon LiteratureNobelPrize i [0.5] h BobDylan hasWon PhysicsNobelPrize i [0.3] h BobDylan type musician i [0.9] h BobDylan type scientist i [0.1] ¬ h BobDylan hasWon Grammy i ∨ h BobDylan type musician i [0.9] ¬ h BobDylan hasWon PhysicsNobelPrize i ∨ h BobDylan type scientist i [0.9] ¬ h BobDylan type scientist i ∨ ¬ h BobDylan type musician i [0.8]The mutual exclusion between types scientist and musician was written in the form oftwo implication constraints before, but both result in the same clause which is thus statedonly once. 165 ubmitted to Foundations and Trends in Databases Obviously, a less simpliﬁed case could have constraints with multiple variables – sothis process has a potential for “combinatorial explosion”. The full grounding happensonly conceptually, though, and does not have to be fully materialized. Also, if entitieshave been canonicalized upfront and are associated with types in the KB, then onlythose combinations for variable substitutions need to be considered that match the typesignatures of the constraint predicates. There are further optimizations for lazy computationof groundings.This way, the consistency reasoning task has been translated into propositional logics,with weighted clauses. We now treat the atoms of all clauses together as statements forwhich we want to infer truth values. For clauses with more than one atom (i.e., the ones forinstantiated constraints), the entire clause becomes satisﬁed if at least one of its atoms hasa positive truth value (with consideration of whether an atom itself is positive or negative,i.e., preﬁxed by ¬ ). Objective of Weighted MaxSat :Given a set of propositional-logic clauses, each with a positive weight, the objectiveof Weighted MaxSat is to compute a truth-value assignment for the underlyingatoms such that the total weight of the satisﬁed clauses is maximal.In the example, with 8 clauses and a total of 5 atoms, the optimal solution is to assign false to the atoms h BobDylan hasWon PhysicsNobelPrize i and h BobDylan type scientist i ,with a total weight of 4.7. A diﬀerent truth-value assignment, which is sub-optimal butconsistent, would be to assign false to h BobDylan type musician i and h BobDylan hasWonGrammy i , and true to the other three atoms. This would have a total weight of 3.5. Weighted MaxSat Solvers:

Relating this approach to MaxSat reasoning for tasks in computational logics, we treatthe atoms of our clauses as variables for truth-value assignment; this way, we can directlyuse state-of-the-art solvers for weighted MaxSat. The problem is NP-hard, though, as itgeneralizes the classical SAT problem. Nevertheless, there are numerous approximationalgorithms with very good approximation ratios (see, e.g., [326]). Many of these have beendesigned for use cases like theorem proving and reasoning about program correctness. Inthese settings, the structure of the input sets is rather diﬀerent from the KB environment.Theorem proving often deals with a moderate number of clauses where each clause could havea large number of atoms. These are automatically derived from declarative speciﬁcationsin more expressive logics. For reasoning over KB candidates, where all clauses are eithersingle-atom or generated by grounded constraints, the situation is very diﬀerent. Thegrounding can lead to a huge number of clauses, but the atom set per clause is fairly smallas the formulas mostly express type constraints, functional dependencies, mutual exclusion,and other compact design patterns. 166 ubmitted to Foundations and Trends in Databases

This design consideration has motivated the development of customized MaxSat solvers for KB cleaning. Two simple but powerful heuristics have been used in [566], originallydeveloped to augment the YAGO knowledge base.The ﬁrst technique leverages

Unit Clauses , that is, clauses that have exactly one variablewhose truth value has not yet been determined. We compute, for each unassigned variable x , the diﬀerence between the combined weight of the unit clauses where x appears positiveand the combined weight of the unit clauses where x appears negative. We choose thevariable x with the highest absolute diﬀerence, and set it to true if the diﬀerence is positiveand to false otherwise. This technique is a greedy heuristic, which assumes that the variablewith the highest value brings a high gain for the ﬁnal solution.The method can be combined with another technique known as Dominant Unit Clause(DUC) Propagation . DUC propagation ﬁxes the values of variables that are already soconstrained that only one truth value can be part of the optimal solution. Speciﬁcally,it sets the value of variables that appear with one polarity in unit clauses that have ahigher combined weight than all clauses (regardless of unit clause or not) where the variableappears with the opposite polarity. In combination, a SAT solver with these two heuristicshas an approximation guarantee of 1/2: solutions have a weight that is at least 50% ofthe optimal solution. In practice, the approximation is much better, often reaching 90% orbetter [566]. Also, the algorithm is very eﬃcient and can run on large sets of input clauses –automatically generated from uncertain candidate statements and consistency constraints.For huge inputs, it is even feasible to scale out the computation, by partitioning the inputvia graph-cut algorithms and running the reasoner on all partitions in parallel [412].

Extensions:

The Weighted MaxSat method has been used in the

SOFIE system [566] to extractinformation from noisy text. SOFIE reformulates the statement-pattern duality of Section4.3.2 as a soft constraint between patterns and relations: if a pattern expresses a relation, thentwo co-occurring entities with this pattern are an instance pair of the relation. Conversely,every sentence that contains two entities that are known to be connected by the relationbecomes a candidate for a pattern. In addition, SOFIE has integrated entity canonicalization (see Chapter 5) into its joint reasoning, by additional constraints that couple surface namesand entities.The HighLife project [149] has extended MaxSat reasoning to higher-arity relations.Texts can express a relationship that holds between more than two entities. If only some ofthese entities are mentioned, the resulting statement amounts to a formula with existentialquantiﬁers for the unmentioned entities. HighLife devised clause systems speciﬁcally forthis case, used Weighted MaxSat reasoning over all the partial observations, and could thusinfer non-binary statements relating more than two entities.167 ubmitted to Foundations and Trends in Databases

Another way of operationalizing the reasoning over uncertain statements and consistencyconstraints is by means of integer linear programs (ILP) (see, e.g., [516]). These modelshave been extensively studied for all kinds of industrial optimization problems, such asproduction planning for factories, supply chains and logistics, or scheduling for airlines,public transportation, and many other applications. ILP is a very mature methodology,hence a candidate for our setting. In general, an ILP consists of • a set of decision variables, often written as a vector x , allowed to take only non-negativeinteger values, • an objective function c T · x to be maximized, with a vector c of constants, and • a set of inequality constraints over the decision variables, written in matrix form as Ax ≤ b with matrix A and vector b holding constants as coeﬃcients.To map our consistency reasoning problem onto an ILP, we associate each uncertainstatement A i = h SPO i , that is, a logical atom, with a decision variable X i . Their Booleannature, deciding on whether to accept a statement or not, is realized by restricting X i to bea , by adding constraints X i ≤ X i ≥

0. The clauses that we constructby grounding logical constraints (as explained in Subsection 8.5.2) are encoded into a setof inequalities that couple the decision variables. Weights of candidate statements becomecoeﬃcients for the ILP objective function, and weights for grounded constraints becomethe coeﬃcients for a big inequality system.

ILP for Constraint Reasoning:

Input: • Uncertain candidate statements A . . . A n , each a logical atom of the form h SPO i or ¬h SPO i . The weight of A i is w i . • A set of clauses with more than one atom, one clause for each grounded constraint: C . . . C m where each C j consists of positive atoms (without ¬ ) and negativeatoms (with ¬ ). We denote these subsets of atoms as C + j and C − j . The weightof C j is u j .Construction of the ILP: • For each statement A i , there is a 0-1 decision variable X i . • The objective function of the ILP is to maximize P i =1 ..n w i X i . • For each grounded constraint C j we create an inequality constraint: P µ ∈ C + j X µ + P ν ∈ C − j (1 − X ν ) ≥ ubmitted to Foundations and Trends in Databases consider the weights of the original constraints, which would allow slack for exceptions. Toachieve this interpretation of soft constraints , we need to extend the objective function.Essentially, we add a cost term, proportional to the constraint weight, each time we violatea grounded constraint. The overall objective function would then take this form:maximize λ P i =1 ..n w i X i − (1 − λ ) P j =1 ..m u j (cid:18)P µ ∈ C + j (1 − X µ ) + P ν ∈ C − j X ν (cid:19) where µ and ν range over the subscripts of the respective atoms, and λ is a tunable hyper-parameter. The sum over statements is the beneﬁt from accepting many candidates, andthe sum over grounded constraints is the penalty to be paid for constraint violations. Thisbasic form can be varied in other ways.The outlined approach has shown how to incorporate soft constraints, whereas ouroriginal ILP formulation allowed only hard constraints. By combining both, via inequalityconstraints as well as penalty terms in the objective function, we have a choice about whichof the original consistency constraints should be treated as strict invariants and which onescan be treated in soft form with allowance for exceptions.Computing the optimal solution for an ILP is, not surprisingly, also NP-hard. Onthe positive side, ILPs are very versatile and widely used in all kinds of mathematicaloptimizations; therefore, a vast array of algorithms exist for making ILPs tractable in manypractical cases. Also, there are mature software packages that are very well engineered, mostnotably, the Gurobi solver ( ). A common optimization techniqueis to relax the ILP by dropping the requirement that the variables can take only integervalues. Instead we allow real-valued solutions between 0 and 1, this way treating the ILPas a standard linear program (LP). This technique is known as the LP relaxation of theILP. LPs can be tackled much more eﬃciently than ILPs, with polynomial algorithms andvarious accelerations. To obtain a valid solution for the ILP, the solution must be roundedto one of the neighboring integers (e.g., 0.65 can be rounded to 1 or 0). A principled methodis to randomly round by tossing a coin that falls on 1 with a probability proportional to thereal value (e.g., with probability 0.65 becoming 1 and with probability 0.35 becoming 0).This randomized algorithm has good approximation guarantees with high probability [404,592].ILP models have been applied to various tasks of KB cleaning as well as for theunderlying knowledge extraction. [500, 498] provide general discussion on using ILP for KBcuration and for NLP tasks.

The prior uncertainty in the weighted MaxSat problem can also be modeled in a probabilisticfashion, such that the Boolean decision variables become random variables with probabilitiesfor being true or false. This line of models often starts with a declarative speciﬁcation of169 ubmitted to Foundations and Trends in Databases candidate statements and logical constraints, just like we did in the previous subsections onMaxSat and ILP. For probabilistic inference, the high-level speciﬁcation is translated intoa

Markov Random Field (MRF) , the general framework for probabilistic graphs thatcouple a large number of random variables (see, e.g., [283]). Conditional Random Fields(CRF) that are widely used for entity discovery (Chapter 4) also fall into this regime (seeSection 4.4.1). For tractability, these models have to make assumptions about conditionalindependence, thus becoming probabilistic factor graphs with factors expressing thelocal coupling of (small) subsets of (non-independent) random variables.

Markov Logic Networks:

In the following, we focus on one prominent and powerful model from this broad family:

Markov Logic Networks (MLN) [125].

Markov Logic Network (MLN):

Input: • a set of grounded clauses C , ..., C n , derived from uncertain statements and softconstraints, with weights w , ..., w n , and • binary random variables X , ..., X k for the atoms that appear in the clauses.Construction of the MLN:The corresponding MLN is an undirected graph with random variables X . . . X k as nodes, and edges between nodes X i , X j if these variables are considered to becoupled. Conversely, if we assume conditional independence P [ X i | X j , all X ν = X i and = X j ]= P [ X i | all X ν = X i and = X j ]then there is no edge between X i and X j . Otherwise, X i and X j are coupled by anedge between them.The MLN constitutes a joint probability distribution for the variables X . . . X k ,further explained below.As an example, reconsider the set of 8 clauses over 5 atoms discussed in Section 8.5.2.To make it more interesting, let us add another constraint that leads to a 9th clause: ¬ h BobDylan hasWon Grammy i ∨ ¬ h BobDylan hasWon LiteratureNobelPrize i ∨ ¬ h BobDylan type scientist i [0.6](if someone has won a Grammy and a Literature Nobel Prize, she/he cannot be a scientist).The constraints are the mechanism for coupling the random variables for the candidatestatements; variables that do not share any constraints are conditionally independent. Thisleads to the grounded MLN in Figure 8.3. The upper part shows the actual graph withbinary edges; the lower part shows the same graph with cliques in the graph made explicit170 ubmitted to Foundations and Trends in Databases by the small blue-rectangle connectors. So this MLN of ﬁve atoms has three cliques of size2 and one clique of size 3. We added the new constraint for this very illustration of a cliquewith more than two nodes. X1: BobDylan hasWon Grammy X2: BobDylan hasWon LiteratureNobelPrize X3: BobDylan hasWon PhysicsNobelPrizeX4: BobDylan type musician X5: BobDylan type scientistX1: BobDylan hasWon Grammy X2: BobDylan hasWon LiteratureNobelPrize X3: BobDylan hasWon PhysicsNobelPrizeX4: BobDylan type musician X5: BobDylan type scientist

Figure 8.3:

Example for MLN Graph

Factorized Distribution of MLN:

The MLN graph induces a joint probability distribution for all random variablestogether. By the assumptions about which variables are coupled and which ones areconditionally independent, and by the Hammersley-Cliﬀord Theorem from MRFtheory [283, 125], the joint distribution takes the following factorized form: P [ X . . . X k ] ∼ Q cliques j Φ j ( X j . . . X j l forming clique j )where j ranges over all cliques in the graph and Φ j are so-called clique potentialfunctions for capturing local probabilities. These per-clique terms are the factors ofthe factor graph .For our KB curation setting, the cliques correspond to the constraint clauses. Thisleads to the product form P [ X . . . X k ] = Z Q C j e w j with C j ranging overall clauses satisﬁed by X . . . X k with clause weights w j and normalization constant Z .An MLN deﬁnes a probability distribution over all possible worlds , that is, over allpossible joint assignments of truth values to the variables X , ..., X k . Computing the worldwith the maximum joint probability is equivalent to solving a weighted MaxSat problem.Finding this world is known as MAP inference , where MAP stands for maximum a ubmitted to Foundations and Trends in Databases posteriori . Like MaxSat, this computation is NP-hard (or, more precisely,

Monte Carlo sampling , most notably,

Gibbs sampling , or to variational calculus (see also Section 4.4.1 on CRFs). Approximationalgorithms for MaxSat and ILP have been used as well (see, e.g., [490]). In addition to thejoint MAP, MLN inference can also compute the marginal probability for each variable whichcan be interpreted as the conﬁdence in an individual statement being valid. Unfortunately,this task comes at even higher computational complexity than MAP, and is hardly supportedby any software tools for probabilistic graphical models.The weights for the MLN clauses can be automatically learned from training data,without reliance on human inputs or external statistics. However, this learning is itself anexpensive task, as it involves non-convex optimization. It is typically addressed by gradientdescent methods (see also Section 4.4.1).MLNs have been used for a variety of probabilistic reasoning tasks [489, 125, 126],including entity linking (e.g., [537]) and the extraction of is-a and part-of relationships fromtext sources (e.g., [458]). Also, advanced methods for minimum-repair database cleaning have made clever use of MLN inference [486].A prominent case of using MLN models for KB construction is the

DeepDive project( http://deepdive.stanford.edu/ ) [527, 527, 655, 656]. It comprises a framework and softwaresuite for constructing KBs from scratch as well as augmenting them in an incremental manner.DeepDive has been applied to build domain-speciﬁc KBs, for example, for paleobiology,geology and crime ﬁghting (ﬁghting human traﬃcking [279]). Usability for knowledgeengineers is boosted by having a declarative interface for inputs, with automatic translationinto MLNs and, ultimately, MRFs. Nevertheless, high-quality KB construction requiressubstantial care and eﬀort for proper conﬁguration, training, tuning and other humanintervention [656]. Like with MaxSat, a major concern for scalability is that the groundingstep – moving from logical constraints with quantiﬁed variables to fully instantiated clauses –comes with the risk of combinatorial explosion. To mitigate this issue, the DeepDive projecthas developed eﬀective techniques for lazy grounding , avoiding unnecessary instantiations[428].Another prominent system that makes use of MLNs is

StatSnowball [672, 425]. Thissystem uses both predeﬁned MLN constraints and logical patterns that are learned atrun-time. The former express prior knowledge, such as hasM other ( x, y ) ⇒ hasChild ( y, x ).The learned kind of soft invariants, on the other hand, capture the relationship betweentextual patterns and the relations and their arguments, similar to the statement-patternduality in DIPRE [58] (see Section 4.3.2) and the reasoning for extraction in SOFIE [566](see Section 8.5.2). 172 ubmitted to Foundations and Trends in Databases Probabilistic Soft Logic (PSL):

MLNs relax the Weighted MaxSat model by admitting many sub-optimal worlds withlower probabilities. In each of these worlds, a Boolean variable is true or false, with certainprobabilities. This setting can be relaxed further by allowing diﬀerent degrees of belief ina variable being true [23]. Essentially, the discrete optimization problem of MaxSat andMLNs is relaxed into a continuous optimization by making all random variables real-valued.This technique is analogous to relaxing ILPs into LPs (see Section 8.5.3). Note that thesereal-valued degrees of truth are diﬀerent from the probabilities (which are real-valuedanyway, for discrete and continuous models alike). Every combination of real-valued beliefdegrees, on a continuous scale from 0 to 1, is associated with a probability density.For making this approach tractable, the truth value, or interpretation I , of a conjunctionof variables is deﬁned as the Lukasiewicz t-norm: I ( v ∧ w ) = max (0 , I ( v ) + I ( w ) − a ∨ ... ∨ a n is treated as a rule ¬ a ∧ ... ∧ ¬ a n − ⇒ a n , with body ¬ a ∧ ... ∧ ¬ a n − andhead a n (see Section 8.3.3). The distance from satisfaction of this rule in an interpretation I is deﬁned as d I ( body ⇒ head ) = max (0 , I ( body ) − I ( head )). This yields the followingsetting: Probabilistic Soft Logic (PSL) Program:

For a set of rules r , ..., r n (including atoms for statement candidates) with weights w , ..., w n , a PSL program computes the probability distribution over interpretations I : P [ I ] = Z exp ( − P i =1 ..n w i d I ( r i ))where Z is a normalizing constant.Like for MLNs, the central task in PSL is MAP inference, for the value assignment(degrees of being true) that maximizes the joint probability over all variables together.[23] has shown that, unlike for discrete MLNs, this can be computed in polynomial time.By using a Hinge-loss objective function, MAP inference becomes a convex optimizationproblem – still not exactly fast and not easily scalable, but no longer NP-hard. PSL hasbeen used for a variety of tasks, from entity linking [290] and paraphrase learning [192] toKB cleaning [464]. Further Works on Probabilistic Factor Graphs:

Numerous other works have leveraged variants of probabilistic graphical models for diﬀerentaspects of KB construction and curation. These include constrained conditional models by[81], factor graphs for joint extraction of entities and relations [491, 643] (see also Section6.2.2.4), coupled learners for the NELL project [394, 395] (see also Section 9.3), and others.173 ubmitted to Foundations and Trends in Databases

Construction and curation of a knowledge base is a never-ending task. As the world keepsevolving, knowledge acquisition and cleaning needs to follow these changes, to ensure thefreshness, quality and utility of the KB. This life-cycle management, over years and decades,entails several challenges. Surprisingly, these issues have received little attention in the KBresearch community. There are great opportunities for novel and impactful work.

While provenance of data is a well-recognized concern for database systems, with long-standing research (e.g., [63, 64]), it is largely underrated and much less explored forknowledge bases. In a KB, each statement should be annotated with provenance metadata about: • the source (e.g., web page) from where the statement was obtained, or sources whenmultiple inputs are combined, • the timestamp(s) of when the statement was acquired, and • the extraction method(s) by which it was acquired, for example, the rule(s), pattern(s)or classiﬁer(s) used.This is the minimum information that a high-quality KB should capture for manageability(see, e.g., [71, 240]). In addition, for learning-based extractions, the underlying conﬁguration (e.g., hyper-parameters) and training data should be documented. For human contributions,the assessment and approval steps, involving moderators/curators , needs to be documented(see, e.g., [452, 453]).This metadata is crucial for being able to trace errors back to their root cause, when theyshow up later: the spurious statements, the underlying sources and the extraction methods.This way, provenance tracking supports removing incorrect KB content or correcting theerrors. For query processing, provenance information can be propagated through queryoperators. This allows tracing each query result back to the involved sources [63], animportant element of explaining answers to users in a human-comprehensible way.Provenance information can be added as additional arguments to relational statements,in the style of this example: h Bob Dylan, won, Nobel Prize in Literature , i ,or by means of reiﬁcation and composite objects in RDF format (see Section 2.1.3). Anotheroption is to group an entire set of statements into a Named Graph [213], a W3C-approvedway of considering a set of statements as an entity [211, 73]. Then, provenance statementscan be made about such a group entity. Yet another option is RDF* [212, 214], a proposed174 ubmitted to Foundations and Trends in Databases mechanism for making statements about other statements. The most common solution,however, is what is known as

Quads : triples that have an additional component that servesas an identiﬁer of the triple. Then, provenance statements can be attached to the identiﬁer.Conceptually, this technique corresponds to a named graph of exactly one statement; it isused, for example, in YAGO 2 [240] and in Wikidata Various triple stores support quads.

Nothing lasts forever: people get divorced and marry again, and even capitals of countrieschange once in a while. For example, Germany had Bonn as its capital, not Berlin, from1949 to 1990. Therefore, versioning and temporal scoping of statements are crucial for properinterpretation of KB contents. In addition, versioning rather than over-writing statementsis also useful for quality assurance and long-term maintenance. For database systems this isan obvious issue, and industrial KBs have certainly taken this into account as well. However,academic research has often treated KBs as a one-time construction eﬀort, disregardingtheir long-term life-cycle.

Versioning:

Keeping all versions of statements, as relationships between entities are established anddissolved, supports time-travel search : querying for knowledge as of a given point inthe past (e.g., for Germany’s capital in 1986). KB projects like DBpedia and YAGOapproximated version support by periodically releasing new versions of the entire KB.In addition, YAGO 2 [240] introduced systematic annotation with temporal scopes tomany of its statements (see below). The Wikidata KB keeps histories of individual items(e.g., about Bob Dylan) andsupports convenient access to earlier versions on speciﬁc entities. Also, SPARQL queriescan be performed over the Wikidata edit history [580].

Temporal Scopes:

We aim for temporally scoped statements, by annotating SPO triples with their validitytimes , which can be timepoints or time intervals. This can be expressed either via higher-arity relations, such as wonPrize (EnnioMorricone, Grammy, 11-February-2007)wonPrize (EnnioMorricone, Grammy, 8-February-2009)capital (Germany, Bonn, [1949-1990])capital (Germany, Berlin, [1991-now]) or by means of reiﬁcation and composite objects (see Section 2.1.3). Note that the temporalscopes can have diﬀerent granularities: exact date, month and year, or only the year. Thechoice depends on the nature of the event or temporal fact, and also on the precision of175 ubmitted to Foundations and Trends in Databases how they are reported in content sources.

Discovering and Inferring Temporal Scopes:

Given a statement, such as married (Bob Dylan, Sara Lownds) , how can we determine thebegin and end of its validity interval, and how do we go about assigning timepoints toevents? There is a variety of methods for spotting and normalizing temporal expressions in text and semi-structured content, based on rules, patterns or CRF/LSTM-like learning –see, for example, [107, 597, 166, 240, 302, 555]. Methods for property extraction (see Chapter6) can be applied to linking these time points or intervals to respective entities, and thusassigning them to SPO triples, at least tentatively. In addition, special pages, categories andlists in Wikipedia, such as monthly and daily events (e.g., https://en.wikipedia.org/wiki/Portal:Current_events/2020_July_14 ) and annual chronologies (e.g., https://en.wikipedia.org/wiki/2020 ) provide great mileage [303, 302, 530].The resulting extractions of temporally scoped statements may result in noisy andconﬂicting outputs, though. For example, we may obtain statement candidates: married (Charles, Diana, [1981-1996]): [0.7]married (Dodi, Diana, [1995-1997]): [0.6]married (Charles, Camilla, [1990-2020]): [0.5]married (Andrew, Camilla, [1973-1997]): [0.4] where the numbers in brackets, following the statements, denote conﬁdence scores. Together,these statements imply some cases of illegal bigamy: being married to more than one spouseat the same time. This is again a case for imposing consistency constraints and reasoningto clean this space of candidates. We can use the repertoire from Section 8.5, includingMaxSat, integer linear programming (ILP), or probabilistic graphical models. For example,the cleaning task could be cast into an ILP as follows:176 ubmitted to Foundations and Trends in Databases

ILP for Temporal Scoping:

Input: Candidates of the form h S, P, O, T i : w ,with time interval T and conﬁdence score w Decision Variables: • X i = 1 if candidate i is accepted, 0 otherwise • P ij = 1 if candidate i should be ordered before j Objective Function: maximize P i w i · X i Constraints: • X i + X j ≤ i and j overlap in time and are conﬂicting (e.g.,violating the monogamy law) • P ij + P ji ≤ i, j , for acyclic ordering • (1 − P ij ) + (1 − P jk ) ≥ (1 − P ik ) for all i, j, k , for transitivity if i, j, k must beordered • (1 − X i ) + (1 − X j ) + 1 ≥ (1 − P ij ) + (1 − P ji ) for all i, j that must be totallyordered – in order to couple the X i and P ij variablesThis ILP model can be seen as a template for all kinds of temporal-scope constraintreasoning, generalizing the anti-bigamy case at hand.Unfortunately, by the temporal overlap of our four candidate statements, this wouldconservatively accept only the ﬁrst statement about Charles and Diana being married from1981 through 1996 and the last statement about Andrew and Camilla. A technique toimprove the recall of such reasoning is to decompose the temporal scopes of the candidatesinto disjoint time intervals, cloning the non-temporal parts of the statements. For example,the marriage of Charles and Camilla is split into one statement for [1990,1996] and anotherone for [1997,2020]. This would allow the reasoner to reject the scope [1990,1996] whileaccepting [1997,2020].Such techniques and the application of consistency reasoning for temporal scoping ofKB statements have been investigated by [345, 573, 572, 615, 614].A variation of this theme is to discover and infer the relative temporal ordering between events and/or the validity of statements. This aims to detect relationships like happened-before , happened-after , happened-during etc. Methods along this lines include[590, 271, 392, 427]. An important aspect of evolving knowledge is to cope with newly emerging entities .When discovering entity names in Web sources, we aim to disambiguate them onto alreadyknown entities in the KB; see Chapter 5. However, even if there is a good match in the KB,177 ubmitted to Foundations and Trends in Databases it is not necessarily the proper interpretation. For entity linking, this is the out-of-KBentity problem [343].For example, when the documentary movie “Amy” was ﬁrst mentioned a few years ago,it would have been tempting to link the name to the soul singer

Amy Winehouse . However,although the movie is about the singer’s life, the two entities must not be confused. Today,after having won an Oscar, the movie is, of course, a registered entity in all major KBs.The general situation, though, is that there will be a delay between new entities cominginto existence and becoming notable. Likewise, existing entities that are not notable enoughto be covered by a KB may become prominent overnight, such as indie musicians gettingpopular or startup companies getting successful. Recognizing these as early as possible canaccelerate the KB growth, and most importantly, it is crucial to avoid confusing them withpre-existing entities that have the same or similar names.To address the problem, the methods for entity linking (EL) (see Chapter 5) alwayshave to consider an additional candidate out-of-KB for the mapping of an observed mention.Whenever the EL method has higher conﬁdence in this choice than in any candidate entityfrom the KB, the mention should be ﬂagged accordingly. As conﬁdence scores are notalways well-calibrated, more sophisticated scoring and score-calibration methods have beeninvestigated [204, 236].These techniques are conservative, in that they avoid incorrectly mapping an emergingentity onto a pre-existing one. However, this alone is not suﬃcient, because we do wantto add the emerging entity to the KB at some point. Moreover, there could be multipleout-of-KB entities with the same or similar names in the discovered text mentions. Forexample, in addition to the movie “Amy” (about Amy Winehouse), the character “Amy”(Farrah Fowler) from the TV series “Big Bang Theory” could also become a candidate foraddition to the KB.An approach to handle such cases has been proposed in [237]. For each mention mappedto out-of-KB , a contextual proﬁle is created and maintained. This comprises the mentionitself and keyphrases from the surrounding contexts or latent models derived from contexts(cf. Section 5.3.2). The proﬁle is gradually enhanced as we observe more mentions of whatis likely the same out-of-the-KB entity. After a while, we obtain a repository of emergingentity names with their contextual proﬁles. When the proﬁle of a name is rich enoughto infer its semantic type(s), such as documentaryMovie or fictitiousCharacter , we mayconsider adding the emerging entity into the KB, properly registered with its type(s).There are still two caveats to consider. First, although there is initial evidence for anemerging out-of-KB entity, it may turn out later that this actually denotes an alreadyknown entity in the KB. So the method has to periodically reconsider and possibly revisethe EL decision. Second, as the same name may denote multiple out-of-KB entities, thecontextual proﬁle for this name could improperly conﬂate more than one emerging entity.The “Amy” scenario is an example. To handle such cases, the method needs to consider178 ubmitted to Foundations and Trends in Databases splitting a proﬁle, to identify strongly coherent contexts – one for each emerging entity. Inthe long run, human-in-the-loop curation may still be required for keeping the KB at itshigh quality.Further methods for early discovery of emerging entities have been developed, forexample, by [261, 647, 6, 435, 661], with emphasis on social media or web tables assources. For entities of type event , news digests such as daily pages in Wikipedia (e.g., https://en.wikipedia.org/wiki/Portal:Current_events/2020_July_14 ) are valuable assets, too.Note that these daily pages often contain headlines that are not yet covered by any of theencyclopedic articles in Wikipedia.Finally, a methodological related issue is to correct existing KB statements where theobject is simply a string literal, such as “Amy – Oscar-winning documentary movie, 2015”in a triple like h AmyWinehouse featuredIn “Amy . . . 2015” i . This is a frequently arising caseas the initial knowledge acquisition may only pick up strings but miss out on proper entitylinking. [84] has developed a general framework for correcting these omissions, utilizing ELtechniques and consistency constraints. We highlighted that knowledge bases require continuous curation, from quality assessmentto quality assurance. Key lessons are the following: • Quality measures for correctness and coverage are computed by sampling statementswith human judgements, by crowdsourcing or, if necessary, experts. • No KB is ever complete: assessing and predicting completeness often builds on the LocalCompleteness Assumption (LCA). • Logical invariants about the KB content serve twofold roles: as constraints they can detecterroneous statements to keep the KB consistent; as rules they can deduce additionalstatements to ﬁll gaps in the KB. • Logical patterns in the KB can be automatically discovered, to learn rules and to analyzebias . • For cleaning candidate statements at scale, constraint-based reasoning is a best-practiceapproach, with a suite of models and methods from MaxSat and ILP to probabilisticfactor graphs. • For the long-term life-cycle of a KB, tracking the provenance of statements as well asversioning with temporal scopes are essential components.179 ubmitted to Foundations and Trends in Databases

The YAGO project ( https://yago-knowledge.org , starting in 2006) created the ﬁrst largeknowledge base that was automatically extracted from Wikipedia, largely in parallel toDBpedia, discussed in Section 9.2. YAGO has been maintained and advanced by the MaxPlanck Institute for Informatics in Germany and Télécom Paris University in France. TheKB was used in many projects world-wide, most notably, for semantic type checking in theIBM Watson system that won the Jeopardy quiz show [163, 409].

The key observation was that Wikipedia contains a large number of individual entities, suchas singers, movies or cities, but does not organize them in a semantically clean type system.Wikipedia’s hierarchy of categories was not suitable as a taxonomy. WordNet, on the otherhand, has a very rich and elaborate taxonomy, but is hardly populated with instances.YAGO aimed to combine the two resources to get the best of both worlds.The ﬁrst version of YAGO [562] converted every Wikipedia article into an entity, andextracted its classes from the categories of the article. To distinguish between thematiccategories for human browsing (e.g.,

Rock ’n Roll music for Elvis Presley) and properlytaxonomic categories (e.g.,

American singers ), YAGO developed the heuristics discussed inSection 3.2. If the head noun of a category name is in plural form (as in

American singers ),then it is a taxonomic class. These judiciously selected leaf-level categories were linked toWordNet with the methodology presented in Chapter 3.The ﬁrst version of YAGO also extracted selected kinds of facts from Wikipedia categories.A small set of relations was manually identiﬁed, including ( hasWonAward , isLocatedIn ,etc.), and regular expressions were speciﬁed for the corresponding Wikipedia categories(e.g., Grammy Award winners or Cities in France ). YAGO extracted labels for entitiesfrom Wikipedia redirects, and attached provenance information to each statement. Thehand-crafted speciﬁcation included domain and range constraints to eliminate spuriousstatements. The objective was to focus YAGO entirely on precision, even if this meant aloss in recall. The rationale was that a KB with 5 million facts and 95% precision is moreuseful than a KB with 10 million facts and 80% precision.The quality of the extracted statements was evaluated manually by a sampling technique.A random sample was drawn for each relation, and the statements were manually comparedto Wikipedia as ground truth. The number of samples was chosen so as to bring the Wilsonconﬁdence interval of the estimated precision to 95% ± ubmitted to Foundations and Trends in Databases properties including the type property even reached nearly 99% precision [563, 240]. For along time, YAGO was the only major KB that came with such statistical guarantees aboutits correctness.In 2008, YAGO was extended with facts from infoboxes of Wikipedia [563]. These wereextracted by hand-crafted regular expressions (see Section 6.2.1) for around 100 selectedrelations. This process also extracted the validity times of statements when applicable. TheSPO triple notation was slightly extended to allow statements with temporal scope, such as: Germany hasGDP "$3,667 trillion" inYear "2008"

By reiﬁcation (see Section 2.1.3), this shorthand notation was mapped into pure triples (orquads, see Section 8.6.1).The data model was extended by a taxonomy of types for literal values. For example, theliteral “$3,667 trillion” was linked by the property hasValue to “3,667,000,000,000”, whichis an instance of the class integer , a subclass of number , which is a subclass of literal .Likewise, the property hasUnit could capture the proper currency US dollars. This furtherstrengthened the ability for early type checking and ensuring near-human quality.The YAGO KB was the focal point of the broader

YAGO-NAGA project that includedmethods and tools for exploring and searching the KB [278].

YAGO 2: Spatial and Temporal Scoping

In 2010, YAGO was systematically extended with temporal and spatial knowledge [239,240]. Entities and statements had time intervals assigned to denote when entities existedand statements were valid. Timestamps were extracted from the Wikipedia infoboxes whenpossible, and propagated to other statements and entities by a limited form of Horn rules(see Section 8.3.2). For example, if we know the birth date and the death date of FrankSinatra, then we can deduce the validity interval for the fact that he was a person. Suchrules were systematically applied to people, artifacts, events, and organizations, thus givingvalidity times to about one third of the facts. This temporal scoping was in turn beneﬁcialas a consistency check when extending the KB or building a new major version.The spatial dimension of YAGO came from GeoNames ), alarge repository of geographical entities, with coordinates and informative types. To avoidduplicates, the entities from Wikipedia were matched to the entities in GeoNames bycomparing names and geographical coordinates with thresholds on similarity. This simpleentity-matching technique (cf. Section 5.2) preserved the high quality of canonicalizedentities. The type taxonomy of GeoNames was mapped to the class taxonomy of YAGO, bytaking into account the name of the GeoNames class, the head noun of that name, the mostfrequent meaning in WordNet, and the overlap of the glosses in the two resources. Similarto the temporal scoping, Horn rules were applied to propagate locations from entities totheir statements, and vice versa. 181 ubmitted to Foundations and Trends in Databases

This KB was accompanied by a query engine [239, 240] that allowed searching entities notjust by their facts, but also by validity times, spatial proximity and space-time combinations.For example, the query

GeorgeHarrison created ?s after JohnLennon would ﬁnd songs written by George Harrison after John Lennon’s death, and the query guitarists bornIn ?p near Seattle would return Jimi Hendrix, Kurt Cobain, Carla Torgerson and more. Recent work onaugmenting YAGO2 with spatial knowledge is the

Yago2geo project [276], which integratedcontent from OpenStreetMap and other sources, and supports a very expressive querylanguage called GeoSPARQL.The extraction patterns for YAGO2 were speciﬁed declaratively. An example pattern is: "Category:(.+) births" pattern "\$0 wasBornOnDate Date(\$1)" to extract birth dates from Wikipedia category names. The next version of YAGO,YAGO2s [43] further advanced the principle of declarative speciﬁction and modularization.The system was factored into 30 extractors , each for a speciﬁc scope. The extractors wereorchestrated by a dependency graph, where each module depends on inputs from othermodules. A scheduler could run these extractors largely in parallel or in pipelined mode.This DB-engine-like declarative machinery proved very valuable for debugging and qualityassurance, and could build new KB versions very eﬃciently (cf. Section 6.2.1.4).

YAGO 3: Multilingual Knowledge

In 2014, YAGO 3 [364] started extracting from Wikipedia editions beyond English, coveringeditions like German, French, Dutch, Italian, Spanish, Romanian, Polish, Arabic, andFarsi (based on the authors’ language skills, for validation of results). The goal was toconstruct a single, consolidated KB from these multilingual sources, with more entitiesand statements (as many appear only in speciﬁc editions) but without any duplicates. Theextractors harnessed inter-wiki links between Wikipedia editions and the inter-languagelinks in Wikidata ( https://wikidata.org/ ). For example, the French Wikipedia article about

Londres is about the same entity as the English article about

London .A diﬃculty to address was the extraction from non-English infoboxes, without manuallyspecifying patterns for each diﬀerent edition. For this purpose, the YAGO 3 system employeddistant supervision from the English edition (see Sections 6.3 and 7.3.2). By statisticallycomparing English seeds for subject-object pairs against those observed in non-Englishinfoboxes, the system could learn the correspondences between properties from diﬀerentsources.To construct the taxonomy, the foreign category names were mapped to their Englishcounterparts, by harnessing inter-language links from Wikidata. The modular architecturedeveloped for YAGO2s allowed all this with just a handful of new extractors: translation of182 ubmitted to Foundations and Trends in Databases non-English entities, mapping of infobox attributes to relations, and taxonomy construction.Downstream extractors were not aﬀected by these extensions.Another major project on multilingual knowledge, with even larger coverage, is BabelNet[418, 145] ( https://babelnet.org/ ). YAGO 4: Alignment with Wikidata

YAGO 3 has been continuously improved, and the software became open source [483].However, the KB was bound to Wikipedia and the entities featured there. Thus, it becameclear that the KB could never reach the scale of entity coverage that Wikidata achieved in themeantime, with nearly 100 million entities. On the other hand, Wikidata has the principleof including claims as statements (see Section 9.4, with potential diversity of perspectives,rather than undisputed factual statements only. Therefore, semantic constraints cannotbe rigorously enforced. Furthermore, the large number of contributors to the Wikidatacommunity has led to a convoluted and cluttered taxonomy of classes, where an entity suchas

Paris is buried under 60, mostly uninformative, classes, 20 of which are called “object”,“unit”, “seat”, “whole”, etc.The cleaning of contradictory statements and the transformation of convoluted andnoisy taxonomies into clean type systems have been key competences of the YAGO projectfrom its very start. In this spirit, the latest YAGO version, YAGO 4 [582], abandonedWikipedia as input source, and set out to tap into Wikidata as a premium source, applyingthe same principles (cf. Chapter 3). The higher-level types of the taxonomy are no longerbased on WordNet; instead YAGO 4 adopts the type system of schema.org [193], an industrystandard for semantic markup in web pages.The directly populated leaf-level classes of YAGO 4 are carried over from Wikidata. Inthis regard, Wikidata is fairly clean and coarse-grained; most entities belong to only one ortwo types directly. For example, all people have type human and none of the conceivableﬁne-grained types such as singer, guitarist etc. Wikidata expresses the latter by means ofvarious properties. Since schema.org has only ca. 1000 classes and the overlap with theimmediate types of Wikidata instances is even smaller, it was best to manually align therelevant types from the two sources. Properties are also adopted from schema.org, with theadvantage that they come with clean type signatures for domain and range. Again, thealignment with Wikidata properties required a reasonably limited amount of manual work.The entities, Wikidata’s best asset, were transferred from Wikidata to populate the newlycrafted KB schema. This data was complemented by hand-crafted consistency constraintsfor class disjointness, functional dependencies and inclusion dependencies (see Section 8.3),expressed in the SHACL language (see Section 8.3.1).The resulting KB comprises ca. 60 million entities with 2 billion statements, organizedinto a clean and logically consistent taxonomic backbone. For this high quality, the KBconstruction “sacriﬁced” about 30 million Wikidata entities that had to be omitted for183 ubmitted to Foundations and Trends in Databases consistency. However, these aﬀect only the long tail of less notable entities which havevery few properties. As a result of this constraint-aware construction process, the YAGO 4knowledge base is “reason-able” [582]: provably consistent and amenable to OWL reasoners.Metadata, about provenance, is represented in the RDF* format [212].

YAGO was one of the ﬁrst large KBs automatically constructed from web sources. Itsunique traits are high precision, semantic constraints, and a judiciously constructed andfairly comprehensive type taxonomy. Over YAGO’s 15-year history, several major lessonswere learned:•

Harvest low-hanging fruit ﬁrst:

YAGO was successful because it focused onpremium sources that were comparatively easy to harvest and could yield high-qualityoutput: the category system of Wikipedia and semi-structured content like infoboxes.This resulted in near-human precision that was previously unrivaled by automaticmethods for information extraction at this scale.•

Focus on precision:

YAGO has focused on precision at the expense of recall. Therationale is that every KB is incomplete, and that applications are thus necessarilyprepared to receive incomplete information. Under this regime, a user is not surprisedif some song or city is missing from the KB. Conversely, users are confused or irritatedwhen they encounter wrong statements. Therefore, YAGO focused on premium sourcesand relatively conservative extration methods, reserving more aggressive methods forKB augmentation. The project investigated and developed a variety of such advancedmethods as well, including the SOFIE tool [566] (see Section 8.5.2), but these hadlimited impact on the more conservative releases of the YAGO KB. Nevertheless,SOFIE and its scalable parallelization [412] were successfully used in another projecton building a health KB called

KnowLife [150, 148].•

Limited-eﬀort manual contributions:

The project identiﬁed sweet spots wherelimited manual eﬀort had a very large positive impact. The speciﬁcation of propertiesand their type signatures is a case in point. It is not much eﬀort to deﬁne hundredsof properties and constraints by hand: a small price when these can guarantee thecleanliness and tangiblity of many million statements.•

Modularized extractors:

There is enormous value in unbundling the KB construc-tion code into smaller extractor modules, each with a speciﬁc scope. This allowsdeclarative orchestration of an entire extractor ensemble, and greatly simpliﬁes main-tenance, debugging, and project life-cycle.184 ubmitted to Foundations and Trends in Databases • Use open standards:

The adoption of open standards (like RDF, RDFS, RDF* [212],and SHACL) boosts the usage and utility of KB resources and APIs by a wider com-munity of researchers and developers. This includes the representation of statements.In the early years of YAGO, many questions arose about the syntax of KB statements,character encodings, reserved characters, escape conventions, etc. These could havebeen avoided by an earlier full adoption of RDF.

DBpedia ( https://dbpedia.org , starting in 2007) was the other early project to construct alarge-scale knowledge base from Wikipedia contents [18, 19, 317]. DBpedia spearheaded theidea of

Linked Open Data (LOD) [225]: a network of data and knowledge bases in whichequivalent entities are interlinked by the sameAs predicate. The LOD ecosystem has grownto ten thousands of sources, with DBpedia as a central hub (see also Section 9.6).DBpedia targeted infobox attributes right from the beginning (cf. Section 6.2.1): everydistinct attribute was cast into a property type, without manual curation. Thus, DBpediacould not apply domain and range constraints and other steps for canonicalization andcleaning. In return, DBpedia captured all information of the Wikipedia infoboxes, andthus provided much larger coverage than YAGO. Over the next years [47], the project alsoextracted further contents; abstracts, images, inter-wiki links between diﬀerent languageeditions, redirect labels, category names, geo-coordinates, external links and more, becominga “Wikipedia in structured form”.In 2009, DBpedia started organizing the entities into a small hand-crafted taxonomy [47],driven by the most frequent infobox templates on Wikipedia. It also speciﬁed clean propertytypes for these classes, and manually mapped the infobox templates onto properties. To keepits high coverage, DBpedia stored both the raw infobox attributes (which cover all infoboxes)and the selectively curated ones. All in all, DBpedia oﬀered four diﬀerent taxonomies: theWikipedia categories, the YAGO taxonomy, the UMBEL taxonomy (an upper-level ontologyderived from Cyc, see https://en.wikipedia.org/wiki/UMBEL ), and the hand-crafted DBpediataxonomy. This shows the diﬃculty of reaching agreement on a universal class hierarchy.Another novelty in 2009 was “DBpedia live”, a system that continuously processes thechange logs of Wikipedia and feeds them into the KB as incremental updates [229].Over the next years, the project built up an international community of contributors [378,285, 317], and added many useful tools including SPARQL endpoint and other APIs aswell as the

Spotlight tool for named entity recognition and disambiguation [379] (cf. Section5.4). Separate DBpedia editions were created for each of ca. 100 language editions ofWikipedia. These DBpedias are independent, but volunteers in diﬀerent countries mappedinfobox attributes of non-English Wikipedias to the common DBpedia schema. Finally,185 ubmitted to Foundations and Trends in Databases

DBpedia dealt with the rise of Wikidata (see Section 9.4) by incorporating its entities andproperties [259] while keeping its genuine taxonomy. Mappings between Wikidata propertiesand the DBpedia schema were manually speciﬁed by the community.Since 2014, DBpedia is run by the DBpedia Association, an organization ( https://wiki.dbpedia.org/dbpedia-association ) with regional chapters in 15 countries.

The Never-Ending Language Learner NELL ( http://rtw.ml.cmu.edu/ , starting in 2010) [71,72, 394, 395] is a project at Carnegie Mellon University to build a knowledge base “abinitio” from any kinds of web sources. NELL distinguishes itself from other KB projectsby its paradigm of continuously running over many years, the idea being that the KB isincrementally grown and that the underlying learning-based extractors would graduallyimprove in both precision and recall. The key principle to tackle this ambitious goal is coupled learning : NELL has learners for several tasks, and these tasks are coupled to supporteach other. For example, learning that

Elvis is a singer reinforces the conﬁdence in anextraction that Elvis released a certain album, and vice versa. Likewise, learning that

Paris is the capital of France from a textual pattern and learning it from a web table strengthensthe belief that this is a correct fact. This is highly related to harnessing soft constraints forconsistency and to the factor coupling of probabilistic graphical models (see Section 8.5).NELL starts with a manually created schema, with ca. 300 classes and ca. 500 binaryrelations with type signatures. NELL boostraps its learners with a few tens of labeledtraining samples for each class and each relation, for example, guitar for the class music-Instruments , Colorado for the class rivers , Page for the relation cityLiesOnRiver , and released (Elvis, JailhouseRock) for the relation released . The extractors run on a largepre-crawled Web corpus, with the following central learning tasks:•

Type classiﬁcation:

Given a noun phrase such as “Rock en Seine” or “Isle of WightFestival”, classify it into one or more of the 300 classes, like musicfestival and event (cf. NER and entity-name typing in Section 4.4.1). NELL uses diﬀerent learners forthis task, based on: string features (e.g., learning that the suﬃx “City” in a compoundnoun phrase often identiﬁes cities, such as “New York City”), textual patterns (suchas “mayor of X”), appearance in web tables (e.g., appearance in a column with otherentities that are known to be cities), image tags and visual similarities, and embeddingvectors.•

Relation classiﬁcation:

Given a pair of noun phrases, classify it into one or moreof the relations, this way gathering instances of the relations. Again, NELL uses186 ubmitted to Foundations and Trends in Databases an ensemble of several learners, some of which are similar to the type classiﬁers.Features include textual patterns, the DOM-tree structure of web pages, and word-level embeddings (cf. Chapter 6).•

Synonymy detection:

Given a pair of noun phrases, detect whether they denote thesame entity (e.g., “Big Apple” and “New York City”). Several supervised classiﬁersare employed, using features like string similarity or co-occurrence with pairs ofnoun-phrase entity mentions (cf. Section 5.4).•

Rule mining:

NELL can learn a restricted form of Horn rules as soft constraints,such as: if two people are married they (usually) live in the same city. These rulesare used to predict new statements and to constrain noisy candidate statements (cf.Section 8.3),These learning and inference tasks are coupled: results of one task serve as trainingsamples, counter-examples or (soft) constraints for other tasks. For example, one couplingconstraint is that in an ensemble of type classiﬁers for noun phrases, all classiﬁers shouldagree on the predicted label. Another constraint is that instances of a class should also beinstances of the class’s superclasses, that instances cannot belong to mutually exclusiveclasses, and that domain and range constraints must be satisﬁed (cf. Section 8.3). In thesame spirit, classiﬁers receive positive feedback if their classiﬁcation corresponds to thepredictions of NELL’s learned rules. Overall, NELL has more than 4000 (instantiations of)learning tasks and more than a million (instantiations of) constraints.NELL runs an inﬁnite loop of two alternating steps, which are loosely modeled afterthe EM (Expectation Maximization) algorithm. In the E-step, NELL re-estimates theprobability of each statement in its KB (called “beliefs”), by combining and reconciling theinputs from the diﬀerent learners. In the M-like step, this conﬁdence-reﬁned KB is usedto re-train the learners. To avoid semantic drift, humans intervene from time to time, andcorrect aberrant patterns or statements. Overall, NELL has learned more than 100 millionconﬁdence-weighted statements, including ca. 3 million with high conﬁdence. Interestingly,in a sequence of such EM-style epochs, NELL can “unlearn” statements and rules that wereaccepted in previous rounds (i.e., lower its conﬁdence in these). Nevertheless, subsequentepochs may learn these again (i.e., increase conﬁdence).As for its schema, NELL has also some means for discovering new relations, to beadded to the ontology [397]. This is based on the

Path Ranking Algorithm by [309, 308],which computes frequent edge-label sequences on paths of the KB graph for view discoveryand rule learning (see Sections 7.3.3 and 8.3.3). The head of a newly learned rule can beinterpreted as a new predicate. For example, NELL could potentially discover the relation coveredBy between musicians from path labels created (between musicians and songs) and performed − (between songs and musicians).187 ubmitted to Foundations and Trends in Databases The NELL website allows visitors to give feedback about statements, which not onlycorrects errors but also provides cues for future learning rounds. NELL could even proactivelysolicit feedback on uncertain statements via Twitter.Last but not least, the NELL project also considered a limited form of introspection :automatically self-reﬂecting on the weak spots in the KB. By its ensemble learners, theconﬁdence of statements and rules can be estimated in a calibrated way. This way, thelearning machinery can be steered towards obtaining more evidence or counter-evidence onlow-conﬁdence beliefs and weak rules.

The overview article [395] provides speciﬁc references for all these methods, the learners,and the potential extensions. It also discusses lessons learned. The overarching insight isthat NELL’s principle of coupling diﬀerent learners allows it to achieve good results with asmall number of training samples for bootstrapping. Constraints are the key to tame noiseand arrive at high-conﬁdence statements.NELL is also facing a number of open challenges:•

Beware of the long tail:

Prominent instances of classes and relations are learnedfast, but less frequently mentioned instances are more diﬃcult to extract. This leadsto the problem of when to cut oﬀ the extractions, and more generally, how to copewith the inevitable trade-oﬀ between precision and recall. The NELL KB is essentiallya probability distribution over statements and rules with a long tail of low-conﬁdencebeliefs. Interpreting this uncertain KB is a challenge for downstream applications likequerying and reasoning.•

Learning convergence:

Another challenge is caused by NELL’s never-ending learn-ing, which makes it hard to tell when some task is completed (cf. Section 8.1). Forexample, the world has only ca. 200 countries, but NELL will continue trying to ﬁndmore, learning spurious statements and rules, unlearning them later, and so on.•

Degree of plasticity:

A key hypothesis has been that making most components ofthe system learnable (“plastic”, as opposed to hard-coded) is beneﬁcial for the outputcoverage and quality. Therefore, NELL is designed to learn its coupling constraints(as opposed to relying more on hand-crafted speciﬁcation). However, it is not ableto learn new ways of discovering new types from noun phrases and other cues. So itstill lacks the ability to automatically expand its schema (cf. Chapter 7). The viewdiscovery from path labels is an exception, but is limited to observing patterns overthe pre-deﬁned existing relations (see Section 7.3.3).188 ubmitted to Foundations and Trends in Databases • Entity canonicalization:

Although NELL can learn to identify synonymous entitynames, it it suﬀers from insuﬃcient support for entity linking (EL, see Chapter5) to ensure unique representation of entities. For example, its learner for entitydetection and typing yields noun phrases labeled “Elvis Presley”, “Elvis AaronPresley”, “legendary Elvis Presley”, “Elvis the King”, “Elvis Presley 1935-1977”,“Elvis lives”, “Elvis Joseph Presley” and many more, without understanding that allthese (except for the last one) are just diﬀerent names for the same entity. This lack ofcanonicalization is an obstacle for some downstream applications (e.g., entity-centricanalytics) and also makes it hard to maintain the KB in a consistent manner. Forexample, some of the seemingly diﬀerent Elvis’s above have died in Memphis, othershave died in Nashville, and others are still alive (and hide on Mars).

Wikidata ( ) is a collaborative community to build and maintain alarge-scale encyclopedic KB [600, 366]. It is currently the most comprehensive endeavor onpublicy accessible knowledge bases.Wikidata operates under the auspices of the Wikimedia foundation, and has close tieswith other Wikimedia projects like Wikipedia, Wiktionary, and Wikivoyage. As of August2020, the KB contained 88 Million entities, and has 23,000 active contributors.For this survey article, three aspects of Wikidata are especially relevant: i) the wayhow knowledge in Wikidata is organized, ii) the way how the schema evolves in thiscollaborative setting, and iii) the role that Wikidata plays as a hub for entity identiﬁcationand interlinkage between datasets.

Wikidata was launched in 2012, following earlier experiments with collaborative data spacessuch as Semantic MediaWiki [296]. The motivation for collecting and organizing structureddata in the Wikimedia ecosystem was twofold: (i) centralizing inter-language linking and (ii)centralizing infobox data. Wikidata should be used for automatically generating these acrossall Wikipedia language editions, thus simplifying maintenance and ensuring consistency. Asof August 2020, the ﬁrst goal has been reached, and the second one is getting closer.The population of the Wikidata KB started out with human contributors manuallyentering statements. In 2014, Google oﬀered the content of its, then phased out, FreebaseKB [51] for possible import into Wikidata: 3 billion statements about 50 million entities.The Wikidata community refrained from automated import for quality assurance; instead atool was created that allowed editors to validate (or discard) individual statements beforeinsertion into the KB [581]. As a result, ca. 17 million statements about 4.5 million entities189 ubmitted to Foundations and Trends in Databases were added to Wikidata. This small fraction of the enormous volume of Freebase underlinesthe very high quality standards that the Wikidata community exercises (including therequirement for reliable references to support statements). The rapid growth of Wikidatain subsequent years has been largely based on its human contributors, but also used bulkimports with humans-in-the-loop for quality assurance (see below).Beyond this Wikipedia-centric usage, it turned out that Wikidata became useful formany other purposes as well: interlinking and enriching data from public libraries andarchives, and managing biomedical and scholarly data (see Section 9.4.3).

Wikidata’s data model [600] revolves around entities, SPO triples, and qualiﬁers.

Entities ,called items in Wikidata jargon, have language-independent Q-code identiﬁers (e.g., Q392for

Bob Dylan , or Q214430 for

Like a Rolling Stone ). Figure 9.1 shows an excerpt of theWikidata page for Bob Dylan, annotating some of the key concepts of the data model.SPO triples, called claims , are statements about entities. Following the RDF model,subjects are entities with Q-codes and objects can be entities or literals. The predicatescome from a manually speciﬁed set of about 7000 properties, identiﬁed via language-agnosticP-codes (e.g. P569 for date of birth, P26 for spouse, or P577 for publication date). Thispredicate space and the entities alike are continuously and collaboratively expanded by thecontributors in the Wikidata community.Wikidata circumvents the restrictions of triple-based knowledge representation byreiﬁcation (e.g., Macron’s inauguration as French president is itself an entity) and by qualiﬁers for reﬁning SPO triples (cf. Section 2.1.3). Qualiﬁers are a predicates that enrichtriples with context, about sources, dates, reasons, etc. For example, the spouse propertycomes with qualiﬁers for wedding date and divorce date. Awards, such as Bob Dylan winningthe Nobel Prize, have qualiﬁers for date, location, ﬁeld, prize money, laudation speaker, etc.

Taxonomy and Consistency:

Wikidata items are organized into classes by use of the instanceOf property, which inturn are organized using instanceOf and subClassOf . There is no hard distinction, though,between instances and classes; for example,

Buddha ( ) isboth an instance of religious concept and a subclass of religious ecstasy and person (asof August 27, 2020), and the latter leads to super-classes like subject, agent, item, individual.The complexity, diﬃcult interpretability and potential inconsistency of the Wikidata classsystem was speciﬁcally addressed by YAGO 4 (see Section 9.1).An important principle of the collaborative community is to allow diﬀerent perspectives.Therefore, Wikidata refers to statements as claims . These do not necessarily capture asingle view of the world, with universally agreed-upon facts. Rather it is possible (and190 ubmitted to Foundations and Trends in Databases Entity Name Entity Id Entity DescriptionAliasNamesTypePredicateProperty ObjectLiteralObjectEntityProvenance& EvidenceClassQualifierPredicateQualifierPredicate

Figure 9.1:

Excerpt of Wikidata page for

Bob Dylan ( , August 31, 2020) ubmitted to Foundations and Trends in Databases often happens) that alternative standpoints are inserted as separate statements for thesame entity and property. For example, it is accepted that Jesus has more than one birthdate, and the same holds for the death date of the mountaineer George Mallory (who diedsomewhere on Mount Everest with unknown precise date). In principle, Wikidata wouldeven tolerate entering alternative death places for Elvis Presley (including perhaps MareElysium on Mars), but the community does have moderators who may intervene in such acase.Wikidata also supports a rich portfolio of consistency constraints , including type con-straints for properties, functional dependencies (called single-value constraints) and more.However, by the philosophy of capturing potentially conﬂicting claims, the constraints arenot rigorously enforced. Instead, they serve to generate warnings when entering new data; soit is up to the contributors to respect the constraints or not. Also, users can systematicallylook at constraint violations and make corrections as deemed appropriate. KB Life-Cycle:

Like Wikipedia, Wikidata follows a collaborative process, under which both data andschema continuously evolve. For inserting entities and statements, an important cri-terion is to include references to reliable sources of evidence. For Bob Dylan (as ofAugust 2020), for example, these include mostly digital libraries, archives and LODsources such as BBC ( ), Musicbrainz( https://musicbrainz.org/artist/72c536dc. . . ) and also Wikipedia articles. For some statementsbeyond the standard biography and creative works, also news articles and authoritativewebsites such as are cited as references.An example is Dylan’s unmarried partnership with Joan Baez, supported by citing a newarticle from . Schema edits, on the other hand, are typicallysubject of community discussions. Frequent issues are, for instance, whether a proposedproperty type will be re-used often enough, whether a a property could also be expressedby class membership, and what kinds of constraints should be imposed.Regardless of its initial intention to support Wikipedia, Wikidata now holds statementsabout many entities that are not present in Wikipedia at all. This is largely data importedfrom so-called GLAMs (an acronym for galleries, libraries, archives, museums), and govern-ment organizations that provide Open Data. Typically, this involves adding an entity, like abook author or public school, and a set of statements for their identiﬁers in the originaldata sources. The data imports go through a semi-automated process: ﬁrst discussed andapproved by the community, then carried out by bots. Typical issues for approval arewhether the data is of suﬃcient interest for a KB at all, how data quality is assured, andhow to interlink the new data with the existing content of the KB. Import procedures aresupported by a variety of tools, such as OpenReﬁne ( https://openreﬁne.org ) and Mix’n’match( https://mix-n-match.toolforge.org ). 192 ubmitted to Foundations and Trends in Databases

Beyond Wikipedia, Wikidata is utilized in a range of applications.

Galleries, Libraries, Archives and Museums (GLAMs):

These stakeholders are major proponents of open and interlinked data, Their entities,like authors, artists and their works, are increasingly captured in Wikidata, typicallywith identiﬁers that point to the original repositories. For some entities, these identiﬁersconstitute the majority of the statements in Wikidata. GLAMs have high interest in suchinterlinking, opening up their contents for collaborative enrichment and better visibilityin search engines. An ideal enrichment would follow the “round-trip” pattern: Wikidataimports entity identiﬁers, say about a lesser known painter from a museum, the Wikidatacommunity augments this with facts about the painter’s biography, and this added valuecan be easily combined with the museum’s online contents about the painter’s works [121].

Scholarly Knowledge:

Knowledge about scientiﬁc publications, authors and their organizations is another use casethat is gaining importance. Notable projects include CiteSeerX [628, 11], SemanticScholar[12, 353], AMiner [578, 607], Open Research Knowledge Graph [430, 57], and Scholia [426].The last one, Scholia, is directly based on Wikidata. All these endeavors aim to providetools for the semantic analysis of scholarly topics, author networks, bibliometric measuresof impact and more, as open alternatives and value-added extensions to prevalent serviceslike Google Scholar, Microsoft Academic or publisher services [538, 216, 156].

Entity Identiﬁcation:

Another desirable purpose of KBs is to provide master data for entity identiﬁcation and cross-linkage between resources. Wikidata is taking up a central role in the Web of Linked OpenData (LOD, see Section 9.6). Wikidata identiﬁers are becoming widespread in interlinkingdatasets and knowledge repositories. In turn, a large fraction of Wikidata statements areabout external identiﬁers, such as

TwitterID, VIAF-ID or GoogleScholarID . Life Science Knowledge:

The life sciences – biomedicine and health – is a speciﬁc domain where Wikidata has thepotential to become a data hub [605]. On one hand, the Wikidata KB contains a largeamount of statements about diseases, drugs, proteins etc. On the other hand, its richcoverage of identiﬁers as links to other repositories (see above) supports combining fromdiﬀerent biomedical sources in a user-friendly manner (e.g., for data scientists in the healtharea). 193 ubmitted to Foundations and Trends in Databases

Wikidata is today’s most prominent endeavor on collaborative knowledge engineering. Still,it faces a number of challenges in its future advances.

Quality Assurance and Countering Vandalism:

These are never-ending concerns for collaborative projects. A good strategy requires balanc-ing openness of contributions and moderated approvals. Allowing edits by any contributorhas the risk of introducing errors but also the advantage that errors can be quickly caughtand corrected by others. Staggered review and approval, on the other hand, can preventblatant errors, at the expense of slowing down the KB growth, though. In addition tovandalism, concerns are also raised over more subtle content distortions with commercial orpolitical interests.

Evolving Scope and Focus:

The scope and focus of Wikidata are repeatedly coming into discussion, especially when newdata imports are discussed. In 2019, for instance, debates revolved around the importingof scholarly data (mostly identiﬁers for authors and publications), which currently makesup 40% of Wikidata’s entities. Such imbalances could possibly bias functionality for searchand ranking, and puts strain on Wikidata’s infrastructure, potentially at the cost of otherstakeholders. Therefore, decisions on in-scope and out-of-scope topics are a recurring concern.

Schema Stability:

Wikidata’s collaborative processes and continuous evolution are better able to keep up withan evolving reality than any expert-level modeling of classes and properties. However, thisimplies that applications relying on Wikidata require continuous monitoring, as changes inproperty deﬁnitions and taxonomic structures can break queries. Finding the right balancebetween the community’s grassroots contributions and controlling the quality and stabilityof the KB schema will remain a challenge.

Data Duplication:

Redundancy, and the resulting risk of inconsistency, are further issues that arise from theprinciple of a collaborative community. At the entity level, this is well under control bythe self-organization among editors which captures duplicates quickly and resolves them.However, redundancy is an issue for types and properties. By independent edits, propertiescan be duplicated in forward and backward direction, such as parent / child , has part /is part of , and award received / recipient . Similarly, existential information is stored inseparate properties, such as child / number of children and episode / number of episodes .These make querying more complex, and most critically, may easily cause inconsistencies.194 ubmitted to Foundations and Trends in Databases Knowledge graphs , or

KGs for short, is the industry jargon for knowledge bases – a widelyused but oversimplifying term as KBs comprise much more than just binary relations.KGs started taking an important role in industry in 2012, the year when Google launchedknowledge-based search under the slogan “things, not strings” [536]. The Google KG startedfrom Freebase [51], a KB built by Metaweb, acquired by Google in 2010. In the same year,Amazon acquired Evi (formerly True Knowledge) [587], whose KG laid the foundation forAlexa question answering. Knowledge graphs are broadly used in search engines like Google,Bing and Baidu, in question answering such as Apple Siri, Amazon Alexa and GoogleAssistant, and in e-Commerce at Alibaba Taobao, Amazon, eBay, Walmart and others.Another early player in industrial KGs has been Wolfram Alpha [248], which providedservices to Apple, Amazon, Microsoft, Samsung and others. Building an authoritativeknowledge graph with comprehensive and high-quality data to support the broad range ofapplications has been a hot topic for industry practice in the past decade [429].The biggest success for knowledge bases in industry is mainly for popular domains suchas music, movies, books and sports. Eﬀorts to collect long-tail knowledge started much later(around 2015), but have already been successful in facilitating users to ﬁnd answers fortheir hobbies such as yoga or cocktails. Gathering and organizing retail product knowledge started quite recently (2017), facing many obstacles but already bearing fruits. We nextdescribe the eﬀorts and progress for each of these three aspects.

The ﬁrst pot of gold for knowledge collection in industry is Wikipedia. As discussed inChapter 3, Wikipedia is a premium source and great starting point with its huge number ofentities, informative descriptions and rich semi-structured contents. Wikipedia data hasplayed an important role in Google KG, Bing Satori, Amazon Evi, and presumably more.The knowledge graphs are then extended on a set of popular domains: large domainsinclude

Music, Movies, Books, Sports, Geo, etc. , and medium domains include

Organizations,People, Natural World, Cars, etc.

Two common features for such domains make them pioneerdomains for knowledge collection. • First, there are already rich data sources in (semi-)structured form and of high quality.For example, IMDB ( ) is a well-known authoritative datasource for movies, and MusicBrainz ( https://musicbrainz.org/doc/MusicBrainz_Database )is an authoritative source for music with open data for free download. Big companiesalso license data from major data providers, and often have their own data sets (forexample, Google has rich data on books, locations, etc.). • Second, the complexity of the domain schema is manageable. Continuing with the movie195 ubmitted to Foundations and Trends in Databases domain, the Freebase knowledge graph contained 52 entity types and 155 propertiesfor this domain. The schema (aka. ontology) to describe these types and properties canbe manually speciﬁed by a knowledge engineer within weeks, especially by leveragingexisting data sources.A big challenge in this process is to integrate entities and properties from Wikipediawith domain-speciﬁc data sources. This requires alignment between schemas/ontologies (seeSection 3.3) and entity matching (see Section 5.2). Schema alignment is carefully curatedmanually; this is aﬀordable because the size of the schema/ontology for each domain ismanageable. On the other hand, manually linking entities that refer to the same real-worldperson or movie is unrealistic, at the scale of many millions of entities. This calls forautomatic entity linking (EL) as discussed in Chapter 5 (see especially Section 5.2). Tomeet the high bar for KB accuracy, the entity linkage needs to have very high precision,obtained by sacriﬁcing recall to an acceptable extent, and by manually checking cases wherethe EL method has low conﬁdence.

Long-tail domains are those where the entities are not globally popular; thus oftentimesthe number of entities is small: thousands or even hundreds only. Examples of tail domainsinclude gym exercises, yoga poses, cheese varieties, and cocktails. Although a tail domainmay not be popular by itself, given the large number of tail domains, they can be collectivelyimpactful, for example, addressing people’s hobbies – an important part of life.Entity distribution among head, torso, and tail domains observes the power law; that is,a few head domains cover a huge number of entities, whereas a huge number of long-taildomains each covers a fairly small number of entities. The huge number of long-tail domainsand the diversity of their attributes make it impossible to collect knowledge from a fewdomain-speciﬁc premium sources as we do for head domains. Likewise, there is no sourcethat covers a signiﬁcant fraction of diﬀerent long-tail domains. An empirical study withmanual curation of four long-tail domains (cheese varities, tomatoes, gym exercises, yogaposes) showed that many entities of interest can be found on websites, but these entitiesneither exist in Wikipedia nor in the Freebase KB [327].Millions of tail domains and no single data source to cover all domains call for a diﬀerentsolution for collecting long-tail knowledge. We next describe two diﬀerent approaches:tooling for curation (Section 9.5.2.1) and automatic extraction from a huge variety ofwebsites (Sections 9.5.2.2 and 9.5.2.3).

One eﬀective method to extract long-tail knowledge is to provide tools that supportknowledge engineers on the manual curation of long-tail domain knowledge. The Google196 ubmitted to Foundations and Trends in Databases

Knowledge Graph and projects at Amazon that feed into Alexa took this approach.The process comprises ﬁve steps:1. Identify a few data sources for each domain.2. Deﬁne the schema for the domain according to data in these sources.3. Extract instance-level data from the sources through annotation tools and hand-craftedpatterns.4. Link the extracted knowledge to existing entities and types in the knowledge base.5. Insert all new entities and acquired statements into the knowledge base.Each step involves manual work and human curation. It can easily take a knowledge engineera few weeks to curate one long-tail domain.There are studies on how to accelerate each step in this process and reduce the manualwork. The work of [613] proposed the

MIDAS method for discovering web sources ofparticular value for a given domain. The core insight from MIDAS is that even if automaticextraction methods may not yield suﬃciently high quality overall, they can provide cluesabout which websites contain a large amount of relevant contents, allow for easy annotation,and are worthwhile for extraction (see the discussion on source discovery in Section 6.2.2.5).A challenge in harvesting these sources, however, is that a website often includes factsabout multiple groups of entities, having diﬀerent schemas and requiring diﬀerent extractiontemplates (see Section 6.3). MIDAS discovers groups that contain a suﬃcient number ofentities and statements that are absent from the KB one wishes to augment, such that theextraction beneﬁts outweigh the eﬀorts. MIDAS also helps to identify interesting domains,such as medicinal chemicals or US history events.Another study [327] aimed to identify extraction errors from particular sources byverifying the correctness of the collected knowledge. This is based on an end-to-end knowl-edge veriﬁcation framework, called

FACTY . The method leverages both search-based andextraction-based techniques to ﬁnd supporting evidence for each triple, and subsequentlypredicts the correctness of each triple based on the evidence through knowledge fusion [129](see Sections 8.1.3 and 8.5.1). Various types of evidence are considered, including existingknowledge bases, extraction results from websites, search-engine query logs, and results ofsearching subject and object of SPO triples. In the study of [327], this technique achievedhigh recall: positive evidence was found for 60% of the correct triples on 96% of the entitiesin the examined four tail domains.Unfortunately, this method also ﬁnds positive evidence for many incorrect statements;so the outlined approach suﬀered from low precision (fraction of veriﬁed triples that aretruly correct). To distinguish correct and wrong triples, knowledge fusion estimates thecorrectness of each triple, taking into consideration the quality of the evidence sourcesand the reliability of the techniques for obtaining evidence. With this additional qualityassurance, the percentage of correct statements that could be veriﬁed dropped from 60%197 ubmitted to Foundations and Trends in Databases to 50%, but the precision increased to 84% (i.e., among all triples that are veriﬁed to becorrect, 84% are indeed correct) [327]. This is an enormous support and productivity boostfor human curators.

Tooling can speed up the collection of long-tail knowledge; however, the solution still doesnot scale to hundred thousands of domains. For high coverage of many domains, it isinevitable to apply information extraction (IE) methods to millions of diverse websites.Among such large-scale endeavors,

Google Knowledge Vault (KV) is a prominent one [129,130].Knowledge Vault diﬀers from other projects in two important aspects: • First, KV extracts knowledge from four types of web sources:o text documents such as news articles,o semi-structured DOM trees, where attribute-value pairs are presented with richvisual and layout features,o HTML tables with tabular data embedded in web pages, ando human-annotated pages with microformats according to ontologies like schema.org [193]. • Second, KV addresses the low quality of the extraction by applying knowledge fusiontechniques (see Sections 8.1.3 and 8.5.1).KV applied 16 types of extractors on these four types of web contents. For text and DOMtrees, extraction techniques were used as discussed in Sections 6.2 and 6.3. In particular, KVﬁrst identiﬁed Freebase entities in a sentence or semi-structured page, and then predictedthe relation between a pair of entities. The training data was obtained from Freebase, usingdistant supervision. To this end, KV utilized the

Local Completeness Assumption (LCA) (see Section 8.2): when a subject-property combination has at least one object in the KB,then the set of triples with this SP combination is complete. Conversely, any choice ofobject O for the same SP pair such that SPO is not in the KB, must be false. This hasbeen key to generating also negative training samples, and turned out to be decisive forextraction quality [129].For web tables, schema mapping and entity linking were employed to map rows andcolumns to entities and properties, respectively (cf. Section 5.7). For pages with microformatannotations, KV extracted statements that follows the schema.org ontology and thentransformed this to the Freebase schema. These mapping were manually speciﬁed, as theyinvolved only a few tens of property types. 198 ubmitted to Foundations and Trends in Databases

Experimental Findings:

The above methods extracted 2.8 billion SPO triples from over 2 billion web pages. Amongthe four source types, DOM trees for HTML-encoded web pages contributed 75% of extractedstatements. This is not surprising because each semi-structured website normally has one ormore back-end databases that yield many instances. In comparison, web tables contributedthe smallest number of extractions, less than 5%. This is partly because web tables arehand-crafted and thus contain only few rows, in contrast to tables or lists that are generatedfrom back-end databases. Another reason is that the extraction techniques faced limitationson aligning column headers with KB properties (as web tables often have generic and highlyambiguous headers such as “Name” or “Count”).Despite the large volume of extractions, extraction quality initially was very low, withprecision around 10%. KV solved the problem by applying knowledge fusion techniques in anadditional curation phase (see Chapter 8, especially Section 8.5.1). First, it trained a logisticregression model to accept or reject a triple based on from how many web pages it wasextracted and by how many extraction models it was returned. Second, KV employed the

Path Ranking Algorithm [310] for deduction (see Section 8.3.3), to assess the plausibility of atriple according to paths between the S and O entities. Third, KV used a neural network forlink prediction, based on KG embeddings (see Section 8.4). Note that all three techniqueswere used conservatively here, to prune false positives (not for inferring additional triples).Among the three approaches, each of them achieved over 90% area-under-the-ROC-curve;all three together reached nearly 95%. In the end, among a pool of 1.6 billion candidatetriples, 324 million (20%) were assessed to have conﬁdence 0.7 or higher, and 271 million(17%) had conﬁdence 0.9 or higher – high-quality candidates to be considered for KBaugmentation.

Lessons Learned:

Despite the promising results, KV did not signiﬁcantly contribute to the Google KnowledgeGraph, mainly because the number of new triples meeting the very high bar for precisionwas not suﬃciently high (the Google KG expected 99% precision). There are several reasons.First, the extractions are restricted to existing entities and properties. Since the Google KGalready has a high coverage of top entities and their properties, missing relations betweenpopular entities were not many. Second, the ﬁrst-stage extraction quality was too low. Theknowledge-fusion stage could prune false positives, but could not add any correct extractions.Third, evaluation in the wild may not be as good as estimated from the local completenessassumption (LCA, see above). It is common to observe diﬀerent quality when the trainingdata and test data have diﬀerent distributions, and that is more likely to happen when thetraining data is not randomly sampled and makes certain assumptions.Nevertheless, the KV data found an interesting application for estimating the quality ofweb sources by a

Knowledge-Based Trust model [131] (see Section 8.5.1). The underlying199 ubmitted to Foundations and Trends in Databases intuition is that the trustworthiness of a web source can be measured by the accuracy of thetriples provided by the source, which can be approximated from the conﬁdence-weightednumber of triples extracted from the source. A probabilistic graphical model was devised tocompute the probability of a triple being correct, the probability of an extraction beingcorrect, the precision and recall of an extractor, and the trustworthiness of a data source.The method computed trustworthiness scores for 5.6 million websites and led to inter-esting observations. First, most of the sources have a score above 0.5, and some reach 0.8.Second, the trustworthiness scores are orthogonal to PageRank scores, which is based on thepopularity of a website rather than its content quality. The knowledge-based trust methodis the basis for fact veriﬁcation in knowledge curation tools, described in Section 9.5.2.1.

Amazon Ceres [354, 355, 356] is another large-scale eﬀort to collect long-tail knowledge byweb extraction. Ceres made two design choices that deviate from the rationale of KnowledgeVault. • First, instead of relying on knowledge fusion to clean up extracted statements, it focuseson improving extraction quality from the beginning. Ceres improved extraction precisionfrom semi-structured sources from 43% in KV to over 90%, a quality reasonable forindustrial-strength services. • Second, Ceres also extracts knowledge about new entities, and even new properties.This way, it extracts head knowledge for tail entities, instead of tail knowledge for headentities, and thus has more room to satisfy users’ long-tail knowledge needs.These diﬀerences make Ceres suitable for industry production, used at Amazon to collectlong-tail knowledge for Alexa and product knowledge for retail.Ceres focuses on only one type of web content: semi-structured web pages. Recallthat DOM-tree-based contents contributed 75% of extractions and 94% of high-conﬁdenceextractions in Knowledge Vault; so is a natural top choice among the diﬀerent types of webcontents. The dense high-quality information in semi-structured sources makes it easier forextraction and entity linking, supporting the two design choices above.On the other hand, semi-structured contents pose the challenge that every website followsa diﬀerent template (or a set of templates). So the model trained on one website normallydoes not carry over to another website. Ceres devised a suite of extraction techniques toaddress this challenge: Ceres [354] for properties speciﬁed in the KB schema (referred toas “Closed IE”), OpenCeres [355] for Open IE (see Chapter 7), and ZeroShotCeres [356]for extraction from unseen websites and even unseen subject domains. We next outline thethree systems. 200 ubmitted to Foundations and Trends in Databases

Ceres for Schema-based “Closed IE”:

Ceres extracts knowledge according to an existing schema/ontology. Training a diﬀerentmodel for a diﬀerent website would require a large amount of training data, posing abottleneck. Ceres solves this problem by generating the training data automatically usingexisting knowledge as seeds. This is the distant supervision technique described in Section6.3.1 and widely used in large-scale knowledge harvesting. However, compared to GoogleKnowledge Vault, for example, Ceres is able to double the extraction precision with twosigniﬁcant enhancements: • First, Ceres focuses on entity detail pages (see Section 6.3.1), each of which describesan individual entity. It employs a two-step annotation algorithm that ﬁrst identiﬁes theentity that is the primary subject of a page, and then annotates entities on the pagethat have relations with that entity in the seed knowledge. This prevents a large numberof spurious outputs. • Second, Ceres leverages the common structure among property-object pairs within a webpage, and across pages from the same website, to further improve the labeling quality.With improved labeling from distant supervision, and thus better training data, Ceresis able to extract knowledge from DOM trees in websites about long-tail domains withprecision above 90% [354].

Open-IE-based Ceres:

OpenCeres [355] and

ZeroShotCeres [356] address the Open IE problem, with the goal ofextracting properties that do not exist in the schema/ontology (see Chapter 7). Thesesystems are among the ﬁrst OpenIE approaches speciﬁcally geared for semi-structured data,and did so by exploring both structural and layout signals in web pages. Between them,OpenCeres trains diﬀerent models for diﬀerent websites, and requires the website to containseeds from the existing KB. ZeroShotCeres, as the name indicates, can extract statementsfrom a new website in a completely new domain, by training a universal model that appliesto all websites.OpenCeres explores similarities between diﬀerent property-object pairs within a page.The idea is that the format for an unknown property (e.g., coveredBy for songs) is oftensimilar to that for a known one (e.g., writtenBy ). It applies semi-supervised label propagationto generate training data for unseen properties (see Section 7.3.2).ZeroShotCeres [356] extends this approach by considering similarities between diﬀerentwebsites. The underlying intuition is that despite diﬀerent websites having diﬀerent tem-plates, there are underlying commonalities in font, style, and layout. ZeroShotCeres trainsa

Graph Neural Network to encode patterns common across diﬀerent websites. Thus, thelearned model can generalize to previously unseen templates and even new vertical domains.201 ubmitted to Foundations and Trends in Databases

Recent years have been witnessing KB applications in a new domain: the retail productdomain. Similar to generic domains, product knowledge can improve product search,recommendation, and voice shopping. However, this new domain presents speciﬁc challenges.

Challenges:

First, except for a few categories such as electronics, structured data is sparse and noisyacross nearly all data sources. This is because the majority of product data resides incatalogs from e-Business websites such as Alibaba, Amazon, Ebay, Walmart, etc., andthese big players often rely on data contributed by retailers. In contrast to stakeholders fordigital products like movies and books, in the retail business, contributing manufacturersand merchants mainly list product features in titles and textual descriptions instead ofproviding structured data [666, 634]. As a result, structured knowledge needs to be minedfrom textual contents like titles and descriptions. Thousands of product attributes, billionsof existing products, and millions of new products emerging on a daily basis require fullyautomatic and eﬃcient knowledge discovery and update mechanisms.Second, the domain is complex in various ways. The number of product types is towardsmillions, and there are sophisticated relations between the types like subclasses (e.g.,swimsuit vs. athletic swimwear), synonyms (e.g., swimsuit vs. bathing suit), and overlappingtypes (e.g., fashion swimwear vs. two-piece swimwear). Product attributes vastly diﬀerbetween types (e.g., compare TVs and dog food), and also evolve over time (e.g., olderTVs did not have WiFi connectivity). All of these make it hard to design a comprehensiveschema or ontology and keep it up-to-date, thus calling for automatic solutions. Opencatalogs like https://icecat.biz/ do not provide extensive schemas, and face the same issuethat ﬁne-grained types and properties are merely mentioned in textual documentation.Third, the huge variety of diﬀerent product types makes it even harder to train models forknowledge acquisition and curation. Product attributes, value vocabularies, text patterns inproduct titles and descriptions widely diﬀer for diﬀerent types. Even highly related producttypes can have rather diﬀerent attributes. For example,

Coﬀee and

Tea , which share thesame parent

Drink , describe packaging sizes by diﬀerent vocabularies and patterns, such as:“Gound Coﬀee, 20 Ounce Bag, Rainforest Alliance Certiﬁed” vs.“Classic Tea Variety Box, 48 Count (Pack of 1)”).Training a single one-size-ﬁts-all model for this huge variety is hopeless. On the other hand,collecting training data for each of the thousands to millions of product types is prohibitivelyexpensive.Research on product knowledge graphs is pursued by major stakeholders includingAlibaba Taobao, Amazon, eBay and Walmart [429, 360, 133, 633, 450]. Next, we exemplifycurrent approaches to these challenges by discussing methods for building and curating theAmazon Product Graph. 202 ubmitted to Foundations and Trends in Databases

InitialProductCategoriesConsumerBehaviorProductCatalogs Taxonomy ofProduct Types

Product

PropertiesProductEntities

Figure 9.2:

Input and Output of the AutoKnow System

Amazon is building a Product Graph for retail products. Figure 9.2 illustrates the inputsand outputs for this endeavor. The inputs include pre-existing categories, catalogs andsignals from consumers’ shopping behaviors, like queries, clicks, purchases, likes and ratings.The output is a KB that comprises an enriched and clean taxonomy and canonicalizedentities with informative properties. To distinguish the product knowledge graph from atraditional KB (or KG), the output is referred to as a “broad graph”. This is a bipartitegraph, where one side contains nodes representing products and the other side containsnodes representing product properties, such as brands, ﬂavors, and ingredients (lower partof the right side of Figure 9.2).Amazon’s knowledge collection system for its product KG is called

AutoKnow [133].The system starts by building a product type taxonomy and deciding on applicable productproperties for each type. Subsequently, it infers structured tripes, cleans up noisy values,and identiﬁes synonyms between attribute values (e.g., “chocolate” and “choc.” for the flavor property).AutoKnow is designed to operate fully automatic at scale, suited for a huge number ofentities and types in the retail domain. For training its machine-learning models, it leveragesexisting catalogs and customer behavior logs. These are used to generate training samples,eliminating the need for manual labeling. The learned models can generalize to new producttypes that have never been seen during training (see [133] for technical details).A few techniques play a key role for scalability and generality: • AutoKnow devised a

Graph Neural Network for machine learning, which lends itself tothe structure of the product KG. • It takes initial product categories as input signal to train models for building the KGtaxonomy. • It heavily relies on distant supervision and semi-supervised learning, to minimize theburden of manual labeling. 203 ubmitted to Foundations and Trends in Databases • It jointly infers facts about properties and synonymous expressions for types, propertiesand values, to cater for the widely varying vocabularies by consumers.

KB technology has been highly impactful in Internet-centric industry, like search engines,online shopping and social networks [429]. As of 2020, it remains a hot topic in industry asthe bars for quality and coverage are raised and use cases are being expanded (see, e.g.,[528, 245]).The technology has largely been driven by applications in mainstream Internet business,but there are many use cases also for other domains such as health, materials, industrialplants, ﬁnancial services, and fashion (e.g., [501, 556, 251, 466, 277]). Moreover, as scalabledata platforms and machine learning are becoming commodities and aﬀordable to smallplayers as well, KB technology is also advanced by startups and specialized companies (e.g.,[656, 380]).

The

Semantic Web is the umbrella term for all semantic data and services that are openlyaccessible via Web standards, including several knowledge bases. The design principles thatKBs satisfy to be compliant with Semantic Web standards are the following:•

Uniform machine-readable data representation:

This principle is implementedby using the RDF/RDFS standards for representing knowledge as statements in tripleform (see Section 2.1).•

Self-descriptive data:

For semantic interpretation, there is no distinction betweenthe instance-level data and the schema. The KB can can describe its schema, con-straints, reasoning rules, and taxonomy/ontology in a single model. The relevantstandards are RDFS, SHACL, and OWL (see Section 8.3).•

Interlinked data:

A KB can link to other KBs via statements that reference entitiesin other KBs, or directly at the entity level, by the owl:sameAs property (cf. Section5.2), and can be accessed using standardized protocols.This last point lifts the KB into an ecosystem of connected KBs and other datasets, whichis called the Web of Data, Linked Data, or (if the data is openly available) Linked OpenData (LOD), or the LOD cloud [225, 244]. For this to work, the KBs have to provideentity identiﬁers that are globally unique – so that the Elvis Presley in DBpedia can bedistinguished from the Elvis Presley in YAGO. For this purpose, each KB is identiﬁed204 ubmitted to Foundations and Trends in Databases by a uniform resource identiﬁer (URI, a generalized form of URL), such as https://yago-knowledge.org for YAGO. Each entity in the KB is identiﬁed by a local name in the namespace given by that URI, as in http://yago-knowledge.org/resource/Elvis_Presley for Elvisin YAGO. This requirement is already part of the RDF standard. To avoid cumbersomeidentiﬁers, RDF allows abbreviating URI preﬁxes. For example, by the abbrevation yago: for http://yago-knowledge.org/resource/ , we can simply write yago:Elvis_Presley .Interestingly, a KB can contain statements about entities from other KBs. For example,DBpedia can assert that its Elvis Presley belongs to the class singer that is deﬁned inYAGO, by the statement dbpedia:Elvis_Presley rdf:type yago:Singer

This has the advantage that a KB can re-use the vocabulary of another KB. The preﬁx“rdf:” refers to the standard vocabulary of RDF, which can be seen as another KB in thesame uniﬁed framework. The same applies to the OWL and SHACL vocabularies. Thereare also vocabularies with extensive modeling of semantic classes, like schema.org [193](used in Section 9.1.The ﬁnal ingredient for the Web of Data is making these URIs dereferenceable : Whenwe access the URI of Elvis in DBpedia, the server of DBpedia replies with a piece of RDFdata about Elvis. In this reply, the client can ﬁnd the URI of a class in YAGO, whichit can follow in the same way. By this mechanism, the machine can seamlessly “surf theSemantic Web” across all its knowledge and data bases, analogous to humans surﬁng theHTML-based Web.The most important links between KBs are owl:sameAs links. They are used to assertthat an identiﬁer in one KB refers to the same real-world entity denoted by another identiﬁerin another KB: dbpedia:Elvis_Presley owl:sameAs yago:Elvis_Presley

These links enable applications to complement the data found about an entity in one KBwith the data from another KB. As of 2019, more than 1200 data and knowledge resourcesare interlinked this way (see https://lod-cloud.net).Another useful element of the Semantic Web are annotation languages like RDFa,Microdata, or JSON-LD for augmenting HTML documents with RDF data [381]. Searchengines use such annotations to identify, for example, product reviews with their scores, shopswith their locations, or movies with their showtimes. Such micro-data appears in billionsof web pages, with increasing trend (see, e.g., http://webdatacommons.org/structureddata/ ).While a large portion of annotations concern website meta-data, the most widely used classof HTML micro-data is schema:Product .Finally, the Semantic Web has pushed forward the adoption of shared vocabularies.205 ubmitted to Foundations and Trends in Databases

Perhaps the most prominent eﬀect is schema.org [193], a vocabulary for thousands of classes,developed by Google, Microsoft, Yahoo and Yandex and standardized by the World WideWeb Consortium. 206 ubmitted to Foundations and Trends in Databases

10 Wrap-Up

This article discussed concepts and methods for the task of automatically building comprehen-sive knowledge bases (KBs) of near-human quality. This entails two strategic objectives: • a very high degree of correctness , with error rates below 5 percent or even under 1percent, and • a very large coverage of entities, their semantic types and their properties, within thescope and intended usage of the KB. Low-hanging Fruit First:

The two goals about correctness and coverage are in tensionwith each other. The ﬁrst goal suggests conservative methods, tapping largely into premiumsources and using high-precision algorithms such as rules and (learned) patterns. Premiumsources (e.g., Wikipedia, IMDB, Goodreads, MayoClinic etc.) are characterized by i) havingauthoritative high-quality content, often in semi-structured form with categories, lists, tables,ii) high coverage of relevant entities, iii) clean and uniform style, structure and layout, and,therefore, iv) being extraction-friendly. The second goal calls for aggressive methods thatcan discover promising sources, identify long-tail entities and extract interesting propertiesbeyond basic facts. Typically, this involves powerful but riskier methods such as Open IEand deep neural networksSuch aggressive methods can often leverage the outcome of the conservative stage byusing high-quality data from an initial KB as seeds for distant supervision . This mitigatesthe typical bottleneck of not having enough training samples. The bottom line is that KBconstruction should consider harvesting “low-hanging fruit” ﬁrst and then embark on morechallenging methodology as needed.

Core Knowledge on Entities and Types:

Entities and their semantic types form thebackbone of every high-quality KB. Attributes and relations are great assets on top, butthey can shine only if the entity-type foundation is proper. Search engines and recommendersystems often need only entities, class memberships, and entity-entity relatedness. This is whythis article ﬁrst emphasized the construction of taxonomic knowledge and the canonicalizationof entities (Chapters 3 through 5). Based on this sound foundation, populating the KBwith attributes and relations (Chapters 6 and 7) yields great value for advanced use casessuch as expert-level QA or entity-centric analytics over properties.

Knowledge Engineering:

KB construction is not a one-time task that can be tackledby a single method in an end-to-end manner. We need to keep in mind that KBs serve asinfrastructure assets and must be maintained over long timespans. Typically, we start withbuilding an initial KB of limited scope and size, and then gradually grow it over time. This207 ubmitted to Foundations and Trends in Databases life-cycle involves tasks like KB augmentation by adding properties (Chapter 6), discoveringnew entities, schema expansion (Chapter 7), and most importantly, quality assurance and

KB curation (Chapter 8).For this entire framework, but also for each of its sub-tasks, a substantial amount ofengineering is required, with humans like KB architects and KB curators in the loop. KBconstruction relies on clever algorithms and smart machine learning, but orchestrating andsteering the complete machinery cannot be fully automated.

Diverse Toolbox:

For many sub-tasks of KB creation and curation, we have presented avariety of alternative methods. Some readers may have hoped for a clear recommendation ofa single best-practice choice, but this is unrealistic. There are many trade-oﬀs (e.g., precision-recall-cost) and dependencies on application requirements. With varying sweet spots andlimitations of diﬀerent algorithms and learners, there is no one-size-ﬁts-all, turn-key method(and hoping for this by more AI advances is likely wishful thinking).Put positively, a wide portfolio of diﬀerent models, methods and tools for knowledgeextraction and knowledge cleaning is a great asset to build on. Each sub-task in the KBlife-cycle mandates judicious choices from this toolbox, again emphasizing the need forhumans in the loop. For this reason, we have tried to cover a diverse variety of approachesthroughout this (consequently fairly long) article.208 ubmitted to Foundations and Trends in Databases

In this ﬁnal section, we sketch some of the open challenges that remain to be addressedtowards the next generation of knowledge bases, pointing out opportunities for original andpotentially impactful research.

Language Models and Knowledge Bases:

Many methods for knowledge extraction harness diﬀerent kinds of language models, fromlexicons of word senses and lexical relations [159, 203] all the way to contextualizedembeddings such as ELMo, BERT and GPT-3 [448, 118, 62]. As the latter encode hugetext corpora and are trained to predict missing words or phrases, an intriguing idea couldbe to use neural language models directly as a proxy KB (see, e.g., [449, 269]). For example,instead of looking up which river ﬂows through Paris, we could ask a neural model topredict the word or phrase “[?]” in the incomplete sentence“Paris is located on the banks of the [?]”,returning “Seine” as best prediction (followed by Loire, Danube, Mississippi etc.). However,when we want to obtain protest songs by Bob Dylan from“Dylan wrote protest songs such as [?]”,the top prediction is “this”, and the predictions for“Dylan wrote [?] songs such as Hurricane”are “popular”, “other”, “many”, “several”, “three” etc, but not “protest” or “political”.These results are sort of correct, but completely useless.Nevertheless, the impressive capabilities of these models for reading comprehensioncould be leveraged more extensively for KB construction and augmentation. The interplay ofsymbolic knowledge and latent understanding of language is an important research avenue[468].

Credibility and Trust:

We live in times with online information exploding in volume, velocity and varying shadesof veracity. Unfortunately, this includes a large fraction of (false) misinformation and(intentionally false) disinformation. To support people in detecting fake news and assessingthe credibility of claims in social media, research on fact checking and propaganda detectionhas become a major direction (e.g., [474, 459, 583, 662, 1, 371, 268]). When KB constructiontaps into riskier sources such as polarized news sources or discussion forums, assessingcredibility and trustworthiness becomes crucial. Conversely, KB statements and reasoningover them could potentially contribute also to unmasking false claims.An important use case is the domain of health. On one hand, health communities andspeciﬁc discussion forums are sources for learning more about symptoms of diseases, typical(as opposed to potentially possible) side eﬀects of drugs, the patients’ experience withtherapies, and all kinds of cross-talk signals about combinations of medical treatments. On209 ubmitted to Foundations and Trends in Databases the other hand, such social media come with a large amount of noisy and false statements(see, e.g., [408]). Contemporary examples are treatments for virus-caused epidemic diseasesand the pros and cons of vaccinations. Even scientiﬁc news are often under debate (e.g., [540,606]). Tackling these concerns about veracity is an inherent part of knowledge acquistionfor the health domain.

Commonsense and Socio-Cultural Knowledge:

Commonsense knowledge (CSK) is the AI term for non-encyclopedic world knowledge thatvirtually all humans agree on. This comprises: • Notable properties of everyday objects , such as: mountains are high and steep, they aretaller than man-made buildings, they may be snow-covered or rocky (but they are neverfast or funny); cups are usually round and are used to hold liquids such as coﬀee, teaetc. • Behavioral patterns and causality , such as: children live with their parents, pregnancyleads to birth, and so on. • Human activities and their typical settings, such as: concerts involve musicians, instru-ments and an audience; rock concerts involve ampliﬁers and usually take place in largeconcert halls or open air festivals (rather than cozy bars or dinner places).This kind of CSK about general concepts and activities (as opposed to individual namedentities) is obvious for humans: every young adult or even child knows this. However, CSKis surprisingly diﬃcult to acquire by machines. There is ample AI work on knowledgerepresentation to this end [113], introducing, for example, epistemic logics with modalitieslike always , sometimes , typically or never , but there is not much work on building large-scaleCSK collections. Some works consider concept hierarchies like WordNet-style hypernymy asCSK, but this can equally be seen as part of encyclopedic KBs and has been intensivelyresearched (see Chapters 3 and 4).CSK may not be needed for today’s mega-applications like search engines, but it couldbecome a crucial building block for next-generation AI. An envisioned use case is to equipconversational assistants, like chatbots, with commonsense background knowledge. Thiswould enable them to understand their human dialog partners better (e.g., humor) andbehave more robustly in their generated utterances (e.g., avoiding absurdities and oﬀensivestatements). Interpreting visual contents, especially videos with speech, would be anotherpotential use case of high importance [651].A variation of CSK that matters for human-computer interaction is socio-culturalknowledge : behavioral patterns that do not necessarily hold universally, but are widelyagreed upon within a large socio-cultural group. For example, people in the western worldgreet each other by handshakes, but this is very uncommon in large parts of Asia. Similarly,there are preferred styles of eating meals, with diﬀerent utensils, diﬀerent ways of sharing,etc. These are not just speciﬁc to geo-regions, but need to consider social and cultural210 ubmitted to Foundations and Trends in Databases backgrounds; for example, children and juveniles do not greet each other by handshakeseven in the western world.Various projects have applied knowledge extraction methods, like those in Chapters 6and 7, to gather CSK statements from online contents (e.g., [576, 575, 659, 393, 510, 495,79]), tapping into non-standard sources like book texts, online forums, frequent queries andimage descriptions. A major diﬃculty to tackle is that mundane CSK is rarely expressedexplicitly [517] (hence the need to tap non-standard sources), and when it is, this is oftenin a very biased, atypical or misleading manner [187]. For example, web-frequency signalswould strongly suggest that most programmers are lonely and socially awkward (which isdisproven by the readers of this article, hopefully). The most successful CSK projects sofar seem to be the conservative ones that relied on human inputs by crowdsourcing, likeConceptNet [532, 548, 549], or knowledge engineers, like Cyc [322, 372], but these are fairlylimited in scope and scale. There are great research opportunities to re-think these priorapproaches and devise new ones to advance CSK acquisition. Personal and Subjective Knowledge:

Another step away from encyclopedic knowledge of universal interest is to capture knowledgeabout individual users, like their habits, tastes and preferences. This would be based onthe digital footprint that a user leaves by her online behavior, including clicks, likes,purchases etc. This kind of knowledge for user proﬁling is collected by major platforms forshopping, entertainment, social media, etc., and leveraged to personalize recommendationsand services.However, there would be big merit in providing a user with her personal KB also onthe user’s desktop and/or smartphone. This KB could serve as an “augmented memory”that assists the user by smart search over her personal digital history. For example, theuser could easily retrieve restaurants where the user dined during her vacation in Italy andcould recollect which places and dishes she liked best. Research on this notion of a personalKB dates back to the theme of “Personal Information Management” (e.g., [139, 132, 122])and is being revived now (e.g., [398, 28]).In addition to such services for end-users, there is an even more compelling reason forpersonal KBs under user control, namely, privacy and, more speciﬁcally, the need for inverseprivacy [202]: knowing what others (the big providers) know about you! Although modernlegislation such as the EU-wide GDPR (General Data Protection Regulation) mandatethat users have the right to inquire provider-side proﬁling and have unwanted informationremoved, the implementation is all but easy. Users leave their data in a wide variety of weband cloud services over a very long timeframe, and most will simply forget which tracesthey have left where and when. A digital assistant could intercept the user’s online accessand record personal information locally, this way maintaining a longitudinal personal KB,under the control of the user herself (see, e.g., [274]). A practically viable solution still faces211 ubmitted to Foundations and Trends in Databases research challenges, though: what to capture and how to represent it, how to contextualizestatements for user interpretability, how to empower end-users to manage the life-cycle oftheir personal KBs, and more.While personal knowledge is still factual but refers only to one individual, there arealso use cases for capturing subjective knowledge , regarding beliefs and argumentations. Forexample, it could be of interest to have statements of the form:

Fabian believes id1: (Elvis livesOn Mars)Gerhard believes id2: (Dylan deserves PeaceNobelPrize)Luna supports Gerhard on id2Simon opposes Gerhard on id2

These kinds of statements, with second-order predicates about attribution and stance, areessential to capture people’s positions and potential biases in discussions and controversies.Obviously, the stakeholders of interest could be important people such as Francois Macronor Angela Merkel rather than the above insigniﬁcant ones. Use cases include credibilityanalysis (see point

Credibility and Trust above), argumentation mining, analyzing debates,political opinion mining, uncovering propaganda and other manipulation, reasoning overlegal texts, and more (see, e.g., [21, 552, 205, 86, 431, 558, 42, 280]). How to automaticallycapture this kind of subjective knowledge for systematic organization in a KB is a widelyopen research area.

Entities with Quantities:

KBs should also support knowledge workers like (data) journalists, (business and media)analysts, health experts, and more. Such advanced users go beyond ﬁnding entities orlooking up their properties, and often desire to ﬁlter, compare, group, aggregate and rankentities based on quantities : ﬁnancial, physical, technological, medical and other measures,such as annual revenue or estimated worth, distance or speed, energy consumption or CO2footprint, blood lab values or drug dosages. Examples of quantity-centric information needsare: • Which runners have ten or more marathons under 2:10:00 hours? • How do the sales/downloads, earnings and wealth of male and female singers compare? • How do Japanese electric cars compare to US-made models, in terms of energy eﬃciency,carbon footprint and cost/km?These kinds of analyses would be straightforward, using SQL or SPARQL queries anddata-science tools, if the underlying data were stored in a single database or knowledgebase. Unfortunately, this is not the case. KBs are notoriously sparse regarding quantities;for example, Wikidata contains several thousand marathon runners but knows their besttimes only for a few tens (not to speak of all their races). Instead of a KB, we could turn todomain-speciﬁc databases on the web, but ﬁnding the right sources in an “ocean of data”212 ubmitted to Foundations and Trends in Databases and assessing their quality, freshness and completeness is itself a big challenge.Extracting statements about quantities, from text or tables, poses major issues [512,253, 235, 619]:i) detecting and normalizing quantities that appear with varying values (e.g., estimated orstale), with diﬀerent scales (e.g., with modiﬁers “thousand”, “K” or “Mio”) and units(e.g., MPG-e (miles per gallon equivalent) vs. kWh/100km),ii) inferring to which entity and measure a quantity mention refers, andiii) contextualizing entity-quantity pairs with enough data for proper interpretation indownstream analytics – all this with very high quality and coverage.

Knowledge Base Coverage:

Commonsense and quantity knowledge are two dimensions on which today’s KBs fall short.In general, coverage is a major concern even for very large KBs. Typically, they cover basicfacts about people, places and products very well, but often miss out on more sophisticated,highly notable points such as: • Johnny Cash performed a free concert in a prison (in the 1960s!). • The Shoshone woman Sacagawea carried her newborn child when serving as guide andinterpreter for the Lewis and Clark expedition. • Frida Kahlo, the surrealistic Mexican painter, suﬀered her whole life from injuries in abus accident. • Cixin Liu’s book Three Body features locations like Tsinghua University and AlphaCentauri.These facts are prominently mentioned in the respective Wikipedia articles and would beremembered as salient points by most readers, yet they are completely absent in today’sKBs. Despite advances on Open IE (Chapter 7) to discover new predicates, coverage isa major pain point. This calls for re-thinking the notion of knowledge saliency and theapproaches to capture “unknown unknowns”.213 ubmitted to Foundations and Trends in Databases

References [1] B. Adair, C. Li, J. Yang, and C. Yu. “Automated pop-up fact-checking: Challenges& progress”. In:

Computation + Journalism Symposium . 2019.[2] D. Adiwardana, M. Luong, D. R. So, J. Hall, N. Fiedel, R. Thoppilan, Z. Yang,A. Kulshreshtha, G. Nemade, Y. Lu, and Q. V. Le. “Towards a Human-likeOpen-Domain Chatbot”.

CoRR . abs/2001.09977. 2020.[3] E. Agichtein and L. Gravano. “

Snowball : extracting relations from large plain-textcollections”. In:

Conference on Digital Libraries . 2000.[4] R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo. “FastDiscovery of Association Rules”. In:

Advances in Knowledge Discovery and DataMining . AAAI/MIT Press, 1996. 307–328.[5] R. Agrawal and R. Srikant. “Fast Algorithms for Mining Association Rules in LargeDatabases”. In:

Conference on Very Large Databases (VLDB) . 1994.[6] S. Akasaki, N. Yoshinaga, and M. Toyoda. “Early Discovery of Emerging Entities inMicroblogs”. In:

Joint Conference on Artiﬁcial Intelligence (IJCAI) . 2019.[7] J. Alammar. “The Illustrated Transformer”. http:// jalammar.github.io/ illustrated-transformer . 2018.[8] O. Alonso.

The Practice of Crowdsourcing . Morgan & Claypool Publishers, 2019.[9] O. Alonso and T. Sellam. “Quantitative Information Extraction From Social Data”.In:

ACM Conference on Research and Development in Information Retrieval(SIGIR) . 2018.[10] C. Alt, M. Hübner, and L. Hennig. “Fine-tuning Pre-Trained Transformer LanguageModels to Distantly Supervised Relation Extraction”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2019.[11] R. A. Al-Zaidy and C. L. Giles. “Extracting Semantic Relations for ScholarlyKnowledge Base Construction”. In:

International Conference on SemanticComputing (ICSC) . 2018.[12] W. Ammar, D. Groeneveld, C. Bhagavatula, I. Beltagy, M. Crawford, D. Downey,J. Dunkelberger, A. Elgohary, S. Feldman, V. Ha, et al. “Construction of theliterature graph in semantic scholar”. arXiv:1805.02262 . 2018.[13] G. Angeli, M. J. J. Premkumar, and C. D. Manning. “Leveraging LinguisticStructure For Open Domain Information Extraction”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2015.[14] M. Arenas, L. E. Bertossi, and J. Chomicki. “Consistent Query Answers inInconsistent Databases”. In:

ACM Symposium on Principles of Database Systems(PODS) . 1999. 214 ubmitted to Foundations and Trends in Databases [15] A. G. Arens-Volland, B. Gâteau, and Y. Naudet. “Semantic Modeling forPersonalized Dietary Recommendation”. In:

IEEE Workshop on Semantic andSocial Media Adaptation and Personalization . 2018.[16] H. Arnaout, S. Razniewski, and G. Weikum. “Enriching Knowledge Bases withInteresting Negative Statements”. In:

Conference on Automatic Knowledge BaseConstruction (AKBC) . 2020.[17] A. R. Aronson and F. Lang. “An overview of MetaMap: historical perspective andrecent advances”.

J. Am. Medical Informatics Assoc.

International Semantic WebConference (ISWC) . 2007.[19] S. Auer and J. Lehmann. “What Have Innsbruck and Leipzig in Common?Extracting Semantics from Wiki Content”. In:

European Semantic Web Conference(ESWC) . 2007.[20] D. Aumueller, H. H. Do, S. Massmann, and E. Rahm. “Schema and ontologymatching with COMA++”. In:

ACM Conference on Management of Data(SIGMOD) . 2005.[21] R. Awadallah, M. Ramanath, and G. Weikum. “Harmony and dissonance:organizing the people’s voices on political controversies”. In:

ACM Conference onWeb Search and Data Mining (WSDM) . 2012.[22] F. Baader, I. Horrocks, C. Lutz, and U. Sattler.

An Introduction to DescriptionLogic . Cambridge University Press, 2017.[23] S. H. Bach, M. Broecheler, B. Huang, and L. Getoor. “Hinge-Loss Markov RandomFields and Probabilistic Soft Logic”.

Journal of Machine Learning Research(JMLR) . 18: 109:1–109:67. 2017.[24] R. A. Baeza-Yates and A. Tiberi. “Extracting semantic relations from query logs”.In:

ACM Conference on Knowledge Discovery and Data Mining (KDD) . 2007.[25] A. Bagga and B. Baldwin. “Entity-Based Cross-Document Coreferencing Using theVector Space Model”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 1998.[26] Baidu. “Introducing Qian Yan, Baidu’s New Plan to Build 100 Chinese NLPDatasets in Three Years”. In:

Baidu Research Blog, 28 August 2020,http:// research.baidu.com/ Blog/ index-view?id=146 .[27] V. Balaraman, S. Razniewski, and W. Nutt. “Recoin: Relative Completeness inWikidata”. In:

Companion of the The Web Conference . 2018.[28] K. Balog and T. Kenter. “Personal Knowledge Graphs: A Research Agenda”. In:

ACM Conference on Research and Development in Information Retrieval (SIGIR) .2019. 215 ubmitted to Foundations and Trends in Databases [29] M. Banko, M. J. Cafarella, S. Soderland, M. Broadhead, and O. Etzioni. “OpenInformation Extraction from the Web”. In:

Joint Conference on ArtiﬁcialIntelligence (IJCAI) . 2007.[30] C. J. Bannard and C. Callison-Burch. “Paraphrasing with Bilingual ParallelCorpora”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2005.[31] M. Bansal, D. Burkett, G. de Melo, and D. Klein. “Structured Learning forTaxonomy Induction with Belief Propagation”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2014. 1041–1051.[32] T. Bansal, P. Verga, N. Choudhary, and A. McCallum. “Simultaneously LinkingEntities and Extracting Relations from Biomedical Text Without Mention-levelSupervision”.

CoRR . abs/1912.01070. 2019.[33] L. Barbosa and J. Freire. “An adaptive crawler for locating hidden web entrypoints”. In:

The Web Conference (WWW) . 2007.[34] R. Barzilay and K. R. McKeown. “Extracting Paraphrases from a Parallel Corpus”.In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2001.[35] H. Bast, B. Buchhold, and E. Haussmann. “Semantic Search on Text and KnowledgeBases”.

Foundations and Trends in Information Retrieval . 10(2-3): 119–271. 2016.[36] R. Baumgartner, S. Flesca, and G. Gottlob. “Visual Web Information Extractionwith Lixto”. In:

Conference on Very Large Databases (VLDB) . 2001.[37] L. E. Bertossi.

Database Repairing and Consistent Query Answering . Morgan &Claypool Publishers, 2011.[38] L. E. Bertossi. “Database Repairs and Consistent Query Answering: Origins andFurther Developments”. In:

ACM Symposium on Principles of Database Systems(PODS) . 2019.[39] C. S. Bhagavatula, T. Noraset, and D. Downey. “TabEL: Entity Linking in WebTables”. In:

International Semantic Web Conference (ISWC) . 2015.[40] I. Bhattacharya and L. Getoor. “Collective entity resolution in relational data”.

ACM Transactions on Knowledge Discovery from Data (TKDD) . 1(1): 5. 2007.[41] N. Bhutani, H. V. Jagadish, and D. R. Radev. “Nested Propositions in OpenInformation Extraction”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2016.[42] N. Bhutani, A. Traylor, C. Chen, X. Wang, B. Golshan, and W.-C. Tan. “Sampo:Unsupervised Knowledge Base Construction for Opinions and Implications”. In:

Automatic Knowledge Base Construction (AKBC) . 2020.[43] J. A. Biega, E. Kuzey, and F. M. Suchanek. “ Inside YAGO2s: A TransparentInformation Extraction Architecture ”. In:

The Web Conference (WWW) . 2013.[44] M. Bienvenu. “A Short Survey on Inconsistency Handling in Ontology-MediatedQuery Answering”.

KI-Künstliche Intelligenz : 1–9. 2020.216 ubmitted to Foundations and Trends in Databases [45]

Handbook of Satisﬁability . Vol. 185.

Frontiers in Artiﬁcial Intelligence andApplications . IOS Press, 2009.[46] C. Bizer, K. Eckert, R. Meusel, H. Mühleisen, M. Schuhmacher, and J. Völker.“Deployment of RDFa, Microdata, and Microformats on the Web - A QuantitativeAnalysis”. In:

International Semantic Web Conference (ISWC) . 2013.[47] C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, andS. Hellmann. “DBpedia - A crystallization point for the Web of Data”.

Journal ofWeb Semantics (JWS) . 7(3): 154–165. 2009.[48] D. M. Blei, A. Y. Ng, and M. I. Jordan. “Latent Dirichlet Allocation”.

Journal ofMachine Learning Research (JMLR) . 3: 993–1022. 2003.[49] P. Bohannon, M. Flaster, W. Fan, and R. Rastogi. “A Cost-Based Model andEﬀective Heuristic for Repairing Constraints by Value Modiﬁcation”. In:

ACMConference on Management of Data (SIGMOD) . 2005.[50] P. Bohannon, S. Merugu, C. Yu, V. Agarwal, P. DeRose, A. S. Iyer, A. Jain,V. Kakade, M. Muralidharan, R. Ramakrishnan, and W. Shen. “Purple SOXextraction management system”.

SIGMOD Record . 37(4): 21–27. 2008.[51] K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. “Freebase: acollaboratively created graph database for structuring human knowledge”. In:

ACMConference on Management of Data (SIGMOD) . 2008.[52] I. Boneva, J. E. L. Gayo, and E. G. Prud’hommeaux. “Semantics and Validation ofShapes Schemas for RDF”. In:

International Semantic Web Conference (ISWC) .2017.[53] G. Bordea, E. Lefever, and P. Buitelaar. “Semeval-2016 task 13: Taxonomyextraction evaluation”. In:

Workshop on Semantic Evaluation (SemEval) . 2016.[54] A. Bordes, N. Usunier, A. Garcia-Duran, J. Weston, and O. Yakhnenko.“Translating Embeddings for Modeling Multi-relational Data”. In:

NeuralInformation Processing Systems (NeurIPS) . 2013.[55] Z. Bouraoui, J. Camacho-Collados, and S. Schockaert. “Inducing RelationalKnowledge from BERT”.

CoRR . abs/1911.12753. 2019.[56] C. D. Bovi, L. Telesca, and R. Navigli. “Large-Scale Information Extraction fromTextual Deﬁnitions through Deep Syntactic and Semantic Analysis”.

Transactionsof the Association for Computational Linguistics (TACL) . 3: 529–543. 2015.[57] A. Brack, A. Hoppe, M. Stocker, S. Auer, and R. Ewerth. “Requirements Analysisfor an Open Research Knowledge Graph”. In:

International Conference on Theoryand Practice of Digital Libraries (TPDL) . 2020.[58] S. Brin. “Extracting Patterns and Relations from the World Wide Web”. In:

International Workshop on the Web and Databases (WebDB) . 1998.[59] S. Brin and L. Page. “The Anatomy of a Large-Scale Hypertextual Web SearchEngine”.

Computer Networks . 30(1-7): 107–117. 1998.217 ubmitted to Foundations and Trends in Databases [60] A. Z. Broder, S. C. Glassman, M. S. Manasse, and G. Zweig. “Syntactic Clusteringof the Web”.

Computer Networks . 29(8-13): 1157–1166. 1997.[61] M. Bronzi, V. Crescenzi, P. Merialdo, and P. Papotti. “Extraction and Integrationof Partially Overlapping Web Sources”.

Proceedings of the VLDB Endowment .6(10): 805–816. 2013.[62] T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal,A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss,G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter,C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner,S. McCandlish, A. Radford, I. Sutskever, and D. Amodei. “Language Models areFew-Shot Learners”. arXiv: 2005.14165 . 2020.[63] P. Buneman, S. Khanna, and W.-C. Tan. “Why and Where: A Characterization ofData Provenance”. In:

International Conference on Database Theory (ICDT) . 2001.[64] P. Buneman and W.-C. Tan. “Data Provenance: What next?”

SIGMOD Record .47(3): 5–16. 2018.[65] R. C. Bunescu and R. J. Mooney. “A Shortest Path Dependency Kernel forRelation Extraction”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2005.[66] R. C. Bunescu and R. J. Mooney. “Subsequence Kernels for Relation Extraction”.In:

Neural Information Processing Systems (NeurIPS) . 2005.[67] R. C. Bunescu and M. Pasca. “Using Encyclopedic Knowledge for Named entityDisambiguation”. In:

European Chapter of the Association for ComputationalLinguistics (EACL) . 2006.[68] A. Burkov.

The Hundred-Page Machine Learning Book . 2019.[69] M. J. Cafarella, A. Y. Halevy, H. Lee, J. Madhavan, C. Yu, D. Z. Wang, and E. Wu.“Ten Years of WebTables”.

Proceedings of the VLDB Endowment . 11(12): 2140–2149.2018.[70] D. Cai, S. Yu, J. Wen, and W. Ma. “Block-based web search”. In:

ACM Conferenceon Research and Development in Information Retrieval (SIGIR) . 2004.[71] A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr., and T. M. Mitchell.“Toward an Architecture for Never-Ending Language Learning”. In:

Conference onArtiﬁcial Intelligence (AAAI) . 2010.[72] A. Carlson, J. Betteridge, R. C. Wang, E. R. H. Jr., and T. M. Mitchell. “Coupledsemi-supervised learning for information extraction”. In:

ACM Conference on WebSearch and Data Mining (WSDM)

W3C Recommendation .W3C, 2014. 218 ubmitted to Foundations and Trends in Databases [74] R. Caruana and A. Niculescu-Mizil. “An empirical comparison of supervised learningalgorithms”. In:

International Conference on Machine Learning (ICML) . 2006.[75] D. Ceccarelli, C. Lucchese, S. Orlando, R. Perego, and S. Trani. “Learningrelatedness measures for entity linking”. In:

ACM Conference on Information andKnowledge Management (CIKM) . 2013.[76] K. Chakrabarti, S. Chaudhuri, Z. Chen, K. Ganjam, and Y. He. “Data servicesleveraging Bing’s data assets”.

IEEE Data Engineering Bulletin . 39(3): 15–28. 2016.[77] S. Chakrabarti.

Mining the web - discovering knowledge from hypertext data .Morgan Kaufmann, 2003.[78] S. Chakrabarti, M. van den Berg, and B. Dom. “Focused Crawling: A NewApproach to Topic-Speciﬁc Web Resource Discovery”.

Computer Networks .31(11-16): 1623–1640. 1999.[79] Y. Chalier, S. Razniewski, and G. Weikum. “Joint Reasoning for Multi-FacetedCommonsense Knowledge”.

International Conference on Automatic Knowledge BaseConstruction (AKBC) . 2020.[80] K. Chang, W. Yih, B. Yang, and C. Meek. “Typed Tensor Decomposition ofKnowledge Bases for Relation Extraction”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2014.[81] M. Chang, L. Ratinov, and D. Roth. “Structured learning with constrainedconditional models”.

Machine Learning . 88(3): 399–431. 2012.[82] S. Chaudhuri, K. Ganjam, V. Ganti, and R. Motwani. “Robust and Eﬃcient FuzzyMatch for Online Data Cleaning”. In:

ACM Conference on Management of Data(SIGMOD) . 2003.[83] H. Chen, M. Vasardani, and S. Winter. “Clustering-based disambiguation ofﬁne-grained place names from descriptions”.

GeoInformatica . 23(3): 449–472. 2019.[84] J. Chen, X. Chen, I. Horrocks, E. B. Myklebust, and E. Jiménez-Ruiz. “CorrectingKnowledge Base Assertions”. In:

The Web Conference (WWW) . 2020.[85] J. Chen, E. Jiménez-Ruiz, I. Horrocks, and C. A. Sutton. “ColNet: Embedding theSemantics of Web Tables for Column Type Prediction”. In:

Conference on ArtiﬁcialIntelligence (AAAI) . 2019.[86] S. Chen, D. Khashabi, W. Yin, C. Callison-Burch, and D. Roth. “Seeing Thingsfrom a Diﬀerent Angle: Discovering Diverse Perspectives about Claims”. In:

NorthAmerican Chapter of the Association for Computational Linguistics (NAACL) . 2019.[87] Z. Chen, M. J. Cafarella, and H. V. Jagadish. “Long-tail Vocabulary DictionaryExtraction from the Web”. In:

ACM Conference on Web Search and Data Mining(WSDM) . 2016. 625–634.[88] F. Chiang and R. J. Miller. “Discovering data quality rules”.

Proceedings of theVLDB Endowment . 1(1): 1166–1177. 2008.219 ubmitted to Foundations and Trends in Databases [89] L. Chiticariu, M. Danilevsky, Y. Li, F. Reiss, and H. Zhu. “SystemT: DeclarativeText Understanding for Enterprise”. In:

North American Chapter of the Associationfor Computational Linguistics (NAACL) . 2018.[90] L. Chiticariu, R. Krishnamurthy, Y. Li, S. Raghavan, F. Reiss, andS. Vaithyanathan. “SystemT: An Algebraic Approach to Declarative InformationExtraction”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2010.[91] A. I. Chittilappilly, L. Chen, and S. Amer-Yahia. “A Survey of General-PurposeCrowdsourcing Techniques”.

IEEE Transactions on Knowledge and DataEngineering (TKDE) . 28(9): 2246–2266. 2016.[92] T. Chklovski and P. Pantel. “VerbOcean: Mining the Web for Fine-GrainedSemantic Verb Relations”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2004.[93] E. Choi, O. Levy, Y. Choi, and L. Zettlemoyer. “Ultra-Fine Entity Typing”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2018.[94] P. Christen.

Data Matching - Concepts and Techniques for Record Linkage, EntityResolution, and Duplicate Detection . Springer, 2012.[95] C. Christodoulopoulos and A. Mittal. “Simple Large-scale Relation Extraction fromUnstructured Text”. In:

Conference on Language Resources and Evaluation (LREC) .2018.[96] V. Christophides, V. Efthymiou, T. Palpanas, G. Papadakis, and K. Stefanidis.“End-to-End Entity Resolution for Big Data: A Survey”.

CoRR . abs/1905.06397.2019.[97] C. X. Chu, S. Razniewski, and G. Weikum. “TiFi: Taxonomy Induction forFictional Domains”. In:

The Web Conference (WWW) . 2019.[98] C. X. Chu, N. Tandon, and G. Weikum. “Distilling Task Knowledge from How-ToCommunities”. In:

The Web Conference (WWW) . 2017.[99] X. Chu, I. F. Ilyas, and P. Papotti. “Holistic data cleaning: Putting violations intocontext”. In:

IEEE International Conference on Data Engineering (ICDE) . 2013.[100] K. Clark and C. D. Manning. “Improving Coreference Resolution by LearningEntity-Level Distributed Representations”. In:

Annual Meeting of the Associationfor Computational Linguistics (ACL) . 2016.[101] L. D. Corro and R. Gemulla. “ClausIE: clause-based open information extraction”.In:

The Web Conference (WWW) . 2013.[102] A. Cropper, S. Dumancic, and S. H. Muggleton. “Turning 30: New Ideas inInductive Logic Programming”.

CoRR . abs/2002.11002. 2020.[103] S. Cucerzan. “Large-Scale Named Entity Disambiguation Based on WikipediaData”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2007. 220 ubmitted to Foundations and Trends in Databases [104] S. Cucerzan and A. Sil. “The MSR Systems for Entity Linking and Temporal SlotFilling at TAC 2013”. In:

Text Analysis Conference (TAC) . 2013.[105] A. Culotta and J. S. Sorensen. “Dependency Tree Kernels for Relation Extraction”.In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2004.[106] H. Cunningham. “GATE, a General Architecture for Text Engineering”.

Computersand the Humanities . 36(2): 223–254. 2002.[107] H. Cunningham, D. Maynard, K. Bontcheva, and V. Tablan. “A framework andgraphical development environment for robust NLP tools and applications”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2002.[108] J. Dai, M. Zhang, G. Chen, J. Fan, K. Y. Ngiam, and B. C. Ooi. “Fine-grainedConcept Linking using Neural Networks in Healthcare”. In:

ACM Conference onManagement of Data (SIGMOD) . ACM, 2018.[109] B. B. Dalvi, W. W. Cohen, and J. Callan. “WebSets: extracting sets of entities fromthe web using unsupervised information extraction”. In:

ACM Conference on WebSearch and Data Mining (WSDM) . 2012.[110] N. N. Dalvi, R. Kumar, B. Pang, R. Ramakrishnan, A. Tomkins, P. Bohannon,S. S. Keerthi, and S. Merugu. “A web of concepts”. In:

ACM Symposium onPrinciples of Database Systems (PODS) . 2009.[111] F. Darari, W. Nutt, G. Pirrò, and S. Razniewski. “Completeness Management forRDF Data Sources”.

ACM Transactions on the Web (TWEB) . 12(3): 18:1–18:53.2018.[112] R. Das, T. Munkhdalai, X. Yuan, A. Trischler, and A. McCallum. “BuildingDynamic Knowledge Graphs from Text using Machine Reading Comprehension”. In:

International Conference on Learning Representations (ICLR) . 2019.[113] E. Davis.

Representations of commonsense knowledge . Morgan Kaufmann, 2014.[114] S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, andR. A. Harshman. “Indexing by Latent Semantic Analysis”.

JASIS . 41(6): 391–407.1990.[115] J. Deng, W. Dong, R. Socher, L. Li, K. Li, and F. Li. “ImageNet: A large-scalehierarchical image database”. In:

Conference on Computer Vision and PatternRecognition (CVPR) . 2009.[116] L. Derczynski, D. Maynard, G. Rizzo, M. van Erp, G. Gorrell, R. Troncy, J. Petrak,and K. Bontcheva. “Analysis of named entity recognition and linking for tweets”.

Information Processing and Management . 51(2): 32–49. 2015.[117] O. Deshpande, D. S. Lamba, M. Tourn, S. Das, S. Subramaniam, A. Rajaraman,V. Harinarayan, and A. Doan. “Building, maintaining, and using knowledge bases: areport from the Trenches”. In:

ACM Conference on Management of Data(SIGMOD) . 2013. 221 ubmitted to Foundations and Trends in Databases [118] J. Devlin, M. Chang, K. Lee, and K. Toutanova. “BERT: Pre-training of DeepBidirectional Transformers for Language Understanding”. In:

North AmericanChapter of the Association for Computational Linguistics (NAACL) . 2019.[119] D. Diefenbach, V. López, K. D. Singh, and P. Maret. “Core techniques of questionanswering systems over knowledge bases: a survey”.

Knowledge and InformationSystems (KAIS) . 55(3): 529–569. 2018.[120] S. Dill, N. Eiron, D. Gibson, D. Gruhl, R. V. Guha, A. Jhingran, T. Kanungo,S. Rajagopalan, A. Tomkins, J. A. Tomlin, and J. Y. Zien. “SemTag and seeker:bootstrapping the semantic web via automated semantic annotation”. In:

The WebConference (WWW) . 2003.[121] J. Dittrich et al. “Research Report - Use of Wikidata in GLAM Institutions,https://commons.wikimedia.org/wiki/File:Research_Report_\T1\textendash_Use_of_Wikidata_in_GLAM_institutions_(2019-11).pdf”.

WikimediaDeutschland e.V.

Conference on Very Large Databases (VLDB) .2005.[123] A. Doan, A. Y. Halevy, and Z. G. Ives.

Principles of Data Integration . MorganKaufmann, 2012.[124] A. Doan, R. Ramakrishnan, and A. Y. Halevy. “Crowdsourcing systems on theWorld-Wide Web”.

Commun. ACM . 54(4): 86–96. 2011.[125] P. M. Domingos and D. Lowd.

Markov Logic: An Interface Layer for ArtiﬁcialIntelligence . Morgan & Claypool Publishers, 2009.[126] P. M. Domingos and D. Lowd. “Unifying logical and statistical AI with Markovlogic”.

Communication of the ACM . 62(7): 74–83. 2019.[127] X. L. Dong. “Building a Broad Knowledge Graph for Products”. In:

Slides fromKeynote at International Conference on Data Engineering (ICDE),http:// lunadong.com/ talks/ BG.pdf . 2019.[128] X. L. Dong, L. Berti-Équille, and D. Srivastava. “Data Fusion: Resolving Conﬂictsfrom Multiple Sources”. In:

Handbook of Data Quality, Research and Practice . 2013.293–318.[129] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann,S. Sun, and W. Zhang. “Knowledge vault: a web-scale approach to probabilisticknowledge fusion”. In:

ACM Conference on Knowledge Discovery and Data Mining(KDD) . 2014.[130] X. L. Dong, E. Gabrilovich, G. Heitz, W. Horn, K. Murphy, S. Sun, and W. Zhang.“From Data Fusion to Knowledge Fusion”.

Proceedings of the VLDB Endowment .7(10): 881–892. 2014. 222 ubmitted to Foundations and Trends in Databases [131] X. L. Dong, E. Gabrilovich, K. Murphy, V. Dang, W. Horn, C. Lugaresi, S. Sun,and W. Zhang. “Knowledge-Based Trust: Estimating the Trustworthiness of WebSources”.

Proceedings of the VLDB Endowment . 8(9): 938–949. 2015.[132] X. L. Dong and A. Y. Halevy. “A Platform for Personal Information Managementand Integration”. In:

Conference on Innovative Data Systems Research (CIDR) .2005.[133] X. L. Dong, X. He, A. Kan, X. Li, Y. Liang, J. Ma, Y. E. Xu, C. Zhang, T. Zhao,G. B. Saldana, S. Deshpande, A. M. Manduca, J. Ren, S. P. Singh, F. Xiao,H. Chang, G. Karamanolakis, Y. Mao, Y. Wang, C. Faloutsos, A. McCallum, andJ. Han. “AutoKnow: Self-Driving Knowledge Collection for Products of Thousandsof Types”. In:

ACM Conference on Knowledge Discovery and Data Mining (KDD) .2020.[134] X. L. Dong and F. Naumann. “Data fusion - Resolving Data Conﬂicts forIntegration”.

Proceedings of the VLDB Endowment . 2(2): 1654–1655. 2009.[135] X. L. Dong and D. Srivastava.

Big Data Integration . Morgan & Claypool Publishers,2015.[136] D. M. Dooley, E. J. Griﬃths, G. S. Gosal, P. L. Buttigieg, R. Hoehndorf,M. C. Lange, L. M. Schriml, F. S. Brinkman, and W. W. Hsiao. “FoodOn: aharmonized food ontology to increase global food traceability, quality control anddata integration”.

Science of Food . 2(1): 1–10. 2018.[137] E. C. Dragut, W. Meng, and C. T. Yu.

Deep Web Query Interface Understandingand Integration . Morgan & Claypool Publishers, 2012.[138] M. Dredze, P. McNamee, D. Rao, A. Gerber, and T. Finin. “Entity Disambiguationfor Knowledge Base Population”. In:

Conference on Computational Linguistics(COLING) . 2010.[139] S. T. Dumais, E. Cutrell, J. J. Cadiz, G. Jancke, R. Sarin, and D. C. Robbins.“Stuﬀ I’ve seen: a system for personal information retrieval and re-use”. In:

ACMConference on Research and Development in Information Retrieval (SIGIR) . 2003.[140] H. L. Dunn. “Record linkage”.

American Journal of Public Health and the NationsHealth . 36(12): 1412–1416. 1946.[141] B. V. Durme and M. Pasca. “Finding Cars, Goddesses and Enzymes:Parametrizable Acquisition of Labeled Instances for Open-Domain InformationExtraction”. In:

Conference on Artiﬁcial Intelligence (AAAI) . 2008.[142] G. Durrett and D. Klein. “A Joint Model for Entity Analysis: Coreference, Typing,and Linking”.

Transactions of the Association for Computational Linguistics(TACL) . 2: 477–490. 2014.[143] G. Durrett and D. Klein. “Easy Victories and Uphill Battles in CoreferenceResolution”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2013. 223 ubmitted to Foundations and Trends in Databases [144] S. Dutta and G. Weikum. “C3EL: A Joint Model for Cross-Document Co-ReferenceResolution and Entity Linking”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2015.[145] M. Ehrmann, F. Cecconi, D. Vannella, J. P. McCrae, P. Cimiano, and R. Navigli.“Representing Multilingual Data as Linked Data: the Case of BabelNet 2.0”. In:

Conference on Language Resources and Evaluation (LREC) . 2014.[146] J. Eisenstein.

Introduction to Natural Language Processing . MIT Press, 2019.[147] D. Erdös and P. Miettinen. “Discovering facts with boolean tensor tuckerdecomposition”. In:

ACM Conference on Information and Knowledge Management(CIKM) . 2013.[148] P. Ernst, A. Siu, D. Milchevski, J. Hoﬀart, and G. Weikum. “DeepLife: AnEntity-aware Search, Analytics and Exploration Platform for Health and LifeSciences”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2016.[149] P. Ernst, A. Siu, and G. Weikum. “HighLife: Higher-arity Fact Harvesting”. In:

TheWeb Conference (WWW) . 2018.[150] P. Ernst, A. Siu, and G. Weikum. “KnowLife: a versatile approach for constructinga large knowledge graph for biomedical sciences”.

BMC Bioinformatics . 16:157:1–157:13. 2015.[151] Y. Eshel, N. Cohen, K. Radinsky, S. Markovitch, I. Yamada, and O. Levy. “NamedEntity Disambiguation for Noisy Text”. In:

Conference on Computational NaturalLanguage Learning (CoNLL) . 2017.[152] O. Etzioni, M. Banko, and M. J. Cafarella. “Machine Reading”. In:

Conference onArtiﬁcial Intelligence (AAAI) . 2006.[153] O. Etzioni, M. J. Cafarella, D. Downey, A. Popescu, T. Shaked, S. Soderland,D. S. Weld, and A. Yates. “Unsupervised named-entity extraction from the Web:An experimental study”.

Artiﬁcial Intelligence (AI) . 165(1): 91–134. 2005.[154] O. Etzioni, A. Fader, J. Christensen, S. Soderland, and Mausam. “OpenInformation Extraction: The Second Generation”. In:

Joint Conference on ArtiﬁcialIntelligence (IJCAI) . 2011.[155] A. Fader, S. Soderland, and O. Etzioni. “Identifying relations for open informationextraction”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2011.[156] M. Färber. “The Microsoft Academic Knowledge Graph: A Linked Data Sourcewith 8 Billion Triples of Scholarly Data”. In:

International Semantic WebConference (ISWC) . 2019.[157] M. Färber, F. Bartscherer, C. Menne, and A. Rettinger. “Linked data quality ofDBpedia, Freebase, OpenCyc, Wikidata, and YAGO”.

Semantic Web Journal(SWJ) . 9(1): 77–129. 2018. 224 ubmitted to Foundations and Trends in Databases [158] M. Faruqui and S. Kumar. “Multilingual Open Relation Extraction UsingCross-lingual Projection”. In:

North American Chapter of the Association forComputational Linguistics (NAACL) . 2015.[159] C. Fellbaum and G. A. Miller.

WordNet: An electronic lexical database . MIT press,1998.[160] I. P. Fellegi and A. B. Sunter. “A theory for record linkage”.

Journal of theAmerican Statistical Association . 64(328): 1183–1210. 1969.[161] P. Ferragina and U. Scaiella. “TAGME: on-the-ﬂy annotation of short textfragments (by wikipedia entities)”. In:

ACM Conference on Information andKnowledge Management (CIKM) . 2010.[162] E. Ferrara, P. D. Meo, G. Fiumara, and R. Baumgartner. “Web data extraction,applications and techniques: A survey”.

Knowledge Based Systems . 70: 301–323.2014.[163] D. A. Ferrucci et al. “Special Issue on "This is Watson"”.

IBM Journal of Researchand Development . 56(3): 1. 2012.[164] D. A. Ferrucci and A. Lally. “UIMA: an architectural approach to unstructuredinformation processing in the corporate research environment”.

Natural LanguageEngineering . 10(3-4): 327–348. 2004.[165] C. J. Fillmore, C. Wooters, and C. F. Baker. “Building a Large Lexical DatabankWhich Provides Deep Semantics”. In:

Paciﬁc Asia Conference on Language,Information and Computation (PACLIC) . 2001.[166] J. R. Finkel, T. Grenager, and C. D. Manning. “Incorporating Non-localInformation into Information Extraction Systems by Gibbs Sampling”. In:

AnnualMeeting of the Association for Computational Linguistics (ACL) . 2005.[167] N. FitzGerald, J. Michael, L. He, and L. Zettlemoyer. “Large-Scale QA-SRLParsing”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2018.[168] M. Fleischman and E. H. Hovy. “Fine Grained Classiﬁcation of Named Entities”. In:

Conference on Computational Linguistics (COLING) . 2002.[169] M. Francis-Landau, G. Durrett, and D. Klein. “Capturing Semantic Similarity forEntity Linking with Convolutional Neural Networks”. In:

North American Chapterof the Association for Computational Linguistics (NAACL) . 2016.[170] C. S. Funk, W. A. B. Jr., B. Garcia, C. Roeder, M. Bada, K. B. Cohen,L. E. Hunter, and K. Verspoor. “Large-scale biomedical concept recognition: anevaluation of current automatic annotators and their parameters”.

BMCBioinformatics . 15: 59. 2014.[171] E. Gabrilovich and S. Markovitch. “Computing Semantic Relatedness UsingWikipedia-based Explicit Semantic Analysis”. In:

Joint Conference on ArtiﬁcialIntelligence (IJCAI) . 2007. 225 ubmitted to Foundations and Trends in Databases [172] M. H. Gad-Elrab, D. Stepanova, J. Urbani, and G. Weikum. “Exception-enrichedrule learning from knowledge graphs”. In:

International Semantic Web Conference(ISWC) . 2016.[173] M. H. Gad-Elrab, D. Stepanova, J. Urbani, and G. Weikum. “ExFaKT: AFramework for Explaining Facts over Knowledge Graphs and Text”. In:

ACMConference on Web Search and Data Mining (WSDM) . 2019.[174] L. A. Galárraga, C. Teﬂioudi, K. Hose, and F. M. Suchanek.

AMIE: association rulemining under incomplete evidence in ontological knowledge bases . 2013.[175] L. Galárraga, G. Heitz, K. Murphy, and F. M. Suchanek. “Canonicalizing OpenKnowledge Bases”. In:

ACM Conference on Information and KnowledgeManagement (CIKM) . 2014.[176] L. Galárraga, S. Razniewski, A. Amarilli, and F. M. Suchanek. “PredictingCompleteness in Knowledge Bases”. In:

ACM Conference on Web Search and DataMining (WSDM) . 2017.[177] L. Galárraga, C. Teﬂioudi, K. Hose, and F. M. Suchanek. “Fast rule mining inontological knowledge bases with AMIE+”.

Journal of the VLDB . 24(6): 707–730.2015.[178] O. Ganea, M. Ganea, A. Lucchi, C. Eickhoﬀ, and T. Hofmann. “ProbabilisticBag-Of-Hyperlinks Model for Entity Linking”. In:

The Web Conference (WWW) .2016.[179] J. Ganitkevitch, B. V. Durme, and C. Callison-Burch. “PPDB: The ParaphraseDatabase”. In:

North American Chapter of the Association for ComputationalLinguistics (NAACL) . 2013.[180] J. Gao, X. Li, Y. E. Xu, B. Sisman, X. L. Dong, and J. Yang. “Eﬃcient KnowledgeGraph Accuracy Evaluation”.

Proceedings of the VLDB Endowment . 12(11):1679–1691. 2019.[181] K. Gashteovski, R. Gemulla, and L. D. Corro. “MinIE: Minimizing Facts in OpenInformation Extraction”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2017.[182] J. E. L. Gayo, E. Prud’hommeaux, H. R. Solbrig, and J. M. A. Rodriguez.“Validating and Describing Linked Data Portals using RDF Shape Expressions.” In:

Workshop on Linked Data Quality (LDQ), co-located with International Conferenceon Semantic Systems (SEMANTICS) . 2014.[183] M. Geﬀet and I. Dagan. “The Distributional Inclusion Hypotheses and LexicalEntailment”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2005.[184] S. Ghosh, S. Razniewski, and G. Weikum. “Uncovering Hidden Semantics of SetInformation in Knowledge Bases”.

Journal of Web Semantics (JWS) . 64. 2020.226 ubmitted to Foundations and Trends in Databases [185] D. Gildea and D. Jurafsky. “Automatic Labeling of Semantic Roles”.

ComputationalLinguistics . 28(3): 245–288. 2002.[186] Y. Goldberg.

Neural Network Methods for Natural Language Processing . Morgan &Claypool Publishers, 2017.[187] J. Gordon and B. Van Durme. “Reporting bias and knowledge acquisition”. In:

International Workshop on Automated Knowledge Base Construction (AKBC) . 2013.[188] M. R. Gormley, M. Yu, and M. Dredze. “Improved Relation Extraction withFeature-Rich Compositional Embedding Models”. In:

Conference on EmpiricalMethods in Natural Language Processing (EMNLP)

ACM Conference on Web Search and Data Mining (WSDM) . 2020.[190] T. Grütze, G. Kasneci, Z. Zuo, and F. Naumann. “CohEEL: Coherent and eﬃcientnamed entity linking through random walks”.

Journal of Web Semantics (JWS) .37-38: 75–89. 2016.[191] A. Grycner and G. Weikum. “POLY: Mining Relational Paraphrases fromMultilingual Sentences”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2016.[192] A. Grycner, G. Weikum, J. Pujara, J. R. Foulds, and L. Getoor. “RELLY: InferringHypernym Relationships Between Relational Phrases”. In:

Conference on EmpiricalMethods in Natural Language Processing (EMNLP) . 2015.[193] R. V. Guha, D. Brickley, and S. Macbeth. “Schema.org: evolution of structureddata on the web”.

Communication of the ACM . 59(2): 44–51. 2016.[194] P. Gulhane, A. Madaan, R. R. Mehta, J. Ramamirtham, R. Rastogi, S. Satpal,S. H. Sengamedu, A. Tengli, and C. Tiwari. “Web-scale information extraction withvertex”. In:

IEEE International Conference on Data Engineering (ICDE) . 2011.[195] Q. Guo, F. Zhuang, C. Qin, H. Zhu, X. Xie, H. Xiong, and Q. He. “A Survey onKnowledge Graph-Based Recommender Systems”.

CoRR . abs/2003.00911. 2020.[196] Z. Guo and D. Barbosa. “Robust Entity Linking via Random Walks”. In:

ACMConference on Information and Knowledge Management (CIKM) . 2014.[197] Z. Guo and D. Barbosa. “Robust named entity disambiguation with random walks”.

Semantic Web Journal (SWJ) . 9(4): 459–479. 2018.[198] A. Gupta, F. Piccinno, M. Kozhevnikov, M. Pasca, and D. Pighin. “RevisitingTaxonomy Induction over Wikipedia”. In:

Conference on Computational Linguistics(COLING) . 2016.[199] M. Gupta, R. Li, Z. Yin, and J. Han. “Survey on social tagging techniques”.

SIGKDD Explorations . 12(1): 58–72. 2010.227 ubmitted to Foundations and Trends in Databases [200] N. Gupta, S. Singh, and D. Roth. “Entity Linking via Joint Encoding of Types,Descriptions, and Context”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2017.[201] R. Gupta, A. Y. Halevy, X. Wang, S. E. Whang, and F. Wu. “Biperpedia: AnOntology for Search Applications”.

Proceedings of the VLDB Endowment . 7(7):505–516. 2014.[202] Y. Gurevich and J. M. Wing. “Inverse privacy”.

Communication of the ACM . 59(7):38–42. 2016.[203] I. Gurevych, J. Eckle-Kohler, and M. Matuschek.

Linked Lexical Knowledge Bases:Foundations and Applications . Morgan & Claypool Publishers, 2016.[204] B. Hachey, W. Radford, J. Nothman, M. Honnibal, and J. R. Curran. “EvaluatingEntity Linking with Wikipedia”.

Artiﬁcial Intelligence (AI) . 194: 130–150. 2013.[205] A. Y. Halevy. “The Ubiquity of Subjectivity”.

IEEE Data Engineering Bulletin .42(1): 6–9. 2019.[206] A. Y. Halevy, M. J. Franklin, and D. Maier. “Principles of dataspace systems”. In:

ACM Symposium on Principles of Database Systems (PODS) . 2006.[207] J. Han, M. Kamber, and J. Pei.

Data Mining: Concepts and Techniques, 3rd edition .Morgan Kaufmann, 2011.[208] X. Han, T. Gao, Y. Lin, H. Peng, Y. Yang, C. Xiao, Z. Liu, P. Li, M. Sun, andJ. Zhou. “More Data, More Relations, More Context and More Openness: A Reviewand Outlook for Relation Extraction”.

CoRR . abs/2004.03186. 2020.[209] M. F. Hanaﬁ, A. Abouzied, L. Chiticariu, and Y. Li. “SEER: Auto-GeneratingInformation Extraction Rules from User-Speciﬁed Examples”. In:

ACM Conferenceon Human Factors in Computing Systems (CHI) . 2017.[210] Q. Hao, R. Cai, Y. Pang, and L. Zhang. “From one tree to a forest: a uniﬁedsolution for structured web data extraction”. In:

ACM Conference on Research andDevelopment in Information Retrieval (SIGIR)

W3CRecommendation . W3C, 2013.[212] O. Hartig. “Foundations of RDF ? and SPARQL ? (An Alternative Approach toStatement-Level Metadata in RDF)”. In: Alberto Mendelzon Workshop (AMW) .2017.[213] O. Hartig. “Provenance Information in the Web of Data.” In:

Workshop on LinkedData on the Web (LDOW), co-located with the Web Conference . 2009.[214] O. Hartig. “RDF* and SPARQL*: An Alternative Approach to AnnotateStatements in RDF”. In:

International Semantic Web Conference (ISWC) . 2017.[215] Harvard NLP Group. “The Annotated Transformer”. https://nlp.seas.harvard.edu/2018/04/03/attention.html . 2018.228 ubmitted to Foundations and Trends in Databases [216] A.-W. Harzing and S. Alakangas. “Google Scholar, Scopus and the Web of Science:a longitudinal and cross-disciplinary comparison”.

Scientometrics . 106(2): 787–804.2016.[217] K. Hashimoto, M. Miwa, Y. Tsuruoka, and T. Chikayama. “Simple Customizationof Recursive Neural Networks for Semantic Relation Classiﬁcation”. In:

Conferenceon Empirical Methods in Natural Language Processing (EMNLP) . 2013.[218] S. Haussmann, O. Seneviratne, Y. Chen, Y. Ne’eman, J. Codella, C. Chen,D. L. McGuinness, and M. J. Zaki. “FoodKG: A Semantics-Driven KnowledgeGraph for Food Recommendation”. In:

International Semantic Web Conference(ISWC) . 2019.[219] T. H. Haveliwala. “Topic-Sensitive PageRank: A Context-Sensitive RankingAlgorithm for Web Search”.

IEEE Transactions on Knowledge and DataEngineering (TKDE) . 15(4): 784–796. 2003.[220] L. He, K. Lee, M. Lewis, and L. Zettlemoyer. “Deep Semantic Role Labeling: WhatWorks and What’s Next”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2017.[221] L. He, M. Lewis, and L. Zettlemoyer. “Question-Answer Driven Semantic RoleLabeling: Using Natural Language to Annotate Natural Language”. In:

Conferenceon Empirical Methods in Natural Language Processing (EMNLP) . 2015.[222] Y. He, K. Chakrabarti, T. Cheng, and T. Tylenda. “Automatic Discovery ofAttribute Synonyms Using Query Logs and Table Corpora”. In:

The WebConference (WWW) . 2016.[223] Y. He and D. Xin. “SEISA: set expansion by iterative similarity aggregation”. In:

The Web Conference (WWW) . 2011. 427–436.[224] M. A. Hearst. “Automatic Acquisition of Hyponyms from Large Text Corpora”. In:

Conference on Computational Linguistics (COLING) . 1992.[225] T. Heath and C. Bizer.

Linked Data: Evolving the Web into a Global Data Space .Morgan & Claypool Publishers, 2011.[226] S. Heindorf, M. Potthast, B. Stein, and G. Engels. “Vandalism detection inwikidata”. In:

ACM Conference on Information and Knowledge Management(CIKM) . 2016.[227] N. Heist, S. Hertling, D. Ringler, and H. Paulheim. “Knowledge Graphs on the Web- An Overview”. In:

Knowledge Graphs for eXplainable Artiﬁcial Intelligence:Foundations, Applications and Challenges . IOS Press, 2020. 3–22.[228] N. Heist and H. Paulheim. “Entity Extraction from Wikipedia List Pages”. In:

European Semantic Web Conference (ESWC) . 2020.[229] S. Hellmann, C. Stadler, J. Lehmann, and S. Auer. “DBpedia Live Extraction”. In:

OTM Conferences . 2009. 229 ubmitted to Foundations and Trends in Databases [230] D. Hernández, A. Hogan, and M. Krötzsch. “Reifying RDF: What Works Well WithWikidata?” In:

Scalable Semantic Web Knowledge Base Systems . 2015.[231] S. Hertling and H. Paulheim. “DBkWik: A Consolidated Knowledge Graph fromThousands of Wikis”. In:

International Conference on Big Knowledge (ICBK) . 2018.[232] S. Hertling and H. Paulheim. “WebIsALOD: Providing Hypernymy RelationsExtracted from the Web as Linked Open Data”. In:

International Semantic WebConference (ISWC) . 2017.[233] P. Heymann and H. Garcia-Molina. “Collaborative creation of communalhierarchical taxonomies in social tagging systems”.

Stanford University, TechnicalReport . 2006.[234] D. S. Himmelstein, A. Lizee, C. Hessler, L. Brueggeman, S. L. Chen, D. Hadley,A. Green, P. Khankhanian, and S. E. Baranzini. “Systematic integration ofbiomedical knowledge prioritizes drugs for repurposing”.

Elife . 6. 2017.[235] V. T. Ho, Y. Ibrahim, K. Pal, K. Berberich, and G. Weikum. “Qsearch: AnsweringQuantity Queries from Text”. In:

International Semantic Web Conference (ISWC) .2019.[236] J. Hoﬀart, Y. Altun, and G. Weikum. “Discovering emerging entities withambiguous names”. In:

The Web Conference (WWW) . 2014.[237] J. Hoﬀart, D. Milchevski, G. Weikum, A. Anand, and J. Singh. “The KnowledgeAwakens: Keeping Knowledge Bases Fresh with Emerging Entities”. In:

Companionof the International Conference on World Wide Web . 2016.[238] J. Hoﬀart, S. Seufert, D. B. Nguyen, M. Theobald, and G. Weikum. “KORE:keyphrase overlap relatedness for entity disambiguation”. In:

ACM Conference onInformation and Knowledge Management (CIKM) . 2012.[239] J. Hoﬀart, F. M. Suchanek, K. Berberich, E. Lewis-Kelham, G. de Melo, andG. Weikum. “ YAGO2: A Spatially and Temporally Enhanced Knowledge Basefrom Wikipedia ”. In:

The Web Conference (WWW) . 2011.[240] J. Hoﬀart, F. M. Suchanek, K. Berberich, and G. Weikum. “YAGO2: A spatiallyand temporally enhanced knowledge base from Wikipedia”.

Artiﬁcial Intelligence(AI) . 194: 28–61. 2013.[241] J. Hoﬀart, M. A. Yosef, I. Bordino, H. Fürstenau, M. Pinkal, M. Spaniol, B. Taneva,S. Thater, and G. Weikum. “Robust Disambiguation of Named Entities in Text”. In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2011.[242] R. Hoﬀmann, C. Zhang, and D. S. Weld. “Learning 5000 Relational Extractors”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2010.[243] T. Hofmann. “Unsupervised Learning by Probabilistic Latent Semantic Analysis”.

Machine Learning . 42(1/2): 177–196. 2001.[244] A. Hogan.

The Web of Data . Springer, 2020.230 ubmitted to Foundations and Trends in Databases [245] A. Hogan, E. Blomqvist, M. Cochez, C. d’Amato, G. de Melo, C. Gutierrez,J. E. L. Gayo, S. Kirrane, S. Neumaier, A. Polleres, R. Navigli, A. N. Ngomo,S. M. Rashid, A. Rula, L. Schmelzeisen, J. F. Sequeda, S. Staab, andA. Zimmermann. “Knowledge Graphs”. arXiv . abs/2003.02320. 2020.[246] A. Hopkinson, A. Gurdasani, D. Palfrey, and A. Mittal. “Demand-WeightedCompleteness Prediction for a Knowledge Base”. In:

North American Chapter of theAssociation for Computational Linguistics (NAACL) . 2018.[247] A. Hotho, R. Jäschke, C. Schmitz, and G. Stumme. “Information Retrieval inFolksonomies: Search and Ranking”. In:

European Semantic Web Conference(ESWC) . 2006.[248] M. B. Hoy. “Wolfram|Alpha: a brief introduction”.

Medical reference servicesquarterly . 29(1): 67–74. 2010.[249] Y. Hua, C. Danescu-Niculescu-Mizil, D. Taraborelli, N. Thain, J. Sorensen, andL. Dixon. “WikiConv: A Corpus of the Complete Conversational History of a LargeOnline Collaborative Community”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2018.[250] Z. Huang, W. Xu, and K. Yu. “Bidirectional LSTM-CRF Models for SequenceTagging”.

CoRR abs/1508.01991 . 2015.[251] T. Hubauer, S. Lamparter, P. Haase, and D. M. Herzig. “Use Cases of the IndustrialKnowledge Graph at Siemens”. In:

International Semantic Web Conference (ISWC) .2018.[252] J. M. van Hulst, F. Hasibi, K. Dercksen, K. Balog, and A. P. de Vries. “REL: AnEntity Linker Standing on the Shoulders of Giants”. In:

ACM Conference onResearch and Development in Information Retrieval (SIGIR) . 2020.[253] Y. Ibrahim, M. Riedewald, and G. Weikum. “Making Sense of Entities andQuantities in Web Tables”. In:

ACM Conference on Information and KnowledgeManagement (CIKM) . 2016.[254] Y. Ibrahim, M. Riedewald, G. Weikum, and D. Zeinalipour-Yazti. “BridgingQuantities in Tables and Text”. In:

IEEE International Conference on DataEngineering (ICDE) . 2019.[255] I. F. Ilyas and X. Chu.

Data Cleaning . ACM, 2019.[256] I. F. Ilyas and X. Chu. “Trends in Cleaning Relational Data: Consistency andDeduplication”.

Foundations and Trends in Databases . 5(4): 281–393. 2015.[257] I. F. Ilyas, V. Markl, P. J. Haas, P. Brown, and A. Aboulnaga. “CORDS: AutomaticDiscovery of Correlations and Soft Functional Dependencies”. In:

ACM Conferenceon Management of Data (SIGMOD) . 2004.[258] P. G. Ipeirotis, E. Agichtein, P. Jain, and L. Gravano. “Towards a query optimizerfor text-centric tasks”.

ACM Transactions on Database Systems (TODS) . 32(4): 21.2007. 231 ubmitted to Foundations and Trends in Databases [259] A. Ismayilov, D. Kontokostas, S. Auer, J. Lehmann, and S. Hellmann. “Wikidatathrough the eyes of DBpedia”.

Semantic Web Journal (SWJ) . 9(4): 493–503. 2018.[260] A. Jain, P. G. Ipeirotis, and L. Gravano. “Building query optimizers for informationextraction: the SQoUT project”.

SIGMOD Record . 37(4): 28–34. 2008.[261] P. Jansson and S. Liu. “Topic modelling enriched LSTM models for the detection ofnovel and emerging named entities from social media”. In:

International Conferenceon Big Data . 2017.[262] R. Jäschke, A. Hotho, C. Schmitz, B. Ganter, and G. Stumme. “Discovering sharedconceptualizations in folksonomies”.

Journal of Web Semantics (JWS) . 6(1): 38–53.2008.[263] G. Jeh and J. Widom. “Scaling personalized web search”. In:

The Web Conference(WWW) . 2003.[264] G. Ji, K. Liu, S. He, and J. Zhao. “Distant Supervision for Relation Extraction withSentence-Level Attention and Entity Descriptions”. In:

Conference on ArtiﬁcialIntelligence (AAAI) . 2017.[265] H. Ji, R. Grishman, H. T. Dang, K. Griﬃtt, and J. Ellis. “Overview of the TAC2010 knowledge base population track”. In:

Text Analysis Conference (TAC) . 2010.[266] S. Ji, S. Pan, E. Cambria, P. Marttinen, and P. S. Yu. “A Survey on KnowledgeGraphs: Representation, Acquisition and Applications”.

CoRR . abs/2002.00388.2020.[267] Z. Ji, Q. Wei, and H. Xu. “BERT-based Ranking for Biomedical EntityNormalization”.

CoRR . abs/1908.03548. 2019.[268] S. Jiang, S. Baumgartner, A. Ittycheriah, and C. Yu. “Factoring Fact-Checks:Structured Information Extraction from Fact-Checking Articles”. In:

The WebConference (WWW) . 2020. 1592–1603.[269] Z. Jiang, F. F. Xu, J. Araki, and G. Neubig. “How Can We Know What LanguageModels Know”.

Transactions of the Association for Computational Linguistics(TACL) . 8: 423–438. 2020.[270] Jimmy, G. Zuccon, and B. Koopman. “Choices in Knowledge-Base Retrieval forConsumer Health Search”. In:

European Conference on Information Retrieval(ECIR) . 2018.[271] P. Jindal and D. Roth. “Extraction of events and temporal expressions from clinicalnarratives”.

Journal of Biomedical Informatics . 46(6): S13–S19. 2013.[272] M. Joshi, O. Levy, L. Zettlemoyer, and D. S. Weld. “BERT for CoreferenceResolution: Baselines and Analysis”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2019.[273] D. Jurafsky and J. H. Martin.

Speech and language processing: an introduction tonatural language processing, computational linguistics, and speech recognition, 3rdEdition . Prentice Hall, Pearson Education International, 2019.232 ubmitted to Foundations and Trends in Databases [274] V. Kalokyri, A. Borgida, and A. Marian. “YourDigitalSelf: A Personal Digital TraceIntegration Tool”. In:

ACM Conference on Information and KnowledgeManagement (CIKM) . 2018.[275] S. D. Kamvar, M. T. Schlosser, and H. Garcia-Molina. “The Eigentrust algorithm forreputation management in P2P networks”. In:

The Web Conference (WWW) . 2003.[276] N. Karalis, G. M. Mandilaras, and M. Koubarakis. “Extending the YAGO2Knowledge Graph with Precise Geospatial Knowledge”. In:

International SemanticWeb Conference (ISWC) . 2019. 181–197.[277] K. Kari. “Building, and communicating, a knowledge graph in Zalando”. In: . 2019.[278] G. Kasneci, M. Ramanath, F. M. Suchanek, and G. Weikum. “The YAGO-NAGAapproach to knowledge discovery”.

SIGMOD Record . 37(4): 41–47. 2008.[279] M. Kejriwal and P. Szekely. “Knowledge graphs for social good: an entity-centricsearch engine for the human traﬃcking domain”.

IEEE Transactions on Big Data .2017.[280] K. A. Khatib, Y. Hou, H. Wachsmuth, C. Jochim, F. Bonin, and B. Stein.“End-to-End Argumentation Knowledge Graph Construction”. In:

Conference onArtiﬁcial Intelligence (AAAI)

W3C Recommendation .2017.[282] N. Kolitsas, O. Ganea, and T. Hofmann. “End-to-End Neural Entity Linking”. In:

Conference on Computational Natural Language Learning (CoNLL) . 2018.[283] D. Koller and N. Friedman.

Probabilistic Graphical Models - Principles andTechniques . MIT Press, 2009.[284] P. Konda, S. Das, P. S. G. C., A. Doan, A. Ardalan, J. R. Ballard, H. Li, F. Panahi,H. Zhang, J. F. Naughton, S. Prasad, G. Krishnan, R. Deep, and V. Raghavendra.“Magellan: Toward Building Entity Matching Management Systems”.

Proceedings ofthe VLDB Endowment . 9(12): 1197–1208. 2016.[285] D. Kontokostas, C. Bratsas, S. Auer, S. Hellmann, I. Antoniou, and G. Metakides.“Internationalization of Linked Data: The case of the Greek DBpedia edition”.

Journal of Web Semantics (JWS) . 15: 51–61. 2012.[286] B. Koopman and G. Zuccon. “WSDM 2019 Tutorial on Health Search (HS2019): AFull-Day from Consumers to Clinicians (with materials onhttps://github.com/ielab/health-search-tutorial/tree/wsdm2019)”. In:

ACMConference on Web Search and Data Mining (WSDM) . 2019. 838–839.233 ubmitted to Foundations and Trends in Databases [287] H. Köpcke and E. Rahm. “Frameworks for entity matching: A comparison”.

Data &Knowledge Engineering . 69(2): 197–210. 2010.[288] Y. Koren and R. M. Bell. “Advances in Collaborative Filtering”. In:

RecommenderSystems Handbook . 2015. 77–118.[289] Y. Koren, R. M. Bell, and C. Volinsky. “Matrix Factorization Techniques forRecommender Systems”.

IEEE Computer . 42(8): 30–37. 2009.[290] P. Kouki, J. Pujara, C. Marcum, L. M. Koehly, and L. Getoor. “Collective EntityResolution in Familial Networks”. In:

IEEE Conference on Data Mining (ICDM) .2017.[291] P. Kouki, J. Pujara, C. Marcum, L. M. Koehly, and L. Getoor. “Collective entityresolution in multi-relational familial networks”.

Knowledge and InformationSystems (KAIS) . 61(3): 1547–1581. 2019.[292] Z. Kozareva and E. H. Hovy. “Learning Arguments and Supertypes of SemanticRelations Using Recursive Patterns”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2010.[293] S. Krause, L. Hennig, A. Moro, D. Weissenborn, F. Xu, H. Uszkoreit, andR. Navigli. “Sar-graphs: A language resource connecting linguistic knowledge withsemantic relations from knowledge graphs”.

Journal of Web Semantics (JWS) .37-38: 112–131. 2016.[294] S. Krause, H. Li, H. Uszkoreit, and F. Xu. “Large-Scale Learning ofRelation-Extraction Rules with Distant Supervision from the Web”. In:

International Semantic Web Conference (ISWC) . 2012.[295] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen,Y. Kalantidis, L. Li, D. A. Shamma, M. S. Bernstein, and L. Fei-Fei. “VisualGenome: Connecting Language and Vision Using Crowdsourced Dense ImageAnnotations”.

International Journal of Computer Vision (IJCV) . 123(1): 32–73.2017.[296] M. Krötzsch, D. Vrandecic, and M. Völkel. “Semantic MediaWiki”. In:

InternationalSemantic Web Conference (ISWC) . 2006.[297] S. Kulkarni, A. Singh, G. Ramakrishnan, and S. Chakrabarti. “Collectiveannotation of Wikipedia entities in web text”. In:

ACM Conference on KnowledgeDiscovery and Data Mining (KDD) . 2009.[298] S. Kumar, R. West, and J. Leskovec. “Disinformation on the Web: Impact,Characteristics, and Detection of Wikipedia Hoaxes”. In:

The Web Conference(WWW) . 2016.[299] N. Kushmerick. “Wrapper induction: Eﬃciency and expressiveness”.

ArtiﬁcialIntelligence (AI) . 118(1-2): 15–68. 2000.234 ubmitted to Foundations and Trends in Databases [300] N. Kushmerick, D. S. Weld, and R. B. Doorenbos. “Wrapper Induction forInformation Extraction”. In:

Joint Conference on Artiﬁcial Intelligence (IJCAI) .1997.[301] E. Kuzey, V. Setty, J. Strötgen, and G. Weikum. “As Time Goes By:Comprehensive Tagging of Textual Phrases with Temporal Scopes”. In:

The WebConference (WWW) . 2016.[302] E. Kuzey, J. Vreeken, and G. Weikum. “A Fresh Look on Knowledge Bases:Distilling Named Events from News”. In:

ACM Conference on Information andKnowledge Management (CIKM) . 2014.[303] E. Kuzey and G. Weikum. “Extraction of temporal facts and events fromWikipedia”. In:

Temporal Web Analytics Workshop, co-located with the WebConference . 2012.[304] J. D. Laﬀerty, A. McCallum, and F. C. N. Pereira. “Conditional Random Fields:Probabilistic Models for Segmenting and Labeling Sequence Data”. In:

InternationalConference on Machine Learning (ICML) . 2001.[305] J. Lajus, L. Galárraga, and F. M. Suchanek. “Fast and Exact Rule Mining withAMIE 3”. In:

European Semantic Web Conference (ESWC) . 2020.[306] G. Lample, M. Ballesteros, S. Subramanian, K. Kawakami, and C. Dyer. “NeuralArchitectures for Named Entity Recognition”. In:

North American Chapter of theAssociation for Computational Linguistics (NAACL) . 2016.[307] W. Lang, R. V. Nehme, E. Robinson, and J. F. Naughton. “Partial results indatabase systems”. In:

ACM Conference on Management of Data (SIGMOD) . 2014.[308] N. Lao, E. Minkov, and W. W. Cohen. “Learning Relational Features withBackward Random Walks”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2015.[309] N. Lao, T. M. Mitchell, and W. W. Cohen. “Random Walk Inference and Learningin A Large Scale Knowledge Base”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2011.[310] N. Lao, T. Mitchell, and W. W. Cohen. “Random walk inference and learning in alarge scale knowledge base”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2011.[311] N. Lao, A. Subramanya, F. Pereira, and W. W. Cohen. “Reading the web withlearned syntactic-semantic inference rules”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2012.[312] N. Lazic, A. Subramanya, M. Ringgaard, and F. Pereira. “Plato: A SelectiveContext Model for Entity Resolution”.

Transactions of the Association forComputational Linguistics (TACL) . 3: 503–515. 2015.235 ubmitted to Foundations and Trends in Databases [313] H. Lee, A. X. Chang, Y. Peirsman, N. Chambers, M. Surdeanu, and D. Jurafsky.“Deterministic Coreference Resolution Based on Entity-Centric, Precision-RankedRules”.

Computational Linguistics . 39(4): 885–916. 2013.[314] J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. H. So, and J. Kang. “BioBERT: apre-trained biomedical language representation model for biomedical text mining”.

Bioinformatics . 36(4): 1234–1240. 2019. issn : 1367-4803.[315] K. Lee, L. He, M. Lewis, and L. Zettlemoyer. “End-to-end Neural CoreferenceResolution”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2017.[316] K. Leetaru and P. A. Schrodt. “Gdelt: Global data on events, location, and tone,1979–2012”. In:

ISA Annual Convention . 2013. 1–49.[317] J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. N. Mendes,S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. “DBpedia - Alarge-scale, multilingual knowledge base extracted from Wikipedia”.

Semantic WebJournal (SWJ) . 6(2): 167–195. 2015.[318] O. Lehmberg and C. Bizer. “Stitching Web Tables for Improving Matching Quality”.

Proceedings of the VLDB Endowment . 10(11): 1502–1513. 2017.[319] O. Lehmberg and C. Bizer. “Synthesizing N-ary Relations from Web Tables”. In:

International Conference on Web Intelligence, Mining and Semantics . 2019.[320] C. Lei, F. özcan, A. Quamar, A. R. Mittal, J. Sen, D. Saha, andK. Sankaranarayanan. “Ontology-Based Natural Language Query Interfaces forData Exploration”.

IEEE Data Engineering Bulletin . 41(3): 52–63. 2018.[321] J. L. Leidner.

Toponym resolution in text: Annotation, evaluation and applicationsof spatial grounding of place names . Universal-Publishers, 2008.[322] D. B. Lenat. “CYC: A Large-Scale Investment in Knowledge Infrastructure”.

Communication of the ACM . 38(11): 32–38. 1995.[323] D. B. Lenat and E. A. Feigenbaum. “On the Thresholds of Knowledge”.

ArtiﬁcialIntelligence (AI) . 47(1-3): 185–250. 1991.[324] O. Levy, Y. Goldberg, and I. Dagan. “Improving Distributional Similarity withLessons Learned from Word Embeddings”.

Transactions of the Association forComputational Linguistics (TACL) . 3: 211–225. 2015.[325] O. Levy, M. Seo, E. Choi, and L. Zettlemoyer. “Zero-Shot Relation Extraction viaReading Comprehension”. In:

Conference on Computational Natural LanguageLearning (CoNLL) . 2017.[326] C. M. Li and F. Manyà. “MaxSAT, Hard and Soft Constraints”. In:

Handbook ofSatisﬁability . IOS Press, 2009. 613–631.[327] F. Li, X. L. Dong, A. Langen, and Y. Li. “Knowledge Veriﬁcation for LongTailVerticals”.

Proceedings of the VLDB Endowment . 10(11): 1370–1381. 2017.236 ubmitted to Foundations and Trends in Databases [328] G. Li. “Special Issue on Large-Scale Data Integration”.

IEEE Data EngineeringBulletin . 41(2): 2. 2018.[329] H. Li.

Learning to Rank for Information Retrieval and Natural Language Processing,Second Edition . Morgan & Claypool Publishers, 2014.[330] J. Li, A. Sun, J. Han, and C. Li. “A Survey on Deep Learning for Named EntityRecognition”.

CoRR . abs/1812.09449. 2018.[331] J. Li, J. Tang, Y. Li, and Q. Luo. “RiMOM: A Dynamic Multistrategy OntologyAlignment Framework”.

IEEE Transactions on Knowledge and Data Engineering(TKDE) . 21(8): 1218–1232. 2009.[332] Q. Li and H. Ji. “Incremental Joint Extraction of Entity Mentions and Relations”.In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2014.[333] X. Li, W. Meng, and C. T. Yu. “T-veriﬁer: Verifying truthfulness of factstatements”. In:

IEEE International Conference on Data Engineering (ICDE) . 2011.[334] X. Li, F. Yin, Z. Sun, X. Li, A. Yuan, D. Chai, M. Zhou, and J. Li.“Entity-Relation Extraction as Multi-Turn Question Answering”. In:

AnnualMeeting of the Association for Computational Linguistics (ACL) . 2019.[335] Y. Li, J. Gao, C. Meng, Q. Li, L. Su, B. Zhao, W. Fan, and J. Han. “A Survey onTruth Discovery”.

SIGKDD Explorations . 17(2): 1–16. 2015.[336] Y. Li, S. Tan, H. Sun, J. Han, D. Roth, and X. Yan. “Entity Disambiguation withLinkless Knowledge Bases”. In:

The Web Conference (WWW) . 2016.[337] M. D. Lieberman and J. J. Lin. “You Are Where You Edit: Locating WikipediaContributors through Edit Histories”. In:

International Conference on Weblogs andSocial Media (ICWSM) . 2009.[338] G. Limaye, S. Sarawagi, and S. Chakrabarti. “Annotating and Searching WebTables Using Entities, Types and Relationships”.

Proceedings of the VLDBEndowment . 3(1): 1338–1347. 2010.[339] D. Lin and P. Pantel. “DIRT – Discovery of inference rules from text”. In:

ACMConference on Knowledge Discovery and Data Mining (KDD) . 2001.[340] X. Lin, H. Li, H. Xin, Z. Li, and L. Chen. “KBPearl: A Knowledge Base PopulationSystem Supported by Joint Entity and Relation Linking”.

Proceedings of the VLDBEndowment . 13(7): 1035–1049. 2020.[341] Y. Lin, Z. Liu, M. Sun, Y. Liu, and X. Zhu. “Learning Entity and RelationEmbeddings for Knowledge Graph Completion”. In:

Conference on ArtiﬁcialIntelligence (AAAI) . 2015.[342] Y. Lin, S. Shen, Z. Liu, H. Luan, and M. Sun. “Neural Relation Extraction withSelective Attention over Instances”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2016.237 ubmitted to Foundations and Trends in Databases [343] X. Ling, S. Singh, and D. S. Weld. “Design Challenges for Entity Linking”.

Transactions of the Association for Computational Linguistics (TACL) . 3: 315–328.2015.[344] X. Ling and D. S. Weld. “Fine-Grained Entity Recognition”. In:

Conference onArtiﬁcial Intelligence (AAAI) . 2012.[345] X. Ling and D. S. Weld. “Temporal Information Extraction”. In:

Conference onArtiﬁcial Intelligence (AAAI) . 2010.[346] B. Liu, W. Guo, D. Niu, J. Luo, C. Wang, Z. Wen, and Y. Xu. “GIANT: ScalableCreation of a Web-scale Ontology”. In:

ACM Conference on Management of Data(SIGMOD) . ACM, 2020.[347] B. Liu, W. Guo, D. Niu, C. Wang, S. Xu, J. Lin, K. Lai, and Y. Xu. “AUser-Centered Concept Mining System for Query and Document Understanding atTencent”. In:

ACM Conference on Knowledge Discovery and Data Mining (KDD) .2019.[348] G. Liu, X. Li, J. Wang, M. Sun, and P. Li. “Extracting Knowledge from Web Textwith Monte Carlo Tree Search”. In:

The Web Conference (WWW) . 2020.[349] H. Liu, L. Hunter, V. Keselj, and K. Verspoor. “Approximate subgraphmatching-based literature mining for biomedical events and relations”.

PloS one .8(4). 2013.[350] T. Liu.

Learning to Rank for Information Retrieval . Springer, 2011.[351] X. Liu, Y. Li, H. Wu, M. Zhou, F. Wei, and Y. Lu. “Entity Linking for Tweets”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2013.[352] Y. Liu, F. Wei, S. Li, H. Ji, M. Zhou, and H. Wang. “A Dependency-Based NeuralNetwork for Relation Classiﬁcation”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2015.[353] K. Lo, L. L. Wang, M. Neumann, R. Kinney, and D. S. Weld. “S2ORC: TheSemantic Scholar Open Research Corpus”. In:

Annual Meeting of the Associationfor Computational Linguistics (ACL) . 2020.[354] C. Lockard, X. L. Dong, P. Shiralkar, and A. Einolghozati. “CERES: DistantlySupervised Relation Extraction from the Semi-Structured Web”.

Proceedings of theVLDB Endowment . 11(10): 1084–1096. 2018.[355] C. Lockard, P. Shiralkar, and X. L. Dong. “OpenCeres: When Open InformationExtraction Meets the Semi-Structured Web”. In:

North American Chapter of theAssociation for Computational Linguistics (NAACL) . 2019.[356] C. Lockard, P. Shiralkar, X. L. Dong, and H. Hajishizi. “ZeroShotCeres: Zero-shotrelation extraction from semi-structured webpages”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2020.238 ubmitted to Foundations and Trends in Databases [357] L. Logeswaran, M. Chang, K. Lee, K. Toutanova, J. Devlin, and H. Lee. “Zero-ShotEntity Linking by Reading Entity Descriptions”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2019.[358] Y. Luan, L. He, M. Ostendorf, and H. Hajishirzi. “Multi-Task Identiﬁcation ofEntities, Relations, and Coreference for Scientiﬁc Knowledge Graph Construction”.In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) .2018.[359] M. Luggen, D. Difallah, C. Sarasua, G. Demartini, and P. Cudré-Mauroux.“Non-Parametric Class Completeness Estimators for Collaborative KnowledgeGraphs - The Case of Wikidata”. In:

International Semantic Web Conference(ISWC) . 2019.[360] X. Luo, L. Liu, Y. Yang, L. Bo, Y. Cao, J. Wu, Q. Li, K. Yang, and K. Q. Zhu.“AliCoCo: Alibaba E-commerce Cognitive Concept Net”. In:

ACM Conference onManagement of Data (SIGMOD) . ACM, 2020.[361] X. Ma and E. H. Hovy. “End-to-end Sequence Labeling via Bi-directionalLSTM-CNNs-CRF”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2016.[362] A. Madaan, A. Mittal, Mausam, G. Ramakrishnan, and S. Sarawagi. “NumericalRelation Extraction with Minimal Supervision”. In:

Conference on ArtiﬁcialIntelligence (AAAI) . 2016.[363] A. Maedche and S. Staab. “Discovering Conceptual Relations from Text”. In:

European Conference on Artiﬁcial Intelligence (ECAI) . 2000.[364] F. Mahdisoltani, J. A. Biega, and F. M. Suchanek. “ YAGO3: A Knowledge Basefrom Multilingual Wikipedias ”. In:

Conference on Innovative Data SystemsResearch (CIDR) . 2015.[365] K. Mai, T. Pham, M. T. Nguyen, N. T. Duc, D. Bollegala, R. Sasano, and S. Sekine.“An Empirical Study on Fine-Grained Named Entity Recognition”. In:

Conferenceon Computational Linguistics (COLING) . 2018.[366] S. Malyshev, M. Krötzsch, L. González, J. Gonsior, and A. Bielefeldt. “Getting theMost Out of Wikidata: Semantic Technology Usage in Wikipedia’s KnowledgeGraph”. In:

International Semantic Web Conference (ISWC) . 2018.[367] C. D. Manning, M. Surdeanu, J. Bauer, J. R. Finkel, S. Bethard, and D. McClosky.“The Stanford CoreNLP Natural Language Processing Toolkit”. In:

Annual Meetingof the Association for Computational Linguistics (ACL) . 2014.[368] Y. Mao, X. Ren, J. Shen, X. Gu, and J. Han. “End-to-End Reinforcement Learningfor Automatic Taxonomy Induction”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2018. 2462–2472.239 ubmitted to Foundations and Trends in Databases [369] J. Marin, A. Biswas, F. Oﬂi, N. Hynes, A. Salvador, Y. Aytar, I. Weber, andA. Torralba. “Recipe1M: A Dataset for Learning Cross-Modal Embeddings forCooking Recipes and Food Images”.

CoRR . abs/1810.06553. 2018.[370] J. Martinez-Rodriguez, A. Hogan, and I. Lopez-Arvalo. “Information extractionmeets the Semantic Web: A survey”.

Semantic Web Journal (SWJ) . 11(2): 255–335.2020.[371] G. D. S. Martino, S. Cresci, A. Barron-Cedeño, S. Yu, R. D. Pietro, and P. Nakov.“A Survey on Computational Propaganda Detection”. In:

Joint Conference onArtiﬁcial Intelligence (IJCAI) . 2020. 4826–4832.[372] C. Matuszek, M. J. Witbrock, R. C. Kahlert, J. Cabral, D. Schneider, P. Shah, andD. B. Lenat. “Searching for Common Sense: Populating Cyc from the Web”. In:

Conference on Artiﬁcial Intelligence (AAAI) . 2005.[373] Mausam. “Open Information Extraction Systems and Downstream Applications”.In:

Joint Conference on Artiﬁcial Intelligence (IJCAI) . 2016.[374] Mausam, M. Schmitz, S. Soderland, R. Bart, and O. Etzioni. “Open LanguageLearning for Information Extraction”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2012.[375] E. Meij. “Understanding News Using the Bloomberg Knowledge Graph”. In:

Talk atBig Data Innovators Gathering (BIG) at the Web Conference 2019,https:// speakerdeck.com/ emeij/ understanding-news-using-the-bloomberg-knowledge-graph .[376] C. Meilicke, M. W. Chekol, D. Ruﬃnelli, and H. Stuckenschmidt. “AnytimeBottom-Up Rule Learning for Knowledge Graph Completion”. In:

Joint Conferenceon Artiﬁcial Intelligence (IJCAI) . 2019.[377] F. Menczer, G. Pant, and P. Srinivasan. “Topical web crawlers: Evaluating adaptivealgorithms”.

ACM Transactions on Internet Technology (TOIT) . 4(4): 378–419.2004.[378] P. N. Mendes, M. Jakob, and C. Bizer. “DBpedia: A Multilingual Cross-domainKnowledge Base”. In:

Conference on Language Resources and Evaluation (LREC) .2012.[379] P. N. Mendes, M. Jakob, A. Garcia-Silva, and C. Bizer. “DBpedia spotlight:shedding light on the web of documents”. In:

International Conference on SemanticSystems . 2011.[380] F. Mesquita, M. Cannaviccio, J. Schmidek, P. Mirza, and D. Barbosa.“KnowledgeNet: A Benchmark Dataset for Knowledge Base Population”. In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2019.[381] R. Meusel, P. Petrovski, and C. Bizer. “The WebDataCommons Microdata, RDFaand Microformat Dataset Series”. In:

International Semantic Web Conference(ISWC) . 2014. 240 ubmitted to Foundations and Trends in Databases [382] J. Michael, G. Stanovsky, L. He, I. Dagan, and L. Zettlemoyer. “CrowdsourcingQuestion-Answer Meaning Representations”. In:

North American Chapter of theAssociation for Computational Linguistics (NAACL) . 2018.[383] R. Mihalcea and A. Csomai. “Wikify!: linking documents to encyclopedicknowledge”. In:

ACM Conference on Information and Knowledge Management(CIKM) . 2007.[384] T. Mikolov, I. Sutskever, K. Chen, G. S. Corrado, and J. Dean. “DistributedRepresentations of Words and Phrases and their Compositionality”. In:

NeuralInformation Processing Systems (NeurIPS) . 2013.[385] R. J. Miller, F. Nargesian, E. Zhu, C. Christodoulakis, K. Q. Pu, and P. Andritsos.“Making Open Data Transparent: Data Discovery on Open Data”.

IEEE DataEngineering Bulletin . 41(2): 59–70. 2018.[386] D. N. Milne and I. H. Witten. “Learning to link with wikipedia”. In:

ACMConference on Information and Knowledge Management (CIKM) . 2008.[387] B. Min, S. Shi, R. Grishman, and C. Lin. “Ensemble Semantics for Large-scaleUnsupervised Relation Extraction”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2012.[388] B. Minaei-Bidgoli, R. Barmaki, and M. Nasiri. “Mining numerical association rulesvia multi-objective genetic algorithms”.

Information Sciences . 2013.[389] M. Mintz, S. Bills, R. Snow, and D. Jurafsky. “Distant supervision for relationextraction without labeled data”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2009.[390] P. Mirza, S. Razniewski, F. Darari, and G. Weikum. “Cardinal Virtues: ExtractingRelation Cardinalities from Text”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2017.[391] P. Mirza, S. Razniewski, F. Darari, and G. Weikum. “Enriching Knowledge Baseswith Counting Quantiﬁers”. In:

International Semantic Web Conference (ISWC) .2018.[392] P. Mirza and S. Tonelli. “CATENA: CAusal and TEmporal relation extraction fromNAtural language texts”. In:

Conference on Computational Linguistics (COLING) .2016.[393] B. D. Mishra, N. Tandon, and P. Clark. “Domain-Targeted, High PrecisionKnowledge Extraction”.

Transactions of the Association for ComputationalLinguistics (TACL) . 5: 233–246. 2017.[394] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, J. Betteridge, A. Carlson,B. Dalvi, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao, K. Mazaitis,T. Mohamed, N. Nakashole, E. Platanios, A. Ritter, M. Samadi, B. Settles,R. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves, and J. Welling.“Never-Ending Learning”. In:

Conference on Artiﬁcial Intelligence (AAAI) . 2015.241 ubmitted to Foundations and Trends in Databases [395] T. Mitchell, W. Cohen, E. Hruschka, P. Talukdar, B. Yang, J. Betteridge,A. Carlson, B. D. Mishra, M. Gardner, B. Kisiel, J. Krishnamurthy, N. Lao,K. Mazaitis, T. Mohamed, N. Nakashole, E. A. Platanios, A. Ritter, M. Samadi,B. Settles, R. C. Wang, D. Wijaya, A. Gupta, X. Chen, A. Saparov, M. Greaves,and J. Welling. “Never-ending learning”.

Communication of the ACM . 61(5):103–115. 2018.[396] M. Miwa and M. Bansal. “End-to-End Relation Extraction using LSTMs onSequences and Tree Structures”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2016.[397] T. Mohamed, E. R. H. Jr., and T. M. Mitchell. “Discovering Relations betweenNoun Categories”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2011.[398] D. Montoya, T. P. Tanon, S. Abiteboul, P. Senellart, and F. M. Suchanek. “AKnowledge Base for Personal Information Management”. In:

Workshop on LinkedData on the Web (LDOW), co-located with the Web Conference . 2018.[399] A. Moro, A. Raganato, and R. Navigli. “Entity Linking meets Word SenseDisambiguation: a Uniﬁed Approach”.

Transactions of the Association forComputational Linguistics (TACL) . 2: 231–244. 2014.[400] A. Moschitti. “A Study on Convolution Kernels for Shallow Statistic Parsing”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2004.[401] A. Moschitti. “Eﬃcient Convolution Kernels for Dependency and ConstituentSyntactic Trees”. In:

European Conference on Machine Learning (ECML) . 2006.[402] A. Moschitti, D. Pighin, and R. Basili. “Semantic Role Labeling via Tree KernelJoint Inference”. In:

Conference on Computational Natural Language Learning(CoNLL) . 2006.[403] A. Motro. “Integrity= validity+ completeness”.

ACM Transactions on DatabaseSystems (TODS) . 14(4): 480–502. 1989.[404] R. Motwani and P. Raghavan.

Randomized Algorithms . Cambridge University Press,1995.[405] S. Mudgal, H. Li, T. Rekatsinas, A. Doan, Y. Park, G. Krishnan, R. Deep,E. Arcaute, and V. Raghavendra. “Deep Learning for Entity Matching: A DesignSpace Exploration”. In:

ACM Conference on Management of Data (SIGMOD) .2018.[406] D. Mueller and G. Durrett. “Eﬀective Use of Context in Noisy Entity Linking”. In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2018.[407] S. Muggleton and L. D. Raedt. “Inductive Logic Programming: Theory andMethods”.

Journal of Logic Programming . 19/20: 629–679. 1994.242 ubmitted to Foundations and Trends in Databases [408] S. Mukherjee, G. Weikum, and C. Danescu-Niculescu-Mizil. “People on drugs:credibility of user statements in health communities”. In:

ACM Conference onKnowledge Discovery and Data Mining (KDD) . 2014.[409] J. W. Murdock, A. Kalyanpur, C. Welty, J. Fan, D. A. Ferrucci, D. Gondek,L. Zhang, and H. Kanayama. “Typing candidate answers using type coercion”.

IBMJournal of Research and Development . 56(3): 7. 2012.[410] I. Muslea, S. Minton, and C. A. Knoblock. “Hierarchical Wrapper Induction forSemistructured Information Sources”.

Autonomous Agents and Multi-Agent Systems .4(1/2): 93–114. 2001.[411] N. Nakashole and T. M. Mitchell. “Language-Aware Truth Assessment of FactCandidates”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2014.[412] N. Nakashole, M. Theobald, and G. Weikum. “Scalable knowledge harvesting withhigh precision and high recall”. In:

ACM Conference on Web Search and DataMining (WSDM) . 2011.[413] N. Nakashole, T. Tylenda, and G. Weikum. “Fine-grained Semantic Typing ofEmerging Entities”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2013.[414] N. Nakashole, G. Weikum, and F. M. Suchanek. “PATTY: A Taxonomy ofRelational Patterns with Semantic Types”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2012.[415] V. Nastase and M. Strube. “Transforming Wikipedia into a large scale multilingualconcept network”.

Artiﬁcial Intelligence (AI) . 194: 62–85. 2013.[416] F. Naumann and M. Herschel.

An Introduction to Duplicate Detection . Morgan &Claypool Publishers, 2010.[417] R. Navigli. “Word sense disambiguation: A survey”.

ACM Computing Surveys .41(2): 10:1–10:69. 2009.[418] R. Navigli and S. P. Ponzetto. “BabelNet: The automatic construction, evaluationand application of a wide-coverage multilingual semantic network”.

ArtiﬁcialIntelligence (AI) . 193: 217–250. 2012.[419] M. Nentwig, M. Hartung, A. N. Ngomo, and E. Rahm. “A survey of current LinkDiscovery frameworks”.

Semantic Web Journal (SWJ) . 8(3): 419–436. 2017.[420] D. B. Nguyen, A. Abujabal, K. Tran, M. Theobald, and G. Weikum. “Query-DrivenOn-The-Fly Knowledge Base Construction”.

Proceedings of the VLDB Endowment .11(1): 66–79. 2017.[421] D. B. Nguyen, J. Hoﬀart, M. Theobald, and G. Weikum. “AIDA-light:High-Throughput Named-Entity Disambiguation”. In:

Workshop on Linked Data onthe Web (LDOW), co-located with the Web Conference . 2014.243 ubmitted to Foundations and Trends in Databases [422] D. B. Nguyen, M. Theobald, and G. Weikum. “J-NERD: Joint Named EntityRecognition and Disambiguation with Rich Linguistic Features”.

Transactions ofthe Association for Computational Linguistics (TACL) . 4: 215–229. 2016.[423] M. Nickel, K. Murphy, V. Tresp, and E. Gabrilovich. “A Review of RelationalMachine Learning for Knowledge Graphs”.

Proceedings of the IEEE . 104(1): 11–33.2016.[424] M. Nickel, V. Tresp, and H. Kriegel. “Factorizing YAGO: scalable machine learningfor linked data”. In:

The Web Conference (WWW) . 2012.[425] Z. Nie, J. Wen, and W. Ma. “Statistical Entity Extraction From the Web”.

Proceedings of the IEEE . 100(9): 2675–2687. 2012.[426] F. Å. Nielsen, D. Mietchen, and E. L. Willighagen. “Scholia, Scientometrics andWikidata”. In:

Scientometrics Workshop . 2017.[427] Q. Ning, S. Subramanian, and D. Roth. “An Improved Neural Baseline forTemporal Relation Extraction”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2019.[428] F. Niu, C. Ré, A. Doan, and J. W. Shavlik. “Tuﬀy: Scaling up Statistical Inferencein Markov Logic Networks using an RDBMS”.

Proceedings of the VLDBEndowment . 4(6): 373–384. 2011.[429] N. Noy, Y. Gao, A. Jain, A. Narayanan, A. Patterson, and J. Taylor.“Industry-Scale Knowledge Graphs: Lessons and Challenges”.

Communication of theACM . 62(8): 36–43. 2019.[430] A. Oelen, M. Y. Jaradeh, M. Stocker, and S. Auer. “Generate FAIR LiteratureSurveys with Scholarly Knowledge Graphs”. In:

Joint Conference on DigitalLibraries . 2020.[431] A. Olteanu, C. Castillo, F. Diaz, and E. Kiciman. “Social Data: Biases,Methodological Pitfalls, and Ethical Boundaries”.

Frontiers Big Data . 2: 13. 2019.[432] P. E. O’Neil and E. J. O’Neil.

Database: Principles, Programming, andPerformance, Second Edition . Morgan Kaufmann, 2000.[433] S. Ortona, V. V. Meduri, and P. Papotti. “Robust Discovery of Positive andNegative Rules in Knowledge Bases”. In:

IEEE International Conference on DataEngineering (ICDE) . 2018.[434] S. Ortona, V. V. Meduri, and P. Papotti. “RuDiK: Rule Discovery in KnowledgeBases”.

Proceedings of the VLDB Endowment . 11(12): 1946–1949. 2018.[435] Y. Oulabi and C. Bizer. “Extending Cross-Domain Knowledge Bases with Long TailEntities using Web Table Data”. In:

Conference on Extending Database Technology(EDBT) . 2019.[436] H. Pal and Mausam. “Demonyms and Compound Relational Nouns in NominalOpen IE”. In:

Workshop on Automated Knowledge Base Construction (AKBC) .2016. 244 ubmitted to Foundations and Trends in Databases [437] K. Pal, V. T. Ho, and G. Weikum. “Co-Clustering Triples from Open InformationExtraction”. In:

CoDS-COMAD Conference . 2020.[438] M. Palmer, D. Gildea, and N. Xue.

Semantic Role Labeling . Morgan & ClaypoolPublishers, 2010.[439] M. Palmer, P. R. Kingsbury, and D. Gildea. “The Proposition Bank: An AnnotatedCorpus of Semantic Roles”.

Computational Linguistics . 31(1): 71–106. 2005.[440] M. Pasca. “Finding Needles in an Encyclopedic Haystack: Detecting Classes AmongWikipedia Articles”. In:

The Web Conference (WWW) . 2018.[441] M. Pasca. “Open-Domain Fine-Grained Class Extraction from Web Search Queries”.In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) .2013.[442] M. Pasca. “The Role of Query Sessions in Interpreting Compound Noun Phrases”.In:

ACM Conference on Information and Knowledge Management (CIKM) . 2015.[443] M. Pasca and B. V. Durme. “What You Seek Is What You Get: Extraction of ClassAttributes from Query Logs”. In:

Joint Conference on Artiﬁcial Intelligence(IJCAI) . 2007.[444] H. Paulheim. “Knowledge graph reﬁnement: A survey of approaches and evaluationmethods”.

Semantic Web Journal (SWJ) . 2017.[445] E. Pavlick, P. Rastogi, J. Ganitkevitch, B. V. Durme, and C. Callison-Burch.“PPDB 2.0: Better paraphrase ranking, ﬁne-grained entailment relations, wordembeddings, and style classiﬁcation”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2015.[446] N. Peng, H. Poon, C. Quirk, K. Toutanova, and W. Yih. “Cross-Sentence N-aryRelation Extraction with Graph LSTMs”.

Transactions of the Association forComputational Linguistics (TACL) . 5: 101–115. 2017.[447] J. Pennington, R. Socher, and C. D. Manning. “Glove: Global Vectors for WordRepresentation”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2014.[448] M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, andL. Zettlemoyer. “Deep Contextualized Word Representations”. In:

North AmericanChapter of the Association for Computational Linguistics (NAACL) . 2018.[449] F. Petroni, T. Rocktäschel, S. Riedel, P. S. H. Lewis, A. Bakhtin, Y. Wu, andA. H. Miller. “Language Models as Knowledge Bases?” In:

Conference on EmpiricalMethods in Natural Language Processing (EMNLP) . 2019.[450] A. Pfadler, H. Zhao, J. Wang, L. Wang, P. Huang, and D. L. Lee. “Billion-scaleRecommendation with Heterogeneous Side Information at Taobao”. In:

IEEEInternational Conference on Data Engineering (ICDE) . 2020.245 ubmitted to Foundations and Trends in Databases [451] F. Piccinno and P. Ferragina. “From TagME to WAT: a new entity annotator”. In:

International Workshop on Entity Recognition & Disambiguation, co-located withACM SIGIR Conference . 2014.[452] A. Piscopo, L. Kaﬀee, C. Phethean, and E. Simperl. “Provenance Information in aCollaborative Knowledge Graph: An Evaluation of Wikidata External References”.In:

International Semantic Web Conference (ISWC) . 2017.[453] A. Piscopo, C. Phethean, and E. Simperl. “What Makes a Good CollaborativeKnowledge Graph: Group Composition and Quality in Wikidata”. In:

InternationalConference on Social Informatics . 2017.[454] A. Piscopo and E. Simperl. “What we talk about when we talk about wikidataquality: a literature survey”. In:

International Symposium on Open Collaboration .2019.[455] M. Ponza, P. Ferragina, and S. Chakrabarti. “On Computing Entity Relatedness inWikipedia, with Applications”.

Knowledge Based Systems . 188. 2020.[456] S. P. Ponzetto and M. Strube. “Deriving a Large-Scale Taxonomy from Wikipedia”.In:

Conference on Artiﬁcial Intelligence (AAAI) . 2007.[457] S. P. Ponzetto and M. Strube. “Taxonomy induction based on a collaborativelybuilt knowledge repository”.

Artiﬁcial Intelligence (AI) . 175(9-10): 1737–1756. 2011.[458] H. Poon and P. M. Domingos. “Unsupervised Ontology Induction from Text”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2010.[459] K. Popat, S. Mukherjee, J. Strötgen, and G. Weikum. “Where the Truth Lies:Explaining the Credibility of Emerging Claims on the Web and Social Media”. In:

The Web Conference (WWW) . 2017. 1003–1012.[460] S. S. Pradhan, E. H. Hovy, M. P. Marcus, M. Palmer, L. A. Ramshaw, andR. M. Weischedel. “Ontonotes: a Uniﬁed Relational Semantic Representation”.

International Journal of Semantic Computing . 1(4): 405–419. 2007.[461] R. E. Prasojo, M. Kacimi, and W. Nutt. “StuﬀIE: Semantic Tagging of UnlabeledFacets Using Fine-Grained Information Extraction”. In:

ACM Conference onInformation and Knowledge Management (CIKM) . 2018.[462] N. Prokoshyna, J. Szlichta, F. Chiang, R. J. Miller, and D. Srivastava. “CombiningQuantitative and Logical Data Cleaning”.

Proceedings of the VLDB Endowment .9(4): 300–311. 2015.[463] J. Pujara, H. Miao, L. Getoor, and W. W. Cohen. “Knowledge GraphIdentiﬁcation”. In:

International Semantic Web Conference (ISWC) . 2013.[464] J. Pujara, H. Miao, L. Getoor, and W. W. Cohen. “Large-Scale Knowledge GraphIdentiﬁcation Using PSL”. In:

AAAI Fall Symposia . 2013.[465] V. Punyakanok, D. Roth, and W. Yih. “The Importance of Syntactic Parsing andInference in Semantic Role Labeling”.

Computational Linguistics . 34(2): 257–287.2008. 246 ubmitted to Foundations and Trends in Databases [466] Y. Qi and J. Xiao. “Fintech: AI powers ﬁnancial services to improve people’s lives”.

Communication of the ACM . 61(11): 65–69. 2018.[467] R. Qiang. “Understand Your World with Bing”. In:

Bing Blog, 21 March 2013,https:// blogs.bing.com/ search/ 2013/ 03/ 21/ understand-your-world-with-bing/ .[468] X. Qiu, T. Sun, Y. Xu, Y. Shao, N. Dai, and X. Huang. “Pre-trained Models forNatural Language Processing: A Survey”.

CoRR . abs/2003.08271. 2020.[469] M. Qu, X. Ren, and J. Han. “Automatic Synonym Discovery with Knowledge Bases”.In:

ACM Conference on Knowledge Discovery and Data Mining (KDD) . 2017.[470] M. Qu, X. Ren, Y. Zhang, and J. Han. “Weakly-supervised Relation Extraction byPattern-enhanced Embedding Learning”. In:

The Web Conference (WWW) . 2018.[471] K. Raghunathan, H. Lee, S. Rangarajan, N. Chambers, M. Surdeanu, D. Jurafsky,and C. D. Manning. “A Multi-Pass Sieve for Coreference Resolution”. In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2010.[472] E. Rahm and H. H. Do. “Data Cleaning: Problems and Current Approaches”.

IEEEData Engineering Bulletin . 23(4): 3–13. 2000.[473] P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. “SQuAD: 100, 000+ Questionsfor Machine Comprehension of Text”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2016.[474] H. Rashkin, E. Choi, J. Y. Jang, S. Volkova, and Y. Choi. “Truth of Varying Shades:Analyzing Language in Fake News and Political Fact-Checking”. In:

Conference onEmpirical Methods in Natural Language Processing (EMNLP) . 2017. 2931–2937.[475] V. Rastogi, N. N. Dalvi, and M. N. Garofalakis. “Large-Scale Collective EntityMatching”.

Proceedings of the VLDB Endowment . 4(4): 208–218. 2011.[476] L. Ratinov and D. Roth. “Design Challenges and Misconceptions in Named EntityRecognition”. In:

Conference on Computational Natural Language Learning(CoNLL) . 2009.[477] L. Ratinov, D. Roth, D. Downey, and M. Anderson. “Local and Global Algorithmsfor Disambiguation to Wikipedia”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2011.[478] D. Ravichandran and E. H. Hovy. “Learning surface text patterns for a QuestionAnswering System”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2002.[479] S. Razniewski and P. Das. “Structured Knowledge: Have We Made Progress? AnExtrinsic Study of KB Coverage over 19 Years”. In:

ACM Conference onInformation and Knowledge Management (CIKM) . 2020.[480] S. Razniewski, N. Jain, P. Mirza, and G. Weikum. “Coverage of InformationExtraction from Sentences and Paragraphs”. In:

Conference on Empirical Methodsin Natural Language Processing (EMNLP) . 2019.247 ubmitted to Foundations and Trends in Databases [481] S. Razniewski, F. Korn, W. Nutt, and D. Srivastava. “Identifying the Extent ofCompleteness of Query Answers over Partially Complete Databases”. In:

ACMConference on Management of Data (SIGMOD) . 2015.[482] S. Razniewski, F. M. Suchanek, and W. Nutt. “But What Do We Actually Know?”In:

Workshop on Automated Knowledge Base Construction (AKBC) . 2016.[483] T. Rebele, F. M. Suchanek, J. Hoﬀart, J. A. Biega, E. Kuzey, and G. Weikum.“YAGO: a multilingual knowledge base from Wikipedia, Wordnet, and Geonames ”.In:

International Semantic Web Conference (ISWC) . 2016.[484] R. Reinanda, E. Meij, and M. de Rijke. “Knowledge Graphs: An InformationRetrieval Perspective”.

Foundations and Trends in Information Retrieval . 2020.[485] F. Reiss, S. Raghavan, R. Krishnamurthy, H. Zhu, and S. Vaithyanathan. “AnAlgebraic Approach to Rule-Based Information Extraction”. In:

IEEE InternationalConference on Data Engineering (ICDE) . 2008.[486] T. Rekatsinas, X. Chu, I. F. Ilyas, and C. Ré. “HoloClean: Holistic Data Repairswith Probabilistic Inference”.

Proceedings of the VLDB Endowment . 10(11):1190–1201. 2017.[487] X. Ren, J. Shen, M. Qu, X. Wang, Z. Wu, Q. Zhu, M. Jiang, F. Tao, S. Sinha,D. Liem, P. Ping, R. M. Weinshilboum, and J. Han. “Life-iNet: A StructuredNetwork-Based Knowledge Exploration and Analytics System for Life Sciences”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2017.[488] X. Ren, Z. Wu, W. He, M. Qu, C. R. Voss, H. Ji, T. F. Abdelzaher, and J. Han.“CoType: Joint Extraction of Typed Entities and Relations with Knowledge Bases”.In:

The Web Conference (WWW) . 2017.[489] M. Richardson and P. M. Domingos. “Markov logic networks”.

Machine Learning .62(1-2): 107–136. 2006.[490] S. Riedel. “Improving the Accuracy and Eﬃciency of MAP Inference for MarkovLogic”.

CoRR . abs/1206.3282. 2012.[491] S. Riedel, L. Yao, and A. McCallum. “Modeling Relations and Their Mentionswithout Labeled Text”. In:

European Conference on Machine Learning andKnowledge Discovery in Databases (ECML PKDD) . 2010.[492] S. Riedel, L. Yao, A. McCallum, and B. M. Marlin. “Relation Extraction withMatrix Factorization and Universal Schemas”. In:

North American Chapter of theAssociation for Computational Linguistics (NAACL) . 2013.[493] D. Ritze and C. Bizer. “Matching Web Tables To DBpedia - A Feature UtilityStudy”. In:

Conference on Extending Database Technology (EDBT) . 2017.[494] M. Röder, R. Usbeck, and A. N. Ngomo. “GERBIL - Benchmarking Named EntityRecognition and Linking consistently”.

Semantic Web Journal (SWJ) . 9(5): 605–625.2018. 248 ubmitted to Foundations and Trends in Databases [495] J. Romero, S. Razniewski, K. Pal, J. Z. Pan, A. Sakhadeo, and G. Weikum.“Commonsense Properties from Query Logs and Question Answering Forums”. In:

ACM Conference on Information and Knowledge Management (CIKM) . 2019.1411–1420.[496] F. Rossi, P. van Beek, and T. Walsh.

Handbook of Constraint Programming . Vol. 2.Elsevier, 2006.[497] D. Roth. “On the Hardness of Approximate Reasoning”.

Artiﬁcial Intelligence (AI) .82(1-2): 273–302. 1996.[498] D. Roth and V. Srikumar. “Integer Linear Programming Formulations in NaturalLanguage Processing (Tutorial Materials)”. In:

Conference of the European Chapterof the Association for Computational Linguistics (EACL) . 2017.[499] D. Roth and W.-t. Yih. “Global inference for entity and relation identiﬁcation via alinear programming formulation”.

Introduction to statistical relational learning :553–580. 2007.[500] D. Roth and W. Yih. “Integer linear programming inference for conditional randomﬁelds”. In:

International Conference on Machine Learning (ICML) . 2005.[501] M. Rotmensch, Y. Halpern, A. Tlimat, S. Horng, and D. Sontag. “Learning a healthknowledge graph from electronic medical records”.

Scientiﬁc reports . 7(1): 1–11.2017.[502] S. Roy, T. Vieira, and D. Roth. “Reasoning about Quantities in Natural Language”.

Transactions of the Association for Computational Linguistics (TACL) . 3: 1–13.2015.[503] D. Ruﬃnelli, S. Broscheit, and R. Gemulla. “You CAN Teach an Old Dog NewTricks! On Training Knowledge Graph Embeddings”. In:

International Conferenceon Learning Representations (ICLR) . 2020.[504] S. J. Russell and P. Norvig.

Artiﬁcial intelligence - a modern approach, 2nd Edition .Prentice Hall, 2003.[505] A. Sabharwal and H. Sedghi. “How Good Are My Predictions? EﬃcientlyApproximating Precision-Recall Curves for Massive Datasets”. In:

Conference onUncertainty in Artiﬁcial Intelligence (UAI) . 2017.[506] S. K. Sahu, F. Christopoulou, M. Miwa, and S. Ananiadou. “Inter-sentence RelationExtraction with Document-level Graph Convolutional Neural Network”. In:

AnnualMeeting of the Association for Computational Linguistics (ACL) . 2019.[507] A. Sahuguet and F. Azavant. “Building Light-Weight Wrappers for Legacy WebData-Sources Using W4F”. In:

Conference on Very Large Databases (VLDB) . 1999.[508] H. Samet, J. Sankaranarayanan, M. D. Lieberman, M. D. Adelﬁo, B. C. Fruin,J. M. Lotkowski, D. Panozzo, J. Sperling, and B. E. Teitler. “Reading news withmaps by exploiting spatial synonyms”.

Communication of the ACM . 57(10): 64–77.2014. 249 ubmitted to Foundations and Trends in Databases [509] C. N. dos Santos, B. Xiang, and B. Zhou. “Classifying Relations by Ranking withConvolutional Neural Networks”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2015.[510] M. Sap, R. L. Bras, E. Allaway, C. Bhagavatula, N. Lourie, H. Rashkin, B. Roof,N. A. Smith, and Y. Choi. “ATOMIC: An Atlas of Machine Commonsense forIf-Then Reasoning”. In:

Conference on Artiﬁcial Intelligence (AAAI) . 2019.[511] S. Sarawagi et al. “Information extraction”.

Foundations and Trends in Databases .1(3): 261–377. 2008.[512] S. Sarawagi and S. Chakrabarti. “Open-domain quantity queries on web tables:annotation, response, and consensus models”. In:

ACM Conference on KnowledgeDiscovery and Data Mining (KDD) . 2014. 711–720.[513] U. Sawant and S. Chakrabarti. “Learning joint query interpretation and responseranking”. In:

The Web Conference (WWW) . 2013.[514] H. Saxena, L. Golab, and I. F. Ilyas. “Distributed Implementations of DependencyDiscovery Algorithms”.

Proceedings of the VLDB Endowment . 12(11): 1624–1636.2019.[515] J. Schmidhuber. “Deep learning in neural networks: An overview”.

Neural Networks .61: 85–117. 2015.[516] A. Schrijver.

Theory of linear and integer programming . Wiley-Interscience series indiscrete mathematics and optimization . Wiley, 1999.[517] L. Schubert. “Can we derive general world knowledge from texts”. In:

InternationalConference on Human Language Technology (HLT) . 2002.[518] özge Sevgili, A. Panchenko, and C. Biemann. “Improving Neural EntityDisambiguation with Graph Embeddings”. In:

Annual Meeting of the Associationfor Computational Linguistics (ACL) . 2019.[519] J. Shang, L. Liu, X. Gu, X. Ren, T. Ren, and J. Han. “Learning Named EntityTagger using Domain-Speciﬁc Dictionary”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2018.[520] J. Shang, X. Zhang, L. Liu, S. Li, and J. Han. “NetTaxo: Automated TopicTaxonomy Construction from Text-Rich Network”. In:

The Web Conference(WWW) . 2020. 1908–1919.[521] W. Shen, P. DeRose, R. McCann, A. Doan, and R. Ramakrishnan. “Towardbest-eﬀort information extraction”. In:

ACM Conference on Management of Data(SIGMOD) . 2008.[522] W. Shen, A. Doan, J. F. Naughton, and R. Ramakrishnan. “Declarative InformationExtraction Using Datalog with Embedded Extraction Predicates”. In:

Conferenceon Very Large Databases (VLDB) . 2007.250 ubmitted to Foundations and Trends in Databases [523] W. Shen, J. Wang, and J. Han. “Entity Linking with a Knowledge Base: Issues,Techniques, and Solutions”.

IEEE Transactions on Knowledge and DataEngineering (TKDE) . 27(2): 443–460. 2015.[524] W. Shen, J. Wang, P. Luo, and M. Wang. “LIEGE: : link entities in web lists withknowledge base”. In:

ACM Conference on Knowledge Discovery and Data Mining(KDD) . 2012.[525] W. Shen, J. Wang, P. Luo, and M. Wang. “LINDEN: linking named entities withknowledge base via semantic knowledge”. In:

The Web Conference (WWW) . 2012.[526] W. Shen, J. Wang, P. Luo, and M. Wang. “Linking named entities in Tweets withknowledge base via user interest modeling”. In:

ACM Conference on KnowledgeDiscovery and Data Mining (KDD) . 2013.[527] J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. “Incremental KnowledgeBase Construction Using DeepDive”.

Proceedings of the VLDB Endowment . 8(11):1310–1321. 2015.[528] J. Shinavier, K. Branson, W. Zhang, S. Dastgheib, Y. Gao, B. G. Arsintescu,F. özcan, and E. Meij. “Panel: Knowledge Graph Industry Applications”. In:

TheWeb Conference (WWW) . 2019.[529] V. Shwartz, Y. Goldberg, and I. Dagan. “Improving Hypernymy Detection with anIntegrated Path-based and Distributional Method”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2016.[530] A. Sil and S. Cucerzan. “Towards Temporal Scoping of Relational Facts based onWikipedia Data”. In:

Conference on Computational Natural Language Learning(CoNLL) . 2014.[531] A. Sil, H. Ji, D. Roth, and S. Cucerzan. “Multi-lingual Entity Discovery andLinking”. In:

Annual Meeting of the Association for Computational Linguistics(ACL) . 2018.[532] P. Singh, T. Lin, E. T. Mueller, G. Lim, T. Perkins, and W. L. Zhu. “Open mindcommon sense: Knowledge acquisition from the general public”. In:

OTMConfederated International Conferences . 2002.[533] R. Singh, V. V. Meduri, A. K. Elmagarmid, S. Madden, P. Papotti, J. Quiané-Ruiz,A. Solar-Lezama, and N. Tang. “Synthesizing Entity Matching Rules by Examples”.

Proceedings of the VLDB Endowment . 11(2): 189–202. 2017.[534] S. Singh, A. Subramanya, F. C. N. Pereira, and A. McCallum. “Large-ScaleCross-Document Coreference Using Distributed Inference and Hierarchical Models”.In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2011.[535] S. Singh, A. Subramanya, F. Pereira, and A. McCallum. “Wikilinks: A large-scalecross-document coreference corpus labeled via links to Wikipedia”.

University ofMassachusetts, Amherst, Technical Report UM-CS-2012 . 15. 2012.251 ubmitted to Foundations and Trends in Databases [536] A. Singhal. “Introducing the Knowledge Graph: things, not strings”. In: .[537] P. Singla and P. M. Domingos. “Entity Resolution with Markov Logic”. In:

IEEEConference on Data Mining (ICDM) . 2006.[538] A. Sinha, Z. Shen, Y. Song, H. Ma, D. Eide, B. P. Hsu, and K. Wang. “AnOverview of Microsoft Academic Service (MAS) and Applications”. In:

The WebConference (WWW) . 2015.[539] S. Sizov, M. Theobald, S. Siersdorfer, G. Weikum, J. Graupmann, M. Biwer, andP. Zimmer. “The BINGO! System for Information Portal Generation and ExpertWeb Search”. In:

Conference on Innovative Data Systems Research (CIDR) . 2003.[540] P. Smeros, C. Castillo, and K. Aberer. “SciLens: Evaluating the Quality of ScientiﬁcNews Articles Using Social Media and Scientiﬁc Literature Indicators”. In:

The WebConference (WWW) . 2019.[541] R. Snow, D. Jurafsky, and A. Y. Ng. “Semantic Taxonomy Induction fromHeterogenous Evidence”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2006.[542] L. B. Soares, N. FitzGerald, J. Ling, and T. Kwiatkowski. “Matching the Blanks:Distributional Similarity for Relation Learning”. In:

Annual Meeting of theAssociation for Computational Linguistics (ACL) . 2019.[543] R. Socher, D. Chen, C. D. Manning, and A. Y. Ng. “Reasoning With Neural TensorNetworks for Knowledge Base Completion”. In:

Neural Information ProcessingSystems (NeurIPS) . 2013.[544] R. Socher, B. Huval, C. D. Manning, and A. Y. Ng. “Semantic Compositionalitythrough Recursive Matrix-Vector Spaces”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2012.[545] S. Soderland. “Learning Information Extraction Rules for Semi-Structured and FreeText”.

Machine Learning . 34(1-3): 233–272. 1999.[546] D. Sorokin and I. Gurevych. “Context-Aware Representations for Knowledge BaseRelation Extraction”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2017.[547] A. Soulet, A. Giacometti, B. Markhoﬀ, and F. M. Suchanek. “Representativeness ofKnowledge Bases with the Generalized Benford’s Law”. In:

International SemanticWeb Conference (ISWC) . 2018.[548] R. Speer and C. Havasi. “Representing General Relational Knowledge in ConceptNet5”. In:

Conference on Language Resources and Evaluation (LREC) . 2012.[549] R. Speer, J. Chin, and C. Havasi. “ConceptNet 5.5: An Open Multilingual Graph ofGeneral Knowledge”. In:

Conference on Artiﬁcial Intelligence (AAAI) . 2017.252 ubmitted to Foundations and Trends in Databases [550] V. I. Spitkovsky and A. X. Chang. “A Cross-Lingual Dictionary for EnglishWikipedia Concepts”. In:

Conference on Language Resources and Evaluation(LREC) . 2012.[551] S. Staab and R. Studer.

Handbook on Ontologies . Springer, 2009.[552] C. Stab and I. Gurevych. “Parsing Argumentation Structures in Persuasive Essays”.

Computational Linguistics . 43(3): 619–659. 2017.[553] G. Stanovsky, J. Michael, L. Zettlemoyer, and I. Dagan. “Supervised OpenInformation Extraction”. In:

North American Chapter of the Association forComputational Linguistics (NAACL) . 2018.[554] D. Stepanova, M. H. Gad-Elrab, and V. T. Ho. “Rule Induction and Reasoning overKnowledge Graphs”. In:

Reasoning Web Summer School . Springer, 2018. 142–172.[555] J. Strötgen, M. Gertz, G. Hirst, and R. Huang. “Domain-Sensitive TemporalTagging”.

Computational Linguistics . 44(2). 2018.[556] J. Strötgen, T. Tran, A. Friedrich, D. Milchevski, F. Tomazic, A. Marusczyk,H. Adel, D. Stepanova, F. Hildebrand, and E. Kharlamov. “Towards the BoschMaterials Science Knowledge Base”. In:

International Semantic Web Conference(ISWC) . 2019.[557] M. Strube and S. P. Ponzetto. “WikiRelate! Computing Semantic Relatedness UsingWikipedia”. In:

National Conference on Artiﬁcial Intelligence . 2006.[558] F. M. Suchanek. “The Need to Move beyond Triples”. In:

Workshop on NarrativeExtraction From Texts (Text2Story), co-located with ECIR Conference . 2020.[559] F. M. Suchanek, S. Abiteboul, and P. Senellart. “PARIS: Probabilistic Alignment ofRelations, Instances, and Schema”.

Proceedings of the VLDB Endowment . 5(3):157–168. 2011.[560] F. M. Suchanek, G. Ifrim, and G. Weikum. “Combining linguistic and statisticalanalysis to extract relations from web documents”. In:

ACM Conference onKnowledge Discovery and Data Mining (KDD) . 2006.[561] F. M. Suchanek, G. Ifrim, and G. Weikum. “LEILA: Learning to ExtractInformation by Linguistic Analysis”. In:

Ontology Learning and Population . 2006.[562] F. M. Suchanek, G. Kasneci, and G. Weikum. “Yago: a core of semantic knowledge”.In:

The Web Conference (WWW) . 2007.[563] F. M. Suchanek, G. Kasneci, and G. Weikum. “ Yago - A Large Ontology fromWikipedia and WordNet ”. In:

Journal of Web Semantics (JWS) . 2008.[564] F. M. Suchanek, J. Lajus, A. Boschin, and G. Weikum. “Knowledge Representationand Rule Mining in Entity-Centric Knowledge Bases”. In:

Reasoning Web SummerSchool . Springer, 2019.[565] F. M. Suchanek and N. Preda. “Semantic Culturomics (vision paper)”.

Proceedingsof the VLDB Endowment . 7(12). 2014.253 ubmitted to Foundations and Trends in Databases [566] F. M. Suchanek, M. Sozio, and G. Weikum. “SOFIE: a self-organizing framework forinformation extraction”. In:

The Web Conference (WWW) . 2009.[567] T. Sun, C. Zhang, Y. Ji, and Z. Hu. “Reinforcement Learning for DistantlySupervised Relation Extraction”.

IEEE Access . 7: 98023–98033. 2019.[568] M. Surdeanu, D. McClosky, J. Tibshirani, J. Bauer, A. X. Chang, V. I. Spitkovsky,and C. D. Manning. “A Simple Distant Supervision Approach for the TAC-KBPSlot Filling Task”. In:

Text Analysis Conference (TAC) . 2010.[569] C. Sutton, A. McCallum, et al. “An introduction to conditional random ﬁelds”.

Foundations and Trends in Machine Learning . 4(4): 267–373. 2012.[570] R. Takanobu, T. Zhang, J. Liu, and M. Huang. “A Hierarchical Framework forRelation Extraction with Reinforcement Learning”. In:

Conference on ArtiﬁcialIntelligence (AAAI) . 2019.[571] P. P. Talukdar, J. Reisinger, M. Pasca, D. Ravichandran, R. Bhagat, andF. C. N. Pereira. “Weakly-Supervised Acquisition of Labeled Class Instances usingGraph Random Walks”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2008.[572] P. P. Talukdar, D. Wijaya, and T. M. Mitchell. “Acquiring temporal constraintsbetween relations”. In:

ACM Conference on Information and KnowledgeManagement (CIKM) . 2012.[573] P. P. Talukdar, D. Wijaya, and T. M. Mitchell. “Coupled temporal scoping ofrelational facts”. In:

ACM Conference on Web Search and Data Mining (WSDM) .2012.[574] C. H. Tan, E. Agichtein, P. Ipeirotis, and E. Gabrilovich. “Trust, but verify:predicting contribution quality for knowledge base construction and curation”. In:

ACM Conference on Web Search and Data Mining (WSDM) . 2014.[575] N. Tandon, G. de Melo, A. De, and G. Weikum. “Knowlywood: Mining ActivityKnowledge From Hollywood Narratives”. In:

ACM Conference on Information andKnowledge Management (CIKM) . 2015. 223–232.[576] N. Tandon, G. de Melo, F. Suchanek, and G. Weikum. “WebChild : Harvesting andOrganizing Commonsense Knowledge from the Web”.

ACM Conference on WebSearch and Data Mining (WSDM) . 2014.[577] B. Taneva, T. Cheng, K. Chakrabarti, and Y. He. “Mining acronym expansions andtheir meanings using query click log”. In:

The Web Conference (WWW) . 2013.[578] J. Tang, J. Zhang, L. Yao, J. Li, L. Zhang, and Z. Su. “ArnetMiner: extraction andmining of academic social networks”. In:

ACM Conference on Knowledge Discoveryand Data Mining (KDD) . 2008. 990–998.[579] T. P. Tanon, D. Stepanova, S. Razniewski, P. Mirza, and G. Weikum.“Completeness-aware Rule Learning from Knowledge Graphs”. In:

Joint Conferenceon Artiﬁcial Intelligence (IJCAI) . 2018.254 ubmitted to Foundations and Trends in Databases [580] T. P. Tanon and F. Suchanek. “Querying the Edit History of Wikidata”. In:

European Semantic Web Conference (ESWC) . 2019.[581] T. P. Tanon, D. Vrandecic, S. Schaﬀert, T. Steiner, and L. Pintscher. “FromFreebase to Wikidata: The Great Migration”. In:

The Web Conference (WWW) .2016.[582] T. P. Tanon, G. Weikum, and F. M. Suchanek. “YAGO 4: A Reason-ableKnowledge Base ”. In:

European Semantic Web Conference (ESWC) . 2020.[583] J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal. “FEVER: aLarge-scale Dataset for Fact Extraction and VERiﬁcation”. In:

North AmericanChapter of the Association for Computational Linguistics (NAACL) . 2018.[584] D. Tikk, P. Thomas, P. Palaga, J. Hakenberg, and U. Leser. “A ComprehensiveBenchmark of Kernel Methods to Extract Protein-Protein Interactions fromLiterature”.

PLoS Computational Biology . 6(7). 2010.[585] B. D. Trisedya, G. Weikum, J. Qi, and R. Zhang. “Neural Relation Extraction forKnowledge Base Enrichment”. In:

Annual Meeting of the Association forComputational Linguistics (ACL) . 2019.[586] B. Trushkowsky, T. Kraska, M. J. Franklin, P. Sarkar, and V. Ramachandran.“Crowdsourcing Enumeration Queries: Estimators and Interfaces”.

IEEETransactions on Knowledge and Data Engineering (TKDE) . 27(7): 1796–1809. 2015.[587] W. Tunstall-Pedoe. “True Knowledge: Open-Domain Question Answering UsingStructured Knowledge and Inference”.

AI Magazine . 31(3): 80–92. 2010.[588] J. D. Ullman and J. Widom.

A ﬁrst course in database systems (2nd edition) .Prentice Hall, 2002.[589] C. Unger, A. Freitas, and P. Cimiano. “An Introduction to Question Answering overLinked Data”. In:

Reasoning Web Summer School . Springer, 2014.[590] N. UzZaman, H. Llorens, L. Derczynski, J. F. Allen, M. Verhagen, andJ. Pustejovsky. “SemEval-2013 Task 1: TempEval-3: Evaluating Time Expressions,Events, and Temporal Relations”. In:

Workshop on Semantic Evaluation (SemEval) .2013.[591] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser,and I. Polosukhin. “Attention is All you Need”. In:

Neural Information ProcessingSystems (NeurIPS) . 2017.[592] V. V. Vazirani.

Approximation algorithms . Springer, 2001.[593] P. Venetis, A. Y. Halevy, J. Madhavan, M. Pasca, W. Shen, F. Wu, G. Miao, andC. Wu. “Recovering Semantics of Tables on the Web”.

Proceedings of the VLDBEndowment . 4(9): 528–538. 2011.[594] P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. “MultilingualRelation Extraction using Compositional Universal Schema”. In:

North AmericanChapter of the Association for Computational Linguistics (NAACL) . 2016.255 ubmitted to Foundations and Trends in Databases [595] P. Verga, A. Neelakantan, and A. McCallum. “Generalizing to Unseen Entities andEntity Pairs with Row-less Universal Schema”. In:

Conference of the EuropeanChapter of the Association for Computational Linguistics (EACL) . 2017.[596] P. Verga, E. Strubell, and A. McCallum. “Simultaneously Self-Attending to AllMentions for Full-Abstract Biological Relation Extraction”. In:

North AmericanChapter of the Association for Computational Linguistics (NAACL) . 2018.[597] M. Verhagen, I. Mani, R. Sauri, J. Littman, R. Knippen, S. B. Jang, A. Rumshisky,J. Phillips, and J. Pustejovsky. “Automating Temporal Annotation with TARSQI”.In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2005.[598] K. Vieira, L. Barbosa, A. S. da Silva, J. Freire, and E. S. de Moura. “Finding seedsto bootstrap focused crawlers”.

World Wide Web . 19(3): 449–474. 2016.[599] J. Völker and M. Niepert. “Statistical schema induction”. In:

European SemanticWeb Conference (ESWC) . 2011.[600] D. Vrandecic and M. Krötzsch. “Wikidata: a free collaborative knowledgebase”.

Communication of the ACM . 57(10): 78–85. 2014.[601] W3C. . World Wide Web Consortium, 2012.[602] W3C. . World Wide WebConsortium, 2014.[603] W3C. .World Wide Web Consortium, 2004.[604] W3C. . World Wide WebConsortium, 2014.[605] A. Waagmeester, G. Stupp, S. Burgstaller-Muehlbacher, B. M. Good, M. Griﬃth,O. L. Griﬃth, K. Hanspers, H. Hermjakob, T. S. Hudson, K. Hybiske, et al. “Wikidata as a knowledge graph for the life sciences”.

ELife . 9. 2020.[606] D. Wadden, K. Lo, L. L. Wang, S. Lin, M. van Zuylen, A. Cohan, and H. Hajishirzi.“Fact or Fiction: Verifying Scientiﬁc Claims”. arXiv:2004.14974 . 2020.[607] H. Wan, Y. Zhang, J. Zhang, and J. Tang. “AMiner: Search and Mining ofAcademic Social Networks”.

Data Intelligence . 1(1): 58–76. 2019.[608] C. Wang, K. Chakrabarti, Y. He, K. Ganjam, Z. Chen, and P. A. Bernstein.“Concept Expansion Using Web Tables”. In:

The Web Conference (WWW) . 2015.1198–1208.[609] C. Wang, M. Danilevsky, J. Liu, N. Desai, H. Ji, and J. Han. “Constructing TopicalHierarchies in Heterogeneous Information Networks”. In:

IEEE Conference on DataMining (ICDM) . 2013. 767–776.[610] H. Wang, M. Tan, M. Yu, S. Chang, D. Wang, K. Xu, X. Guo, and S. Potdar.“Extracting Multiple-Relations in One-Pass with Pre-Trained Transformers”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2019.256 ubmitted to Foundations and Trends in Databases [611] Q. Wang, Z. Mao, B. Wang, and L. Guo. “Knowledge Graph Embedding: A Surveyof Approaches and Applications”.

IEEE Transactions on Knowledge and DataEngineering (TKDE) . 29(12): 2724–2743. 2017.[612] R. C. Wang and W. W. Cohen. “Iterative Set Expansion of Named Entities Usingthe Web”. In:

IEEE Conference on Data Mining (ICDM) . 2008. 1091–1096.[613] X. Wang, X. L. Dong, Y. Li, and A. Meliou. “MIDAS: Finding the right websources to ﬁll knowledge gaps”. In:

ACM Conference on Management of Data(SIGMOD) . 2019.[614] Y. Wang, M. Dylla, M. Spaniol, and G. Weikum. “Coupling Label Propagation andConstraints for Temporal Fact Extraction”. In:

Annual Meeting of the Associationfor Computational Linguistics (ACL) . 2012.[615] Y. Wang, B. Yang, L. Qu, M. Spaniol, and G. Weikum. “Harvesting facts fromtextual web sources by constrained label propagation”. In:

ACM Conference onInformation and Knowledge Management (CIKM) . 2011.[616] Z. Wang, J. Zhang, J. Feng, and Z. Chen. “Knowledge Graph and Text JointlyEmbedding”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2014.[617] Z. Wang, J. Li, Z. Wang, S. Li, M. Li, D. Zhang, Y. Shi, Y. Liu, P. Zhang, andJ. Tang. “XLore: A Large-scale English-Chinese Bilingual Knowledge Graph”. In:

International Semantic Web Conference (ISWC) . 2013.[618] J. Weeds, D. J. Weir, and D. McCarthy. “Characterising Measures of LexicalDistributional Similarity”. In:

Conference on Computational Linguistics (COLING) .2004.[619] G. Weikum. “Entities with Quantities”.

IEEE Data Engineering Bulletin . 43(1): 4–8.2020.[620] G. Weikum, J. Hoﬀart, and F. M. Suchanek. “Knowledge Harvesting: Achievementsand Challenges”. In:

Computing and Software Science - State of the Art andPerspectives . Vol. 10000.

Lecture Notes in Computer Science . Springer, 2019.217–235.[621] G. Weikum, J. Hoﬀart, and F. M. Suchanek. “Ten Years of Knowledge Harvesting:Lessons and Challenges”.

IEEE Data Engineering Bulletin . 39(3): 41–50. 2016.[622] G. Weikum and M. Theobald. “From information to knowledge: harvesting entitiesand relationships from web sources”. In:

ACM Symposium on Principles of DatabaseSystems (PODS) . 2010.[623] S. E. Whang, D. Menestrina, G. Koutrika, M. Theobald, and H. Garcia-Molina.“Entity resolution with iterative blocking”. In:

ACM Conference on Management ofData (SIGMOD) . 2009. 257 ubmitted to Foundations and Trends in Databases [624] M. L. Wick, K. Rohanimanesh, K. Schultz, and A. McCallum. “A uniﬁed approachfor schema matching, coreference and canonicalization”. In:

ACM Conference onKnowledge Discovery and Data Mining (KDD) . 2008.[625] F. Wu and D. S. Weld. “Automatically reﬁning the wikipedia infobox ontology”. In:

The Web Conference (WWW) . 2008.[626] F. Wu and D. S. Weld. “Autonomously semantifying wikipedia”. In:

ACMConference on Information and Knowledge Management (CIKM) . 2007.[627] F. Wu and D. S. Weld. “Open Information Extraction Using Wikipedia”. In:

Annual Meeting of the Association for Computational Linguistics (ACL) . 2010.[628] J. Wu, K. M. Williams, H. Chen, M. Khabsa, C. Caragea, S. Tuarob, A. Ororbia,D. Jordan, P. Mitra, and C. L. Giles. “CiteSeerX: AI in a Digital Library SearchEngine”.

AI Magazine . 36(3): 35–48. 2015.[629] L. Wu, F. Petroni, M. Josifoski, S. Riedel, and L. Zettlemoyer. “Zero-shot EntityLinking with Dense Entity Retrieval”.

CoRR . abs/1911.03814. 2019.[630] S. Wu and Y. He. “Enriching Pre-trained Language Model with Entity Informationfor Relation Classiﬁcation”. In:

ACM Conference on Information and KnowledgeManagement (CIKM) . 2019.[631] W. Wu, H. Li, H. Wang, and K. Q. Zhu. “Probase: a probabilistic taxonomy for textunderstanding”. In:

ACM Conference on Management of Data (SIGMOD) . 2012.[632] Y. Wu, D. Bamman, and S. J. Russell. “Adversarial Training for RelationExtraction”. In:

Conference on Empirical Methods in Natural Language Processing(EMNLP) . 2017.[633] D. Xu, C. Ruan, E. Körpeoglu, S. Kumar, and K. Achan. “Product KnowledgeGraph Embedding for E-commerce”. In:

ACM Conference on Web Search and DataMining (WSDM) . 2020.[634] H. Xu, W. Wang, X. Mao, X. Jiang, and M. Lan. “Scaling up Open Tagging fromTens to Thousands: Comprehension Empowered Attribute Value Extraction fromProduct Title”. In:

Annual Meeting of the Association for ComputationalLinguistics (ACL) . 2019.[635] Y. Xu, L. Mou, G. Li, Y. Chen, H. Peng, and Z. Jin. “Classifying Relations viaLong Short Term Memory Networks along Shortest Dependency Paths”. In:

Conference on Empirical Methods in Natural Language Processing (EMNLP) . 2015.[636] M. Yahya, S. Whang, R. Gupta, and A. Halevy. “Renoun: Fact extraction fornominal attributes”. In:

Conference on Empirical Methods in Natural LanguageProcessing (EMNLP) . 2014.[637] M. Yakout, K. Ganjam, K. Chakrabarti, and S. Chaudhuri. “InfoGather: entityaugmentation and attribute discovery by holistic matching with web tables”. In:

ACM Conference on Management of Data (SIGMOD) . 2012.258 ubmitted to Foundations and Trends in Databases [638] I. Yamada, A. Asai, H. Shindo, H. Takeda, and Y. Takefuji. “Wikipedia2Vec: AnOptimized Tool for Learning Embeddings of Words and Entities from Wikipedia”.

CoRR . abs/1812.06280. 2018.[639] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. “Joint Learning of theEmbedding of Words and Entities for Named Entity Disambiguation”. In:

Conference on Computational Natural Language Learning (CoNLL) . 2016.[640] I. Yamada, H. Shindo, H. Takeda, and Y. Takefuji. “Learning DistributedRepresentations of Texts and Entities from Knowledge Base”.

Transactions of theAssociation for Computational Linguistics (TACL) . 5: 397–411. 2017.[641] Z. Yang and E. Nyberg. “Leveraging Procedural Knowledge for Task-orientedSearch”. In:

ACM Conference on Research and Development in InformationRetrieval (SIGIR) . 2015.[642] L. Yao, A. Haghighi, S. Riedel, and A. McCallum. “Structured relation discoveryusing generative models”. In:

Conference on Empirical Methods in NaturalLanguage Processing (EMNLP) . 2011.[643] L. Yao, S. Riedel, and A. McCallum. “Collective Cross-Document RelationExtraction Without Labelled Data”. In:

Conference on Empirical Methods inNatural Language Processing (EMNLP) . 2010.[644] L. Yao, S. Riedel, and A. McCallum. “Universal schema for entity type prediction”.In:

Workshop on Automated Knowledge Base Construction (AKBC) . 2013.[645] Y. Yao, D. Ye, P. Li, X. Han, Y. Lin, Z. Liu, Z. Liu, L. Huang, J. Zhou, and M. Sun.“DocRED: A Large-Scale Document-Level Relation Extraction Dataset”. In:

AnnualMeeting of the Association for Computational Linguistics (ACL) . 2019.[646] A. Yates and O. Etzioni. “Unsupervised Methods for Determining Object andRelation Synonyms on the Web”.

Journal of Artiﬁcial Intelligence Research . 34:255–296. 2009.[647] J. Yeo, H. Cho, J. Park, and S. Hwang. “Multimodal KB Harvesting for EmergingSpatial Entities”.

IEEE Transactions on Knowledge and Data Engineering (TKDE) .29(5): 1073–1086. 2017.[648] X. Yin, J. Han, and P. S. Yu. “Truth Discovery with Multiple ConﬂictingInformation Providers on the Web”.

IEEE Transactions on Knowledge and DataEngineering (TKDE) . 20(6): 796–808. 2008.[649] R. Yu, U. Gadiraju, B. Fetahu, O. Lehmberg, D. Ritze, and S. Dietze. “KnowMore -knowledge base augmentation with structured web markup”.

Semantic Web Journal(SWJ) . 10(1): 159–180. 2019.[650] Y. Yuan, L. Liu, S. Tang, Z. Zhang, Y. Zhuang, S. Pu, F. Wu, and X. Ren.“Cross-Relation Cross-Bag Attention for Distantly-Supervised Relation Extraction”.In:

Conference on Artiﬁcial Intelligence (AAAI) . 2019.259 ubmitted to Foundations and Trends in Databases [651] R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi. “From Recognition to Cognition:Visual Commonsense Reasoning”. In:

Conference on Computer Vision and PatternRecognition (CVPR) . 2019.[652] D. Zeng, K. Liu, Y. Chen, and J. Zhao. “Distant Supervision for RelationExtraction via Piecewise Convolutional Neural Networks”. In:

Conference onEmpirical Methods in Natural Language Processing (EMNLP) . 2015.[653] D. Zeng, K. Liu, S. Lai, G. Zhou, and J. Zhao. “Relation Classiﬁcation viaConvolutional Deep Neural Network”. In:

Conference on Computational Linguistics(COLING) . 2014.[654] Q. Zeng, J. M. Patel, and D. Page. “QuickFOIL: Scalable Inductive LogicProgramming”.

Proceedings of the VLDB Endowment . 8(3): 197–208. 2014.[655] C. Zhang, C. Ré, M. J. Cafarella, J. Shin, F. Wang, and S. Wu. “DeepDive:declarative knowledge base construction”.

Communication of the ACM . 60(5):93–102. 2017.[656] C. Zhang, J. Shin, C. Ré, M. J. Cafarella, and F. Niu. “Extracting Databases fromDark Data with DeepDive”. In:

ACM Conference on Management of Data(SIGMOD) . 2016.[657] D. Zhang, S. Mukherjee, C. Lockard, L. Dong, and A. McCallum. “OpenKI:Integrating Open Information Extraction and Knowledge Bases with RelationInference”. In:

North American Chapter of the Association for ComputationalLinguistics (NAACL) . 2019.[658] M. Zhang and K. Chakrabarti. “InfoGather+: semantic matching and annotation ofnumeric and time-varying attributes in web tables”. In:

ACM Conference onManagement of Data (SIGMOD) . 2013.[659] S. Zhang, R. Rudinger, K. Duh, and B. V. Durme. “Ordinal Common-senseInference”.

Transactions of the Association for Computational Linguistics (TACL) .5: 379–395. 2017.[660] S. Zhang and K. Balog. “Web Table Extraction, Retrieval, and Augmentation: ASurvey”.

ACM Transactions on Intelligent Systems Technology (TIST) . 11(2):13:1–13:35. 2020.[661] S. Zhang, E. Meij, K. Balog, and R. Reinanda. “Novel Entity Discovery from WebTables”. In:

The Web Conference (WWW) . 2020.[662] Y. Zhang, Z. G. Ives, and D. Roth. “Evidence-based Trustworthiness”. In:

AnnualMeeting of the Association for Computational Linguistics (ACL) . 2019.[663] Y. Zhang, Q. Chen, Z. Yang, H. Lin, and Z. Lu. “BioWordVec, improvingbiomedical word embeddings with subword information and MeSH”.

Scientiﬁc Data .6: 52. 2019.[664] Y. Zhang and X. Chen. “Explainable Recommendation: A Survey and NewPerspectives”.

Foundations and Trends in Information Retrieval . 14(1): 1–101. 2020.260 ubmitted to Foundations and Trends in Databases [665] Y. Zhang, V. Zhong, D. Chen, G. Angeli, and C. D. Manning. “Position-awareAttention and Supervised Data Improve Slot Filling”. In:

Conference on EmpiricalMethods in Natural Language Processing (EMNLP) . 2017.[666] G. Zheng, S. Mukherjee, X. L. Dong, and F. Li. “OpenTag: Open Attribute ValueExtraction from Product Proﬁles”. In:

ACM Conference on Knowledge Discoveryand Data Mining (KDD) . 2018.[667] S. Zheng, R. Song, J. Wen, and C. L. Giles. “Eﬃcient record-level wrapperinduction”. In:

ACM Conference on Information and Knowledge Management(CIKM) . 2009.[668] S. Zheng, R. Song, J. Wen, and D. Wu. “Joint optimization of wrapper generationand template detection”. In:

ACM Conference on Knowledge Discovery and DataMining (KDD) . 2007.[669] Z. Zheng, F. Li, M. Huang, and X. Zhu. “Learning to Link Entities with KnowledgeBase”. In:

North American Chapter of the Association for ComputationalLinguistics (NAACL) . 2010.[670] G. Zhou, M. Zhang, D. Ji, and Q. Zhu. “Tree Kernel-Based Relation Extractionwith Context-Sensitive Structured Parse Tree Information”. In:

Conference onEmpirical Methods in Natural Language Processing (EMNLP) . 2007.[671] E. Zhu, D. Deng, F. Nargesian, and R. J. Miller. “JOSIE: Overlap Set SimilaritySearch for Finding Joinable Tables in Data Lakes”. In:

ACM Conference onManagement of Data (SIGMOD) . 2019.[672] J. Zhu, Z. Nie, X. Liu, B. Zhang, and J. Wen. “StatSnowball: a statistical approachto extracting entity relationships”. In:

The Web Conference (WWW) . 2009.[673] J. Zhu, Z. Nie, J. Wen, B. Zhang, and W. Ma. “Simultaneous record detection andattribute labeling in web data extraction”. In:

ACM Conference on KnowledgeDiscovery and Data Mining (KDD) . 2006.[674] J. Zhu, B. Zhang, Z. Nie, J. Wen, and H. Hon. “Webpage understanding: anintegrated approach”. In:

ACM Conference on Knowledge Discovery and DataMining (KDD) . 2007.[675] Q. Zhu, H. Wei, B. Sisman, D. Zheng, C. Faloutsos, X. L. Dong, and J. Han.“Collective Multi-type Entity Alignment Between Knowledge Graphs”. In:

The WebConference (WWW) . 2020. 2241–2252.[676] X. Zhu and Z. Ghahramani. “Learning from labeled and unlabeled data with labelpropagation”.

Technical Report, Carnegie Mellon University . 2002.[677] S. Zwicklbauer, C. Seifert, and M. Granitzer. “Robust and Collective EntityDisambiguation through Semantic Embeddings”. In:

Year	Nominee …	Result
1965	The Times They Are A-Changin‘	Nominated
1973	The Concert for Bangla Desh	Won