Machine Knowledge: Creation and Curation of Comprehensive Knowledge Bases
Gerhard Weikum, Luna Dong, Simon Razniewski, Fabian Suchanek
SSubmitted to Foundations and Trends in Databases
Machine Knowledge: Creation and Curationof Comprehensive Knowledge Bases
Gerhard WeikumMax Planck Institute for [email protected] Luna [email protected] RazniewskiMax Planck Institute for [email protected] Fabian SuchanekTelecom Paris [email protected] 25, 2020
Abstract
Equipping machines with comprehensive knowledge of the world’s entities and theirrelationships has been a long-standing goal of AI. Over the last decade, large-scaleknowledge bases, also known as knowledge graphs, have been automatically constructedfrom web contents and text sources, and have become a key asset for search engines.This machine knowledge can be harnessed to semantically interpret textual phrasesin news, social media and web tables, and contributes to question answering, naturallanguage processing and data analytics.This article surveys fundamental concepts and practical methods for creating andcurating large knowledge bases. It covers models and methods for discovering and canon-icalizing entities and their semantic types and organizing them into clean taxonomies.On top of this, the article discusses the automatic extraction of entity-centric properties.To support the long-term life-cycle and the quality assurance of machine knowledge,the article presents methods for constructing open schemas and for knowledge curation.Case studies on academic projects and industrial knowledge graphs complement thesurvey of concepts and methods. a r X i v : . [ c s . A I] S e p ubmitted to Foundations and Trends in Databases Contents ubmitted to Foundations and Trends in Databases ubmitted to Foundations and Trends in Databases ubmitted to Foundations and Trends in Databases
10 Wrap-Up 207
References 214 ubmitted to Foundations and Trends in Databases Enhancing computers with “machine knowledge” that can power intelligent applicationsis a long-standing goal of computer science [323]. This formerly elusive vision has becomepractically viable today, made possible by major advances in knowledge harvesting . Thiscomprises methods for turning noisy Internet content into crisp knowledge structures onentities and relations. The knowledge harvesting methodology has enabled the automaticconstruction of knowledge bases (KB) : collections of machine-readable facts about thereal world. Today, publicly available KBs provide millions of entities (such as people,organizations, locations and creative works like books, music etc.) and billions of statementsabout them (such as who studied where, which country has which capital, or which singerperformed which song). Proprietary KBs deployed at major companies comprise knowledgeat an even larger scale, with one or two orders of magnitude more entities.A prominent use case where knowledge bases have become a key asset is web search.When we send a query like “dylan protest songs” to Baidu, Bing or Google, we obtain acrisp list of songs such as Blowin’ in the Wind, Masters of War, or A Hard Rain’s a-GonnaFall. So the search engine automatically detects that we are interested in facts about anindividual entity – Bob Dylan in this case – and ask for specifically related entities of acertain type – protest songs – as answers. This is feasible because the search engine has ahuge knowledge base in its back-end data centers, aiding in the discovery of entities in userrequests (and their contexts) and in finding concise answers.The KBs in this setting are centered on individual entities, containing (at least) thefollowing backbone information: • entities like people, places, organizations, products, events, such as Bob Dylan , the
Stockholm City Hall , the • the semantic classes to which entities belong, for example h Bob Dylan, type, singer-songwriter i , h Bob Dylan, type, poet i • relationships between entities, such as h Bob Dylan, created, Blowin’ in the Wind i , h Bob Dylan, won, Nobel Prize in Literature i Some KBs also contain validity times such as • h Bob Dylan, married to, Sara Lownds, [1965,1977] i This temporal scoping is optional, but very important for the life-cycle management of aKB as the real world evolves over time. In the same vein of long-term quality assurance,KBs may also contain constraints and provenance information.6 ubmitted to Foundations and Trends in Databases
The concept of a comprehensive KB goes back to pioneering work in Artificial Intelligenceon universal knowledge bases in the 1980s and 1990s, most notably, the
Cyc project atMCC in Austin [322] and the
WordNet project at Princeton [159]. However, these knowledgecollections have been hand-crafted and curated manually. Thus, the knowledge acquisitionwas inherently limited in scope and scale. With the Semantic Web vision in the early 2000s,domain-specific ontologies [551] have been developed, but these were also manually created.In the first decade of the 2000s, automatic knowledge harvesting from Web and text sourcesbecame a major research avenue, and has made substantial practical impact. Knowledgeharvesting is the core methodology for the automatic construction of large knowledge bases,going beyond manually compiled knowledge collections like Cyc or WordNet.These achievements are rooted in academic research and community projects. Salientprojects that started in the 2000s are DBpedia [18], Freebase [51], KnowItAll [153], WebOf-Concepts [110], WikiTaxonomy [456] and YAGO [562]. More recent projects with publiclyavailable data include BabelNet [418], ConceptNet [548], DeepDive [527], EntityCube (aka.Renlifang) [425], KnowledgeVault [129], NELL [71], Probase [631], WebIsALOD [232], Wiki-data [600], and XLore [617]. More on the history of KB technology can be found in theoverview article [245].At the time of writing this survey, the largest general-purpose KBs with publicly acces-sible contents are Wikidata ( wikidata.org ), BabelNet ( babelnet.org ), DBpedia ( dbpedia.org ),and YAGO ( yago-knowledge.org ). They contain millions of entities, organized in hundredsto hundred thousands of semantic classes, and hundred millions to billions of relationalstatements on entities. These and other knowledge resources are interlinked at the entitylevel, forming the Web of Linked Open Data [225, 244].Over the 2010s, knowledge harvesting has been adopted at big industrial stakeholders[429], and large KBs have become a key asset in a variety of commercial applications,including semantic search (see, e.g., [35, 484]), analytics (e.g., aggregating by entities),recommendations (see, e.g., [195]), and data integration (i.e., to combine heterogeneousdatasets in and across enterprises). Examples are the Google Knowledge Graph [536], theuse of KBs in IBM Watson [163], the Amazon Product Graph [127, 133], the Alibabae-Commerce Graph [360], the Baidu Knowledge Graph [26], Microsoft Satori [467], WolframAlpha [248] as well as domain-specific knowledge bases in business, finance, life sciences,and more (e.g., at Bloomberg [375]).In addition, KBs have found wide use as a source of distant supervision for a variety oftasks in natural language processing, such as entity linking.7 ubmitted to Foundations and Trends in Databases
Knowledge bases enable or enhance a wide variety of applications.
Semantic Search and Question Answering:
All major search engines have some form of KB as a background asset. Whenever auser’s information need centers around an entity or a specific type of entities, such assingers, songs, tourist locations, companies, products, sports events etc., the KB can returna precise and concise list of entities rather than merely giving “ten blue links” to web pages.The earlier example of asking for “dylan protest songs” is typical for this line of semanticsearch. Even when the query is too complex or the KB is not complete enough to enableentity answers, the KB information can help to improve the ranking of web-page results byconsidering the types and other properties of entities. Similar use cases arise in enterprisesas well, for example, when searching for customers or products with specific properties, orwhen forming a new team with employees who have specific expertise and experience.An additional step towards user-friendly interfaces is question answering (QA) wherethe user poses a full-fledged question in natural language and the system aims to returncrisp entity-style answers from the KB or from a text corpus or a combination of both. Anexample for KB-based QA is “Which songs written by Bob Dylan received Grammys?”;answers include All Along the Watchtower, performed by Jimi Hendrix, which received aHall of Fame Grammy Award. An ambitious example that probably requires tapping intoboth KB and text would be “Who filled in for Bob Dylan at the Nobel Prize ceremony inStockholm?”; the answer is Patti Smith.Overviews on semantic search and question answering with KBs include [35, 119, 320,484, 589].
Language Understanding and Text Analytics:
Both written and spoken language are full of ambiguities. Knowledge is the key tomapping surface phrases to their proper meanings, so that machines interpret languageas fluently as humans. AI-style use cases include machine translation, and conversationalassistants like chatbots. Prominent examples include Amazon’s Alexa, Apple’s Siri, andGoogle’s Assistant and new chatbot initiatives [2], and Microsoft’s Cortana.In these applications, world knowledge plays a crucial role. Consider, for example,sentences like “Jordan holds the record of 30 points per match” or “The forecast for Jordanis a record high of 110 degrees”. The meaning of the word “Jordan” can be inferred byhaving world knowledge about the basketball champion Michael Jordan and the middle-eastcountry Jordan.Understanding entities (and their properties and relations) in text is also key to large-scale analytics over news articles, scientific publications, review forums, or social mediadiscussions. For example, we can identify mentions of products (and associated consumer8 ubmitted to Foundations and Trends in Databases opinions), link them to a KB, and then perform comparative and aggregated studies. Wecan even incorporate filters and groupings on product categories, geographic regions etc.,by combining the textual information with structured data from the KB or from productand customer databases. All this can be enabled by the KB as a clean and comprehensiverepository of entities (see, e.g., [523] for a survey on the core task of entity linking).A trending example of semantic text analytics is detecting gender bias in news and otheronline content (see, e.g., [565]). By identifying people, and the KB knowing their gender, wecan compute statistics over male vs. female people in political offices or on company boards.If we also extract earnings from movies and ask the KB to give us actors and actresses, wecan shed light into potential unfairness in the movie industry.
Visual Understanding:
For detecting objects and concepts in images (and videos), computer vision has madegreat advances using machine learning. The training data for these tasks are collections ofsemantically annotated images, which can be viewed as visual knowledge bases. The mostwell-known example is ImageNet ([115]) which has populated a subset of WordNet conceptswith a large number of example images. A more recent and more advanced endeavor alongthese lines is VisualGenome ([295]). By themselves, these assets already go a long way, buttheir value can be further boosted by combining them with additional world knowledge.For example, knowing that lions and tigers are both predators from the big cat familyand usually prey on deer or antelopes, can help to automatically label scenes as “catsattack prey”. Likewise, recognizing landmark sites such as the Brandenburg gate and havingbackground knowledge about them (e.g., other sites of interest in their vicinity) helps tounderstand details and implications of an image. In such computer vision and further AIapplications, a KB often serves as an informed prior for machine learning models, or as areference for consistency (or plausibility) checks.
Data Cleaning:
Coping with incomplete and erroneous records in large heterogeneous data is a classicaltopic in database research (see, e.g., [472]). The problem is more timely and pressing thanever. Data scientists and business analysts want to rapidly tap into diverse datasets, forcomparison, aggregation and joint analysis. So different kinds of data need to be combinedand fused, more or less on the fly and thus largely depending on automated tools. Thistrend amplifies the crucial role of identifying and repairing missing and incorrect values.In many cases, the key to spotting and repairing errors or to infer missing values isconsistency across a set of records. For example, suppose that a database about music hasa new tuple stating that Jeff Bezos won the Grammy Award. A background knowledgebase would tell that Bezos is an instance of types like businesspeople, billionaires, companyfounders etc., but there is no type related to music. As the Grammy is given only forsongs, albums and musicians, the tuple about Bezos is likely a data-entry error. In fact,9 ubmitted to Foundations and Trends in Databases the requirement that Grammy winners, if they are of type person, have to be musicians,can be encoded into a logical consistency constraint. Several KBs contain such consistencyconstraints. They typically include: • type constraints, e.g.: a Grammy winner who belongs to the type person must also bean instance of type musician (or a sub-type), • functional dependencies, e.g.: for each year and each award category, there is exactlyone Grammy winner, • inclusion dependencies, e.g.: composers are also musicians and thus can win a Grammy,and all Grammy winners must have at least one song to which they contributed (i.e.,the set of Grammy winners is a subset of the set of people with at least one song), • disjointness constraints, e.g.: songs and albums are disjoint, so no piece of music cansimultaneously win both of these award categories for the Grammy, • temporal constraints, e.g.: the Grammy award is given only to living people, so thatsomeone who died in a certain year cannot win it in any later year.Data cleaning as a key stage in data integration is surveyed by [256, 255] and the articlesin [328]. Are knowledge bases part of the Semantic Web, and can they be used only forSemantic Web applications?
The big revival of knowledge bases in this millenium originated from research projects inthe Semantic Web community. This is why some design choices favor models and methodsfrom the Semantic Web. For example, the RDF data model is popular among KBs, and thequery language of choice is often close to the Sparql language rather than SQL. However,it is easy to move KBs into the ecosystem of other data models and their tool suites. Inparticular, KBs can be stored, accessed and managed by relational database systems as well,and can be used also with NoSQL platforms such as Apache Spark and with cloud-basedservices. Likewise, it is easy to combine KBs with popular tools for machine learning suchas TensorFlow, SciPy etc.
How is knowledge harvesting related to the field of information extraction?
Information extraction (IE) (see, e.g., [511, 162, 208]) comprises methodologies forrecognizing and semantically annotating meaningful units in natural language and otherkinds of noisy contents (e.g., ad-hoc tables in web pages, or query-and-click logs), buildingon text mining and machine learning. Given an arbitrary input, IE aims at a best-effortjob on computing value-added mark-up. Knowledge harvesting leverages IE methods, butit is output-driven. To construct a high-quality KB, judicious choices about input sources10 ubmitted to Foundations and Trends in Databases and extraction strategies are crucial considerations. For example, we often want to “picklow-hanging fruit” first, for high quality, and tap into noisier sources only subsequently andwith specific focus and customized techniques.
Why do we need machine knowledge, when we already have end-to-end machinelearning working so well?
Machine learning (ML), especially deep neural networks, work well when there issufficient training data with gold-standard labels. However, there are fundamental reasonswhy ML alone is not a full solution. First, training data is the typical bottleneck whentackling new applications, so that a lot of time and money needs to be spent on compilingand organizing the relevant data. This cost arises in each and every application againand again. Machine knowledge is an easily re-usable, versatile asset that can simplify andaccelerate these expensive steps. Second, even the best deep learning methods are far fromnear-human quality in identifying crisp statements in complex, noisy and ambiguous textsand other input sources. There are fundamental issues for these limitations: ML is basedon the paradigm of learning with iid samples from a data distribution and applying thetrained model to new samples from the same distribution (iid = independently identicallydistributed). In contrast, humans do not solely rely on situative data alone, but interpretobservations and learn with a rich body of background experience – the human’s generalworld knowledge. Therefore, machine learning and machine knowledge are complementarypillars of modern AI. The more a machine knows, the better it can learn; and better learningenables acquiring more and deeper knowledge.
Are knowledge bases simply some sort of databases?
Knowledge bases are data, and can, of course, be stored in standard databases. Buttheir scope and use cases make them special in several ways. First, a KB is a referencerepository of entities, types and vocabulary with open-ended scope, modeling broad domainsor enterprise-wide knowledge or even aiming for universal encyclopedic coverage. To thisend, KBs include a rich taxonomy of types, way more expressive than the usual databases.Second, as a consequence, KBs operate under the Open World Assumption, which allowsdata to be incomplete, and KBs are grown with continuous curation. Third, to fulfill theserequirements, KBs need to continously adapt and enhance their schemas for types andproperties, following the paradigm of agile data spaces [206] rather than traditional “schemafirst” databases. This aspect led to the pragmatic choice for the RDF data model withbinary relations, to allow new types and new properties to be added in a light-weightmanner. 11 ubmitted to Foundations and Trends in Databases
Are knowledge bases useful for enterprises?
KBs are useful as reference data in many ways. They contain encyclopedic knowledgeabout the world’s notable entities, including people, places, events, and organizations. Thiscan serve as background knowledge in enterprises and their applications. In the travel andtourism industry, for example, KBs can contribute knowledge about vacation sites, naturalor cultural points of interest, and geography. Several of the publicly available KBs are richon this kind of geo-spatial knowledge. Industrial applications can combine this with theirdata about hotels, flights, commercial tours etc.There are also KBs specifically geared for a vertical domain or a company. In the healthdomain, for example, KBs can collect background knowledge about diseases, drugs, symp-toms, therapies and their properties. In a company, a KB can provide relevant knowledgeabout customers, products, product categories, sales regions, and so on.Industrial applications require involving domain experts, within the enterprise. Themethodology presented in this article will benefit such teams in companies on automatingand accelerating their endeavors.
Can high-tech startups benefit from knowledge bases?
KBs contribute to methods and tools for language understanding, data cleaning, machine-learning-based AI, semantic reasoning, and knowledge modeling. Therefore, several startupshave taken to offer KB technology as a product. Some of these turned into big successstories: Freebase (by Metaweb Technologies, Inc.) was acquired by Google and kick-startedthe Google Knowledge Graph, and DeepDive (by Lattice Data, Inc.) was acquired by Apple.
This article covers methods for automatically constructing and curating large knowledgebases from web and text sources. We hope that it will be useful for doctoral studentsand faculty interested in a wide spectrum of topics – from machine knowledge and dataquality to machine learning and data science as well as applications in web content miningand natural language understanding. In addition, this article aims to be useful also forindustrial researchers and practitioners working on semantic technologies for web, socialmedia, or enterprise contents, including all kinds of applications where sense-making fromtext or semi-structured data is an issue. Prior knowledge on natural language processing orstatistical learning is not required; we will introduce relevant methods as they are needed(or at least give specific pointers to literature).The article is organized into ten chapters. Chapter 2 gives foundational basics onknowledge representation and discusses the design space for building a KB. Chapters3, 4 and 5 cover the methodology for constructing the core of a KB that comprisesentities and types. Chapter 3 discusses tapping premium sources with rich and clean semi-12 ubmitted to Foundations and Trends in Databases structured contents, and Chapter 4 addresses knowledge harvesting from textual contents.Chapter 5 specifically focuses on the important issue of canonicalizing entities into uniquerepresentations. Chapters 6 and 7 extend the scope of the KB by methods for discoveringand extracting attributes of entities and relations between entities. Chapter 6 focuses on thecase where a schema is designed upfront for the properties of interest. Chapter 7 discussesthe case of discovering new property types for attributes and relations that are not (yet)specified in the KB schema. Chapter 8 discusses the issue of quality assurance for KBcuration and the long-term maintenance of KBs. Chapter 9 presents several case studies onspecific KBs including industrial knowledge graphs (KGs). We conclude in Chapter 10 withkey lessons and an outlook on where the theme of machine knowledge may be heading.13 ubmitted to Foundations and Trends in Databases
Knowledge bases, KBs for short, comprise salient information about entities, semantic classesto which entities belong, attributes of entities, and relationships between entities. Whenthe focus is on classes and their logical connections such as subsumption and disjointness,knowledge repositories are often referred to as ontologies . In database terminology, thisis referred to as the schema . The class hierarchy alone is often called a taxonomy . Thenotion of KBs in this article covers all these aspects of knowledge, including ontologies andtaxonomies.This chapter presents foundations for casting knowledge into formal representations.Knowledge representation has a long history, spanning decades of AI research, from theclassical model of frames to recent variants of description logics. Overviews on this spectrumare given by [504] and [551]. In this article, we restrict ourselves to the knowledge repre-sentation that has emerged as a pragmatic consensus for entity-centric knowledge bases(see [564] for an extended discussion). More on the wide spectrum of knowledge modelingcan be found in the survey [245].
The most basic element of a KB is an entity .An entity is any abstract or concrete object of fiction or reality.This definition includes people, places, products and also events and creative works(books, poems, songs etc.), real people (living or dead) as well as fictional people (e.g., HarryPotter), and also general concepts such as empathy and Buddhism. KBs take a pragmaticapproach: they model only entities that match their scope and purpose. A KB on writersand their biographies would include Shakespeare and his drama Macbeth, but it may notinclude the characters of the drama such as King Duncan or Lady Macbeth. However, a KBfor literature scholars – who want to analyze character relationships in literature content –should include all characters from Shakespeare’s works.
Individual Entities (aka. Named Entities):
We often narrow down the set of entities ofinterest by emphasizing uniquely identifiable entities and distinguishing them from generalconcepts.An individual entity is an entity that can be uniquely identified against all otherentities. 14 ubmitted to Foundations and Trends in Databases
To uniquely identify a location, we can use its geo-coordinates: longitude and latitudewith a sufficiently precise resolution. To identify a person, we would – in the extreme case –have to use the person’s DNA sequence, but for all realistic purpose a combination of fullname, birthplace and birthdate are sufficient. In practice, we are typically even coarser andjust use location or person names when there is near-universal social consensus about whator who is denoted by the name. Wikipedia article names usually follow this principle ofusing names that are sufficiently unique. For these arguments, individual entities are alsoreferred to as named entities . Identifiers and Labels:
To denote an entity unambiguously, we need a name that canrefer to only a single entity. Such an identifier can be a unique name, but can also bespecifically introduced keys such as URLs for web sites, ISBNs for books (which can evendistinguish different editions of the same book), DOIs for publications, Google ScholarURLs or ORCID IDs for authors, etc. In the data model of the Semantic Web,
RDF (for
Resource Description Framework ) [602], identifiers always take the form of URIs(Unique Resource Identifiers, a generalization of URLs).An identifier for an entity is a string of characters that uniquely denotes the entity.As identifiers are not necessarily of a form that is nicely readable and directly inter-pretable by a human, we often want to have human-readable labels or names synonyms or alias names (different names, same meaning). When different entities sharea label such as “Hamlet”, this label is a homonym (same name, different meanings). Entities of interest often come in groups where all elements have a shared characteristic.For example,
Bob Dylan , Elvis Presley and
Lisa Gerrard are all musicians and singers. Wecapture this knowledge by organizing entities into classes or, synonomously, types .A class (or type ) is a named set of entities that share a common trait. An elementof that set is called an instance of the class.For Dylan, Presley and Gerrard, classes of interest include: musicians and singers withall three as instances, guitarists with Dylan and Presley as instances, men with these two, women containing only Gerrard, and so on. 15 ubmitted to Foundations and Trends in Databases Note that an entity can belong to multiple classes, and classes can relate to eachother in terms of their members as being disjoint (e.g. men and women ), overlapping, or onesubsuming the other (e.g., musicians and guitarists ). Classes can be quite specific; forexample left-handed electric guitar players could be a class in the KB containing JimiHendrix, Paul McCartney and others.It is not always obvious whether something should be modeled as an entity or as a class.We could construct, for every entity, a singleton class that contains just this entity. Classesof interest typically have multiple instances, though. By this token, we do not considergeneral concepts such as love, Buddhism or pancreatic cancer as classes, unless we wereinterested in specific instances (e.g., the individual cancer of one particular patient).
Taxonomies (aka. Class Hierarchies):
By relating the instance sets of two classes, we can specify invariants that must holdbetween the classes, most notably, subsumption, also known as the subclass / superclass relation. By combining these pairwise invariants across all classes, we can thus construct a class hierarchy . We refer to this aspect of the KB as a taxonomy .Class A is a subclass of (is subsumed by) class B if all instances of A must also beinstances of B .For example, the classes singers and guitarists are subclasses of musicians becauseevery singer and every guitarist is a musician. We say that class X is a direct subclassof Y if there is no other class that subsumes X and is subsumed by Y. Classes can havemultiple superclasses, but there should not be any cycles in the subsumption relation. Forexample, left-handed electric guitar players are a subclass of both left-handed people and guitarists .A taxonomy is a directed acyclic graph, where the nodes are classes and there isan edge from class X to class Y if X is a direct subclass of Y.Note that these invariants do not just describe the current instances of such class pairs,but actually prescribe that the invariant holds for all possible instance sets. So the taxonomyacts like a database schema , and is instrumental for keeping the KB consistent. In databaseterminology, a subclass/superclass pair is also called an inclusion dependency . In theSemantic Web, the RDFS extension of the RDF model (RDFS = RDF Schema) allowsspecifying such constraints [604]. Other Semantic Web models, notably, OWL, support alsodisjointness constraints (aka. mutual exclusion), for example, specifying that men and women are disjoint.One of the largest taxonomic repositories is the
WordNet lexicon [159], comprisingmore than 100,000 classes. Figure 2.1 visualizes an excerpt of the WordNet taxonomy.The nodes are classes, called word senses in WordNet, the edges indicate subsumption.16 ubmitted to Foundations and Trends in Databases
Person … MusicianAccordionist Bassist BassoonistCellist ClarinetistGuitarist Harpist KeyboardPlayerJazz Musician …AdventurerAdult Entertainer… FemalePerson … Scientist ……CreatorArtistExpressionist Painter … … … … … … … … Figure 2.1:
Excerpt from the WordNet Taxonomy
In linguistic terminology, this lexical relation is called hypernymy : an edge connects amore special class, called hyponym , with a generalized class, called hypernym . [203] givesan overview of this kind of lexical resources. Further examples of (potentially unclean)taxonomies include the Wikipedia category system or product catalogs.
Subsumption vs. Part-Whole Relation:
Class subsumption should not be confused or conflated with the relationship between partsand wholes. For example, a soprano saxophone is part of a jazz band . This does not mean,however, that every soprano saxophone is a jazz band. Likewise, New York is part of theUSA, but New York is not a subclass of the
USA . Instance-of vs. Subclass-of:
Some KBs do not make a clear distinction between classes and instances, and they collapsethe instance-of and subclass-of relations into a single is-a hierarchy . Instead of stating that
Bob Dylan is an instance of the class singers and that singers are a subclass of musicians ,they would view all three as general entities and connect them in a generalization graph.17 ubmitted to Foundations and Trends in Databases
Entities have properties such as birthdate, birthplace and height of a person, prizes won,books, songs or software written, and so on. KBs capture these in the form of mathematicalrelations:A relation or relationship for the instances of classes C , ..., C n is a subset of the Cartesian product C × ... × C n , along with an identifier(i.e., unique name) for the relation.For example, we can state the birthdate and birthplace of Bob Dylan in the relationalform: h Bob Dylan , , Duluth (Minnesota) i ∈ birth where birth is the identifier of the relation. This instance of the birth relation is a ternarytuple, that is, it has three arguments: the person entity, the birthdate, and the birthplace.The underlying Cartesian product of the relation is persons × dates × cities .In logical notation, we also write R ( x , ..., x n ) instead of h x , ..., x n i ∈ R , and we referto R as a predicate . The number of arguments, n , is called the arity of R . The domain C × ... × C n is also called the relation’s type signature As most KBs are of encyclopedic nature, the instances of a relation are often referred toas facts . We do not want to exclude knowledge that is not fact-centric (e.g., commonsenseknowledge with a socio-cultural dimension); so we call relational instances more generally statements . The literature also speaks of facts , and sometimes uses the terminology assertion as well. For this article, the three terms statement , fact and assertion are more orless interchangeable.In logical terms, statements are grounded expressions of first-order predicate logic(where “grounded” means that the expression has no variables). In the KB literature, theterm “relation” is sometimes used to denote both the relation identifier R and an instance h x , ..., x n i . We avoid this ambiguity, and more precisely speak of the relation and its(relational) tuples. Attributes of Entities:
In the above example about the birth relation, we made use of the class dates . By statingthis, we consider individual dates, such as , entities. It is a design choice whetherwe regard numerical expressions like dates, heights or monetary amounts as entities or not.Often, we want to treat them simply as values for which we do not have any additionalproperties. In the RDF data model, such values are called literals . Strings such as nicknamesof people (e.g., “Air Jordan”) are another frequent type of literals.We introduce a special class of relations with two arguments where the first argumentis an entity of interest, such as
Bob Dylan or Michael Jordan (the basketball player), and18 ubmitted to Foundations and Trends in Databases the second argument is a value of interest, such as their heights “171 cm” and “198 cm”,respectively.The case for binary relations with values as second argument largely corresponds to themodeling of entity attributes in database terminology. Such relations are restricted to be functions : for each entity as first argument there is only one value for the second argument.We denote attributes in the same style as other relational properties, but we use numeric orstring notation to distinguish the literals from entities: h Michael Jordan , 198cm i ∈ height or height ( Michael Jordan , 198cm) h Michael Jordan , “Air Jordan” i ∈ nickname or nickname ( Michael Jordan , “Air Jordan”)
Relations between Entities:
In addition to their attributes, entities are characterized by their relationships with otherentities, for example, the birthplaces of people, prizes won, songs written or performed, andso on. Mathematical relations over classes, as introduced above, are the proper formalismfor representing this kind of knowledge. The frequent case of binary relations capturesthe relationship between exactly two entities.Some KBs focus exclusively on binary relations, and the Semantic Web data model RDFhas specific terminology and formal notation for this case of so-called subject-predicate-object triples , or
SPO triples , or merely triples for short.The RDF model restricts the three roles in a subject-predicate-object (SPO)triple as follows: • S must be a URI identifying an entity, • P must be a URI identifying a relation, and • O must be a URI identifying an entity for a relationship between entities, or aliteral denoting the value of an attribute.As binary relations can be easily cast into a labeled graph – with node labels for S andO and edge labels for P – knowledge bases that focus on SPO triples are widely referred toas knowledge graphs . SPO triples are often written in the form h S, P, O i or as S P O with the relation between subject and object. Examples of SPO triples are:
Bob Dylan married to Sara LowndsBob Dylan composed Blowin’ in the WindBlowin’ in the Wind composed by Bob DylanBob Dylan has won Nobel Prize in LiteratureBob Dylan type Nobel Laureate ubmitted to Foundations and Trends in Databases The examples also illustrate the notion of inverse relations : composed by is inverse to composed , and can be written also as composed − : h S, O i ∈ P ⇔ h O, S i ∈ P − . The last example in the above table shows that an entity belonging to a certain class canalso be written as a binary relation, with type as the predicate, following the RDF standard.It also shows that knowledge can sometimes be expressed either by class membership orby a binary-relation property. In this case, the latter adds information (Nobel Prize in
Literature ) and the former is convenient for querying (about all Nobel Laureates). Moreover,having a class
Nobel Laureate allows us to define further relations and attributes with thisclass as domain. To get the benefits of all this, we may want to have both of the exampletriples in the KB.An advantage of binary relations is that they can express facts in a self-containedmanner, even if some of the arguments for a higher-arity relation is missing or the instancesof the relations are only partly known. For example, if we know only Dylan’s birthplacebut not his birthdate (or vice versa), capturing this in the ternary relation birth is a bitawkward as the unknown argument would have to be represented as a null value (i.e.,a placeholder for an unknown or undefined value). In database systems, null values arestandard practice, but they often make things complicated. In KBs, the common practice isto avoid null values and prefer binary relations where we can simply have a triple for theknown argument (birthplace) and nothing else.
Higher-Arity Relations:
Some KBs emphasize binary relations only, leading to the notion of knowledge graphs (KGs). However, ternary and higher-arity relations can play a big role, and these cannot bedirectly captured by a graph.At first glance, it may seem that we can always decompose a higher-arity relation intomultiple binary relations. For example, instead of introducing the ternary relation birth : person × date × city , we can alternatively use two binary relations: birthdate : person × date and birthplace : person × city . In this case, no information is lost by using simpler binaryrelations. Another case where such decomposition works well is the relation that contains alltuples of parents, sons and daughters: children : person × boys × girls . This could be equallyrepresented by two separate relations sons : person × boys and daughters : person × girls .In fact, database design theory tells us that this decomposition is a better representation,based on the notion of multi-valued dependencies [588].However, not every higher-arity relation is decomposable without losing information.Consider a quarternary relation won : person × award × year × f ield capturing who wonwhich prize in which year for which scientific field. Instances would include h Marie Curie, Nobel Prize, 1903, physics i ubmitted to Foundations and Trends in Databases and h Marie Curie, Nobel Prize, 1911, chemistry i . If we simply split these 4-tuples into a set of binary-relation tuples (i.e., SPO triples), wewould end up with: h MarieCurie , NobelPrize i , h MarieCurie , i , h MarieCurie , Physics i , h MarieCurie , NobelPrize i , h MarieCurie , i , h MarieCurie , Chemistry i . Leaving the technicality of two identical tuples aside, the crux here is that we can no longerreconstruct in which year Marie Curie won which of the two prizes. Joining the binary tuplesusing database operations would produce spurious tuples, namely, all four combinations of1903 and 1911 with physics and chemistry.The Semantic Web data model RDF and its associated W3C standards (including theSPARQL query language) support only binary relations. They therefore exploit cleverways of encoding higher-arity relations into a binary representation, based on techniquesrelated to reificiation [603]. Essentially, each instance of the higher-arity relation is givenan identifier of type statement and that identifier is combined with the original relation’sarguments into a set of binary tuples.For the n -ary relation instance R ( X , X . . . X n ) the reified representation con-sists of the set of binary instances type ( id, statement ), arg ( id, X ) , arg ( id, X ) . . . arg n ( id, X n )where id is an identifier.With this technique, the triple h id type statement i asserts the existence of the higher-arity tuple, and the additional triples fill in the arguments. In some KBs, the techniqueis referred to as compound objects , as the h id type statement i is expanded into a setof facets , often called qualifiers , with the number of facets even being variable (see, e.g.,[230]). Reification can be applied to binary relations as well (if desired): the representation of h S P O i then becomes h id type statement i , h id hasSubject S i , h id hasPredicate P i , h idhasObject O i . Our use case for reification is higher-arity relations, though, most importantly,to capture events and their different aspects. Attaching provenance or belief information tostatements is another case for reification.The identifier id could be a number or a URI (as required by RDF). The names of thefacets of arguments arg i ( i = 1 ..n ) can be arbitrarily chosen, but often capture certainproperties that can be aptly reflected in their names. For example, a tuple for the higher-arityrelation wonAward may result in the following triples:21 ubmitted to Foundations and Trends in Databases id type statementid hasPredicate wonAwardid winner Marie Curieid award Nobel Prizeid year 1903id field physics The example additionally includes a triple that encodes the name of the property wonAward for which the n-tuple holds. Strictly speaking, we could then drop the id typestatement triple without losing information, and the remaining triples are a typical knowledgerepresentation for n-ary predicates (e.g., for events): one triple for the predicate itself andone for each of the n-ary predicate arguments. If we want to emphasize two of the argumentsas major subject and object, we could also use a hybrid form with triples like h id hasSubjectMarie Curie i , h id hasObject Nobel Prize i , h id year 1903 i , h id field physics i .The advantage of the triples representation is that it stays in the world of binaryrelations, and the notion of a knowledge graph still applies. In a graph model, the SPO triplethat encodes the existence of the original n-ary R instance is often called a compound node (e.g., in Freebase and in Wikidata), and serves as the “gateway” to the qualifier triples .The downside of reification and related techniques for casting n-ary relations into RDFis that they make querying more difficult if not to say tedious. It requires more joins, andconsidering paths and not just single edges when dealing with compound nodes in thegraph model. For this reason, some KBs have also pursued hybrid representations where foreach higher-arity relation, the most salient pair of arguments are represented as a standardbinary relation and reification is used only for the other arguments. An important objective for a clean knowledge base is the uniqueness of the subjects,predicates and objects in SPO triples and other relational statements. We want to captureevery entity and every fact about it exactly once, just like an enterprise company shouldcontain every customer and her orders and account balance once and only once. As soon asredundancy creeps in, this opens the door for variations of the same information and hencepotential inconsistency. For example, if we include two entities in our KB,
Bob Dylan and
Robert Zimmerman (Dylan’s real name) without knowing that they are the same, we couldattach different facts to them that may eventually contradict each other. Furthermore, wewould distort the results of counting queries (counting two people instead of one person).This motivates the following canonicalization principle:Each entity, class and property in a KB is canonicalized by having a uniqueidentifier and being included exactly once.22 ubmitted to Foundations and Trends in Databases
For entities this implies the need for named entity disambiguation , also knownas entity linking ([523]). For example, we need to infer that Bob Dylan and RobertZimmerman are the same person and should have him as one entity with two differentlabels rather than two entities with different identifiers. The same principle should hold forclasses, for example, avoiding that we have both guitarists and guitar players , and forproperties as well.We strive to avoid redundancy and the resulting ambiguities and potential inconsistencies.However, this goal is not always perfectly achievable in the entire life-cycle of KB creation,growth and curation. Some KBs settle for softer standards and allow diverse representationsfor the same facts to co-exist, effectively demoting entities and relations into literal values.Here is an example of such a softer (and hence less desirable, but still useful) representation:
Bob Dylan has won “Nobel Prize in Literature”Bob Dylan has won “Literature Nobel Prize”Bob Dylan has won award “Nobel”
In addition to the grounded statements about entities, classes and relational properties, KBscan also contain intensional knowledge in the form of logical constraints and rules. Thepurpose of constraints is to enforce the consistency of the KB: grounded statements thatviolate a constraint cannot be entered. For example, we do not allow a second birthdatefor a person, as the birthdate property is a function, and we require creators of songs tobe musicians (including composers and bands). The former is an example of a functionaldependency , and the latter is an example of a type constraint . We discuss consistencyconstraints and their crucial role for KB curation in Chapter 8 (particularly, Section 8.3.1).The purpose of rules is to derive additional knowledge that logically follows from what theKB already contains. For example, if Bob is married to Sara, then Sara is married to Bob,by the symmetry of the spouse relation). We discuss rules in Section 8.3.2.
Machines cannot create any knowledge on their own; all knowledge about our world iscreated by humans (and their instruments) and documented in the form of encyclopedia,scientific publications, books, daily news, all the way to contents in online discussion forumsand other social media. What machines actually do to construct a knowledge base is to tapinto these sources, harvest their nuggets of knowledge, and refine and semantically organizethem into a formal knowledge representation. This process of distilling knowledge fromonline contents is best characterized as knowledge harvesting [622, 621, 620].23 ubmitted to Foundations and Trends in Databases
This big picture opens up a design space for how we go about harvesting online contentstowards large-scale knowledge bases. Depending on the kinds of sources we tap into andthe standards that we set for the output we want to achieve, there is a variety of designchoices. Figure 2.2 depicts this design space. Note that the connections between methodsand their associated inputs and outputs are merely indicative for typical approaches; theyare not meant to be exhaustive. For example, NLP tools and Deep Learning are useful fordiscovering entities and their types as well, but they face higher complexity and usuallyyield lower quality than the simpler methods based on rules and patterns.
Input Sources:
There is a wide spectrum of input sources to be considered. The top part of Figure 2.2 showssome notable design points, with difficulty increasing from left to right. The difficulties arisefrom the decreasing ratio of valuable knowledge to noise in the respective sources.To build a high-quality KB, we advocate to start with the cleanest sources, called premium sources in the figure. These include well-organized and curated encyclopediccontent like Wikipedia. For example, Wikipedia’s set of article names is a great sourcefor KB construction, as these names constitute the world’s notable entities in reasonablystandardized form: millions of entities with human-readable unique labels. First harvestingthese entities (and cues about their classes, e.g., via Wikipedia categories) forms a strong
InputsOutputs
Premium Sources (Wikipedia, WordNet,
GeoNames, Librarything …)
Semi-Structured Data (Infoboxes, Tables, Lists …)
Text Documents& Web Pages Mass User Content
Online Forums & SocialMedia Queries& Clicks
Entity Names& Classes Entities inTaxonomy Relational
Statements
Invariants &
Constraints
Canonicalized
Statements
Difficult Text (Books,
Interviews …)
High-Quality Text (News Articles,
Wikipedia …)
Methods
Rules & Patterns LogicalInference StatisticalInference DeepLearningNLPTools
Figure 2.2:
Design Space for Knowledge Harvesting ubmitted to Foundations and Trends in Databases backbone for subsequent extensions and refinements. This design choice can be seen as aninstantiation of the folk wisdom to “pick low-hanging fruit” first, widely applied in systemsengineering. Beyond Wikipedia as a general-purpose source, knowledge harvesting shouldgenerally start with the most authoritative high-quality sources for the domains of interest.For example, for a KB about movies, it would be a mistake to disregard IMDB as it isthe world’s largest and cleanest – manually constructed – repository of movies, characters,actors, keywords and phrases about movie plots, etc. Likewise, we must not overlook sourceslike GeoNames and OpenStreetMap for geographic entities, GoodReads and Librarythingfor books, MusicBrainz (or even Spotify’s catalog) for music, DrugBanks for medical drugs,and so on. Note that some sources may be proprietary and require licensing.After considering premium sources, the next step is to tap into semi-structuredelements in online data, like infoboxes (in Wikipedia and other wikis), tables, lists,headings, category systems, etc. This is almost always easier than comprehending textualcontents. However, if we aim at rich coverage of facts about entities, we eventually haveto extract knowledge from natural-language text as well – ranging from high-qualitysources like Wikipedia articles all the way to user contents in online forums and othersocial media. Finally, mass-user data about online behavior – like queries and clicks – is yetanother source, which comes with a large amount of noise and potential bias.From these considerations it should be obvious that we generally face a precision-recalltrade-off . In Figure 2.2, the precision of extracted knowledge tends to decrease from leftto right, and the recall typically increases from left to right (with exceptions, though). Thatis, sources on the left end often yield highly accurate KBs but limited coverage, whereassources on the right end usually yield less accurate KBs but with higher coverage. We defineprecision and recall as follows:The precision of a KB of statements is the ratio correct statements in KB statements in KB . The recall of a KB of statements is the ratio correct statements in KB correct statements in real world .
Precision can be evaluated by inspecting a KB alone, but recall can only be estimatedas the complete real-world knowledge is not known to us. However, we can sample on a per-entity basis and compare what a KB knows about an entity (in the form of relational tuples)against what a human can learn from reading the Wikipedia article or other high-coveragesources about the entity. 25 ubmitted to Foundations and Trends in Databases
Output Scope and Quality:
Depending on our goals on precision and recall of a KB and our choice on dealing with thetrade-off, we can expect different kinds of outputs from a knowledge-harvesting machinery.Figure 2.2 lists major options in the bottom part.The minimum that every KB-building endeavor should have is that all entities in theKB are semantically typed in a system of classes. Without any classes, the KB would merelybe a flat collection of entities, and a lot of the mileage that search applications get fromKBs is through the class system. For example, queries or questions about “singers who arealso poets” can be answered by intersecting the entities of two classes (returning, e.g., BobDylan, Leonard Cohen, Serge Gainsbourg).As for the entities themselves, some KBs do not normalize, lacking unique identifiersand including real-world duplicates. Such a KB of names, containing, for example, both BobDylan and Robert Zimmerman as if they were two entities (see Section 2.1.4), can still beuseful for many applications. However, a KB with disambiguated names and canonicalizedrepresentation of entities clearly offers more value for high-quality use cases such as dataanalytics.Larger coverage and application value come from capturing also properties of entities,in the form of relational statements. Often, but not necessarily, this goes hand in handwith logical invariants about properties, which could be acquired by rule mining or byhand-crafted modeling using expert or crowdsourced inputs. Analogously to surface-formversus canonicalized entities, relational statements also come in these two forms: with adiversity of names for the same logical relation, or with unique names and no redundancy(see Section 2.1.4 for examples).
Methodological Repertoire:
To go from input to output, knowledge harvesting has a variety of options for its algorithmicmethods and tools. The following is a list of most notable options; there are further choices,and practical techniques often combine several of the listed options.•
Rules and Patterns:
When inputs have rigorous structure and the desired outputquality mandates conservative techniques, rule-based extraction can achieve bestresults. The System T project at IBM is a prime example of rule-based knowledgeextraction for industrial-strength applications (see, e.g., [89]).•
Logical Inference:
Using consistency constraints can often eliminate spurious can-didates for KB statements, and deduction rules can generate additional statementsof interest. Both cases require reasoning with computational logics. This is usuallycombined with other paradigms such as extraction rules or statistical inference.•
Statistical Inference:
Distilling crisp knowledge from vague and ambiguous textcontent or semi-structured tables and lists often builds on the observation that there26 ubmitted to Foundations and Trends in Databases is redundancy in content sources: the same KB statement can be spotted in (many)different places. Thus we can leverage statistics and corresponding inference methods.In the simplest case, it boils down to frequency arguments, but it can be muchmore elaborated by considering different statistical measures and joint reasoning. Inparticular, statistical inference can be combined with logical invariants, for example,by probabilistic graphical models ([126]).•
NLP Tools:
Modern tools for natural language processing (see, e.g., [146, 273])encompass a variety of methods, from rule-based to deep learning. They reveal structureand text parts of interest, such as dependency-parse trees for syntactic analysis,pronoun resolution, identification of entity names, sentiment-bearing phrases, andmuch more. However, as language becomes more informal with incomplete sentencesand colloquial expressions (incl. social-media talks such as “LOL”), mainstream NLPdoes not always work well.•
Deep Learning:
The most recent addition to the methodogical repertoire is deepneural networks, trained in a supervised or distantly supervised manner (see, e.g., [68,186]). The sweet spot here is when there is a large amount of “gold-standard” labeledtraining data, and often in combination with learning so-called embeddings from largetext corpora. Thus, deep learning is most naturally used for increasing the coverage ofa KB after initial population, such that the initial KB can serve as a source of distantsupervision.In Figure 2.2, the edges between input and methods and between outputs and methodsindicate choices for methods being applied to different kinds of inputs and outputs. Notethat this is not meant to exclude further choices and additional combinations.The outlined design space and the highlighted options are by no means complete, butmerely reflect some of the prevalent choices as of today. We will largely use this big pictureas a “roadmap” for organizing material in the following chapters. However, there are furtheroptions and plenty of underexplored (if not unexplored) opportunities for advancing thestate of the art in knowledge harvesting. 27 ubmitted to Foundations and Trends in Databases
This chapter presents a powerful method for populating a knowledge base with entitiesand classes, and for organizing these into a systematic taxonomy. This is the backbonethat any high-quality KB – broadly encyclopedic or focused on a vertical domain – musthave. Following the rationale of our design-space discussion, we focus here on knowledgeharvesting from premium sources such as Wikipedia or domain-specific repositories suchas GeoNames for spatial entities or GoodReads and Librarything for the domain of books.This emphasizes the philosophy of “picking low-hanging fruit first” for the best benefit/costratio.
We recommend to start every KB construction project by tapping one or a few premiumsources first. Such sources should have the following characteristics: • authoritative high-quality content about entities of interest, • high coverage of many entities, and • clean and uniform representation of content, like having clean HTML markup or evenwiki markup, unified headings and structure, well-organized lists, informative categories,and more.Distilling some of the contents from such sources into machine-readable knowledge cancreate a strong core KB with a good ratio of “mileage per effort”. In particular, relativelysimple extraction and cleaning methods go a long way already. The core KB can thenbe further expanded from other sources – with more advanced methods as presented insubsequent chapters. Wikipedia:
For a general-purpose encyclopedic knowledge base,
Wikipedia is presumably the mostsuitable starting point, with its huge number of entities, highly informative descriptions andannotations, and quality assurance via curation by a large community and a sophisticatedsystem of moderators. The English edition of Wikipedia (https://en.wikipedia.org) containsmore than 6 million articles with 500 words of text on average (as of July 1, 2020), all withunique names most of which feature individual entities. These include more than 1.5 millionnotable people, more than 750,000 locations of interest, more than 250,000 organizations,and instances of further major classes including events (i.e., sports tournaments, naturaldisasters, battles and wars, etc.) and creative works (i.e., books, movies, musical pieces,etc.).Another great starting point, with even more entities (ca. 100 million), would be the
Wikidata knowledge base ( https://wikidata.org ), populated with entity-centric facts (SPO28 ubmitted to Foundations and Trends in Databases triples) by a knowledge-sharing community [600]. Wikidata is already a full-fledged KB, ina formal representation following the RDF data model. So there is no point in illustratingknowledge extraction from Wikidata. Moreover, Wikidata largely focuses on capturing basicbiographic facts about entities, like birthdate, birthplace, spouses, children for people, orcity, country and geo-coordinates for buildings and landmarks, and so on. Its type systemis large, but most entities belong only to one or two types, whereas Wikipedia often offersseveral tens of highly informative categories that characterize an entity. Last but not least,Wikidata does not have full-text articles about entities with rich descriptions, lists, tablesand more. We will come back to Wikidata in Chapter 9, including a case for integrationwith another knowledge source (see Section 9.1).Wikipedia serves as an archetype of knowledge-sharing communities, which can be seenas “proto-KBs”: the right contents for a KB, but not yet in the proper representation.Another case in point would be the Chinese encyclopedia Baidu Baike with almost 20million articles ( https://baike.baidu.com ). In this chapter, we focus on the English Wikipediaas an exemplary case. We will see that Wikipedia alone does not lend itself to building aclean KB as easily as one would hope. Therefore, we combine input from Wikipedia withanother premium source: the
WordNet lexicon [159] as a key asset for the KB taxonomy.
Geographic Knowledge:
For building a KB about geographic and geo-political entities, like countries, cities, rivers,mountains, natural and cultural landmarks, Wikipedia itself is a good starting point,but there are very good alternatives as well. Wikivoyage ( ) isa travel-guide wiki with specialized articles about travel destinations. GeoNames ( ) is a huge repository of geographic entities, from mountains and vol-canos to churches and city parks, more than 10 million in total. If city streets, high-ways, shops, buildings and hiking trails are of interest, too, then OpenStreetMaps ( ) is another premium source to consider (or alternatively com-mercial maps if you can afford to license them). Even commercial review forums such asTripAdvisor ( ) could be of interest, to include hotels, restaurantsand tourist services.These sources complement each other, but they also overlap in entities. Therefore, simplytaking their union as an entity repository is not a viable solution. Instead, we need tocarefully integrate the sources, using techniques for entity matching to avoid duplicates andto combine their different pieces of knowledge for each entity (see Chapter 5, especiallySection 5.2).To obtain an expressive and clean taxonomy of classes, we could tap each of the sourcesseparately, for example, by interpreting categories as semantic types. But again, simplytaking a union of several category systems does not make sense. Instead, we need to findways of aligning equivalent (and possibly also subsumption) pairs of categories, as a basis29 ubmitted to Foundations and Trends in Databases for constructing a unified type hierarchy . For example, can and should we map craters fromone source to volcanoes in a second source, and how are both related to volcanic nationalparks ? This alignment and integration is not an easy task, but it is still much simpler thanextracting all volcanoes and craters from textual contents in a wide variety of diverse webpages.
Knowledge about Movies:
For the movie domain, we are primarily interested in entities like movies, directors, actors,producers, soundtrack music, contributors to special effects etc. IMDB (Internet MovieDatabase, ) is by far the best source of information for this scope.This premium source is commercial and disallows crawling, but it offers periodic dumps ofits core data for downloading (subject to licensing conditions).However, advanced users may ask for more: a movie KB should also provide convenientaccess to additional knowledge about the life of the movie contributors, for example, howoften they are divorced and how they started their careers. To this end, we could combineIMDB entities with selected articles from
Wikipedia or entries from
Wikidata , but suchcombinations involve non-trivial knowledge integration tasks. Moreover, although IMDB ishuge, it does not have perfect coverage of the world’s film footage, for example, missingmany Bollywood productions, African movies or lesser-known documentaries. Therefore,merging its repository with entities from other sites would be desirable, with appropriate entity matching and type alignment , similar to the Wikipedia case discussed earlier.An even better KB for movie afficionados would cover also the contents of moviesand their characters. A rich source about popular movies and TV series is fan-communitywikis , many of which are hosted at (formerly registered as ). On these wikis, a large number of fictitious characters are systematicallyorganized in semantic types, and are annotated with crisp statements about their traits andrelationships in the respective stories. The type labels allow finding favorite villains, heroes,wizards, witches, and more. However, if we want a truly integrated and clean KB, we needto align these types with the categories that IMDB provides for real people and some moviecharacters, for example, to consider the gender and race of actors and their roles.
Health Knowledge:
Another vertical domain of great importance for society is health: building a KB withentity instances of diseases, symptoms, drugs, therapies etc. There is no direct counterpartto Wikipedia for this case, but there are large and widely used tagging catalogs andterminology lexicons like MeSH ( ) and UMLS( including the SNOMED clinical terminology forhealthcare), and these can be treated as analogs to Wikipedia: rich categorization but notalways semantically clean. The next step would then be to clean these raw assets, usingmethods like the presented ones, and populate the resulting classes with entities.30 ubmitted to Foundations and Trends in Databases
For the latter, additional premium sources could be considered: either Wikipedia articlesabout biomedical entities, or curated structured sources such as DrugBank ( ) or Disease Ontology ( http://disease-ontology.org/ ), and also human-oriented Webportals like the one by the Mayo Clinic ( ).Research projects along the lines of this knowledge integration and taxonomy construc-tion for health include KnowLife/DeepLife [150, 148], Life-INet [487] and Hetionet [234];see also [270] for general discussion of health knowledge.
Many premium sources come with a rich category system : assigning pages to relevantcategories that can be viewed as proto-classes but are too noisy to be considered as asemantic type system. Wikipedia, as our canonical example, organizes its articles in ahierarchy of more than 1.9 million categories (as of July 1, 2020). For example,
Bob Dylan (the entity corresponding to article en.wikipedia.org/wiki/Bob_Dylan ) is placed in categoriessuch as
American male guitarists , Pulitzer Prize winners , Songwriters from Minnesota etc.,and
Blowin’ in the Wind (corresponding to en.wikipedia.org/wiki/Blowin’_in_the_Wind ) isin categories such as
Songs written by Bob Dylan , Elvis Presley songs , Songs about freedom and
Grammy Hall of Fame recipients , among others.Using these categories as classes with their respective entities, it seems we could effortlesslyconstruct an initial KB. So are we done already?Unfortunately, the Wikipedia category system is almost a class taxonomy, but onlyalmost. We face the following difficulties: • High Specificity of Categories:
The direct categories of entities (i.e., leaves in thecategory hierarchy) tend to be highly specific and often combine multiple classes intoone multi-word phrase. Examples are
American male singer-songwriters , or Nobel laureates absent at the ceremony . For humans, it isobvious that this implies membership in classes singers , men , guitar players , Nobellaureates etc., but for a computer, the categories are initially just noun phrases. • Drifting Super-Categories:
By considering also super-categories (i.e., non-leaf nodesin the hierarchy) and the paths in the category system, we could possibly generalizethe leaf categories and derive broader classes of interest, such as men , American people ,31 ubmitted to Foundations and Trends in Databases … American Male Guitarists Songwritersfrom MinnesotaPulitzerPrize Winners Songs writtenby Bob Dylan
Bob Dylan
Blowin‘ in the Wind
Elvis Presley
Songs Songs aboutFreedomElvis Presley Human
Rights Songs by
ThemeTopics inCulture Works byTopic AppliedEthicsHumanBehavior
Action Free Will
Culture Humans … Burials inTennessee Cemetries inTennesseeBuildings inTennessee AmericanSingers
People by
Nationality … …
Works by
Bob DylanBob
DylanNorth America … North AmericanCulture … People by Gender
MenMaleMusiciansMale
GuitaristsGender
PeopleSex PulitzerPrizesWriters
By Award
PulitzerFamilyAmericanLiteraryAwardsAmerican
Literature
NewspaperPublishingFamiliesPeople inLiterature
Arts
MinnesotaCultureMinne-sotaMidwestUnited StatesCountries inthe Americas … …………… … Figure 3.1:
Example categories and super-categories from Wikipedia musicians , etc. However, the Wikipedia category system exhibits conceptual drifts wheresuper-categories imply classes that are incompatible with those of the correspondingleaves and the entity itself. Figure 3.1 shows excerpts of the category hierarchy for theentities
Bob Dylan and
Blowin’ in the Wind . By transitivity, the super-categories wouldimply that Bob Dylan is a location, a piece of art, a family and a kind of reproductionprocess (generalizing the “sex” category). For the example song, the category systemalone would likewise lead to flawed or meaningless classes: locations, buildings, singers,actions, etc. • Entity Types vs. Associative Categories:
Some of the super-categories are generalconcepts, for example,
Applied Ethics and
Free Will in Figure 3.1. Some of the edgesbetween a category and its immediate super-category are conceptual leaps, for example,moving from songs and works to the respective singers in Figure 3.1.All this makes sense when the category hierarchy is viewed as a means to support userbrowsing in an associative way, but it is not acceptable for the taxonomic backbone ofa clean KB. For example, queries about buildings on North America may return
Blowin’in the Wind as an answer, and statistical analytics on prizes, say by geo-region or gender,would confuse the awards and the awardees.In the following we present a simple but powerful methodology to leverage the Wikipediacategories as raw input with thorough cleansing of their noun-phrase names and integration32 ubmitted to Foundations and Trends in Databases with upper-level taxonomies for clean KB construction based on the works of [456, 457, 562,240] (applied in the WikiTaxonomy and YAGO projects).
Head Words of Category Names:
As virtually all category names are noun phrases, a basic building block for uncoveringtheir semantics is to parse these multi-word phrases into their syntactic constituents. Thistask is known as noun-phrase parsing in NLP (see, e.g., [273] and [146]). In general, nounphrases consist of nouns, adjectives, determiners (like “the” or “a”), pronouns, coordinatingconjunctions (“and” etc.), prepositions (“by”, “from” etc.), and possibly even further wordvariants.A typical first step is to perform part-of-speech tagging , or
POS tagging for short.This tags each word with is syntactic sort: noun, adjective etc. Nouns are further classifiedinto common nouns, which can have an article, e.g., “guitarist”, and proper nouns whichdenote names (e.g., “Minnesota”). POS tagging usually works by dynamic programming overa pre-trained statistical model of word-variant sequences in large corpora. The subsequent noun-phrase parsing computes a syntactic tree structure for the POS-tagged wordsequence, inferring which word modifies or refines which other word. This is usually basedon stochastic context-free grammars, again using some form of dynamic programming. Laterchapters in this article will make intensive use of such NLP methods, too.The root of the resulting tree is called the head word , and this is what we are mostlyafter. For example, for “American male guitarists” the head word is “guitarists” and for“songwriters from Minneota” it is “songwriters”. The head word is preceded by so-called pre-modifiers , and followed by post-modifiers . Sometimes, words from these modifiers can becombined with the head word to form a semantically meaningful class as well (e.g., “femaleguitarists”).Equipped with this building block, we can now tackle the desired category cleaning. Ourgoal is to distinguish taxonomic categories (such as “guitarists”) from associative categories(such as “music”). The key idea is the heuristics that plural-form common nouns are likelyto denote classes, whereas single-form nouns tend to correspond to general concepts (e.g.,“free will”). The reason for this is that classes regroup several instances, and a plural formis thus a strong indicator for a class. We can possibly relax this heuristics to consider alsosingular-form nouns where the corresponding plural form is frequently occurring in corporasuch as news. For example, if a Wikipedia category were named “jazz band” rather than“jazz bands” we should accept it as a class, while still disregarding categories such as “freewill” or “ethics” (where “wills” is very rare, and “ethics” is a singular word despite endingwith “s”). These ideas can be cast into the following algorithm.33 ubmitted to Foundations and Trends in Databases
Algorithm for Category Cleaning
Input: Wikipedia category name c : leaf node or non-leaf nodeOutput: semantic class label or null
1. Run noun-phrase parsing to identify headword h and modifier structure: c = pre .. pre k h post .. post l .2. Test if h is in plural form or has a frequently occurring plural form. If not, return null .Optionally, consider also pre i .. pre k h as class candidates, with increasing i from 0 to k − c , return h (and optionally additonal class labels pre i .. pre k h ).4. For non-leaf category c , test if the class candidate h is a synonym or hypernym(i.e., generalization) of an already accepted class (including h ).If so, keep it; otherwise, discard it.The rationale for the additional test in Step 4 is that non-leaf categories in Wikipediaare often merely associative, as opposed to denoting semantically proper super-classes (seediscussion above). So we impose this additional scrutinizing, while still being able to harvestthe cases when head words of super-categories are meaningful outputs (e.g., Musicians and
Men in Figure 3.1). The test itself can be implemented by looking up head words in existingdictionaries like
WordNet [159] or
Wiktionary ( ), which listsynonyms and hypernyms for many words. This is a prime case of harnessing a secondpremium source. Class Candidates from Wikipedia Articles:
We have so far focused on Wikipedia categories as a source of semantic class candidates.However, normal Wikipedia articles may be of interest as well. Most articles representindividual entities, but some feature concepts, among which some may qualify as classes.For example, the articles https://en.wikipedia.org/wiki/Cover_version and https://en.wikipedia.org/wiki/Aboriginal_Australians correspond to classes as they have instances, whereas https://en.wikipedia.org/wiki/Empathy is a singleton concept as it has no individual entities asinstances of interest.Simple but effective heuristics to capture these cases have been studied by [198, 440].The key idea is that an article qualifies as a class if its textual body mentions the article’stitle in both singular and plural forms. For example, “cover version” and “cover versions”are both present in the article about cover versions (of songs), but the article on empathydoes not refer to “empathys”. Obviously, this technique is just another building block thatshould be combined with other heuristics and statistical inference.34 ubmitted to Foundations and Trends in Databases
By applying the category cleaning algorithm to all Wikipedia categories, we can obtaina rich set of class labels for each entity. However, as the Wikipedia community does notenforce strict naming standards, we could arrive at duplicates for the same class, for example,accepting both guitarist and guitar player . Moreover, as we are mostly harvesting theleaf-node categories and expect to prune many of the more associative super-categories, ourKB taxonomy may end up with many disconnected classes.To fix these issues, we resort to pre-existing high-quality taxonomies like
WordNet [159]. This lexicon already covers more than hundred thousand concepts and classes – called word senses or synsets for sets of synonyms – along with clean structure for hypernymy.Alternatively, we could consider Wiktionary ( ). Both of theselexical resources also have multilingual extensions, covering a good fraction of mankind’slanguages. See also [203] for a general overview of this kind of lexical resources.A major caveat, however, is that WordNet has hardly any entity-level instances for itsclasses; you can think of it as an un-populated upper-level taxonomy . The same holds forWiktionary. The goal now is to align the class candidates harvested from Wikipedia withthe classes in WordNet. Similarity between Categories and Classes:
The key idea is to perform a similarity test between a class candidate from Wikipedia andpotentially corresponding classes in WordNet. In the simplest case, this is just a surface-form string similarity . For example, the Wikipedia-derived candidate “guitar player” has highsimilarity with the WordNet entry ”guitarist”. There are two problems to address, though.First, we could still observe low string similarity for two matching classes, for example,“award” from Wikipedia against “prize” in WordNet. Second, we can find multiple matcheswith high similarity, for example “building” from Wikipedia matching two different sensesin WordNet, namely, building in the sense of a man-made structure (e.g., houses, towersetc.) and building in the sense of a construction process (e.g., building a KB). We have tomake the right choice among such alternatives for ambiguous words.The solution for the first problem – similarity-based matching – is to consider contextsas well. WordNet, and Wiktionary alike, provide synonym sets as well as short descriptions(so-called glosses), for their entries, and the Wikipedia categories can be contextualized byrelated words occurring in super-categories or (articles for) their instances. This way, we areable to map “award” to “prize” because WordNet has the entry “prize, award (somethinggiven for victory or superiority in a contest or competition or for winning a lottery)”,stating that “prize” and “award” are synonyms (for this specific word sense). More generally,we could consider also entire neighborhoods of WordNet entries defined by hypernyms,hyponyms, derivationally related terms, and more. Such contextualized lexical similarity ubmitted to Foundations and Trends in Databases measures have been investigated in research on word sense disambiguation (WSD) ,see [417, 203] for overviews.Another approach to strengthen the similarity comparisons is to incorporate wordembeddings such as Word2Vec [384] or Glove [447] (or even deep neural networks alongthese lines, such as BERT [118]). We will not go into this topic now, but will come back toit in Section 4.5.For the second problem – ambiguity – we could apply state-of-the-art WSD methods,but it turns out that there is a very simple heuristic that works so well that it is hardlyoutperformed by any advanced WSD method. It is known as the most frequent sense(MFS) heuristic: whenever there is a choice among different word senses, pick the onethat is more frequently used in large corpora such as news or literature. Conveniently, theWordNet team has already manually annotated large corpora with WordNet senses, and hasrecorded the frequency of each word sense. It is thus easy to identify, for each given word,its most frequent meaning. For example, the MFS for “building” is indeed the man-madestructure. There are exceptions to the MFS heuristics, but they can be handled in otherways. Putting Everything Together:
Putting these considerations together, we arrive at the following heuristic algorithm foraligning Wikipedia categories and WordNet senses.
Algorithm for Alignment with WordNet
Input: Class name c derived from Wikipedia categoryOutput: synonym or hypernym in WordNet, or null
1. Compute string or lexical similarity of c to WordNet entries s (for candidates s with certain overlap of character-level substrings). Then pick the s with highestsimilarity if this is above a given threshold; otherwise return null .2. If the highest-similarity entry s is unambiguous (i.e., the same word has onlythis sense), then return the WordNet sense for s . If s is an ambiguous word, thenreturn the MFS for s (or use another WSD method for c and the s candidates).Once we have mapped the accepted Wikipedia-derived classes onto WordNet, we have acomplete taxonomy, with the upper-level part coming from the clean WordNet hierarchy ofhypernyms. The last thing left to decide for the alignment task is whether the category-basedclass c is synonymous to the identified WordNet sense s or whether s is a hypernym of c .The latter occurs when c does not have a direct counterpart in WordNet at all. For example,we could keep the category c = “singer-songwriter”, but WordNet does not have this classat all. Instead we should align c to singer or songwriter or both. If WordNet did not havean entry for songwriters, we should map c to the next hypernym, which is composer .This final issue can be settled heuristically, for example, by assuming a synonymy match36 ubmitted to Foundations and Trends in Databases if the similarity score is very high and assuming a hypernym otherwise, or we could resortto leveraging information-theoretic measures over additional text corpora (e.g., the full textof all Wikipedia articles). Specifically, for a pair c and s (e.g., songwriter and composer –a case of hypernymy), a symmetry-breaking measure is the conditional probability P [ s | c ]estimated from co-occurrence frequencies of words. If P [ s | c ] (cid:29) P [ c | s ], that is, c sort ofimplies s but not vice versa, then s is likely a hypernym of c , not a synonym. Variousmeasures along these lines are investigated in [618, 183].The presented methodology can also be adapted to other cases of taxonomy alignment,for example, to GeoNames, Wikivoyage and OpenStreetMap (and Wikipedia or Wikidata)about geographic categories and classes (see Section 3.1). A different paradigm for aligning Wikipedia categories with WordNet classes, as primeexamples of premium sources, has been developed by [418] for constructing the
BabelNet knowledge base. It is based on a candidate graph derived from the Wikipedia category athand and potential matches in WordNet, and then uses graph metrics such as shortest pathsor graph algorithms like random walks to infer the best alignment. Figure 3.2 illustrates thisapproach. The graph construction extracts the head word of interest and salient contextwords from Wikipedia – “play” as well as “film” and “fiction” in the example. Then allapproximate matches are identified in WordNet, and their respective local neighborhoodsare added to the graph casting WordNet’s lexical relations like hypernymy/hyponymy,holoynymy/meronymy (whole-part) etc. into edges. Edges could even be weighted based onsimilarity or salience metrics.In the example, we have two main candidates to which “play” could refer: “play, drama”or “play (sports)” (WordNet contains even more). To rank these candidates, the simplestmethod is to look at how close they are to the Wikipedia-based start nodes “play”, “film”and “fiction”, for example, by aggregating the shortest paths from start nodes to candidate-class nodes. Alternatively, and typically better performing, we can use methods basedon random walks over the graph, analogously to how Google’s original PageRank andPersonalizedPageRank measures were computed ([59, 263]).The walk starts at the Wikipedia-derived head word “play” and randomly traverses edgesto visit other nodes – where the probabilities for picking edges should be proportional toedge weights (i.e., uniform over all outgoing edges if the edges are unweighted). Occasionally,the walk could jump back to the start node “play”, as decided by a probabilistic coin toss.By repeating this random procedure sufficiently often, we obtain statistics about how ofteneach node is visited. This statistics converges to well-defined stationary visiting probabilities as the walk length (or repetitions after jumping back to the start node) approaches infinity.37 ubmitted to Foundations and Trends in Databases
Play
Hamlet Macbeth
Fiction about RegicideBritish Plays Adapted into Films
Film FictionPlay,Drama Play(Sport)LiteraryWork
Fiction
NovelStory Film, Movie
Show, Public
Performance TheatricalPerformance …… … … … Plan of
Action
Football
PlayBasketballPlay …… … Wikipedia-WordNetGraph
Wikipedia
WordNet
Figure 3.2:
Example for graph-based alignment
The candidate class with the highest visiting probability is the winner: “play, drama” inthe example. Such random-walk methods are amazingly powerful, easy to implement andwidely applicable. We will see other use cases in later chapters.
There are various extensions of the presented method. Wikipedia category names do not justindicate class memberships, but often reflect other relations as well. For example, for BobDylan being in the category
Nobel Laureates in Literature we can infer a relational triple h Bob Dylan , has won , Literature Nobel Prize i . Such extensions have been developed by[415]; we will revisit these techniques in Chapter 6. Another line of generalization is toleverage the taxonomies from the presented methods as training data to learn generalizedpatterns in category names and other Wikipedia structures (infobox templates, list pagesetc.). This way, the taxonomy can be further grown and refined. Such methods have beendeveloped in the Kylin/KOG project by [626, 625]. Relationship with Ontology Alignment:
The alignment between Wikipedia categories and WordNet classes can be seen as a specialcase of ontology alignment (see, e.g., [551] for an overview, [20, 331] for representative38 ubmitted to Foundations and Trends in Databases state-of-the-art methods, and http://oaei.ontologymatching.org/ for a prominent benchmarkseries). Here, the task is to match classes and properties of one ontology with those ofa second ontology, where the two ontologies are given in crisp logical forms like RDFSschemas or, even better, in the OWL description-logic language [601]. Ontology matching isin turn highly related to the classical task of database schema matching [123].The case of matching Wikipedia categories and WordNet classes is a special case, though,for two reasons. First, WordNet has hardly any instances of its classes, and we chose toignore the few existing ones. Second, the upper part of the Wikipedia category hierarchy ismore associative than taxonomic so that it had to be cleaned first (as discussed in Section3.2). For these reasons, the case of Wikipedia and WordNet benefits from tailored alignmentmethods, and similar situations are likely to arise also for domain-specific premium sources.
Beyond Wikipedia:
We used Wikipedia and WordNet as exemplary cases of premium sources, and pointedout a few vertical domains for wider applicability of the presented methods. Aligning andenriching pre-existing knowledge sources is also a key pillar for industrial-strength KBsabout retail products (see, e.g., [117, 133]). More discussion on this use case is offered inSection 9.5.Apart from this mainstream, similar cases for knowledge integration can be made for lessobvious domains, too, examples being food [15, 218] or fashion [277, 189]. Food KBs haveintegrated sources like the FoodOn ontology [136], the nutrients catalog https://catalog.data.gov/dataset/food-and-nutrient-database-for-dietary-studies-fndds and a large recipe collection[369], and fashion KBs could make use of contents from catalogs such as https://ssense.com .More exotic verticals to which the Wikipedia-inspired methodology has been carried overare fictional universes such as Game of Thrones, the Simpsons, etc., with input from richcategory systems by fan-community wikis ( ). Recent research onthis topic includes works by [231] and [97]. Finally, another non-standard theme for KBconstruction is how-to knowledge : organizing human tasks and procedures for solving themin a principled taxonomy. Research on this direction includes [641, 98].
We summarize this chapter by the following take-home lessons. • For building a core KB, with individual entities organized into a clean taxonomy ofsemantic types, it is often wise to start with one or a few premium sources . Examplesare Wikipedia for general-purpose encyclopedic knowledge, GeoNames and WikiVoyagefor geo-locations, or IMDB for movies. • A key asset are categories by which entities are annotated in these sources. As categoriesare often merely associative, designed for manual browsing, this typically involves a39 ubmitted to Foundations and Trends in Databases category cleaning step to identify taxonomically clean classes. • To construct an expressive and clean taxonomy, while harvesting two or more premiumsources, it is often necessary to integrate different type systems. This can be achievedby alignment heuristics based on NLP techniques (such as noun phrase parsing), or by random walks over candidate graphs. 40 ubmitted to Foundations and Trends in Databases
This chapter presents advanced methods for populating a knowledge base with entitiesand classes (aka. types), by tapping into textual and semi-structured sources. Building onthe previous chapter’s insights on harvesting premium sources first, this chapter extendsthe regime for inputs to discovering entities and type information in Web pages and textdocuments. This will often yield noisier output, in the form of entity duplicates. The followingchapter, Chapter 5, will address this issue by presenting methods for canonicalizing entitiesinto unique subjects in the KB.
Harvesting entities and their classes from premium sources goes a long way, but it is boundto be incomplete when the goal is to fully cover a certain domain such as music or health,and to associate all entities with their relevant classes. Premium sources alone are typicallyinsufficient to capture long-tail entities, such as less prominent musicians, songs and concerts,as well as long-tail classes such as left-handed cello players or cover songs in a differentlanguage than the original. In this section, we present a suite of methods for automaticallyextracting such additional entities and classes from sources like web pages and other textdocuments.In addressing this task, we typically leverage that premium sources already give us ataxonomic backbone populated with prominent entities. This leads to various discoverytasks:1. Given a class T containing a set of entities E = { e . . . e n } , find more entities for T thatare not yet in E .2. Given an entity e and its associated classes T = { t . . . t k } , find more classes for e notyet captured in T .3. Given an entity e with known names, find additional (alternative) names for e , such asacronyms or nicknames. This is often referred to as alias name discovery .4. Given a class t with known names, find additional names for t . This is sometimes referredto as paraphrase discovery .In the following, we organize different approaches by methodology rather than by thesefour tasks, as most methods apply to several tasks.In principle, there is also a case where the initial repository of entities and classes isempty – that is, when there is no premium source to be harvested first. This is the case for ab-initio taxonomy induction from noisy observations, which will be discussed in Section4.6. 41 ubmitted to Foundations and Trends in Databases An important building block for all of the outlined discovery tasks is to detect mentions ofalready known entities in web pages and text documents. These mentions do not necessarilytake the form of an entity’s full name as derived from premium sources. For example, wewant to be able to spot occurrences of
Steve Jobs and
Apple Inc. in a sentence such as“Apple co-founder Jobs gave an impressive demo of the new iPhone.” Essentially, this isa string matching task where we compare known names of entities in the existing KBagainst textual inputs. The key asset here is to have a rich dictionary of alias names from the KB. Early works on information extraction from text made extensive use of namedictionaries, called gazetteers , in combination with NLP techniques (POS tagging etc.) andstring patterns. Two seminal projects of this sort are the
GATE toolkit [106, 107] and the
UIMA framework [164].A good dictionary should include • abbreviations (e.g., “Apple” instead of “Apple Inc.”), • acronyms (e.g., “MS” instead of “Microsoft”), • nicknames and stage names (e.g., “Bob Dylan” vs. his real name “Robert Zimmerman”,or “The King” for “Elvis Presley”), • titles and roles (e.g., “CEO Jobs” or “President Obama”), and possibly even • rules for deriving short-hand names (e.g., “Mrs. Y” for female people with last name“Y”).Where do we get such dictionaries from? This is by itself a research issue, tackled,for example, by [76]. The easiest approach is to exploit redirects in premium sourcesand hyperlink anchors. Whenever a page with name X is redirected to an official pagewith name Y and whenever a hyperlink with anchor text X points to a page Y , we canconsider X an alias name for entity Y . In Wikipedia, for example, a page with title “Elvis”( https://en.wikipedia.org/w/index.php?title=Elvis ) redirects to the proper article ( https://en.wikipedia.org/wiki/Elvis_Presley ), and the page about the Safari browser contains a hyperlinkwith anchor “Apple” that points to the article https://en.wikipedia.org/wiki/Apple_Inc. . Thisapproach extends to links from Wikipedia disambiguation pages: for example, the page https://en.wikipedia.org/wiki/Robert_Zimmerman lists 11 people with this name, including alink to https://en.wikipedia.org/wiki/Bob_Dylan . Of course, this does not resolve the ambiguityof the name, but it gives us one additional alias name for Bob Dylan . Hyperlink anchor textsin Wikipedia were first exploited for entity alias names by [67], and this simple technique wasextended to hyperlinks in arbitrary web pages by [550]. A special case of interest is to harvestmultilingual names from interwiki links in Wikipedia (connecting different language editions)or non-English web pages linking to English Wikipedia articles. For example, French pageswith anchor text “Londre” link to the article https://en.wikipedia.org/wiki/London . All this is42 ubmitted to Foundations and Trends in Databases low-hanging fruit from an engineering perspective, and gives great mileage towards richdictionaries for entity names.An analogous issue arises for class names as well. Here, redirect and anchor texts areuseful, too, but are fairly sparse. The WordNet thesaurus and the Wiktionary lexiconcontain synonyms for many word senses (e.g., “vocalists” for singers , and vice versa) andcan serve to populate a dictionary of class paraphrases.The ideas that underlie the above heuristics can be cast into a more general principle of strong co-occurrence : Strong Co-Occurrence Principle:
If an entity or class name X co-occurs with name Y in a context with cue Z , then Y is (likely) an alias name for X .This principle can be instantiated in various ways, depending on what we consideras context cue Z : • The cue Z is a hyperlink where X is the link target and Y is the anchor text. • The cue Z is a specific wording in a sentence, like “also known as”, “aka.”, “bornas”, “abbreviated as”, “for short” etc. • The context cue Z is a query-click pair (observed by a search engine), where X is the query and Y is the title of a clicked result (with many clicks by differentusers). • The context cue Z is the frequent occurrence of X in documents about Y (e.g.,Wikipedia articles, biographies, product reviews etc.).For example, when many users who query for “Apple” (or “MS”) subsequently click onthe Wikipedia article or homepage of Apple Inc. (or “Microsoft”), we learn that “Apple” isa short-hand name for the company. This co-clicking technique has been studied by [577].Extending this to textual co-occurrence (i.e., the last of the above itemized cases) comeswith a higher risk of false positives, but could still be worthwhile for cases like short namesor acronyms for products. The technique can be tuned towards either precision or recall bythresholding on the observation frequencies and by making the context cue more or lessrestrictive. More advanced techniques for learning co-occurrence cues about alias names –so-called synonym discovery, have been investigated by [469], among others.
To discover more entities of a given class or more classes of a given entity, a powerful43 ubmitted to Foundations and Trends in Databases approach is to consider specific patterns that co-occur with input (class or entity) anddesired output (entity and class). For example, a text snippet like “singers such as BobDylan, Elvis Presley and Frank Sinatra” suggests that Dylan, Presley and Sinatra belongto the class of singers . Such patterns have been identified in the seminal work of MartiHearst [224], and are thus known as
Hearst patterns . They are a special case of the strongco-occurrence principle where the Hearst patterns serve as context cues.In addition to the “such as” pattern, the most important Hearst patterns are: “ X like Y ” (with class X in plural form and entity Y ), “ X and other Y ” (with entity X and class Y in plural form), and “ X including Y ” (with class X in plural form and entity Y ).Some of the Hearst patterns also apply to discovering subclass relations between classes.In the pattern “ X including Y ”, X and Y could both be classes, for example, “singersincluding rappers”. In fact, the patterns alone cannot distinguish between observations ofentity-class relations (types) versus subclass relations. Additional techniques can be appliedto identify which surface strings denote entities and which ones refer to classes. Simpleheuristics can already go a long way: for example, words that start with an uppercaseletter are often entities whereas common nouns in plural form are more likely class names.Full-fledged approaches make use of dictionaries as discussed above or more advancedmethods for entity recognition, discussed further below.Hand-crafted patterns are useful also for discovering entities in semi-structured webcontents like lists and tables. For example, if a list heading or a column header denotesa type, then the list or column entries could be considered as entities of that type. Thepattern for this purpose would refer to HTML tags that mark headers and entries. Ofcourse, this is just a crude heuristics that has a non-negligible risk of failing. We will discussmore advanced methods that handle this case more robustly Particularly, sections 6.2.1.5and 6.3 go into depth on extraction from semi-structured contents for the more generalscope of entity properties. Multi-anchored Patterns:
Hearst patterns may pick up spurious observations. For example, the sentence “protestsongs against war like Universal Soldier” could erroneously yield that
Universal Solider isan instance of the class wars . One way of making the approach more robust is to includepart-of-speech tags or even dependency-parsing trees (see [146, 273] for these NLP basics) inthe specification of patterns. Another approach is to extend the context cue and strengthenits role. In addition to the pattern itself, we can demand that the context contains at leastone additional entity which is already known to belong to the observed class. For example,the text “singers such as Elvis Presley” alone may be considered insufficient evidence toaccept Elvis as a singer, but the text “singers such as Elvis Presley and Frank Sinatra”would have a stronger cue if
Frank Sinatra is a known singer already. This idea has beenreferred to as doubly-anchored patterns in the literature [292] for the case of observing two44 ubmitted to Foundations and Trends in Databases entities of the same class. With a strong cue like “singers such as”, insisting on a knownwitness may be an overkill, but the principle equally applies to weaker cues, for example,“voices such as ...” for the target class singers .Multi-anchored patterns are particularly useful when going beyond text-based Hearstpatterns by considering strong co-occurrence in enumerations, lists and tables. For simplicity,consider only the case of tables in web pages – as opposed to relational tables in databases.The co-occurring entities are typically the names in the cells of the same column, and theclass is the name in the column header. Due the ambiguity of words and the ad-hoc natureof web tables, the spotted entities in the same column may be very heterogeneous, mixingup apples and oranges. For example, a table column on Oscar winners could have bothactors and movies as rows. Thus, we may incorrectly learn that
Godfather is an actor and
Bob Dylan is a movie. To overcome these difficulties, we can require that a new entity nameis accepted only when co-occurring with a certain number of known entities that belong tothe proper class ([109]), for example, at least 10 actors in the same column for a table of 15rows. Needless to say, all these are still heuristics that may occasionally fail, but they areeasy to implement, powerful and have practical value.
Pre-specified patterns are inherently limited in their coverage. This motivates approachesfor automatically learning patterns, using initial entity-type pairs and/or initial patterns for distant supervision . For example, when frequently observing a phrase like “great voice in”for entities of type singers , this phrase could be added to a set of indicative patterns fordiscovering more singers. This idea is captured in the following principle of statement-patternduality , first formulated by [58] (see also [478] in the context of question answering):
Principle of Statement-Pattern Duality
When correct statements about entities x (e.g., x belonging to class y ) frequentlyco-occur with textual pattern p , then p is likely a good pattern to derive statementsof this kind.Conversely, when statements about entities x frequently co-occur with a goodpattern p , then these statements are likely correct.Thus, observations of good statements and good patterns reinforce each other; hencethe name statement-pattern duality.This insightful paradigm gives rise to a straightforward algorithm where we start withstatements in the KB as seeds (and possibly also with pre-specified patterns), and theniterate between deriving patterns from statements and deriving statements from patterns.45 ubmitted to Foundations and Trends in Databases Algorithm for Seed-based Pattern Learning
Input: Seed statements in the form of known entities for a classOutput: Patterns for this class, and new entities of the classInitialize: S ← seed statements P ← ∅ (or pre-specified patterns like Hearst patterns)Repeat1. Pattern discovery:- search for mentions of entity x ∈ S in web corpus, and identify co-occurringphrases;- generalize phrases into patterns by substituting x with a placeholder $ X ;- analyze frequencies of patterns (and other statistics);- P ← P ∪ frequent patterns;2. Statement expansion:- search for occurrences of patterns p ∈ P in web corpus, and identify co-occurring entities;- analyze frequencies of entities co-occurring with multiple patterns (and otherstatistics);- S ← S ∪ frequent entities;A toy example for running this algorithm is shown in Table 4.1.In the example, phrases such as “$ X ’s vocal performance” (with $ X as a placeholderfor entities) are generalized patterns. They co-occur with at least one but typically multipleof the entities in S known so far, and their strength is the cumulative frequency of theseoccurrences. Newly discovered patterns also co-occur with seed entities, and this wouldfurther strengthen the patterns’ usefulness. The NELL project [394] has run a variantof this algorithm at large scale, and has found patterns for musicians, such as “originalsong by X ”, “ballads reminiscent of X ”, “bluesmen , including X ”, “was later coveredby X ”, “also performed with X ”, “ X ’s backing bands”, and hundreds more (see http://rtw.ml.cmu.edu/rtw/kbbrowser/predmeta:musician ).Despite its elegance, the algorithm, in this basic form, has severe limitations:1. Over-specific patterns:Some patterns are overly specific. For example, a possible pattern “$ X and her chansons”would apply only to female singers. This can be overcome by generalizing patterns into regular expressions over words and part-of-speech tags [153]: “$ X ∗ and P RP chansons”in this example, where ∗ is a wildcard for any word sequence and P RP requires a personalpronoun. Likewise, the pattern “$ X ’s great voice” could be generalized into “$ X ’s J J voice” to allow for other adjectives (with part-of-speech tag
J J ) such as “haunting voice”46 ubmitted to Foundations and Trends in Databases singers like $ X Nina Simone X ’s vocal performance Amy Winehouse sentence: Queen’s vocal performance led by Freddie Mercury . . .new entity:
Queen voice of $ X Francoise Hardy sentence: The great voice of Donald Trump got loud and angry.new entity:
Donald Trump ... ... ...
Table 4.1:
Toy example for Seed-based Pattern Learning, with seed “Elvis Presley” or “angry voice”. Moreover, instead of considering these as sequences over the surfacetext, the patterns could also be derived from paths in dependency-parsing trees (see,e.g., [65, 402, 561]).2. Open-ended iterations:In principle, the loop could be run forever. Recall will continue to increase, but precisionwill degrade with more iterations, as the acquired patterns get diluted. So we needto define a meaningful stopping criterion. A simple heuristic could be to consider thefraction of seed occurrences obtained at the end of iteration i , as the new patternsshould still co-occur with known entities. So a sudden drop in the observations of seedswould indicate a notable loss of quality.3. False positives and pattern dilution:Even after one or two iterations, some of the newly observed statements are falsepositives: Queen is a band, not a singer, and Donald Trump does not have a lot ofmusical talent. This is caused by picking up overly broad or ambiguous patterns. Forexample, “Grammy winner” applies to bands as well, and “$ X ’s great voice” may beobserved in sarcastic news about politics, besides music.As already stated for points 1 and 2, these weaknesses can be ameliorated by extendingthe method. The hardest issue is point 3. To mitigate the potential dilution of patterns,a number of techniques have been explored. One is to impose additional constraints for47 ubmitted to Foundations and Trends in Databases pruning out misleading patterns and spurious statements; we will discuss these in Chapter6 for the more general task of acquiring relational statements. A second major techniqueis to compute statistical measures of pattern and statement quality after each iteration,and use these to drop doubtful candidates [3]. In the following, we list some of the salientmeasures that can be leveraged for this purpose.The support of pattern p for seed statements S , supp ( p, S ), is the ratio of thefrequency of joint occurrences of p with any of the entities x ∈ S to the co-occurrencefrequency of any pattern with any x ∈ S : supp ( p, S ) = P x ∈ S freq ( p, x ) P q P x ∈ S freq ( q, x )where P q ranges over all observed patterns q (possibly with a lower bound onabsolute frequency) and freq ( ) is the total number of observations of a pattern orpattern-entity pair.The confidence of pattern p with regard to seed statements S is the ratio of thefrequency of p jointly with seed entities x ∈ S to the frequency of p with anyentities: conf ( p, S ) = freq ( p, S ) freq ( p )The diversity of pattern p with regard to seed statements S , div ( p, S ), is thenumber of distinct entities x ∈ S that co-occur with pattern p : div ( p, S ) = |{ x ∈ S | freq ( p, x ) > }| We can also contrast the positive occurrences of a pattern p , that is, co-occurrenceswith correct statements from S , against the negative occurrences with statements knownto be incorrect. To this end, we have to additionally compile a set of incorrect statementsas negative seeds , for example, specifying that the Beatles and Barack Obama are notsingers to prevent noisy patterns that led to acquiring Queen and Donald Trump as newsingers. Let us denote these negative seeds as S . This allows us to revise the definition ofconfidence: 48 ubmitted to Foundations and Trends in Databases Given positive seeds S and negative seeds S , the confidence of pattern p , conf ( p ),is the ratio of positive occurrences to occurrences with either positive or negativeseeds: conf ( p ) = P x ∈ S freq ( p, x ) P x ∈ S freq ( p, x ) + P x ∈ S freq ( p, x )These quality measures can be used to restrict the acquired patterns to those for whichsupport, confidence or diversity – or any combination of these – are above a given threshold.Moreover, the measures can also be carried over to the observed statements. This is againbased on the principle of statement-pattern duality. To this end, we now identify a subset P of good patterns using the statistical quality measures.The confidence of statement x (i.e., that an entity x belongs to the class ofinterest) is the normalized aggregate frequency of co-occurring with good patterns,weighted by the confidence of these patterns: conf ( x ) = P p ∈ P freq ( x, p ) · conf ( p ) P q freq ( x, q )where P q ranges over all observed patterns. That is, we achieve perfect confidencein statement x if it is observed only in conjunction with good patterns and all thesepatterns have perfect confidence 1.0.The diversity of statement x is the number of distinct patterns p ∈ P that x co-occurs with: div ( x ) = |{ p ∈ P | freq ( p, x ) > }| For both of these measures, variations are possible as well as combined measures.Diversity is a useful signal to avoid that a single pattern drives the acquisition of statements,which would incur a high risk of error propagation. Rather than relying directly on thesemeasures, it is also possible to use probabilistic models or random-walk techniques forscoring and ranking newly acquired statements. In particular, algorithms for set expansion (aka. concept expansion) can be considered to this end (e.g., [612, 223, 608, 87]).We can now leverage these considerations to extend the seed-based pattern learningalgorithm. The key idea is to prune, in each round of the iterative method, both patterns andstatements that do not exceed certain thresholds for support, confidence and/or diversity.Conversely, we can promote , in each round, the best statements to the status of seeds,to incorporate them into the calculation of the quality statistics for the next round. Forexample, when observing
Amy Winehouse as a high-confidence statement in some round, wecan add her to the seed set, this way enhancing the informativeness of the statistics in the49 ubmitted to Foundations and Trends in Databases next round. We sketch this extended algorithm, whose key ideas have been developed by [3]and [153].
Extended algorithm for Seed-based Pattern Learning
Input: Seed statements in the form of known entities for a classOutput: Patterns for this class, and new entities of the classInitialize: S ← seed statements S + ← S //acquired statements P ← ∅ (or pre-specified patterns like Hearst patterns)Repeat1. Pattern discovery:- same steps as in base algorithm- P ← P \ { patterns below quality thresholds }
2. Statement expansion:- same steps as in base algorithm- S + ← S + ∪ { newly acquired statements } - S ← S ∪ { new statements above quality thresholds } All of the above assumes that patterns are either matched or not. However, it is oftenthe case that a pattern is almost but not exactly matched, with small variations in thewording or using synonymous words. For example, the pattern “Nobel prize winner $ X ”(for scientists for a change) could be considered as approximately matched by phrases suchas “Nobel winning $ X or “Nobel laureate $ X ”. If these phrases are frequent and we wantto consider them as evidence, we can add a similarity kernel sim ( p, q ) to the observationstatistics, based on edit distance or n-gram overlap where n-grams are sub-sequences of n consecutive characters. This simply extends the frequency of pattern p into a weightedcount freq ( p ) = P q sim ( p, q ) if sim ( p, q ) > θ where P q ranges over all approximate matchesof p in the corpus and θ is a pruning threshold to eliminate weakly matching phrases.The presented techniques are also applicable to acquiring subclass/superclass pairs (forthe subclass-of relation, as opposed to instance-of ). For example, we can detect frompatterns in web pages that rappers and crooners are subclasses of singers . This has beenfurther elaborated in work on ontology/taxonomy learning like [363] and [153]. A major alternative to learning patterns for entity discovery is to devise end-to-endmachine learning models. In contrast to the paradigm of seed-based distant supervision50 ubmitted to Foundations and Trends in Databases of Subsection 4.3, we now consider fully supervised methods that require labeled trainingdata in the form of annotated sentences (or other text snippets). Typically, these methodswork well only if the training data has a substantial size, in the order of ten thousand or higher.By exploiting large corpora with appropriate markup, or annotations from crowdsourcingworkers, or high-quality outputs of employing methods like those for premium sources, suchlarge-scale training data is indeed available today. On the first direction, Wikipedia is againa first-choice asset, as it has many sentences where a named entity appears and is markedup as a hyperlink to the entity’s Wikipedia article.In the following, we present two major families of end-to-end learning methods. The firstone is probabilistic graphical models where a sequence of words, or, more generally, tokens , is mapped into the joint state of a graph of random variables, with states denotingtags for the input words. The second approach is deep neural networks for classifyingthe individual words of an input sequence onto a set of tags. Thus, both of these methodsare geared for the task of sequence labeling , also known as sequence tagging . This family of models considers a set of coupled random variables that take a finite set oftags as values. In our application, the tags are primarily used to demarcate entity names ina token sequence, like an input sentence or other snippet. Each random variable correspondsto one token in the input sequence, and the coupling reflects short-distance dependencies(e.g., between the tags for adjacent or nearby tokens). In the simplest and most widelyused case, the coupling is pair-wise such that a random variable for an input token dependsonly on the random variable for the immediately preceding token. This setup amounts tolearning conditional probabilities for subsequent pairs of tags as a function of the inputtokens. As the input sequence is completely known upfront, we can generalize this to eachrandom variable being a function of all tokens or all kinds of feature functions over theentire token sequence.
Conditional Random Fields (CRF):
The most successful method from this family of probabilistic graphical models is known as
Conditional Random Fields , or
CRF s for short, originally developed by [304]. Morerecent tutorials are by [569] on foundations and algorithms, and [511] on applying CRFsfor information extraction. CRFs are in turn a generalization of the prior notion of
HiddenMarkov Models (HMMs) , the difference lying in the incorporation of feature functions onthe entire input sequence versus only considering two successive tokens.51 ubmitted to Foundations and Trends in Databases A Conditional Random Field (CRF) , operating over input sequence X = x . . . x n is an undirected graph, with a set of finite-state random variables Y = { Y . . . Y m } as nodes and pair-wise coupling of variables as edges.An edge between variables Y i and Y j denotes that their value distributions arecoupled. Conversely, it denotes that, in the absence of any other edges, two variablesare conditionally independent, given their neighbors.More precisely, the following Markov condition is postulated for all variables Y i andall possible values t : P [ Y i = t | x . . . x n , Y . . . Y i − , Y i +1 . . . Y m ] = P [ Y i = t | x . . . x n , all Y j with an edge ( Y i , Y j )]Strictly speaking, the coupling of random variables may go beyond pairs by introducing factor nodes (or factors for short) for dependencies between two or more variables. In thissection, we restrict ourselves to the basic case of pair-wise coupling. Moreover, we assumethat the graph forms a linear chain: a so-called linear-chain CRF . Often the variablescorrespond one-to-one to the input tokens: so each token x i is associated with variable Y i (and we have n = m for the number of tokens and variables). CRF Training:
The training of a CRF from labeled sequences, in the form of (
X, Y ) value pairs, involvesthe posterior likelihoods P [ Y i | X ] = P [ Y i = t i | x . . . x n , all neighbors Y j of Y j ]With feature functions f k over input X and subsets Y c ⊂ Y of coupled random variables(with known values in the training data), this can be shown to be equivalent to P [ Y | X ] ∼ Z Y c exp( X k w k · f k ( X, Y c )]with an input-independent normalization constant Z , k ranging over all feature functions,and c ranging over all coupled subsets of variables, the so-called factors . For a linear-chainCRF, the factors are all pairs of adjacent variables: P [ Y | X ] ∼ Z Y i exp( X k w k · f k ( X, Y i − , Y i )]The parameters of the model are the feature-function weights w k ; these are the output ofthe training procedure. The training objective is to choose weights w k to minimize the errorbetween the model’s maximum-posterior Y values and the ground-truth values, aggregatedover all training samples. The error, or loss function , can take different forms, for example,the negative log-likelihood that the trained model generates the ground-truth tags.52 ubmitted to Foundations and Trends in Databases As with all machine-learning models, the objective function is typically combined witha regularizer to counter the risk of overfitting to the training data. The training procedureis usually implemented as a form of (stochastic) gradient descent (see, e.g., [68] and furtherreferences given there). This guarantees convergence to a local optimum and empiricallyapproximates the global optimum of the objective function fairly well (see [569]). For thecase of linear-chain CRFs, the optimization is convex; so we can always approximatelyachieve the global optimum.
CRF Inference:
When a trained CRF is presented with a previously unseen sentence, the inference stagecomputes the posterior values of all random variables, that is, the tag sequence for theentire input, that has the maximum likelihood given the input and the trained model withweights w k : Y ∗ = argmax Y P [ Y | X, all w k ]For linear-chain CRFs, this can be computed using dynamic programming , namely, variants ofthe Viterbi algorithm (for HMMs). For general CRFs, other – more expensive – techniqueslike Monte Carlo sampling or variational inference are needed. Alternatively, the CRFinference can also be cast into an Integer Linear Program (ILP) and solved by optimizerslike the Gurobi software ( ), following [500]. We will go into moredepth on ILPs as a modeling and inference tool in Chapter 8, especially Section 8.5.3.
CRF for Part-of-Speech Tagging:
A classical application of CRF-based learning is part-of-speech tagging: labeling each wordin an input sentence with its word category, like noun (NN), verb (VB), preposition (IN),article (DET) etc. Figure 4.1 shows two examples for this task, with their correct outputtags. Y Time flies like an arrow x x x x x Y Y Y Y NN VB IN DET NN Y Fruit flies like a banana x x x x x Y Y Y Y NN NN VB DET NN
Figure 4.1:
Examples for CRF-based Part-of-Speech Tagging
The intuition why this works so well is that word sequence frequencies as well as tagsequence frequencies from large corpora can inform the learner, in combination with other53 ubmitted to Foundations and Trends in Databases feature functions, to derive very good weights. For example, nouns are frequently followedby verbs, verbs are frequently followed by prepositions, and some word pairs are compositenoun phrases (such as “fruit flies”). Large corpus statistics also help to cope with exoticor even non-sensical inputs. For example, the sentence “Pizza flies like an eagle” would beproperly tagged as
NN VB IN DET NN because, unlike fruit flies, there is virtually nomention of pizza flies in any corpus.
CRF for Named Entity Recognition (NER) and Typing:
For the task at hand, entity discovery, the CRF tags of interest are primarily
N E for NamedEntity and O for Others. For example, the sentence“Dylan composed the song Sad-eyed Lady of the Lowlands,about his wife Sara Lownds, while staying at the Chelsea hotel”should yield the tag sequence NE O O O NE NE NE NE NE O O O NE NE O O O O NE NE .Many entity mentions correspond to the part-of-speech tag NNP, for proper noun, that is,nouns that should not be prefixed with an article in any sentence, such as names of people.However, this is not sufficient, as it would accept false positives like abstractions (e.g.,“love” or “peace”) and would miss out on multi-word names that include non-nouns, such assong or book titles (e.g. “Sad-eyed Lady of the Lowlands”), and names that come with anarticle (e.g., “the Chelsea hotel”). For these reasons, CRFs for
Named Entity Recognition ,or
NER for short, have been specifically developed, with training over annotated corpora.The seminal work on this is [166]; advanced extensions are implemented in the StanfordCoreNLP software suite ([367]).As the training of a CRF involves annotated corpora, instead of merely distinguish-ing entity mentions versus other words, we can piggyback on the annotation effort andincorporate more expressive tags for different types of entities , such as people, places,products (incl. songs and books). This idea has indeed been pursued already in the workof [166], integrating coarse-grained entity typing into the CRF for NER. The simplechange is to move from output variables with tag set { N E, O } to a larger tag set like { P ERS, LOC, ORG, M ISC, O } denoting persons (PERS), locations (LOC), organizations(ORG), entities of miscellanous types (MISC) such as products or events, and non-entitywords (O). For the above example about Bob Dylan, we should then obtain the tag sequence PERS O O O MISC MISC MISC MISC MISCO O O PERS PERS O O O O LOC LOC .The way the CRF is trained and used for inference stays the same, but the training datarequires annotations for the entity types. Later work has even devised (non-CRF) classifiersfor more fine-grained entity typing , with tags for hundreds of types, such as politicians,scientists, artists, musicians, singers, guitarists, etc. [168, 344, 413, 93]. An easy way ofobtaining training data for this task is consider hyperlink anchor texts in Wikipedia as54 ubmitted to Foundations and Trends in Databases entity names and derive their types from Wikipedia categories, or directly from a core KBconstructed by methods from Chapter 3. An empirical comparison of various NER methodswith fine-grained typing is given by [365].Widely used feature functions for CRF-based NER tagging, or features for other kindsof NER/type classifiers, include the following: • part-of-speech tags of words and their co-occurring words in left-hand and right-handproximity, • uppercase versus lowercase spelling, • word occurrence statistics in type-specific dictionaries, such as dictionaries of peoplenames, organization names, or location names, along with short descriptions (fromyellow pages and so-called gazetteers), • co-occurrence frequencies of word-tag pairs in the training data, • further statistics for word n-grams and their co-occurrences with tags.There is also substantial work on domain-specific NER , especially for the biomedicaldomain (see, e.g., [170] and references given there), and also for chemistry, restaurant namesand menu items, and titles of entertainment products. In these settings, domain-specificdictionaries play a strong role as input for feature functions [476, 519]. In recent years, neural networks have become the most powerful methodology for supervisedmachine learning when sufficient training data are available. This holds for a variety ofNLP tasks [186], potentially including Named Entity Recognition.The most primitive neural network is a single perceptron which takes as input a set ofreal numbers, aggregates them by weighted summation, and applies a non-linear activationfunction (e.g., logistic function or hyperbolic tangent) to the sum for producing its output.Such building blocks can be connected to construct entire networks, typically organizedinto layers. Networks with many layers are called deep networks . As a loose metaphor, onemay think of the nodes as neurons and the interconnecting edges as synapses.The weights of incoming edges (for the weighted summation) are the parameters ofsuch neural models, to be learned from labeled training data. The inputs to each node areusually entire vectors, not just single numbers, and the top-layer’s output are real values forregression models or, after applying a softmax function, scores for classification labels. Theloss function for the training objective can take various forms of error measures, possiblycombined with regularizers or constraint-based penalty terms. For neural learning, it iscrucial that the loss function is differentiable in the model parameters (i.e., the weights), andthat this can be backpropagated through the entire network. Under this condition, trainingis effectively performed by methods for stochastic gradient descent (see, e.g., [68]), andmodern software libraries (e.g., TensorFlow) support scaling out these computations across55 ubmitted to Foundations and Trends in Databases many processors. For inference, with new inputs outside the training data, input vectorsare simply fed forward through the network by performing matrix and tensor operations ateach layer.
LSTM Models:
There are various families of neural networks, with different topologies for interconnectinglayers. For NLP where the input to the entire network is a text sequence, so-called
LSTMnetworks (for “Long Short Term Memory”) have become prevalent (see [515, 186] andreferences there). They belong to the broader family of recurrent neural networks withfeedback connections between nodes of the same layer. This allows these nodes to aggregatelatent state computed from seeing an input token and the latent state derived from thepreceding tokens. To counter potential bias from processing the input sequence in forwarddirection alone,
Bi-directional LSTM s (or
Bi-LSTMs for short) connect nodes in bothdirections.We can think of LSTMs as the neural counterpart of CRFs. A key difference, however,is that neural networks do not require the explicit modeling of feature functions. Instead,they take the raw data (in vectorized form) as inputs and automatically learn latentrepresentations that implicitly capture features and their cross-talk.Figure 4.2 gives a pictorial illustration of an LSTM-based neural network for NER,applied to a variant of our Bob Dylan example sentence. The outputs of the forward LSTMand the backward LSTM are combined into a latent data representation, for example, byconcatenating vectors. On top of the bi-LSTM, further network layers (learn to) computethe scores for each tag, typically followed by a softmax function to choose the best label.The output sequence of tags is slightly varied here, by prefixing each tag with its role in asubsequence of identical tags: B for Begin, I for In, and E for End. This serves to distinguishthe case of a single multi-word mention from the case of different mentions without anyinterleaving “Other” words. The E tags are not really needed as a following B tag indicatesthe next mention anyway, hence E is usually omitted. This simple extension of the tag setis also adopted by CRFs and other sequence labeling learners; we disregarded this earlierfor simplicity.LSTM-based neural networks can be combined with a CRF on top of the neural layers[250, 361, 306], this way combining the strengths of the two paradigms. Other enhancements(see [330] for a survey of neural NER) include additional bi-LSTM layers for learningcharacter-level representations, capturing character n-grams in a latent manner. This canleverage large unlabeled corpora, analogously to the role of dictionaries in feature-basedtaggers. This line of methods has also extended the scope of fine-grained entity typing ,yielding labels for thousands of types [93].Overall, deep neural networks, in combination with CRFs, tend to outperform othermethods for NER whenever a large amount of training data is at hand. When training data56 ubmitted to Foundations and Trends in Databases
Dylan composed Sad-eyed Lady of the Lowlands at the Chelsea hotelB-PERS O B-Misc I-Misc I-Misc I-Misc E-Misc O O B-LOC E-LOC . . . . . . . . . . . . . . . inputvectorsbi-LSTMlayerslearned representations additionalnetworklayersoutput tags input sequence
Figure 4.2:
Illustration of LSTM network for NER is not abundant, for example, in specific domains such as health, pattern-based methodsand feature-driven graphical models (incl. CRFs) are still a good choice.
Machine learning methods, and especially neural networks, do not operate directly on text,as they require numeric vectors as inputs. A popular way of casting text into this suitableform is by means of embeddings . Word Embedding:
Embeddings of words (or multi-word phrases) are real-valued vectors of fixeddimensionality, such that the distance between two vectors (e.g., by their cosine)reflects the relatedness (sometimes called “semantic similarity”) of the two words,based on the respective contexts in which the words typically occur.Word embeddings are computed (or “learned”) from co-occurrences and neighborhoodsof words in large corpora. The rationale is that the meaning of a word is captured by thecontexts in which it is often used, and that two words are highly related if they are used57 ubmitted to Foundations and Trends in Databases in similar contexts. This is often referred to as “distributional semantics” or “distributedsemantics” of words [324].This hypothesis should not be confused with two words directly co-occurring together.Instead, we are interested in indirect co-occurrences where the contexts of two input wordsshare many words. For example, the words “car” and “automobile” have the same semantics,but rarely co-occur directly in the same text span. The point rather is that both oftenco-occur with third words such as “road”, “traffic”, “highway” and names of car models.Technically, this becomes an optimization problem. Given a large set of text windows C of length k + 1 with word sequences w . . . w k , we aim to compute a fixed-length vector ~w for each word w such that the error for predicting the word’s surrounding window C ( w t ) = w t − k/ . . . w t . . . w t + k/ , from all word-wise vectors alone, is minimized. Thisconsideration leads to a non-convex continuous optimization with the objective function[384]: maximize X C X j ∈ C ( w t ) ,j = t log exp ( ~w jT · ~w t ) P v exp ( ~v T · ~w t )where the outermost sum ranges over all possible text windows (with overlapping windows).The dot product between the output vectors ~w j and ~w t reflects overlapping-context like-lihoods of word pairs, and the softmax function normalizes these scores by consideringall possible words v . Intuitively, the objective is maximized if the resulting word vectorscan predict the surrounding window from a given word with high accuracy. This specificobjective is known as the skip-gram model ; there are also other variations, with a similarflavor. Computing solutions for the non-convex optimization is typically done via gradientdescent methods.Embedding vectors can be readily plugged into machine-learning models, and they area major asset for the power of neural networks for NLP tasks. A degree of freedom is thechoice for the dimensionality of the vectors. For most use cases, this is set to a few hundred,say 300, for robust behavior.The most popular models and tools for this line of text embeddings are word2vec, by[384], and GloVE, by [447]. Both come with pre-packaged embeddings derived from newsand other collections, but applications can also compute new embeddings from customizedcorpora. The word2vec approach has been further extended to compute embeddings forshort paragraphs and entire documents (called doc2vec). Important earlier work on latenttext representations, with similar but less expressive models, includes Latent SemanticIndexing (LSI) and Latent Dirichlet Allocation (LDA) ([114, 243, 48]).A recent, even more advanced way of computing and representing embeddings is bytraining deep neural networks for word-level or sentence-level prediction tasks, and thenkeeping the learned model as a building block for training an encompassing network fordownstream tasks (e.g., question answering or conversational chatbots).58 ubmitted to Foundations and Trends in Databases The pre-training utilizes large corpora like the full text of all Wikipedia articles or theGoogle Books collection. A typical objective function is to minimize the error in predictinga masked-out word given a text window of successive words. The embeddings for all wordsare jointly given by the learned weights of the network’s synapses (i.e., connections betweenneurons), with 100 millions of real numbers or even more. Popular instantiations of thisapproach are ElMo [448] and BERT [118].
Embeddings for Word and Entity Pair Relatedness:
Embedding vectors are not directly interpretable; they are just vectors of numbers. However,we can apply linear algebra operators to them to obtain further results. Embeddings areadditive and subtractive, which allows forming analogies of the form: −−→ man + −−→ king = −−−−−→ woman + −−−→ queen −−−−−→ F rance + −−−→ P aris = −−−−−−→ Germany + −−−−→ Berlin −−−−−−→
Einstein + −−−−−−→ scientist = −−−−→ M essi + −−−−−−−→ f ootballer So we can solve an equation like −−−−−−−−−−→
Rock and Roll + −−−−−−−−−−→ Elvis P resley = −−−−−−−→ F olk Rock + −→ X yielding −→ X = −−−−−−−−−−→ Rock and Roll + −−−−−−−−−−→ Elvis P resley − −−−−−−−→
F olk Rock ≈ −−−−−−−→
Bob Dylan
Most importantly for practical purposes, we can compare the embeddings of two words(or phrases) by computing a distance measure between their respective vectors, typically,the cosine or the scalar product. This gives us a measure of how strongly the two words arerelated to each other, where (near-) synonyms would often have the highest relatedness.
Embedding-based Relatedness:
For two words v and w , their relatedness can be computed as cos ( ~v, ~w ) from theirembedding vectors ~v and ~w .The absolute values of the relatedness scores are not crucial, but we can now easily orderrelated words by descending scores. For example, for the word “rock”, the most relatedwords and short phrases are “rock n roll”, “band”, “indie rock” etc., and for “knowledge”we obtain the most salient words “expertise”, “understanding”, “knowhow”, “wisdom” etc.We will later see that such relatedness measures are very useful for many sub-tasks inknowledge base construction.The embedding model can capture not just words or other text spans, but we canalso apply it to compute distributional representations of entities . This is achievedby associating each entity in the knowledge base with a textual description of the entity,typically the Wikipedia article about the entity (but possibly also the external referencesgiven there, homepages of people and organizations, etc.).59 ubmitted to Foundations and Trends in Databases Once we have per-entity vectors, we can again compute cosine or scalar-product distancesfor entity pairs. This results in measures for entity-entity relatedness . Moreover, bycoupling the computations of per-word and per-entity embeddings, we also obtain scoresfor entity-word relatedness , which is often handy when we need salient keywords orkeyphrases for an entity. For example, the embedding for Elvis Presley should be close tothe embeddings for “king”, “rock n roll”, etc. Technical details for these models can befound in [616, 677, 638]; the latter includes data and code for the wikipedia2vec tool.Such embeddings have also been computed from domain-specific data sources, mostnotably, for biomedical entities and terminology, with consideration of the standard MeSHvocabulary. Resources of this kind include BioWordVec ([663]) and BioBERT ([314]).An important predecessor to all these works is the semantic relatedness model of [171],which was the first to harness Wikipedia articles for this purpose.A related, recent direction is knowledge graph (KG) embeddings (see [611] for a survey).These kinds of embeddings capture the neighborhood of entities in an existing graph-structured KB. They do not use textual inputs, however, and serve different purposes. Wewill discuss KG embeddings in Chapter 8, specifically Section 8.4.
Assuming that we can extract a large pool of types, and optionally also entities for them,the task discussed here is to construct a taxonomic tree or DAG (directed acyclic graph)for these types – without assuming any prior structure such as WordNet. In the literature,the problem is also referred to as taxonomy induction [541, 457], as its output is ageneralization of bottom-up observations. The input can take different forms, for example,starting from the noisy set of Wikipedia categories (but ignoring the graph structure),or from noisy and sparse pairs of hyponym-hypernym candidates derived by applyingpatters to large text and web corpora. A good example for the latter is the
WebIsALOD project ( http://webisa.webdatacommons.org/ , which used more than 50 patterns to extractcandidate pairs and a supervised classifier to prune out the most noisy ones [232]. Thiscollection, and others of similar flavor, does not strictly focus on hypernymy but alsocaptures meronymy/holonymy (part-of) and, to some extent, instance-type pairs. Hencethe broader term
IsA in the project name.
Methods for Wikipedia Categories:
Seminal work that considered all Wikipedia categories as noisy type candidates and thesubcategory-supercategory pairs as hypernymy candidates was the
WikiTaxonomy project[456, 457]. Its approach can be characterized by three steps:60 ubmitted to Foundations and Trends in Databases
Wikipedia-based Taxonomy Induction: • Category Cleaning: eliminating noisy categories that do not really denote types. • Category-Pair Classification: using a rule-based classifier to eliminate pairs thatdo not denote hypernymy. • Taxonomy Graph Construction: building a tree or DAG from the remainingtypes and hypernymy pairs.The first step is very similar to the techniques presented in Section 3.2. State-of-the-arttechniques for this purpose are discussed in [440]. The second step is based on heuristic butpowerful rules that compare stems or lemmas of head words in multi-word noun phrases.The following shows two examples for rules: • For sub-category S and direct super-category C : if head ( S ) is the same as head ( C ),then this is likely a hyponym-hypernym pair (e.g., S = “American baritones”, C =“baritones by nationality”). • For C and S : if head ( C ) appears in S , but head ( S ) is different from head ( C ), then thisis likely not a good pair (e.g., S = “American baritones”, C = “baritone saxophoneplayers”),Additional rules are used to refine the first case and to handle other cases. This includesconsidering instances of a category, at the entity level, and comparing their set of categoriesagainst the category at hand.For the third step, graph construction , the method applies transitivity to build a multi-rooted graph, eliminates cycles by removing as few edges as possible, and connects all rootsof the resulting DAG to the universal type entity .An industrial-strength variation and extension of the presented method is discussed in[117]. Another Wikipedia component that has been considered as noisy input for taxonomyinduction is infobox templates . The Wikipedia community has developed a large number ofdifferent templates for people, musicians, bands, songs, albums etc. They are instantiated inhighly varying numbers, and there is redundancy, for example, different templates for songs,some used more than others. [625] proposed a learning approach, using SVM classifiers andCRF-like graphical models, to infer a clean taxonomy from this noisy data. Taxonomies from Catalogs, Networks and User Behavior:
Alternatively to Wikipedia categories, other catalogs of categories can be processed in asimilar manner, for example, the DMOZ directory of web sites ( https://dmoz-odp.org/ ) orthe Icecat open product catalog ( https://icecat.biz/ ). Some methods combine informationfrom catalogs with topical networks, for example, connecting business categories, users andreviews on sites such as Yelp or TripAdvisor, and potentially also informative terms fromuser reviews. Examples of such methods are [609, 520]. Last but not least, recent methods61 ubmitted to Foundations and Trends in Databases on this task start with a high-quality product catalog and its category hierarchy, and thenlearn to extend and enrich the taxonomy with input from other sources, most notably, logsof customer queries, clicks, likes and purchases. This method is part of the
AutoKnow pipeline [133], discussed further in Section 9.5.
Folksonomies from Social Tags:
The general approach has also been carried over to build taxonomies from social tagging ,resulting in so-called folksonomies [199]. The input is a set of items like images or webpages of certain types that are associated with concise tags to annotate items, such as“sports car”, “electric car”, “hybrid auto” etc. If the number of items and the taggingcommunity are very large, the frequencies and co-occurrences of (words in) tags provide cuesabout proper types as well as type pairs where one is subsumed by the other. Data miningtechniques (related to association rules) can then be applied to clean such a large but noisycandidate pool, and the subsequent DAG construction is straightforward (e.g.,[233, 247,262]).Another target for similar techniques are fan communities (e.g., hosted at http://fandom.com aka. Wikia), which have collaboratively built extensive but noisy category and taggingsystems for entertainment fiction like movie series or TV series (e.g., Lord of the Rings,Game of Thrones, The Simpsons etc.) [97, 231].
Methods for Web Contents:
Early approaches spotted entity names in web-page collections and clustered these byvarious similarity measures (e.g., [141]). Using Hearst patterns and other heuristics, typelabels are then derived for each cluster. Such techniques can be further refined for scoringand ranking the outputs (e.g., using label propagation with random-walk techniques [571]).The seminal
KnowItAll project [153] advanced this line of research by a suite of scoringand classification techniques to enhance the output quality. It also studied tapping into listsof named entities as a source of type cues. For example, headers or captions of lists mayserve as type candidates, and pairs of lists where one mostly subsumes the other in terms ofelements (with some tolerance for exceptions) can be viewed as candidates for hypernymy.In the
Probase project, noisy candidates for hypernymy pairs were mined from theWeb index of a major search engine [631]. First, Hearst patterns were liberally applied tothis huge text collection. Then, a probabilistic model was used to prune out noise and inferlikely candidates for hyponym-hypernym pairs, based on (co-)occurrence frequencies. Theapproach led to a huge but still noisy and incomplete taxonomy. Also, the resulting typesare not canonicalized, meaning that synonymous type names may appear as different nodesin the taxonomy with different neighborhoods of hyponyms and hypernyms. Nevertheless,for use cases like query recommendation in web search, such a large collection of taxonomicinformation can be a valuable asset.Recent works approached taxonomy induction as a supervised machine-learning task,62 ubmitted to Foundations and Trends in Databases using factors graphs or neural networks, or by reinforcement learning (see, e.g., [31, 529,368]).
Methods for Query-Click Logs:
Search engine companies have huge logs of query-click pairs: user-issued keyword queriesand subsequent clicks on web pages after seeing the preview snippets of top-ranked results.When a sufficiently large fraction of queries is about types (aka. classes), such as “Americansong writers” or “pop music singers from the midwest”, one can derive various signalstowards inferring type synonymy and pairs for the IsA relation: • Surface cues in query strings: frequent patterns of query formulations that indicate typenames. Examples are queries that start with “list of” or noun-phrase query strings witha prefix (or head word) known to be a type followed by a modifier, such as “musicianswho were shot” or “IT companies started in garages”. • Co-Clicks: pages that are (frequently) clicked upon two different queries. For example,if the queries “American song writers” and “Americana composers” have many clicks incommon, they could be viewed as synonymous types. • Overlap of query and page title: the word-level n-gram overlap between the query stringand the title of a (frequently) clicked page. For example, if the query “pop music singersfrom the midwest” often leads to clicking the page with title “baritone singers fromthe midwest”, this pair is a candidate for the IsA relation (or, specifically, hypernymybetween types if the two strings are classified to denote types, not entity instances orother phrases).A variety of methods have been devised to harness these cues for inferring taxonomicrelations (synonymy and hypernymy, or more coarsely IsA) by [24, 441, 347, 346]. Theseinvolve scoring and ranking the candidates, so that different slices can be compiled depend-ing on whether the priority is precision or recall. By incorporating word-embedding-basedsimilarities and learning techniques, the directly observed cues can also convey generaliza-tions, for example, inferring that “crooners from Mississippi” are a subtype of “singers fromthe midwest”.Some of the resulting collections of types and type-name pairs are much richer and morefine-grained than the taxonomies that hinge on Wikipedia-like sources. Their strength isthat they cover very specific types absent in current KBs, such as “musicians who wereshot” (e.g., John Lennon), “musicians who died at 27” (e.g., Jim Morrison, Amy Winehouse,etc.), or “IT companies started in garages” (e.g., Apple), along with sets of paraphrases forthese (e.g., “27 club” for “musicians who died at 27”). Such repositories are very useful forquery suggestions (i.e., auto-completion or re-formulations) and explorative browsing (see,e.g., [347, 346]), but they do not (yet) reach the semantic rigor and near-human quality of63 ubmitted to Foundations and Trends in Databases full-fledged knowledge bases.
Discussion:
Overall, the methods presented in this section have not yet achieved taxonomies of betterquality and much wider coverage than those built directly from premium sources (seeChapter 3). Nevertheless, the outlined methodologies for coping with noisier input areof interest and value, for example, for query suggestion by search engines and towardsconstructing domain-specific KBs (e.g. on health where user queries could be valuable cues;see [286] and references there).
The following are key points to remember. • The task of entity discovery involves finding more entities for a given type as well asfinding more informative types for a given entity. These two goals are intertwined, andmany methods in this chapter apply to both of them. • To discover entity names in web contents, dictionaries and patterns are an easy and veryeffective way. Patterns can be hand-crafted, such as Hearst patterns, or automaticallycomputed by seed-based distantly supervised learning , following the principle of statement-pattern duality . For assuring the quality of newly acquired entity-type pairs, quantitativemeasures like support and confidence must be considered. • When sufficient amounts of labeled training data are available, in the form of annotatedsentences, end-to-end supervised learning is a powerful approach. This is typically castinto a sequence tagging task, known as
Named Entity Recognition (NER) and
NamedEntity Typing . Methods for this purpose are based on probabilistic graphical modelslike
CRFs , or deep neural networks like
LSTMs , or combinations of both. • A useful building block for all these methods are word and entity embeddings , whichlatently encode the degree of relatedness between pairs of words or entities. • While most of these methods start with a core KB that already contains a (limited) setof entities and types, it is also possible to compute IsA relations by ab-initio taxonomyinduction from text-based observations only. This has potential for obtaining morelong-tail items, but comes at a higher risk of quality degradation.64 ubmitted to Foundations and Trends in Databases
The entity discovery methods discussed in Chapter 4 may inflate the KB with alias namesthat refer to the same real-world entity. For example, we may end up with entity namessuch as “Elvis Presley”, “Elvis” and “The King”, or “Harry Potter Volume 1” and “HarryPotter and the Philosopher’s Stone”. If we treated all of them as distinct entities, we wouldend up with redundancy in the KB and, eventually, inconsistencies. For example, the birthand death dates for
Elvis Presley and
Elvis could be different, causing uncertainty aboutthe correct dates. For some KB applications, this kind of inconsistency may not cause muchharm, as long as humans are satisfied with the end results, such as finding songs for asearch-engine query about
Elvis . However, applications that combine , compare and reason with KB data, such as entity-centric analytics or recommendations, need to be aware ofcases when two names denote the same entity. For example, counting book mentions formarket studies should properly combine the two variants of the same Harry Potter bookwhile avoiding conflation with other book titles that denote different volumes of the series.Likewise, a user should not erroneously get recommendations for a book that she alreadyread.This motivates why a high-quality KB needs to tame ambiguity by canonicalizing entitymentions, creating one entry for all observations of the same entity regardless of namevariants. The task comes in a number of different settings. The most widely studied case is called
Entity Linking (EL) , where we assume an existingKB with a rich set of canonicalized entities (e.g., harvested from premium sources) and weobserve a new set of mentions in additional inputs like text documents or web tables. Whenthe input is text, the task is also known as
Named Entity Disambiguation (NED) inthe computational linguistics community. Historically, the so-called
Wikification task [383,386] has aimed to map both named entities and general concepts onto Wikipedia articles(including common nouns such as “football”, which could mean either American football orEuropean football aka. soccer, or the ball itself). This leads to the broad task of
WordSense Disambiguation (WSD) . Text with entity mentions also contains general wordsthat are often equally ambiguous. For example, words like “track” and “album” (cf. Figure5.1) can have several and quite different meanings, referring to music (as in the figure) or tocompletely different topics. The WSD task is to map these surface words (and possibly alsomulti-word phrases) onto their proper word senses in WordNet or onto Wikipedia articles[417, 399, 203]. For the mission of KB construction, general concepts and WSD are out of65 ubmitted to Foundations and Trends in Databases scope.Figure 5.1 gives an example for the EL task. In the input text on the left side, NERmethods can detect mentions, and these need to be mapped to their proper entities inthe KB, shown on the right-hand side. Candidate entities can be determined based onsurface cues like string similarity of names. In the example, this leads to many candidatesfor the first name Bob, but also for the highly ambiguous mentions “Hurricane”, “Carter”and “Washington”. As each mention has so many mapping options, we face a complexcombinatorial problem. Wikipedia knows more than 700 people with first name (or nickname)Bob, and Wikidata contains many more.
Hurricane,about Carter,is one of Bob‘stracks.It is played in the film with
Washington.
Hurricane
Carter
BobWashington Hurricane (song)
Hurricane cocktail
Jimmy Carter
Rubin CarterBob Dylan
Robert KennedyWashington, DCGeorge WashingtonDenzel Washington
Figure 5.1:
Example for Entity Linking
To compute, or learn to compute, the correct mapping, all methods consider varioussignals that relate input and output:
Mention-Entity Popularity:
If an entity is frequently referred by the name of the mention, this entity is a likelycandidate. For example, “Carter” and “Washington” most likely denote the formerUS president
Jimmy Carter and
Washington, DC .66 ubmitted to Foundations and Trends in Databases
Mention-Entity Context Similarity:
Mentions have surrounding text, which can be compared to descriptions of entitiessuch as short paragraphs from Wikipedia or keyphrases derived from such texts.For example, the context words “tracks” and “played” are cues towards music andmusicians, and “film with” suggests that “Washington” is an actor or actress.
Entity-Entity Coherence:
In meaningful texts, different entities do not co-occur uniformly at random. Whywould someone write a document about
Jimmy Carter drinking a
Hurricane (cocktail) together with
Robert Kennedy and
George Washington ?For two entities to co-occur, a semantic relationship should hold between them. Theexisting KB may have such prior knowledge that can be harnessed. For example,
BobDylan has composed
Hurricane (song) , and its lyrics is about the Afro-Americanboxer
Rubin Carter , aka. Hurricane, who was wrongfully convincted for murder inthe 1970s and later released after 20 years in prison. By mapping the mentions tothese inter-related entities, we obtain a highly coherent interpretation.In Figure 5.1, the edges between mentions and candidate entities indicate mention-entity similarities, and the edges among candidate entities indicate entity-entity coherence.Obviously, for quantifying the strength of similarity and coherence, these edges should be weighted , which is not shown in the figure. Some edge weights are stronger than others, andthese are the cues for inferring the proper mapping. In Figure 5.1, these indicative edgesare thicker lines in blue.Algorithmic and learning-based methods for EL are based on these three components. ELis intensively researched and applied not just for KB construction, with comprehensivesurveys by [523, 343, 370] and widely used benchmarks (e.g., [494]). This chapter discussesmajor families of methods and their building blocks.
The EL task has many variations and extensions. An important case is the treatment of out-of-KB entities : mentions that denote entities that are not (yet) included in the KB.This situation often arises with emerging entities , such as newly created songs or books,people or organizations that suddenly become prominent, and long-tail entities such asgarage bands or small startups. In such cases, the EL method has an additional optionto map a mention to null , meaning that none of the already known entities is a propermatch. This may hold even if the KB has reasonable string matches for the name itself. Forexample, the Nigerian saxophonist Peter Udo, who played with the Orchestra Baobab, isnot included in any major KB (to the best of our knowledge), but there are many matches67 ubmitted to Foundations and Trends in Databases for the string “Peter Udo” as this is also a German first name. A good EL method needs tocalibrate its linking decisions to avoid spurious choices, and should map mentions with lowconfidence in being KB entities to null . Such long-tail entities may become candidates to beincluded later in the KB life-cycle. We will revisit this issue in Chapter 8 on KB curation,specifically Section 8.6.3.
The initial KB against which EL methods operate is inevitably incomplete, regarding bothcoverage of entities and coverage of different names for the known entities. The former isaddressed by awareness of out-of-KB entities. The latter calls for grouping mentions intoequivalence classes that denote the same entities. In NLP, this task is known as coreferenceresolution (CR) ; in the world of structured data, its counterpart is the entity matching(EM) problem (see Section 5.2).The CR task is highly related to the EL setting, as illustrated by Figure 5.2. Here,the input text contains underdetermined phrases like “the album”, “the singer” and “wife”(or say “his wife”). Longer texts will likely contain pronouns as well, such as “she”, “her”,“it”, “they”, etc. All these are not immediately linkable to KB entities. However, we canfirst aim to identify to which other mentions these coreferences refer, this way computing equivalence classes of mentions . In Figure 5.2, possible groupings are indicated by edgesbetween mentions, and the correct ones are marked by thick lines in blue.The ideal output would thus state that “Desire” and “the album” denote the sameentity, and by linking one of the two mentions to the Bob Dylan album
Desire (album) ,EL covers both mentions. In general, however, the grouping and linking will be partial,meaning that some coreferences may be missed and some of the coreference groups maystill be unlinkable – either because of remaining uncertainty or because the proper entitydoes not exist in the KB. Although this partial picture may look unsatisfying, it does givevaluable information for KB construction and completion: • Mentions in the same coreference group linked to a KB entity may be added as aliasnames, or simply textual cues, for an existing entity. • Coreference groups that cannot be linked to a KB entity can be captured as candidatesfor new entities to be added, or at least reconsidered, later.For example, if we pick up mentions like “Peter Udo”, “the sax player”, “the Nigeriansaxophonist” and “he” as a coreference group, not only can we assert that he is not anexisting KB entity, but we already have informative cues about what type of entity this isand even a gender cue.Methods for coreference resolution over text inputs, and for coupling this with entitylinking, can be rule-based (see, e.g., [471, 313, 143]), based on CRF-like graphical models(see, e.g., [142]) or based on neural learning (see, e.g., [100, 315, 272]). The latter benefits68 ubmitted to Foundations and Trends in Databases
Hurricane (song)Hurricane cocktailBob DylanRobert Kennedy
Hurricane is on Bob‘sDesire.The album also contains a track about Sara,the singer‘sformer wife.
Desire (album)USS Desire (ship)Sara LowndsMia SaraSarah ConnorHurricaneBobDesire
The album
Sarathe singerwife
Figure 5.2:
Example for Combined Entity Linking and Coreference Resolution from feeding large unlabeled corpora into the model training (via embeddings such as BERT(e.g., [272], see also Section 4.5).The CR task mostly focuses on short-distance mentions, often looking at successivesentences or paragraphs only. However, the task can be extended to compute coreferencegroups across distant paragraphs or even across different documents. This aims to mark upan entire corpus with equivalence classes of mentions, and partial linking of these classesto KB entities. The extended task is known as cross-document coreference resolution(CCR) , first studied by [25]. Methods along these lines include inference with CRF-likegraphical models (e.g, [534]) and hierarchical clustering (e.g., [144]).
Benefit of CR for KB construction:
Partial linking of mentions to KB entities helps coreference resolution by capturing distantsignals. Conversely, having good candidates for coreference groups is beneficial for EL,as it provides richer context for individual mentions. In addition to these dual benefits,coreference groups can be important for extracting types and properties of entities, whenthe latter are expressed with pronouns or underdetermined phrases (e.g., “the album”). Forexample, when given the text snippet“Hurricane was not exactly a big hit.It is a protest song about racism.”, 69 ubmitted to Foundations and Trends in Databases we can infer the type protest song and its super-type song only with the help of CR.Likewise, in the example of Figure 5.2, we can acquire knowledge about Bob Dylan’s ex-wifeonly when considering coreferences. So CR can improve extraction recall and thus thecoverage of knowledge bases.
Computing equivalence classes over a set of observed entity mentions is also a frequent taskin data cleaning and integration. For example, when faced with two or more semanticallyoverlapping databases or other datasets (incl. web tables), we need to infer which recordsor table rows correspond to the same entity. This task of entity matching (EM) or duplicate detection , historically called record linkage , is a long-standing problem incomputer science [140, 160].In a nutshell, all EM methods leverage cues from comparing records in a matching-candidate pair: • Name similarity:
The higher the string similarity between two names, the higher thelikelihood that they denote matching entities (e.g., strings like “Sara” and “Sarah” beingclose). • Context similarity:
As context of a database record or table cell that denote an entityof interest, we should consider the full record or row as the data in such proximityoften denote related entities or salient attribute values. Additionally, it can be beneficialto contextualize the mention in a table cell by considering the other cells in the samecolumn, for example, as signals for the specific entity type of a cell (e.g., when all valuesin a table column are song titles). • Consistency constraints:
When a pair of rows from two tables is matched, this rulesout matching one of the two rows to a third one, assuming that there are no duplicateswithin each table. With duplicates or when we consider more than two tables as input,matchings need to satisfy the transitivity of an equivalence relation. • Distant knowledge:
By connecting highly related entities, using a background KB, wecan establish matching contexts despite low scores on string similarity. For example, asinger’s name (e.g., Mick Jagger) and the name of his or her band (e.g., Rolling Stones)could be very different, but there is nevertheless a strong connection (e.g., in matchinga Stones song, using either one or both of the related names).These ingredients can be fed into matching rules generated from human-provided samples(e.g., [284, 533]), or used as input to supervised learning or probabilistic graphical models(e.g., [537, 40, 624, 475, 559, 291]) or neural networks (e.g., [405, 675]).Similar techniques can be applied also to spotting matching entity pairs across datasetsin the Web of (Linked) Open Data [225, 385]; see the survey by [419] on this link discovery ubmitted to Foundations and Trends in Databases problem. For query processing, the problem takes the form of finding joinable tables andthe respective join columns and values (see, e.g., [671, 318]).As EM may operate over large databases with millions of records, it faces a scalabilitychallenge : conceptually comparing all record pairs from two databases may entail quadraticcomplexity and is bound to be intractable in practice. To overcome this potential show-stopper, the input needs to be partitioned into manageable blocks, based on a variety of blocking techniques . The simplest approach is to partition the records of both sides bythe name string of the entities of interest (e.g., person names, company names, or songs orbooks, etc.) or by one or more of the most informative attributes (e.g., birthdate, address,etc.). More advanced techniques pre-compute fingerprints that approximate string similarityscores, such as min-hash sketches for n-gram overlap (and edit distance) [60, 82], anduse these for partitioning. Because of these techniques, blocking-based EM algorithms arealso referred to as similarity joins or fuzzy joins . The actual EM step then applies morerefined techniques to compare records between blocks in the same partition. State-of-the-artmethods do not rely on a single partitioning, but use adaptive and iterative blocking, so asto reduce the false negative rate and boost the overall accuracy (e.g., [623, 135, 96]).Entity matching between databases or other structured data repositories is of interestfor KB construction when incorporating premium sources into an existing KB. For example,when adding entities from GeoNames to a Wikipedia-derived KB, the detection of duplicates(to be eliminated from the ingest) is essentially an EM task. There is ample literature on EMmethods for integrating heterogeneous databases and, recently, in the context of so-called data lakes [385]. Therefore, we do not discuss methods in more depth and instead refer tosurveys and best-practice papers by [287, 416, 94, 135, 96]. All EL methods are based on quantitative measures for mention-entity popularity, mention-entity similarity and entity-entity coherence. This section elaborates on these measures.Throughout the following, we assume that many entities in the KB, and especially theprominent ones, are uniquely identified also in Wikipedia. It is very easy for a KB to alignits entities with Wikipedia articles. This holds for large KBs like Wikidata, even if the KBitself is not built from Wikipedia as a premium source. We will use Wikipedia articles as abackground asset for various aspects of EL methods.
As most texts, like news, books and social media, are about prominent entities, the popularityof an entity determines a prior likelihood to be selected by an EL algorithm. For example,the Web contents about Elvis Presley is an order of magnitude larger than about the less71 ubmitted to Foundations and Trends in Databases known musician Elvis Costello. Therefore, when EL sees a mention “Elvis”, the probabilitythat this denotes
Elvis Presley is a priori much higher than for Elvis Costello. To measurethe global popularity of entities, a variety of indicators can be used: • the length of an entity’s Wikipedia article (or “homepage” in domain-specific platformssuch as IMDB for movies or Goodreads for books, • the number of incoming links of the Wikipedia page, • the number of page visits based on Wikipedia usage statistics, • the amount of user activity, such as clicks or likes, referring to the entity in a domain-specific platform or in social media, and more.While considering global popularity is useful, it can be misleading and insufficient as a priorprobability. For example, for the mention “Trump”, the most likely entity is Donald Trump ,but for the mention “Donald”, albeit Donald Trump still being a candidate, the more likelyentity is
Donald Duck . This suggests that we should consider the combination of mentionand entity for estimating popularity.The most widely used estimator for mention-entity popularity is based on Wikipedialinks [383, 386], exploiting the observation that href anchor texts are often short names andthe pages to which they link are canonicalized entities already.
Link-based Mention-Entity Popularity:
The mention-entity popularity score for mention m and entity e is proportionalto the occurrence frequency of hyperlinks with href anchor text “m” that point tothe main page about e . This includes redirects within Wikipedia as well as interwikilinks between different language editions.Obviously, this works only for entities that have Wikipedia articles, and it hinges onsufficiently many href links pointing to these articles. On the other hand, the popularityscore is useful only for prominent entities and will not be a good signal for long-tail entitiesanyway. Nevertheless, the approach can be generalized for larger coverage, by consideringall kinds of Web pages with links to Wikipedia [550]. Further alternatives are to leveragequery-click logs, by considering names in search-engine queries as mentions and subsequentclicks on Wikipedia articles or other kinds of homepages as linked entities. The most obvious cue for inferring that mention m denotes entity e is to compare their surface strings . For m , this is given by the input text, for example, “Trump” or “PresidentTrump”. For e , we can consider the preferred (i.e., official or most widely used) label forthe entity, such as “Donald Trump” or “Donald John Trump”, but also alias names thatare already included in the KB, such as “the US president”. String similarity betweennames , like edit distance or n-gram overlap, can then score how good e is a match for m ,72 ubmitted to Foundations and Trends in Databases this way ranking the candidate entities. In doing this, we can consider weights for tokens atthe word or even character level. The weights can be derived from frequency statistics, in thespirit of IR-style idf weights. The weights and the similarity measure can be type-specificor domain-specific, dealing differently with say person names, organization acronyms, songtitles, etc.Beyond the basic comparison of m and names for e , a fairly obvious approach is toleverage the mention context , that is, the text surrounding m , and compare it to concisedescriptions of entities or other kinds of entity contextualization . The mention contextcan take the form of a single sentence, single paragraph or entire document, possibly inweighted variants (e.g., weights decreasing with distance from m ). The entity contextdepends on the richness of the existing KB. Entities can be augmented with descriptionstaken, for example, from their Wikipedia articles (e.g., the first paragraph stating mostsalient points). Alternatively, the types of entities, their Wikipedia categories and otherprominently appearing entities in a Wikipedia article (i.e., outgoing links) can be used for acontextualized representation. In essence, we create a pseudo-document for each candidateentity, and we may even expand these by external texts such as news articles about entities,or highly related concepts for the entity types using WordNet and other sources.In the example of Figure 5.1, the mention “Hurricane” has words like “track” and“played” in its proximity, and the mention “Washington” is accompanied by the word “film”.These should be compared to entity contexts with words like song, music etc. for Hurricane(song) versus beverage, alcohol, rum etc. for
Hurricane (cocktail) .The following are some of the widely used measures for scoring the mention-entitycontext similarity . Bag-of-Words Context Similarity:
Both m and e are represented as tf-idf vectors derived from bags of words (BoW).Their similarity is the scalar product or cosine between these vectors: sim ( cxt ( m ) , cxt ( e )) = −−−−−−−−−−→ BoW ( cxt ( m )) · −−−−−−−−−→ BoW ( cxt ( e ))or analogously for cosine. Some variants restrict the BoW representation to informa-tive keywords from both contexts, using statistical and entropy measures to identifythe keywords.A generalization of keyword-based contexts is to focus on characteristic keyphrases [238]: multi-word phrases that are salient and specific for candidate entities. Keyphrasecandidates can be derived from entity types, category names, href anchor texts in an entity’sWikipedia article, and other sources along these lines. For example, Rubin Carter (as acandidate for the mention “Carter” in Figure 5.1) would be associated with keyphrasessuch as “African-American boxer”, “people convicted of murder”, “overturned convictions”,73 ubmitted to Foundations and Trends in Databases “racism victim”, “nickname The Hurricane” etc. Such phrases can be gathered from nounphrases n in the Wikipedia page (or other entity description) for e , and then scored andfiltered by criteria like pointwise mutual information (PMI) : weight ( n | e ) ∼ log P [ n, e ] P [ n ] · P [ e ]or other information-theoretic entropy measures, with probabilities estimated from (co-)occurrence frequencies (in Wikipedia). Intuitively, the best keyphrases for entity e shouldfrequently co-occur with e , but should not be globally frequent for all entities. The contextof e , cxt ( e ), then becomes the weighted set of keyphrases n for which weight ( n | e ) is abovesome threshold.Comparing a set of phrases against the mention context is a bit more difficult thanfor the Bag-of-Words representations. The reason is that exact matches of multi-wordphrases are infrequent, and we need to pay attention to similar phrases with some wordsmissing, different word order, and other variations. [238] proposed a word-proximity-awarewindow-based model for such approximate matches. For example, the keyphrase “racismvictim” can be matched by “victim in a notorious case of racism”. Keyphrase Context Similarity:
Representing cxt ( e ) as a weighted set KP of keyphrases n and cxt ( m ) as a sequenceof words conceptually broken down into a set W of (overlapping) small text windowsof bounded length, the similarity is computed by aggregating the following scores: • for each n ∈ KP identify the best matching window ω ∈ W ; • for each such ω compute the (sub-)set of words w ∈ n ∩ ω and their maximumpositional distance δ in ω ; • aggregate the word matches for n in ω with consideration of δ and the entity-specific weight for w (treating w as if it were a keyphrase by itself); • aggregate these per-keyphrase scores over all n ∈ KP .This template can be varied and extended in a number of ways.With the advent of latent embeddings (see Subsection 4.5), both BoW and keyphrasemodels may seem to be superseded by word2vec-like and other kinds of embeddings, whichimplicitly capture also synonyms and other strongly related terms. However, at the wordlevel alone, the embeddings are susceptible to over-generalization and drifting focus. Forexample, the word embedding for “Hurricane” brings out highly related terms that havenothing to do with the song or the boxer (e.g., “storm”, “damage”, “deaths” etc.). So itis important to use entity-centric embeddings likes the ones for wikipedia2vec [638], andideally, these should reflect multi-word phrases as well.74 ubmitted to Foundations and Trends in Databases Embedding-based Context Similarity:
With embedding vectors −−−−→ cxt ( m ) and −−−→ cxt ( e ), the context similarity between m and e is cos (cid:16) −−−−→ cxt ( m ) , −−−→ cxt ( e ) (cid:17) .Each of these context-similarity models has its sweet spots as well as limitations.Choosing the right model thus depends on the topical domain (e.g., business vs. music) andthe language style of the input texts (e.g., news vs. social media). Whenever an input text contains several mentions, EL methods should compute theircorresponding entities jointly , based on the principle that co-occurring mentions usuallymap to semantically coherent entities. To this end, we need to define measures for entity-entity relatedness that capture this notion of coherence.One of the most powerful and surprisingly simple measures exploits the rich link structureof Wikipedia. The idea is to consider two entities as highly related if the hyperlink sets oftheir Wikipedia pages have a large overlap. More specifically, the incoming links are a goodsignal. For example, two songs like “Hurricane” and “Sara” are highly related because theyare both linked to from the articles about
Bob Dylan , Desire (album) and more (e.g., theother involved musicians). Likewise,
Bob Dylan and
Elvis Presley are notably related asthere are quite a few Wikipedia categories that link to both of them. Intuitively, in-links areconsidered more informative than out-links as outgoing links tend to refer to more generalentities and concepts whereas incoming links often have the direction from more general tomore specific. These considerations have given rise to the following link-based definition ofentity-entity coherence.
Link-based Entity-Entity Coherence:
For two entities e and f with Wikipedia articles that have incoming-link sets In ( e )and In ( f ), their coherence score is1 − log ( max {| In ( e ) | , | In ( f ) |} ) − log | In ( e ) ∪ In ( f ) | )log( | U | ) − log ( min {| In ( e ) | , | In ( f ) |} )where U is the total set of known entities (e.g., Wikipedia articles about namedentities).This approach was pioneered by [386], and is thus sometimes referred to as the Milne-Witten metric. Similarly to link-based popularity measures, we can generalize this Wikipedia-centric link-overlap model to other settings. In a search engine’s query-and-click log, entitieswhose pages both appear in the clicked-pages set for the same queries (so-called “co-clicks”) should have a high relatedness score. In domain-specific content portals and web75 ubmitted to Foundations and Trends in Databases communities, such as Goodreads and LibraryThing for books or IMDB and Rottentomatoesfor movies, the overlap of the user sets who expressed liking the same entity is a goodmeasure for entity-entity relatedness. All these can be seen as instantiations of the strongco-occurrence principle (see Subsection 4.2).When entities are represented as weighted sets of keywords (BoW) or weighted sets ofkeyphrases (KP), their relatedness can be captured by measures for the (weighted) overlapof these sets. BoW-based and KP-based Entity-Entity Coherence:
Consider two entities e and f with associated keyword sets E and F whose entries x have weights w E ( x ) and w F ( x ), respectively. The coherence between e and f canbe measured by the weighted Jaccard metric: X x ∈ E ∩ F min { w E ( x ) , w F ( x ) } max { w E ( x ) , w F ( x ) } The extension to keyphrases is more sophisticated. It needs to consider also partialmatches between keyphrases for e and for f (e.g., “rock and roll singer” vs. “rock ’n’roll musician”), following the same ideas as the KP-based context similarity model(see [238] for details).Analogously to the context-similarity aspect, embeddings are a strong alternative forentity-entity coherence as well. They are straightforward to apply. Embedding-based Entity-Entity Coherence:
With embedding vectors −→ e and −→ f for entities e and f , their relatedness for ELcoherence is cos ( −→ e , −→ f ).All these purely text-based coherence measures – keywords, keyphrases, embeddings– have the advantage that they can be computed solely from textual entity descriptions.There is no need for Wikipedia-style links, and not even for any relations between entities.This setting has been called EL with a linkless KB in [336]. That prior work, which predatesembedding-based methods, made use of latent topic models (in the style of LDA [48])for linkless EL. Today, word2vec-style embeddings or even BERT-like language models[118] (see Section 4.5) seem to be the more powerful choice, but they all fall into the samearchitecture presented above.Yet another way of defining and computing entity-entity relatedness is by means of random walks over an existing knowledge graph (e.g., the link structure of Wikipedia), see,for example, [197]. Here each entity is represented by the (estimated) probability distributionof reaching related entities by random walks with restart. The relatedness score betweenentities can be defined as the relative entropy between two distributions.76 ubmitted to Foundations and Trends in Databases
The above are major cases within a wider space of measures for entity-entity coherence.A good discussion of the broader design space and empirical comparisons can be found in[557, 75, 455]. Ultimately, however, the best choice depends on the domain of interest (e.g.,business vs. music) and the style of the ingested texts (e.g, news vs. social media).
All EL methods aim to optimize a scoring or ranking function that maximizes a combinationof mention-entity popularity, mention-entity context similarity and entity-entity coherence.This can be formalized as follows.
EL Optimization Problem:
Consider an input text with entity mentions M = { m , m . . . } each of which hasentity candidates, E ( m i ) = { e i , e i . . . } , together forming a pool of target entities E = { e , e . . . } . The goal is to find a, possibly partial, function φ : M → E thatmaximizes the objective α X m pop ( m, φ ( m )) + β X m sim ( cxt ( m ) , cxt ( φ ( m )))+ γ X e,f { coh ( e, f ) | ∃ m, n ∈ M : m = n, e = φ ( m ) , f = φ ( n ) } where α, β, γ are tunable hyper-parameters, pop denotes mention-entity popularity, cxt the context of mentions and entities, sim the contextual similarity and coh thepair-wise coherence between entities.In principle, coherence could even be incorporated over the set of all entities in the imageof φ , but the practical sweet spot is to break this down into pair-wise terms which allowmore robust estimators.Combinatorially, the function φ is the solution of a combined selection-and-assignment problem: selecting a subset of the entities and assigning the mentions onto them. We aimfor the globally best solution, with a choice for φ that bundles the linking of all mentionstogether. Algorithmically, this optimization opens up a wide variety of design choices: fromunsupervised scoring to graph-based algorithms all the way to neural learning. The basicchoice, found in the classical works of [120, 67, 383, 103, 386, 379], is to view EL as a localoptimization problem (notwithstanding its global nature): for each mention in the inputtext, we compute its best-matching entity from a pool of candidates. This is carried outfor each mention independently of the other mentions, hence the adjective local . By thisrestriction, coherence is largely disregarded, but some aspects can still be incorporatedvia clever ways of contextualization. Most notably, the method of [386] first identifies allunambiguous mentions and expands their contexts by (the descriptions of) their respective77 ubmitted to Foundations and Trends in Databases entities. With this enriched context, the method then optimizes a combination of popularityand similarity (i.e., setting γ to zero in the general problem formulation). Generalizedtechniques for enhancing mention context and entity context via document retrieval havebeen devised by [336]. The literature on EL contains further techniques along these lines.These approaches can be used to score and rank mention-entity pairs, but can likewise bebroken down into their underlying scoring components as features for learning a ranker. Suchmethods have been explored using support vector machines and other kinds of classifiers,picking the entity with the highest classification confidence (e.g., [67, 386, 138, 525, 312]).Alternatively, more advanced learning-to-rank (LTR) regression models [350, 329], havebeen investigated (e.g., [669, 477, 197]), for example, with learning from pairwise preferencesbetween entity candidates for the same mention. Note that these supervised learningmethods hinge on the availability of a labeled training corpus where mentions are markedup with ground-truth entities. By factoring entity-entity relatedness scores into the context of candidate entities, context-similarity methods already go some way in considering global coherence, examples being[103, 104]. Nevertheless, more powerful EL methods make decisions on mapping mentionsto entities jointly for all mentions of the input text. This family of methods is referred to as
Collective Entity Linking . Many of these can be seen as operating on an
EL candidategraph :For an entity linking task, the
EL Candidate Graph consists of • a set M of mentions and a set E of entities as nodes, • a set M E of mention-entity edges with weights derived from popularity andsimilarity scores, and • a set EE of entity-entity edges with weights derived from relatedness scoresbetween entities.Figure 5.1, in Section 5.1, showed an example for such a candidate graph, with edgeweights omitted. [161] proposed a relatively simple but effective and highly efficient approach for incorporatingentity-entity edge weights into the scoring of mention-entity edges. Given a candidatemapping m e , its score is augmented by the coherence of e with all candidate entities for78 ubmitted to Foundations and Trends in Databases all other mentions in the graph: score ( m e ) = · · · + X n = m X f :( n,f ) ∈ ME sim ( n, f ) · coh ( e, f )The intuition here is that a candidate e for m is rewarded if e has highly weighted edges orpaths with all other entities in the entire candidate graph.Obviously, the graph still contains spurious entities, but we expect those to have lowcoherence with others anyway. Conversely, entities for unambiguous mentions and entitieswith tight connections to many others in the graph have a strong influence on the decisionsfor all mentions. Hence the collective flavor of the method. Motivated by these considerations, a generalization is to consider entire subgraphs thatconnect the best cues for joint mappings. This leads to a powerful framework, although itentails more expensive algorithms. More specifically, we are interested in identifying densesubgraphs in the candidate graph such that there is at most one mention-entity edge foreach mention (or exactly one if we insist on linking all mentions). Density refers to highedge weights, where both mention-entity weights and entity-entity weights are aggregatedover the subgraph. We assume the weights of the two edge types are calibrated before beingcombined (e.g., via hyper-parameters like α, β, γ ). EL based on Dense Subgraph:
Given a candidate graph with nodes M ∪ E and weighted edges M E ∪ EE , the goalis to compute the densest subgraph S , with nodes S.M ⊆ M and S.E ⊂ E andedges S.M E ⊂ M E and
S.EE ⊂ EE , maximizing aggr { weight ( s ) | s ∈ S.M E ∪ S.EE } subject to the constraint:for each m ∈ M there is at most one e ∈ S.E such that ( m, e ) ∈ S.M E . In thisobjective, aggr denotes an aggregation function over edge weights, a natural choicebeing summation (or the sum normalized by the number of nodes or edges in thesubgraph).Note that the resulting subgraph is not necessarily connected, as a text may be aboutdifferent groups of entities, related within a group but unrelated across. However, thissituation should be rare, and the method could enforce a connected subgraph. Also, it maybe desirable to have identical mentions in a document mapped to the same entity (e.g., allmentions “Carter” linked to the same person), at least when the text is not too long. Thiscan be enforced by an additional constraint.79 ubmitted to Foundations and Trends in Databases
Bob
Hurricane
Carter
Hurricane (song)Ruben CarterJimmy CarterBob DylanRobert Kennedy10 9
73 255 Figure 5.3:
Example for Dense Subgraphs in EL Candidate Graph
Figure 5.3 illustrates this subgraph-based method with a simple example: 3 mentions and5 candidate entities, with one mention being unambiguous. There are two subgraphs withhigh values for their total edge weight, shown in blue and red. The one in blue correspondsto the ground-truth mapping (assuming the context is still the Bob Dylan song), with atotal weight of 30. The one in red is an alternative solution (centered on the two prominentpoliticians, one of which was a strong advocate against racism), with a total edge weight of31. Both are substantially better than other choices, such as mapping the three mentions toRobert Kennedy, the song and Ruben Carter, which has a total edge weight of 23. However,the best subgraph by using sum for weight aggregation is the wrong output in red, albeitby a tiny margin only.This observation motivates the following alternative choice for the aggregation function aggr : within the subgraph of interest, instead of paying attention only to the total weight,we consider the weakest link , that is, the edge with the lowest weight. The goal then is tomaximize this minimum weight, making the weakest link as strong as possible. Using thisobjective, the best subgraph in Figure 5.3 is the one with Bob Dylan, the song and RubenCarter, as the lowest edge weight in the respective subgraph is 3, whereas it is 2 for thealternative subgraph with Robert Kennedy, the song and Jimmy Carter.
EL based on Max-Min Subgraph:
For a given candidate graph, the goal is to compute the subgraph S that maximizes min { weight ( s ) | s ∈ S.M E ∪ S.EE } subject to the constraint:for each m ∈ M there is at most one e ∈ S.E such that ( m, e ) ∈ S.M E .Both variants of the dense-subgraph approach are NP-hard, due to the constraint aboutmention-entity edges. However, there are good approximation algorithms, including greedyremoval of weak edges as well as stochastic search to overcome local optima. This family80 ubmitted to Foundations and Trends in Databases of methods has been proposed by [241], and achieved good results on benchmarks withnews articles, in a completely unsupervised way (except for tuning a small number ofhyper-parameters). Methods with similar considerations on incorporating coherence havebeen developed by [477, 525].
Using the same EL candidate graph as before, an alternative approach that also has efficientimplementations, is based on random walks with restarts , essentially the same principlethat underlies
Personalized Page Rank [219].First, edge weights in the candidate graph are re-calibrated to become proper transitionprobabilities, and a small restart probability is chosen to jump back to the starting node ofa walk. Conceptually, we initiate such walks on each of the mentions, making probabilisticdecisions for traversing both mention-entity and entity-entity edges many times, andoccasionally jumping back to the origin. In the limit, as the walk length approaches infinity,the visiting frequencies of the various nodes converge to stationary visiting probabilities , whichare then interpreted as scores for mapping mentions to entities. An actual implementationwould bound the length of each walk, but walk repeatedly to obtain samples towardsbetter approximation. Alternatively, iterative numeric algorithms from linear algebra, mostnotably, Jacobi iteration, can be applied to the transition matrix of the graph, until someconvergence criterion is reached for the best entity candidates of every mention.
EL based on Random Walks with Restart:
Given a candidate graph with weights, re-calibrate the weights into proper transitionprobabilities.For each mention m ∈ M • approximate the visiting probabilities of the possible target entities e ∈ E :( m, e ) ∈ M E , and • map m to the entity e with the highest probability.Although these algorithms make linking decisions one mention at a time, they docapture the essence of collective EL as the walks involve the entire candidate graph and thestationary visiting probabilities take the mutual coherence into account. EL methods basedon random walks and related techniques include, for example, [399, 196, 451, 190]. The coherence-aware graph-based methods can also be cast into probabilistic graphicalmodels , like CRFs and related models. They can be seen as reasoning over a joint probabilitydistribution P [ m , m . . . , e , e . . . , d ]81 ubmitted to Foundations and Trends in Databases with mentions m i , entities e j and the context given by document d . This denotes thelikelihood that a document d contains entities e , e . . . and that these entities are textuallyexpressed in the form of mentions m , m . . . . Obviously, this high-dimensional distributionis not tractable. So it is factorized by making model assumptions and mathematicaltransformations, such as P [ m , m . . . , e , e . . . , d ] = Y i,j P [ m i | e j , d ] · Y j,k P [ e j , e k ]where P [ m i | e j , d ] is the probability of m i expressing e j in the context of d and P [ e j , e k ]is the probability of the two entities co-occurring in the same, semantically meaningfuldocument.This kind of probabilistic reasoning can be cast into a CRF model or factor graph (cf. Section 4.4) as follows. CRF Model for Entity Linking:
For each mention m i with entity candidates E i = { e i , e i . . . } , the model has a random variable X i with values from E i . These variables capture the probabilities P [ m i | e j ].For each candidate entity e k (for any of the mentions), the model has a binary random variable Y k that is true if e k is mentioned in the document and falseotherwise. These variables capture probabilities P [ e k | d ] of entity occurrence in thedocument.All variables are assumed to be conditionally independent, except for the following coupling factors : • X i , Y j are coupled if e j is a candidate for m i , • Y j , Y k are coupled for all pairs e j , e k .Figure 5.4 depicts an example, showing how the candidate graph for our running examplecan be cast into the CRF structure with variables for each mention and entity node, andcoupling factors for each of the edges.Unlike the CRF models for NER, discussed in Section 4.4, this kind of CRF does notoperate on token sequences but on graph-structured input. So it falls under a more generalclass of Markov Random Fields (MRF) , but the approach is nevertheless widely referredto as a CRF model. The joint distribution of all X i and Y j variables is factorized, by theMarkov assumption, according to the cliques in the graph. This way, we obtain one “cliquepotential” or “coupling factor” per clique, often with restriction to binary cliques, couplingtwo variables (i.e., the edges in the graph). Training such a model entails learning per-cliqueweights (or estimating conditional probabilities for the coupled variables), typically usinggradient-descent techniques. Alternatively, similarity and coherence scores can be used82 ubmitted to Foundations and Trends in Databases BobHurricaneCarter Hurricane (song)Ruben CarterJimmy CarterBob DylanRobert Kennedy X1X2X3 Y3Y4Y5Y2Y1
Figure 5.4:
Example for CRF derived from EL candidate graph for these purposes, at least for a good initialization of the gradient-descent optimization.Inference on the values of variables, given a new text document as input, is usually limitedto joint MAP inference: computing the combination of variable values that has the highestposterior likelihood. This involves Monte Carlo sampling, belief propagation, or variationalcalculus (cf. Section 4.4 on CRFs for NER).CRF-based EL has first been developed by [297], with a variety of enhancements infollow-up works such as [142, 178, 422].Many CRF-like models can also be cast into
Integer Linear Programs (ILP) : adiscrete optimization problem with constraints [500, 498].83 ubmitted to Foundations and Trends in Databases
ILP Model for Entity Linking:
For each mention m i with entity candidates E i = { e i , e i . . . } , the model has abinary decision variable X ij set to 1 if m i truly denotes e j .For each pair of candidate entities e k , e l (for any of the mentions), the model has abinary decision variable Y kl set to 1 if e k and e l are indeed both mentioned inthe input text.The objective function for the ILP is to maximize the data evidence for the choiceof 0-1 values for the X ij and Y kl variables, subject to constraints:maximize β X ij weight ( m i , e j ) X ij + γ X kl weight ( e k , e l ) Y kl with weights corresponding to similarity and coherence scores and hyper-parameters β, γ .The constraints specify that mappings are functions, couple the X ij and Y ij variables,and may optionally capture transitivity among identical mentions or (obvious)coreferences, if desired: • P j X ij ≤ i ; • Y kl ≥ X ik + X jl −
1, stating that Y kl must be 1 if both e k and e l are chosen asmapping targets; • X ik ≤ X jk and X ik ≥ X jk for all k for identical mentions m i , m j ; • ≤ X ij ≤ ≤ Y kl ≤ i, j, k, l .Solving such an ILP is computationally expensive: NP-hard in the worst case and alsocostly in practice for large instances. However, there are very efficient ILP solvers, such asGurobi ( ), which can handle reasonably sized inputs such as shortnews articles with tens of mentions and hundreds of candidate entities. Larger inputs couldhave their candidate space pruned first by other, simpler, techniques. Moreover, ILPs canbe relaxed into LPs, linear programs with continuous variables, followed by randomizedrounding. Often, this yields very good approximations for the discrete optimization (seealso [297]). Early work on EL (e.g., [67, 386, 138, 477, 525, 104, 312]) already pursued machine learningfor ranking entity candidates, building on labeled training data in the form of ground-truth mention-entity pairs in corpora (most notably, Wikipedia articles or annotated newsarticles). These methods used support vector machines, logistic regression and other learners,all relying on feature engineering, with features about mention contexts and a suite of84 ubmitted to Foundations and Trends in Databases cues for entity-entity relatedness. More recently, with the advent of deep neural networks,these feature-based learners have been superseded by end-to-end architectures withoutfeature modeling. However, these methods still, and perhaps even more strongly, hinge onsufficiently large collections of training samples in the form of correct mention-entity pairs.
Recall from Section 4.4 that neural networks require real-valued vectors as input. Thus, akey point in applying neural learning to the EL problem is the embeddings of the inputs:mention context (both short-distance and long-distance), entity description and moststrongly related entities, and more. This neural encoding is already a learning task byitself, successfully addressed, for example, by [169, 639, 640, 200]. The jointly learnedembeddings are fed into a deep neural network, with a variety of architectures like LSTM,CNN, Feed-Forward, Attention learning, Transformer-style, etc. The output of the neuralclassifier is a scoring of entity candidates for each mention. For end-to-end training, atypical choice for the loss function is softmax over the cross-entropy between predictionsand ground-truth distribution. Figure 5.5 illustrates such a neural EL architecture. Asembedding vectors are fairly restricted in size, mention contexts can be captured at differentscopes: short-distance like sentences as well as long-distance like entire documents. Bythe nature of neural networks, the “cross-talk” between mentions and entities and amongentities is automatically considered, capturing similarity as well as coherence. scores of entity candidates . . . . . . . . . . . . neuralnetworklayers(LSTM, CNN,
FFN or …)
Hurricane, about Carter, is one of Bob‘s tracks … Jimmy Carter Ruben Carter BobDylan RobertKennedy … entity embeddingsmentioncontextembeddings Figure 5.5:
Illustration of Neural EL Architecture
Neural networks for EL are trained end-to-end (e.g., [151, 282, 406, 518, 252]) onlabeled corpora where mentions are marked up with their proper entities, using gradientdescent techniques. Some of these methods integrate EL with the NER task, jointly spotting85 ubmitted to Foundations and Trends in Databases mentions and linking them. In the literature, Wikipedia full-text with hyperlink targets asground truth provides ample training samples. However, the articles all follow the sameencyclopedic style. Therefore, the learned models, albeit achieving excellent performanceon withheld Wikipedia articles, do not easily carry over to text with different stylisticcharacteristics and neither to domain-specific settings such as biomedical articles or healthdiscussion forums. Another large resource for training is the
WikiLinks corpus [535] whichcomprises Web pages with links to Wikipedia articles. This captures a wider diversity oftext styles, but the ground-truth labels have been compiled automatically, hence containingerrors.Overall, it seems that neural EL is not yet as mature and successful as its neuralcounterparts for NER (see Section 4.4), as it is easier to obtain training data for NER.Neural EL shines when trained with large labeled collections and the downstream texts towhich the learned linker is applied have the same general characteristics. When trainingsamples are scarce or the use-case data characteristics substantially deviate from the trainingdata, it is much harder for neural EL to compete with feature-based unsupervised methods.Very recently, methods for transfer learning have been integrated into neural EL (e.g.,[357, 629]). These methods are trained on one labeled collection, but applied to a differentcollection which does not have any labels and has a disjoint set of target entities. A majorasset to this end is the integration of large-scale language embeddings, like BERT [118],which covers both training and target domains. The effect is an implicit capability of“reading comprehension”, which latently captures relevant signals about context similarityand coherence. Transfer learning still seems a somewhat brittle approaches, but suchmethods will be further advanced, leveraging even larger language embeddings, such asGPT-3 based on a neural network with over 100 billion parameters [62].Regardless of future advances along these lines, we need to realize that EL comes inmany different flavors: for different domains, text styles and objectives (e.g., precisionvs. recall). Therefore, flexibly configurable, unsupervised methods with explicit featureengineering will continue to play a strong role. This includes methods that require tuning ahandful of hyper-parameters, which can be done by domain experts or using a small set oflabeled samples.
In addition to text documents, KB construction also benefits from tapping semi-structuredcontents with lists and tables. In the following, we focus on the case of ad-hoc web tablesas input, to exemplify EL over semi-structured data. Table 5.1 shows an example withambiguous mentions such as “Elvis”, “Adele”, “Columbia”, “RCA” as well as abbreviatedor slightly misspelled names (e.g., “Pat Garrett” should be the album
Pat Garrett & Billy ubmitted to Foundations and Trends in Databases Name Title Album Label YearBob Dylan Hurricane Desire Columbia 1976Bob Dylan Sara Desire Columbia 1976Bob Dylan Knockin on Heavens Door Pat Garrett Columbia 1973Elvis Cant Help Falling in Love Blue Hawaii RCA 1961Adele Make You Feel My Love n/a XL 2008
Table 5.1:
Example for entity linking task over web tables the Kid ). Note that such tables are usually surrounded by text – within web pages, withtable captions, headings etc., which can be harnessed as additional context.From a traditional database perspective, it seems that the best cues for EL over tablesis to exploit the table schema, that is, column headers and perhaps inferrable column types.However, these tables are very different from well-designed databases: they are hand-craftedin an ad-hoc manner, and their column names are often not exactly informative (e.g.,“Name”, “Title”, “Label” are very generic). Thus, it seems that the EL problem is muchharder for tables. However, we can leverage the tabular structure to guide the search for theproper entities. Specifically, we pay attention to same-row mentions and same-columnmentions : • Same-row mentions are most tightly related. So their coherence should be boostedin the objective function. • Same-column mentions are not directly related, but they are typically of the sametype, such as musicians , songs , music albums and record labels . So the objectivefunction should incorporate a soft constraint for per-column homogeneity .By taking these design considerations into account, the EL optimization can be variedas follows. 87 ubmitted to Foundations and Trends in Databases EL Optimization for Web Tables:
Consider a table with c columns, r rows and entity mentions m ij where i, j are therow and column where the mention occurs. Each m ij has a set of entity candidates E ( m ij ). All mentions together are denoted as M , and the pool of target entitiesoverall as E . For ease of notation, we assume that all table cells are entity mentions(i.e., disregarding the fact that some columns are about literal values). The goal isto find a, possibly partial, function φ : M → E that maximizes the objective α X m ij ∈ M pop ( m ij , φ ( m ij ))+ β X m ij ∈ M sim ( rowcxt ( m ij ) , cxt ( φ ( m ij )))+ β X m ij ∈ M sim ( doccxt ( m ij ) , cxt ( φ ( m ij )))+ γ X e,f ∈ E { coh ( e, f ) | m ij , m ik ∈ M : j = k, e = φ ( m ij ) , f = φ ( m ik ) } + δ X j =1 ..c hom { type ( e ) | e = φ ( m ∗ j , ∗ = 1 ..r } where α, β , β , γ , δ are tunable hyper-parameters. As before, pop denotes mention-entity popularity, cxt the context of mentions and entities in two variants formentions: rowcxt for same-row cells, doccxt for the entire document. sim denotescontextual similarity and coh pair-wise coherence between same-row entities. hom is a measure of type homogeneity and specificity .This framework leaves many choices for the underlying measures: defining specifics ofthe two context models, defining the measures for type homogeneity and specificity, and soon. For example, hom may combine the fraction of per-column entities that have a commontype and the depth of the type in the KB taxonomy. The latter is important to avoidoverly generic types like entity , person or artefact . The inference of column types hasbeen addressed as a problem by itself (e.g., [593, 85]). The case of lists , which can be viewedas single-column tables, has received special attention, too (e.g., [524, 228]).The literature on EL over tables, most notably [338, 39, 253, 493], discusses a variety ofviable design choices in depth. [660] is a recent survey on knowledge extraction from webtables.Algorithmically, many of the previously presented methods for text-based EL carry overto the case of tables. Scoring and ranking methods can simply extend their objective functionsto the above table-specific model. Graph-based methods, including dense subgraphs, randomwalks and CRF-based inference, merely have to re-define their input graphs accordingly.88 ubmitted to Foundations and Trends in Databases The seminal work of [338] did this with a probabilistic graphical model, integrating thecolumn type inference in a joint learning task.Figure 5.6 illustrates this graph-construction step: edges denote CRF-like coupling factorsor guide random walks (only some edges are shown). In the figure, the type homogeneity isdepicted by couplings with the prevalent column type. Alternatively, the CRF could coupleall mention pairs in the same column including the column header [39].
ColType1 ColType2
Candidate
Entities … … ………… ……
Figure 5.6:
Illustration of EL Graph for Tables
Iterative Linking:
A simple yet powerful principle that can be combined with virtually all EL methods is tomake linking decisions in multiple rounds, based on the mapping confidence (see, e.g., [196,421, 451]). Initially, only unambiguous mentions are mapped, unless there is uncertainty onwhether they could denote out-of-KB entities. In the next round, only those mentions aremapped for which the method has high confidence. After every round, all similarity andcoherence measures are updated triggering updates to the graph or other model on whichEL operates. As more and more entities are mapped, they create a more focused contextfor subsequent rounds. For the running example, suppose that we can map “Hurricane” tothe song with high confidence. Once this context about music is established, the confidencein linking “Bob” to Bob Dylan, rather than any other prominent Bobs, is boosted.
Domain-specific Methods:
There are numerous variations and extensions of EL methods, including domain-specific approaches, for example, for mentions of proteins, diseases etc. in biomedical texts (see,e.g., [17, 170, 108, 267] and references given there), and multi-lingual approaches wheretraining data is available only in some languages and transferred to the processing of other89 ubmitted to Foundations and Trends in Databases languages (see, e.g., [531] and references there). The case for domain-specific methods canalso be made for music (e.g., names of songs, albums, bands – which include many commonwords and appear incomplete or misspelled), bibliography with focus on author names andpublication titles, and even business where company acronyms are common and productnames have many variants.
Methods for Specific Text Styles:
There are approaches customized to specific kinds of text styles , most notably, socialmedia posts such as tweets (e.g., [351, 526, 116]) with characteristics very different fromencyclopedic pages or news articles. Yet another specific kind of input is search-enginequeries when users refer to entities by telegraphic phrases (e.g., “Dylan songs covered byGrammy and Oscar winners”). For EL over queries, [513] developed powerful methods basedon probabilistic graphical models.
Methods for Specific Entity Types:
Finally, there are also specialized EL methods for disambiguating geo-spatial entities, suchas “Twin Towers” (e.g., in Kuala Lumpur, or formerly in New York) and temporal entitiessuch as “Mooncake Festival”. Methods for these types of entities are covered, for example, by[321, 508, 83] for spatial mentions, aka. toponyms , and [555, 301] for temporal expressions,aka. temponyms . A notable project where spatio-temporal entities have been annotated atscale is the GDELT news archive ( ), supporting event-orientedknowledge discovery [316].
Key points to remember from this chapter are the following: • Entity Linking (EL) (aka. Named Entity Disambiguation) is the task of mapping entitymentions in web pages (detected by NER, see Chapter 4) onto uniquely identified entitiesin a KB or similar repository. This is a key step for constructing canonicalized
KBs.Full-fledged EL methods also consider the case of out-of-KB entities, where a mentionshould be mapped to null rather than any entity in the KB. • The input sources for EL can be text documents or semi-structured contents such as listsor tables in web pages. In the text case, related tasks like coreference resolution (CR)for pronouns and common noun phrases, or even general word sense disambiguation,may be incorporated as well. • Entity Matching (EM) is a variation of the EL task where the inputs are structureddata sources, such as database tables. The goal here is to map mentions in data recordsfrom one source to those of a different source and, this way, compute equivalence classes.There is not necessarily a KB as a reference repository.90 ubmitted to Foundations and Trends in Databases • EL and EM leverage a variety of signals, most notably: a-priori popularity measures forentities and name-entity pairs, context similarity between mentions and entities, and coherence between entities that are considered as targets for different mentions in thesame input. Specific instantiations may exploit existing links like those in Wikipedia,text cues like keywords and keyphrases, or embeddings for latent encoding of suchcontexts. • EL methods can be chosen from a spectrum of paradigms, spanning graph algorithms , probabilistic graphical models , feature-based classifiers , all the way to feature-less neurallearning . A good choice depends on the prioritization of complexity, efficiency, precision,recall, robustness and other quality dimensions. • None of the state-of-the-art EL methods seems to universally dominate the others.There is (still) no one-size-fits-all solution. Instead, a good design choice depends onthe application setting: scope and scale of entities under consideration (e.g., focus onvertical domains or entities of specific types), language style and structure of inputs(e.g., news articles vs. social media vs. scientific papers), and requirements of the usecase (e.g., speed vs. precision vs. recall).91 ubmitted to Foundations and Trends in Databases
Using methods from the previous chapters, we can now assume that we have a knowledgebase that has a clean and expressive taxonomy of semantic types (aka. classes) and thatthese types are populated with a comprehensive set of canonicalized (i.e., uniquely identified)entities.The next step is to enrich the entities with properties in the form of SPO triples, coveringboth • attributes with literal values such as the birthday of a person, the year when a song oralbum was released, the maximum speed and energy consumption of a car model, etc.,and • relations with other entities such as birthplace, spouse, composers and musicians for asong or album, manufacturer of a car, etc.In this chapter, we present methods for extracting such SPO triples; most of these canhandle both attributes and relations in a more or less unified way. We will see that manyof the principles (e.g., statement-pattern duality), key ideas (e.g., pattern learning) andmethodologies (e.g., CRFs or neural networks) of the previous chapters are applicable hereas well. Assumptions:
Best-practice methods build on a number of assumptions that are justified by already havinga clean and large KB of entities and types. • Argument Spotting:
Given input content in the form of a text document, Webpage, list or table, we can spot and canonicalize arguments for the subject and objectof a candidate triple. This assumption is valid because we already have methods forentity discovery and linking. As for attribute values, specific techniques for dates,monetary numbers and other quantities (with units) can be harnessed for spotting andnormalization (e.g., [362, 502, 9, 555]). • Target Properties:
We assume that, for each type of entities, we have a fairly goodunderstanding and a reasonable initial list of which properties are relevant to capturein the KB. For example, we should know upfront that for people in general we areinterested in birthdate, birthplace, spouse(s), children, organizations worked for, awards,etc., and for musicians, we additionally need to harvest songs composed or performed,albums released, concerts given, instruments played, etc. These lists are unlikely to becomplete, but they provide the starting point for this chapter. We will revisit and relaxthe assumption in Chapter 7 on the construction and evolution of open schemas. • Type Signatures:
We assume that each property of interest has a type signature92 ubmitted to Foundations and Trends in Databases such that we know the domain and range of the property upfront. This is part ofthe KB schema (or ontology). By associating properties with types, we already havethe domain, but we require also that the range is specified in terms of data types forboth attributes (literal values) and relations (entity types). This enables high-precisionknowledge acquistion, as we can leverage type constraints for de-noising. For example,we will assume the following specifications: birthdate: person x datebirthplace: person x locationspouse: person x personworksFor: person x organization
Schema Repositories of Properties:
Where do these pre-specified properties of interest and their type signatures come from?Spontaneously, one may think this is a leap of faith, but on second thought, there are majorassets already available. • Early KB projects like Yago and Freebase demonstrated that it is well feasible, withlimited effort, to manually compile schemas (aka. ontologies) for relevant properties.
Freebase comprised several thousands of properties with type signatures. • Frameworks like schema.org [193] have specified vocabularies for types and properties.These are not populated with entities, but one can easily use the schemas to drive theKB population. Currently, schema.org comprises mostly business-related types (ca. 1000)and their salient properties. • There are rich catalogs and thesauri that cover a fair amount of vocabulary for typesand properties. Some are well organized and clean, for example, the icecat.biz catalogof consumer products. Others are not quite so clean, but can still be starting pointstowards a schema, an example being the UMLS thesaurus for the biomedical domain( ). • Domain-specific KBs , say on food, health or energy, can start with some of the aboverepositories and would then require expert efforts to extend and refine their schemas.This is manual work, but it is not a huge endeavor, as the KB is very unlikely to requiremore than a few thousand types and properties. For health KBs, for example, some tensof types and properties already cover a useful slice (e.g., [150, 605]).93 ubmitted to Foundations and Trends in Databases
The easiest and most effective way of harvesting attribute values and relational arguments,for given entities and a target property, is again to tap premium sources like Wikipedia (orIMDB, Goodreads etc. for specific domains). They feature entity-specific pages and theirstructure follows fairly rigid templates. Therefore, extraction patterns can be specified withrelatively little effort, most notably, in the form of regular expressions (regex) over thetext and existing markup of the target pages. The underlying assumption for the viabilityof this approach is:
Consistent Patterns in Single Web Site:
In a single web site, all (or most) pages about entities of the same type (e.g., musiciansor writers) exhibit the same patterns to express certain properties (e.g., their albumsor their books, respectively). A limited amount of diversity and exceptions needs tobe accepted, though.
Figure 6.1:
Examples of Wikipedia Infoboxes ubmitted to Foundations and Trends in Databases Within Wikipedia, semi-structured elements like infoboxes, categories, lists, headings, etc.provide the best opportunity for harvesting facts by regular expressions. Consider theinfoboxes shown in Figure 6.1 for three musicians (introducing new ones for a change, togive us a break from Bob Dylan and Elvis Presley). Our goal is to extract, say, the datesand places of birth of these people, to populate the birthdate attribute and birthplace relation. For these examples, the
Born fields provide this knowledge, with small variations,though, such as showing only the year for Nive Nielsen or repeating the person name forJimmy Page. The following regular expressions specify the proper extractions, coping withthe variations. For simplicity of explanation, we restrict ourselves to the birth year andbirth city. birth year X: Born .* (X = (1|2)[0-9]{3}) .*birth city Y: Born .* ([0-9]{4}|")") (Y = [A-Z]([a-z])+) .*
In these expressions, “.*” denotes a wildcard for any token sequence, “|” and “[...]”denote disjunctions and ranges of tokens, “ { ... } ” and “+” are repetition factors, and putting“)” itself in quotes is necessary to distinguish this token from the parenthesis symbol usedto group sub-structures in a regex. Note that the specific syntax for regex varies amongdifferent pattern-matching tools.Intuitively, the regex for birth year finds a subsequence X that has exactly four digitsand starts with 1 or 2 (disregarding, for simplicity, people who were born more than 1020years ago). The regex for birth cities identifies the first alphabetic string that starts withan upper-case letter and follows a digit or closing parenthesis.This is still not perfectly covering all possible cases. For example, cities could be multi-word noun phrases (e.g., New Orleans). We do not show more complex expressions for easeof explanation. It is straightforward to extend the approach for both i) completeness, likeextracting the full date rather than merely the year and the exact place, and ii) diversity ofshowing this in infoboxes. On the latter aspect, the moderation of Wikipedia has gone along way towards standardizing infobox conventions by templates, but it could still be (andearlier was the case) that some people’s infoboxes show fields birth place , place of birth , born in , birth city , country of birth , etc. Nevertheless, it is limited effort to manually specifywide-coverage and robust regex patterns for hundreds of attributes and relations. TheYAGO project, for example, did this for about 100 properties in a few days of single-personwork [563]. Industrial knowledge bases harvest deterministic patterns from Web sites thatare fed by back-end databases, such that each entity page has the very same structure (e.g.,IMDB pages for the cast of movies). 95 ubmitted to Foundations and Trends in Databases To ease the specification of regex patterns, methods have been developed that merelyrequire marking up examples of the desired output in a small set of pages, sometimeseven supported by visual tools (e.g., [507, 162, 209]). For restricted kinds of patterns, it isthen possible to automatically learn the regex or, equivalently, the underlying finite-stateautomaton, essentially inferring a regular grammar from examples of the language. Thismethodology applied to pattern extraction has become known as wrapper induction [300, 545, 299, 410, 36]. The survey [511] covers best-practice methods, with emphasison CRF-based learning. Wrapper induction is a standard building block for informationextraction today.
Neither manually specified nor learned regex patterns are perfect: there could always be anunanticipated variation among the pages that are processed. An extremely useful techniqueto prune out false positives among the extracted results is semantic type checking ,utilizing the a-priori knowledge of type signatures for the properties of interest (see Section6.1). If we expect birthplace to have cities as its range rather than countries or evencontinents, we can test an extracted argument for this relation against the specification.The types themselves can be looked up in the existing KB of entities and classes, afterrunning entity linking on the extracted argument. This technique substantially improvesthe precision of regex-based KB population [563, 240]. It equally applies to literal valuesof attributes if there are pre-specified patterns, for example, for dates, monetary values orquantities with units.
Often, a number of patterns, rules, type-checking and other steps have to be combinedinto an entire execution plan to accomplish some extraction task. The underlying stepscan be seen as operators in an algebraic language. The
System T project [485, 90, 89]has developed a declarative language, called AQL (Annotation Query Language), and aframework for orchestrating, optimizing and executing algebraic plans combining suchoperators. In addition to expressive kinds of pattern matching, the framework includesoperators for text spans and for combining intermediate results. A similar project, with adeclarative language called
Xlog , was pursued by [522, 521], and closely related researchwas carried out by [50] and [260].To illustrate the notion and value of operator-based execution plans, assume that wewant to extract from a large corpus of web pages statements about music bands and their96 ubmitted to Foundations and Trends in Databases live concerts, specifically, the relation between the involved musicians and the instrumentsthat they played. An example input could look as follows:
Led Zeppelin returns with rocking London reunion.
The quartet had a crowd of around 20,000 at London’s 02 Arena calling for more at the end of 16tracks ranging from their most famous numbers to less familiar fare. Lead singer Robert Plant, 59,strutted his way through “Good Times Bad Times” to kick off one of the most eagerly-anticipatedconcerts in recent years. A grey-haired Jimmy Page, 63, reminded the world why he is consideredone of the lead guitar greats, while John Paul Jones, 61, showed his versatility jumping from bassto keyboards. Completing the quartet was Jason Bonham on drums.
We aim to extract all triples of the form playsInstrument : musician × instrument ,namely, the five SO pairs (Robert Plant, vocals) , (John Paul Jones, bass) , (John Paul Jones, keyboards) , (Jimmy Page, guitar) , (Jason Bonham, drums) .In addition, we want to check that this really refers to a live performance. The extractiontask entails several steps:1. Detect all person names in the text.2. Check that the mentions of people are indeed musicians, by entity linking and typechecking, or accept them as out-of-KB entities.3. Detect all mentions of musical instruments, using the instances from the KB type musicinstruments , including specializations such as Gibson Les Paul for electric guitar andparaphrases from the KB dictionary of labels, such as “singer” for vocals .4. Check that the instrument mentions refer to specific mentions of musicians, for example,by considering only pairs of musician and instrument that appear in the same sentence.Here the text proximity condition may have to be varied depending on the nature andstyle of the text, for example, by testing for co-occurrences within a text span of 30words, or by first applying co-reference resolution.Sometimes, even deeper analyis is called for, to handle difficult cases where subject andobject co-occur incidentally without standing in the proper relation to each other.5. Check that the entire text refers to a live performance. For example, this could requirechecking for the occurrence of a date and specific kinds of location like concert halls,theaters, performance arenas, music clubs and bars, or festivals.There are many ways for ordering the execution of these steps, each of which involvessub-steps (e.g., for matching performance locations), or for running them in parallel or ina pipelined manner (with intermediate results streamed between steps). When applyingsuch an entire operator ensemble to a large collection of input pages, the choice of orderor parallel execution is crucial. The reason is that different choices incur largely different97 ubmitted to Foundations and Trends in Databases costs because they materialize intermediate results of highly varying sizes. The System Tapproach therefore views the entire ensemble as a declarative task, and invokes a queryoptimizer to pick the best execution plan. This involves estimating the selectivities ofoperators, that is, the fraction of pages that yield intermediate results and the number ofintermediate candidates per page. For query optimization over text, this cost-estimationaspect is still underexplored, notwithstanding results by the SystemT project [485] as wellas [522] and [258].
Extraction patterns can also be specified for texts or lists as inputs. This is often amaz-ingly easy and can yield high-quality outputs. For example, many Wikipedia articles andother kinds of online biographies contain sentences such as “Presley was born in Tupelo,Mississippi”. By the stylistic conventions of such web sites, there is little variation in theseformulations. Therefore, a regex pattern like
P * born in * C with P bound to a personentity and C to a city can robustly extract the birth places of many people. Analogously, apattern like P .* (received|won) .* (award(s)?)? .* A applied to single sentences (with (...)? denoting optional occurrences) can extract outputssuch as (Bob Dylan, hasWon, Grammy Award)(Bob Dylan, hasWon, Academy Award)(Elvis Presley, hasWon, Grammy Award)
Obviously, there are many other ways of phrasing statements about someone winning anaward (e.g., “awarded with”, “honored by”). Manually specifying all of these would be abottleneck; so we will discuss how to learn patterns based on distant supervision in Section6.2.2. Nevertheless, the effort of identifying a few widely used patterns is modest and goesa long way in “picking low-hanging fruit”.Simple regex patterns with surface tokens and wildcards, like the ones shown above,often face a dilemma of either being too specific or too liberal, thus sacrificing either recall orprecision. For example, the pattern
P played his X with P and X matching a musician and amusical instrument, respectively, can capture only a subset of male guitarists, drummers, etc.Moreover, it misses out on more elaborate phrases such as “Jimmy Page played his bowedguitar”. By employing NLP tools for first creating word-level annotations like lemmatization and POS tags (see Section 3.2), more expressive regex patterns can be specified, for example
P play $PPZ ($ADJ)? X ubmitted to Foundations and Trends in Databases where “play” is the lemma for “plays”, “played” etc. and $PPZ and $ADJ are the tags forpossessive personal pronouns and adjectives, respectively. This pattern would also capturea triple h PJ Harvey, playsInstrument, guitar i from the sentence “PJ Harvey plays hergrungy guitar”. Further generalization could consider pre-processing sentences by dependencyparsing , so as to capture arguments that are distant in the token sequence of surface textbut close when understanding the grammatical structure. Figure 6.2 shows exampleswhere the parsing reveals short-distance paths, highlighted in blue, between arguments forextracting instances of playsInstrument . The third example may fail in practice, as thepath between musician and instrument is not sufficiently short. However, by additionallyrunning coreference resolution (see Section 5.1), the word “himself” can be linked back to“Bob Dylan”, thus shortening the path.Note, though, that the extra effort of dependency parsing or coreference resolution isworthwhile only for sufficiently frequent patterns and properties that cannot be harvestedby easier means. Moreover, parsing may fail on inputs with ungrammatical sentences, likein social media. Jimmy Page used a cello bow on his double-necked electric guitarPJ Harvey delighted the audience with new songs and surprised them with her saxophone playingBob Dylan performed Blind Willie McTell with Mark Knopfler on guitar and himself playing piano
Figure 6.2:
Examples for Relation Extraction based on Dependency Parsing Paths
Patterns are also frequent in headings of lists, including Wikipedia categories. TheEnglish edition of Wikipedia contains more than a million categories and lists withinformative names such as “list of German scientists”, “list of French philosophers”, “Chinesebusinesswomen” or “astronauts by nationality”. As discussed in Section 3.2 we use such cuesfor inferring semantic types, but we can also exploit them for deriving statements for specificproperties like hasProfession , hasNationality or bornInCountry and more. Especially whenharvesting Wikipedia, judicious specification of patterns with frequent occurrences yieldhigh-quality outputs at substantial scale. This has been demonstrated by the WikiNet project [415]. 99 ubmitted to Foundations and Trends in Databases
List of awards and nominations received by Bob Dylan
Academy_Awards
Year | Category | …|
---|---|---|
2000 | Best Original Song |
Year | Nominee … | …Result |
---|---|---|
1965 | The Times They Are A-Changin‘ | …Nominated |
1973 | The Concert for Bangla Desh | …Won |