[PDF] Typing Errors in Factual Knowledge Graphs: Severity and Possible Ways Out

Abstract

Factual knowledge graphs (KGs) such as DBpedia and Wikidata have served as part of various downstream tasks and are also widely adopted by artificial intelligence research communities as benchmark datasets. However, we found these KGs to be surprisingly noisy. In this study, we question the quality of these KGs, where the typing error rate is estimated to be 27% for coarse-grained types on average, and even 73% for certain fine-grained types. In pursuit of solutions, we propose an active typing error detection algorithm that maximizes the utilization of both gold and noisy labels. We also comprehensively discuss and compare unsupervised, semi-supervised, and supervised paradigms to deal with typing errors in factual KGs. The outcomes of this study provide guidelines for researchers to use noisy factual KGs. To help practitioners deploy the techniques and conduct further research, we published our code and data.

Full PDF

TTyping Errors in Factual Knowledge Graphs: Severity andPossible Ways Out ∗ Peiran Yao

University of AlbertaEdmonton, AB, [email protected]

Denilson Barbosa

University of AlbertaEdmonton, AB, [email protected]

ABSTRACT

Factual knowledge graphs (KGs) such as DBpedia and Wikidatahave served as part of various downstream tasks and are also widelyadopted by artificial intelligence research communities as bench-mark datasets. However, we found these KGs to be surprisinglynoisy. In this study, we question the quality of these KGs, where thetyping error rate is estimated to be 27% for coarse-grained typeson average, and even 73% for certain fine-grained types. In pursuitof solutions, we propose an active typing error detection algorithmthat maximizes the utilization of both gold and noisy labels. Wealso comprehensively discuss and compare unsupervised, semi-supervised, and supervised paradigms to deal with typing errorsin factual KGs. The outcomes of this study provide guidelines forresearchers to use noisy factual KGs. To help practitioners deploythe techniques and conduct further research, we published our codeand data . CCS CONCEPTS • Information systems → Data cleaning ; •

Computing method-ologies → Semi-supervised learning settings ; Ontology engineering ;• General and reference → Evaluation . KEYWORDS label noise, data cleaning, noise model, learning with noise, factualknowledge graph

Large scale factual knowledge graphs (KGs) such as DBpedia [15]and Wikidata [30] organize factual knowledge extracted from trust-worthy corpus like Wikipedia in a machine-readable way. As anaccessible and effective source of information, they have served as avital component of many AI systems including question-answeringsystems [11, 17], recommendation systems [6], and contextualizedlanguage models [25]. Aside from being utilized by a variety ofdownstream applications, in recent years these factual KGs havealso been widely adopted as a benchmark in a multitude of researchon knowledge graphs or even general machine learning. DBpediaand Wikidata have been used to evaluate KG embeddings [28],few-shot link prediction [32], entity typing [33], and other tasks.Outside of the research area of knowledge graphs, these KGs arealso used as datasets for general ML tasks such as semi-supervisedtext classification [8, 19, 31]. ∗ Camera-ready for 30th The Web Conference (WWW’2021) https://github.com/xavieryao/kg-type-err-corr Although being profoundly favored by the AI research commu-nity, the quality of these factual KGs, however, might be question-able. In fact, our analysis shows that even on the coarse level, thepercentage of typing errors in DBpedia has already reached 27%.This number suggests that DBpedia, as well as other KGs similarlybuilt, are quite noisy. The facts in DBpedia were extracted auto-matically from Wikipedia based on a collectively-maintained set ofrules, so it is not surprising that the extracted facts contain errors.This also indicates that more caution is needed when using factualKGs, and it of great importance to develop methods to identifytyping errors in KGs.In this study, we propose to use semi-supervised noise modelto effectively detect typing errors with only a minimum amountof human intervention. We designed a neural network (NN) archi-tecture specially for entity typing in factual KGs, which combinesheterogeneous information from entity descriptions, surface formsof entity names, and network structures of the KG. In addition tothat, we included a probabilistic noise model to enable the modelto robustly learn from entities with noisy type labels, and used vir-tual adversarial training [20] to learn from all entities disregardingwhether the label is noisy or not. We also applied an active learningstrategy to only annotate the most useful entities.Data-driven approaches to deal with typing errors in factual KGshave a very broad spectrum, covering fully unsupervised clusteringand outlier detection [1, 21], semi-supervised noise models thatcould leverage noisy labels [10, 12, 14], and supervised noise detec-tion methods that fully rely on gold labels [5, 36]. In this study, wepresent a taxonomy of the KG typing error detection paradigmsand comprehensively evaluate those paradigms on DBpedia.To the best of our knowledge, this work is the first to applynoise models and semi-supervised learning to resolve typing errorsin factual KGs. Although the theory of noise models have beendeveloped for years, they were mostly evaluated in vitro on syn-thesized datasets and this is also one of the earliest attempts totest them on a real noisy dataset. The findings of our study revealthe practical difficulties when applying those models in reality andprovide directions on dealing with KG typing errors. Despite theextensive effort we made to develop error detection methods withmultiple paradigms, the problem remains largely unsolved. Thecode and data used in the study will be released to the public afterthe publication of this paper to encourage researchers to deploythe techniques by their needs and further the study. a r X i v : . [ c s . D B ] F e b ao and Barbosa ProgrammingLanguage President Mayor Governor Overall0.00.20.40.60.81.0 e rr o r r a t e Typing error rate estimation fine-grained typescoarse-grained types

Figure 1: Estimations of fine-grained and coarse-grained er-ror rates in DBpedia. The left 4 bars (in blue) show the errorrates of 4 fine-grained types, and the right-most bar (in red)shows the mean error rate of coarse-grained entity types.

Factual KGs like DBpedia are sets of triples like , where head and tail are entities. For typing errors weare interested in the rdf:type relation, for example the followingtuples regarding the type of dbr:Canada are present in DBpedia: where is a typ-ing error.Entity types are also organized hierarchically as a rooted tree inthe DBpedia Ontology. We use 𝑙𝑒𝑣𝑒𝑙 ( 𝑦 ) to denote the level of thetype 𝑦 in the rooted tree, and the level of the root type, dbo:Thing ,is . A coarse-grained type 𝑦 𝑐 satisfies 𝑙𝑒𝑣𝑒𝑙 ( 𝑦 𝑐 ) = . We denotethe correct type of an entity 𝑒 as ^ 𝑦 .We define the problem of KG typing error detection as: givena tuple from a KG, determine whether it is cor-rect or incorrect. Moreover, if we restrict 𝑙𝑒𝑣𝑒𝑙 ( 𝑦 ) = , the problembecomes coarse-grained KG typing error detection . Zaveri et al. estimated that about 12% of the tuples in DBpedia are er-roneous [35]. When it comes to entity types, the issue seems worse.We estimated the ratio of fine-grained typing errors in DBpedia bylooking at four fine-grained types, and the ratio of coarse-grainedtyping errors by examining a subset of entities. The estimated ra-tios and the confidence intervals with 95% confidence are shown inFigure 1.The results shown that even with the coarsest granularity, theoverall error rate is already . ± . . This is the lower boundof errors, and with finer granularity the situation should be equal oreven worse. Another finding is, there are both types of good qual-ity and poor quality. For example, the types Mayor and

Governor have error rates less than . while the rate is as high as . for ProgrammingLanguage . This makes it challenging to develop auniversal method for error detection and correction. These statisticswere estimated on DBpedia version 2016-10. We chose this versionas this version is widely used by the research communities andhas a relatively large amount of public resources such as analyses,baseline methods, and pre-trained models available as a result.

Despite the problems with the data acquisition process such asimperfect rules, we also identified issues with the DBpedia Taxon-omy design which might have induced typing errors. For the typehierarchy, certain types are not positioned correctly, and there arealso overlaps between types. Certain types also have ambiguousnames and lack formal definitions.

We categorized the possible paradigms for typing error detectionbased on the amount of intrinsic and extra information required.As illustrated in Figure 2, the approaches are positioned in fourquadrants, in terms of whether they utilized the noisy type labelsand whether they require additional annotations. In this section wewill briefly introduce the ideas behind each paradigm and proposeour methods for KG typing error detection adopting the paradigms.

Inspired by the recent advances in learning with noise [10, 12, 14],we propose to use the combination of an entity typing network anda probabilistic noise model for the typing error detection task. Wetrained a classification model that is robust to noise with a subsetof a noisy KG, and applied that model on another subset to detecterrors. Learning with noise enables us to leverage the vast amountof data in noisy KGs, without the need of human labour to obtainhigh-quality typing labels.

To leverage the heterogenous infor-mation present in factual KGs, we designed a neural network archi-tecture for entity representation learning, as shown in Figure 3. Thedescription of an entity is a rich source of typing information, as forentity typing simple pattern matching could achieve 87% accuracy[13]. The network captures this information with a pre-trainedBERT model [9]. The name or surface form of an entity could oftensuggest its type, for example entities whose name ends with "

Script "are more likely to be programming languages. Therefore, surfaceforms are encoded with a character-level RNN in our model. Andfinally, the network captures the first-order network structure ofthe KG with a bag-of-word (BoW) model. Each relation in the KGis represented as an embedding vector (cid:174) 𝑟 , and the representation ofthe graph structure is the sum of the embeddings of all relations anentity has. Formally, for an entity 𝑒 in a KG 𝐺 , the typing networkworks as follows, where (cid:174) 𝑟 , W and (cid:174) 𝑏 are parameters to be learned: (cid:174) 𝑒 =  𝐵𝐸𝑅𝑇 ( 𝑑𝑒𝑠𝑐𝑟𝑖𝑝𝑡𝑖𝑜𝑛 ) ; 𝑅𝑁 𝑁 ( 𝑛𝑎𝑚𝑒 ) ; ∑︁ 𝑟 : ( 𝑒,𝑟,𝑒 ′ ∈ 𝐺 ) (cid:174) 𝑟  (1) (cid:174) 𝑜 = 𝑅𝑒𝐿𝑈 ( W 𝑇 (cid:174) 𝑒 + (cid:174) 𝑏 ) (2) yping Errors in Factual Knowledge Graphs: Severity and Possible Ways Out w/o noisy labels Annotations (d) Semi-supervised Noise Models

Noise Model (c) Noise Models (b) Classi ﬁ cation(a) Outlier Detection w/ noisy labels w/o gold labels w/ gold labels Figure 2: The spectrum of typing error detection paradigms. Triangles (green) and circles (purple) denote two distinct types.In (b) and (d), shapes with solid borders represent entities with gold type labels. In (c), shapes in yellow and with solid bordersrepresent entities whose label was altered by the noise model.

Each component of the network (BERT for description, character-level RNN for surface form, and BoW for graph structure) was ini-tialized independently by pre-training with an entity classificationtask on a noisy entity type dataset.

Since our goal is to detect errors, we chooseto train a multi-label classification model instead of a multi-classone, to reduce complexity. The probability of the entity 𝑒 havingtype 𝑧 𝑖 given by the typing network is as follows where Θ is theparameters of the model. Pr ( 𝑧 𝑖 | 𝑒 ; Θ ) = + 𝑒 −(cid:174) 𝑜 𝑖 (3)Following recent developments in noise models [10, 12, 14], we usethe probabilities of flipping to model the process of typing errorgeneration: 𝑝 𝑖 ≜ Pr ( 𝑦 𝑖 | 𝑧 𝑖 ) (4)where 𝑝 𝑖 , 𝑖 = · · · 𝑇 are the parameters of the noise model that arelearned end-to-end during the training of the model, and 𝑇 is thetotal number of types.The final output after the noise model is then Pr ( 𝑦 𝑖 | 𝑒 ; Θ ) = 𝑝 𝑖 Pr ( 𝑧 𝑖 | 𝑒 ; Θ ) + ( − 𝑝 𝑖 ) ( − Pr ( 𝑧 𝑖 | 𝑒 ; Θ )) (5) Our model was further extended to learn simultaneously fromentities with noisy typing labels and those with human-verifiedgold labels. We used a two-fold approach here. First, we appliedvirtual adversarial training (VAT) [20] as a regularization method,which makes use of the input data disregarding the labels, andhence avoids being affected by the typing errors in the labels. Whencalculating the loss, we used the model prediction without noisemodel Pr ( 𝑧 𝑖 | 𝑒 ) instead of Pr ( 𝑦 𝑖 | 𝑒 ) if an entity has gold label, andwe proposed an active learning scheme to efficiently select entitiesto annotate. The learning rate is also dynamically adjusted based on the prior belief of the correctness of a type label estimated fromword embeddings. We applied VAT to learn from allsample entities in the dataset without being affected by erroneoustype labels. VAT ensures a soft constraint that the model predictionsfor similar input entities are the same. The similarity of entities ismeasured in the embedding space by the 𝐿 distance of embedding (cid:174) 𝑒 . For an input entity 𝑒 and its embedding (cid:174) 𝑒 , VAT adds the followinglocal distributional smoothing (LDS) term to the loss function: 𝐿𝐷𝑆 ((cid:174) 𝑒 ) ≜ − Δ 𝐾𝐿 ((cid:174) 𝑟 𝑒 , (cid:174) 𝑒 ) (6)where Δ 𝐾𝐿 ((cid:174) 𝑟, (cid:174) 𝑒 ) ≜ 𝐾𝐿 [ Pr ( 𝑦 |(cid:174) 𝑒 )|| Pr ( 𝑦 |(cid:174) 𝑒 + (cid:174) 𝑟 𝑒 )] (7) (cid:174) 𝑟 𝑒 ≜ arg max 𝑟 : | | 𝑟 | | ≤ 𝜖 Δ 𝐾𝐿 ((cid:174) 𝑟, (cid:174) 𝑒 ) (8)and 𝜖 is a hyper-parameter.We denote the set of entities with only noisy type labels as 𝑆 = {( 𝑒 ( 𝑖 ) , 𝑦 ( 𝑖 ) )} 𝑖 :1 ··· 𝑁 and the set of entities with gold labels as ^ 𝑆 = {( 𝑒 ( 𝑖 ) , ^ 𝑦 ( 𝑖 ) )} 𝑖 :1 ··· 𝑀 . The adjusted loss function that considersboth noisy label and label as well as VAT is: 𝐿 ( 𝑆, ^ 𝑆 ; Θ ) = E ( 𝑒,𝑦 ) ∈ 𝑆 ∪ ^ 𝑆 (cid:104) [( 𝑒, 𝑦 ) ∈ 𝑆 ] · 𝑇 ∑︁ 𝑖 = 𝐵𝐶𝐸 ( Pr ( 𝑦 𝑖 | 𝑒 ; Θ ) , ( 𝑦 = 𝑦 𝑖 ))+ [( 𝑒, 𝑦 ) ∉ 𝑆 ] · 𝑇 ∑︁ 𝑖 = 𝐵𝐶𝐸 ( Pr ( 𝑧 𝑖 | 𝑒 ; Θ ) , ( 𝑦 = 𝑦 𝑖 ))+ 𝜆 · 𝐿𝐷𝑆 ((cid:174) 𝑒 ) (cid:105) (9)where 𝜆 is the hyper-parameter controlling the impact of VAT and Θ is the parameters of the typing network and the noise model. ao and Barbosa C++ is a general-purpose programming language. It has imperative, object-oriented and generic programming features, while also providing facilities for low-level memory manipulation.

C++ (Programming Language)dbo:designer dbo:in ﬂ uenceddbo: ﬁ leExt Classi ﬁ cation LayerNoise Model BERT Character-level RNN BoW with Embedding

Figure 3: The architecture of the entity typing network. We use the entity

C++ (ProgrammingLanguage) in DBpedia as theexample here. Green nodes (starting with dbo: ) represent relation nodes in the KG, and orange nodes (not starting with dbo: )represent entities in the KG. Example texts are from DBpedia (http://dbpedia.org/page/C++).

We propose to use uncertainty sam-pling ( US ) as the active learning strategy for selecting entity andtype pairs to annotate. Although it is not theoretically optimal, itrequires less computation and is thus more scalable on large KGs.Uncertainty sampling selects the entity whose prediction has themaximum entropy: 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 = arg max ( 𝑒,𝑦 ) 𝑇 ∑︁ 𝑖 = 𝐻 𝑏 ( Pr ( 𝑦 𝑖 | 𝑒 ; Θ )) (10)We also compared US with expected error reduction ( ERR ) [14,29]. In ERR, samples are selected greedily to maximize the expectedmodel change, and this process is approximated by the differencebetween the gradients before and after an annotation. Suppose ( 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 ) is selected for annotation and the obtained gold label is ^ 𝑦 𝑠𝑒𝑙 , the loss function after the annotation is then ˜ 𝐿 ( 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 ; Θ ) ≜ 𝐿 (cid:16) 𝑆 \{( 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 )} , ^ 𝑆 ∪ {( 𝑒 𝑠𝑒𝑙 , ^ 𝑦 𝑠𝑒𝑙 )} ; Θ (cid:17) (11)The entity-label pair selected for annotation is then the one thatmaximizes the gradient difference: 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 = arg max ( 𝑒,𝑦 ) (cid:13)(cid:13)(cid:13)(cid:13) 𝜕𝐿𝜕 Θ − 𝜕 ˜ 𝐿 ( 𝑒, 𝑦 ) 𝜕 Θ (cid:13)(cid:13)(cid:13)(cid:13) (12) We adjusted the learning rate for eachentity in the training set according to the prior belief of the proba-bility that the entity has correct type labels. This encourages themodel to learn more from entities with correct labels and mitigatesthe negative impacts of noisy labels. For an entity-type pair ( 𝑒, 𝑦 ) ,we estimate the prior probability that the label 𝑦 is correct to be thecosine similarity between the GloVe [24] embeddings of the namesof the entity and the type, denoted as (cid:174) 𝑤 𝑒 and (cid:174) 𝑤 𝑦 . If one of the twoword embeddings does not exist, the prior probability falls back the the mean probability of all entities. Suppose the original learningrate is 𝑙𝑟 , then the dynamic learning rate for the entity-type pair ( 𝑒, 𝑡 ) is set to be within the range of [ . 𝑙𝑟, . 𝑙𝑟 ] with the formula: 𝑙𝑟 𝑑𝑦𝑛 ( 𝑒, 𝑦 ) = (cid:18) . + · cos ( (cid:174) 𝑤 𝑒 , (cid:174) 𝑤 𝑦 ) (cid:19) · 𝑙𝑟 (13)The complete framework of the semi-supervised typing errordetection method we proposed is described in Algorithm 1. Algorithm 1

Training the semi-supervised error detection model ^ 𝑆 ← ∅ ⊲ ^ 𝑆 is the gold training set procedure Epoch( 𝑆 , ^ 𝑆 ) ⊲ 𝑆 is the noisy training set for 𝑏𝑎𝑡𝑐ℎ ∈ 𝑆 ∪ ^ 𝑆 do optimize 𝐿 ( 𝑆, ^ 𝑆 ; Θ ) with 𝑏𝑎𝑡𝑐ℎ ⊲ 𝐿 defiend in (9) for 𝑖 ∈ · · · 𝑀𝑎𝑥𝑄𝑢𝑒𝑟𝑦 do select ( 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 ) ∈ 𝑆 based on (12)annotate 𝑒 𝑠𝑒𝑙 with gold label ^ 𝑦𝑆 ← 𝑆 \{( 𝑒 𝑠𝑒𝑙 , 𝑦 𝑠𝑒𝑙 )} ; ^ 𝑆 ← ^ 𝑆 ∪ {( 𝑒 𝑠𝑒𝑙 , ^ 𝑦 )} end forend forend procedure Outlier detection methods are the least costly to deploy for typingerror detection, as they require no label. For typing error detection,outlier detection algorithms are independently applied on individ-ual types, and the input is the embeddings (cid:174) 𝑒 of entities labelledwith type 𝑡 in the KG. The assumption behind outlier detectionalgorithms is entities with correct types could form high density yping Errors in Factual Knowledge Graphs: Severity and Possible Ways Out clusters [23]. But this assumption is often violated because enti-ties with correct labels are not the majority for certain types, asdiscussed in Section 2.2.We used the combination of Wikipedia2Vec [34] and RDF2Vec[28] embeddings for each entity as the input. These two embed-ding methods do not require labels, either manually-provided ornoisy, as they are self-supervised. The Wikipedia2Vec captures textinformation from the entity descriptions while RDF2Vec capturesinformation from the graph structure of the KG. And to reduce thedimensionality of embeddings, we trained a MLP-based represen-tation learning network with triplet loss, as illustrated in Figure 4.The input of the MLP is the concatenation of the Wikipedia2Vec andRDF2Vec embedding of an entity, and the output is the embeddingwith reduced dimensionality. For each entity-type pair, an anchorentity is sampled from the set of positive samples for the type anda negative entity is sampled from negative samples. The triplet lossimposes a constraint that the positive entity should be closer tothe anchor than the negative one, measured by the cosine distanceof the embeddings. The outlier detection algorithms we used areLocal Outlier Factor ( LOF ) [4] and Isolation Forest ( IF ) [16]. MLPAnchorPositive Negative dist(positive, anchor) dist(negative, anchor) < Triplet LossEmbedding

Figure 4: The representation learning process used to reducethe dimensionality of the embedding for outlier detection.

This paradigm requires to provide gold type labels for entities andtrain binary or multi-class classifiers to classify entities to the righttypes. Some previous studies [5, 36] proposed to use supervisedclassification for typing error detection, but it is hard to scale asthe number of types is large for many KGs (in total 778 typesin DBpedia). One work [5] tried to tackle the scalability problemby another entity type dataset of better quality, but this could notfundamentally solve the issue as external datasets are also noisy andmay be unavailable. Due to these concerns, we omit this paradigm.

We chose DBpedia [15] as the factual KG of interest, because DB-pedia is still actively developed and has been widely used in theresearch community. The datasets used for our experiments werederived from the DBpedia version 2016-10 because of the popular-ity of this version and the large amount of resources. We prepareda large-scale coarse-grained typing dataset to test our proposed methods based on noise models, and adapted the dataset from thethesis of Caminhas [5] as a fine-grained typing dataset to test outlierdetection and classification methods.

This dataset (

DBpedia-C ) isa subset of DBpedia with balanced type distribution, and has acoarse typing granularity. This dataset was created because ourmethods based on noise models follow a multi-label classificationsetting, and keeping only the coarse-grained types could reducethe complexity of the task. For each entity we kept its most generaltype in the DBpedia type ontology, which consists of 17 distincttypes in total, including one (

Other ) for all minority types. Thereare 56 types in the first (most general) level of the DBpedia typehierarchy, and for the types with more than 10,000 entities weuniformly sampled a subset with size 10,000, and those types withless than 10,000 entities were aggregated as the

Other type and thattype was also subsampled to a subset with size 10,000. The finaldataset was uniformly subsampled to a size of 500,000. We followeda 97:3 train-dev split, where the dev set was used to tune the hyper-parameters of the model. In addition, we have a test set with 600entities with manually annotated gold labels for evaluation. Andduring the active learning process, additional 3,247 entities wereannotated with gold labels.

We adapted the dataset fromthe work by Caminhas [5] (denoted as

DBpedia-F ) to evaluateoutlier detection and classification methods, as they are conductedon a type-by-type basis. The

DBpedia-F dataset contains 83 fine-grained types from DBpedia with 5,889 positive samples and 3,395negative samples. The positive and negative samples were obtainedby sampling and examining a subset of entities for each type. Notethat the sampling process was not uniform, so this dataset could notrepresent the true noise ratio in DBpedia. The dataset was dividedby types to create the train/dev/test split, and each split contains48, 16, and 19 types respectively.

We implemented the entity typing network, the semi-supervisednoise model and the representation learning model for outlier detec-tion with PyTorch. In the typing network, the BERT component weused is the bert-base-uncased model from HuggingFace . We pre-processed the input entity descriptions with the en_core_web_sm NER model from spaCy by replacing all named entities with aspecial token ENT and all locations with

LOC . The character-levelRNN for surface forms is a uni-directional RNN with hidden size64. For the BoW model for graph structures, we kept all nodes withmore than 20 occurrences and used a hidden size of 256. Whentraining the typing network, we used a hidden size of 512, a batchsize of 128, and set 𝜆 to be . after tuning on the development set.The parameters 𝑝 𝑖 for the noise model were all initialized to be . .We used the Adam optimizer with learning rate , and used thepre-trained GloVe embedding for dynamic learning rate[24]. We applied the active learning strategy to label a batch of 20entities for every 400 iterations, which is around 70 annotationsper epoch. https://huggingface.co/transformers/pretrained_models.html https://spacy.io ao and Barbosa Table 1: Results of coarse-grained typing error detection.

Model 80 annotations / epoch 1 140 annotations / epoch 2 200 annotations / epoch 3Prec. Recall F1 Prec. Recall F1 Prec. Recall F1

SSNM (US) - VAT - VAT, -dynamic lr - gold label - gold label, - NM - noisy data, -US - noisy label, -US

SSNM (ERR) - VAT scikit-learn were used. Weused the pre-trained enwiki_20180420

Wikipedia2Vec embeddingswith 100 dimensions [34] and the uniform

RDF2Vec embeddingswith 200 dimensions [28], and reduced the dimensions with ourrepresentation learning network to 128 dimensions.

We evaluated the proposed semi-supervised noisemodel (

SSNM ) on the task of coarse-grained typing error detectionwith the

DBpedia-C dataset. The performance of SSNM as wellas several baselines are shown in Table 1. The results at differentstages of the active training process are reported to compare theeffect of training iterations. For models not involving active learning(

SSNM(US) -noisy label, -US and

SSNM(US) -noisy data, -US ), weonly limited the number of annotated entities and reported theresults of the checkpoints with best validation accuracy in the first10 epochs.The results show that our model achieved very high F1 score( . ) with only 80 gold labels, and the F1 scores are above all otherbaselines when there are only 80 and 140 gold labels. This is astrong indication of the efficiency and effectiveness of our model. We assume that our proposedmethod achieved good performance by leveraging information fromvarious information sources, including the entity data points, goldlabels, and noisy labels. We verified this assumption by comparingits performance with a few ablated baselines. The -noisy data, -US baseline only used the limited number of gold-labelled entitiesto train a fully-supervised classifier, and it achieved the poorestF1 score ( . v.s. . ). This indicates that only relying on goldlabels is infeasible if the gold-labelled set is small. The -gold label and -gold label, -NM baselines only rely on entities with noisytyping labels, and the performance is not very satisfying ( . v.s. . ). This justifies the need for additional human annotations.The -noisy label, -US baseline uses entities with and without goldlabels, but treat entities without gold labels as unlabelled. Comparedwith the -noisy data, -US baseline it has a huge performance gain( . v.s. . ), which also suggests the usefulness of noisy data.Incorporating prior belief of label correctness from self-supervisedembeddings with our dynamic learning rate scheme ( -VAT v.s. -VAT,-dynamic lr ) also has positive impacts. Finally, the -VAT baseline uses less information from the noisy data points and suffered froma performance drop, which suggests that entities with noisy labelsare helpful if we ignore the labels. We compared theperformance of uncertainty sampling (US) and error rate reduction(ERR) as the active learning strategy for sampling entities to an-notate. The evaluation shows that uncertainty sampling, thoughtheoretically imperfect, achieved better results than ERR. And inpractice, ERR is computationally intensive as it involves computingthe gradient as many times as the number of input entities. Thismakes it infeasible to iterate through every entity in the dataset,and makes the sampling less accurate. When used in noise models,ERR requires two distinct loss functions for gold label and noisylabel. In our case, the difference between the two loss functions iswhether the noise model is applied or not, which might not workwell as in the early stage of training the noise model has not startedto take effects.

When the number of trainingiterations is large (at epoch 3), although more entities are annotatedwith gold labels, the performance of all models involving noisylabels began to decay (for the

SSNM(US) model, F1 drops from . at epoch 1 to . at epoch 2 and . at epoch 3). We suspect thatthis is because as the training proceeds, the model begins to fitmore on the wrong labels. This coincides with the previous findingof Arazo el al. that samples with wrong labels are often hard tolearn and are learned at a later stage of the training process [3].Although we applied techniques such as fine-tuning on the entitieswith gold labels, this issue was not completely mitigated. In themeantime, the -noisy label, -US baseline does not have this issue asit ignores noisy labels, and therefore achieved the best performanceat epoch 3. The entity typ-ing network we designed combines heterogeneous informationfrom entity description, entity name (surface form), and the KGstructure (neighboring nodes of the entity in the KG). We assessedthe contributions of each of these three features with an ablationstudy, where we used the typing network to perform entity typeclassification on the DBpedia-C dataset without gold labels. Thiscoincides with the setting of typical entity typing or classificationtasks where noisy typing labels are ignored. We report the best yping Errors in Factual Knowledge Graphs: Severity and Possible Ways Out accuracy and loss on the dev set of the first 10 epochs, and the testaccuracy at this best epoch. Ablation was achieved by replacing thecorresponding part of the input embedding to random noise gen-erated from a uniform distribution in range [ . , . ] and keepingthe others parts unmodified. Table 2: Classification accuracy and loss of the entity typingnetwork on the dev set and the test set.

Method Dev Loss Dev Acc. Test Acc. original design 0.314 0.906 0.867 - surface form 0.318 0.900 0.848- KG structure 0.331 0.896 0.847- description text 1.170 0.630 0.619The results, as listed in Table 2, suggest that text feature from theentity description contributes the most to the performance. Thiscoincides with previous findings that simple pattern-matching fromthe description text could perform reasonably well for typing [13].The other two features, surface form and KG structure, also havepositive effects.

Outlier detection and classification methods for typing error de-tection are conducted on a per-type basis, so we evaluated thesemethods on the fine-grained

DBpedia-F dataset and report themacro average precision, recall, F1 of all types. We also report themean average precision (MAP) to exclude the effect of thresholdchoice. The results are compiled in Table 3. We did not report theprecision and recall for

Representation Learning + LOF as the scoresare too low with the default threshold, and MAP is sufficient forcomparison. We used the method of Caminhas [5] as the examplefor classification method and directly used the results reported inthe thesis in our table (marked with ★ ). This method used the sin-gular value decomposition (SVD) of the one-hot entity-propertymatrix as property embeddings, and concatenated property em-bedding and Wikipedia2Vec embedding as the input features. Theclassifier used was a nearest-centroid classifier, and gold labels wereobtained by linking with the LHD dataset [13]. Table 3: Results of outlier detection and classification meth-ods for typing error detection.

Method Prec. Recall F1 MAP

ReprLearning + IF ★ ★ ★ -Overall, the embeddings learned by the representation learningnetwork we proposed worked better than using Wikipedia2Vec orRDF2Vec individually, and Isolation Forest (IF) is more effective than Local Outlier Factor (LOF). However, the performance of out-lier detection methods are still poor, with the maximum F1 scorelower than the random baseline. As mentioned previously, outlierdetection algorithms have assumptions on the density of normaldata. However, these assumptions often break in factual KGs as wedescribed in Section 2.2, since outliers are sometimes the majority.And we also lack a feasible way to tell which types satisfy the den-sity assumptions, so it is also not easy to apply outlier detection onselected types.For classification methods, the results appear to be promising.However, they face scalability issues because it is labour-intensiveto acquire sufficient amount of labelled data to train good classifierswhen the number of types is large, as discussed in Section 3.4.Besides, training large-scale classifiers to “re-type” every entity isdoing much more than just typing error detection. We have quantitatively compared all four paradigms for typingerror detection. All four paradigms have limitations, but we couldconclude that the most feasible and effective solution to typing er-rors is semi-supervised noise models. This is because this paradigmcould simultaneously leverage information from entity data points,gold labels, and noisy labels, and hence only requires a minimumamount of human intervention to achieve good performance. Inthe mean time, outlier detection methods are only applicable if itcould be verified that the entity type of interest only contains avery small portion of errors, and classification is only feasible if agood source of supervision is available.With so many recent research projects relying on DBpedia, howsevere is the impact of the typing errors? Although noise are provento be beneficial for neural networks under certain circumstances[2], we believe that this is not the case here. Our experiments haveshown a clear (30%) performance degradation when training ourentity typing network with only noisy data. The signal-to-noiseratio (SNR) in factual KGs like DBpedia might be too low and isindeed causing harms. Therefore, it is not appropriate to ignoretyping errors when using KGs as datasets, and error detection orcorrection methods should be considered.

General reviews on knowledge graphs such as the one by Ringler[27] only compared the basic statistics like the size but did not wentdeep into the quality. Zaveri el al. [35] performed a quantitativeassessment of DBpedia, and came up with 4 major issues with itincluding accuracy. Paulheim and Bizer used the statistical distri-butions of properties as an indicator of erroneous statements [22],and have applied this idea in production for creating the DBpedia3.9, while our dataset was created from a later version with theseimprovements. Ma et al. [18] came up with the disjoint axiom toestimate the number of typing errors in KGs, stating that an entityshould not have two disjoint types. This method was also usedby Caminhas [5] for error rate estimation. However, this methodcould only estimate a lower bound of error rate as entities withcompatible type labels could also contain errors [5]. Users of KGsalso paid little attention to the quality of them. For example, Dai et ao and Barbosa al. [8] described DBpedia to have “no duplication or tainting issues”(page 7) which is contrary to our findings.

Entity typing is a task closely related to typing error detection. Butgenerally, entity typing aims at predicting the type of an entity fromonly a natural language text, while typing error detection could uti-lize more heterogeneous information like KG structure, noisy label,and other entities with the same type. This is the main differencebetween our typing network and other neural networks speciallydesigned for entity typing [33]. Hypernym detection aims at findingthe hypernyms (usually type names) of words, and common meth-ods include training hierarchical embeddings [7] or performingpattern matching [13]. But our goal is closer to cleansing existingtype-word pairs instead of creating new ones.

Recent advances in representation learning has been inspiring re-search on taxonomy cleaning. For example, similar to our entitytyping network, Ren et al. proposed an embedding method thatcombines information from both text corpus and knowledge graphthat claimed that the embedding is resistant to noise [26]. But asour experiments showed, the use of noise model, VAT and activelearning could provide additional help to learning embeddings. Alyet al. [1] used hyperbolic embeddings to refine taxonomy, but thismethod is hard to be transferred to our case as it requires high-quality hypernym-hyponym pairs.

Goldberger et al. [10] were one of the earliest groups of authorsto introduce noise models to the training of neural networks, andthis method was applied to NLP by Jindal et al. [12]. Kremer et al.[14] further combines noise models with active learning. However,these work were only evaluated on synthesized data that may notrepresent label noise in real world like KGs. Our work, on thecontrary, is one of the first to apply learning with noise on a realnoisy dataset.

In this study, we exhaustively reviewed all the available paradigmsfor typing error detection in KGs, and concluded that semi-supervisednoise models are the most feasible solution. Under that paradigm,we proposed our method that is optimized to use heterogeneous in-formation from multiple sources. Our opinion is that typing errorsin KGs especially DBpedia is a severe problem, and methods suchas ours should be deployed when using typing information fromDBpedia. Beside errors in entity-type pairs, there are other issueswith the DBpedia taxonomy as described in Section 2.3, and weleave the detailed analysis and probable solutions of those issuesas future work.

ACKNOWLEDGMENTS

The authors would like to thank J. Lau for the discussions on prob-lem formulation. We would also like to thank the anonymous re-viewers for their generous and valuable feedback. This work is supported by the Natural Sciences and Engineering Research Coun-cil of Canada and the Scotiabank Artificial Intelligence ResearchInitiative.

REFERENCES [1] Rami Aly, Shantanu Acharya, Alexander Ossa, Arne Köhn, Chris Biemann,and Alexander Panchenko. 2019. Every Child Should Have Parents: A Tax-onomy Refinement Algorithm Based on Hyperbolic Term Embeddings. In

Pro-ceedings of the 57th Annual Meeting of the Association for Computational Lin-guistics . Association for Computational Linguistics, Florence, Italy, 4811–4817.https://doi.org/10.18653/v1/P19-1474[2] Guozhong An. 1996. The Effects of Adding Noise during BackpropagationTraining on a Generalization Performance.

Neural Computation

8, 3 (April 1996),643–674. https://doi.org/10.1162/neco.1996.8.3.643[3] Eric Arazo, Diego Ortego, Paul Albert, Noel O’Connor, and Kevin Mcguinness.2019. Unsupervised Label Noise Modeling and Loss Correction. In

InternationalConference on Machine Learning .[4] Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and Jörg Sander. 2000.LOF: Identifying Density-Based Local Outliers.

SIGMOD Rec.

29, 2, 93–104.https://doi.org/10.1145/335191.335388[5] Daniel D. Caminhas. 2019.

Detecting and Correcting Typing Errors in Open-DomainKnowledge Graphs Using Semantic Representation of Entities . Master’s thesis.University of Alberta, Edmonton, Canada. https://doi.org/10.7939/r3-1qtb-st35[6] Yixin Cao, Xiang Wang, Xiangnan He, Zikun Hu, and Tat-Seng Chua. 2019.Unifying Knowledge Graph Learning and Recommendation: Towards a BetterUnderstanding of User Preferences. In

The World Wide Web Conference (SanFrancisco, CA, USA) (WWW ’19) . Association for Computing Machinery, NewYork, NY, USA, 151–161. https://doi.org/10.1145/3308558.3313705[7] Haw-Shiuan Chang, Ziyun Wang, Luke Vilnis, and Andrew McCallum. 2018.Distributional Inclusion Vector Embedding for Unsupervised Hypernymy De-tection. In

Proceedings of the 2018 Conference of the North American Chapterof the Association for Computational Linguistics: Human Language Technologies,Volume 1 (Long Papers) . Association for Computational Linguistics, New Orleans,Louisiana, 485–495. https://doi.org/10.18653/v1/N18-1045[8] Andrew M Dai and Quoc V Le. 2015. Semi-supervised Sequence Learning. In

Advances in Neural Information Processing Systems , C. Cortes, N. Lawrence, D. Lee,M. Sugiyama, and R. Garnett (Eds.), Vol. 28. Curran Associates, Inc., 3079–3087.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding. In

Proceedings of the 2019 Conference of the North American Chapter of the Associationfor Computational Linguistics: Human Language Technologies, Volume 1 (Long andShort Papers) . 4171–4186.[10] Jacob Goldberger and Ehud Ben-Reuven. 2017. Training Deep Neural-NetworksUsing a Noise Adaptation Layer. In

The International Conference on LearningRepresentations (ICLR) .[11] Xiao Huang, Jingyuan Zhang, Dingcheng Li, and Ping Li. 2019. KnowledgeGraph Embedding Based Question Answering. In

Proceedings of the TwelfthACM International Conference on Web Search and Data Mining (Melbourne VIC,Australia) (WSDM ’19) . Association for Computing Machinery, New York, NY,USA, 105–113. https://doi.org/10.1145/3289600.3290956[12] Ishan Jindal, Daniel Pressel, Brian Lester, and Matthew Nokleby. 2019. AnEffective Label Noise Model for DNN Text Classification. In

Proceedings of the 2019Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Volume 1 (Long and Short Papers) .Association for Computational Linguistics, Minneapolis, Minnesota, 3246–3256.https://doi.org/10.18653/v1/N19-1328[13] Tomáš Kliegr. 2015. Linked hypernyms: Enriching DBpedia with TargetedHypernym Discovery.

Journal of Web Semantics

31 (2015), 59 – 69. https://doi.org/10.1016/j.websem.2014.11.001[14] Jan Kremer, Fei Sha, and Christian Igel. 2018. Robust Active Label Correction. In

Proceedings of the Twenty-First International Conference on Artificial Intelligenceand Statistics (Proceedings of Machine Learning Research, Vol. 84) , Amos Storkeyand Fernando Perez-Cruz (Eds.). PMLR, 308–316.[15] Jens Lehmann, Robert Isele, Max Jakob, Anja Jentzsch, Dimitris Kontokostas,Pablo N Mendes, Sebastian Hellmann, Mohamed Morsey, Patrick Van Kleef, SörenAuer, et al. 2015. DBpedia–a Large-Scale, Multilingual Knowledge Base Extractedfrom Wikipedia.

Semantic Web

6, 2 (2015), 167–195.[16] F. T. Liu, K. M. Ting, and Z. Zhou. 2008. Isolation Forest. In . 413–422. https://doi.org/10.1109/ICDM.2008.17[17] Denis Lukovnikov, Asja Fischer, Jens Lehmann, and Sören Auer. 2017. NeuralNetwork-Based Question Answering over Knowledge Graphs on Word andCharacter Level. In

Proceedings of the 26th International Conference on World WideWeb (Perth, Australia) (WWW ’17) . International World Wide Web ConferencesSteering Committee, Republic and Canton of Geneva, CHE, 1211–1220. https://doi.org/10.1145/3038912.3052675 yping Errors in Factual Knowledge Graphs: Severity and Possible Ways Out [18] Yanfang Ma, Huan Gao, Tianxing Wu, and Guilin Qi. 2014. Learning DisjointnessAxioms With Association Rule Mining and Its Application to InconsistencyDetection of Linked Data. In

The Semantic Web and Web Science . Springer BerlinHeidelberg, Berlin, Heidelberg, 29–41.[19] Takeru Miyato, Andrew M Dai, and Ian Goodfellow. 2017. Adversarial TrainingMethods for Semi-Supervised Text Classification. In

The International Conferenceon Learning Representations (ICLR) .[20] Takeru Miyato, Shin-ichi Maeda, Masanori Koyama, Ken Nakae, and Shin Ishii.2016. Distributional Smoothing with Virtual Adversarial Training. In

The Inter-national Conference on Learning Representations (ICLR) .[21] P. Oza and V. M. Patel. 2019. One-Class Convolutional Neural Network.

IEEESignal Processing Letters

26, 2 (2019), 277–281. https://doi.org/10.1109/LSP.2018.2889273[22] Heiko Paulheim and Christian Bizer. 2014. Improving the Quality of LinkedData Using Statistical Distributions.

International Journal on Semantic & WebInformation Systems

10, 2 (April 2014), 63–86. https://doi.org/10.4018/ijswis.2014040104[23] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M.Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cour-napeau, M. Brucher, M. Perrot, and E. Duchesnay. 2011. Scikit-learn: MachineLearning in Python.

Journal of Machine Learning Research

12 (2011), 2825–2830.[24] Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe:Global Vectors for Word Representation. In

Proceedings of the 2014 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) . Association forComputational Linguistics, Doha, Qatar, 1532–1543. https://doi.org/10.3115/v1/D14-1162[25] Matthew E Peters, Mark Neumann, Robert Logan, Roy Schwartz, Vidur Joshi,Sameer Singh, and Noah A Smith. 2019. Knowledge Enhanced Contextual WordRepresentations. In

Proceedings of the 2019 Conference on Empirical Methods inNatural Language Processing and the 9th International Joint Conference on NaturalLanguage Processing (EMNLP-IJCNLP) . 43–54.[26] Xiang Ren, Wenqi He, Meng Qu, Clare R. Voss, Heng Ji, and Jiawei Han. 2016.Label Noise Reduction in Entity Typing by Heterogeneous Partial-Label Em-bedding. In

Proceedings of the 22nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (San Francisco, California, USA) (KDD’16) . Association for Computing Machinery, New York, NY, USA, 1825–1834.https://doi.org/10.1145/2939672.2939822 [27] Daniel Ringler and Heiko Paulheim. 2017. One Knowledge Graph to Rule ThemAll? Analyzing the Differences Between DBpedia, YAGO, Wikidata & co.. In

KI 2017: Advances in Artificial Intelligence , Gabriele Kern-Isberner, JohannesFürnkranz, and Matthias Thimm (Eds.). Springer International Publishing, Cham,366–372.[28] Petar Ristoski and Heiko Paulheim. 2016. RDF2Vec: RDF Graph Embeddingsfor Data Mining. In

The Semantic Web – ISWC 2016 , Paul Groth, Elena Simperl,Alasdair Gray, Marta Sabou, Markus Krötzsch, Freddy Lecue, Fabian Flöck, andYolanda Gil (Eds.). Springer International Publishing, Cham, 498–514.[29] Nicholas Roy and Andrew Mccallum. 2001. Toward Optimal Active Learningthrough Sampling Estimation of Error Reduction. In

International Conference onMachine Learning . Morgan Kaufmann, 441–448.[30] Denny Vrandečić and Markus Krötzsch. 2014. Wikidata: A Free CollaborativeKnowledgebase.

Commun. ACM

57, 10 (Sept. 2014), 78–85. https://doi.org/10.1145/2629489[31] Qizhe Xie, Zihang Dai, Eduard Hovy, Minh-Thang Luong, and Quoc V Le. 2020.Unsupervised Data Augmentation for Consistency Training. In

Advances inNeural Information Processing Systems , Vol. 33.[32] Wenhan Xiong, Mo Yu, Shiyu Chang, Xiaoxiao Guo, and William Yang Wang.2018. One-Shot Relational Learning for Knowledge Graphs. In

Proceedings of the2018 Conference on Empirical Methods in Natural Language Processing . 1980–1990.[33] Peng Xu and Denilson Barbosa. 2018. Neural Fine-Grained Entity Type Classifi-cation with Hierarchy-Aware Loss. In

Proceedings of the 2018 Conference of theNorth American Chapter of the Association for Computational Linguistics: HumanLanguage Technologies, Volume 1 (Long Papers) . 16–25.[34] Ikuya Yamada, Akari Asai, Jin Sakuma, Hiroyuki Shindo, Hideaki Takeda,Yoshiyasu Takefuji, and Yuji Matsumoto. 2020. Wikipedia2Vec: An EfficientToolkit for Learning and Visualizing the Embeddings of Words and Entities fromWikipedia. (Oct. 2020), 23–30. https://doi.org/10.18653/v1/2020.emnlp-demos.4[35] Amrapali Zaveri, Dimitris Kontokostas, Mohamed A. Sherif, Lorenz Bühmann,Mohamed Morsey, Sören Auer, and Jens Lehmann. 2013. User-Driven QualityEvaluation of DBpedia (I-SEMANTICS ’13) . Association for Computing Machinery,New York, NY, USA, 97–104. https://doi.org/10.1145/2506182.2506195[36] Hanqing Zhou, Amal Zouaq, and Diana Inkpen. 2017. DBpedia Entity Type De-tection Using Entity Embeddings and N-Gram Models. In